### Data Cleaning - Livestock
Based on the information we found in data exploration we will now clean the data.  
The idea is to make a general cleaning which can be applied in all of the four tasks.

**TODO**
- Are we removing too much/too little?
- Interpolate where we are missing data

In [65]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [66]:
livestock = pd.read_csv('raw_data/Production_LivestockPrimary_E_All_Data/Production_LivestockPrimary_E_All_Data.csv', sep = ',', encoding = 'latin-1')
livestock.head(3)

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1961F,Y1962,...,Y2013,Y2013F,Y2014,Y2014F,Y2015,Y2015F,Y2016,Y2016F,Y2017,Y2017F
0,2,Afghanistan,1062,"Eggs, hen, in shell",5313,Laying,1000 Head,4000.0,F,4400.0,...,9500.0,F,9500.0,F,9337.0,Im,9369.0,Im,10688.0,F
1,2,Afghanistan,1062,"Eggs, hen, in shell",5410,Yield,100mg/An,25000.0,Fc,25000.0,...,18947.0,Fc,19474.0,Fc,21253.0,Fc,21263.0,Fc,18713.0,Fc
2,2,Afghanistan,1062,"Eggs, hen, in shell",5510,Production,tonnes,10000.0,F,11000.0,...,18000.0,F,18500.0,F,19844.0,Im,19921.0,Im,20000.0,F


**Elements**  
As we are primarily interested in meat production in this task, we take away all elements except _Production_ and _Producing Animals/Slaughtered_. 

In [67]:
livestock_prod = livestock[(livestock['Element'] == 'Production') | (livestock['Element'] == 'Producing Animals /Slaughtered') ]
livestock_prod.head(3)

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1961F,Y1962,...,Y2013,Y2013F,Y2014,Y2014F,Y2015,Y2015F,Y2016,Y2016F,Y2017,Y2017F
2,2,Afghanistan,1062,"Eggs, hen, in shell",5510,Production,tonnes,10000.0,F,11000.0,...,18000.0,F,18500.0,F,19844.0,Im,19921.0,Im,20000.0,F
3,2,Afghanistan,1067,"Eggs, hen, in shell (number)",5513,Production,1000 No,200000.0,F,220000.0,...,360000.0,F,370000.0,F,396880.0,Im,398420.0,Im,400000.0,F
6,2,Afghanistan,919,"Hides, cattle, fresh",5510,Production,tonnes,7200.0,Fc,7680.0,...,14890.0,Fc,,,,,,,,


**Items**  
After removing the other elements, we remove all items that are not meat.

In [75]:
livestock_meat = livestock_prod[livestock_prod['Item'].str.contains('Meat')]
livestock_meat.head(3)

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1961F,Y1962,...,Y2013,Y2013F,Y2014,Y2014F,Y2015,Y2015F,Y2016,Y2016F,Y2017,Y2017F
8,2,Afghanistan,1137,"Meat indigenous, camel",5322,Production,Head,20000.0,F,22393.0,...,19500.0,F,,,,,,,,
10,2,Afghanistan,1137,"Meat indigenous, camel",5510,Production,tonnes,3600.0,Fc,4031.0,...,3510.0,Fc,,,,,,,,
11,2,Afghanistan,944,"Meat indigenous, cattle",5322,Production,Head,360000.0,F,384000.0,...,654604.0,F,,,,,,,,


**Units**  
To easily compare the data, we would like to have a joint unit for all the data.

In [76]:
livestock_meat.drop_duplicates(subset=['Unit'])

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1961F,Y1962,...,Y2013,Y2013F,Y2014,Y2014F,Y2015,Y2015F,Y2016,Y2016F,Y2017,Y2017F
8,2,Afghanistan,1137,"Meat indigenous, camel",5322,Production,Head,20000.0,F,22393.0,...,19500.0,F,,,,,,,,
10,2,Afghanistan,1137,"Meat indigenous, camel",5510,Production,tonnes,3600.0,Fc,4031.0,...,3510.0,Fc,,,,,,,,
14,2,Afghanistan,1094,"Meat indigenous, chicken",5323,Production,1000 Head,7000.0,F,7500.0,...,29648.0,F,,,,,,,,


Looking at this we see that although the units are a bit different, they make sense for each for their category. Therefore we keep them like this for now.  

**Flags**  
The flags say something about the reliability of the data, but as this is the best data we available we "trust" all the data. However, we do keep the flags in case we observe inconsistencies in the future.

**Missing Data**  
There are some data missing in the dataset, and when we are talking are talking about production at country-levet this is not very suprising. Since 1960, a lot of states have been founded and dissolved, with the Sovjet Union in 1991 as the most notable. As we are mostly looking at a continent-level, we choose to not clean this up at the moment, although it is important to remember with the data.  
However, we choose to remove the data after 2013 as a lot of information is missing here.

In [77]:
drop_years = ['Y2014', 'Y2015', 'Y2016', 'Y2017',  'Y2014F', 'Y2015F', 'Y2016F', 'Y2017F' ]
livestock_meat.drop(drop_years, axis = 1, inplace = True)
livestock_meat.head(3)

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1961F,Y1962,...,Y2009,Y2009F,Y2010,Y2010F,Y2011,Y2011F,Y2012,Y2012F,Y2013,Y2013F
8,2,Afghanistan,1137,"Meat indigenous, camel",5322,Production,Head,20000.0,F,22393.0,...,22000.0,F,22000.0,F,19800.0,F,20000.0,F,19500.0,F
10,2,Afghanistan,1137,"Meat indigenous, camel",5510,Production,tonnes,3600.0,Fc,4031.0,...,3960.0,Fc,3960.0,Fc,3564.0,Fc,3600.0,Fc,3510.0,Fc
11,2,Afghanistan,944,"Meat indigenous, cattle",5322,Production,Head,360000.0,F,384000.0,...,745186.0,F,727346.0,F,766676.0,F,764703.0,F,654604.0,F


**Categorizing the Data**  
We would like to divide our dataset into areas, countries and continents for easier use in the future. Luckily, because of the way area codes are organized, this is easily done. Everything under 251 is countries, and everything above 5000 are areas. 

In [16]:
from scripts.helpers import *
print(split_fao_data.__doc__)

meat_countries, meat_area, meat_continents = split_fao_data(livestock_meat)


    Function that splits data into countries, areas and continents.
    params:
        df: fao-dataframe that includes area codes.
        
    returns:
        countries: dataframe with area-code < 500
        area: dataframe with only area-code > 500
        continents: dataframe with the 6 continents
    
    


**Saving the Data**  
To store the data for the future we save it in both CSV and pickles.  
All these CSV are however not pushed to git, so it has to be run locally. 

In [40]:
# Save dataframes to CSV
meat_countries.to_csv('./data/csv/meat_countries.csv')
meat_area.to_csv('./data/csv/meat_area.csv')
meat_continents.to_csv('./data/csv/meat_continents.csv')

In [42]:
# Save dataframes to pickles
meat_countries.to_pickle('./data/pickles/meat_countries.pkl')
meat_area.to_pickle('./data/pickles/meat_area.pkl')
meat_continents.to_pickle('./data/pickles/meat_continents.pkl')

In [50]:
# Extract years and flag colums
year_col = [col for col in livestock_meat.columns if (col[0] == 'Y') and (col[-1] != 'F')]
flag_col = [col for col in livestock_meat.columns if (col[0] == 'Y') and (col[-1] == 'F')]

In [61]:
(livestock_meat[year_col].isna()).sum()

Y1961    1082
Y1962    1081
Y1963    1082
Y1964    1082
Y1965    1082
Y1966    1069
Y1967    1065
Y1968    1065
Y1969    1054
Y1970    1049
Y1971    1048
Y1972    1045
Y1973    1048
Y1974    1043
Y1975    1043
Y1976    1041
Y1977    1032
Y1978    1032
Y1979    1024
Y1980    1024
Y1981    1021
Y1982    1028
Y1983    1026
Y1984    1021
Y1985    1015
Y1986    1015
Y1987    1016
Y1988    1016
Y1989    1015
Y1990     962
Y1991     951
Y1992     455
Y1993     404
Y1994     379
Y1995     369
Y1996     364
Y1997     361
Y1998     353
Y1999     345
Y2000     306
Y2001     312
Y2002     305
Y2003     293
Y2004     290
Y2005     291
Y2006     278
Y2007     277
Y2008     275
Y2009     279
Y2010     277
Y2011     275
Y2012     274
Y2013     282
Y2014    4542
Y2015    4542
Y2016    4542
Y2017    4542
dtype: int64