### Data Cleaning - Livestock
Based on the information we found in data exploration we will now clean the data.  
The idea is to make a general cleaning which can be applied in all of the four tasks.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [5]:
livestock = pd.read_csv('raw_data/Production_LivestockPrimary_E_All_Data/Production_LivestockPrimary_E_All_Data.csv', sep = ',', encoding = 'latin-1')
livestock.head(3)

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1961F,Y1962,...,Y2013,Y2013F,Y2014,Y2014F,Y2015,Y2015F,Y2016,Y2016F,Y2017,Y2017F
0,2,Afghanistan,1062,"Eggs, hen, in shell",5313,Laying,1000 Head,4000.0,F,4400.0,...,9500.0,F,9500.0,F,9337.0,Im,9369.0,Im,10688.0,F
1,2,Afghanistan,1062,"Eggs, hen, in shell",5410,Yield,100mg/An,25000.0,Fc,25000.0,...,18947.0,Fc,19474.0,Fc,21253.0,Fc,21263.0,Fc,18713.0,Fc
2,2,Afghanistan,1062,"Eggs, hen, in shell",5510,Production,tonnes,10000.0,F,11000.0,...,18000.0,F,18500.0,F,19844.0,Im,19921.0,Im,20000.0,F


As we are primarily interested in meat production in this task, we take away all elements except **Production** and **Producing Animals/Slaughtered**.  

After removing the other elements, we remove all items that are not meat.

In [19]:
livestock_prod = livestock[(livestock['Element'] == 'Production') | (livestock['Element'] == 'Producing Animals /Slaughtered') ]
livestock_prod.head(3)

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1961F,Y1962,...,Y2013,Y2013F,Y2014,Y2014F,Y2015,Y2015F,Y2016,Y2016F,Y2017,Y2017F
2,2,Afghanistan,1062,"Eggs, hen, in shell",5510,Production,tonnes,10000.0,F,11000.0,...,18000.0,F,18500.0,F,19844.0,Im,19921.0,Im,20000.0,F
3,2,Afghanistan,1067,"Eggs, hen, in shell (number)",5513,Production,1000 No,200000.0,F,220000.0,...,360000.0,F,370000.0,F,396880.0,Im,398420.0,Im,400000.0,F
6,2,Afghanistan,919,"Hides, cattle, fresh",5510,Production,tonnes,7200.0,Fc,7680.0,...,14890.0,Fc,,,,,,,,


In [17]:
livestock_meat = livestock_prod[livestock_prod['Item'].str.contains('Meat')]
livestock_meat.head(3)

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1961F,Y1962,...,Y2013,Y2013F,Y2014,Y2014F,Y2015,Y2015F,Y2016,Y2016F,Y2017,Y2017F
8,2,Afghanistan,1137,"Meat indigenous, camel",5322,Production,Head,20000.0,F,22393.0,...,19500.0,F,,,,,,,,
10,2,Afghanistan,1137,"Meat indigenous, camel",5510,Production,tonnes,3600.0,Fc,4031.0,...,3510.0,Fc,,,,,,,,
11,2,Afghanistan,944,"Meat indigenous, cattle",5322,Production,Head,360000.0,F,384000.0,...,654604.0,F,,,,,,,,


Now we look at the units.
To easily compare the data, we would like to have a joint unit for all the data.

In [24]:
livestock_meat.drop_duplicates(subset=['Element', 'Unit'])

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1961F,Y1962,...,Y2013,Y2013F,Y2014,Y2014F,Y2015,Y2015F,Y2016,Y2016F,Y2017,Y2017F
8,2,Afghanistan,1137,"Meat indigenous, camel",5322,Production,Head,20000.0,F,22393.0,...,19500.0,F,,,,,,,,
10,2,Afghanistan,1137,"Meat indigenous, camel",5510,Production,tonnes,3600.0,Fc,4031.0,...,3510.0,Fc,,,,,,,,
14,2,Afghanistan,1094,"Meat indigenous, chicken",5323,Production,1000 Head,7000.0,F,7500.0,...,29648.0,F,,,,,,,,
23,2,Afghanistan,1127,"Meat, camel",5320,Producing Animals/Slaughtered,Head,20000.0,F,22393.0,...,19500.0,Im,19823.0,Im,20007.0,Im,19992.0,Im,20310.0,Im
29,2,Afghanistan,1058,"Meat, chicken",5321,Producing Animals/Slaughtered,1000 Head,7000.0,F,7500.0,...,33000.0,Im,31031.0,Im,30716.0,Im,30543.0,Im,34839.0,Im


Looking at this we see that although the units are a bit different, they make sense for each for their category. Therefore we keep them like this for now.  
However, we would like to divide our dataset into areas, countries and continents for easier use in the future. Luckily, because of the way area codes are organized, this is easily done. Everything under 251 is countries, and everything above 5000 are areas.

In [35]:
meat_countries = livestock_meat[livestock_meat['Area Code'] < 500]
meat_area = livestock_meat[livestock_meat['Area Code'] > 500]

In [44]:
pd.unique(meat_area['Area'])

array(['World', 'Africa', 'Eastern Africa', 'Middle Africa',
       'Northern Africa', 'Southern Africa', 'Western Africa', 'Americas',
       'Northern America', 'Central America', 'Caribbean',
       'South America', 'Asia', 'Central Asia', 'Eastern Asia',
       'Southern Asia', 'South-Eastern Asia', 'Western Asia', 'Europe',
       'Eastern Europe', 'Northern Europe', 'Southern Europe',
       'Western Europe', 'Oceania', 'Australia and New Zealand',
       'Melanesia', 'Micronesia', 'Polynesia', 'European Union',
       'Least Developed Countries', 'Land Locked Developing Countries',
       'Small Island Developing States',
       'Low Income Food Deficit Countries',
       'Net Food Importing Developing Countries'], dtype=object)

In [37]:
continents = ['Africa', 'Northern America', 'South America', 'Asia', 'Oceania', 'Europe']
meat_continents = meat_area[meat_area['Area'].isin(continents)]

In [40]:
# Save dataframes to CSV
meat_countries.to_csv('./data/meat_countries.csv')
meat_area.to_csv('./data/meat_area.csv')
meat_continents.to_csv('./data/meat_continents.csv')

In [42]:
# Save dataframes to pickles
meat_countries.to_pickle('./data/pickles/meat_countries.pkl')
meat_area.to_pickle('./data/pickles/meat_area.pkl')
meat_continents.to_pickle('./data/pickles/meat_continents.pkl')