# Livestock - Exploration and Data Cleaning
In this notebook we explore the dataset we got from FAO, and try to retrieve the most important information from.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
livestock = pd.read_csv('raw_data/Production_LivestockPrimary_E_All_Data.csv', sep = ',', encoding = 'latin-1')
livestock.head(3)

## Explanations of data files
Each row of the livestock dataset contains data on all years available, on a given metadata.  
There are 7 columns of metadata, which can be seen below, and the rest of the columns are data for each year.  
The years usually span from 1961 to 2017, but some years are missing.  

| Column name         | Explanation|
|------------------------|--------|
| Area                   |Name of country/area|
| Area Code              |Unique code for each country/area|
| Item                   |Type of product, e.g "Eggs, hen, in shell"|
| Item Code              |Unique code for each Item|
| Element                |Type data, e.g. Production, Yield, Milk Animals |
| Unit                   |The unit of the measurement of the element, in total 9 different units |
| Element Code           |Unique code based on pairs of Element and Unit|


For each year there is also a column with the year + F, which signifies the meaning of the data.  

|  Flag  | Meaning of flag        |
|--------|------------------------|
| *      | Unofficial data        |
| F      | FAO Estimate           |
| NaN    | Official data          |
| Fc     | Calculated data        |
| A      | Aggregate, may include official, semi-official, estimated or calculated data |
| M      | Data not available     | 
| Im     |FAO data based on imputation methodology |



### Elements
|  Element             | Description            |
|----------------------|------------------------|
| Laying               | Birds raised for laying eggs        |
| Yield                | How much of the animal which is used           |
| Production           | How much is produced          |
| Producing Animals/Slaughtered    | How many animals that are being produced for meat        |
| Yield/Carcass Weight | Same as yield, but also taking account size of animal |
| Milk Animals         | How many animals used for milk production     | 
| Prod Popultn         | Population of a given item  |


### Units

The units can be quite confusing, but as we remove them many of them we do not have to use so much time on this. However, here is a quick summary of them.

Producing Animals/Slaughtered: 
- Laying in 1000 heads; 
- Milk Animals in heads; 
- Prod Population (Beehives) in number; 
- Prod Population (Slaughtered animals) in heads. 

Production Quantity: 

- Eggs in tonnes and in number; 
- Meat and milk in tonnes; 
- Wool and Hides and Skins in tonnes; 
- Honey and Beeswax in tonnes.

Yield: 

- 100 milligrams per animal; 
- number per animal; hectograms per animal; 
- hectograms.  

Yield/Carcass Weight:

- 0.1 grams per animal (poultry);
- hectograms per animal (other animals).


## Data Cleaning

**Elements**  
As we are primarily interested in meat production in this task, we take away all elements except _Production_ and _Producing Animals/Slaughtered_. 

In [None]:
livestock_prod = livestock[(livestock['Element'] == 'Production') | (livestock['Element'] == 'Producing Animals /Slaughtered') ]
livestock_prod.head(3)

**Items**  
After removing the other elements, we remove all items that are not meat.

In [None]:
livestock_meat = livestock_prod[livestock_prod['Item'].str.contains('Meat')]
livestock_meat.head(5)

**Units**  
To easily compare the data, we would like to have a joint unit for all the data. We see that for each item we get two numbers, number of animals (Heads) and the weight of produced meat (tonnes). We choose to only look at produced meat, and remove the head-counts.

In [None]:
livestock_meat = livestock_meat[livestock_meat['Unit'].str.contains('tonnes')]
livestock_meat.head(3)

Looking at this we see that although the units are a bit different, they make sense for each for their category. Therefore we keep them like this for now.  

**Flags**  
The flags say something about the reliability of the data, but as this is the best data we available we "trust" all the data. However, we do keep the flags in case we observe inconsistencies in the future.

**Reshaping**  
To have this dataset fit the same format as the others we have to reshape it, so that the yearly values are rows instead of columns.  

In [None]:
# Find the columns which are years, flags and metadata
col_years = [col for col in livestock_meat.columns if (col[0] == 'Y') and (col[-1] != 'F') ]
col_flags = [col for col in livestock_meat.columns if (col[0] == 'Y') and (col[-1] == 'F') ]
col_metadata = livestock_meat.columns[0:7]

# Do two melts, once on year and once on flag and add flags to dataframe with years
temp_years = livestock_meat.melt(id_vars = col_metadata, value_vars = col_years, var_name = 'Year', value_name = 'Value')
temp_flags = livestock_meat.melt(id_vars = col_metadata, value_vars = col_flags, var_name = 'FlagYear', value_name = 'Flag')
meat_data = temp_years.join(temp_flags['Flag'])

In [None]:
#pd.to_datetime(meat_data['Year'], format = 'Y%Y')
meat_data['Year'] = meat_data['Year'].str.replace('Y', '').astype(int)

In [None]:
meat_data

**Missing Data**  
Below we can see that there are some data missing in the dataset, and when we are talking are talking about production at country-levet this is not very suprising. Since 1960, a lot of states have been founded and dissolved, with the Soviet Union in 1991 as the most notable. We assume that the missing year is included somewhere else (for example Albania in Soviet Union), and remove all rows with missing values.  
Also, we choose to remove the data after 2013 as a lot of information is missing here and we wish to have consistent data.

In [None]:
# Create series with missing values
missing_values = meat_data['Value'].isnull().groupby(meat_data['Year']).sum()

# Plot missing values with years on x-axis and missing values on y-axis
f = plt.figure(figsize = (12,6))
plt.plot(missing_values.index, missing_values.values)
plt.title('Missing Values in Livestock Dataset', fontsize = 16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Missing Values', fontsize=12)
plt.show()

In [None]:
# Remove all null values
meat_data = meat_data[meat_data['Value'].notnull()]

# Also remove all data after 2014 to have consistent data
meat_data = meat_data[meat_data['Year'] < 2014]

**Categorizing the Data**  
We would like to divide our dataset into areas, countries and continents for easier use in the future. Luckily, because of the way area codes are organized, this is easily done. Everything under 251 is countries, and everything above 5000 are areas. 

In [None]:
from scripts.helpers import *
print(split_fao_data.__doc__)

meat_countries, meat_area, meat_continents = split_fao_data(meat_data)

**Saving the Data**  
To store the data for the future we save it in both CSV and pickles.  
All these CSV are however not pushed to git, so it has to be run locally. 

In [None]:
# Save dataframes to CSV
meat_countries.to_csv('./data/csv/meat_countries.csv')
meat_area.to_csv('./data/csv/meat_area.csv')
meat_continents.to_csv('./data/csv/meat_continents.csv')

In [None]:
# Save dataframes to pickles
meat_countries.to_pickle('./data/pickles/meat_countries.pkl')
meat_area.to_pickle('./data/pickles/meat_area.pkl')
meat_continents.to_pickle('./data/pickles/meat_continents.pkl')

#### Categorizing meats

In [None]:
meat_wanted = meat_continents[meat_continents.Item.str.contains('Meat, ')]
meat_wanted.Item.unique()

We want to keep the three main meat categories separate (cattle, pig, chicken) and categorize the rest into the following:
- Equidae & Camelidae: ass, horse, camel, other camelids, mule, 
- Birds: bird nes, duck, goose & guinea fowl, turkey
- Bovidae: goat, sheep, buffalo
- Others: game, nes, rabbit + other rodents

In [None]:
main = ['Meat, pig', 'Meat, cattle', 'Meat, chicken', 'Meat, Total']
equidae_camelidae = ['Meat, ass', 'Meat, horse', 'Meat, camel', 'Meat, other camelids', 'Meat, mule']
bovidae = ['Meat, goat', 'Meat, sheep', 'Meat, buffalo']
birds = ['Meat, bird nes', 'Meat, duck', 'Meat, goose and guinea fowl', 'Meat, turkey']
others = ['Meat, rabbit', 'Meat, other rodents', 'Meat, game', 'Meat, nes']

meat_equidae_camelidae = meat_wanted[meat_wanted.Item.isin(equidae_camelidae)]
prod_equidae_camelidae = meat_equidae_camelidae.groupby(['Area','Element','Year','Unit']).agg({'Value':'sum'}).reset_index()
prod_equidae_camelidae['Item'] = 'Meat, equidae and camelidae'

meat_bovidae = meat_wanted[meat_wanted.Item.isin(bovidae)]
prod_bovidae = meat_bovidae.groupby(['Area','Element','Year','Unit']).agg({'Value':'sum'}).reset_index()
prod_bovidae['Item'] = 'Meat, bovidae'

meat_birds = meat_wanted[meat_wanted.Item.isin(birds)]
prod_birds = meat_birds.groupby(['Area','Element','Year','Unit']).agg({'Value':'sum'}).reset_index()
prod_birds['Item'] = 'Meat, bird excl chicken'

meat_others = meat_wanted[meat_wanted.Item.isin(others)]
prod_others = meat_others.groupby(['Area','Element','Year','Unit']).agg({'Value':'sum'}).reset_index()
prod_others['Item'] = 'Meat, other'

meat_main = meat_wanted[meat_wanted.Item.isin(main)]
prod_main = meat_main.groupby(['Area','Element','Year','Unit','Item']).agg({'Value':'sum'}).reset_index()
prod_main = prod_main[['Area','Element','Year','Unit','Value','Item']]

meat_categorized = pd.concat([prod_equidae_camelidae, prod_bovidae, prod_birds, prod_others, prod_main], axis=0)
meat_categorized

In [None]:
# Save dataframe to pickles
meat_categorized.to_pickle('./data/pickles/meat_categorized.pkl')