In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [106]:
data = pd.read_csv('raw_data/fao_data_crops_data.csv.zip', compression='zip', header=0, sep=',', quotechar='"')
data.head(5)

Unnamed: 0,country_or_area,element_code,element,year,unit,value,value_footnotes,category
0,Americas +,31,Area Harvested,2007.0,Ha,49404.0,A,agave_fibres_nes
1,Americas +,31,Area Harvested,2006.0,Ha,49404.0,A,agave_fibres_nes
2,Americas +,31,Area Harvested,2005.0,Ha,49404.0,A,agave_fibres_nes
3,Americas +,31,Area Harvested,2004.0,Ha,49113.0,A,agave_fibres_nes
4,Americas +,31,Area Harvested,2003.0,Ha,48559.0,A,agave_fibres_nes


## Explanation of crops data file

Each row of the crops dataset contains data for a certain country/area and year.  
There are 8 columns of metadata, which can be seen below.
The years span from 1961 to 2007, but some years are undefined.  

| Column name         | Explanation          |
|---------------------|----------------------|
| country_or_area     | Name of country/area |
| year                | Unique code for each country/area |
| element             | Data classification type |
| element_code        | Unique code for each type of Element |
| unit                | Unit of measurement |
| value               | The value of the measurement |
| value_footnote      | Where the data comes from |
| category            | Crop category |

The value footnotes used in the dataset have the following explanations:

|  Footnote  | Meaning of footnote    |
|------------|------------------------|
| Fc         | Calculated data        |
| A          | Aggregate, may include official, semi-official or estimated or calculated data |
| NR         | Not reproted by country|
| F          | FAO Estimate           |
| *          | Unofficial figure      | 
NaN?

**Observation:** The dataset is ordered according to the alphabetical order of the categories and after the end of each category, there are rows that contain the footnote-descriptions above. We therefore need to remove these rows since they don't contain any useful data.

## Cleaning the data

In [122]:
# Removing rows that do not contain useful information
drop_col = np.logical_not(data.country_or_area.isin(['fnSeqID', 'Fc', 'A ', 'NR', 'F ', '* ']))
crops_data = data[drop_col]

#### Missing information?

Are we now missing any information in our datasets?

In [227]:
print("Missing information in country based dataset: \n", crops_data.isna().sum())

Missing information in country based dataset: 
 country_or_area         0
element_code            0
element                 0
year                    0
unit                    0
value                   0
value_footnotes    478418
category                0
dtype: int64


We can see that a lot of value footnotes have missing information. Can we find something in common for these missing values?

In [229]:
missing_values = crops_data[crops_data.value_footnotes.isna()]
print("Number of unique countries included in missing data: ", missing_values.country_or_area.unique().shape[0])
print("Number of unique elements included in missing data: ", missing_values.element.unique().shape[0])
print("Number of unique years included in missing data: ", missing_values.year.unique().shape[0])
print("Number of unique categories included in missing data: ", missing_values.category.unique().shape[0])
print("Number of unique units included in missing data: ", missing_values.unit.unique().shape[0])

Number of unique countries included in missing data:  217
Number of unique elements included in missing data:  4
Number of unique years included in missing data:  47
Number of unique categories included in missing data:  158
Number of unique units included in missing data:  3


We aren't able to detect if the missing data follows a general rule. The missing data seems to occur in many different cases.

#### Names with '+'

There are countries/areas that contain a '+' at the end of the name. What names contain this sign and what do they have in common?

In [127]:
# Examining what names contain '?'
country_series = crops_data.country_or_area
names_with_sign = country_series[country_series.str.endswith('+')]
names_with_sign.unique()

array(['Americas +', 'Asia +', 'Caribbean +', 'Central America +',
       'Low Income Food Deficit Countries +',
       'Net Food Importing Developing Countries +',
       'Small Island Developing States +', 'South America +',
       'South-Eastern Asia +', 'World +', 'Africa +',
       'Australia and New Zealand +', 'Central Asia +', 'Eastern Asia +',
       'Eastern Europe +', 'Europe +', 'European Union +',
       'LandLocked developing countries +', 'Least Developed Countries +',
       'Northern Africa +', 'Northern America +', 'Oceania +',
       'Southern Africa +', 'Southern Asia +', 'Southern Europe +',
       'Western Africa +', 'Western Asia +', 'Western Europe +',
       'Eastern Africa +', 'Northern Europe +', 'Middle Africa +',
       'Micronesia +', 'Polynesia +', 'Melanesia +'], dtype=object)

All of the country/area names that contain a '+' at the end are all areas. We can therefore divide the dataset into several groups: one with all countries, one with all continents and one with the remaining areas.

#### Splitting dataset 

In [223]:
# Splitting crops_data into country, continent and area based sets and renaming country_or_area column

crops_country = crops_data[np.logical_not(country_series.str.endswith('+'))].rename({'country_or_area': 'country'}, axis=1)
crops_remain = crops_data[country_series.str.endswith('+')]

# Remove last two characters from continent/area name
crops_remain.country_or_area = crops_remain.country_or_area.str[:-2]

continents = ['Africa', 'Northern America', 'South America', 'Asia', 'Oceania', 'Europe']
is_continent = crops_remain.country_or_area.isin(continents)

crops_continent = crops_remain[is_continent].rename({'country_or_area': 'continent'}, axis=1)
crops_area = crops_remain[np.logical_not(is_continent)].rename({'country_or_area': 'area'}, axis=1)

print('Number of unique countries:', crops_country.country.unique().shape[0])
print('Number of unique continents:', crops_continent.continent.unique().shape[0])
print('Number of unique areas:', crops_area.area.unique().shape[0])

Number of unique countries: 219
Number of unique continents: 6
Number of unique areas: 28


In [225]:
# Save dataframes to CSV
#crops_country.to_csv('./data/csv/crops_countries.csv')
#crops_area.to_csv('./data/csv/crops_areas.csv')
#crops_continent.to_csv('./data/csv/crops_continents.csv')

In [226]:
# Save dataframes to pickles
crops_country.to_pickle('./data/pickles/crops_countries.pkl')
crops_area.to_pickle('./data/pickles/crops_areas.pkl')
crops_continent.to_pickle('./data/pickles/crops_continents.pkl')

#### Elements

In [185]:
# Aggregate year column to 'min - max' year
def agg_year(series):
    min_year = int(series.min())
    max_year = int(series.max())
    return '{} to {}'.format(min_year, max_year)

# Examine the different countries
def count_unique_area(series):
    return len(series.unique())

# Group by element code and element to see what these columns represent 
crops_country.groupby(['element_code', 'element'])\
             .agg({'value':'sum', 'unit':'unique', 'year':agg_year, 'country':count_unique_area})\
             .sort_values(by='value', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,value,unit,year,country
element_code,element,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
51,Production Quantity,563526000000.0,[tonnes],1961 to 2007,219
31,Area Harvested,180533900000.0,[Ha],1961 to 2007,217
41,Yield,36208940000.0,[Hg/Ha],1961 to 2007,217
111,Seed,19836450000.0,[tonnes],1961 to 2007,194
152,Gross Production 1999-2001 (1000 I$),11809930000.0,[1000 Int. $],1961 to 2007,191
154,Net Production 1999-2001 (1000 I$),11280550000.0,[1000 Int. $],1961 to 2007,191
438,Net per capita PIN (base 1999-2001),1496454.0,[Int. $],1961 to 2007,182
434,Grs per capita PIN (base 1999-2001),1482819.0,[Int. $],1961 to 2007,182
436,Net PIN (base 1999-2001),995278.0,[Int. $],1961 to 2007,182
432,Gross PIN (base 1999-2001),989080.0,[Int. $],1961 to 2007,182


By taking the sum of all values and sorting we see that 51-Production quantity has the greatest value. We can observe that, depending one the element, we have different number of unique countries in that column