In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [106]:
data = pd.read_csv('raw_data/fao_data_crops_data.csv.zip', compression='zip', header=0, sep=',', quotechar='"')
data.head(5)

Unnamed: 0,country_or_area,element_code,element,year,unit,value,value_footnotes,category
0,Americas +,31,Area Harvested,2007.0,Ha,49404.0,A,agave_fibres_nes
1,Americas +,31,Area Harvested,2006.0,Ha,49404.0,A,agave_fibres_nes
2,Americas +,31,Area Harvested,2005.0,Ha,49404.0,A,agave_fibres_nes
3,Americas +,31,Area Harvested,2004.0,Ha,49113.0,A,agave_fibres_nes
4,Americas +,31,Area Harvested,2003.0,Ha,48559.0,A,agave_fibres_nes


## Explanations of data files

Each row of the crops dataset contains data for a certain country/area and year.  
There are 8 columns of metadata, which can be seen below.
The years span from 1961 to 2007, but some years are undefined.  

| Column name         | Explanation          |
|---------------------|----------------------|
| country_or_area     | Name of country/area |
| year                | Unique code for each country/area |
| element             | Data classification type |
| element_code        | Unique code for each type of Element |
| unit                | Unit of measurement |
| value               | The value of the measurement |
| value_footnote      | Where the data comes from |
| category            | Crop category |

The value footnotes used in the dataset have the following explanations:

|  Footnote  | Meaning of footnote    |
|------------|------------------------|
| Fc         | Calculated data        |
| A          | Aggregate, may include official, semi-official or estimated or calculated data |
| NR         | Not reproted by country|
| F          | FAO Estimate           |
| *          | Unofficial figure      | 
NaN?

*Observation:* The dataset is ordered according to the alphabetical order of the categories and after the end of each category, there are rows that contain the footnote-descriptions above. We therefore need to remove these rows since they don't contain any useful data.

In [122]:
# Removing rows that do not contain useful information
drop_col = np.logical_not(data.country_or_area.isin(['fnSeqID', 'Fc', 'A ', 'NR', 'F ', '* ']))
crops_data = data[drop_col]

In [115]:
crops_data.country_or_area.unique().shape

(253,)

There are countries/areas that contain a '+' at the end of the name. What names contain this sign and what do they have in common?

In [123]:
country_series = crops_data.country_or_area
unique_land = country_series[country_series.str.endswith('+')]
unique_land.unique()

array(['Americas +', 'Asia +', 'Caribbean +', 'Central America +',
       'Low Income Food Deficit Countries +',
       'Net Food Importing Developing Countries +',
       'Small Island Developing States +', 'South America +',
       'South-Eastern Asia +', 'World +', 'Africa +',
       'Australia and New Zealand +', 'Central Asia +', 'Eastern Asia +',
       'Eastern Europe +', 'Europe +', 'European Union +',
       'LandLocked developing countries +', 'Least Developed Countries +',
       'Northern Africa +', 'Northern America +', 'Oceania +',
       'Southern Africa +', 'Southern Asia +', 'Southern Europe +',
       'Western Africa +', 'Western Asia +', 'Western Europe +',
       'Eastern Africa +', 'Northern Europe +', 'Middle Africa +',
       'Micronesia +', 'Polynesia +', 'Melanesia +'], dtype=object)

All of the country/area names that contain a '+' at the end are all areas. We can therefore divide the dataset into two groups: one with all countries and one with all areas.

In [4]:
# WHY IS THIS DONE?

# Aggregate year column to 'min - max' year
def agg_year(series):
    min_year = int(series.min())
    max_year = int(series.max())
    return '{} to {}'.format(min_year, max_year)

# Examine the different countries
def count_unique_area(series):
    return len(series.unique())

# Group by element code and element to see what these columns represent 
df.groupby(['element_code', 'element']).agg({'value':'sum', 'unit':'unique', 'year':agg_year, 'country_or_area':count_unique_area}).sort_values(by='value', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,value,unit,year,country_or_area
element_code,element,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
51,Production Quantity,2357941000000.0,[tonnes],1961 to 2007,253
31,Area Harvested,797337600000.0,[Ha],1961 to 2007,251
111,Seed,83428710000.0,[tonnes],1961 to 2007,228
41,Yield,51734380000.0,[Hg/Ha],1961 to 2007,245
152,Gross Production 1999-2001 (1000 I$),49609200000.0,[1000 Int. $],1961 to 2007,222
154,Net Production 1999-2001 (1000 I$),47391080000.0,[1000 Int. $],1961 to 2007,222
438,Net per capita PIN (base 1999-2001),1642766.0,[Int. $],1961 to 2007,213
434,Grs per capita PIN (base 1999-2001),1630339.0,[Int. $],1961 to 2007,213
436,Net PIN (base 1999-2001),1105409.0,[Int. $],1961 to 2007,213
432,Gross PIN (base 1999-2001),1100095.0,[Int. $],1961 to 2007,213


#### Element code, element and their values
It seems like element and element code is a one to one match and each element code has a unique element. Also, by taking the sum of all values and sorting we see that 51-Production quantity has the greatest value. We can observe that, depending one the element, we have different number of unique countries in that column

In [6]:
print('The different footnotes: {} \n'.format(df['value_footnotes'].unique()))
print(df.isna().sum())
# A few columns contain NaN values, lets examine..
# It seems like there's something weird about element, year, unit and value and their 958 NaN values
df.drop(df[df['value'].isna()].index, inplace=True)
# Seems like the explanation of the value_footnotes repeat, we can drop those rows
# Now lets look at value_footnotes NaN:s
df[df['value_footnotes'].isna()]
# These rows seem to be okey but just miss the value footnotes.. Let's keep them this way for now

The different footnotes: ['A ' 'F ' nan 'Fc' 'NR'] 

country_or_area         0
element_code            0
element                 0
year                    0
unit                    0
value                   0
value_footnotes    478418
category                0
dtype: int64


Unnamed: 0,country_or_area,element_code,element,year,unit,value,value_footnotes,category
567,Colombia,31,Area Harvested,2004.0,Ha,17294.0,,agave_fibres_nes
568,Colombia,31,Area Harvested,2003.0,Ha,17094.0,,agave_fibres_nes
569,Colombia,31,Area Harvested,2002.0,Ha,17391.0,,agave_fibres_nes
570,Colombia,31,Area Harvested,2001.0,Ha,16802.0,,agave_fibres_nes
571,Colombia,31,Area Harvested,2000.0,Ha,17987.0,,agave_fibres_nes
...,...,...,...,...,...,...,...,...
2255150,"Venezuela, Bolivarian Republic of",51,Production Quantity,1965.0,tonnes,61062.0,,yautia_cocoyam
2255151,"Venezuela, Bolivarian Republic of",51,Production Quantity,1964.0,tonnes,59225.0,,yautia_cocoyam
2255152,"Venezuela, Bolivarian Republic of",51,Production Quantity,1963.0,tonnes,57500.0,,yautia_cocoyam
2255153,"Venezuela, Bolivarian Republic of",51,Production Quantity,1962.0,tonnes,55825.0,,yautia_cocoyam
