In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('raw_data/fao_data_crops_data.csv.zip', compression='zip', header=0, sep=',', quotechar='"')
print('Description of the different footnotes')
df.tail(6)[['country_or_area', 'element_code']]

Description of the different flags


Unnamed: 0,country_or_area,element_code
2255343,fnSeqID,Footnote
2255344,Fc,Calculated Data
2255345,A,"May include official, semi-official or estimat..."
2255346,NR,Not reported by country
2255347,F,FAO Estimate
2255348,*,Unofficial figure


In [3]:
country_series = df['country_or_area']

unique_land = country_series[np.logical_not(country_series.str.endswith('+'))]
print('The number of countries seem to be: {}'.format(len(unique_land.value_counts())))

The number of countries seem to be: 225


In [56]:
# Aggregate year column to 'min - max' year
def agg_year(series):
    min_year = int(series.min())
    max_year = int(series.max())
    return '{} to {}'.format(min_year, max_year)

# Examine the different countries
def count_unique_area(series):
    return len(series.unique())

# Group by element code and element to see what these columns represent 
df.groupby(['element_code', 'element']).agg({'value':'sum', 'unit':'unique', 'year':agg_year, 'country_or_area':count_unique_area}).sort_values(by='value', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,value,unit,year,country_or_area
element_code,element,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
51,Production Quantity,2357941000000.0,[tonnes],1961 to 2007,253
31,Area Harvested,797337600000.0,[Ha],1961 to 2007,251
111,Seed,83428710000.0,[tonnes],1961 to 2007,228
41,Yield,51734380000.0,[Hg/Ha],1961 to 2007,245
152,Gross Production 1999-2001 (1000 I$),49609200000.0,[1000 Int. $],1961 to 2007,222
154,Net Production 1999-2001 (1000 I$),47391080000.0,[1000 Int. $],1961 to 2007,222
438,Net per capita PIN (base 1999-2001),1642766.0,[Int. $],1961 to 2007,213
434,Grs per capita PIN (base 1999-2001),1630339.0,[Int. $],1961 to 2007,213
436,Net PIN (base 1999-2001),1105409.0,[Int. $],1961 to 2007,213
432,Gross PIN (base 1999-2001),1100095.0,[Int. $],1961 to 2007,213


#### Element code, element and their values
It seems like element and element code is a one to one match and each element code has a unique element. Also, by taking the sum of all values and sorting we see that 51-Production quantity has the greatest value. We can observe that, depending one the element, we have different number of unique countries in that column

In [49]:
print('The different footnotes: {} \n'.format(df['value_footnotes'].unique()))
print(df.isna().sum())
# A few columns contain NaN values, lets examine..
# It seems like there's something weird about element, year, unit and value and their 958 NaN values
df.drop(df[df['value'].isna()].index, inplace=True)
# Seems like the explanation of the value_footnotes repeat, we can drop those rows
# Now lets look at value_footnotes NaN:s
df[df['value_footnotes'].isna()]
# These rows seem to be okey but just miss the value footnotes.. Let's keep them this way for now

The different footnotes: ['A ' 'F ' nan 'Fc' 'NR'] 

country_or_area         0
element_code            0
element                 0
year                    0
unit                    0
value                   0
value_footnotes    478418
category                0
dtype: int64


Unnamed: 0,country_or_area,element_code,element,year,unit,value,value_footnotes,category
567,Colombia,31,Area Harvested,2004.0,Ha,17294.0,,agave_fibres_nes
568,Colombia,31,Area Harvested,2003.0,Ha,17094.0,,agave_fibres_nes
569,Colombia,31,Area Harvested,2002.0,Ha,17391.0,,agave_fibres_nes
570,Colombia,31,Area Harvested,2001.0,Ha,16802.0,,agave_fibres_nes
571,Colombia,31,Area Harvested,2000.0,Ha,17987.0,,agave_fibres_nes
...,...,...,...,...,...,...,...,...
2255150,"Venezuela, Bolivarian Republic of",51,Production Quantity,1965.0,tonnes,61062.0,,yautia_cocoyam
2255151,"Venezuela, Bolivarian Republic of",51,Production Quantity,1964.0,tonnes,59225.0,,yautia_cocoyam
2255152,"Venezuela, Bolivarian Republic of",51,Production Quantity,1963.0,tonnes,57500.0,,yautia_cocoyam
2255153,"Venezuela, Bolivarian Republic of",51,Production Quantity,1962.0,tonnes,55825.0,,yautia_cocoyam


### Explanation of columns in 'fao_data_crops_data'
1. **country_or_area code** -> Which country/Area the data comes from
1. **element_code** -> The number corresponding to certain element
1. **element** -> Type data, e.g. 'Area Harvested' or 'Production Quantity'
1. **year** -> What year the data comes from, column spans from 1961 - 2007
1. **unit** -> The 'value' columns unit, e.g. 'tonnes' or 'Ha'
1. **value** -> The number of unit of element
1. **value_footnotes** -> Value footnote, see description above
1. **category** ->

