## Crops - Exploration and Cleaning
This notebook handles the data collected for crops data.  
First we will explain the dataset, before we clean the data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
data = pd.read_csv('raw_data/fao_data_crops_data.csv.zip', compression='zip', header=0, sep=',', quotechar='"')
data.head(5)

# Explanation of crops data file

Each row of the crops dataset contains data for a certain country/area and year.  
There are 8 columns of metadata, which can be seen below.
The years span from 1961 to 2007, but some years are undefined.  

| Column name         | Explanation          |
|---------------------|----------------------|
| country_or_area     | Name of country/area |
| year                | Unique code for each country/area |
| element             | Data classification type |
| element_code        | Unique code for each type of Element |
| unit                | Unit of measurement |
| value               | The value of the measurement |
| value_footnote      | Where the data comes from |
| category            | Crop category |

### Footnotes
The value footnotes used in the dataset have the following explanations:

|  Footnote  | Meaning of footnote    |
|------------|------------------------|
| Fc         | Calculated data        |
| A          | Aggregate, may include official, semi-official or estimated or calculated data |
| NR         | Not reported by country|
| F          | FAO Estimate           |

### Elements
- **Area harvested** refer to the area under cultivation. Area under cultivation means the area that corresponds to the total sown area, but after the harvest it excludes ruined areas (e.g. due to natural disasters). If the same land parcel is used twice in the same year, the area of this parcel can be counted twice. 
- **Production quantity** means the harvested production. Harvested production means production including on-holding losses and wastage, quantities consumed directly on the farm and marketed quantities, indicated in units of basic product weight. *Harvest year* means the calendar year in which the harvest begins. 
- **Yield** means the harvested production per ha for the area under cultivation. 
- **Seed** quantity comprises all amounts of the commodity in question used during the reference period for reproductive purposes, such as seed or seedlings. Usually, the average seed rate in any given country does not vary greatly from year to year.

### Categories
Crop statistics are recorded for 172 products, covering the following categories: Crops Primary, Fibre Crops Primary, Cereals, Coarse Grain, Citrus Fruit, Fruit, Jute Jute-like Fibres, Oilcakes Equivalent, Oil crops Primary, Pulses, Roots and Tubers, Treenuts and Vegetables and Melons. The objective is to comprehensively cover production of all primary crops for all countries and regions in the world. 

**Cereals**: Area and production data on cereals relate to crops harvested for dry grain only. Cereal crops harvested for hay or harvested green for food, feed or silage or used for grazing are therefore excluded. 

**Vegetables**, total (including melons): Data relate to vegetable crops grown mainly for human consumption. Crops such as cabbages, pumpkins and carrots, when explicitly cultivated for animal feed, are therefore excluded. Statistics on vegetables are not available in many countries, and the coverage of the reported data differs from country to country. In general, it appears that the data refer to crops grown in field and market gardens mainly for sale, thus excluding crops cultivated in kitchen gardens or small family gardens mainly for household consumption.

**Fruit**, total (excluding melons): Data refer to total production of fresh fruit, whether finally used for direct consumption for food or feed, or processed into different products: dry fruit, juice, jam, alcohol, etc. Generally, production data relate to plantation crops or orchard crops grown mainly for sale. Data on production from scattered trees used mainly for home consumption are not usually collected. Production from wild plants, particularly berries, which is of some importance in certain countries, is generally disregarded by national statistical services. Therefore, the data for the various fruits and berries are rather incomplete. Dates, plantains and total grapes are included in the “total fruit” aggregated figures, while olives are excluded.

**Bananas and plantains**: Figures on bananas refer, as far as possible, to all edible fruit-bearing species of the genus Musa except Musa paradisiaca, commonly known as plantain. Unfortunately, several countries make no distinction in their statistics between bananas and plantains and publish only overall estimates. When this occurs and there is some indication or assumption that the data reported refer mainly to bananas, the data are included. The production data on bananas and plantains reported by the various countries are also difficult to compare because a number of countries report in terms of bunches, which generally means that the stalk is included in the weight.  

**Treenuts**, aggregated: Production of nuts (including chestnuts) relates to nuts in the shell or in the husk. Statistics are very scanty and generally refer only to crops for sale. In addition to the kind of nuts shown separately, production data include all other treenuts mainly used as dessert or table nuts, such as pecan nuts, pili nuts, sapucaia nuts and macadamia nuts. Nuts mainly used for flavouring beverages are excludedas are masticatory and stimulant nuts and nuts used mainly for the extraction of oil or butter, including areca/betel nuts, cola nuts, illipe nuts, karate nuts, coconuts, tung nuts, oilpalm nuts etc.

# Cleaning the data

### Renaming
First of all we rename the columns to fit the livestock dataset.  
This is mainly to make the first letter capital, but _Footnote_ changes to _Flag_ and _Category_ changes to _Item_.

In [None]:
new_names = {'country_or_area': 'Area', 'element_code': 'Element Code', 
             'element': 'Element', 'year': 'Year', 'unit': 'Unit', 
             'value': 'Value', 'value_footnotes': 'Flag', 'category': 'Item'}

data.rename(columns = new_names, inplace = True)
data.head(2)

### Unnecessary rows:
The dataset is ordered according to the alphabetical order of the categories and after the end of each Item, there are rows that contain the footnote-descriptions above. We therefore need to remove these rows since they don't contain any useful data.

In [None]:
# Removing rows that do not contain useful information
keep_col = np.logical_not(data.Area.isin(['fnSeqID', 'Fc', 'A ', 'NR', 'F ', '* ']))
crops_data = data[keep_col]

In [None]:
print("Number of countries in our dataset:", crops_data.Area.unique().shape[0])

### Keep only category totals

Since we want to analyse the general production of crops it is more interesting for us to look at the total production of the different categories instead of looking at every type of item. We will therefore only keep the totals in our dataset. The categories are stored using the following item names:
- Fibre Crops Primary = fibre_crops_primary 
- Cereals = cereals_total
- Coarse Grain = coarse_grain_total
- Citrus Fruit = citrus_fruit_total
- Fruit = fruit_excl_melons_total
- Jute Jute-like Fibres = jute_jute_like_fibres
- Oilcakes Equivalent = oilcakes_equivalent
- Oil crops Primary = oil_crops_primary
- Pulses = pulses_total
- Roots and Tubers = roots_and_tubers_total 
- Treenuts = treenuts_total 
- Vegetables and Melons = vegetables_melons_total

In [None]:
keywords = ['_total', 'primary', 'jute_jute', 'oilcakes']
items = crops_data.Item
crops_categorized = crops_data[items.str.contains('|'.join(keywords))]

print("Number of countries in categorized dataset", crops_categorized.Area.unique().shape[0])
print("\nItem categories in categorized dataset:\n", crops_categorized.Item.unique())

### Missing information?

Are we now missing any information in our datasets?

In [None]:
print("Missing information in categorized dataset: \n", crops_categorized.isna().sum())

No information is missing as far as we can tell. No cells seem to be missing information.

### Elements

In [None]:
# Aggregate year column to 'min - max' year
def agg_year(series):
    min_year = int(series.min())
    max_year = int(series.max())
    return '{} to {}'.format(min_year, max_year)

# Examine the different countries
def count_unique_area(series):
    return len(series.unique())

# Group by element code and element to see what these columns represent 
crops_categorized.groupby(['Element Code', 'Element'])\
             .agg({'Value':'sum', 'Unit':'unique', 'Year':agg_year, 'Area':count_unique_area})\
             .sort_values(by='Value', ascending=False)

By taking the sum of all values and sorting we see that the '51-Production Quantity' element category has information stored for all 253 countries, whereas the other element categories are missing information for some countries. Area = number of unique countries in the specified element category.

**Observation:** Are the elements apart form Production Quantity really necessary?

- *Seed* is the amount of seeds that were planted, which is not relevant for the scope of this project.
- *Area Harvested* is the amount of land that was used for planting the crops in our dataset, which is not relevant for the scope of this project.
- *Yield* is the amount of crops that was given by the planted area, which is not relevant for the scope of this project.

We will therefore remove element categories: Seed, Area Harvested and Yield.

In [None]:
elements = ['Seed', 'Area Harvested', 'Yield']
crops_processed = crops_categorized[np.logical_not(crops_categorized['Element'].isin(elements))]
crops_processed

But what do the element categories with element codes > 140 include? Can we remove these?

### Element codes > 140

In [None]:
elem_codes = ['152', '154', '434', '438', '432', '436']
study_data = crops_processed[crops_processed['Element Code'].isin(elem_codes)]
print("Number of countries in this data: ", study_data.Area.unique().shape[0])
study_data

From what we can see in the new dataset that only contain these elements, both the rows in the beginning and end have the Item cereals_total. Is this the only Item?

In [None]:
study_data.Item.unique()

Apparently so. Does this catgory exist for the other type of elements as well? Because if it does, we should be able to remove the elements with codes 152-438 from our dataset.

In [None]:
study_data_2 = crops_processed[crops_processed.Item.str.contains('cereals_total')]
study_data_2 = study_data_2[np.logical_not(study_data_2['Element Code'].isin(elem_codes))]
print("Number of countries in this data: ", study_data_2.Area.unique().shape[0])
study_data_2['Element Code'].unique()

We can now see that the data for the Item 'cereals_total' with element codes > 140 are subgroups of element codes < 140. We can therefore remove this data from our dataset.

In [None]:
# Drop rows with element codes > 140
drop_col = np.logical_not(crops_processed['Element Code'].isin(elem_codes))
crops_cleaned = crops_processed[drop_col]
crops_cleaned

### Area names with '+'

There are countries/areas that contain a '+' at the end of the name. What names contain this sign and what do they have in common?

In [None]:
# Examining what names contain '?'
country_series = crops_cleaned.Area
names_with_sign = country_series[country_series.str.endswith('+')]
names_with_sign.unique()

All of the country/area names that contain a '+' at the end are all areas. We can therefore divide the dataset into several groups: one with all countries, one with all continents and one with the remaining areas.

# Splitting the dataset 

In [None]:
# Splitting crops_data into country, continent and area based sets and renaming Area column

crops_country = crops_cleaned[np.logical_not(country_series.str.endswith('+'))].rename({'Area': 'Area'}, axis=1)
crops_remain = crops_cleaned[country_series.str.endswith('+')]

# Remove last two characters from continent/area name
crops_remain.Area = crops_remain.Area.str[:-2]

continents = ['Africa', 'Northern America', 'South America', 'Asia', 'Oceania', 'Europe']
is_continent = crops_remain.Area.isin(continents)

crops_continent = crops_remain[is_continent].rename({'Area': 'Area'}, axis=1)
crops_area = crops_remain[np.logical_not(is_continent)].rename({'Area': 'Area'}, axis=1)

print('Number of unique countries:', crops_country.Area.unique().shape[0])
print('Number of unique continents:', crops_continent.Area.unique().shape[0])
print('Number of unique areas:', crops_area.Area.unique().shape[0])

In [None]:
# Save dataframes to CSV
#crops_country.to_csv('./data/csv/crops_countries.csv')
#crops_area.to_csv('./data/csv/crops_areas.csv')
#crops_continent.to_csv('./data/csv/crops_continents.csv')

In [None]:
# Save dataframes to pickles
crops_country.to_pickle('./data/pickles/crops_countries.pkl')
crops_area.to_pickle('./data/pickles/crops_areas.pkl')
crops_continent.to_pickle('./data/pickles/crops_continents.pkl')

In [None]:
crops_area

In [None]:
crops_continent

In [None]:
crops_country

#### Categorizing crops further

In [None]:
crops_continent.Item.unique()

We choose to categorize into the following:
- Oilcrops & oilcakes, 
- Fruits excl melons: citrus fruits and fruits
- Vegetables and melons
- Others: treenuts, jute & jutelike fibres, pulses, fibre crops
- Roots and tubers
- Cereals
- Coarse grains

We choose to keep cereals and coars grain seperate even though they are the same family as they are both big categories independently. The 'others' are all of the small crops categories.

In [None]:
oil = ['oilcakes_equivalent', 'oilcrops_primary']
fruits = ['fruit_excl_melons_total', 'citrus_fruit_total']
veg = ['vegetables_melons_total']
roots_tubers = ['roots_and_tubers_total']
cereals = ['cereals_total']
coarse_grain = ['coarse_grain_total']
others = ['treenuts_total', 'jute_jute_like_fibres', 'pulses_total', 'fibre_crops_primary']

oil_crops = crops_continent[crops_continent.Item.isin(oil)]
prod_oil = oil_crops.groupby(['Area','Element','Year','Unit']).agg({'Value':'sum'}).reset_index()
prod_oil['Item'] = 'Oilcrops & oilcakes'

fruit_crops = crops_continent[crops_continent.Item.isin(fruits)]
prod_fruit = fruit_crops.groupby(['Area','Element','Year','Unit']).agg({'Value':'sum'}).reset_index()
prod_fruit['Item'] = 'Fruits excl melons'

veg_crops = crops_continent[crops_continent.Item.isin(veg)]
prod_veg = veg_crops.groupby(['Area','Element','Year','Unit']).agg({'Value':'sum'}).reset_index()
prod_veg['Item'] = 'Vegetables & melons'

roots_tubers_crops = crops_continent[crops_continent.Item.isin(roots_tubers)]
prod_roots_tubers = roots_tubers_crops.groupby(['Area','Element','Year','Unit']).agg({'Value':'sum'}).reset_index()
prod_roots_tubers['Item'] = 'Roots & tubers'

cereals_crops = crops_continent[crops_continent.Item.isin(cereals)]
prod_cereals = cereals_crops.groupby(['Area','Element','Year','Unit']).agg({'Value':'sum'}).reset_index()
prod_cereals['Item'] = 'Cereals'

coarse_grain_crops = crops_continent[crops_continent.Item.isin(coarse_grain)]
prod_coarse_grain = coarse_grain_crops.groupby(['Area','Element','Year','Unit']).agg({'Value':'sum'}).reset_index()
prod_coarse_grain['Item'] = 'Coarse grain'

coarse_grain_crops = crops_continent[crops_continent.Item.isin(coarse_grain)]
prod_coarse_grain = coarse_grain_crops.groupby(['Area','Element','Year','Unit']).agg({'Value':'sum'}).reset_index()
prod_coarse_grain['Item'] = 'Coarse grain'

other_crops = crops_continent[crops_continent.Item.isin(others)]
prod_other = other_crops.groupby(['Area','Element','Year','Unit']).agg({'Value':'sum'}).reset_index()
prod_other['Item'] = 'Other crops'

total_crops = crops_continent.groupby(['Area','Element','Year','Unit']).agg({'Value':'sum'}).reset_index()
total_crops['Item'] = 'Crops, total'

crops_categorized = pd.concat([prod_oil, prod_fruit, prod_veg, prod_roots_tubers, prod_cereals, prod_coarse_grain, prod_other, total_crops], axis=0)
crops_categorized

In [None]:
# Save dataframe to pickles
crops_categorized.to_pickle('./data/pickles/crops_categorized.pkl')