## Question 1

### Could we find and support a correlation between greenhouse gas emissions and agricultural growth? Could we also assess the influence of the type of culture on the emissions?

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import re
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import statsmodels
import folium
import math

# Custom imports
from ipywidgets import IntProgress
from IPython.display import display
import time
from multiprocessing import Pool, Lock
import os
import json
import seaborn as sns
import time

from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing   import StandardScaler

## Importing the data

In order to find a model that helps in the understanding of the role of the variety of crops in the ecological influence of the agriculture, we need 3 datasets:

- The land usage
- The crops cultures
- The emissions related to the agriculture


In [None]:
dataLands = pd.read_csv("./data/fao_data_land_data.csv")

dataCrops = pd.read_csv("./data/fao_data_crops_data.csv")

dataEmissions = pd.read_csv("./data/current_FAO/raw_files/Environment_Emissions_by_Sector_E_All_Data_(Normalized).csv", encoding="cp1252")

## Cleaning

- Removing NAN
- Removing useless columns

In [None]:
dataLands = dataLands.dropna(subset=["element"])

dataCrops = dataCrops.dropna(subset=["element"])

dataEmissionsAgriculture = dataEmissions.where(dataEmissions["Item"] == "Agriculture total").where(dataEmissions["Element"] == "Emissions (CO2eq)").dropna()
dataEmissionsAgriculture = dataEmissionsAgriculture.drop(["Item", "Element Code", "Element", "Item Code", "Year Code", "Flag"], axis=1)\
                                                    .rename(columns={"Unit":"Unit emissions","Value":"Value emissions"})

Here is a graph that shows the progression of land usage within each continent.

In [None]:
def cond_countries(dataLands):
    countries = ["Asia +","Europe +", "Americas +", "Oceania +", "Africa +"]
    truthTable = (dataLands["country_or_area"] == countries[0])
    for c in countries:
        truthTable = (dataLands["country_or_area"] == c) | truthTable
    return truthTable

dataLandsContinent = dataLands.where(dataLands["category"] == "agricultural_area")\
                                .where(cond_countries(dataLands))\
                                .dropna(subset=["country_or_area"])\
                                .sort_values("value",ascending=False)

sns.set(style="darkgrid")

fg = plt.figure(figsize=(20,20))
axes = fg.add_subplot()
# Plot the responses for different events and regions
sns.lineplot(x="year", y="value", style="country_or_area", data=dataLandsContinent, hue="country_or_area", ax=axes)

fg.show()

Here are the emissions of the BRICS throughout the years. We can see a clear progression (except for South Africa).

In [None]:
sns.set(style="darkgrid")

dataEmissionsAgrContinent = dataEmissionsAgriculture.where(dataEmissionsAgriculture["Area"].isin(["South Africa","Brazil", "China", "India", "USSR","Russian Federation"])).dropna()

fg = plt.figure(figsize=(20,20))
axes = fg.add_subplot()
# Plot the responses for different events and regions
sns.lineplot(x="Year", y="Value emissions", style="Area", data=dataEmissionsAgrContinent, hue="Area", ax=axes)

fg.show()

## Creating the dataset for processing

The dataset we create for processing consist in the aggregation of the area harvested of for each crop, by year and by country, and the emissions of greenhouse gases by year and by country as well. The goal will be to find a model that explain and helps in predicting the emissions, based on the other features (country, year and crops).

In [None]:
cropsAndEmissions = dataCrops.drop(["element_code"], axis=1)\
                                .where(dataCrops["element"] == "Area Harvested")\
                                .dropna(subset=["element"])\
                                .drop(["element", "value_footnotes"],axis=1)\
                                .rename(columns={"unit":"Unit area", "value":"Value area", "year":"Year","country_or_area":"Area"})
#                                .where(dataCrops["country_or_area"] == "World +")\
cropsAndEmissions = cropsAndEmissions.pivot_table(values='Value area',index=["Area","Year"],columns="category").reset_index()
cropsAndEmissions = cropsAndEmissions.fillna(0)
cropsAndEmissions = pd.merge(cropsAndEmissions, dataEmissionsAgriculture, how="left", on=['Area',"Year"])#.dropna(subset=["Value area", "Value emissions"])
cropsAndEmissions = cropsAndEmissions.dropna().dropna(subset=["Unit emissions"])
cropsAndEmissions

## Learning to predict emissions

Here, we try to create a model by doing a logistic regression on the data. We first try to optimize the alpha parameter on a predefined interval and then we use it to train the model.

In [None]:
SEED = 1
st_pipeline = Pipeline([('scl', StandardScaler()), ('ridge', Ridge(copy_X=True, random_state=SEED))])
st_pipeline

#features = ["Area","category"]

XRidge2 = pd.get_dummies(cropsAndEmissions.drop(["Area Code","Unit emissions", "Value emissions"], axis=1))
yRidge2 = cropsAndEmissions["Value emissions"]

results = []

# Tune for alpha using 10 fold crossvalidation when calculating the mean squared error.
for alpha in np.linspace(-100, 100, 32):
    st_pipeline.set_params(ridge__alpha= alpha) 
    neg_MSE = cross_val_score(st_pipeline, XRidge2, yRidge2, scoring='neg_mean_squared_error', cv=10)  # we use 10 folds crossvalidation since 10 
                                                                                                      # is pretty much standard in the industry
    results.append([neg_MSE, alpha])
    
# Take the mean MSE for each level of alpha
for i in range(len(results)):
    results[i][0] = -np.mean(results[i][0])
    
# Plot the results
plt.figure(figsize=(20,20))
plt.plot([row[1] for row in results], [row[0] for row in results])
plt.xlabel('alpha')
plt.ylabel('MSE')
plt.title('Tuning alpha');
plt.grid()

best_st_alpha = min(results)[1]
print('Best alpha is', best_st_alpha, 'with a MSE of', min(results)[0],'.')

In [None]:
st_Model = st_pipeline.set_params(ridge__alpha= best_st_alpha)  # use best alpha calculated above
st_Model.fit(XRidge2, yRidge2)                                  # fit the new model

In [None]:
#st_Model.named_steps["ridge"].coef_
pd.DataFrame([XRidge2.columns,st_Model.named_steps["ridge"].coef_]).transpose().sort_values(1)

Further analysis will be conducted in order to create efficient predictions. This first analysis already gave us clear insight of the data and helps us in assessing the feasability of this research.

## Plan

The next steps will be :

- Visualize the output of the model and its relation to reality.
- Try new models and assess their performances.
- Make predictions on the future using these regressions.
- Extract guidelines that could help in reducing emissions in the future, based on these data.
- Criticize these guidelines and the predictions using more domain-related knowledge.

## Question 5
This part of the notebook tries to answer the last question:

### What is the ratio of country support/cost vs crop production in developed countries? Could we find other metrics to explain this support? Should we stop supporting farmers in Switzerland?

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
#Load data for all countries
GovExpendituresAll = pd.read_csv('Data/Investment_GovernmentExpenditure_E_All_Data_(Normalized).csv',
                                 sep=',',engine='python')
CropsProdAll = pd.read_csv('Data/Production_Crops_E_All_Data_(Normalized).csv',
                                 sep=',',engine='python')
CapitalStockAll = pd.read_csv('Data/Investment_CapitalStock_E_All_Data_(Normalized).csv',
                                 sep=',',engine='python')

### Data pre-processing

In [None]:
GovAgriOriIndexAll = GovExpendituresAll.query('Element == "Agriculture orientation index"')
GovAgriInvestmentAll = GovExpendituresAll.query('Element == "Value US$"')

In [None]:
GovAgriOriIndexIndia = GovAgriOriIndexAll.query('Area == "India"')
GovAgriOriIndexItaly = GovAgriOriIndexAll.query('Area == "China, mainland"')
GovAgriOriIndexChina = GovAgriOriIndexAll.query('Area == "China, mainland"')
GovAgriOriIndexSwitzerland = GovAgriOriIndexAll.query('Area == "Switzerland"')
GovAgriOriIndexFrance = GovAgriOriIndexAll.query('Area == "France"')
GovAgriOriIndexUK = GovAgriOriIndexAll.query('Area == "United Kingdom"')

In [None]:
WheatProdIndia2001 = CropsProdAll.query('Area == "India" and Item == "Wheat" and Element == "Production" and Year > 2000')
WheatProdChina2001 = CropsProdAll.query('Area == "China, mainland" and Item == "Wheat" and Element == "Production" and Year > 2000')
WheatProdSwitzerland2001 = CropsProdAll.query('Area == "Switzerland" and Item == "Wheat" and Element == "Production" and Year > 2000')
RiceProdIndia2001 = CropsProdAll.query('Area == "India" and Item == "Rice, paddy" and Element == "Production" and Year > 2000')
RiceProdChina2001 = CropsProdAll.query('Area == "China" and Item == "Rice, paddy" and Element == "Production" and Year > 2000')
CerealsProdSwiss2001 = CropsProdAll.query('Area == "Switzerland" and Item == "Cereals,Total" and Element == "Production" and Year > 2000')

In [None]:
GovAgriInvestmentSwitzerlandCentral = GovAgriInvestmentAll.query('Area == "Switzerland" and Item == "Agriculture (Central Government)"')
GovAgriInvestmentSwitzerlandGeneral = GovAgriInvestmentAll.query('Area == "Switzerland" and Item == "Agriculture (General Government)"')
CapitalStockSwitzerland = CapitalStockAll.query('Area == "Switzerland" and Element == "Value US$" and Item == "Gross Fixed Capital Formation (Agriculture, Forestry and Fishing)" and Year > 2000')

### Data Exploration

In [None]:
fig, ax1 = plt.subplots(figsize=(20,10))
ax1.set_xlabel('Year')
ax1.set_ylabel('Government argiculture orientation index')
ax1.plot(GovAgriOriIndexIndia.Year,GovAgriOriIndexIndia.Value,
         color='red',marker='o',linestyle='dashed',linewidth=2,markersize=8,label='India')
ax1.plot(GovAgriOriIndexChina.Year,GovAgriOriIndexChina.Value,
         color='blue',marker='o',linestyle='dashed',linewidth=2,markersize=8,label='China')
ax1.plot(GovAgriOriIndexItaly.Year,GovAgriOriIndexItaly.Value,
         color='yellow',marker='o',linestyle='dashed',linewidth=2,markersize=8,label='Italy')
ax1.plot(GovAgriOriIndexSwitzerland.Year,GovAgriOriIndexSwitzerland.Value,
         color='green',marker='o',linestyle='dashed',linewidth=2,markersize=8,label='Switzerland')
ax1.plot(GovAgriOriIndexFrance.Year,GovAgriOriIndexFrance.Value,
         color='purple',marker='o',linestyle='dashed',linewidth=2,markersize=8,label='France')
ax1.plot(GovAgriOriIndexUK.Year,GovAgriOriIndexUK.Value,
         color='orange',marker='o',linestyle='dashed',linewidth=2,markersize=8,label='United Kingdom')
fig.tight_layout()
plt.title('Agriculture orientation index')
plt.legend()
plt.show()

### Comment
This first graph shows time series of the agriculture orientation index (AOI) for several developed countries. The AOI is defined as the ratio of the agriculture share of government expenditures and the agriculture share of gross domestic product (GDP). We can see that the agriculture orientation index is mush higher for Switzerland as compared to other typical developed countries. This would suggest that the swiss government invests mush more money for the same added value on agricultural production. 

In [None]:
fig, ax1 = plt.subplots(figsize=(20,10))
color = 'tab:red'
ax1.set_xlabel('Year')
ax1.set_ylabel('Wheat Production [tonnes]',color=color)
ax1.plot(WheatProdIndia2001.Year,WheatProdIndia2001.Value,
         color=color,marker='o',linestyle='dashed',linewidth=2,markersize=8)
ax1.tick_params(axis='y',labelcolor=color)
ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
color = 'tab:green'
ax2.set_ylabel('Agriculture Index',color=color)
ax2.plot(GovAgriOriIndexIndia.Year[0:14],GovAgriOriIndexIndia.Value[0:14],
         color=color,marker='o',linestyle='dashed',linewidth=2,markersize=8)
ax2.tick_params(axis='y',labelcolor=color)
ax3 = ax1.twinx()
ax3.set_ylabel('Rice Production [tonnes]',color='blue',labelpad=20)
ax3.plot(RiceProdIndia2001.Year,RiceProdIndia2001.Value,
         color='blue',marker='o',linestyle='dashed',linewidth=2,markersize=8)
ax3.tick_params(axis='y',labelcolor='blue')
fig.tight_layout()  # otherwise the right y-label is slightly clipped
plt.title('Production and agriculture orientation index time series in India')
plt.show()

In [None]:
fig, ax1 = plt.subplots(figsize=(20,10))
color = 'tab:red'
ax1.set_xlabel('Year')
ax1.set_ylabel('Wheat Production [tonnes]',color=color)
ax1.plot(WheatProdChina2001.Year,WheatProdChina2001.Value,
         color=color,marker='o',linestyle='dashed',linewidth=2,markersize=8)
ax1.tick_params(axis='y',labelcolor=color)
ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
color = 'tab:green'
ax2.set_ylabel('Agriculture Index',color=color)
ax2.plot(GovAgriOriIndexChina.Year[0:8],GovAgriOriIndexChina.Value[0:8],
         color=color,marker='o',linestyle='dashed',linewidth=2,markersize=8)
ax2.tick_params(axis='y',labelcolor=color)
ax3 = ax1.twinx()
ax3.set_ylabel('Rice Production [tonnes]',color='blue',labelpad=20)
ax3.plot(RiceProdChina2001.Year,RiceProdChina2001.Value,
         color='blue',marker='o',linestyle='dashed',linewidth=2,markersize=8)
ax3.tick_params(axis='y',labelcolor='blue')
fig.tight_layout()  # otherwise the right y-label is slightly clipped
plt.title('Production and agriculture orientation index time series in China')
plt.show()

In [None]:
fig, ax1 = plt.subplots(figsize=(20,10))
color = 'tab:red'
ax1.set_xlabel('Year')
ax1.set_ylabel('Wheat Production [tonnes]',color=color)
ax1.plot(WheatProdSwitzerland2001.Year,WheatProdSwitzerland2001.Value,
         color=color,marker='o',linestyle='dashed',linewidth=2,markersize=8)
ax1.tick_params(axis='y',labelcolor=color)
ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
color = 'tab:green'
ax2.set_ylabel('Agriculture Index',color=color)
ax2.plot(GovAgriOriIndexSwitzerland.Year[0:14],GovAgriOriIndexSwitzerland.Value[0:14],
         color=color,marker='o',linestyle='dashed',linewidth=2,markersize=8)
ax2.tick_params(axis='y',labelcolor=color)
ax3 = ax1.twinx()
ax3.set_ylabel('Cereals Production [tonnes]',color='blue',labelpad=20)
ax3.plot(CerealsProdSwiss2001.Year,CerealsProdSwiss2001.Value,
         color='blue',marker='o',linestyle='dashed',linewidth=2,markersize=8)
ax3.tick_params(axis='y',labelcolor='blue')
fig.tight_layout()  # otherwise the right y-label is slightly clipped
plt.title('Production (cereals and wheat) and agriculture orientation index time series in Switzerland')
plt.show()

### Comment 
We tried to look if we can directly see the correlation betweem government investment and main crops production. To do so, we looked at the production of main crops in India, China and Switzerland and compared it to the agriculture orientation index. As we can see on the above graphs, there is no net correlation.

In [None]:
fig, ax1 = plt.subplots(figsize=(20,10))
color = 'tab:red'
ax1.set_xlabel('Year')
ax1.set_ylabel('Central government investments in agriculture [millions US$]',color=color)
ax1.plot(GovAgriInvestmentSwitzerlandCentral.Year,GovAgriInvestmentSwitzerlandCentral.Value,
         color=color,marker='o',linestyle='dashed',linewidth=2,markersize=8)
ax1.tick_params(axis='y',labelcolor=color)
ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
color = 'tab:green'
ax2.set_ylabel('Gross fixed capital formation',color=color)
ax2.plot(CapitalStockSwitzerland.Year,CapitalStockSwitzerland.Value,
         color=color,marker='o',linestyle='dashed',linewidth=2,markersize=8)
ax2.tick_params(axis='y',labelcolor=color)
ax3 = ax1.twinx()
ax3.set_ylabel('General government investments in agriculture [millions US$]',color='blue',labelpad=20)
ax3.plot(GovAgriInvestmentSwitzerlandGeneral.Year,GovAgriInvestmentSwitzerlandGeneral.Value,
         color='blue',marker='o',linestyle='dashed',linewidth=2,markersize=8)
ax3.tick_params(axis='y',labelcolor='blue')
fig.tight_layout()  # otherwise the right y-label is slightly clipped
plt.title('Government investment and agriculture capital value in Switzerland')
plt.show()

### Comment
This last graph shows time series of the swiss central and general government expenditures in the field of agriculture as well as the capital formation that comes from this sector. As can be seen, there is a direct correlation between government expenditures and the agriculture capital formation, except for 2009, where we have a glitch. This graph suggests that the government support is helpful in Switzerland.

### Conclusion
Probably due to the high life and salary cost in Switrzerland, there is a remarkly higher ratio in investment to crop growth production (in terms of GDP). Indeed, the Swiss government invest 6 to 8 times more for the same relative result as compared to other country. Nevertheless, on the last plot we see a direct corelation between investment and agriculture capital growth. This indicates a consistent and valuable support from the country.

### Improvments
We could try to compare the government and private sector investments in agriculture and see how this relates to crops production to see which sector is the most valuable to invest. 