# **Introduction**
The FAOSTAT Temperature Change domain disseminates statistics of mean surface temperature change by country, with annual updates. The current dissemination covers the period 1961–2019. Statistics are available for monthly, seasonal and annual mean temperature anomalies, i.e., temperature change with respect to a baseline climatology, corresponding to the period 1951–1980. The standard deviation of the temperature change of the baseline methodology is also available. Data are based on the publicly available GISTEMP data, the Global Surface Temperature Change data distributed by the National Aeronautics and Space Administration Goddard Institute for Space Studies (NASA-GISS).

In [None]:
# libraries used
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn import metrics

%matplotlib inline

Below is the data present in the file.

In [None]:
Etemp_path='../input/temperature-change/Environment_Temperature_change_E_All_Data_NOFLAG.csv'
Code_path='../input/temperature-change/FAOSTAT_data_11-24-2020.csv'

data = pd.read_csv(Etemp_path, encoding='latin-1')
data2 = pd.read_csv(Code_path)

data

In [None]:
plt.figure(figsize=(20,8))
sns.heatmap(data.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.show()

Due to world politics being in contant flux many countries have come into existence or have gone the way of the dodo. This means our dataset is incomplete as shown by the above plot where the bright lines show null/nan values in our dataset. A simple thing we do to remedy this is to just remove the rows which contain nan values.

In [None]:
data=data.dropna()

Next let us look at the second file.

In [None]:
data2

The second dataset just contains codes used for different countries and groups. This set is not important for our study.

As we can see the main dataset contains the temperature change and standard deviation for countries around the world from 1961 to 2019.

For the purposes of this investigation we:
* delete columns which contain the various codes as they wont be nessessary. 
* rename the Area column to country
* keep the temperature change and standard deviation
* keep only the 12 months and get rid of the three month grouping

In [None]:
data=data.rename(columns={'Area':'Country'})
#data=data[data['Element']=='Temperature change']
data=data.drop(columns=['Area Code','Months Code','Element Code','Unit'])
TempC=data.loc[data.Months.isin(['January', 'February', 'March', 'April', 'May', 'June', 'July','August', 'September', 'October', 'November', 'December'])]

After getting rid of the extra columns that are not needed the dataset now looks like the table below.

In [None]:
TempC.head()

Lets look at how many countries we have in the data set.

In [None]:
TempC.Country.unique()

This is obviously a lot, the dataset also contains some groupings of countries such as EU, Least Developed Countries, etc. Lets just use one country for now to see what data can be extrated and understood

# **Afghanistan as a case study**
Let use start the study by looking at what data in present for each country and what information can be extracted from it. The table below show the data present for just Afghanistan.

In [None]:
Afg=TempC.loc[TempC.Country=='Afghanistan']
Afg

Lets make a simple plot to see how the temperature varies over the year.

In [None]:
plt.figure(figsize=(15,10))
sns.lineplot(x=Afg.Months.loc[Afg.Element=='Temperature change'],y=Afg.Y1961.loc[Afg.Element=='Temperature change'], label='Y1961')
sns.lineplot(x=Afg.Months.loc[Afg.Element=='Temperature change'],y=Afg.Y1971.loc[Afg.Element=='Temperature change'], label='Y1971')
sns.lineplot(x=Afg.Months.loc[Afg.Element=='Temperature change'],y=Afg.Y1981.loc[Afg.Element=='Temperature change'], label='Y1981')
sns.lineplot(x=Afg.Months.loc[Afg.Element=='Temperature change'],y=Afg.Y1991.loc[Afg.Element=='Temperature change'], label='Y1991')
sns.lineplot(x=Afg.Months.loc[Afg.Element=='Temperature change'],y=Afg.Y2001.loc[Afg.Element=='Temperature change'], label='Y2001')
plt.xlabel('Months')
plt.ylabel('Temperature change (C)')
plt.title('Temperature Change in Afganistan')
plt.show()

We reshape the dataset so instead of a column for each year we now have all the years in one column. Although this makes the dataset have more rows it makes maipulation a bit simpler (for me anyway).

In [None]:
Afg=Afg.melt(id_vars=['Country','Months','Element'],var_name='Year', value_name='TempC')
Afg['Year'] = Afg['Year'].str[1:].astype('str')
Afg.info()

Lets replot the temperature change over the year and the standard deviation provided for each month.

In [None]:
plt.figure(figsize=(15,15))
plt.subplot(211)
for i in Afg.Year.unique():
    plt.plot(Afg.Months.loc[Afg.Year==str(i)].loc[Afg.Element=='Temperature change'],Afg.TempC.loc[Afg.Year==str(i)].loc[Afg.Element=='Temperature change'],linewidth=0.5)
plt.plot(Afg.Months.unique(),Afg.loc[Afg.Element=='Temperature change'].groupby(['Months']).mean(),'r',linewidth=2.0,label='Average')
plt.xlabel('Months',)
plt.xticks(rotation=45)
plt.ylabel('Temperature change')
plt.title('Temperature Change in Afganistan')
plt.legend()

plt.subplot(212)
plt.plot(Afg.Months.loc[Afg.Year=='1961'].loc[Afg.Element=='Standard Deviation'],Afg.TempC.loc[Afg.Year=='1961'].loc[Afg.Element=='Standard Deviation']) 
plt.xlabel('Year')
plt.xticks(rotation=45)
plt.ylabel('Standard Deviation')
plt.title('Standard Deviation of Temperature Change in Afganistan')

plt.subplots_adjust(hspace=0.3)
plt.show()

It looks like the month of July has the smallest devation while the winter months of December January and Febuary have a lage spread with febuary having a highest. The Standard Deviation is calculated by looking at the temperature for that month in a country from 1961 to 2019. So each country has their own standerd deviation of temperature for the different months.

Next lets look at the how the data is spread over the different years and how the mean temperature changes.

In [None]:
plt.figure(figsize=(15,10))
plt.scatter(Afg['Year'].loc[Afg.Element=='Temperature change'],Afg['TempC'].loc[Afg.Element=='Temperature change'])
plt.plot(Afg.loc[Afg.Element=='Temperature change'].groupby(['Year']).mean(),'r',label='Average')
plt.axhline(y=0.0, color='k', linestyle='-')
plt.xlabel('Year')
plt.xticks(np.linspace(0,58,20),rotation=45)
plt.ylabel('Temperature change')
plt.legend()
plt.title('Temperature Change in Afganistan')
plt.show()

We can also look at the histogram of the temperature changes

In [None]:
plt.figure(figsize=(15,10))
sns.histplot(Afg.TempC.loc[Afg.Element=='Temperature change'],kde=True,stat='density')
plt.axvline(x=0.0, color='b', linestyle='-')
plt.xlabel('Temperature change')
plt.title('Temperature Change in Afganistan')
plt.show()

Clearly a majority of Afganistan's temperature is higher then the world baseline corresponding to the period 1951–1980. Even the average temperature change for Afganistan is rising as the years progress.

# **World Temperature**
Obviously the above study can be done to every contry and area in the dataset but this will just be repeting the above graphs for more then 200 times. We need to try and summerise the data and present it in a more digestable way. We create a similar dataset to what was used above for the world datset.

In [None]:
TempC=TempC.melt(id_vars=['Country','Months','Element'],var_name='Year', value_name='TempC')
TempC['Year'] = TempC['Year'].str[1:].astype('str')
TempC

To make sure country groupings such as EU or Africa don't skew our calculations we remove them for the world data list. So we are just left with individual countires. We can keep the regions data in the different dataset in case we want to use it later.

In [None]:
regions=TempC[TempC.Country.isin(['World', 'Africa',
       'Eastern Africa', 'Middle Africa', 'Northern Africa',
       'Southern Africa', 'Western Africa', 'Americas',
       'Northern America', 'Central America', 'Caribbean',
       'South America', 'Asia', 'Central Asia', 'Eastern Asia',
       'Southern Asia', 'South-Eastern Asia', 'Western Asia', 'Europe',
       'Eastern Europe', 'Northern Europe', 'Southern Europe',
       'Western Europe', 'Oceania', 'Australia and New Zealand',
       'Melanesia', 'Micronesia', 'Polynesia', 'European Union',
       'Least Developed Countries', 'Land Locked Developing Countries',
       'Small Island Developing States',
       'Low Income Food Deficit Countries',
       'Net Food Importing Developing Countries', 'Annex I countries',
       'Non-Annex I countries', 'OECD'])]

TempC=TempC[~TempC.Country.isin(['World', 'Africa',
       'Eastern Africa', 'Middle Africa', 'Northern Africa',
       'Southern Africa', 'Western Africa', 'Americas',
       'Northern America', 'Central America', 'Caribbean',
       'South America', 'Asia', 'Central Asia', 'Eastern Asia',
       'Southern Asia', 'South-Eastern Asia', 'Western Asia', 'Europe',
       'Eastern Europe', 'Northern Europe', 'Southern Europe',
       'Western Europe', 'Oceania', 'Australia and New Zealand',
       'Melanesia', 'Micronesia', 'Polynesia', 'European Union',
       'Least Developed Countries', 'Land Locked Developing Countries',
       'Small Island Developing States',
       'Low Income Food Deficit Countries',
       'Net Food Importing Developing Countries', 'Annex I countries',
       'Non-Annex I countries', 'OECD'])]
TempC

Now we can look at the distribution data. Lets first look at a histogram for temperature change.

In [None]:
plt.figure(figsize=(15,10))
sns.histplot(TempC.TempC.loc[TempC.Element=='Temperature change'],kde=True,stat='density')
plt.axvline(x=0.0, color='b', linestyle='-')
plt.xlabel('Temperature change')
plt.title('Temperature Change distribution of the World')
plt.xlim(-5,5)
plt.show()

Let us calculate some averages that we can easily use for our plots.

In [None]:
# Average for the whole world
AvgT=TempC.loc[TempC.Element=='Temperature change'].groupby(['Year'],as_index=False).mean()

# Average for every country
AvgTC=TempC.loc[TempC.Element=='Temperature change'].groupby(['Country','Year'],as_index=False).mean()

We can also do a scatter plot, like before, for different years for all the countries and plot the world average.

In [None]:
plt.figure(figsize=(15,10))
plt.scatter(TempC['Year'].loc[TempC.Element=='Temperature change'],TempC['TempC'].loc[TempC.Element=='Temperature change'])
plt.plot(AvgT.Year,AvgT.TempC,'r',label='Average')
plt.axhline(y=0.0, color='k', linestyle='-')
plt.xlabel('Year')
plt.xticks(np.linspace(0,58,20),rotation=45)
plt.ylabel('Temperature change')
plt.legend()
plt.title('Temperature Change of the World')
plt.show()

Finally we can plot the temperatures for each country and plot the world average on top.

In [None]:
plt.figure(figsize=(15,10))
for i in AvgTC.Country.unique():
    plt.plot(AvgTC.Year.loc[AvgTC.Country==str(i)],AvgTC.TempC.loc[AvgTC.Country==str(i)],linewidth=0.5)

plt.plot(AvgT.Year,AvgT.TempC,'r',linewidth=2.0)
plt.axhline(y=0.0, color='k', linestyle='-')
plt.xlabel('Year')
plt.xticks(np.linspace(0,58,20),rotation=45)
plt.ylabel('Average Temperature change')
plt.title('Average Temperature Change of the World')
plt.show()

The plot clearly shows how the temperature of the world is rising when compared to the basline corresponding to the period 1951–1980. A lot of work needs to be clearly done to get the temperature in control. In the next part we will try and create a machine learning model to predict how the temperatures will change in the future.

# Test-Train Split
Before we can make a prediction we need to train our model. We will split the data into a test and train dataset. This can be used to verify the predictability of our model.

In [None]:
MonthV={'January':'1', 'February':'2', 'March':'3', 'April':'4', 'May':'5', 'June':'6', 'July':'7','August':'8', 'September':'9', 'October':'10', 'November':'11', 'December':'12'}
TempC=TempC.replace(MonthV)

TempC.head()

In [None]:
y=TempC['TempC'].loc[TempC.Element=='Temperature change']
X=TempC.drop(columns=['TempC','Country','Months','Element']).loc[TempC.Element=='Temperature change']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8,random_state=42)

# Regression
One of the simplest models we can use on the system is regression. Below we use both a linear model and polynomial models to predict world the temperatures in the future.


## Simple Linear Regression
We use the test-train data to train the model and compare the predictions to actual data.

In [None]:
LR = LinearRegression()
LR.fit(X_train,y_train)
LRpreds = LR.predict(X_valid)

print('RMSE:', np.sqrt(metrics.mean_squared_error(y_valid, LRpreds)))

In [None]:
plt.figure(figsize=(15,8))
plt.plot(y_valid-LRpreds,'o')
plt.axhline(y=0.0, color='k', linestyle='-')
plt.ylabel('Actual value - Predicited value')
plt.show()

We are happy with what the model is prediciting so we use the whole dataset to train a model.

In [None]:
# Fit the model to the training data
LR.fit(X, y)

We now create artifical data that we can use to test what the model predicts for the future.

In [None]:
# Creating prediction data
LR_test=pd.DataFrame({'Year':np.random.randint(1980,2060, size=1000)})
LR_test=LR_test.sort_values(by=['Year']).reset_index(drop=True).astype(str)

#T_test=pd.DataFrame(np.arange(2020, 2046),columns=['Year']).astype(str)

In [None]:
# Generate test predictions
preds_test = LR.predict(LR_test)
LR_test['TempC']=pd.Series(preds_test, index=LR_test.index)

## Polynomial regression
As seen in different plots the data is clearly not linear so let us experiment with some nonlinear features to see if we can get a more accirate prediction.

In [None]:
PR2_mod = Pipeline([('poly', PolynomialFeatures(degree=2)),
                  ('linear', LinearRegression(fit_intercept=False))])

PR3_mod = Pipeline([('poly', PolynomialFeatures(degree=5)),
                  ('linear', LinearRegression(fit_intercept=False))])
# Fit the model to the training data
PR2_mod.fit(X, y)
PR3_mod.fit(X, y)

In [None]:
# Creating prediction data
PR2_test=pd.DataFrame({'Year':np.random.randint(1980,2060, size=1000)})
PR2_test=PR2_test.sort_values(by=['Year']).reset_index(drop=True).astype(str)

PR3_test=pd.DataFrame({'Year':np.random.randint(1980,2060, size=1000)})
PR3_test=PR3_test.sort_values(by=['Year']).reset_index(drop=True).astype(str)

In [None]:
# Generate test predictions
pred2_test = PR2_mod.predict(PR2_test)
pred3_test = PR3_mod.predict(PR3_test)

PR2_test['TempC']=pd.Series(pred2_test, index=PR2_test.index)
PR3_test['TempC']=pd.Series(pred3_test, index=PR3_test.index)

# Plotting Results
Lets plot the reuslts for the linear and polynomial models side-by-side to see how they fair.

In [None]:
plt.figure(figsize=(15,10))
for i in AvgTC.Country.unique():
    plt.plot(AvgTC.Year.loc[AvgTC.Country==str(i)],AvgTC.TempC.loc[AvgTC.Country==str(i)],linewidth=0.5)

plt.plot(AvgT.Year,AvgT.TempC,'r',linewidth=2.0)
plt.plot(LR_test.Year.unique(),LR_test.groupby('Year').mean(),'b',linewidth=2.0,label='Linear Model')
plt.plot(PR2_test.Year.unique(),PR2_test.groupby('Year').mean(),'g',linewidth=2.0,label='Poly-2 Model')
plt.plot(PR3_test.Year.unique(),PR3_test.groupby('Year').mean(),'c',linewidth=2.0,label='Poly-5 Model')
plt.axhline(y=0.0, color='k', linestyle='-')
plt.xticks(np.linspace(0,100,40),rotation=45)
plt.xlabel('Year')
plt.ylabel('Average Temperature change')
plt.title('Average Temperature Change of the World')
plt.legend()
plt.show()

As this is a linear model the prediction is pretty simple without any big fluctuation. The model is still predicting that the temperatures will rise in the future. Polynomial regression model predicts a much higher temperature change and there does not seem to much change in adding more factors indicating that currently the data does not have high order. At the edge of the avilable data it seems that the polynomial regression is more accurate than the linear model. So unless the world does something to mitigate the rise the increase in temperature will continue to a higher rate.