***
# Temperature Change (Raw Data Processing)
Capstone Project - Ali Sehpar Shikoh
***

<b> Previous Notebook: RoundwoodProduction-RAW

<b> Next Notebook: CombiningDataSets-AdditionalEDA 

Climate change (mainly governed by changes in environmental temperatures) is one of the major factors that is known to be directly influence the crop yield in many countries. This notebook deals with the processing of raw data related to temperature change observed in various countries on yearly basis.

### Exploratory Data Analysis

Importing packages required to run this notebook.

In [2]:
import pandas as pd
import statistics

Transforming the 'TemperatureChange-RAW' CSV file into a dataframe.

In [3]:
temperature_df1 = pd.read_csv("DataFiles/01-RawDataFiles/TemperatureChange-RAW/TemperatureChange-RAW.csv")

Checking the dataset imported

In [4]:
temperature_df1.head()

Unnamed: 0,Domain Code,Domain,Area Code (FAO),Area,Element Code,Element,Months Code,Months,Year Code,Year,Unit,Value,Flag,Flag Description
0,ET,Temperature change,2,Afghanistan,7271,Temperature change,7016,Dec–Jan–Feb,1961,1961,°C,-0.751,Fc,Calculated data
1,ET,Temperature change,2,Afghanistan,7271,Temperature change,7016,Dec–Jan–Feb,1962,1962,°C,0.985,Fc,Calculated data
2,ET,Temperature change,2,Afghanistan,7271,Temperature change,7016,Dec–Jan–Feb,1963,1963,°C,1.931,Fc,Calculated data
3,ET,Temperature change,2,Afghanistan,7271,Temperature change,7016,Dec–Jan–Feb,1964,1964,°C,-2.056,Fc,Calculated data
4,ET,Temperature change,2,Afghanistan,7271,Temperature change,7016,Dec–Jan–Feb,1965,1965,°C,-0.669,Fc,Calculated data


As seen, the data frame consists of a total of 14 columns containing information such as Domain, Element and Item names. We do not require all of these columns.

Looking at the shape of the temperature_df1 dataframe.

In [5]:
temperature_df1.shape

(135250, 14)

As seen there are a total of 135250 rows and 14 columns.

Checking the datatypes of all the variable columns.

In [6]:
temperature_df1.dtypes

Domain Code          object
Domain               object
Area Code (FAO)       int64
Area                 object
Element Code          int64
Element              object
Months Code           int64
Months               object
Year Code             int64
Year                  int64
Unit                 object
Value               float64
Flag                 object
Flag Description     object
dtype: object

As seen above the dataset contains a mix of numerical and categorical columns. A total of 8 columns were of object data type whereas 5 columns were of int64 data type. Only one column was found to have float64 data type. All the above mentioned data types were found to be an appropriate fit for the respective columns.

Checking out the unique values in the 'Element' column.

In [7]:
temperature_df1['Element'].unique()

array(['Temperature change', 'Standard Deviation'], dtype=object)

As seen there are two temperature related elements in the dataset. We could use both of these elements however, 'Temperature change' column is a more useful one and thus will be the one only to be used in later stages.

Filtering out dataset with respect to 'Element' column, where 'Element' is equal to 'Temperature change'.

In [8]:
temperature_df2 = temperature_df1.loc[temperature_df1['Element'] == 'Temperature change']
temperature_df2.head(2)

Unnamed: 0,Domain Code,Domain,Area Code (FAO),Area,Element Code,Element,Months Code,Months,Year Code,Year,Unit,Value,Flag,Flag Description
0,ET,Temperature change,2,Afghanistan,7271,Temperature change,7016,Dec–Jan–Feb,1961,1961,°C,-0.751,Fc,Calculated data
1,ET,Temperature change,2,Afghanistan,7271,Temperature change,7016,Dec–Jan–Feb,1962,1962,°C,0.985,Fc,Calculated data


Dropping redundant columns as rechecking the dataframe.

In [9]:
temperature_df3 = temperature_df2.drop(['Domain Code', 'Domain', 'Element', 'Element Code', 'Months Code', 'Year Code', 'Unit', 'Flag Description', 'Unit', 'Flag'], 1)
temperature_df3.head(2)

Unnamed: 0,Area Code (FAO),Area,Months,Year,Value
0,2,Afghanistan,Dec–Jan–Feb,1961,-0.751
1,2,Afghanistan,Dec–Jan–Feb,1962,0.985


Rename the remaining columns include more information and appropriate names.

In [10]:
temperature_df3.rename(columns = {'Area Code (FAO)':'Area Code', 'Value':'Temperature Change (°C)'}, inplace = True)
temperature_df3 = temperature_df3[['Area Code', 'Area', 'Year', 'Months', 'Temperature Change (°C)']]
temperature_df3.head()

Unnamed: 0,Area Code,Area,Year,Months,Temperature Change (°C)
0,2,Afghanistan,1961,Dec–Jan–Feb,-0.751
1,2,Afghanistan,1962,Dec–Jan–Feb,0.985
2,2,Afghanistan,1963,Dec–Jan–Feb,1.931
3,2,Afghanistan,1964,Dec–Jan–Feb,-2.056
4,2,Afghanistan,1965,Dec–Jan–Feb,-0.669


Checking out the null values in the dataframe.

In [11]:
temperature_df3.isnull().sum().sum

<bound method NDFrame._add_numeric_operations.<locals>.sum of Area Code                     0
Area                          0
Year                          0
Months                        0
Temperature Change (°C)    2294
dtype: int64>

As seen there are a total of 2294 null values in the temperature change column. One could deal with the null values in the 'Temperature Change (°C)' when joining the dataset with the main processed dataset.

Checking for duplicate rows in the dataset.

In [14]:
temperature_df3.duplicated().sum()

0

As seen there are no duplicate rows in the dataset.

The number of rows seem to be larger that expected. At maximum, there should be around 12000 rows, considering there are 202 countries and the data is collected from year 1961 to 2019.

Looking at the unique values in the year column.

In [17]:
temperature_df3['Year'].unique()

array([1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,
       1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982,
       1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993,
       1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004,
       2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015,
       2016, 2017, 2018, 2019, 2020], dtype=int64)

The years range seems to be coinciding with other datasets which is good.

Checking the unique entries in the 'Months' column.

In [18]:
temperature_df3['Months'].unique()

array(['Dec–Jan–Feb', 'Mar–Apr–May', 'Jun–Jul–Aug', 'Sep–Oct–Nov',
       'Meteorological year'], dtype=object)

As seen the months column tends to contain four different values reported on quartely basis where one value/category is reported on yearly basis.

There is a confusion related 'Meteorological year' category i.e. if it is reports values on yearly basis. Checking if the mean of temperature change values reported on quartely basis is equal to the value reported for 'Meteorological year' category.

Considering dataset for Afghanistan only.

In [20]:
temperature_df3.loc[temperature_df3['Area'] == 'Afghanistan']

Unnamed: 0,Area Code,Area,Year,Months,Temperature Change (°C)
0,2,Afghanistan,1961,Dec–Jan–Feb,-0.751
1,2,Afghanistan,1962,Dec–Jan–Feb,0.985
2,2,Afghanistan,1963,Dec–Jan–Feb,1.931
3,2,Afghanistan,1964,Dec–Jan–Feb,-2.056
4,2,Afghanistan,1965,Dec–Jan–Feb,-0.669
...,...,...,...,...,...
535,2,Afghanistan,2016,Meteorological year,1.581
536,2,Afghanistan,2017,Meteorological year,1.626
537,2,Afghanistan,2018,Meteorological year,1.682
538,2,Afghanistan,2019,Meteorological year,1.125


Checking if the mean quarterly temperature changes is equal to temperature change reported for Meteorological year category. Further filtering the data for year 1961.

In [29]:
Afghaisntan_1961_Quaters = temperature_df3.loc[(temperature_df3['Months'] != 'Meteorological year') & (temperature_df3['Year'] == 1961) & (temperature_df3['Area'] == 'Afghanistan')]
Afghaisntan_1961_Quaters

Unnamed: 0,Area Code,Area,Year,Months,Temperature Change (°C)
0,2,Afghanistan,1961,Dec–Jan–Feb,-0.751
120,2,Afghanistan,1961,Mar–Apr–May,0.022
240,2,Afghanistan,1961,Jun–Jul–Aug,0.346
360,2,Afghanistan,1961,Sep–Oct–Nov,-0.1


Calculating mean temperature change using quarterly values of 1961.

In [30]:
statistics.mean(Afghaisntan_1961_Quaters['Temperature Change (°C)'])

-0.12075000000000001

In case the mean temperature change for the quartely values is equal to the value mentioned for 'Meteorological year', we can discard the values mentioned on the quarterly basis owing to the fact that all of the other databases have data entries mentioned on yearly basis.

Looking at the 'Meteorological year' temperature change value.

In [31]:
Afghaisntan_1961_MetYear = temperature_df3.loc[(temperature_df3['Months'] == 'Meteorological year') & (temperature_df3['Year'] == 1961) & (temperature_df3['Area'] == 'Afghanistan')]
Afghaisntan_1961_MetYear

Unnamed: 0,Area Code,Area,Year,Months,Temperature Change (°C)
480,2,Afghanistan,1961,Meteorological year,-0.121


Since the mean quarterly temperature change measurements are equal to Temperature Change measurements reported for Meteorological year therefore the Temperature Change statistic can be reported on yearly basis by filtering Meteorological year data. Later 'Months' column can be dropped.

### Refined Dataset Creation and Exportation

Filtering out the data for 'Meterological year' and dropping the redundant 'Months' column.

In [32]:
temperature_df4 = temperature_df3.loc[(temperature_df3['Months'] == 'Meteorological year')]
temperature_df4 = temperature_df4.drop(['Months'], axis=1)
temperature_df4

Unnamed: 0,Area Code,Area,Year,Temperature Change (°C)
480,2,Afghanistan,1961,-0.121
481,2,Afghanistan,1962,-0.171
482,2,Afghanistan,1963,0.841
483,2,Afghanistan,1964,-0.779
484,2,Afghanistan,1965,-0.254
...,...,...,...,...
135185,181,Zimbabwe,2016,1.470
135186,181,Zimbabwe,2017,0.443
135187,181,Zimbabwe,2018,0.747
135188,181,Zimbabwe,2019,1.359


The above dataset is to be exported into a CSV file in the RefinedDataAndWorkingFiles folder.

In [36]:
temperature_df4.to_csv(r'DataFiles/02-RefinedDataFiles/TemperatureChange-REFINED.csv', index = False)

### Summary of things done in this notebook:

- Performed basic EDA.
- Discarded region based statistics by applying filter on the 'Area' column.
- Dropped redundant columns.
- Incorporated more information in selected column names.
- Exported the refined data to a CSV file.


### References

[1] Kang Y, Khan S, Ma X. Climate change impacts on crop yield, crop water productivity and food security–A review. Progress in natural Science. 2009 Dec 10;19(12):1665-74.