***
# Atmospheric Deposition (Raw Data Processing)
Capstone Project - Ali Sehpar Shikoh
***

<b> Next Notebook: BiologicalFixation-RAW

<u> Capstone Project Problem Statement: What are the key parameters that play a significant role in the determination of crop yield.

<u> Capstone Project Motivation: Food security is profoundly important for human beings all over the world and is predominantly studied  under various climate change scenarios.  It is required to integrate various other variables for the evaluation of food security in a complete and systematic fashion.

This is the first notebook of the project and deals with the cleaning of dataset related to 'Atmospheric deposition', i.e. a feature that could be affecting crop yield.

Atmospheric deposition is an essential process for particles and gases from the atmosphere to the terrestrial and aquatic surfaces [1]. This process can have both useful and harmful effects on the environment. One one side, it is responsible for cleanses the air and transports additional nutrients to the plants, whereas on the otherside, deposition of sulfur and nutrients may contribute to acidification and eutrophication of various ecosystems.

To start off, lets import the important packages. Since the current notebook and notebooks similar to this one only deal with the refinement of data so that it could be combined with the other datasets for modelling purposes, therefore, only pandas library will be imported. There will be a separate notebook dealing with detailed exploratory data analysis.

### Exploratory Data Analysis

Importing package(s) to be used in this notebook.

In [4]:
import pandas as pd

Importing csv file in dataframe called AtmosDep_df1.

In [5]:
AtmosDep_df1 = pd.read_csv('DataFiles/01-RawDataFiles/AtmosphericDeposition-RAW/AtmosphericDeposition-RAW.csv')

Checking the shape of the dataframe.

In [7]:
AtmosDep_df1.shape

(10112, 14)

As seen the dataframe consists of about 10 thousand rows and 14 columns.

Checking the data contents of the dataframe.

In [8]:
AtmosDep_df1.head(2)

Unnamed: 0,Domain Code,Domain,Area Code (FAO),Area,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,ESB,Soil nutrient budget,2,Afghanistan,7275,Cropland nutrient flow,5076,Atmospheric Deposition,1961,1961,tonnes,65189.74,Fc,Calculated data
1,ESB,Soil nutrient budget,2,Afghanistan,7275,Cropland nutrient flow,5076,Atmospheric Deposition,1962,1962,tonnes,67127.88,Fc,Calculated data


As seen, the data frame consists of variable columns containing information such as Domain, Element and Item names. We do not require all of these columns. Further the Flag column will also be dropped as various methodologies have been used by FAO to fill in the gaps in the dataset. The number of rows will always be less than anticipated. Ideally the number of rows should be equal to the product of number of countries surveyed and number of years the data is collected for. However, this is not the case for all databases included in this project. Thus it is meaningless to filter data of the basis of flags (that are indicative of the methodology used for data collection), as it will reduce the number of rows further.

Checking the datatypes of each of the columns.

In [9]:
AtmosDep_df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10112 entries, 0 to 10111
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Domain Code       10112 non-null  object 
 1   Domain            10112 non-null  object 
 2   Area Code (FAO)   10112 non-null  int64  
 3   Area              10112 non-null  object 
 4   Element Code      10112 non-null  int64  
 5   Element           10112 non-null  object 
 6   Item Code         10112 non-null  int64  
 7   Item              10112 non-null  object 
 8   Year Code         10112 non-null  int64  
 9   Year              10112 non-null  int64  
 10  Unit              10112 non-null  object 
 11  Value             10112 non-null  float64
 12  Flag              10112 non-null  object 
 13  Flag Description  10112 non-null  object 
dtypes: float64(1), int64(5), object(8)
memory usage: 1.1+ MB


As seen above the dataset contains a mix of numerical and categorical columns. A total of 8 columns were of object data type whereas 5 columns were of int64 data type. Only one column was found to have float64 data type. All the above mentioned data types were found to be an appropriate fit for the respective columns.

Checking for null values.

In [10]:
AtmosDep_df1.isnull().sum()

Domain Code         0
Domain              0
Area Code (FAO)     0
Area                0
Element Code        0
Element             0
Item Code           0
Item                0
Year Code           0
Year                0
Unit                0
Value               0
Flag                0
Flag Description    0
dtype: int64

The dataset does not contain any null values.

In [11]:
AtmosDep_df1.duplicated().sum()

0

The dataset also does not contain any duplicate values.

In order to know more about the dataset checking unique values in each column.

Starting off with the 'Year' column.

In [12]:
AtmosDep_df1['Year'].unique()

array([1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,
       1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982,
       1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993,
       1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004,
       2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015,
       2016, 2017, 2018], dtype=int64)

The year column comprises of values ranging from 1961 to 2018.

Checking the 'Flag', 'Element' and 'Item' columns.

In [13]:
AtmosDep_df1['Flag'].unique() # Flag: Relates to the method used to collect data.

array(['Fc'], dtype=object)

In [None]:
AtmosDep_df1['Element'].unique() # Element: Relates to the main categories of interest.

array(['Cropland nutrient flow'], dtype=object)

In [None]:
AtmosDep_df1['Item'].unique() # Item: Relates to the main subcategories of interest.

array(['Atmospheric Deposition'], dtype=object)

As seen the 'Flag', 'Item' and 'Element' columns only contain one value each and thus can be dropped as they either do not provide any useful information or the information conveyed could easily be incorporated into other relevant columns.

### Refined Dataset Creation and Exportation

Based on the initial analysis, it is clear that only four columns are of value i.e. 'Area Code (FAO)', 'Area', 'Year' and 'Value'. Since the names of these columns are not perfect, therefore, need to be renamed to either include more information or discard redundant information.

Creating a new dataframe called 'AtmosDep_df2' consisting only the columns of interest, and renaming the columns included in the new dataframe.

In [None]:
AtmosDep_df2 = AtmosDep_df1[['Area Code (FAO)', 'Area', 'Year', 'Value']]
AtmosDep_df2.rename(columns = {'Area Code (FAO)':'Area Code', 'Value':'Atmospheric Deposition (tonnes)'}, inplace = True) #renaming columns to maximize the incorporation of meaningful information.
AtmosDep_df2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  AtmosDep_df2.rename(columns = {'Area Code (FAO)':'Area Code', 'Value':'Atmospheric Deposition (tonnes)'}, inplace = True)


Unnamed: 0,Area Code,Area,Year,Atmospheric Deposition (tonnes)
0,2,Afghanistan,1961,65189.7400
1,2,Afghanistan,1962,67127.8800
2,2,Afghanistan,1963,68986.5110
3,2,Afghanistan,1964,70987.6918
4,2,Afghanistan,1965,72446.0625
...,...,...,...,...
10107,181,Zimbabwe,2014,18855.9000
10108,181,Zimbabwe,2015,18855.9000
10109,181,Zimbabwe,2016,18855.9000
10110,181,Zimbabwe,2017,18855.9000


The above refined dataset consisting of around 10 thousand rows and 4 columns incorporates all the information related to 'Atmospheric Deposition' and thus can readily be exported.

Exporting the refined dataset to a data folder containing defined datasets.

In [None]:
AtmosDep_df2.to_csv(r'DataFiles/02-RefinedDataFiles/AtmosphericDeposition-REFINED.csv', index = False)

### Summary of things done in this notebook:

- Performed basic EDA.
- Discarded region based statistics by applying filter on the 'Area' column.
- Dropped redundant columns.
- Incorporated more information in selected column names.
- Exported the refined data to a CSV file.

### References:

[1] Greaver TL, Sullivan TJ, Herrick JD, Barber MC, Baron JS, Cosby BJ, Deerhake ME, Dennis RL, Dubois JJ, Goodale CL, Herlihy AT. Ecological effects of nitrogen and sulfur air pollution in the US: what do we know?. Frontiers in Ecology and the Environment. 2012 Sep;10(7):365-72.