***
# Biological Fixation (Raw Data Processing)
Capstone Project - Ali Sehpar Shikoh
***

<b> Previous Notebook: AtmosphericDeposition-RAW

<b> Next Notebook: CropRemoval-RAW

This is the second notebook of the project and deals with the cleaning of dataset related to 'Biological Fixation', i.e. a feature that could be affecting crop yield.

All plants, including forage crops, need relatively large amounts of nitrogen (N) for proper growth and development. Biological nitrogen fixation (BNF) or simply Biological fixation (BF) is the term used for a process in which nitrogen gas (N2) from the atmosphere is incorporated into the tissue of certain plants [1]. Only a few group of plants are able to obtain N this way, with the help of soil micro-organisms. Among forage plants, the group of plants known as legumes (plants in the botanical family Fabaceae) are well known for being able to obtain N from air N2.

### Exploratory Data Analysis

Importing package(s).

In [1]:
import pandas as pd

Importing csv file in dataframe called BioFix_df1.

In [2]:
BioFix_df1 = pd.read_csv('DataFiles/01-RawDataFiles/BiologicalFixation-RAW/BiologicalFixation-RAW.csv')

Checking the shape of the dataframe.

In [5]:
BioFix_df1.shape

(8989, 14)

As seen the dataframe consists of about 9 thousand rows and 14 columns.

Checking the data contents of the dataframe.

In [6]:
BioFix_df1.head(2)

Unnamed: 0,Domain Code,Domain,Area Code (FAO),Area,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,ESB,Soil nutrient budget,3,Albania,7275,Cropland nutrient flow,5078,Biological Fixation,1961,1961,tonnes,277.41,Fc,Calculated data
1,ESB,Soil nutrient budget,3,Albania,7275,Cropland nutrient flow,5078,Biological Fixation,1962,1962,tonnes,267.87,Fc,Calculated data


As seen, the data frame consists of variable columns containing information such as Domain, Element, Item and Flag names. We do not require all of these columns.

Checking the datatypes of each of the columns.

In [7]:
BioFix_df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8989 entries, 0 to 8988
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Domain Code       8989 non-null   object 
 1   Domain            8989 non-null   object 
 2   Area Code (FAO)   8989 non-null   int64  
 3   Area              8989 non-null   object 
 4   Element Code      8989 non-null   int64  
 5   Element           8989 non-null   object 
 6   Item Code         8989 non-null   int64  
 7   Item              8989 non-null   object 
 8   Year Code         8989 non-null   int64  
 9   Year              8989 non-null   int64  
 10  Unit              8989 non-null   object 
 11  Value             8989 non-null   float64
 12  Flag              8989 non-null   object 
 13  Flag Description  8989 non-null   object 
dtypes: float64(1), int64(5), object(8)
memory usage: 983.3+ KB


As seen above the dataset contains a mix of numerical and categorical columns. A total of 8 columns were of object data type whereas 5 columns were of int64 data type. Only one column was found to have float64 data type. All the above mentioned data types were found to be an appropriate fit for the respective columns.

Checking for null values.

In [8]:
BioFix_df1.isnull().sum()

Domain Code         0
Domain              0
Area Code (FAO)     0
Area                0
Element Code        0
Element             0
Item Code           0
Item                0
Year Code           0
Year                0
Unit                0
Value               0
Flag                0
Flag Description    0
dtype: int64

The dataset does not contain any null values.

The dataset also does not contain any duplicate values.

In [9]:
BioFix_df1.duplicated().sum()

0

In order to know more about the dataset checking unique values in each column.

Starting off with the 'Year' column.

In [10]:
BioFix_df1['Year'].unique()

array([1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,
       1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982,
       1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993,
       1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004,
       2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015,
       2016, 2017, 2018], dtype=int64)

The year column comprises of values ranging from 1961 to 2018.

Checking the 'Flag', 'Element' and 'Item' columns.

In [11]:
BioFix_df1['Flag'].unique() # Flag: Relates to the method used to collect data.

array(['Fc'], dtype=object)

In [13]:
BioFix_df1['Element'].unique() # Element: Relates to the main categories of interest.

array(['Cropland nutrient flow'], dtype=object)

In [12]:
BioFix_df1['Item'].unique() # Item: Relates to the main subcategories of interest.

array(['Biological Fixation'], dtype=object)

As seen the 'Flag', 'Item' and 'Element' columns only contain one value each and thus can be dropped as they either do not provide any useful information or the information conveyed could easily be incorporated into other relevant columns.

### Refined Dataset Creation and Exportation

Based on the initial analysis, it is clear that only four columns are of value i.e. 'Area Code (FAO)', 'Area', 'Year' and 'Value'. Since the names of these columns are not perfect, therefore, they need to be renamed to either include more information or discard redundant information.

Creating a new dataframe called 'BioFix_df2' consisting only the columns of interest.

In [15]:
BioFix_df2 = BioFix_df1[['Area Code (FAO)', 'Area', 'Year', 'Value']]
BioFix_df2.rename(columns = {'Area Code (FAO)':'Area Code', 'Value':'Biological Fixation (tonnes)'}, inplace = True) #renaming columns to maximize the incorporation of meaningful information.
BioFix_df2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  BioFix_df2.rename(columns = {'Area Code (FAO)':'Area Code', 'Value':'Biological Fixation (tonnes)'}, inplace = True)


Unnamed: 0,Area Code,Area,Year,Biological Fixation (tonnes)
0,3,Albania,1961,277.41
1,3,Albania,1962,267.87
2,3,Albania,1963,420.48
3,3,Albania,1964,383.70
4,3,Albania,1965,250.77
...,...,...,...,...
8984,181,Zimbabwe,2014,9012.87
8985,181,Zimbabwe,2015,6420.84
8986,181,Zimbabwe,2016,6500.15
8987,181,Zimbabwe,2017,7971.18


The above refined dataset consisting of around 8 thousand rows and 4 columns incorporates all the information related to 'Biological Fixation' and thus can readily be exported.

Exporting the refined dataset to a data folder containing defined datasets.

In [16]:
BioFix_df2.to_csv(r'DataFiles/02-RefinedDataFiles/BiologicalFixation-REFINED.csv', index = False)

### Summary of things done in this notebook:

- Performed basic EDA.
- Discarded region based statistics by applying filter on the 'Area' column.
- Dropped redundant columns.
- Incorporated more information in selected column names.
- Exported the refined data to a CSV file.


### References

[1] “Define Biological Nitrogen Fixation (BNF) and Explain Its Importance.” Forage Information System, 2 June 2009, https://forages.oregonstate.edu/nfgc/eo/onlineforagecurriculum/instructormaterials/availabletopics/nitrogenfixation/definition.