***
# Crop Removal (Raw Data Processing)
Capstone Project - Ali Sehpar Shikoh
***

<b> Previous Notebook: BiologicalFixation-RAW

<b> Next Notebook: Crops-RAW

This is the third notebook of the project and deals with the cleaning of dataset related to 'Crop', i.e. a feature that could be affecting crop yield.

Crop nutrient removal is defined as the total amount of nutrients removed from the field in the harvested portion of the crop (e.g., grain, silage, hay) [1]. This term should not be confused with crop nutrient uptake, which is defined as the total amount of nutrients contained in the entire crop at maturity.

### Exploratory Data Analysis

Importing package(s) to be used in this notebook.

In [1]:
import pandas as pd

Importing csv file in dataframe called CropRem_df1.

In [2]:
CropRem_df1 = pd.read_csv('DataFiles/01-RawDataFiles/CropRemoval-RAW/CropRemoval-RAW.csv')

Checking the shape of the dataframe.

In [3]:
CropRem_df1.shape

(10843, 14)

As seen the dataframe consists of about 11 thousand rows and 14 columns.

Checking the data contents of the dataframe.

In [4]:
CropRem_df1.head(2)

Unnamed: 0,Domain Code,Domain,Area Code (FAO),Area,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,ESB,Soil nutrient budget,2,Afghanistan,7275,Cropland nutrient flow,5077,Crop Removal,1961,1961,tonnes,74154.2,Fc,Calculated data
1,ESB,Soil nutrient budget,2,Afghanistan,7275,Cropland nutrient flow,5077,Crop Removal,1962,1962,tonnes,75966.9064,Fc,Calculated data


As seen, the data frame consists of variable columns containing information such as Domain, Element, Item and Flag names. We do not require all of these columns.

Checking the datatypes of each of the columns.

In [5]:
CropRem_df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10843 entries, 0 to 10842
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Domain Code       10843 non-null  object 
 1   Domain            10843 non-null  object 
 2   Area Code (FAO)   10843 non-null  int64  
 3   Area              10843 non-null  object 
 4   Element Code      10843 non-null  int64  
 5   Element           10843 non-null  object 
 6   Item Code         10843 non-null  int64  
 7   Item              10843 non-null  object 
 8   Year Code         10843 non-null  int64  
 9   Year              10843 non-null  int64  
 10  Unit              10843 non-null  object 
 11  Value             10843 non-null  float64
 12  Flag              10843 non-null  object 
 13  Flag Description  10843 non-null  object 
dtypes: float64(1), int64(5), object(8)
memory usage: 1.2+ MB


As seen above the dataset contains a mix of numerical and categorical columns. A total of 8 columns were of object data type whereas 5 columns were of int64 data type. Only one column was found to have float64 data type. All the above mentioned data types were found to be an appropriate fit for the respective columns.

Checking for null values.

In [6]:
CropRem_df1.isnull().sum()

Domain Code         0
Domain              0
Area Code (FAO)     0
Area                0
Element Code        0
Element             0
Item Code           0
Item                0
Year Code           0
Year                0
Unit                0
Value               0
Flag                0
Flag Description    0
dtype: int64

The dataset does not contain any null values.

In [7]:
CropRem_df1.duplicated().sum()

0

The dataset also does not contain any duplicate values.

In order to know more about the dataset checking unique values in each column.

Starting off with the 'Year' column.

In [8]:
CropRem_df1['Year'].unique()

array([1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,
       1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982,
       1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993,
       1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004,
       2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015,
       2016, 2017, 2018], dtype=int64)

The year column comprises of values ranging from 1961 to 2018.

### Refined Dataset Creation and Exportation

Based on the initial analysis, it is clear that only four columns are of value i.e. 'Area Code (FAO)', 'Area', 'Year' and 'Value'. Since the names of these columns are not perfect, therefore, need to be renamed to either include more information or discard redundant information.

Checking the 'Flag', 'Element' and 'Item' columns.

In [9]:
CropRem_df1['Flag'].unique() # Flag: Relates to the method used to collect data.

array(['Fc'], dtype=object)

In [11]:
CropRem_df1['Element'].unique() # Element: Relates to the main categories of interest.

array(['Cropland nutrient flow'], dtype=object)

In [10]:
CropRem_df1['Item'].unique() # Item: Relates to the main subcategories of interest.

array(['Crop Removal'], dtype=object)

As seen the 'Flag', 'Item' and 'Element' columns only contain one value each and thus can be dropped as they either do not provide any useful information or the information conveyed could easily be incorporated into other relevant columns.

Creating a new dataframe called 'CropRem_df2' consisting only the columns of interest.

In [13]:
CropRem_df2 = CropRem_df1[['Area Code (FAO)', 'Area', 'Year', 'Value']]
CropRem_df2.rename(columns = {'Area Code (FAO)':'Area Code', 'Value':'Crop Removal (tonnes)'}, inplace = True) #renaming columns to maximize the incorporation of meaningful information.
CropRem_df2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  CropRem_df2.rename(columns = {'Area Code (FAO)':'Area Code', 'Value':'Crop Removal (tonnes)'}, inplace = True)


Unnamed: 0,Area Code,Area,Year,Crop Removal (tonnes)
0,2,Afghanistan,1961,74154.2000
1,2,Afghanistan,1962,75966.9064
2,2,Afghanistan,1963,71302.8675
3,2,Afghanistan,1964,76253.3200
4,2,Afghanistan,1965,77364.1100
...,...,...,...,...
10838,181,Zimbabwe,2014,38122.7967
10839,181,Zimbabwe,2015,28149.5661
10840,181,Zimbabwe,2016,26201.6590
10841,181,Zimbabwe,2017,32794.6510


The above refined dataset consisting of around 10 thousand rows and 4 columns incorporates all the information related to 'Crop Removal' and thus can readily be exported.

Exporting the refined dataset to a data folder containing defined datasets.

In [14]:
CropRem_df2.to_csv(r'DataFiles/02-RefinedDataFiles/CropRemoval-REFINED.csv', index = False)

### Summary of things done in this notebook:

- Performed basic EDA.
- Discarded region based statistics by applying filter on the 'Area' column.
- Dropped redundant columns.
- Incorporated more information in selected column names.
- Exported the refined data to a CSV file.


### References

[1] Nitrogen Removal by Delaware Crops | Cooperative Extension | University of Delaware. https://www.udel.edu/academics/colleges/canr/cooperative-extension/fact-sheets/nitrogen-removal-delaware-crops/. Accessed 1 Apr. 2022.