***
# Roundwood Production (Raw Data Processing)
Capstone Project - Ali Sehpar Shikoh
***

<b> Previous Notebook: Population-RAW

<b> Next Notebook: TemperatureChange-RAW

This is the ninth notebook of the project and deals with the cleaning of dataset related to 'Roundwood Production', i.e. a feature that could be affecting crop yield.

Roundwood production comprises all quantities of wood removed from the forest and other wooded land, or other tree felling site during a defined period of time [1]. It can be an important factor as it tends to be directly linked with climate change and thus could reflect its effect on crop yield. This notebook deals with the processing or raw data related to the roundwood production per country on yearly basis.

### Exploratory Data Analysis

Importing packages.

In [1]:
import numpy as np
import pandas as pd

Importing csv file in dataframe called RndWoodProd_df1.

In [2]:
RndWoodProd_df1 = pd.read_csv('DataFiles/01-RawDataFiles/RoundwoodProduction-RAW/RoundwoodProduction-RAW.csv')

Checking the shape of the dataframe.

In [3]:
RndWoodProd_df1.shape

(11163, 14)

As seen the dataframe consists of about 10 thousand rows and 14 columns.

Checking the data contents of the dataframe.

In [4]:
RndWoodProd_df1.head(2)

Unnamed: 0,Domain Code,Domain,Area Code (FAO),Area,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FO,Forestry Production and Trade,2,Afghanistan,5516,Production,1861,Roundwood,1961,1961,m3,1354291,Im,FAO data based on imputation methodology
1,FO,Forestry Production and Trade,2,Afghanistan,5516,Production,1861,Roundwood,1962,1962,m3,1371568,Im,FAO data based on imputation methodology


As seen, the data frame consists of 14 variable columns containing information such as Domain, Element, Item and Flag names. We do not require all of these columns.

Checking the datatypes of each of the columns.

In [5]:
RndWoodProd_df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11163 entries, 0 to 11162
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Domain Code       11163 non-null  object
 1   Domain            11163 non-null  object
 2   Area Code (FAO)   11163 non-null  int64 
 3   Area              11163 non-null  object
 4   Element Code      11163 non-null  int64 
 5   Element           11163 non-null  object
 6   Item Code         11163 non-null  int64 
 7   Item              11163 non-null  object
 8   Year Code         11163 non-null  int64 
 9   Year              11163 non-null  int64 
 10  Unit              11163 non-null  object
 11  Value             11163 non-null  int64 
 12  Flag              11163 non-null  object
 13  Flag Description  11163 non-null  object
dtypes: int64(6), object(8)
memory usage: 1.2+ MB


As seen above the dataset contains a mix of numerical and categorical columns. A total of 8 columns were of object data type whereas 6 columns were of int64 data type. All the above mentioned data types were found to be an appropriate fit for the respective columns.

Checking for null values.

In [6]:
RndWoodProd_df1.isnull().sum()

Domain Code         0
Domain              0
Area Code (FAO)     0
Area                0
Element Code        0
Element             0
Item Code           0
Item                0
Year Code           0
Year                0
Unit                0
Value               0
Flag                0
Flag Description    0
dtype: int64

The dataset does not contain any null values.

Checking the dataset for duplicates.

In [7]:
RndWoodProd_df1.duplicated().sum()

0

The dataset also does not contain any duplicate values.

In order to know more about the dataset checking unique values in each column.

Starting off with the 'Year' column.

In [8]:
RndWoodProd_df1['Year'].unique()

array([1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,
       1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982,
       1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993,
       1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004,
       2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015,
       2016, 2017, 2018, 2019, 2020], dtype=int64)

The year column comprises of values ranging from 1961 to 2020.

Checking the 'Flag', 'Item' and 'Element' columns.

In [9]:
RndWoodProd_df1['Flag'].unique()

array(['Im', 'A'], dtype=object)

As see the Flag column contains two unique unique flag types i.e. Im and A. Since the number of rows is less than product of number of countries and years, therefore it is recommended to keep rows related to both of these flags.

In [10]:
RndWoodProd_df1['Item'].unique()

array(['Roundwood'], dtype=object)

In [11]:
RndWoodProd_df1['Element'].unique()

array(['Production'], dtype=object)

As seen the 'Item' and 'Element' columns only contain one value each and thus can be dropped as they tend not to provide any useful information.

### Refined Dataset Creation and Exportation

Creating a new dataframe called 'RndWoodProd_df2' consisting only the columns of interest. In addition renaming the retained 'Area Code (FAO)' and 'Value' columns to 'Area Code' and 'Roundwood Production (m3)', respectively.

In [13]:
RndWoodProd_df2 = RndWoodProd_df1[['Area Code (FAO)', 'Area', 'Year', 'Value']]
RndWoodProd_df2.rename(columns = {'Area Code (FAO)':'Area Code', 'Value':'Roundwood Production (m3)'}, inplace = True)
RndWoodProd_df2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  RndWoodProd_df2.rename(columns = {'Area Code (FAO)':'Area Code', 'Value':'Roundwood Production (m3)'}, inplace = True)


Unnamed: 0,Area Code,Area,Year,Roundwood Production (m3)
0,2,Afghanistan,1961,1354291
1,2,Afghanistan,1962,1371568
2,2,Afghanistan,1963,1414937
3,2,Afghanistan,1964,1533399
4,2,Afghanistan,1965,1596956
...,...,...,...,...
11158,181,Zimbabwe,2016,9806850
11159,181,Zimbabwe,2017,9801056
11160,181,Zimbabwe,2018,9920009
11161,181,Zimbabwe,2019,9983350


Exporting the refined dataset consisting of 11 thousand rows and 4 columns to a folder containing refined/filtered data and working files.

In [14]:
RndWoodProd_df2.to_csv(r'DataFiles/02-RefinedDataFiles/RoundwoodProduction-REFINED.csv', index = False)

### Summary of things done in this notebook:

- Performed basic EDA.
- Discarded region based statistics by applying filter on the 'Area' column.
- Dropped redundant columns.
- Incorporated more information in selected column names.
- Exported the refined data to a CSV file.


### References

[1] Glossary:Roundwood Production. https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Glossary:Roundwood_production. Accessed 1 Apr. 2022.