<div style="text-align: center;">
  <h2>Milestone 2: Formatting Flat File Source</h2>
</div>

## Data Source and Handling

The dataset used in this project was sourced from FAOSTAT: [https://www.fao.org/faostat/en/#data/QCL](https://www.fao.org/faostat/en/#data/QCL).

Due to its large size (approximately 520MB), the CSV file has been excluded from Git version control to maintain repository efficiency.


## Steps

### 1 - Import necessary Libraries

In [22]:
import pandas as pd

### 2 - Reading CSV file



In [23]:
# Reading CSV
df = pd.read_csv("Production_Crops_Livestock.csv",dtype={'Note': str}) #Setting the Data Type Manually to avoid mixed data type Warnings
df.head(5)

Unnamed: 0,Area Code,Area Code (M49),Area,Item Code,Item Code (CPC),Item,Element Code,Element,Year Code,Year,Unit,Value,Flag,Note
0,2,'004,Afghanistan,221,'01371,"Almonds, in shell",5312,Area harvested,1961,1961,ha,0.0,A,
1,2,'004,Afghanistan,221,'01371,"Almonds, in shell",5312,Area harvested,1962,1962,ha,0.0,A,
2,2,'004,Afghanistan,221,'01371,"Almonds, in shell",5312,Area harvested,1963,1963,ha,0.0,A,
3,2,'004,Afghanistan,221,'01371,"Almonds, in shell",5312,Area harvested,1964,1964,ha,0.0,A,
4,2,'004,Afghanistan,221,'01371,"Almonds, in shell",5312,Area harvested,1965,1965,ha,0.0,A,


### 3 - Remove Unwanted Columns

Removing all the code columns like Area Code, Item Code etc .. from the dataframe. 

These columns are not necessary because the data is already normalized

In [24]:
#Removing Columns
columns_to_remove = ['Area Code', 'Area Code (M49)','Item Code','Item Code (CPC)','Element Code','Year Code'] #List with unwanted columns
df = df.drop(columns_to_remove, axis=1) # Store in new dataframe

df.head(5)

Unnamed: 0,Area,Item,Element,Year,Unit,Value,Flag,Note
0,Afghanistan,"Almonds, in shell",Area harvested,1961,ha,0.0,A,
1,Afghanistan,"Almonds, in shell",Area harvested,1962,ha,0.0,A,
2,Afghanistan,"Almonds, in shell",Area harvested,1963,ha,0.0,A,
3,Afghanistan,"Almonds, in shell",Area harvested,1964,ha,0.0,A,
4,Afghanistan,"Almonds, in shell",Area harvested,1965,ha,0.0,A,


### 4 - Check the distinct Value for Unit, Flag and Note

Checking the Distint values of multiple fields to determine the need.

In [25]:
#Finding Distinct values
print(df['Unit'].unique())
print(df['Flag'].unique())
print(df['Note'].unique())

['ha' 'kg/ha' 't' 'An' '1000 An' 'No/An' 'g/An' '1000 No' 'kg/An' 'No']
['A' 'E' 'X' 'I' 'M']
[nan 'Unofficial figure']


### 5 - Replace the Distinct values with the description

All the values are indictors of some description. Replace the indicators with the description taken from the site

In [26]:
# Replace values

#Replacement values for unit
unit_map = {'ha': 'Hectares', 'kg/ha': 'Kilograms per hectare','t':'Tonnes','An':'Animals','No/An':'Number per animal','g/An': 'Grams per animal','1000 No': 'Thousand Number','kg/An':'Kilograms per animal','No':'Number'}
#Replace Unit values
df['Unit'] = df['Unit'].replace(unit_map)

#Replacement values for Flag
flag_map = {'A' :'Official figure','E' :'Estimated value','X':'Figure from international organizations','I':'Imputed value','M':'Missing value'}
#Replace Flag values
df['Flag'] = df['Flag'].replace(flag_map)

df.head(5)

Unnamed: 0,Area,Item,Element,Year,Unit,Value,Flag,Note
0,Afghanistan,"Almonds, in shell",Area harvested,1961,Hectares,0.0,Official figure,
1,Afghanistan,"Almonds, in shell",Area harvested,1962,Hectares,0.0,Official figure,
2,Afghanistan,"Almonds, in shell",Area harvested,1963,Hectares,0.0,Official figure,
3,Afghanistan,"Almonds, in shell",Area harvested,1964,Hectares,0.0,Official figure,
4,Afghanistan,"Almonds, in shell",Area harvested,1965,Hectares,0.0,Official figure,


### 6 - Remove Note Column

Form the above analysis seems Note column not giving much insight. It is mostly having NaN and other information "Unoffical Figure" is already available under Flag column

In [27]:
#Remove Note column
df = df.drop('Note', axis=1) 

print (f'Total records:',len(df))

Total records: 4126411


### 7 - Remove data not related to crops/vegetables

I am interested only in Crops/Vegetables data to remove the other unwanted values, i am filtering on unit column for the values of 'Hectares' , 'tonnes' & 'Kg per Hactares'


In [28]:
#Remove data other than Crops/Vegetables
df = df[(df['Unit'] == 'Hectares') | (df['Unit'] == 'Kilograms per hectare') |(df['Unit'] == 'Tonnes')]

print (f'Total records of crops data:',len(df))

Total records of crops data: 3349665


### 8 - Selecting only Production Unit

I am considering only Production data from this dataset. Other information are taken from other dataset in future.

In [29]:
#Selecting only Production Element
df = df[(df['Element'] == 'Production')]

print (f'Total records of prodution data:',len(df))

Total records of prodution data: 1633092


### 9 - Remove records having value as 0

Zero value indicates either unavailable or bad data

In [30]:
#Remove data that have value as 0
df=df[(df['Value'] > 0)]

print (f'Total records after removing 0 Value',len(df))

Total records after removing 0 Value 1509323


### 9 - Identifying/Removing Outliers

I am using Z-scores to identify the Outliers for the value column. 

In [31]:
from scipy import stats
import numpy as np

# Select only numeric columns
numeric_cols = df.select_dtypes(include=[np.number]) #Select only value

# Calculate Z-scores
z_scores = np.abs(stats.zscore(numeric_cols, nan_policy='omit'))

# Keep rows where all Z-scores are below 3
df_no_outliers = df[(z_scores < 3).all(axis=1)]

#Print the record count before and after removing outlier
print(f'Count before removing Outliers: {len(df)}')
print(f'Count after removing Outliers: {len(df_no_outliers)}')

df.head (-5)

Count before removing Outliers: 1509323
Count after removing Outliers: 1501241


Unnamed: 0,Area,Item,Element,Year,Unit,Value,Flag
126,Afghanistan,"Almonds, in shell",Production,1976,Tonnes,9800.00,Estimated value
127,Afghanistan,"Almonds, in shell",Production,1977,Tonnes,9000.00,Estimated value
128,Afghanistan,"Almonds, in shell",Production,1978,Tonnes,12000.00,Estimated value
129,Afghanistan,"Almonds, in shell",Production,1979,Tonnes,10500.00,Estimated value
130,Afghanistan,"Almonds, in shell",Production,1980,Tonnes,9900.00,Estimated value
...,...,...,...,...,...,...,...
4126401,Net Food Importing Developing Countries,Vegetables Primary,Production,2014,Tonnes,81576818.72,Estimated value
4126402,Net Food Importing Developing Countries,Vegetables Primary,Production,2015,Tonnes,85136254.08,Estimated value
4126403,Net Food Importing Developing Countries,Vegetables Primary,Production,2016,Tonnes,84933484.08,Estimated value
4126404,Net Food Importing Developing Countries,Vegetables Primary,Production,2017,Tonnes,85513605.40,Estimated value


### 10- Identifying Duplicates

It is expected to have duplicates in most of the column. I added this step just to cross check the value in Element, Unit and Flag

In [32]:
#Finding Duplicates

# Loop through each column and check for duplicates
for col in df.columns:
    duplicate_values = df[col][df[col].duplicated()] #Identity the unique duplicate values
    if not duplicate_values.empty:
        print(f"\nColumn '{col}' has {len(duplicate_values.unique())} duplicate values.") #Print the total duplicate values for the column
    else:
        print(f"\nColumn '{col}' has no duplicate values.")


Column 'Area' has 245 duplicate values.

Column 'Item' has 280 duplicate values.

Column 'Element' has 1 duplicate values.

Column 'Year' has 63 duplicate values.

Column 'Unit' has 1 duplicate values.

Column 'Value' has 141172 duplicate values.

Column 'Flag' has 4 duplicate values.
