# Dataset Cleaner

Import the necessary libs

In [1]:
import pandas as pd

Load the dataset

In [2]:
cpd = pd.read_csv("assets/Crop Production data.csv")

Display basic information about the dataset

In [3]:
print(cpd.info())
print(cpd.head())
print(cpd.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246091 entries, 0 to 246090
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   State_Name     246091 non-null  object 
 1   District_Name  246091 non-null  object 
 2   Crop_Year      246091 non-null  int64  
 3   Season         246091 non-null  object 
 4   Crop           246091 non-null  object 
 5   Area           246091 non-null  float64
 6   Production     242361 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.1+ MB
None
                    State_Name District_Name  Crop_Year       Season  \
0  Andaman and Nicobar Islands      NICOBARS       2000  Kharif        
1  Andaman and Nicobar Islands      NICOBARS       2000  Kharif        
2  Andaman and Nicobar Islands      NICOBARS       2000  Kharif        
3  Andaman and Nicobar Islands      NICOBARS       2000  Whole Year    
4  Andaman and Nicobar Islands      NICOBARS       2000  

Check for missing values

In [4]:
print(cpd.isnull().sum())

State_Name          0
District_Name       0
Crop_Year           0
Season              0
Crop                0
Area                0
Production       3730
dtype: int64


Check for duplicates

In [5]:
print(cpd.duplicated().sum())

0


Handle missing values by removing rows with missing values

In [6]:
cpdc = cpd.dropna()

Ensure correct data types using .loc

In [7]:
cpdc.loc[:, "Crop_Year"] = cpdc["Crop_Year"].astype(int)
cpdc.loc[:, "Area"] = cpdc["Area"].astype(float)
cpdc.loc[:, "Production"] = cpdc["Production"].astype(float)

Remove duplicates

In [8]:
cpdc = cpdc.drop_duplicates()

Remove outliers beyond 1.5*IQR for 'Area' and 'Production'

In [9]:
Q1_area = cpdc["Area"].quantile(0.25)
Q3_area = cpdc["Area"].quantile(0.75)
IQR_area = Q3_area - Q1_area

Q1_prod = cpdc["Production"].quantile(0.25)
Q3_prod = cpdc["Production"].quantile(0.75)
IQR_prod = Q3_prod - Q1_prod

cpdc = cpdc[
    (cpdc["Area"] >= (Q1_area - 1.5 * IQR_area))
    & (cpdc["Area"] <= (Q3_area + 1.5 * IQR_area))
    & (cpdc["Production"] >= (Q1_prod - 1.5 * IQR_prod))
    & (cpdc["Production"] <= (Q3_prod + 1.5 * IQR_prod))
]

Check the cleaned data

In [10]:
print(cpdc.info())
print(cpdc.describe())

<class 'pandas.core.frame.DataFrame'>
Index: 190741 entries, 0 to 246090
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   State_Name     190741 non-null  object 
 1   District_Name  190741 non-null  object 
 2   Crop_Year      190741 non-null  int64  
 3   Season         190741 non-null  object 
 4   Crop           190741 non-null  object 
 5   Area           190741 non-null  float64
 6   Production     190741 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 11.6+ MB
None
           Crop_Year           Area     Production
count  190741.000000  190741.000000  190741.000000
mean     2005.697328    1208.483631    1565.447702
std         4.931001    2080.811909    2896.547890
min      1997.000000       0.100000       0.000000
25%      2002.000000      50.000000      46.000000
50%      2006.000000     277.000000     300.000000
75%      2010.000000    1289.000000    1580.000000
max      2015.

Standardize categorical values using .loc (if necessary)

In [11]:
cpdc.loc[:, "State_Name"] = cpdc["State_Name"].str.strip().str.title()
cpdc.loc[:, "District_Name"] = cpdc["District_Name"].str.strip().str.title()
cpdc.loc[:, "Crop"] = cpdc["Crop"].str.strip().str.title()
cpdc.loc[:, "Season"] = cpdc["Season"].str.strip().str.title()

Checking the missing values again

In [12]:
cpdc.isnull().sum()

State_Name       0
District_Name    0
Crop_Year        0
Season           0
Crop             0
Area             0
Production       0
dtype: int64

Save the cleaned dataset

In [13]:
cpdc.to_csv("assets/cleaned_dataset.csv", index=False, encoding="utf-8")