# Task 1 : Data Cleaning & Preprocessing

## Objectives :
    - Load the dataset using Pandas.
    - Identify and handle missing values (e.g., Imputation or removal )
    - Remove duplicate rows & standardize inconsistent data formats.

#### 1. Importing Libraries

In [1]:
import pandas as pd

#### 2. Reading the CSV file

In [3]:
a = pd.read_csv(r"C:\Users\Rudra Pratap Swain\OneDrive\Desktop\CodVeda\DATASETS\1) iris.csv")
print(a)

     sepal_length  sepal_width  petal_length  petal_width    species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

[150 rows x 5 columns]


#### 3. Checking null values

In [5]:
a.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

#### 4. Verifying Data Types

In [7]:
a.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

#### 5. Checking Unique Values

In [10]:
for i in a.columns :
    print(i,' : \n ',a[i].unique())

sepal_length  : 
  [5.1 4.9 4.7 4.6 5.  5.4 4.4 4.8 4.3 5.8 5.7 5.2 5.5 4.5 5.3 7.  6.4 6.9
 6.5 6.3 6.6 5.9 6.  6.1 5.6 6.7 6.2 6.8 7.1 7.6 7.3 7.2 7.7 7.4 7.9]
sepal_width  : 
  [3.5 3.  3.2 3.1 3.6 3.9 3.4 2.9 3.7 4.  4.4 3.8 3.3 4.1 4.2 2.3 2.8 2.4
 2.7 2.  2.2 2.5 2.6]
petal_length  : 
  [1.4 1.3 1.5 1.7 1.6 1.1 1.2 1.  1.9 4.7 4.5 4.9 4.  4.6 3.3 3.9 3.5 4.2
 3.6 4.4 4.1 4.8 4.3 5.  3.8 3.7 5.1 3.  6.  5.9 5.6 5.8 6.6 6.3 6.1 5.3
 5.5 6.7 6.9 5.7 6.4 5.4 5.2]
petal_width  : 
  [0.2 0.4 0.3 0.1 0.5 0.6 1.4 1.5 1.3 1.6 1.  1.1 1.8 1.2 1.7 2.5 1.9 2.1
 2.2 2.  2.4 2.3]
species  : 
  ['setosa' 'versicolor' 'virginica']


#### 6. Dataset Describing

In [16]:
a.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


#### 7. Dataset Info

In [19]:
a.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


#### 8. Encoding the Categorical Values (Label Encoder)
    - Label Encoder is used to encode categorical values into numerical values.
    - As because ML Models cannot work with text values, they can only work with numerical values.

In [22]:
from sklearn.preprocessing import LabelEncoder

In [23]:
le = LabelEncoder()
a['species'] = le.fit_transform(a['species'])

In [24]:
a['species'].unique()

array([0, 1, 2])

#### 9. Standardize the Numerical Values
    - Standard Scaler is basically used to standardize the numerical feature to have a mean=0 and standard deviation=1.
    - It helps to prevent features with larger ranges from dominating other features.
    - It ensures that each feature contribute equally.

In [29]:
from sklearn.preprocessing import StandardScaler

In [31]:
sc = StandardScaler()

a_num = a[['sepal_length','sepal_width','petal_length','petal_width']]

a_scaled = sc.fit_transform(a_num)

In [33]:
a_f = pd.DataFrame(a_scaled, columns = a_num.columns, index=a_num.index)

#### 10. Final Dataset

In [37]:
a_final = pd.concat([a['species'],a_f],axis=1)

### 11. Final Verification

In [40]:
print(a_final)

     species  sepal_length  sepal_width  petal_length  petal_width
0          0     -0.900681     1.032057     -1.341272    -1.312977
1          0     -1.143017    -0.124958     -1.341272    -1.312977
2          0     -1.385353     0.337848     -1.398138    -1.312977
3          0     -1.506521     0.106445     -1.284407    -1.312977
4          0     -1.021849     1.263460     -1.341272    -1.312977
..       ...           ...          ...           ...          ...
145        2      1.038005    -0.124958      0.819624     1.447956
146        2      0.553333    -1.281972      0.705893     0.922064
147        2      0.795669    -0.124958      0.819624     1.053537
148        2      0.432165     0.800654      0.933356     1.447956
149        2      0.068662    -0.124958      0.762759     0.790591

[150 rows x 5 columns]


In [43]:
a_final.to_csv('Cleaned Iris Dataset.csv', index=False)