#  Data Cleaning (Iris Species Dataset)

### In this notebook, we will:
1. Load the dataset from `data/` folder.
2. Check for duplicate rows.
3. Handle missing values (if any).
4. Convert target column (`Species`) to numeric using encoding.
5. Save cleaned dataset for further analysis.
6. Summarize observations for next step (EDA).

## Step 1: Import Libraries
We'll import the necessary libraries for data handling.

In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

## Step 2: Load Dataset
Load the Iris CSV dataset and preview it.

In [7]:
df = pd.read_csv('D:\Thiru\ML_Projects\Iris-Species-Prediction\Data\Raw\iris.csv')
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


## Step 3: Check for Duplicate Rows
Identify if any duplicate rows exist in the dataset.

In [11]:
# Number of duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# Drop duplicates if any
df = df.drop_duplicates()

Number of duplicate rows: 0


## Step 5: Encode Target Column
Convert the target column `Species` to numeric values using Label Encoding.

In [15]:
le=LabelEncoder()
df['Species']=le.fit_transform(df['Species'])

df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,0
1,2,4.9,3.0,1.4,0.2,0
2,3,4.7,3.2,1.3,0.2,0
3,4,4.6,3.1,1.5,0.2,0
4,5,5.0,3.6,1.4,0.2,0


## Step 6: Save Cleaned Dataset
Save the cleaned dataset to `Data/Processed` folder for use in EDA and modeling.

In [17]:
import os
if not os.path.exists("D:\Thiru\ML_Projects\Iris-Species-Prediction\Data\processed"):
    os.makedirs("D:\Thiru\ML_Projects\Iris-Species-Prediction\Data\processed")
    
df.to_csv("D:\Thiru\ML_Projects\Iris-Species-Prediction\Data\processed\cleaned_iris.csv", index=False)
print("Cleaned dataset saved to D:\Thiru\ML_Projects\Iris-Species-Prediction\Data\processed\cleaned_iris.csv")

Cleaned dataset saved to D:\Thiru\ML_Projects\Iris-Species-Prediction\Data\processed\cleaned_iris.csv


## Step 7: Summary & Observations

- Dataset has no missing values.
- Duplicate rows (if any) were removed.
- Target column `Species` converted to numeric for ML.
- Cleaned dataset saved as `cleaned_iris.csv`.
- Next Steps: 
  1. Perform Exploratory Data Analysis (EDA).
  2. Visualize feature relationships and species distribution.