#  Data Cleaning (Iris Species Dataset)

### In this notebook, we will:
1. Load the dataset from `data/` folder.
2. Check for duplicate rows.
3. Handle missing values (if any).
4. Convert target column (`Species`) to numeric using encoding.
5. Save cleaned dataset for further analysis.
6. Summarize observations for next step (EDA).

### Step 1: Set Project Root for Python Imports

In [1]:
import os
import sys

# Add project root to sys.path
sys.path.append(os.path.abspath(".."))

### Step 2: Load Dataset
Load the Iris CSV dataset and preview it.

In [2]:
from src.data import load_data
df = load_data('D:\Thiru\ML_Projects\Iris-Species-Prediction\Data\Raw\iris.csv')
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


### Step 3: Check for Duplicate Rows
Identify if any duplicate rows exist in the dataset.

In [3]:
from src.data import ckeck_duplicate
df=ckeck_duplicate(df)

Number of duplicate rows: 0


### Step 4: Dropping Column
- Id is dropped as they are not useful for prediction.

In [4]:
from src.data import drop_cols
df=drop_cols(df)

### Step 5: Encode Target Column
Convert the target column `Species` to numeric values using Label Encoding.

In [5]:
from src.data import encode_target
df, le=encode_target(df)

### Step 6: Save Cleaned Dataset
Save the cleaned dataset to `Data/Processed` folder for use in EDA and modeling.

In [6]:
from src.data import save_cleaned_data
df = save_cleaned_data(df) 

Cleaned dataset saved to D:\Thiru\ML_Projects\Iris-Species-Prediction\Data\processed\cleaned_iris.csv


### Step 7: Save Label_Encoded and Load Label_Encoded

In [7]:
from src.data import save_le
save_le(le, filename=r"D:\Thiru\ML_Projects\Iris-Species-Prediction\models\le.pkl")

from src.data import load_le
le = load_le(filename=r"D:\Thiru\ML_Projects\Iris-Species-Prediction\models\le.pkl")

Label_Encoded saved to D:\Thiru\ML_Projects\Iris-Species-Prediction\models\le.pkl
Label_Encoded loaded from D:\Thiru\ML_Projects\Iris-Species-Prediction\models\le.pkl


### Summary & Observations

- Dataset has no missing values.
- Duplicate rows (if any) were removed.
- Target column `Species` converted to numeric for ML.
- Cleaned dataset saved as `cleaned_iris.csv`.
- Next Steps: 
  1. Perform Exploratory Data Analysis (EDA).
  2. Visualize feature relationships and species distribution.