# Data Cleaning and Preprocessing â€“ Iris Dataset

## Objective
The objective of this task is to clean, preprocess, and prepare the Iris dataset for machine learning by handling data quality issues, encoding categorical variables, and scaling numerical features.


In [2]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, LabelEncoder


In [6]:
df = pd.read_csv(r"E:\SUMED_DATA\CODVEDA - INTERNSHIP\1) iris.csv")
df.head()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Dataset Overview
The Iris dataset contains measurements of iris flowers and is commonly used for classification problems. It includes numerical features describing flower dimensions and a categorical target variable representing species.


In [7]:
df.shape


(150, 5)

In [8]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [9]:
df.describe()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### Initial Observations
- The dataset consists of rows and columns representing samples and features respectively.
- Data types were inspected to identify numerical and categorical variables.
- Summary statistics were reviewed to understand feature ranges and distributions.


In [10]:
df.isnull().sum()


sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

### Missing Value Analysis
The dataset was examined for missing values. No missing values were detected, indicating that the dataset is complete and does not require imputation.


In [11]:
df.duplicated().sum()


np.int64(3)

In [12]:
df = df.drop_duplicates()


### Duplicate Records
Duplicate records were checked to ensure data integrity. Duplicate rows were removed where applicable.


In [13]:
numeric_cols = df.select_dtypes(include=np.number)

Q1 = numeric_cols.quantile(0.25)
Q3 = numeric_cols.quantile(0.75)
IQR = Q3 - Q1

outliers = ((numeric_cols < (Q1 - 1.5 * IQR)) | 
            (numeric_cols > (Q3 + 1.5 * IQR))).sum()

outliers


sepal_length    0
sepal_width     4
petal_length    0
petal_width     0
dtype: int64

### Outlier Detection
The Interquartile Range (IQR) method was applied to detect potential outliers. Since the Iris dataset is well-balanced and curated, no extreme outliers requiring removal were observed.


In [14]:
le = LabelEncoder()
df['species'] = le.fit_transform(df['species'])


### Encoding Categorical Variables
The target variable (species) was converted into numerical format using Label Encoding to make it suitable for machine learning algorithms.


In [15]:
X = df.drop('species', axis=1)
y = df['species']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
X_scaled.head()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,-0.915509,1.019971,-1.357737,-1.3357
1,-1.15756,-0.128082,-1.357737,-1.3357
2,-1.39961,0.331139,-1.414778,-1.3357
3,-1.520635,0.101529,-1.300696,-1.3357
4,-1.036535,1.249582,-1.357737,-1.3357


### Feature Scaling
Numerical features were standardized using StandardScaler to ensure all features contribute equally to model training.


In [16]:
X_scaled.describe()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,147.0,147.0,147.0,147.0
mean,-4.8336240000000005e-17,1.691768e-16,-2.416812e-16,-3.383537e-16
std,1.003419,1.003419,1.003419,1.003419
min,-1.88371,-2.424189,-1.585902,-1.468099
25%,-0.9155095,-0.5873036,-1.243654,-1.203301
50%,-0.06833389,-0.1280822,0.3535005,0.1206904
75%,0.6578166,0.56075,0.7527893,0.782686
max,2.473193,3.086468,1.779532,1.70948


In [None]:
y.value_counts()

species
1    50
2    49
0    48
Name: count, dtype: int64

## Final Summary
The Iris dataset was thoroughly inspected and prepared for machine learning applications. Data quality checks confirmed the absence of missing values, while duplicate records were identified and removed. Outlier detection using the IQR method revealed no extreme values requiring treatment. The categorical target variable was encoded using Label Encoding, and numerical features were standardized using StandardScaler to ensure uniform feature contribution. The resulting dataset is clean, normalized, and suitable for downstream exploratory analysis and classification modeling.