# Task 1: Data Cleaning of the Iris Dataset

### Objective
This notebook demonstrates data cleaning techniques on the Iris dataset (loaded from `iris.csv`). Steps include handling missing values, removing duplicates, detecting and dealing with outliers, and feature scaling. Each step is explained in detail, and key transformations are shown with before-and-after views of the data.

### Step 1: Load the Dataset
The Iris dataset is loaded from `iris.csv` into a Pandas DataFrame for easy data manipulation. This dataset contains
measurements of iris flowers, including `sepal length`, `sepal width`, `petal length`, and `petal width`, along with 
a `species` column.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load the iris dataset from a CSV file
df = pd.read_csv("iris.csv")

# Display basic information about the dataset
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


### Step 2: Handling Missing Data
To ensure data integrity, we first check for missing values using `isnull().sum()`. If missing values are found, they are 
imputed with the mean for each numerical column. This approach is particularly effective for small, numerical datasets like 
Iris.

In [2]:
# Check for missing values in each column
print("Missing values in each column:")
print(df.isnull().sum())

# Fill missing values in numeric columns only
df.fillna(df.select_dtypes(include='number').mean(), inplace=True)

Missing values in each column:
Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64


### Step 3: Removing Duplicate Records
Duplicate rows are identified using `duplicated()`. Any duplicate rows found are removed with `drop_duplicates()`. This step 
helps in maintaining the dataset’s integrity and preventing biases in analysis.


In [3]:
# Check for duplicate records
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

# Remove duplicate records, if any
df = df.drop_duplicates()
print("Duplicates removed.")

Number of duplicate rows: 0
Duplicates removed.


### Step 4: Outlier Detection and Handling
Outliers can affect analysis, so we identify them using the Z-score method. Values beyond 3 standard deviations from the mean 
are considered outliers. These outliers are then removed to ensure the dataset remains reliable.

In [4]:
from scipy.stats import zscore

# Calculate Z-scores for each feature column
z_scores = np.abs(zscore(df.iloc[:, :-1]))  # Exclude the species column if it is numeric
outliers = (z_scores > 3).any(axis=1)  # Identify rows with any Z-score > 3
print(f"Number of outliers detected: {outliers.sum()}")

# Remove outliers from the dataset
df_cleaned = df[outliers]

Number of outliers detected: 1


### Step 5: Feature Scaling
To normalize the features, we apply `StandardScaler` to the feature columns (excluding `species`). This transformation standardizes the data, giving each feature a mean of 0 and a standard deviation of 1, which can improve the performance of machine learning algorithms.

In [5]:
from sklearn.preprocessing import StandardScaler

# Standardize feature columns (excluding the 'species' column)
scaler = StandardScaler()
df_cleaned.loc[:, df_cleaned.columns != 'Species'] = scaler.fit_transform(df_cleaned.loc[:, df_cleaned.columns != 'Species'])

# Display the first few rows to show scaled features
df_cleaned.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned.loc[:, df_cleaned.columns != 'Species'] = scaler.fit_transform(df_cleaned.loc[:, df_cleaned.columns != 'Species'])


Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
15,0.0,0.0,0.0,0.0,0.0,Iris-setosa


### Before-and-After Comparison
These snapshots provide a comparison of the dataset at different stages:
1. **Before Handling Outliers and Scaling:** Shows the raw data loaded from `iris.csv`.
2. **After Handling Outliers and Scaling:** Shows the cleaned dataset, with outliers removed and features standardized.


In [6]:
# Display dataset before and after handling outliers and scaling
print("Dataset before handling outliers and scaling:")
print(df.head())

print("\nDataset after handling outliers and scaling:")
print(df_cleaned.head())

Dataset before handling outliers and scaling:
   Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
0   1            5.1           3.5            1.4           0.2  Iris-setosa
1   2            4.9           3.0            1.4           0.2  Iris-setosa
2   3            4.7           3.2            1.3           0.2  Iris-setosa
3   4            4.6           3.1            1.5           0.2  Iris-setosa
4   5            5.0           3.6            1.4           0.2  Iris-setosa

Dataset after handling outliers and scaling:
     Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
15  0.0            0.0           0.0            0.0           0.0  Iris-setosa


### Conclusion
In this notebook, we performed essential data cleaning steps on the Iris dataset, including handling missing values, removing 
duplicates, detecting and handling outliers, and feature scaling. The final dataset is now ready for analysis, with improved 
quality and consistency.
