# Task 2: Data Transformation of the Iris Dataset

### Objective
This notebook covers data transformation techniques, including encoding categorical data, feature engineering, and aggregating 
data. Each transformation is explained, and key changes are demonstrated with before-and-after views.


### Step 1: Load the Dataset
The Iris dataset is loaded from `iris.csv`. This dataset will undergo several transformations to make it more analysis-ready.


In [1]:
# Import libraries
import pandas as pd

# Load the dataset
df = pd.read_csv("iris.csv")

# Display the first few rows of the dataset
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


### Step 2: Encoding Categorical Data
The `species` column, which contains nominal data, is transformed using one-hot encoding. This technique creates binary columns 
for each species, making the data suitable for modeling. We use `drop_first=True` to avoid multicollinearity by omitting one of 
the categories.


In [2]:
# One-hot encoding for nominal variables
df_encoded = pd.get_dummies(df, columns=['Species'], drop_first=True)

# Display the dataset after encoding
df_encoded.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species_Iris-versicolor,Species_Iris-virginica
0,1,5.1,3.5,1.4,0.2,0,0
1,2,4.9,3.0,1.4,0.2,0,0
2,3,4.7,3.2,1.3,0.2,0,0
3,4,4.6,3.1,1.5,0.2,0,0
4,5,5.0,3.6,1.4,0.2,0,0


### Step 3: Feature Engineering
To add more insight, we derived two new features:
1. **Petal Area**: Calculated by multiplying petal length and petal width.
2. **Sepal Area**: Calculated by multiplying sepal length and sepal width.

These engineered features might be more useful than raw measurements for some analysis.

In [3]:
# Create new features
df_encoded['petal_area'] = df['PetalLengthCm'] * df['PetalWidthCm']
df_encoded['sepal_area'] = df['SepalLengthCm'] * df['SepalWidthCm']

# Display the dataset after feature engineering
df_encoded.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species_Iris-versicolor,Species_Iris-virginica,petal_area,sepal_area
0,1,5.1,3.5,1.4,0.2,0,0,0.28,17.85
1,2,4.9,3.0,1.4,0.2,0,0,0.28,14.7
2,3,4.7,3.2,1.3,0.2,0,0,0.26,15.04
3,4,4.6,3.1,1.5,0.2,0,0,0.3,14.26
4,5,5.0,3.6,1.4,0.2,0,0,0.28,18.0


### Step 4: Data Aggregation
To summarize the data, we grouped by `species` and calculated the mean of each feature (sepal length, sepal width, petal length,
and petal width). This provides insights into the average size of different iris species.


In [4]:
# Group by species and calculate mean values for each feature
df_aggregated = df.groupby('Species').agg({
    'SepalLengthCm': 'mean',
    'SepalWidthCm': 'mean',
    'PetalLengthCm': 'mean',
    'PetalWidthCm': 'mean'
}).reset_index()

# Display aggregated dataset
df_aggregated

Unnamed: 0,Species,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,Iris-setosa,5.006,3.418,1.464,0.244
1,Iris-versicolor,5.936,2.77,4.26,1.326
2,Iris-virginica,6.588,2.974,5.552,2.026


### Before-and-After Comparison
The following snapshots provide a comparison of the dataset before and after transformations:
1. **Before Transformations:** Shows the raw data loaded from `iris.csv`.
2. **After Encoding and Feature Engineering:** Highlights the dataset after applying one-hot encoding to `species` and adding 
new features (petal/sepal areas and petal-sepal length ratio).


In [5]:
# Display dataset before transformations
print("Dataset before transformations:")
print(df.head())

# Display dataset after transformations (encoding and feature engineering)
print("\nDataset after encoding and feature engineering:")
print(df_encoded.head())

Dataset before transformations:
   Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
0   1            5.1           3.5            1.4           0.2  Iris-setosa
1   2            4.9           3.0            1.4           0.2  Iris-setosa
2   3            4.7           3.2            1.3           0.2  Iris-setosa
3   4            4.6           3.1            1.5           0.2  Iris-setosa
4   5            5.0           3.6            1.4           0.2  Iris-setosa

Dataset after encoding and feature engineering:
   Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm  \
0   1            5.1           3.5            1.4           0.2   
1   2            4.9           3.0            1.4           0.2   
2   3            4.7           3.2            1.3           0.2   
3   4            4.6           3.1            1.5           0.2   
4   5            5.0           3.6            1.4           0.2   

   Species_Iris-versicolor  Species_Iris-virginica  pe

### Conclusion
This notebook demonstrated key data transformation steps on the Iris dataset. We encoded categorical variables, created additional features, and performed data aggregation. These transformations enhance the dataset's usability for machine learning and data analysis.
