Data wrangling (also called data munging) is the process of cleaning, organizing, and transforming raw data into a usable format for analysis.

[Kaggle - Iris Dataset](https://www.kaggle.com/datasets/uciml/iris)  
This dataset contains sepal and petal measurements for three species of iris flowers.

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv("./Iris.csv")
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


Pandas is an open-source Python library used for data manipulation and analysis. 
NumPy is a powerful open-source Python library used for numerical computing. 
A DataFrame is a two-dimensional, tabular data structure in Python, provided by the Pandas library. It is similar to an Excel spreadsheet or an SQL table.
df.head() is a Pandas method used in Python to display the first five rows of a DataFrame.

In [None]:
print(df.shape)  # (rows , columns)

(150, 6)


In [9]:
df.isnull().sum() ## Number of missing values per column

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

In [None]:
df.describe()  # Summary stats for numeric columns

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [11]:
df.describe(include='all')

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
count,150.0,150.0,150.0,150.0,150.0,150
unique,,,,,,3
top,,,,,,Iris-setosa
freq,,,,,,50
mean,75.5,5.843333,3.054,3.758667,1.198667,
std,43.445368,0.828066,0.433594,1.76442,0.763161,
min,1.0,4.3,2.0,1.0,0.1,
25%,38.25,5.1,2.8,1.6,0.3,
50%,75.5,5.8,3.0,4.35,1.3,
75%,112.75,6.4,3.3,5.1,1.8,


In [13]:
df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

In [None]:
| Column Name         | Description                                 | Data Type | Example Values                            | Variable Type         |
| ------------------- | ------------------------------------------- | --------- | ----------------------------------------- | --------------------- |
| `sepal length (cm)` | Length of the sepal in centimeters          | float64   | 5.1, 4.9, 6.3                             | Continuous            |
| `sepal width (cm)`  | Width of the sepal in centimeters           | float64   | 3.5, 3.0, 2.9                             | Continuous            |
| `petal length (cm)` | Length of the petal in centimeters          | float64   | 1.4, 4.7, 5.6                             | Continuous            |
| `petal width (cm)`  | Width of the petal in centimeters           | float64   | 0.2, 1.4, 2.1                             | Continuous            |
| `target`            | Iris flower species (encoded as 0, 1, or 2) | int64     | 0 (setosa), 1 (versicolor), 2 (virginica) | Categorical (ordinal) |


Data normalization is the process of scaling numerical data so that it fits within a specific range, usually 0 to 1

Min-Max Normalization formula
X' = (X - min(X)) / (max(X) - min(X))

sklearn is Python library used for machine learning and data analysis. It provides simple and efficient tools for data mining, analysis, and building machine learning models.

The preprocessing module in scikit-learn (sklearn) provides a set of functions to preprocess data, such as scaling, encoding, and handling missing values. It's an essential part of the data preprocessing pipeline for machine learning tasks.

In [18]:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()

In [19]:
x = df.iloc[:,:4]
x

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm
0,1,5.1,3.5,1.4
1,2,4.9,3.0,1.4
2,3,4.7,3.2,1.3
3,4,4.6,3.1,1.5
4,5,5.0,3.6,1.4
...,...,...,...,...
145,146,6.7,3.0,5.2
146,147,6.3,2.5,5.0
147,148,6.5,3.0,5.2
148,149,6.2,3.4,5.4


In [2]:
df.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

Fit: The model (e.g., scaler, imputer, encoder) is fitted on the data, meaning it learns statistics or characteristics from the input data X (e.g., mean, standard deviation, or encoding rules).

Transform: The data is then transformed based on the learned parameters and returned in the desired format.

In [20]:
x_scaled = min_max_scaler.fit_transform(x)

In [25]:
df_normalized = pd.DataFrame(x_scaled)
df_normalized

Unnamed: 0,0,1,2,3
0,0.000000,0.222222,0.625000,0.067797
1,0.006711,0.166667,0.416667,0.067797
2,0.013423,0.111111,0.500000,0.050847
3,0.020134,0.083333,0.458333,0.084746
4,0.026846,0.194444,0.666667,0.067797
...,...,...,...,...
145,0.973154,0.666667,0.416667,0.711864
146,0.979866,0.555556,0.208333,0.677966
147,0.986577,0.611111,0.416667,0.711864
148,0.993289,0.527778,0.583333,0.745763


In [30]:
df['Species'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

The LabelEncoder from sklearn.preprocessing is used to convert categorical labels into numerical labels.

In [34]:
label_encoder = preprocessing.LabelEncoder()
df['Species'] = label_encoder.fit_transform(df['Species'])

In [36]:
df['Species'].unique()

array([0, 1, 2])

In [39]:
features_df = df.drop(columns=['Species'])

The OneHotEncoder() from sklearn.preprocessing is used to convert categorical features into a one-hot numeric array.

In a one-hot encoded array, the values 0 and 1 represent the presence or absence of a category for each observation.

In [41]:
enc = preprocessing.OneHotEncoder()
enc_df=pd.DataFrame(enc.fit_transform(df[['Species']]).toarray())
enc_df

Unnamed: 0,0,1,2
0,1.0,0.0,0.0
1,1.0,0.0,0.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,1.0,0.0,0.0
...,...,...,...
145,0.0,0.0,1.0
146,0.0,0.0,1.0
147,0.0,0.0,1.0
148,0.0,0.0,1.0


In [43]:
df_encode = features_df.join(enc_df)
df_encode

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,0,1,2
0,1,5.1,3.5,1.4,0.2,1.0,0.0,0.0
1,2,4.9,3.0,1.4,0.2,1.0,0.0,0.0
2,3,4.7,3.2,1.3,0.2,1.0,0.0,0.0
3,4,4.6,3.1,1.5,0.2,1.0,0.0,0.0
4,5,5.0,3.6,1.4,0.2,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,0.0,0.0,1.0
146,147,6.3,2.5,5.0,1.9,0.0,0.0,1.0
147,148,6.5,3.0,5.2,2.0,0.0,0.0,1.0
148,149,6.2,3.4,5.4,2.3,0.0,0.0,1.0


In [44]:
df_encode.rename(columns = {0:'Iris-Setosa',
1:'Iris-Versicolor',2:'Iris-virginica'}, inplace = True)
df_encode

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Iris-Setosa,Iris-Versicolor,Iris-virginica
0,1,5.1,3.5,1.4,0.2,1.0,0.0,0.0
1,2,4.9,3.0,1.4,0.2,1.0,0.0,0.0
2,3,4.7,3.2,1.3,0.2,1.0,0.0,0.0
3,4,4.6,3.1,1.5,0.2,1.0,0.0,0.0
4,5,5.0,3.6,1.4,0.2,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,0.0,0.0,1.0
146,147,6.3,2.5,5.0,1.9,0.0,0.0,1.0
147,148,6.5,3.0,5.2,2.0,0.0,0.0,1.0
148,149,6.2,3.4,5.4,2.3,0.0,0.0,1.0


In [17]:
from sklearn import preprocessing
df['species'].unique()


array([0, 1, 2])

Adds 'Species_' as a prefix to the new columns

Species_versicolor | Species_virginica
-------------------|-------------------
        0          |        0          ← setosa (dropped)
        1          |        0          ← versicolor
        0          |        1          ← virginica


In [45]:
one_hot_df = pd.get_dummies(df, prefix="Species",
columns=['Species'], drop_first=True)

In [46]:
one_hot_df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species_1,Species_2
0,1,5.1,3.5,1.4,0.2,False,False
1,2,4.9,3.0,1.4,0.2,False,False
2,3,4.7,3.2,1.3,0.2,False,False
3,4,4.6,3.1,1.5,0.2,False,False
4,5,5.0,3.6,1.4,0.2,False,False
...,...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,False,True
146,147,6.3,2.5,5.0,1.9,False,True
147,148,6.5,3.0,5.2,2.0,False,True
148,149,6.2,3.4,5.4,2.3,False,True
