# Data Transformation and Normalization

## 1. Normalization and Standardization

Normalization and standardization are techniques used to scale the features of your data so that they are on a comparable level. This is particularly important for machine learning algorithms that are sensitive to the scale of data.

### MinMaxScaler

The MinMaxScaler transforms features by scaling each feature to a given range, usually between zero and one. The formula is:

---


>$$X' = \frac{X - X_{min}}{X_{max} - X_{min}}$$


---

This scaler is useful when you want to preserve the relationships between the data points.

In [12]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Example DataFrame
data = {'Age': [24, 27, 22, 29],
        'Score': [85.5, 90.0, 87.5, 88.0]}

df = pd.DataFrame(data)

# Applying MinMaxScaler
scaler = MinMaxScaler()
df_minmax = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_minmax)

        Age     Score
0  0.285714  0.000000
1  0.714286  1.000000
2  0.000000  0.444444
3  1.000000  0.555556


### StandardScaler

The StandardScaler standardizes features by removing the mean and scaling to unit variance. The formula is:

---


>$$X' = \frac{X - \mu}{\sigma}$$

Where:
* X is the original value
* $\mu$ is the mean of the feature
* $\sigma$ is the standard deviation of the feature
* $X'$ is the transformed value


---

This scaler is useful when the data follows a Gaussian distribution with different variances.

In [13]:
from sklearn.preprocessing import StandardScaler

# Applying StandardScaler
scaler = StandardScaler()
df_standard = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_standard)

        Age     Score
0 -0.557086 -1.405564
1  0.557086  1.405564
2 -1.299867 -0.156174
3  1.299867  0.156174


### RobustScaler

The RobustScaler uses the median and the interquartile range for scaling. It is less sensitive to outliers compared to the StandardScaler. The formula is:

---


>$$X' = \frac{X - \text{median}}{\text{IQR}}$$

Where:
* $X$ is the original value
* $\text{median}$ is the median of the feature
* $\text{IQR}$ is the interquatile range of the feature (the difference between the Q3 and Q1)
* $X'$ is the transformed value
---

This scaler is useful when your data contains outliers.

In [14]:
from sklearn.preprocessing import RobustScaler

# Applying RobustScaler
scaler = RobustScaler()
df_robust = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_robust)

     Age     Score
0 -0.375 -1.500000
1  0.375  1.500000
2 -0.875 -0.166667
3  0.875  0.166667


### Differences and When to Use Them

* **MinMaxScaler**: Use when you need to preserve the relationships between data points and your data does not contain outliers.
* **StandardScaler**: Use when your data follows a Gaussian distribution with different variances. Not robust to outliers.
* **RobustScaler**: Use when your data contains outliers. It scales the data based on the median and IQR, making it robust to outliers.

## 2. Encoding Categorical Variables

Categorical variables need to be converted into a numerical format for machine learning algorithms to process them. There are several methods for encoding categorical variables.

### One-Hot Encoding

One-hot encoding creates a new binary column for each category.

In [15]:
# Example DataFrame with categorical variable
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Team': ['A', 'B', 'A', 'B']}

df = pd.DataFrame(data)

# Applying one-hot encoding
df_onehot = pd.get_dummies(df, columns=['Team'])
print(df_onehot)

      Name  Team_A  Team_B
0    Alice       1       0
1      Bob       0       1
2  Charlie       1       0
3    David       0       1


### Label Encoding

Label encoding converts each category to a numerical label.

In [16]:
from sklearn.preprocessing import LabelEncoder

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Team': ['A', 'B', 'C', 'D']}

df = pd.DataFrame(data)
# Applying label encoding
encoder = LabelEncoder()
df['Team_Encoded'] = encoder.fit_transform(df['Team'])
print(df)

      Name Team  Team_Encoded
0    Alice    A             0
1      Bob    B             1
2  Charlie    C             2
3    David    D             3


## Handling Dates and Times

Handling dates and times is crucial for time series analysis, feature extraction, and understanding trends over time.

### Converting Strings to Datetime

You can convert date strings to datetime objects using `pd.to_datetime()`.

In [17]:
# Example DataFrame with date strings
data = {'Date': ['2021-01-01', '2021-01-02', '2021-01-03'],
        'Value': [100, 150, 200]}

df_dates = pd.DataFrame(data)

# Converting to datetime
df_dates['Date'] = pd.to_datetime(df_dates['Date'])
print(df_dates)

        Date  Value
0 2021-01-01    100
1 2021-01-02    150
2 2021-01-03    200


### Extracting Date Components

You can extract various components of datetime objects for analysis.

In [18]:
# Extracting year, month, day
df_dates['Year'] = df_dates['Date'].dt.year
df_dates['Month'] = df_dates['Date'].dt.month
df_dates['Day'] = df_dates['Date'].dt.day
print(df_dates)

        Date  Value  Year  Month  Day
0 2021-01-01    100  2021      1    1
1 2021-01-02    150  2021      1    2
2 2021-01-03    200  2021      1    3


In [19]:
# Extracting day of the week
df_dates['DayOfWeek'] = df_dates['Date'].dt.dayofweek
print(df_dates)

        Date  Value  Year  Month  Day  DayOfWeek
0 2021-01-01    100  2021      1    1          4
1 2021-01-02    150  2021      1    2          5
2 2021-01-03    200  2021      1    3          6
