Topic 2: **Data Transformation**

Data transformation involves converting the raw data into a format that is suitable for analysis or modeling. This typically includes encoding categorical variables into numerical representations, scaling numerical features, and performing feature engineering to create new features or transform existing ones. Let's explore each aspect in detail:

### 1. Encoding Categorical Variables

Categorical variables are variables that represent categories rather than numerical values. Many machine learning algorithms require numerical input, so categorical variables need to be encoded into numerical representations. Common techniques include:

#### a. One-Hot Encoding
In one-hot encoding, each category is represented by a binary vector where only one element is 1 (hot) and the rest are 0 (cold).

In [1]:
import pandas as pd

# Example data with categorical variable
data = pd.DataFrame({'Category': ['A', 'B', 'C', 'A', 'B', 'C']})

# One-hot encoding
data_encoded = pd.get_dummies(data, columns=['Category'])

print("Original data:")
print(data)
print("\nOne-hot encoded data:")
print(data_encoded)

Original data:
  Category
0        A
1        B
2        C
3        A
4        B
5        C

One-hot encoded data:
   Category_A  Category_B  Category_C
0           1           0           0
1           0           1           0
2           0           0           1
3           1           0           0
4           0           1           0
5           0           0           1


#### b. Label Encoding
In label encoding, each category is assigned a unique integer.

In [2]:
from sklearn.preprocessing import LabelEncoder

# Example data with categorical variable
data = pd.DataFrame({'Category': ['A', 'B', 'C', 'A', 'B', 'C']})

# Label encoding
label_encoder = LabelEncoder()
data['Category_Encoded'] = label_encoder.fit_transform(data['Category'])

print("Original data:")
print(data)

Original data:
  Category  Category_Encoded
0        A                 0
1        B                 1
2        C                 2
3        A                 0
4        B                 1
5        C                 2


### 2. Scaling Features

Scaling numerical features ensures that all features have a similar scale, which can improve the performance of certain machine learning algorithms. Common scaling techniques include:

#### a. Min-Max Scaling
Min-max scaling scales features to a specified range, typically between 0 and 1.

In [3]:
from sklearn.preprocessing import MinMaxScaler

# Example data with numerical features
data = pd.DataFrame({'Feature1': [1, 2, 3, 4, 5]})

# Min-max scaling
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)

print("Original data:")
print(data)
print("\nScaled data:")
print(data_scaled)

Original data:
   Feature1
0         1
1         2
2         3
3         4
4         5

Scaled data:
   Feature1
0      0.00
1      0.25
2      0.50
3      0.75
4      1.00


#### b. Standardization (Z-score Scaling)
Standardization scales features to have a mean of 0 and a standard deviation of 1.

In [4]:
from sklearn.preprocessing import StandardScaler

# Example data with numerical features
data = pd.DataFrame({'Feature1': [1, 2, 3, 4, 5]})

# Z-score scaling
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)

print("Original data:")
print(data)
print("\nScaled data:")
print(data_scaled)

Original data:
   Feature1
0         1
1         2
2         3
3         4
4         5

Scaled data:
   Feature1
0 -1.414214
1 -0.707107
2  0.000000
3  0.707107
4  1.414214


### 3. Feature Engineering

Feature engineering involves creating new features from existing ones or transforming features to improve the performance of machine learning models. Common techniques include:

#### a. Creating New Features

In [5]:
import pandas as pd

# Example data
data = pd.DataFrame({'Feature1': [1, 2, 3, 4, 5]})

# Creating a new feature
data['Feature2'] = data['Feature1'] ** 2

print("Original data:")
print(data)

Original data:
   Feature1  Feature2
0         1         1
1         2         4
2         3         9
3         4        16
4         5        25


#### b. Transformation

In [9]:
import pandas as pd
import numpy as np

# Example data
data = pd.DataFrame({'Feature1': [1, 2, 3, 4, 5]})

# Log transformation
data['Log_Feature1'] = np.log(data['Feature1'])

print("Original data:")
print(data)

Original data:
   Feature1  Log_Feature1
0         1      0.000000
1         2      0.693147
2         3      1.098612
3         4      1.386294
4         5      1.609438
