# Module 1: Data Analysis and Data Preprocessing

## Section 2: Feature scaling and normalization

### Part 6: PowerTransformer in scikit-learn

In this section, we will explore the PowerTransformer in scikit-learn, which is a preprocessing technique used to transform data into a more Gaussian-like distribution.

### 6.1 Understanding power transformer

PowerTransformer applies a power transformation to make the data more Gaussian-like and can be useful when dealing with data that is heavily skewed or does not follow a normal distribution. It is particularly beneficial for data that contains extreme values or has long tails.

PowerTransformer supports two types of power transformations:

- Yeo-Johnson Transformation: This is a generalization of the Box-Cox transformation and can handle both positive and negative values.
- Box-Cox Transformation: This is suitable for data with positive values only.
Usage of PowerTransformer:

The PowerTransformer class in scikit-learn can be used to apply either Yeo-Johnson or Box-Cox transformation. The choice of transformation can be specified by setting the method parameter to 'yeo-johnson' or 'box-cox', respectively.

### 6.2 Using PowerTransformer

Let's demonstrate the usage of PowerTransformer with a synthetic dataset.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PowerTransformer

# Generate skewed data (exponential distribution)
np.random.seed(1)
data = np.random.exponential(1, size=1000).reshape(-1, 1)

# Create a PowerTransformer instance using Yeo-Johnson transformation
yeo_johnson_transformer = PowerTransformer(method='yeo-johnson')
# Create a PowerTransformer instance using Box-Cox transformation
box_cox_transformer = PowerTransformer(method='box-cox')
# Fit and transform the data using Yeo-Johnson and Box-Cox transformations
transformed_data_yj = yeo_johnson_transformer.fit_transform(data)
transformed_data_bc = box_cox_transformer.fit_transform(data)

# Plot the original and transformed distributions
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
plt.hist(data, bins=50, edgecolor='black')
plt.title('Original Data Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.subplot(1, 3, 2)
plt.hist(transformed_data_yj, bins=50, edgecolor='black')
plt.title('Transformed Data Distribution (Yeo-Johnson)')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.subplot(1, 3, 3)
plt.hist(transformed_data_bc, bins=50, edgecolor='black')
plt.title('Transformed Data Distribution (Box-Cox)')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

In this example, we applied both Yeo-Johnson and Box-Cox transformations to the original exponential data. The first subplot shows the original data's exponential distribution. The second subplot shows the data after applying the Yeo-Johnson transformation, and the third subplot shows the data after applying the Box-Cox transformation. This allows us to compare the effects of both transformations on the data's distribution.

### 6.3 Summary

The PowerTransformer in scikit-learn is a powerful tool to transform skewed data into a more Gaussian-like distribution, making it more suitable for certain statistical methods that assume normally distributed data. It can be particularly useful in scenarios where the data contains extreme values or has long tails. Experimenting with different transformations can help achieve the best results for a given dataset and modeling task.