# Day 34 — Data Normalization and Scaling

In today’s post, we’ll focus on data normalization and scaling, two essential preprocessing techniques for machine learning. Whether you're building a predictive model or performing exploratory data analysis, normalization and scaling ensure that your features are on the same scale, improving model performance and reducing bias.

## Why Normalize and Scale Data?

Machine learning algorithms, especially distance-based models like K-Nearest Neighbors (KNN), linear models, and neural networks, perform better when input features have similar scales. Normalization and scaling help in:

- **Improving Model Performance:** Ensuring that all features are weighted equally in algorithms sensitive to feature magnitude.

- **Handling Skewed Data:** Dealing with extreme values that might affect model accuracy.

- **Speeding Up Convergence:** In optimization algorithms, normalized features help models converge faster.

## Normalization vs. Scaling

- **Normalization** typically scales your data to a range of [0, 1] or [-1, 1], ensuring all features have similar magnitudes.

- **Scaling** standardizes features so they have a mean of 0 and a standard deviation of 1.

## Tutorial: Using Normalization Techniques in Python

We’ll explore how to apply normalization and scaling to a dataset using Pandas and Scikit-learn libraries.

### Step 1: Importing the Necessary Libraries

Make sure you have Pandas and Scikit-learn installed. If not, you can install them using:

In [None]:
pip install pandas scikit-learn

Then, import the required libraries:

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

### Step 2: Loading a Sample Dataset

Let’s use a sample dataset of customer data to demonstrate normalization techniques.

In [None]:
# Sample dataset of customer attributes
data = pd.DataFrame({
    'Annual Income': [30000, 50000, 70000, 100000, 120000],
    'Age': [22, 35, 45, 50, 60],
    'Credit Score': [650, 720, 800, 680, 710]
})

# Displaying the dataset
print(data)

### Step 3: Applying Min-Max Normalization

Min-Max Scaling transforms data into a range of [0, 1], which is ideal for algorithms that require normalized input.

In [None]:
# Applying Min-Max Scaling
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)

# Converting back to DataFrame for easy viewing
normalized_df = pd.DataFrame(normalized_data, columns=['Annual Income', 'Age', 'Credit Score'])
print("Normalized Data (Min-Max Scaling):")
print(normalized_df)

### Step 4: Applying Standard Scaling

Standard Scaling ensures that the features have a mean of 0 and a standard deviation of 1. It’s useful for algorithms that assume normality in the data.

In [None]:
# Applying Standard Scaling (Z-Score Normalization)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Converting back to DataFrame for easy viewing
scaled_df = pd.DataFrame(scaled_data, columns=['Annual Income', 'Age', 'Credit Score'])
print("Scaled Data (Standard Scaling):")
print(scaled_df)

## Use Case: Preprocessing Data for Machine Learning

Let’s look at how data normalization and scaling can help improve machine learning model performance. Imagine you’re building a K-Nearest Neighbors (KNN) model, which is sensitive to feature magnitudes. Before feeding the data into the model, we need to normalize it.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Sample classification dataset with a target variable
data['Target'] = [1, 0, 1, 0, 1]

# Splitting the dataset into training and test sets
X = data.drop('Target', axis=1)
y = data['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Normalizing the training data using Min-Max Scaling
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Applying KNN with normalized data
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)
score = knn.score(X_test_scaled, y_test)

print(f"KNN Accuracy with Normalized Data: {score}")

### Explanation:

In this use case, we used Min-Max Scaling to normalize the dataset before applying a KNN model, improving its performance by ensuring that all features contribute equally.

## Conclusion

In today’s post, we explored data normalization and scaling, key preprocessing steps for machine learning. These techniques help ensure that models perform optimally by bringing features to a common scale, reducing bias from feature magnitudes.

### Key Takeaways:

- Normalization is crucial for models that require bounded inputs, like neural networks.

- Scaling (Z-score normalization) is helpful when data must follow a normal distribution, improving performance for models like linear regression.

- Preprocessing your data before training your machine learning models is a critical step to improve model accuracy and convergence.

Stay tuned for tomorrow’s post, where we’ll dive into advanced feature engineering!