<a href="https://colab.research.google.com/github/samvillasmith/EDA/blob/main/Feature_Scaling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Scaling

Feature scaling is a technique used to standardize the range of independent variables or features of data. It's a crucial step in data preprocessing for many machine learning algorithms. The two main types of feature scaling are Normalization (Min-Max Scaling) and Standardization (Z-Score Scaling).

This notebook will demonstrate both.

In [20]:
pip install -U scikit-learn



In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

In [3]:
df = pd.read_csv('/content/Churn_Modelling_missing.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           9850 non-null   object 
 6   Age              9850 non-null   float64
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


## Normalization and standardization are two common techniques used in data preprocessing, particularly when dealing with machine learning algorithms that are sensitive to the scale of the input features.

## Here's a breakdown of each:

### Normalization (Min-Max Scaling):

What it does: Rescales the features to a fixed range, usually between 0 and 1.
How it works: It subtracts the minimum value of a feature from each data point and then divides by the range (maximum value minus minimum value).
Formula: (x - min) / (max - min)
When to use: It's useful when you have features with different ranges and you want to bring them all to the same scale without distorting the relationships between them. It's often used in algorithms like K-Nearest Neighbors and Neural Networks.
Effect on outliers: Normalization is sensitive to outliers, as they will influence the minimum and maximum values, potentially compressing the range of the majority of the data.

### Standardization (Z-Score Scaling):

What it does: Rescales the features to have a mean of 0 and a standard deviation of 1.
How it works: It subtracts the mean of a feature from each data point and then divides by the standard deviation.
Formula: (x - mean) / standard deviation
When to use: It's useful when you have features with different means and standard deviations, and you want to transform them into a standard normal distribution. It's often used in algorithms like Support Vector Machines, Linear Regression, and Logistic Regression.
Effect on outliers: Standardization is less affected by outliers compared to normalization because it uses the mean and standard deviation, which are less influenced by extreme values.
Key Differences:

Range: Normalization scales to a fixed range (e.g., 0 to 1), while standardization scales to a mean of 0 and standard deviation of 1.
Outliers: Normalization is sensitive to outliers, while standardization is less sensitive.
Distribution: Standardization assumes the data follows a Gaussian (normal) distribution, while normalization does not make any assumptions about the distribution.
In summary, the choice between normalization and standardization depends on the specific dataset, the type of algorithm you're using, and whether you want to preserve the original distribution or transform it into a standard distribution.

In [7]:
df.describe().round(2)

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,9850.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.57,650.53,38.92,5.01,76485.89,1.53,0.71,0.52,100090.24,0.2
std,2886.9,71936.19,96.65,10.49,2.89,62397.41,0.58,0.46,0.5,57510.49,0.4
min,1.0,15565701.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628528.25,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690738.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.92,0.0
75%,7500.25,15753233.75,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.25,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [8]:
new_df = pd.DataFrame(df, columns=['Age', 'Tenure'])

In [9]:
new_df.head(5)

Unnamed: 0,Age,Tenure
0,42.0,2
1,41.0,1
2,42.0,8
3,39.0,1
4,43.0,2


In [10]:
new_df['Age'] = new_df['Age'].fillna(new_df['Age'].mean())

In [13]:
scaler = MinMaxScaler()
normalized_df = scaler.fit_transform(new_df)
print(normalized_df)

[[0.32432432 0.2       ]
 [0.31081081 0.1       ]
 [0.32432432 0.8       ]
 ...
 [0.24324324 0.7       ]
 [0.32432432 0.3       ]
 [0.13513514 0.4       ]]


### Smaller Example

In [15]:
x_array = np.array([14.2, 16.4, 11.9, 15.2, 18.5, 2])

scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x_array.reshape(-1, 1))

print(x_scaled)

[[0.73939394]
 [0.87272727]
 [0.6       ]
 [0.8       ]
 [1.        ]
 [0.        ]]


In [19]:
scaler_standard = StandardScaler()
x_scaled_standard = scaler_standard.fit_transform(x_array.reshape(-1, 1))

print(x_scaled_standard)

[[ 0.21898965]
 [ 0.63194156]
 [-0.2127328 ]
 [ 0.40669507]
 [ 1.02612294]
 [-2.07101641]]
