Data plays a significant role in ensuring the effectiveness of ML applications. However, working with data is not always simple. Considering the variety and scale of information sources we have today, this complexity is unavoidable. 

# Data Scaling in Machine Learning

Feature Scaling is an important step to take prior to training a machine learning models.

Feature Scaling is a technique to standardize the independent features present in the data in a fixed range.

### Why feature scaling is important?

In the machine learning algorithms if the values of the features are closer to each other there are chances for the algorithm to get trained well and faster instead of the data set where the data points or features values have high differences with each other will take more time to understand the data and the accuracy will be lower. 

The accuracy of machine learning algorithms is greatly improved with standardized data, some of them even require it. The only way to standardize data is a process called feature scaling.


While working on ML algorithms we need to scale all feature value in same max and min value to get better result and speedy test. e.g. : we have two feature Age (24, 32, 28, 21) and Salary (30k, 42k, 48k, 21k) as we can see there is huge difference
. So we should scale both feature between 0 to 1 (normalization) or mean zero and standard deviation 1 (standazation). 

### When to use feature scaling?

###### Required
Scaling is required when we use any machine learning algorithims that require Gradient calculation such as Linear/Logistic regression and artificial neural networks.

Any ML algorithm which involves Euclidean distance. Some DL Techniques where Gradient descent is basically involve (a parabolic curve where we need to find the best minimal pattern or global minimal point) K NN , K Means Clustering , DL Algorithms, etc.


###### Not Required
Scaling is not required for distance based and tree-based algorithms such as K-Means Clustering, SVM and K Nearest Neighbors , Decision Trees, Random Forest and XG Boost.

Any techniques which involves decision tree. No need to scale the data.

###### Gradient Descent Based

Machine learning algorithms like linear regression, logistic regression, neural network, etc. that use gradient descent as an optimization technique require data to be scaled. 

###### Distance-Based

Distance algorithms like KNN, K-means, and SVM are most affected by the range of features. This is because behind the scenes they are using distances between data points to determine their similarity.

##### Tree-Based

Tree Based Algorithms are fairly insensitive to the scale of the features. This is what makes them invariant to the scale of the features. Examples : Decision Tree, Random Forest, XG Boost etc

# Methods of Feature Scaling

Normalization and Standardization are the two main methods for the scaling of the data. Which are widely used in the algorithms where scaling is required.

If Dataset is not normally distributed or different measured with different metrics then it is a 
matter to bring that data points to a standard normal distribution. Let’s get into it.
We have two ways to do it.

• Standardization
• Normalization

### Normalized Data Vs Standardized Data

Normalization is used when the data doesn't have Normal (Gaussian) distribution whereas Standardization is used on data having Normal (Gaussian) distribution.

Normalization scales in a range of [0,1] or [-1,1]. Standardization is not bounded by range.

Normalization is highly affected by outliers. Standardization is slightly affected by outliers.

Normalization is considered when the algorithms do not make assumptions about the data distribution. Standardization is used when algorithms make assumptions about the data distribution.

## Normalization:-

Normalization is used to scale the feature values between 0 to 1

Normalization is used when the data doesn't have Gaussian distribution

###### Normalzation Formula:-

![Normalization%20Formula.PNG](attachment:Normalization%20Formula.PNG)

###### Library to import in python

In [7]:
list1 = [1,2,3,4,5,30,12,9]
import pandas as pd
df = pd.DataFrame(list1)

In [8]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_df = scaler.fit_transform(df)

In [10]:
scaled_df

array([[0.        ],
       [0.03448276],
       [0.06896552],
       [0.10344828],
       [0.13793103],
       [1.        ],
       [0.37931034],
       [0.27586207]])

###### Note:     
After normalization all data is between 0 to 1.
The maximum value of data is now 1
and the minimum value of data is now 0.


In [13]:
new_value = (2-1)/(30-1)
new_value

0.034482758620689655

## Standardization (Z-Score normalization):-

Standardization is a method of feature scaling in which data values are rescaled to fit the distribution between 0 and 1 using mean and standard deviation as the base to find specific values. 

Standardization is use to transform the data to have mean value is 0 and standard deviation of 1.

Standarization is also known as the Z-Score normalization.

Standardization is the process of putting different variables on the same scale. 
This process allows you to compare scores between different types of variables.

Standardization is used on data having Gaussian distribution. 

###### Example

 A hot dog stand has mean daily sales of $420  
 with a standard deviation 
of $50. The income has a normal distribution. What is the standardized value for daily sales 
of $520?

###### Example

###### Standardization Formula:-

![STatistics3.JPG](attachment:STatistics3.JPG)

###### 1. Standardization is used for feature scaling when your data follows Gaussian distribution. It is most useful for: 

Optimizing algorithms such as gradient descent Clustering models or distance-based classifiers like K-Nearest Neighbors
High variance data ranges such as in Principle Component Analysis dominating the first principal components (components with maximum variance).

## When should you standardize your data, and why?

###### 1. BEFORE PRINCIPAL COMPONENT ANALYSIS (PCA)

In principal component analysis, features with high variances or wide ranges get more 
weight than those with low variances, and consequently, they end up illegitimately dominating the first principal components (components with maximum variance).

###### 2. BEFORE CLUSTERING

Clustering models are distance-based algorithms. In order to measure similarities between 
observations and form clusters they use a distance metric. So, features with high ranges will 
have a bigger influence on the clustering. 

###### 3. BEFORE K-NEAREST NEIGHBORS (KNN)

K-nearest neighbor is a distance-based classifier that classifies new observations based on 
similar measures (e.g., distance metrics) with labeled observations of the training set. 
Standardization makes all variables contribute equally to the similarity measures

###### 4. BEFORE SUPPORT VECTOR MACHINE (SVM)

Support vector machine tries to maximize the distance between the separating plane and the 
support vectors. If one feature has very large values, it will dominate over other features 
when calculating the distance. Standardization gives all features the same influence on the 
distance metric

###### 5. BEFORE MEASURING VARIABLE IMPORTANCE IN REGRESSION MODELS

You can measure variable importance in regression analysis by fitting a regression model 
using the standardized independent variables and comparing the absolute value of their 
standardized coefficients. But, if the independent variables are not standardized, comparing 
their coefficients becomes meaningless.


###### 6. BEFORE LASSO AND RIDGE REGRESSIONS

Lasso and ridge regressions place a penalty on the magnitude of the coefficients associated 
with each variable, and the scale of variables will affect how much of a penalty will be 
applied on their coefficients. The coefficients of variables with a large variance are small and 
thus less penalized. Therefore, standardization is required before fitting both regressions.

## Cases When Standardization Is Not Needed

###### LOGISTIC REGRESSIONS AND TREE-BASED MODELS

Logistic regressions and tree-based algorithms such as decision trees, random forests, and 
gradient boosting are not sensitive to the magnitude of variables. So, standardization is not 
needed before fitting these kinds of models.

###### Library to import in python

In [14]:
list1 = [1,2,3,4,5,30,12,9]
my_sum=sum(list1)
my_count = len(list1)
my_avg=my_sum/my_count
my_variance = sum([((x - my_avg) ** 2) for x in list1]) / len(list1)
st_dev = my_variance **
print(my_sum , 'is the sum of all values and ' , my_sum/len(list1) , ' is the count. The average is ' , my_sum/len(list1))
import pandas as pd
df = pd.DataFrame(list1)

66 is the sum of all values and  8.25  is the count. The average is  8.25


In [4]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)
scaled_df

array([[-0.81343943],
       [-0.70124089],
       [-0.58904235],
       [-0.47684381],
       [-0.36464526],
       [ 2.4403183 ],
       [ 0.42074454],
       [ 0.08414891]])

# Normalization:-

### Gaussian/Normal Distribution curve

Data spread on Symmetrical distribution

The data was distributed in 68%, 95%, and 99.7% manner (Empirical Rule)

The standard normal distribution, also called the z-distribution, is a special normal distribution where the mean is 0 and the standard deviation is 1.

Any normal distribution can be standardized by converting its values into z-scores. Z-scores tell you how many standard deviations from the mean each value lies.


![Statistics2.JPG](attachment:Statistics2.JPG)

### Selection of Feature Scaling Method

Normalization is preferred in case of neural networks since we don't assume any data distribution.

Standardization is preferred when data follows a gaussion distribution.

Standardization is preferred when there are a lot of outliers.

![Feature%20Scaling%201.PNG](attachment:Feature%20Scaling%201.PNG)

### T-Test (Student Test)

This is another way of data scaling. However, we dont use this test.

###### Formula

t = (mean of sample - assumed mean)/ (Standard deviation / Under root of No. of Observation)

###### When to use Z test or T test