# Outlier Handling
“The Outlier is an observation that deviates so much from other observations as to raise suspicion that it is produced by another mechanism.” [D. Hawkins. Identification of Outliers, Chapman and Hall, 1980].

Statistical factors such as mean and variance are easily affected by outliers. Additionally, some machine learning models are sensitive to outliers that can degrade their quality. Therefore, we often remove outliers from variables depending on the algorithm we want to train.

We have discussed how to identify outliers. And in this section, we will discuss how to handle them to train machine learning models.

### How to preprocess outlier?
+ Trimming: remove outliers from the data set.
+ Treat outliers as missing data and proceed with any missing data assignment technique.
+ Discretization: outliers are replaced in border bins along with higher or lower values of the distribution.
+ Censoring: limits the variable distribution to the max/min value.

# 1. Trimming/Truncation
Trimming/Truncation involves removing outliers from the data set. We just need to decide on a measurement to determine the outlier. This could be the Gaussian approximation for normally distributed variables or the IQR asymptotic rule for skewed variables.

### Gaussian approximation rule:
<img src="img/j1.png" width=700/>

### Quantile approximation rule:
<img src="img/j2.png" width=700/>

### IQR asymptotic rule:
<img src="img/j3.png" width=680/>

#### Advantage
+ Fast
#### Limit
+ Outliers for one variable may contain useful information in other variables.
+ We can eliminate most of the data set if there are outliers in many variables.

# 2. Censoring/Capping.
Censoring or Capping is the max/min limit of the distribution at any value. In other words, values larger or smaller than arbitrarily determined values are censored.

Capping can be done at both ends or one end of the distribution depending on the variable and the user.

Numbers to limit the distribution can identify:
+ Optional
+ Use the IQR proximity rule
+ Use the Gaussian approximation
+ Use quantile
#### Advantage:
+ Does not remove data
#### Limit:
+ Distort the variable's distributions
+ Distort the relationship between variables

# 3. Feature Scaling

We discussed earlier that feature rates are quite important when building machine learning models. Specifically:


### Feature magnitude is important because:

- The regression coefficients of the linear model are directly affected by the proportional transformation of the variable.
- Variables with larger magnitude/range of values will outperform variables with smaller magnitude/range of values.
- Gradient descent converges faster when the features have the same scale.
- Feature elasticity helps reduce the time to find support vectors for SVM.
- Euclidean distance is sensitive to the magnitude of the feature.
- Some algorithms such as PAC require features to be concentrated at 0.


### Machine learning models are directly affected by characteristic scaling:

- Linear Regression and Logistic Regression
- Neural network
- Support vector machine (SVM)
- KNN
- K-means clustering
- Linear discriminant analysis (LDA)
- Principal component analysis (PCA)


### Characteristic elasticity

**Feature scaling** refers to methods or techniques of normalizing the range of independent variables in the data or in other words, methods of placing the range of characteristic values within the same scaling rule. Feature scaling is often the final step in the data preprocessing pipeline, performed **just before training machine learning algorithms**.

  We will discuss some typical Elastication techniques:

### 1. Standardization (Z-score normalization):
Centers the variable at 0 and set the variance to 1, the procedure is:
$$ X' = \frac{X - mean(X)}{Std(X)} $$
$$$$

The result of the transformation is z, referred to as the z-score, which indicates the standard deviation by which a particular observation deviates from the mean. The z-score determines the position of an observation in a distribution (in terms of the number of standard deviations from the mean of the distribution). The sign of the z-score (+ or -) indicates whether the observation is above (+) or below (-) the mean.

<img src="img/k1.png" width=650/>

#### In short, standardization:
1. **Centres the mean at 0.**
2. **Scales the variance at 1.**
3. **Preserves the shape of the original distribution.**
4. **Minimum and maximum values vary.**
5. **Preserves outliers.**

#### When to use:
1. **It’s useful when there are a few outliers, but not so extreme that you need clipping (When the feature distribution does not contain extreme outliers)**

In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

data = {'Price': [100, 90, 50, 40, 20, 100, 50, 60, 120, 40, 200]}
df = pd.DataFrame(data)
df['Sign'] = df['Price'].apply(lambda x: "+" if x > df["Price"].mean() else "-")

# Print mean and std before scaling
print("Original Data Statistics:")
print(f'Mean = {df["Price"].mean():.2f}')
print(f'Std = {df["Price"].std():.2f}\n')

scaler = StandardScaler().fit(df[['Price']])

# Transform and round the scaled data
df['Price_scaled'] = scaler.transform(df[['Price']])
df['Price_scaled'] = df['Price_scaled'].round(2)


# Display the DataFrame
print("\n\nAfter use Z-score normalize:\n\n", df, "\n")

# Print mean and std after scaling
print(f'Mean = {df["Price_scaled"].mean():.2f}')
print(f'Std = {df["Price_scaled"].std():.2f}\n')




Original Data Statistics:
Mean = 79.09
Std = 50.88



After use Z-score normalize:

     Price Sign  Price_scaled
0     100    +          0.43
1      90    +          0.22
2      50    -         -0.60
3      40    -         -0.81
4      20    -         -1.22
5     100    +          0.43
6      50    -         -0.60
7      60    -         -0.39
8     120    +          0.84
9      40    -         -0.81
10    200    +          2.49 

Mean = -0.00
Std = 1.05



### 2. Min/max scaling - MinMaxScaling:
Scales the variable between 0 and 1, the procedure is:
$$ X' = \frac{X - min(X)}{max(X) - min(X)}$$
$$$$

MinMaxScaling is a data normalization technique that transforms the values of variables into a specific range, usually [0, 1]. This method helps synchronize variables with different units of measurement and ensures that they are all within a pre-selected value range.

#### In short, MinMaxScaling:
1. **Minimum and maximum values within [0, 1].**
2. **Mean varies.**
3. **Variance varies.**
4. **Preserves outliers.**

In [3]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

data = {'Price': [100, 90, 50, 40, 20, 100, 50, 60, 120, 40, 200]}
df = pd.DataFrame(data)

# Print mean and std before scaling
print("Original Data Statistics:")
print(f'Min = {df["Price"].min():.2f}')
print(f'Max = {df["Price"].max():.2f}\n')

scaler = MinMaxScaler().fit(df[['Price']])

# Transform and round the scaled data
df['Price_scaled'] = scaler.transform(df[['Price']])
df['Price_scaled'] = df['Price_scaled'].round(2)

# Display the DataFrame
print("\n\nAfter use MinMax_Scaling::\n\n", df, "\n")

# Print mean and std after scaling
print(f'Min = {df["Price_scaled"].min():.2f}')
print(f'Max = {df["Price_scaled"].max():.2f}\n')




Original Data Statistics:
Min = 20.00
Max = 200.00



After use MinMax_Scaling::

     Price  Price_scaled
0     100          0.44
1      90          0.39
2      50          0.17
3      40          0.11
4      20          0.00
5     100          0.44
6      50          0.17
7      60          0.22
8     120          0.56
9      40          0.11
10    200          1.00 

Min = 0.00
Max = 1.00



### 3. Mean normalization:
Centres the variable at 0 and re-scaled the variable to the value range
$$ X_{normalized} = \frac{X - mean(X)}{max(X) - min(X)}$$
$$$$

#### In short, mean normalization:
1. **Centres the mean at 0**
2. **Minimum and maximum values within [0, 1].**
3. **Variance varies.**
4. **Preserves outliers.**

In [4]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

data = {'Price': [100, 90, 50, 40, 20, 100, 50, 60, 120, 40, 200]}
df = pd.DataFrame(data)
df['Sign'] = df['Price'].apply(lambda x: "+" if x > df["Price"].mean() else "-")

# Print mean and std before scaling
print("Original Data Statistics:")
print(f'Mean = {df["Price"].mean():.2f}')
print(f'Std = {df["Price"].std():.2f}\n')

scaler = StandardScaler(with_mean=True, with_std=False).fit(df[['Price']])

# Transform and round the scaled data
df['Price_scaled'] = scaler.transform(df[['Price']])
df['Price_scaled'] = df['Price_scaled'].round(2)

# Display the DataFrame
print("\n\nAfter use Mean_Scaling:\n\n", df, "\n")

# Print mean and std after scaling
print(f'Mean = {df["Price_scaled"].mean():.2f}')
print(f'Std = {df["Price_scaled"].std():.2f}\n')




Original Data Statistics:
Mean = 79.09
Std = 50.88



After use Mean_Scaling:

     Price Sign  Price_scaled
0     100    +         20.91
1      90    +         10.91
2      50    -        -29.09
3      40    -        -39.09
4      20    -        -59.09
5     100    +         20.91
6      50    -        -29.09
7      60    -        -19.09
8     120    +         40.91
9      40    -        -39.09
10    200    +        120.91 

Mean = 0.00
Std = 50.88



### 4. Scaling to the absolute maximum value - MaxAbsScaling
Scales the variable between in range [-1, 1]:
$$ X' = \frac{X}{max(|X|)}$$
$$$$

#### In short, MaxAbsScaling:
1. **Mean not centred.**
2. **Variance not scaled.**
3. **Scikit-learn recommends use with:**
    + Data that is centred
    + Sparse matrices

In [5]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MaxAbsScaler

data = {'Price': [100, 90, 50, 40, 20, 100, 50, 60, 120, 40, 200]}
df = pd.DataFrame(data)

# Print max and min before scaling
print("Original Data Statistics:")
print(f'Max = {df["Price"].max()}')
print(f'Min = {df["Price"].min()}\n')

# Perform MaxAbs scaling using MaxAbsScaler
scaler = MaxAbsScaler().fit(df[['Price']])
df['Price_scaled'] = scaler.transform(df[['Price']])

# Display the DataFrame
print("\n\nAfter MaxAbs_Scaling:\n\n", df, "\n")

# Print max and min after MaxAbs scaling
print(f'Max = {df["Price_scaled"].max()}')
print(f'Min = {df["Price_scaled"].min()}')



Original Data Statistics:
Max = 200
Min = 20



After MaxAbs_Scaling:

     Price  Price_scaled
0     100          0.50
1      90          0.45
2      50          0.25
3      40          0.20
4      20          0.10
5     100          0.50
6      50          0.25
7      60          0.30
8     120          0.60
9      40          0.20
10    200          1.00 

Max = 1.0
Min = 0.1


### 5. Scaling to median and quantiles - RobustScaling
$$ X' = \frac{X - median(X)}{75th\_quant(X) - 25th\_quant(X)}$$
$$$$


#### In short, RobustScaling:
1. **Median centred at zero.**
2. **Handles outliers.**

In [6]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler

data = {'Price': [100, 90, 50, 40, 20, 100, 50, 60, 120, 40, 200]}
df = pd.DataFrame(data)

# Print median and IQR before scaling
print("Original Data Statistics:")
print(f'Median = {df["Price"].median()}')
print(f'Q25 = {np.percentile(df["Price"], 25)}')
print(f'Q75 = {np.percentile(df["Price"], 75)}')
print(f'IQR = Q75 - Q25 = {np.percentile(df["Price"], 75) - np.percentile(df["Price"], 25)}\n')

# Perform Robust scaling using RobustScaler
scaler = RobustScaler().fit(df[['Price']])
df['Price_scaled'] = scaler.transform(df[['Price']])

# Display the DataFrame
print("\n\nAfter Robust_Scaling:\n\n", df)
print(f'\nMedian = {np.median(df["Price_scaled"])}')



Original Data Statistics:
Median = 60.0
Q25 = 45.0
Q75 = 100.0
IQR = Q75 - Q25 = 55.0



After Robust_Scaling:

     Price  Price_scaled
0     100      0.727273
1      90      0.545455
2      50     -0.181818
3      40     -0.363636
4      20     -0.727273
5     100      0.727273
6      50     -0.181818
7      60      0.000000
8     120      1.090909
9      40     -0.363636
10    200      2.545455

Median = 0.0


### 6. Scaling to  vector unit  length