An outlier may occur due to the variability in the data, or due to experimental error/human error.

They may indicate an experimental error or heavy skewness in the data(heavy-tailed distribution).

In statistics, we have three measures of central tendency namely Mean, Median, and Mode. They help us describe the data.

Mean is the accurate measure to describe the data when we do not have any outliers present.

Median is used if there is an outlier in the dataset.

Mode is used if there is an outlier AND about ½ or more of the data is the same.

‘Mean’ is the only measure of central tendency that is affected by the outliers which in turn impacts Standard deviation.

# Detecting Outliers

Below are some of the techniques of detecting outliers

Boxplots

Z-score

Inter Quantile Range(IQR)

any data point whose Z-score falls out of 3rd standard deviation is an outlier.

# By  Z score

In [None]:
#With sid 3 lets see the stats 
#From scipy import stats 
#zscore=(x_mean)/std=>you have seen  this is Standard Scaler 

#Z=(x-mean)/std
from scipy.stats import zscore
import numpy as np


z_score=zscore(df[['age','height','ap_hi','weight','ap_lo']])
abs_z_score=np.abs(z_score)#apply the formula and you get the scaled data 
filtering_entry=(abs_z_score < 3).all(axis=1)
df=df[filtering_entry]
df.describe()

# By IQR -> Intel Quantile Range

data points that lie 1.5 times of IQR above Q3 and below Q1 are outliers. This shows in detail about outlier treatment in Python.

steps:


Sort the dataset in ascending order


calculate the 1st and 3rd quartiles(Q1, Q3)


compute IQR=Q3-Q1


compute lower bound = (Q1–1.5*IQR), upper bound = (Q3+1.5*IQR)


loop through the values of the dataset and check for those who fall below the lower bound and above the upper bound and mark them as outliers


In [None]:
#Save Outliers in  a Empty List
outliers = []
def detect_outliers_iqr(data):
    data = sorted(data)
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    # print(q1, q3)
    IQR = q3-q1
    lwr_bound = q1-(1.5*IQR)
    upr_bound = q3+(1.5*IQR)
    # print(lwr_bound, upr_bound)
    for i in data: 
        if (i<lwr_bound or i>upr_bound):
            outliers.append(i)
    return outliers# Driver code
sample_outliers = detect_outliers_iqr(sample)
print("Outliers from IQR method: ", sample_outliers)

In [None]:
#For Upper Value
#1st quantile 
q1=data.quantile(0.25)
#3rd quantile
q3=data.quantile(0.75)
#IQR 
IQR =q3  - q1


st_high=(q3.SkinThickness + (1.5 * IQR.SkinThickness))
print(st_high)


index=np.where(data['SkinThickness']>st_high)


data=data.drop(data.index[index])


print(data.shape)


data.reset_index()

In [None]:
#For Lower Value
bp_low=(q1.BloodPressure - (1.5 * IQR.BloodPressure))
print(bp_low)

index=np.where(data['BloodPressure']<bp_low)



data=data.drop(data.index[index])


print(data.shape)


data.reset_index()

# Below are some of the methods of treating the outliers

    Trimming/removing the outlier
    Quantile based flooring and capping
    Mean/Median imputation
    5.1 Trimming/Remove the outliers
In this technique, we remove the outliers from the dataset. Although it is not a good practice to follow.

Python code to delete the outlier and copy the rest of the elements to another array.

In [None]:
# Trimming
for i in sample_outliers:
    a = np.delete(sample, np.where(sample==i))
print(a)

In [None]:
# print(len(sample), len(a))
#The outlier ‘101’ is deleted and the rest of the data points are copied to another array ‘a’.

#5.2 Quantile based flooring and capping
# In this technique, the outlier is capped at a certain value above the 90th percentile value or floored at a factor below the 10th percentile value.

In [3]:
sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]

In [None]:
##for i in sample_outliers:
   # a = np.delete(sample, np.where(sample==i))
#print(a)
# print(len(sample), len(a))

In [4]:
# Computing 10th, 90th percentiles and replacing the outliers
import numpy as np
tenth_percentile = np.percentile(sample, 10)
ninetieth_percentile = np.percentile(sample, 90)
# print(tenth_percentile, ninetieth_percentile)
b = np.where(sample<tenth_percentile, tenth_percentile, sample)
b = np.where(b>ninetieth_percentile, ninetieth_percentile, b)
# print("Sample:", sample)
print("New array:",b)

New array: [15.  20.7 18.   7.2 13.  16.  11.  20.7  7.2 15.  10.   9. ]


In [None]:
#As the mean value is highly influenced by the outliers, it is advised to replace the outliers with the median value.

import matplotlib.pyplot as plt

median = np.median(sample)# Replace with median
for i in sample_outliers:
    c = np.where(sample==i, 14, sample)
print("Sample: ", sample)
print("New array: ",c)
# print(x.dtype)
#Visualizing the data after treating the outlier

plt.boxplot(c, vert=False)
plt.title("Boxplot of the sample after treating the outliers")
plt.xlabel("Sample")


# There are so many types of feature transformation methods, we will talk about the most useful and popular ones.


Standardization


Min — Max Scaling/ Normalization


Robust Scaler


Logarithmic Transformation


Reciprocal Transformation


Square Root Translation


Box-Cox Transformation


Exponential Transformation


Johnson transformation

# Standardization

Standardization should be used when the features of the input dataset have large differences between ranges or when they are measured in different measurements units like Height, Weight, Meters, Miles, etc.

We bring all the variables or features to a similar scale. Where the mean is 0 and the Standard Deviation is 1.

In Standardization, we subtract feature values by their mean and then divide by standard deviation which gives exactly standard normal distribution.

# Min Max Scaler

In simple terms, min-max scaling brings down feature values to a range of 0 to 1. Until we specify the range we want it to be scaled down to.

In Normalization, we subtract the feature value by its minimum value and then divide it by the range of features (range of feature= maximum value of feature — minimum value of feature).

# Robust to Outliers


If the dataset has too many outliers, both Standardization and Normalization can be hard to depend on, in such case you can use Robust Scaler for feature scaling.

You can also say Robust Scaler is robust to outliers ?.

It scales values using median and interquartile range therefore it doesn’t get affected by very large or very small values of features.

The robust scaler subtracts feature values by their median and then divides by its IQR.

25th percentile = 1st quartile


50th percentile = 2nd quartile (also called the median)


75th percentile = 3rd quartile


100th percentile = 4th quartile (also called the maximum)


IQR= Inter Quartile Range


IQR= 3rd quartile — 1st quartile



# 1. Log Transformation — right skewed data

When the data sample follows the power law distribution, we can use log scaling to transform the right skewed distribution into normal distribution. To achieve this, simply use the np.log() function. In this dataset, most variables fall under this category.

In the Logarithmic Transformation, we will apply log to all values of features using NumPy and store it in the new feature.

Using Log Transformation doesn’t seem to fit very well in this dataset, it even worsens the distribution by making data left-skewed. So we have to rely on other methods to achieve normal distribution.

# B. Reciprocal Transformation
In Reciprocal Transformation, we divide each value of a feature by 1(reciprocal) and store it in the new feature.

Reciprocal Transformation doesn’t work well with this data, It doesn’t give normal distribution instead it made data even more right-skewed.

# C. Square Root Translation
In square root transformation, we raise the values of feature to the power of fraction(1/2) to achieve the square root of a value. We can also use NumPy for this transformation.

Square root transformation seems to perform better than reciprocal and log transformation with this data but yet it is a bit left-skewed.

# D. Box-Cox Transformation
Box-Cox transformation is one of the most useful scaling techniques to transfer data distribution in a normal distribution.

The Box-Cox transformation can be defined as:

T(Y)=(Y exp(λ)−1)/λ

Where Y is the response variable and λ is the transformation parameter. λ varies from -5 to 5. In the transformation, all values of λ are considered and the optimal value for a given variable is selected.

We can calculate box cox transformation using stats from the SciPy module.

So far box cox transformation seems to be the best fit for the age feature to transform.

In [None]:
df['df_exponential'] = df.Age**(1/1.2)
plot_data(df, 'df_exponential')

# Guassian Transformation

What is Guassian Transformation?
If our features are not normally distributed then we use mathematical operations to convert that into normal or guassian distribution This is called guasian Transformation.

Some machine learning algorithms like linear and logistic assume that the features are normally distributed. They give us Accuracy and Performance

In [None]:
#Intern at Pranathi 
#Student of DataTrained- Saurav
#Date - 30- March - 2023
#Time  - 10:10