<a href="https://colab.research.google.com/github/sureshmecad/Google-Colab/blob/master/1_Outlier_Treatment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- https://www.pluralsight.com/guides/cleaning-up-data-from-outliers

- https://www.analyticsvidhya.com/blog/2021/05/detecting-and-treating-outliers-treating-the-odd-one-out/

- Once we have identified the outliers, we need to treat them. There are several techniques for this, and we will discuss the most widely used ones below.

#### **1) Log Transformation**

Transformation of the skewed variables may also help correct the distribution of the variables. These could be logarithmic, square root, or square transformations. The most common is the logarithmic transformation, which is done on the 'Loan_amount' variable in the first line of code below. The second and third lines of code print the skewness value before and after the transformation.

In [None]:
df["Log_Loanamt"] = df["Loan_amount"].map(lambda i: np.log(i) if i > 0 else 0) 
print(df['Loan_amount'].skew())
print(df['Log_Loanamt'].skew())

In [None]:
# Output:
2.8146019248106815
-0.17792641310111373

- The above output shows that the skewness value came down from 2.8 to -0.18, confirming that the distribution has been treated for extreme values.

#### **2) Replacing Outliers with Median Values**

In this technique, we replace the extreme values with median values. It is advised to not use mean values as they are affected by outliers. The first line of code below prints the 50th percentile value, or the median, which comes out to be 140. The second line prints the 95th percentile value, which comes out to be around 326. The third line of code below replaces all those values in the 'Loan_amount' variable, which are greater than the 95th percentile, with the median value. Finally, the fourth line prints summary statistics after all these techniques have been employed for outlier treatment.

In [None]:
print(df['Loan_amount'].quantile(0.50)) 
print(df['Loan_amount'].quantile(0.95)) 
df['Loan_amount'] = np.where(df['Loan_amount'] > 325, 140, df['Loan_amount'])
df.describe()

In [None]:
Output:

    140.0
    325.7500000000001


|       	| Income       	| Loan_amount 	| Term_months 	| Credit_score 	| approval_status 	| Age        	| Log_Loanamt 	|
|-------	|--------------	|-------------	|-------------	|--------------	|-----------------	|------------	|-------------	|
| count 	| 594.000000   	| 594.000000  	| 594.000000  	| 594.000000   	| 594.000000      	| 594.000000 	| 594.000000  	|
| mean  	| 6112.375421  	| 144.289562  	| 366.929293  	| 0.787879     	| 0.688552        	| 50.606061  	| 4.957050    	|
| std   	| 3044.257269  	| 53.033735   	| 63.705994   	| 0.409155     	| 0.463476        	| 16.266324  	| 0.494153    	|
| min   	| 2960.000000  	| 10.000000   	| 36.000000   	| 0.000000     	| 0.000000        	| 22.000000  	| 2.302585    	|
| 25%   	| 3831.500000  	| 111.000000  	| 384.000000  	| 1.000000     	| 0.000000        	| 36.000000  	| 4.709530    	|
| 50%   	| 5050.000000  	| 140.000000  	| 384.000000  	| 1.000000     	| 1.000000        	| 50.500000  	| 4.941642    	|
| 75%   	| 7629.000000  	| 171.000000  	| 384.000000  	| 1.000000     	| 1.000000        	| 64.000000  	| 5.192957    	|
| max   	| 12681.000000 	| 324.000000  	| 504.000000  	| 1.000000     	| 1.000000        	| 80.000000  	| 6.656727    	|

#### **3) IQR Score**

This technique uses the IQR scores calculated earlier to remove outliers. The rule of thumb is that anything not in the range of (Q1 - 1.5 IQR) and (Q3 + 1.5 IQR) is an outlier, and can be removed. The first line of code below removes outliers based on the IQR range and stores the result in the data frame 'df_out'. The second line prints the shape of this data, which comes out to be 375 observations of 6 variables. This shows that for our data, a lot of records get deleted if we use the IQR method.

In [None]:
df_out = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
print(df_out.shape)

In [None]:
Output:

 (375, 6)

#### **4) Trimming**

In this method, we completely remove data points that are outliers. Consider the 'Age' variable, which had a minimum value of 0 and a maximum value of 200. The first line of code below creates an index for all the data points where the age takes these two values. The second line drops these index rows from the data, while the third line of code prints summary statistics for the variable.

After trimming, the number of observations is reduced from 600 to 594, and the minimum and maximum values are much more acceptable.

In [None]:
index = df[(df['Age'] >= 100)|(df['Age'] <= 18)].index
df.drop(index, inplace=True)
df['Age'].describe()

In [None]:
Output:

    count    594.000000
    mean      50.606061
    std       16.266324
    min       22.000000
    25%       36.000000
    50%       50.500000
    75%       64.000000
    max       80.000000
    Name: Age, dtype: float64

#### **5) Quantile-based Flooring and Capping**

In this technique, we will do,
 - **flooring (e.g., the 10th percentile)** for the **lower values**

 - **capping (e.g., the 90th percentile)** for the **higher values**.
 
 - The lines of code below print the **10th and 90th percentiles** of the variable **'Income'**, respectively. These values will be used for quantile-based flooring and capping.

In [None]:
print(df['Income'].quantile(0.10))
print(df['Income'].quantile(0.90))

In [None]:
Output:

 2960.1
 12681.0

Now we will remove the outliers, as shown in the lines of code below. Finally, we calculate the skewness value again, which comes out much better now.

In [None]:
df["Income"] = np.where(df["Income"] < 2960.0, 2960.0, df['Income'])
df["Income"] = np.where(df["Income"] > 12681.0, 12681.0, df['Income'])
print(df['Income'].skew())

In [None]:
Output:

  1.04