# Day 21 - LU1

<h2>Demo - Detecting and Removing Outliers</h2>

In this demo, you will be shown how to detect and remove outliers using Z-score and IQR score.

In [1]:
#Import the required libraries
import pandas as pd
from sklearn import datasets
import sklearn
from scipy import stats
import numpy as np

In [15]:


# Define the column names
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

# Generate random data
num_rows = 20  # Number of rows in the DataFrame
data = np.random.rand(num_rows, len(columns))  # Generate random numbers between 0 and 1

# Create a DataFrame
boston_df = pd.DataFrame(data, columns=columns)
# Introduce outliers in specific columns
boston_df.iloc[5, boston_df.columns.get_loc('INDUS')] = 100.0  # Outlier in the 'INDUS' column
boston_df.iloc[10, boston_df.columns.get_loc('NOX')] = -100.0  # Outlier in the 'NOX' column
boston_df.iloc[15, boston_df.columns.get_loc('DIS')] = 100.0  # Outlier in the 'DIS' column

# Display the DataFrame
print(boston_df)

        CRIM        ZN       INDUS      CHAS         NOX        RM       AGE  \
0   0.567544  0.294749    0.078595  0.080668    0.846974  0.636929  0.572308   
1   0.974854  0.568331    0.296339  0.157526    0.341859  0.531228  0.579402   
2   0.882891  0.678743    0.992182  0.086614    0.959332  0.372590  0.513282   
3   0.955180  0.512427    0.288832  0.909273    0.763343  0.094197  0.388022   
4   0.098543  0.393014    0.919464  0.285657    0.440916  0.437403  0.482075   
5   0.841136  0.648354  100.000000  0.670916    0.661151  0.101968  0.401533   
6   0.460463  0.721960    0.271269  0.597935    0.248620  0.161225  0.295529   
7   0.690075  0.879024    0.759194  0.341545    0.556697  0.821365  0.468875   
8   0.233242  0.263140    0.399725  0.952060    0.808991  0.303799  0.015091   
9   0.350653  0.145023    0.643585  0.939993    0.493332  0.340212  0.700781   
10  0.789831  0.594316    0.711125  0.257468 -100.000000  0.204706  0.554054   
11  0.827056  0.161764    0.435689  0.22

### Using Z-Score

In [16]:
#Step1: Use Z-score function defined in scipy library to detect the outliers
boston_df_z = boston_df
z = np.abs(stats.zscore(boston_df))
print(z)

        CRIM        ZN     INDUS      CHAS       NOX        RM       AGE  \
0   0.049477  0.527311  0.253475  1.238121  0.240735  1.501472  0.296526   
1   1.388053  0.507909  0.243425  0.999640  0.217697  1.005994  0.329676   
2   1.085825  0.925705  0.211307  1.219671  0.245859  0.262369  0.020673   
3   1.323395  0.296372  0.243772  1.332920  0.236920  1.042609  0.564709   
4   1.491842  0.155482  0.214664  0.602070  0.222215  0.566183  0.125166   
5   0.948603  0.810714  4.358547  0.593332  0.232259  1.006183  0.501566   
6   0.302434  1.089236  0.244582  0.366884  0.213445  0.728415  0.996959   
7   0.452158  1.683559  0.222061  0.428657  0.227496  2.366023  0.186855   
8   1.049167  0.646919  0.238653  1.465684  0.239002  0.060092  2.307544   
9   0.663310  1.093867  0.227397  1.428241  0.224606  0.110596  0.896921   
10  0.779994  0.606236  0.224280  0.689535  4.358693  0.524594  0.211216   
11  0.902329  1.030523  0.236993  0.798999  0.219940  1.271425  1.400177   
12  1.159280

Looking at the code and the output above, it is difficult to say which data point is an outlier.
So let’s define a threshold to identify an outlier.

In [17]:
#Step2: Define a threshold
threshold = 3
print(np.where(z > 3))

(array([ 5, 10, 15], dtype=int64), array([2, 4, 7], dtype=int64))


In [21]:
#Step4: Remove the outliers using the z-score
boston_df_z = boston_df_z[(z < 3).all(axis=1)]

print("The no. of rows before outlier filtering was: ", boston_df.shape)
print("The no. of rows after outlier filtering is: ", boston_df_z.shape)

The no. of rows before outlier filtering was:  (20, 13)
The no. of rows after outlier filtering is:  (17, 13)


### Using IQR Score

In [22]:
#Step1: Calculate the IQR
boston_df_iqr = boston_df
Q1 = boston_df_iqr.quantile(0.25)
Q3 = boston_df_iqr.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

CRIM       0.546822
ZN         0.439053
INDUS      0.549739
CHAS       0.603651
NOX        0.394636
RM         0.300620
AGE        0.233054
DIS        0.766146
RAD        0.460067
TAX        0.426491
PTRATIO    0.300987
B          0.574997
LSTAT      0.366098
dtype: float64


In [23]:
#Step2: Detect the outliers
print(boston_df_iqr < (Q1 - 1.5 * IQR)) |(boston_df_iqr > (Q3 + 1.5 * IQR))

     CRIM     ZN  INDUS   CHAS    NOX     RM    AGE    DIS    RAD    TAX  \
0   False  False  False  False  False  False  False  False  False  False   
1   False  False  False  False  False  False  False  False  False  False   
2   False  False  False  False  False  False  False  False  False  False   
3   False  False  False  False  False  False  False  False  False  False   
4   False  False  False  False  False  False  False  False  False  False   
5   False  False  False  False  False  False  False  False  False  False   
6   False  False  False  False  False  False  False  False  False  False   
7   False  False  False  False  False  False  False  False  False  False   
8   False  False  False  False  False  False   True  False  False  False   
9   False  False  False  False  False  False  False  False  False  False   
10  False  False  False  False   True  False  False  False  False  False   
11  False  False  False  False  False  False  False  False  False  False   
12  False  F

TypeError: Cannot perform 'ror_' with a dtyped [bool] array and scalar of type [NoneType]

The data point where we have False that means these values are valid whereas <b><i>True</i> indicates presence of an outlier</b>.

In [24]:
#Step3: Remove the outliers using the IQR score
boston_df_out = boston_df_iqr[~((boston_df_iqr < (Q1 - 1.5 * IQR)) |(boston_df_iqr > (Q3 + 1.5 * IQR))).any(axis=1)]

print("The no. of rows before outlier filtering was: ", boston_df_iqr.shape)
print("The no. of rows after outlier filtering is: ", boston_df_out.shape)

The no. of rows before outlier filtering was:  (20, 13)
The no. of rows after outlier filtering is:  (16, 13)


        CRIM        ZN     INDUS      CHAS       NOX        RM       AGE  \
0   0.935901  0.184580  0.378560  0.327824  0.358714  0.922855  0.577494   
1   0.261980  0.453886  0.765653  0.884395  0.244139  0.289694  0.234978   
2   0.673691  0.239742  0.183082  0.856572  0.721046  0.754697  0.683417   
3   0.058502  0.758463  0.228258  0.010332  0.757972  0.412882  0.639415   
4   0.243489  0.401212  0.621007  0.076008  0.487308  0.527604  0.549366   
5   0.842674  0.734955  0.794781  0.974991  0.523140  0.823195  0.074309   
6   0.737742  0.351063  0.475868  0.208937  0.344342  0.804741  0.172414   
7   0.430843  0.330222  0.792397  0.253859  0.757773  0.371135  0.534966   
8   0.568981  0.554228  0.753827  0.281874  0.312823  0.735936  0.838611   
9   0.670365  0.221349  0.092201  0.292420  0.144678  0.609305  0.541129   
10  0.262854  0.522670  0.582450  0.191589  0.802648  0.623636  0.992020   
11  0.270769  0.865622  0.585053  0.338649  0.559414  0.020169  0.783508   
12  0.772943

Hence, the outliers have been removed.