# Ways to Detect and Remove the Outliers

While working on a Data Science project, what is it, that you look for? What is the most important part of the EDA phase? There are certain things which, if are not done in the EDA phase, can affect further statistical/Machine Learning modelling. One of them is finding “Outliers”. In this post we will try to understand what is an outlier? Why is it important to identify the outliers? What are the methods to outliers? Don’t worry, we won’t just go through the theory part but we will do some coding and plotting of the data too.

Credit: https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba

In [None]:
#Import the libraries
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing

In [None]:
#Load the data
import warnings
warnings.filterwarnings('ignore')
california = fetch_california_housing()

#Find features and target
x = california.data
y = california.target

#Find the dic keys
print(california.keys())

In [None]:
#find features name
columns = california.feature_names
columns

In [None]:
#Description of dataset
print(california.DESCR)

In [None]:
#Create dataframe
california_df = pd.DataFrame(california.data)
california_df.columns = columns
california_df_o = california_df
california_df.shape

In [None]:
california_df.head()

In [None]:
# Check1: Oulier detection - Univarite - Boxplot
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.boxplot(x=california_df['Population'])

In [None]:
# Check2: Multivariate outlier analysis
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(california_df['MedInc'], california_df['HouseAge'])
ax.set_xlabel('Median income in block group')
ax.set_ylabel('Median house age in block group')
plt.show()

In [None]:
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(california_df))
z

In [None]:
z.shape

In [None]:
threshold = 3
print(np.where(z > 3)) # show row & columns

## Removing "whole rows" with outliers


In [None]:
california_df_o = california_df_o[(z <= 3).all(axis=1)]

In [None]:
california_df.shape

In [None]:
california_df_o.shape

In [None]:
california_df_o1 = california_df

In [None]:
Q1 = california_df_o1.quantile(0.25)
Q3 = california_df_o1.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

california_df_out = california_df_o1[~((california_df_o1 < (Q1 - 1.5 * IQR)) | (california_df_o1 > (Q3 + 1.5 * IQR))).any(axis=1)]

In [None]:
california_df_out.shape