In [None]:
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10,6)

In [None]:
df = pd.read_csv("../input/weight-height/weight-height.csv")
df.head(5)

In [None]:
plt.hist(df.Height, bins=20, rwidth=0.8)
plt.xlabel('Height (inches)')
plt.ylabel('Count')
plt.show()

In [None]:
from scipy.stats import norm
import numpy as np
plt.hist(df.Height, bins=20, rwidth=0.8, density=True)
plt.xlabel('Height (inches)')
plt.ylabel('Count')

rng = np.arange(df.Height.min(), df.Height.max(), 0.1)
plt.plot(rng, norm.pdf(rng,df.Height.mean(),df.Height.std()))

In [None]:
df.Height.describe()

In [None]:
df.Height.mean()

In [None]:
df.Height.std()

Here the mean is 66.37 and standard deviation is 3.84.

**(1) Outlier detection and removal using 3 standard deviation**


One of the ways we can remove outliers is remove any data points that are beyond 3 standard deviation from mean. Which means we can come up with following upper and lower bounds

In [None]:
upper_limit = df.Height.mean() + 3*df.Height.std()
upper_limit

In [None]:
lower_limit = df.Height.mean() - 3*df.Height.std()
lower_limit

Here are the outliers that are beyond 3 std dev from mean

In [None]:
df[(df.Height>upper_limit) | (df.Height<lower_limit)]

Above the heights on higher end is 78 inch which is around 6 ft 6 inch. Now that is quite unusual height. There are people who have this height but it is very uncommon and it is ok if you remove those data points. Similarly on lower end it is 54 inch which is around 4 ft 6 inch. While this is also a legitimate height you don't find many people having this height so it is safe to consider both of these cases as outliers
Now remove these outliers and generate new dataframe

In [None]:
df_no_outlier_std_dev = df[(df.Height<upper_limit) & (df.Height>lower_limit)]
df_no_outlier_std_dev.head()

In [None]:
df_no_outlier_std_dev.shape

In [None]:
df.shape

**(2) Outlier detection and removal using Z Score**


Z score is a way to achieve same thing that we did above in part (1)
Z score indicates how many standard deviation away a data point is.

In [None]:
df['zscore'] = ( df.Height - df.Height.mean() ) / df.Height.std()
df.head(5)

In [None]:
(73.84-66.37)/3.84

Get data points that has z score higher than 3 or lower than -3. Another way of saying same thing is get data points that are more than 3 standard deviation away

In [None]:
df[df['zscore']>3]

In [None]:
df[df['zscore']<-3]

Here a list of outliers

In [None]:
df[(df.zscore<-3) | (df.zscore>3)]

**Remove the outliers and produce new dataframe**

In [None]:
df_no_outliers = df[(df.zscore>-3) & (df.zscore<3)]
df_no_outliers.head()

In [None]:
df_no_outliers.shape

In [None]:
df.shape

**3) Outlier detection usuing IQR**

In [None]:
Q1 = df.Height.quantile(0.25)
Q3 = df.Height.quantile(0.75)
Q1, Q3

In [None]:
IQR = Q3 - Q1
IQR

In [None]:
lower_limit = Q1 - 1.5*IQR
upper_limit = Q3 + 1.5*IQR
lower_limit, upper_limit

In [None]:
df[(df.Height<lower_limit)|(df.Height>upper_limit)]

**Remove outliers**

In [None]:
df_no_outlier = df[(df.Height>lower_limit)&(df.Height<upper_limit)]
df_no_outlier

In [None]:
df_no_outlier.shape