# **Feature Engineering_Part A_ <b style="color:black;"> Handling Missing Values & Outliers** </b>

<p style="font-size:160%;"> In this note book, I will be practising various techniques used to handle missing values and outliers in a dataset. The content of this post borrowed from the notebook "Feature Engineering from scratch" by https://www.kaggle.com/harshjain123.</p> 

# <b style="color:blue;"> Techniques for Handling Missing Values </b>
<p style="font-size:160%;"> Here is the list of typical techniques use to handle the missing values in a dataset: </p>
<li style="font-size:150%;"> Mean/Median/Mode
<li style="font-size:150%;"> Random Sample Imputation
<li style="font-size:150%;"> End of Distribution Imputation
<li style="font-size:150%;"> Arbitrary Imputation
<li style="font-size:150%;"> Regression Imputation
<li style="font-size:150%;"> KNN Imputation

<p style="font-size:180%;"><b> 1. Mean/Median/Mode</b></p>
<p style="font-size:160%;"> This is used when data is missing completely at random (MCAR). The missing values most likely look like the majority of observations in the variable aka mean/median/mode. In this case, it is reasonable to assume that the missing values are close to the mean/median/mode of the distribution</p>

In [None]:
import pandas as pd
import numpy as np

# Import dataset
df = pd.read_csv('../input/titanic/train.csv')

# Check for missing values
df.isnull().sum()

In [None]:
# Selecting & printing columns with missing values
df = pd.read_csv('../input/titanic/train.csv', usecols=['Age', 'Cabin', 'Embarked'])
df.head()

In [None]:
# Find percentage of missing values
df.isnull().mean()

In [None]:
# Function to impute missing values with mean/mode/median
def impute_nan(df, variable, mean, mode, median):
    df[variable + '_mean'] = df[variable].fillna(mean)
    df[variable + '_mode'] = df[variable].fillna(mode)
    df[variable + '_median'] = df[variable].fillna(median)

In [None]:
# Find mean, mode & median for 'Age' column
mean = df.Age.mean()
mode = df.Age.mode()
median = df.Age.median()

# Call function 'ampute_nan'
impute_nan(df, 'Age', mean, mode, median)

# Check for updated dataframe
df.head()

In [None]:
# Visualize the 'Age' column with missing values & replaced one
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)

df.Age.plot(kind='density', ax=ax)
df.Age_mean.plot(kind='density', ax=ax)
df.Age_mode.plot(kind='density', ax=ax)
df.Age_median.plot(kind='density', ax=ax)

lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

<p style="font-size:180%;"><b> 2. Random Sample Imputation</b></p>
<p style="font-size:160%;"> This involves taking of random sample of observations from the dataset and replace the missing values with it. This technique is suitable when data are missing completely at random (MCAR)</p>

In [None]:
# Selecting & printing columns with missing values
import pandas as pd
df = pd.read_csv('../input/titanic/train.csv', usecols=['Age', 'Cabin', 'Embarked'])
df.head()

In [None]:
# Function to impute missing values with random sample
def impute_nan(df, variable, median):
    df[variable + '_median'] = df[variable].fillna(median)
    df[variable + '_random'] = df[variable]
    # Get random sample
    random_sample = df[variable].dropna().sample(df[variable].isnull().sum(), random_state=0)
    # Get index to merge the dataset
    random_sample.index = df[df[variable].isnull()].index
    df.loc[df[variable].isnull(), variable + '_random'] = random_sample
    

In [None]:
# Find median & call function
median = df.Age.median()
impute_nan(df, 'Age', median)

# Check for Dataframe updation
df.head()

In [None]:
# Visualize the 'Age' column with missing values & replaced one
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)

df.Age.plot(kind='density', ax=ax)
df.Age_median.plot(kind='density', ax=ax)
df.Age_random.plot(kind='density', ax=ax)

lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

<p style="font-size:180%;"><b> 3. Capturing NaN with New Feature </b></p>
<p style="font-size:160%;"> This technique is well suited for the data that is missing not at random (MNAR). In this method, the NaN values are captured and replaced with new feature</p>

In [None]:
# Selecting & printing columns with missing values
import pandas as pd
df = pd.read_csv('../input/titanic/train.csv', usecols=['Age', 'Cabin', 'Embarked'])
df.head()

In [None]:
# Add a new feature 'Age_NaN' based on missing values
df['Age_NaN'] = np.where(df.Age.isnull(), 1, 0)
df.head()

In [None]:
# Visualize the 'Age' column with missing values & replaced one
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)

df.Age_NaN.plot(kind='density', ax=ax)

lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

<p style="font-size:180%;"><b> 4. End of Distribution </b></p>
<p style="font-size:160%;"> This is a tricky method in which ends of distribution are replaced. </p>

In [None]:
# Selecting & printing columns with missing values
import numpy as np
import pandas as pd
df = pd.read_csv('../input/titanic/train.csv', usecols=['Age', 'Cabin', 'Embarked'])

In [None]:
# Visualize 'Age' as histogram
df.Age.hist(bins=50)

In [None]:
# Visualize 'Age' as box-plot
import seaborn as sns
sns.boxplot('Age', data=df)

In [None]:
# Set extreme values
extreme = df.Age.mean() + 3*df.Age.std()

In [None]:
# Function with passing parameters: df, variable, median, extreme
def impute_nan(df, variable, median, extreme):
    df[variable + '_end_distribution'] = df[variable].fillna(extreme)
    df[variable].fillna(median, inplace=True)

In [None]:
# Call function
impute_nan(df, 'Age', df.Age.median(), extreme)
# Check updated dataframe
df.head()

<p style="font-size:180%;"><b> 5. Arbitrary Value Imputation </b></p>
<p style="font-size:160%;"> It consists of replacing all occurrences of missing values within a variable with an arbitrary value. The arbitrary value should be different from the mean or median and not within the normal values of the variable. </p>

In [None]:
# Selecting & printing columns with missing values
import numpy as np
import pandas as pd
df = pd.read_csv('../input/titanic/train.csv', usecols=['Age', 'Cabin', 'Embarked'])

In [None]:
# Define function for arbitrary value imputation
def impute_nan(df, variable):
    df[variable + '_zero'] = df[variable].fillna(0)
    df[variable + '_hundred'] = df[variable].fillna(100)

In [None]:
# Call function
impute_nan(df, 'Age')
# Check updated dataframe
df.head()

In [None]:
# Visualize the 'Age' column with missing values & replaced one
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)

df.Age.plot(kind='density', ax=ax)
df.Age_zero.plot(kind='density', ax=ax)
df.Age_hundred.plot(kind='density', ax=ax)

lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

<p style="font-size:180%;"><b> 6. Frequent Category Imputation </b></p>
<p style="font-size:160%;"> It consists of replacing all occurrences of missing values with most frequent one. Use this method when data contains no more than 5% of missing values. </p>

In [None]:
# Selecting & printing columns with missing values
import numpy as np
import pandas as pd
df = pd.read_csv('../input/titanic/train.csv', usecols=['Age', 'Cabin', 'Embarked'])
df.head()

In [None]:
# Compute frequency of each type
df['Embarked'].value_counts().plot.bar()

In [None]:
# Define function to imputate missing value with the frequent one
def impute_nan(df, variable):
    most_frequent_value = df[variable].mode()[0]
    df[variable].fillna(most_frequent_value, inplace=True)

In [None]:
# Call function and check for updates
impute_nan(df, 'Embarked')
df.head()

<p style="font-size:180%;"><b> 7. Regression Imputation </b></p>
<p style="font-size:160%;"> This method is used when there is a probable correlation between the missing values and other variables. </p>

In [None]:
# Selecting & printing columns with missing values
import numpy as np
import pandas as pd
df = pd.read_csv('../input/titanic/train.csv', usecols=['PassengerId', 'Age', 'Fare', 'Survived'])
df.head()

In [None]:
# Find correlation among variables & plot it
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

In [None]:
# Create a subset of data where there no missing value
# Note: 'Fare' has no missing values, 'Age' has missing values i.e. subset ['Age', 'Fare']
df_Age_Fare = df.dropna(axis=0, subset = ['Age', 'Fare'])
df_Age_Fare = df_Age_Fare.loc[:,['Age', 'Fare']]

# Find entries of 'Age' with missing values
missing_Age = df['Age'].isnull()

# Extract entries of 'Fare' corresponding to missing values of 'Age'
Fare_missAge = pd.DataFrame(df['Fare'] [missing_Age])

# Assign x & y variables
X = df_Age_Fare[['Fare']]
y = df_Age_Fare[['Age']]

In [None]:
# Split dataset into training data & testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

In [None]:
# Peform linear regression analysis
from sklearn.linear_model import LinearRegression
lm = LinearRegression().fit(X_train, y_train)
Age_pred = lm.predict(Fare_missAge)

In [None]:
# Plot regression results
import matplotlib.pyplot as plt
plt.scatter(Fare_missAge, df['Age'] [missing_Age], color='gray')
plt.plot(Fare_missAge, Age_pred, color='royalblue', linewidth=2)
plt.xlabel('Fare')
plt.ylabel('Age')
plt.show()

<p style="font-size:180%;"><b> 8. KNN Imputation </b></p>
<p style="font-size:160%;"> This method i.e. K-Nearest Neighbour (KNN) is just like regression imputation that can be used. </p>

In [None]:
from sklearn.impute import KNNImputer
import numpy as np

X = [ [3, np.NaN, 5], [1, 0, 0], [3, 3, 3] ]
print("X: ", X)
print("===========")

imputer = KNNImputer(n_neighbors=1)
impute_with_1 = imputer.fit_transform(X)
print("\n Impute with 1 Neighbours: \n", impute_with_1)

imputer = KNNImputer(n_neighbors=2)
impute_with_2 = imputer.fit_transform(X)
print("\n Impute with 2 Neighbours: \n", impute_with_1)

# <b style="color:blue;"> Techniques for Handling Outliers </b>
<p style="font-size:160%;"> Here is the list of typical techniques use to handle the missing values in a dataset: </p>
<li style="font-size:150%;"> Box Plot
<li style="font-size:150%;"> Scatter Plot
<li style="font-size:150%;"> Z-Score
<li style="font-size:150%;"> IQR-Score    
<p style="font-size:160%;"> Algorithms NOT sensitive to outliers: </p>
<li style="font-size:150%;"> Naive Bayes
<li style="font-size:150%;"> SVM
<li style="font-size:150%;"> Decision Trees
<li style="font-size:150%;"> XGBoost, GBM
<li style="font-size:150%;"> KNN    
<p style="font-size:160%;"> Algorithms sensitive to outliers: </p>
<li style="font-size:150%;"> Linear Regression
<li style="font-size:150%;"> Logistic Regression
<li style="font-size:150%;"> K-Means Clustering
<li style="font-size:150%;"> Hierarchical Clustering
<li style="font-size:150%;"> PCA
<li style="font-size:150%;"> Neural NetWorks

## <center>If You find this kernel helpful, please give an upvote. Thank you!!</center>