# What is feature Engineering 

* Preparing the proper input dataset, compatible with the machine learning algorithm requirements
* Improving the performance of machine learning models

<img src=https://as2.ftcdn.net/jpg/02/66/42/59/500_F_266425971_tyuVCOtVdfDNSObd2DwcG4TMxKZnHEU1.jpg width="400" height="400">

There are various techniques of feature engineering. Few of them are listed below:
* Imputation
* Handling Outliers
* Binning
* Log Transform
* One-Hot Encoding
* Grouping Operations
* Scaling

# Missing Data 

Let us first understand that why the data is missing from the dataset?
Assume that the source of data is a survey...Now some people might not be comfortable in sharing their salaries, age, weight, etc.

Since people hesitate to share personal life, they might not answer all the survey questions

There might also be a case where the person knowing the answer is not in a position to answer your questionnaire. Either they are not alive or have medical issues.

Due to all these reasons the data contains missing values

<img src="https://image.freepik.com/free-vector/problem-solving-creative-decision-difficult-task-lateral-thinking-man-assembling-puzzle-cartoon-character-right-choice-missing-item_335657-2108.jpg" width="300" height="300">

Now let's discuss what are the types of missing data

# Types of missing data 

Understanding the reasons why data are missing is important for handling the remaining data correctly. If values are missing completely at random, the data sample is likely still representative of the population. But if the values are missing systematically, analysis may be biased. For example, in a study of the relation between IQ and income, if participants with an above-average IQ tend to skip the question ‘What is your salary?’, analyses that do not take into account this missing at random may falsely fail to find a positive association between IQ and salary. Because of these problems, methodologists routinely advise researchers to design studies to minimize the occurrence of missing values.

# Missing Completely at Random 

Missing Completely at Random, MCAR: A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('../input/titanic/train.csv')

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df[df['Embarked'].isnull()]

# Missing at Random 

Missing at random (MAR) occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information. Since MAR is an assumption that is impossible to verify statistically, we must rely on its substantive reasonableness. An example is that males are less likely to fill in a depression survey but this has nothing to do with their level of depression, after accounting for maleness. Depending on the analysis method, these data can still induce parameter bias in analyses due to the contingent emptiness of cells (male, very high depression may have zero entries). However, if the parameter is estimated with Full Information Maximum Likelihood, MAR will provide asymptotically unbiased estimates

Men---hide their salary<br>
Women---hide their age

### Few techniques of handling missing values

* Mean/Median/Mode replacement
* Random Sample Imputation
* Capturing nan values with new feature
* End of distribution imputation
* Arbitrary imputation
* Frequent category imputation
* Predicting using linear regression

# Mean/Median/Mode Imputation 

When should we apply? 

Mean/median imputation has the assumption that the data are missing completely at random(MCAR). We solve this by replacing the NAN with the most frequent occurance of the variables

In [None]:
df = pd.read_csv('../input/titanic/train.csv', usecols=['Age','Fare','Survived'])
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.isnull().mean()

In [None]:
def impute_nan(df, variable, median):
    df[variable+'_median'] = df[variable].fillna(median)

In [None]:
median = df['Age'].median()
median

In [None]:
impute_nan(df, 'Age', median)

In [None]:
df.head()

In [None]:
print(df['Age'].std())
print(df['Age_median'].std())

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
df['Age'].plot(kind='kde', ax=ax)
df['Age_median'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

## Advantages and Disadvantages of Mean/Median/Mode imputation 

#### Advantages

* Easy to implement(Robust to outliers)
* Faster way to obtain the complete dataset

#### Disadvantages

* Change or Distortion in the original variance
* Impacts Correlation

# Random Sample Imputation 

Random sample imputation consists of taking random observation from the dataset and we use this observation to replace the nan values

When should it be used? It assumes that the data are missing completely at random(MCAR)

In [None]:
df = pd.read_csv('../input/titanic/train.csv', usecols=['Age','Fare','Survived'])
df.head()

In [None]:
df.isnull().sum()

In [None]:
# We are trying to remove na from df and from all the other values left, sample will be chosen 
# The sample of the size is nothing but 'df['Age'].isnull().sum()'
df['Age'].dropna().sample(df['Age'].isnull().sum(), random_state = 0)

Although we have got sample random values but we don't know their index and for that the next code is required

In [None]:
df[df['Age'].isnull()].index

Using this code in the final function 

In [None]:
def impute_nan(df, variable, median):
    df[variable+'_median'] = df[variable].fillna(median)
    df[variable+'_random'] = df[variable]
    # random sample to fill na
    random_sample = df[variable].dropna().sample(df[variable].isnull().sum(), random_state=0)
    # we have random values of count na but we don't have their index to fill na values
    # pandas need to have same index in order to merge the dataset
    random_sample.index = df[df[variable].isnull()].index
    # replace the values
    df.loc[df[variable].isnull(), variable+'_random'] = random_sample

In [None]:
median = df['Age'].median()
median

In [None]:
impute_nan(df, 'Age', median)

In [None]:
df.head()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
df['Age'].plot(kind='kde', ax=ax)
df.Age_random.plot(kind='kde', ax=ax, color='green')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
df['Age'].plot(kind='kde', ax=ax)
df.Age_median.plot(kind='kde', ax=ax, color='red')
df.Age_random.plot(kind='kde', ax=ax, color='green')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

## Advantages and disadvantages of Random Sample Imputation

#### Advantages

* Easy To implement
* There is less distortion in variance

#### Disadvantages

* Every situation randomness won't work

# Capturing Nan values with a new feature

It works well if the data are not missing completely at random

Creating a new column, if the value from the 'Age' is missing then enter 1 in new column else enter 0

In [None]:
df = pd.read_csv('../input/titanic/train.csv', usecols=['Age','Fare','Survived'])
df.head()

In [None]:
import numpy as np
df['Age_Nan'] = np.where(df['Age'].isnull(),1,0)

In [None]:
df.head()

In [None]:
df['Age'].median()

In [None]:
df['Age'].fillna(df['Age'].median(), inplace=True)

In [None]:
df.head(10)

## Advantages and disadvantages of Capturing Nan values with a new feature

#### Advantages 

* Easy to implement
* Captures the importance of missing values

#### Disadvantages 

* Creating Additional Features(Curse of Dimensionality)

# End of Distribution Imputation 

Trying to find out the outlier and filling the na values with that outlier

In [None]:
df = pd.read_csv('../input/titanic/train.csv', usecols=['Age','Fare','Survived'])
df.head()

In [None]:
df['Age'].hist(bins=50)

In [None]:
extreme = df['Age'].mean()+3*df['Age'].std()

In [None]:
import seaborn as sns
sns.boxplot(df['Age'])

In [None]:
def impute_nan(df, variable, median, extreme):
    df[variable+'_end_distribution'] = df[variable].fillna(extreme)
    df[variable].fillna(median, inplace = True)

In [None]:
impute_nan(df, 'Age', df['Age'].median(), extreme)

In [None]:
df.head()

In [None]:
df['Age'].hist(bins=50)

In [None]:
df['Age_end_distribution'].hist(bins = 50)

In [None]:
sns.boxplot(df['Age_end_distribution'])

# Advantages and Disadvantages of End of Distribution Imputation

#### Advantages

* Can bring out the importance of missing values

#### Disadvantages 

* Changes Co-variance/variance
* May create biased data

# Arbitrary value Imputation

This technique was derived from kaggle competition It consists of replacing NAN by an arbitrary value. In this technique, all the nan values are replaced by any one value which is decided by the data scientist

In [None]:
df=pd.read_csv('../input/titanic/train.csv', usecols=['Age','Fare','Survived'])
df.head()

In [None]:
def impute_nan(df, variable):
    df[variable+'_zero'] = df[variable].fillna(0)
    df[variable+'_hundred'] = df[variable].fillna(100)

In [None]:
impute_nan(df, 'Age')

In [None]:
df['Age_zero'].hist(bins=50)

## Advantages and Disadvantages of Arbitrary value Imputation 

#### Advantages 

* Easy to implement
* Captures the importance of missingess if there is one

#### Disadvantages 

* Distorts the original distribution of the variable
* If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution
* Hard to decide which value to use

# Handling Categroical Missing Values

# Frequent Category Imputation

In [None]:
df = pd.read_csv('../input/loancsv/train (1).csv')
df.columns

In [None]:
df = pd.read_csv('../input/loancsv/train (1).csv', usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.isnull().mean().sort_values(ascending = True)

## Compute the frequency with every feature

In [None]:
df['BsmtQual'].value_counts().plot.bar()

In [None]:
df['GarageType'].value_counts().plot.bar()

In [None]:
df['FireplaceQu'].value_counts().plot.bar()

In [None]:
df['FireplaceQu'].mode()[0]

In [None]:
df['FireplaceQu'].value_counts().index[0]

##### Replacing nan with mode 

In [None]:
def impute_nan(df, variable):
    most_freq_category = df[variable].mode()[0]
    df[variable].fillna(most_freq_category, inplace=True)

In [None]:
for feature in ['BsmtQual','FireplaceQu','GarageType']:
    impute_nan(df, feature)

In [None]:
df.head()

In [None]:
df.isnull().sum()

## Advantages of replacing nan with mode 

#### Advanatages 

* Easy To implement
* Faster way to implement

#### Disadvantages

* Since we are using the more frequent labels, it may use them in an over respresented way, if there are many nan's
* It distorts the relation of the most frequent label

In [None]:
# df = pd.read_csv('../input/mercedesbenz-greener-manufacturing/train.csv')
# df.columns

# Adding a variable to capture nan

In [None]:
df = pd.read_csv('../input/loancsv/train (1).csv', usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])

In [None]:
df.head()

In [None]:
import numpy as np
def impute_nan(df, variable, frequent):
    df[variable+'_new'] = np.where(df[variable].isnull(),1,0)
    df[variable].fillna(frequent, inplace = True)

In [None]:
for feature in ['BsmtQual','FireplaceQu','GarageType']:
    frequent = df[feature].mode()[0]
    impute_nan(df, feature, frequent)

In [None]:
df.head()

# If you have more frequent categories, we just replace NAN with a new category 

In [None]:
df = pd.read_csv('../input/loancsv/train (1).csv', usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])

In [None]:
df.head()

In [None]:
def impute_nan(df, variable):
    df[variable+'_new_var'] = np.where(df[variable].isnull(),'missing',df[variable])

In [None]:
for feature in ['BsmtQual','FireplaceQu','GarageType']:
    impute_nan(df, feature)

In [None]:
df.head()

In [None]:
df = df.drop(['BsmtQual','FireplaceQu','GarageType'], axis=1)

In [None]:
df.head()