## Abstract
1. We are trying different imputation methods on the Titanic Dataset, and evaluating classifier accuracies for each of these. A package that we are using is fancyimpute.

2. To briefly describe how gradient boosting differs from bagging.To implement gradient boosting as invoked in scikit-learn, and to evaluate classifier accuracy for the Titanic dataset.

3. To theoretically, increasing the number of decision trees (n_estimators), increases classifier performance and/or generalizability. Hence to design and evaluate a computational experiment to test this, on the Titanic dataset.

4. To Pick any Kaggle regression dataset. Train, tune and evaluate performance of a Random Forest Regression model and to use the feature importance calculations from this to perform feature selection and to demonstrate this using the Kaggle regression dataset that has been picked.

## About the dataset "Titanic dataset" :

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

## PART 1

#### What do you mean by imputation?

Imputation simply means replacing the missing values with an estimate, then analyzing the full data set as if the imputed values were actual observed values.



## Import necessary packages, and read in data

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import rcParams
import os
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

%matplotlib inline
rcParams['figure.figsize'] = 10,8
sns.set(style='whitegrid', palette='muted',
        rc={'figure.figsize': (12,8)})

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
# print(os.listdir("../input"))

In [None]:
# Load data as Pandas dataframe
train = pd.read_csv('train (2).csv')
test = pd.read_csv('test (1).csv')
df = pd.concat([train, test], axis=0, sort=True)

In [None]:
df.head()

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

        
display_all(df.describe(include='all').T)

## 2. Imputation

We can see above that there are a few columns with missing values. The Cabin column is missing over 1000 values, so we won't use that for predictions, but the Age, Embarked and Fare columns are all complete enough that we can fill in the missing values through imputation.

## 2.1. Impute missing age values

A simple option for the missing age values is to use the median age value. Let's go a little further and use each passenger's Title to estimate their age. E.g. if a passenger has the title of Dr, I will give them the median age value for all other passengers with the same title.

## Extract title from name

We can use a regular expression to extract the title from the Name column. We will do this by finding the adjacent letters that are immediately followed by a full stop.

In [None]:
# create new Title column
df['Title'] = df['Name'].str.extract('([A-Za-z]+)\.', expand=True)
In [7]:


In [None]:
df.head()

Use only the most common titles

Let's take a look at the unique titles across all passengers:

In [None]:
df['Title'].value_counts()

As we can see above, there are quite a few different titles. However, many of these titles are just French versions of the more common English titles, e.g. Mme = Madame = Mrs.

We will use the six most common titles, replacing all other titles with the most appropriate of these six.

In [None]:
# replace rare titles with more common ones
mapping = {'Mlle': 'Miss', 'Major': 'Mr', 'Col': 'Mr', 'Sir': 'Mr',
           'Don': 'Mr', 'Mme': 'Mrs', 'Jonkheer': 'Mr', 'Lady': 'Mrs',
           'Capt': 'Mr', 'Countess': 'Mrs', 'Ms': 'Miss', 'Dona': 'Mrs'}
df.replace({'Title': mapping}, inplace=True)

In [None]:
# confirm that we are left with just six values
df['Title'].value_counts()

#### Use median of title group

Now, for each missing age value, we will impute the age using the median age for all people with the same title.

In [None]:
# impute missing Age values using median of Title groups
title_ages = dict(df.groupby('Title')['Age'].median())

# create a column of the average ages
df['age_med'] = df['Title'].apply(lambda x: title_ages[x])

# replace all missing ages with the value in this column
df['Age'].fillna(df['age_med'], inplace=True, )
del df['age_med']

We can visualize the median ages for each title group. Below, we see that each title has a distinctly different median age.

Note: There is no risk in doing this after imputation, as the median of an age group has not been affected by our actions.

In [None]:
sns.barplot(x='Title', y='Age', data=df, estimator=np.median, ci=None, palette='Blues_d')
plt.xticks(rotation=45)
plt.show()

In [None]:
sns.countplot(x='Title', data=df, palette='hls', hue='Survived')
plt.xticks(rotation=45)
plt.show()

### 2.2. Impute missing fare values
For the single missing fare value, I also use the median fare value for the passenger's class.

Perhaps you could come up with a cooler way of visualising the relationship between the price a passenger paid for their ticket and their chances of survival?

In [None]:
sns.swarmplot(x='Sex', y='Fare', hue='Survived', data=df)
plt.show()

In [None]:
# impute missing Fare values using median of Pclass groups
class_fares = dict(df.groupby('Pclass')['Fare'].median())

# create a column of the average fares
df['fare_med'] = df['Pclass'].apply(lambda x: class_fares[x])

# replace all missing fares with the value in this column
df['Fare'].fillna(df['fare_med'], inplace=True, )
del df['fare_med']

### 2.3. Impute missing "embarked" value

There are also just two missing values in the Embarked column. Here we will just use the Pandas 'backfill' method.

In [None]:
sns.catplot(x='Embarked', y='Survived', data=df,
            kind='bar', palette='muted', ci=None)
plt.show()

In [None]:
df['Embarked'].fillna(method='backfill', inplace=True)

### 3. Add family size column

We can use the two variables of Parch and SibSp to create a new variable called Family_Size. This is simply done by adding Parch and SibSp together.

In [None]:
# create Family_Size column (Parch +)
df['Family_Size'] = df['Parch'] + df['SibSp']

In [None]:
display_all(df.describe(include='all').T)

### 4. Save cleaned version

Finally, let's save our cleaned data set so we can use it in other notebooks.

In [None]:
train = df[pd.notnull(df['Survived'])]
test = df[pd.isnull(df['Survived'])]

In [None]:
train.to_csv('train_clean.csv', index=False)
test.to_csv('test_clean.csv', index=False)

## Conclusion : 

Used the Titanic dataset and did feature engineering wherein different columns were imputed using different imputation methods.

### Contribution Statement :

Did the following :
1. Normal EDA.
2. Imputation of data.
3. Saving the cleaned version of data.

### Citations:

Imputation methods knowledge:---https://machinelearningmastery.com/handle-missing-data-python/

Titanic dataset reference :https://www.kaggle.com/c/titanic 

## License:

Copyright <2019> Ria Rajput Permission is hereby granted, free of charge, to any person obtaining a copy of this notebook and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
