<p align="center">
<img style="width:80%" src="https://c4.wallpaperflare
.com/wallpaper/378/267/803/titanic-ship-cruise-ship-drawing-night-hd-wallpaper-preview.jpg">
</p>

[Image source](https://www.wallpaperflare.com/titanic-ship-cruise-ship-drawing-night-hd-digital-artwork-wallpaper-mzpsf/)

<h1 style="text-align: center; color:#01872A; font-size: 80px;
background:#daf2e1; border-radius: 20px;
">Titanic.<br> Part 2.</h1>

## Please use nbviewer to read this notebook to use all it's features:

https://nbviewer.org/github/sersonSerson/Projects/blob/master/Classification/Titanic/Titanic.ipynb

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Contents </span>

## 4.	[Feature engineering.](#step4)
## 4.1. [Fare.](#Step4.1)
## 4.2. [Name.](#Step4.2)
## 4.3. [SibSp.](#Step4.3)
## 4.4. [Parch.](#Step4.4)
## 4.5. [Age.](#Step4.5)
## 4.6. [Embarked.](#Step4.6)
## 4.7. [Cabin.](#Step4.7)
## 4.8. [Feature encoding.](#Step4.8)

In [689]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from matplotlib import cm
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

In [690]:
pd.options.display.max_columns = 80
pd.options.display.max_rows = 30
pd.options.display.max_colwidth = 60
pd.options.mode.chained_assignment = None

## Load data

In [693]:
train = pd.read_csv('data/train.csv', index_col='PassengerId')
test = pd.read_csv('data/test.csv', index_col='PassengerId')
filled_df = pd.concat([train, test])
filled_df.head(2)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C


<div id="Step4">
</div>

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Step 4. Feature engineering.</span>


<div id="Step4.1">
</div>

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Step 4.1. Fare.</span>


According to:
https://autumnmccordckp.weebly.com/tickets-and-accomodations.html
Titanic pricing policy was:
1. First Class - £870 to £30.
2. Second Class- £12.
3. Third Class- £3 to £8.

## Check dataset fares

In [694]:
filled_df.groupby('Pclass').agg({'Fare': ['mean', 'max', 'min', 'count']})

Unnamed: 0_level_0,Fare,Fare,Fare,Fare
Unnamed: 0_level_1,mean,max,min,count
Pclass,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,87.508992,512.3292,0.0,323
2,21.179196,73.5,0.0,277
3,13.302889,69.55,0.0,708


## Ideas:
1. There are passengers with zero Fare. Fill fare by mean of the respective
Pclass.
2. The mean price of all classes is higher than expected, so maybe the
price is stated for all passengers with the same ticket number.

## Define number of passenger for each ticket and append it to the DataFrame.

In [695]:
ticket_passengers = \
    filled_df.groupby('Ticket')['Name'].agg('count')
ticket_passengers

Ticket
110152         3
110413         3
110465         2
110469         1
110489         1
              ..
W./C. 6608     5
W./C. 6609     1
W.E.P. 5734    2
W/C 14208      1
WE/P 5735      2
Name: Name, Length: 929, dtype: int64

In [696]:
filled_df['PassengersCount'] = filled_df['Ticket'].map(ticket_passengers)
filled_df[:3]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,PassengersCount
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C,2
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1


Fill missing 'Fare' value

In [697]:
filled_df['Fare'].fillna(0, inplace=True)

## Calculate fare per passenger.

In [698]:
filled_df['FarePerPassenger'] = filled_df['Fare'] / filled_df['PassengersCount']
filled_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,PassengersCount,FarePerPassenger
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,7.25
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C,2,35.64165
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1,7.925
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,2,26.55
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1,8.05


In [699]:
filled_df.groupby('Pclass').agg({'FarePerPassenger': ['mean', 'max', 'min',
                                                 'count']})

Unnamed: 0_level_0,FarePerPassenger,FarePerPassenger,FarePerPassenger,FarePerPassenger
Unnamed: 0_level_1,mean,max,min,count
Pclass,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,33.9105,128.0823,0.0,323
2,11.41101,16.0,0.0,277
3,7.318808,19.9667,0.0,709


## Now the fares are closer to expected values.
## Outline passengers with zero ticket price.

In [700]:
filled_df['ZeroPrice'] = np.where(filled_df['FarePerPassenger'] == 0, 1, 0)
filled_df['ZeroPrice'].value_counts()

0    1291
1      18
Name: ZeroPrice, dtype: int64

<div id="Step4.2">
</div>

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Step 4.2. Name.</span>


In [701]:
filled_df['FirstName'] = filled_df['Name'].str.extract(r"([a-zA-Z '\-]+),")
filled_df['Title'] = filled_df['Name'].str.extract(r' ([a-zA-Z]+)\.')
filled_df['LastName'] = \
    filled_df['Name'].str.extract(r'\. ([a-zA-Z /"]*)').iloc[:, 0].str.strip()
# filled_df['MaidenName'] = \
#     filled_df['Name'].str.extract(r'\(([a-zA-Z ]*)\)')
filled_df[['Name', 'FirstName', 'LastName', 'Title',]].head()
           # 'MaidenName']]

Unnamed: 0_level_0,Name,FirstName,LastName,Title
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,"Braund, Mr. Owen Harris",Braund,Owen Harris,Mr
2,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",Cumings,John Bradley,Mrs
3,"Heikkinen, Miss. Laina",Heikkinen,Laina,Miss
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Futrelle,Jacques Heath,Mrs
5,"Allen, Mr. William Henry",Allen,William Henry,Mr


## Spouse
There is no feature which clearly shows if there are Spouses on board, but
only feature 'SibSp' - Spouse + siblings.
The idea is to divide 'SibSp' into a Spouse and siblings.
Logic:
1. Feature 'SibSp' should be > 0.
2. Female title should not be 'Miss' - it is a title of unmarried woman.
3. Male title should not be 'Master' - it is a title of a young man (under
 14).
4. Spouses should have the same ticket number.
5. A female Spouse should have the last name equal to male last name.


In [702]:
filled_df['Spouse'] = 0
females_with_sibsp = filled_df[(filled_df['Sex'] == 'female')
                               & (filled_df['SibSp'] > 0)
                               &  (filled_df['Title'] != 'Miss')].copy()
males_with_sibsp = filled_df[(filled_df['Sex'] == 'male')
                                & (filled_df['SibSp'] > 0)].copy()
for female_index, potential_wife in females_with_sibsp.iterrows():
    potential_husband = \
        males_with_sibsp[
            (males_with_sibsp['Ticket'] == potential_wife['Ticket'])
            & (males_with_sibsp['LastName'].str.split(' ').str[0] ==
               potential_wife['LastName'].split(' ')[0])
            & (males_with_sibsp['Title'] != 'Master')]
    if len(potential_husband) > 1:
        print(f'Not found Spouse for: {potential_wife["Name"]}')
    elif len(potential_husband) > 0:
        filled_df.loc[potential_husband.index, 'Spouse'] = 1
        filled_df.loc[female_index, 'Spouse'] = 1
    else:
        print(f'Not found Spouse for: {potential_wife["Name"]}')

Not found Spouse for: Ahlin, Mrs. Johan (Johanna Persdotter Larsson)
Not found Spouse for: Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson)
Not found Spouse for: Skoog, Mrs. William (Anna Bernhardina Karlsson)
Not found Spouse for: Strom, Mrs. Wilhelm (Elna Matilda Persson)
Not found Spouse for: Abbott, Mrs. Stanton (Rosa Hunt)
Not found Spouse for: Richards, Mrs. Sidney (Emily Hocking)
Not found Spouse for: Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan")
Not found Spouse for: Appleton, Mrs. Edward Dale (Charlotte Lamson)
Not found Spouse for: Stephenson, Mrs. Walter Bertram (Martha Eustis)
Not found Spouse for: Goodwin, Mrs. Frederick (Augusta Tyler)
Not found Spouse for: Hogeboom, Mrs. John C (Anna Andrews)
Not found Spouse for: Hocking, Mrs. Elizabeth (Eliza Needs)
Not found Spouse for: Wilkes, Mrs. James (Ellen Needs)
Not found Spouse for: Hirvonen, Mrs. Alexander (Helga E Lindqvist)
Not found Spouse for: Cornell, Mrs. Robert Clifford (Malvina Helen Lamson)
Not 

There are some exceptions to the rules introduced above, need to correct them.
Possible reasons:
1. Misspelled names ('Mrs. Said' and 'Mr. Sahid').
2. Use of short versions of names ('Mrs. Frederick' and 'Mr. Charles Frederick')
3. Wrong numbers in initial data.

In [703]:
def correct_spouses(filled_df):
    exceptions = ['Skoog, Mrs. William (Anna Bernhardina Karlsson)',
                  'Skoog, Mr. Wilhelm',
                  'Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs '
                  'Morgan")',
                  'Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan")',
                  'Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson)',
                  'Faunthorpe, Mr. Harry',
                  'Nakid, Mr. Sahid',
                  'Nakid, Mrs. Said (Waika Mary" Mowad)"',
                  'Goodwin, Mrs. Frederick (Augusta Tyler)',
                  'Goodwin, Mr. Charles Frederick',
                  'Ware, Mr. John James',
                  'Ware, Mrs. John James (Florence Louise Long)',
                  'Crosby, Mrs. Edward Gifford (Catherine Elizabeth Halstead)',
                  'Crosby, Capt. Edward Gifford'
                  ]
    exceptions_mask = filled_df['Name'].isin(exceptions)
    filled_df.loc[exceptions_mask, 'Spouse'] = 1
    return filled_df
filled_df = correct_spouses(filled_df)

<div id="Step4.3">
</div>

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Step 4.2. SibSp.</span>


## Siblings
As we already calculated number of 'Spounse', we can calculate number of siblings
by reducing 'SibSp' feature by 'Spounse' feature.

In [704]:
filled_df['Siblings'] = filled_df['SibSp'] - filled_df['Spouse']
filled_df['Siblings']

PassengerId
1       1
2       0
3       0
4       0
5       0
       ..
1305    0
1306    0
1307    0
1308    0
1309    1
Name: Siblings, Length: 1309, dtype: int64

## Correct wrong data

In [705]:
def correct_families(filled_df):
    filled_df.loc[filled_df['Name'] == 'Abbott, Mrs. Stanton (Rosa Hunt)',
              ['SibSp', 'Siblings', 'Parch', 'Children']] = [0, 0, 2, 2]
    filled_df.loc[filled_df['Name'] == 'Abbott, Master. Eugene Joseph',
              ['SibSp', 'Siblings', 'Parch']] = [1, 1, 1]
    filled_df.loc[filled_df['Name'] == 'Andersson, Miss. Erna Alexandra',
              ['SibSp', 'Siblings', 'Parch']] = 0
    filled_df.loc[filled_df['Name'] == 'Baxter, Mr. Quigg Edmond',
              ['SibSp', 'Siblings']] = 1
    filled_df.loc[filled_df['Name'] == 'Ford, Mr. Edward Watson',
              ['SibSp', 'Siblings', 'Spouse', 'Parch',
               'Parents']] = [3, 3, 0, 1, 1]
    filled_df.loc[filled_df['Name'] == 'Ford, Mrs. Edward (Margaret Ann Watson)',
              ['Siblings', 'Spouse', 'Parch']] = [1, 0, 4]
    filled_df.loc[filled_df['Name'] == 'Ford, Mr. William Neal',
              ['SibSp', 'Siblings', 'Parch']] = [3, 3, 1]
    filled_df.loc[filled_df['Name'] == 'Ford, Miss. Robina Maggie "Ruby"',
              ['SibSp', 'Siblings', 'Parch']] = [3, 3, 1]
    filled_df.loc[filled_df['Name'] == 'Ford, Miss. Doolina Margaret "Daisy"',
              ['SibSp', 'Siblings', 'Parch', 'Parents']] = [3, 3, 1, 1]
    filled_df.loc[filled_df['Name'] == 'Johnston, Mrs. Andrew G (Elizabeth Lily" Watson)"',
              ['SibSp', 'Siblings', 'Spouse']] = [2, 1, 1]
    filled_df.loc[filled_df['Name'] == 'Lahtinen, Mrs. William (Anna Sylfven)',
              ['SibSp', 'Siblings', 'Spouse', 'Parch']] = [2, 1, 1, 0]
    filled_df.loc[filled_df['Name'] == 'Lahtinen, Rev. William',
              ['Parch']] = [0]
    filled_df.loc[filled_df['Name'] == 'Natsch, Mr. Charles H',
              ['Parch']] = [0]
    filled_df.loc[filled_df['Name'] == 'Silven, Miss. Lyyli Karoliina',
              ['SibSp', 'Siblings', 'Parch']] = [1, 1, 0]
    filled_df.loc[filled_df['Name'] == 'Ware, Mr. William Jeffery',
              ['SibSp', 'Siblings']] = [0, 0]
    filled_df.loc[filled_df['Name'] == 'Ware, Mrs. John James (Florence Louise Long)',
              ['SibSp', 'Siblings', 'Spouse', 'Parch']] = [1, 0, 1, 0,]
    return filled_df

filled_df = correct_families(filled_df)

<div id="Step4.4">
</div>

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Step 4.4. Parch.</span>


# Divide Parch into Parents and Children
1. Used only for passengers with 'Parch' > 0
2. Parents and children should have same ticket number.
3. Algorithm of dividing:
    * If a group have 'Spouse' feature - they are definitely parents.
    Implemented in 'divide_by_spouse' function.
    * If the 'Age' of all people in group is filled - parents should be at
    least 12 years older than children. Implemented in 'divide_by_age' function.
    * If none of the above methods gave result then divide by 'Title' column.
    For parents 'Title' is in ['Dr', 'Mr', 'Mrs', 'Capt'], for children in:
    ['Miss', 'Mr', 'Master'].

In [706]:
def divide_by_spouse(group):
    parents_group, children_group = None, None
    if any(group['Spouse'] > 0):
        Spouse_grouped = group.groupby(['Spouse'])
        for name, Spouse_group in Spouse_grouped:
            if all(Spouse_group['Spouse'] == 1):
                parents_group = Spouse_group
            elif all(Spouse_group['Spouse'] != 1):
                children_group = Spouse_group
    return parents_group, children_group

def divide_by_age(group):
    max_age = group['Age'].max()
    max_children_age = max_age - 12
    children_group = group[group['Age'] <= max_children_age]
    parents_group = group[group['Age'] > max_children_age]
    return  parents_group, children_group

def divide_by_title(group):
    parents_group, children_group = None, None
    parch_grouped = group.groupby(['Parch'])
    for name, parch_group in parch_grouped:
        if all(parch_group['Title'].isin(['Dr', 'Mr', 'Mrs', 'Capt'])):
            if parents_group is None:
                parents_group = parch_group
        elif all(parch_group['Title'].isin(['Miss', 'Mr', 'Master'])):
            if children_group is None:
                children_group = parch_group
    return parents_group, children_group

def divide_in_parents_and_children(group):

    parents_group, children_group = divide_by_spouse(group)

    if group['Age'].isna().sum() == 0:
        age_filled = True
    else:
        age_filled = False

    if children_group is None or parents_group is None:
        if age_filled:
            parents_group, children_group = divide_by_age(group)
        else:
            parents_group, children_group = divide_by_title(group)

    return parents_group, children_group

filled_parch = filled_df[filled_df['Parch'] > 0]
filled_df['Parents'] = 0
filled_df['Children'] = 0
ticket_grouped = filled_parch.sort_values('Ticket').groupby(['Ticket'])
for name, ticket_group in ticket_grouped:
    parents_group, children_group = divide_in_parents_and_children(ticket_group)
    if children_group is not None and parents_group is not None\
            and len(children_group) > 0 and len(parents_group) > 0:
        filled_df.loc[parents_group.index, 'Children'] = parents_group['Parch']
        filled_df.loc[children_group.index, 'Parents'] = children_group['Parch']

# Merge everyone with incorrectly found parents and children into one group and try to find matches amoung this group.

In [707]:
incorrect_parch = filled_df[(filled_df['Children'] + filled_df['Parents']) !=
            filled_df['Parch']]
name_grouped = incorrect_parch.sort_values('FirstName').groupby(['FirstName'])
for name, name_group in name_grouped:
    parents_group, children_group = divide_in_parents_and_children(name_group)
    if children_group is not None and parents_group is not None\
            and len(children_group) > 0 and len(parents_group) > 0:
        filled_df.loc[parents_group.index, 'Children'] = parents_group['Parch']
        filled_df.loc[children_group.index, 'Parents'] = children_group['Parch']

# Clear all incorrect samples by hand.

In [708]:
def incorrect_parents_children(filled_df):

    filled_df.loc[filled_df['Name'] == 'Chibnall, Mrs. (Edith Martha Bowerman)',
              ['Children']] = [1]
    filled_df.loc[filled_df['Name'] == 'Bowerman, Miss. Elsie Edith',
              ['Parents']] = [1]
    filled_df.loc[filled_df['Name'] == 'Klasen, Mr. Klas Albin',
              ['Parents']] = [1]
    filled_df.loc[filled_df['Name'] == 'Newsom, Miss. Helen Monypeny',
              ['Parents']] = [2]
    filled_df.loc[filled_df['Name'] == 'Beckwith, Mr. Richard Leonard',
              ['Children']] = [1]
    filled_df.loc[filled_df['Name'] == 'Beckwith, Mrs. Richard Leonard (Sallie Monypeny)',
              ['Children']] = [1]
    filled_df.loc[filled_df['Name'] == 'Hocking, Mrs. Elizabeth (Eliza Needs)',
              ['Children']] = [3]
    filled_df.loc[filled_df['Name'] == 'Hocking, Mr. Richard George',
              ['Parents']] = [1]
    filled_df.loc[filled_df['Name'] == 'Hays, Mr. Charles Melville',
              ['Children']] = [1]
    filled_df.loc[filled_df['Name'] == 'Hays, Mrs. Charles Melville (Clara Jennings Gregg)',
              ['Children']] = [1]
    filled_df.loc[filled_df['Name'] == 'Davidson, Mrs. Thornton (Orian Hays)',
              ['Parents']] = [2]
    filled_df.loc[filled_df['Name'] == 'Crosby, Mrs. Edward Gifford (Catherine Elizabeth Halstead)',
              ['Children']] = [1]
    filled_df.loc[filled_df['Name'] == 'Frolicher, Miss. Hedwig Margaritha',
              ['Parents']] = [2]
    filled_df.loc[filled_df['Name'] == 'Hiltunen, Miss. Marta',
              ['Parch']] = [0]
    filled_df.loc[filled_df['Name'] == 'Andersson, Miss. Ida Augusta Margareta',
              ['Parch']] = [0]
    filled_df.loc[filled_df['Name'] == 'Newell, Mr. Arthur Webster',
              ['Children']] = [2]
    filled_df.loc[filled_df['Name'] == 'Jacobsohn, Mrs. Sidney Samuel (Amy Frances Christy)',
              ['Parents']] = [1]
    return filled_df

filled_df = incorrect_parents_children(filled_df)

### Check if there are any mistakes left

In [709]:
incorrect_parch = \
    filled_df[(filled_df['Children'] + filled_df['Parents']) !=
              filled_df['Parch']]
len(incorrect_parch)

0

Add some new features:
1. Women with only husband

In [710]:
filled_df['OnlyHusband'] = \
    np.where((filled_df['Sex'] == 'female') & (filled_df['Spouse'] == 1)
              & (filled_df['Parents'] == 0) & (filled_df['Children'] == 0),1, 0)

2. Families that have many children.

In [711]:
filled_df['ManyChildren'] = np.where(filled_df['Children'] > 3, 1, 0)

3. Total relatives.

In [712]:
filled_df['TotalRelatives'] = filled_df['Parch'] + filled_df['SibSp']
filled_df['TotalRelatives']

PassengerId
1       1
2       1
3       0
4       1
5       0
       ..
1305    0
1306    0
1307    0
1308    0
1309    2
Name: TotalRelatives, Length: 1309, dtype: int64

4. Alone Traveller.

In [713]:
filled_df['Alone'] = \
    np.where(filled_df['Parch'] + filled_df['SibSp'] == 0, 1, 0)
filled_df['Alone']

PassengerId
1       0
2       0
3       1
4       0
5       1
       ..
1305    1
1306    1
1307    1
1308    1
1309    0
Name: Alone, Length: 1309, dtype: int32

<div id="Step4.5">
</div>

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Step 4.5. Age.</span>

### Age seems to be an important feature, it's worth spending time on filling it properly.
### Check mean age of each 'Title'

In [714]:
title_stats = filled_df.groupby('Title')
title_stats = title_stats.agg({'Age': ['min', 'mean', 'max', 'count']})
title_stats.sort_values(('Age', 'count'), ascending=False)[:10]

Unnamed: 0_level_0,Age,Age,Age,Age
Unnamed: 0_level_1,min,mean,max,count
Title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Mr,11.0,32.252151,80.0,581
Miss,0.17,21.774238,63.0,210
Mrs,14.0,36.994118,76.0,170
Master,0.33,5.482642,14.5,53
Rev,27.0,41.25,57.0,8
Dr,23.0,43.571429,54.0,7
Col,47.0,54.0,60.0,4
Mlle,24.0,24.0,24.0,2
Major,45.0,48.5,52.0,2
Ms,28.0,28.0,28.0,1


In [715]:
def fill_age_by_mask(mask, df):
    mean = df.loc[mask, 'Age'].mean()
    df.loc[mask, 'Age'] = df.loc[mask, 'Age'].fillna(mean)
    return df

'Master' title seems to be used for a little boy.

In [716]:
master_mask = filled_df['Title'] == 'Master'
filled_df = fill_age_by_mask(master_mask, filled_df)

In [717]:
filled_df['TravelsAlone'] = \
    np.where((filled_df['Parch'] + filled_df['SibSp']) == 0, 1, 0)

Alone travellers are usually young or middle aged men

In [718]:
travels_alone_mask = filled_df['TravelsAlone'] == 1
filled_df[travels_alone_mask]['Sex'].value_counts(normalize=True)

male      0.755051
female    0.244949
Name: Sex, dtype: float64

In [719]:
print(f'Mean age of alone travellers: '
      f'{filled_df[travels_alone_mask]["Age"].mean()}')

Mean age of alone travellers: 31.43926246460276


In [720]:
filled_df = fill_age_by_mask(travels_alone_mask, filled_df)

Married people (both males and females) are a bit older

In [721]:
married_mask = filled_df['Spouse'] == 1
print(f'Mean age of married travellers: '
      f'{filled_df[married_mask]["Age"].mean()}')

Mean age of married travellers: 37.154639175257735


In [722]:
filled_df = fill_age_by_mask(married_mask, filled_df)

People who travel with parents and no spouse are usually young

In [723]:
travels_with_parents_mask = (filled_df['Parents'] > 0) & (filled_df['Spouse']
                             == 0)
print(f'Mean age of travellers with parents: '
      f'{filled_df[travels_with_parents_mask]["Age"].mean()}')
filled_df = fill_age_by_mask(travels_with_parents_mask, filled_df)
filled_df.loc[travels_with_parents_mask]['Age'].value_counts(dropna=False)

Mean age of travellers with parents: 10.856080067731012


10.85608    13
2.00000     12
1.00000     10
9.00000     10
4.00000     10
            ..
39.00000     1
38.00000     1
0.67000      1
31.00000     1
14.50000     1
Name: Age, Length: 43, dtype: int64

People just without a spouse.

In [724]:
not_married_mask = filled_df['Spouse'] == 0
print(f'Mean age of travellers without a spouse: '
      f'{filled_df[not_married_mask]["Age"].mean()}')
filled_df = fill_age_by_mask(not_married_mask, filled_df)

Mean age of travellers without a spouse: 28.4412998166156


Check if there are any unfilled values of 'Age' left.

In [725]:
filled_df.isna().sum().sort_values(ascending=False)[:4]

Cabin        1014
Survived      418
Embarked        2
FirstName       0
dtype: int64

<div id="Step4.6">
</div>

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Step 4.6. Embarked.</span>

## Fill empty 'Embarked' values with most frequent value.

In [726]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
filled_df['Embarked'] = imputer.fit_transform(filled_df[['Embarked']])

In [727]:
filled_df.isna().sum().sort_values(ascending=False)[:3]

Cabin        1014
Survived      418
FirstName       0
dtype: int64

<div id="Step4.7">
</div>

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Step 4.7. Cabin.</span>

Create new feature 'Deck' from the 'Cabin'

In [728]:
filled_df['Deck'] = filled_df['Cabin'].str[0]
filled_df['Deck'].fillna('Empty', inplace=True)
encoder = OneHotEncoder(sparse=False)
encoded_sex = encoder.fit_transform(filled_df[['Deck']])
for column_index, category in enumerate(encoder.categories_[0]):
    print(category, column_index)
    filled_df['Deck' + category.capitalize()] = encoded_sex[:, column_index]
filled_df.drop('Deck', axis=1, inplace=True)
filled_df.head()

A 0
B 1
C 2
D 3
E 4
Empty 5
F 6
G 7
T 8


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,PassengersCount,FarePerPassenger,ZeroPrice,FirstName,Title,LastName,Spouse,Siblings,Children,Parents,OnlyHusband,ManyChildren,TotalRelatives,Alone,TravelsAlone,DeckA,DeckB,DeckC,DeckD,DeckE,DeckEmpty,DeckF,DeckG,DeckT
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,7.25,0,Braund,Mr,Owen Harris,0,1,0,0,0,0,1,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C,2,35.64165,0,Cumings,Mrs,John Bradley,1,0,0,0,1,0,1,0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1,7.925,0,Heikkinen,Miss,Laina,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,2,26.55,0,Futrelle,Mrs,Jacques Heath,1,0,0,0,1,0,1,0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1,8.05,0,Allen,Mr,William Henry,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


<div id="Step4.8">
</div>

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Step 4.8. Feature encoding.</span>

## One-hot encoding for categorical features

In [729]:
encoder = OneHotEncoder(sparse=False)
encoded_pclass = encoder.fit_transform(filled_df[['Pclass']])
for column_index, category in enumerate(encoder.categories_[0]):
    print(category, column_index)
    if category == 1:
        category = 'First'
    elif category == 2:
        category = 'Second'
    elif category == 3:
        category = 'Third'
    filled_df['Pclass' + category.capitalize()] = encoded_pclass[:, column_index]
filled_df.drop('Pclass', axis=1, inplace=True)
filled_df.head()

1 0
2 1
3 2


Unnamed: 0_level_0,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,PassengersCount,FarePerPassenger,ZeroPrice,FirstName,Title,LastName,Spouse,Siblings,Children,Parents,OnlyHusband,ManyChildren,TotalRelatives,Alone,TravelsAlone,DeckA,DeckB,DeckC,DeckD,DeckE,DeckEmpty,DeckF,DeckG,DeckT,PclassFirst,PclassSecond,PclassThird
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1
1,0.0,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,7.25,0,Braund,Mr,Owen Harris,0,1,0,0,0,0,1,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C,2,35.64165,0,Cumings,Mrs,John Bradley,1,0,0,0,1,0,1,0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,1.0,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1,7.925,0,Heikkinen,Miss,Laina,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,2,26.55,0,Futrelle,Mrs,Jacques Heath,1,0,0,0,1,0,1,0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,0.0,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1,8.05,0,Allen,Mr,William Henry,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0


In [730]:
encoder = OneHotEncoder(sparse=False)
encoded_sex = encoder.fit_transform(filled_df[['Sex']])
for column_index, category in enumerate(encoder.categories_[0]):
    print(category, column_index)
    filled_df['Sex' + category.capitalize()] = encoded_sex[:, column_index]
filled_df.drop('Sex', axis=1, inplace=True)
filled_df.head()

female 0
male 1


Unnamed: 0_level_0,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,PassengersCount,FarePerPassenger,ZeroPrice,FirstName,Title,LastName,Spouse,Siblings,Children,Parents,OnlyHusband,ManyChildren,TotalRelatives,Alone,TravelsAlone,DeckA,DeckB,DeckC,DeckD,DeckE,DeckEmpty,DeckF,DeckG,DeckT,PclassFirst,PclassSecond,PclassThird,SexFemale,SexMale
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1
1,0.0,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,S,1,7.25,0,Braund,Mr,Owen Harris,0,1,0,0,0,0,1,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",38.0,1,0,PC 17599,71.2833,C85,C,2,35.64165,0,Cumings,Mrs,John Bradley,1,0,0,0,1,0,1,0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,1.0,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,S,1,7.925,0,Heikkinen,Miss,Laina,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
4,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,S,2,26.55,0,Futrelle,Mrs,Jacques Heath,1,0,0,0,1,0,1,0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
5,0.0,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,S,1,8.05,0,Allen,Mr,William Henry,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0


In [731]:
encoder = OneHotEncoder(sparse=False)
encoded_embarked = encoder.fit_transform(filled_df[['Embarked']])
for column_index, category in enumerate(encoder.categories_[0]):
    print(category, column_index)
    filled_df['Embarked' + category] = encoded_embarked[:, column_index]
filled_df.drop('Embarked', axis=1, inplace=True)
filled_df.head()

C 0
Q 1
S 2


Unnamed: 0_level_0,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,PassengersCount,FarePerPassenger,ZeroPrice,FirstName,Title,LastName,Spouse,Siblings,Children,Parents,OnlyHusband,ManyChildren,TotalRelatives,Alone,TravelsAlone,DeckA,DeckB,DeckC,DeckD,DeckE,DeckEmpty,DeckF,DeckG,DeckT,PclassFirst,PclassSecond,PclassThird,SexFemale,SexMale,EmbarkedC,EmbarkedQ,EmbarkedS
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1
1,0.0,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,1,7.25,0,Braund,Mr,Owen Harris,0,1,0,0,0,0,1,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",38.0,1,0,PC 17599,71.2833,C85,2,35.64165,0,Cumings,Mrs,John Bradley,1,0,0,0,1,0,1,0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
3,1.0,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,1,7.925,0,Heikkinen,Miss,Laina,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
4,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,2,26.55,0,Futrelle,Mrs,Jacques Heath,1,0,0,0,1,0,1,0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
5,0.0,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,1,8.05,0,Allen,Mr,William Henry,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0


## Structure dataframe

In [732]:
filled_df = filled_df[['Survived', 'Name', 'Age', 'Ticket', 'Fare', 'Cabin',
                       'FirstName', 'LastName', 'Title', 'SibSp', 'Siblings',
                       'Spouse', 'Parch', 'Parents', 'Children',
                       'PassengersCount', 'FarePerPassenger', 'ZeroPrice',
                       'TravelsAlone', 'PclassFirst', 'PclassSecond',
                       'PclassThird', 'SexFemale', 'SexMale', 'EmbarkedC',
                       'EmbarkedQ', 'EmbarkedS', 'TotalRelatives', 'Alone',
                       'DeckA', 'DeckB', 'DeckC', 'DeckD', 'DeckE',
                       'DeckEmpty', 'DeckF', 'DeckG', 'DeckT', 'OnlyHusband',
                       'ManyChildren']]

## Mean encoding
1. Create new columns.
2. Fill them with mean of the dependent variable.
In our case - probability to survive given the input like 'Age' or 'Sex'.

In [733]:
def generate_column_name(col, function_name, dependent_variable):

    if not isinstance(col, list):
        col = [col]
    name = ''.join([str(elem) for elem in col]) \
           + function_name + dependent_variable

    return name


def generate_dependent_features_new(X, y, columns, dependent_variable='DV',
                                    functions=None, function_names=None,
                                    replace=False, fill_empty_with_mean=False):
    """
    Function generates features based on the groups in Columns parameter and
    functions in the functions parameter.
    :param unencoded_df:
    :param columns:
    :param dependent_variable:
    :param functions:
    :param function_names:
    :return:
    """

    if dependent_variable is None:
        dependent_variable = y.name

    if functions is None or function_names is None:
        functions = (np.mean, np.min, np.max)
        function_names = ['Mean', 'Min', 'Max']

    full_df = pd.concat([X, y], axis=1)
    X, y, X_pred, full_train = create_X_and_y(full_df, dependent_variable)

    # Fill the training set categorical features with function of dependent
    # variable
    filled_train, new_numerical_columns = \
        fill_train_dataset(full_train, columns,
                           dependent_variable=dependent_variable,
                           functions=functions, function_names=function_names,
                           replace=replace, fill_empty_with_mean=fill_empty_with_mean)

    # Fill the values for the test (competition) set
    X_tr = fill_test_dataset(filled_train, full_df, columns,
                             dependent_variable=dependent_variable,
                             functions=functions, function_names=function_names,
                             replace=replace, fill_empty_with_mean=fill_empty_with_mean)

    if replace:
        X_tr.drop(columns, axis=1, inplace=True)
        columns = {colname + function_names[0] + dependent_variable:
                       colname for colname in columns}
        X_tr.rename(columns=columns,
                    inplace=True)
        return X_tr
    else:
        return X_tr, new_numerical_columns


def fill_train_dataset(full_train, columns, dependent_variable='DV',
                       functions=None, function_names=None, replace=False,
                       fill_empty_with_mean=False):
    # Generate 5 splits not to fill the values with the dependent variable of
    # the test set.
    skf = KFold(5, shuffle=True, random_state=0)
    skf.get_n_splits()
    full_train_new = full_train.copy()
    # Algorithm of work:
    # 1. Cycle through each split to avoid using the very example when filling means
    # 2. Cycle through each grouping column (column list)
    # 3. Cycle through each function
    index_name = full_train.index.name
    new_numerical_columns = []
    for tr_ind, val_ind in skf.split(full_train, full_train[dependent_variable]):
        X_tr, X_val = full_train.iloc[tr_ind], full_train.iloc[val_ind]
        for col in columns:
            for function, function_name in zip(functions, function_names):
                name = generate_column_name(col, function_name,
                                            dependent_variable)
                # Generate columns for the new features
                if name not in full_train_new.columns:
                    full_train_new[name] = 0
                    new_numerical_columns.append(name)

                x_tr_means = \
                    X_tr.groupby(col)[dependent_variable].agg(
                        function).rename(name)
                # After merging the index of a DataFrame is reset, so:
                # 1. Create separate indexes column with reset_index()
                # 2. Merge with the means (don't use map() as there can be
                #    multiple grouping columns
                # 3. Set index back to the columns with set_index()

                full_train_new.iloc[
                    val_ind, full_train_new.columns.get_loc(name)] = \
                    X_val.reset_index().merge(x_tr_means, on=col,
                                              how='left').set_index(index_name)[name]

    if fill_empty_with_mean:
        # Fill the values for the rows that have no other rows to take mean from
        prior = full_train_new[dependent_variable].mean()
        full_train_new.fillna(prior, inplace=True)

    return full_train_new, new_numerical_columns


def fill_test_dataset(full_train, full_df, columns, dependent_variable='DV',
                      functions=None, function_names=None, replace=False,
                      fill_empty_with_mean=False):
    prior = full_train[dependent_variable].mean()
    index_name = full_train.index.name
    for col in columns:
        for function, function_name in zip(functions, function_names):
            name = generate_column_name(col, function_name,
                                        dependent_variable)
            full_df.loc[:, name] = full_train[name]
            full_tr_means = full_train.groupby(col)[
                dependent_variable].agg(
                np.mean).rename(name)
            test_rows = full_df[dependent_variable].isna() == True
            full_df.loc[test_rows, name] = \
                full_df.loc[test_rows].drop(name, axis=1).\
                reset_index().merge(full_tr_means, on=col, how='left'). \
                set_index(index_name)[name]

            if fill_empty_with_mean:
                # Fill the values for the rows that have no other rows
                # to take mean from
                full_df.loc[:, name].fillna(prior, inplace=True)

    full_df.drop(dependent_variable, inplace=True, axis=1)

    return full_df

def create_X_and_y(df, dependent_variable, return_x=False):

    # Create train and test datasets
    processed_train = df.loc[df[dependent_variable].notna()]
    processed_test = df.loc[df[dependent_variable].isna()]

    # Split train data into X and y
    X = processed_train.drop([dependent_variable], axis=1)
    y = processed_train[dependent_variable]
    X_train = processed_train
    X_pred = processed_test.drop([dependent_variable], axis=1)

    if len(X_pred) == 0:
        X_pred = None
    if return_x:
        return X
    else:
        return X, y, X_pred, X_train

columns_to_encode = ['Ticket', 'Cabin', 'Title', 'SibSp', 'Siblings', 'Spouse',
                    'Parch', 'Parents', 'Children', 'PassengersCount',
                    'FarePerPassenger', 'TravelsAlone', 'PclassFirst',
                    'PclassSecond', 'PclassThird', 'SexFemale', 'SexMale',
                    'EmbarkedC', 'EmbarkedQ', 'EmbarkedS', 'TotalRelatives',
                    'Alone', 'OnlyHusband', 'ManyChildren', 'DeckA', 'DeckB',
                    'DeckC', 'DeckD', 'DeckE', 'DeckEmpty', 'DeckF', 'DeckG',
                    'DeckT']

new_columns = columns_to_encode
new_df = generate_dependent_features_new(filled_df.drop('Survived', axis=1),
                                         filled_df['Survived'],
                                         columns=columns_to_encode,
                                         functions=[np.mean],
                                         dependent_variable=None,
                                         function_names=['Mean'], replace=True)
new_column_names = [column + 'MeanEncoded' for column in columns_to_encode]
filled_df[new_column_names] =  new_df[columns_to_encode]
for column in new_column_names:
    filled_df[column].fillna(-1, inplace=True)
filled_df.head()

Unnamed: 0_level_0,Survived,Name,Age,Ticket,Fare,Cabin,FirstName,LastName,Title,SibSp,Siblings,Spouse,Parch,Parents,Children,PassengersCount,FarePerPassenger,ZeroPrice,TravelsAlone,PclassFirst,PclassSecond,PclassThird,SexFemale,SexMale,EmbarkedC,EmbarkedQ,EmbarkedS,TotalRelatives,Alone,DeckA,DeckB,DeckC,DeckD,DeckE,DeckEmpty,DeckF,DeckG,DeckT,OnlyHusband,ManyChildren,TicketMeanEncoded,CabinMeanEncoded,TitleMeanEncoded,SibSpMeanEncoded,SiblingsMeanEncoded,SpouseMeanEncoded,ParchMeanEncoded,ParentsMeanEncoded,ChildrenMeanEncoded,PassengersCountMeanEncoded,FarePerPassengerMeanEncoded,TravelsAloneMeanEncoded,PclassFirstMeanEncoded,PclassSecondMeanEncoded,PclassThirdMeanEncoded,SexFemaleMeanEncoded,SexMaleMeanEncoded,EmbarkedCMeanEncoded,EmbarkedQMeanEncoded,EmbarkedSMeanEncoded,TotalRelativesMeanEncoded,AloneMeanEncoded,OnlyHusbandMeanEncoded,ManyChildrenMeanEncoded,DeckAMeanEncoded,DeckBMeanEncoded,DeckCMeanEncoded,DeckDMeanEncoded,DeckEMeanEncoded,DeckEmptyMeanEncoded,DeckFMeanEncoded,DeckGMeanEncoded,DeckTMeanEncoded
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1
1,0.0,"Braund, Mr. Owen Harris",22.0,A/5 21171,7.25,,Braund,Owen Harris,Mr,1,1,0,0,0,0,1,7.25,0,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0,0,-1.0,-1.0,0.16545,0.545977,0.588235,0.372287,0.355019,0.371522,0.3875,0.284211,0.1875,0.522648,0.317254,0.381206,0.261538,0.201299,0.201299,0.356522,0.396024,0.348837,0.574803,0.522648,0.374815,0.401989,0.395448,0.3789,0.383459,0.383721,0.384615,0.310909,0.394286,0.397183,0.398317
2,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",38.0,PC 17599,71.2833,C85,Cumings,John Bradley,Mrs,1,0,1,0,0,0,2,35.64165,0,0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,-1.0,-1.0,0.791667,0.529762,0.376884,0.491379,0.346225,0.360065,0.374224,0.496599,-1.0,0.492857,0.611765,0.360424,0.547468,0.744939,0.744939,0.527132,0.37963,0.492228,0.552,0.492857,0.805556,0.386913,0.381636,0.365385,0.545455,0.37172,0.367496,0.644172,0.379801,0.382228,0.383966
3,1.0,"Heikkinen, Miss. Laina",26.0,STON/O2. 3101282,7.925,,Heikkinen,Laina,Miss,0,0,0,0,0,0,1,7.925,0,1,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0,0,-1.0,-1.0,0.682759,0.340206,0.384095,0.348333,0.345588,0.366013,0.370543,0.253927,0.411765,0.293706,0.295327,0.365248,0.23057,0.746094,0.746094,0.339655,0.388633,0.337838,0.293706,0.293706,0.355655,0.389205,0.384286,0.359584,0.367625,0.37172,0.371925,0.293578,0.384068,0.385915,0.386236
4,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,113803,53.1,C123,Futrelle,Jacques Heath,Mrs,1,0,1,0,0,0,2,26.55,0,0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,-1.0,-1.0,0.79,0.545977,0.389831,0.535088,0.355019,0.371522,0.3875,0.514085,0.5,0.522648,0.649425,0.381206,0.563467,0.760956,0.760956,0.356522,0.396024,0.348837,0.574803,0.522648,0.815789,0.401989,0.395448,0.3789,0.604167,0.383721,0.384615,0.693252,0.394286,0.397183,0.398317
5,0.0,"Allen, Mr. William Henry",35.0,373450,8.05,,Allen,William Henry,Mr,0,0,0,0,0,0,1,8.05,0,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0,0,-1.0,-1.0,0.158654,0.340862,0.377104,0.355593,0.348921,0.367742,0.371341,0.269231,0.121951,0.305747,0.308688,0.357016,0.237852,0.18913,0.18913,0.349736,0.386154,0.34585,0.305747,0.305747,0.360119,0.389518,0.385164,0.369118,0.367868,0.370803,0.37409,0.306715,0.382311,0.385915,0.386236


# Save the modified data

In [734]:
filled_df.to_csv('data/Preprocessed data.csv', index=True, header=True,
                 index_label='PassengerId')
filled_df.shape

(1309, 73)

## Part 1. EDA:

https://nbviewer.org/github/sersonSerson/Projects/blob/master/Classification/Titanic/Titanic.ipynb

## Part 3. Model selection:

https://nbviewer.org/github/sersonSerson/Projects/blob/master/Classification/Titanic/Titanic.ipynb
