## Preamble
Hello Kaggle! I'm new-ish to machine learning and wanted to contribute something to Kaggle. I've read a few other people's notebooks for ideas on the titanic competition and wanted to throw my hat in the ring. On paper I have a few years experience in data science but really it's just been fitting linear models. Recently I decided I wanted to learn more so I started a few of the courses on here. Along with my data manipulation I've included some thoughts that popped into my head while I was working through the data. It may be useful, it may not but I thought I'd include it anyways.

I'd appreciate any feedback you have to give. I'm still not sure on which models really need data to be setup numerically with a gaussian distribution or whether it's okay to have ordinal data encoded as integers, but here's my first attempt at a notebook. Enjoy!

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None 
%matplotlib inline

In [None]:
train_data = pd.read_csv("../input/titanic/train.csv", index_col="PassengerId")
test_data = pd.read_csv("../input/titanic/test.csv", index_col="PassengerId")

## First step is to load up the data and describe the columns and check for differences.

In [None]:
print(train_data.info())
print(train_data.isna().sum())

In [None]:
print(test_data.info())
print(test_data.isna().sum())

In the training set we are missing 177 ages, 2 embarked and 687 cabins.
In the test set we're missing 86 ages, 1 fare and 327 cabins.

With a lot of missing cabins it might make sense to drop the column but we will keep it for now.

## Visualising the Data
I want to see how each of the variables is associated with survival. Lets go in order:

### 1) Pclass:

In [None]:
print(train_data["Pclass"].unique())
train_data[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

There are three classes (1, 2 and 3) representing first, second or third class tickets on the boat.

It seems like the passenger's class has a strong association with survival, with the higher class passengers having a higher survival rate.
It makes sense to include this in the model, likely without much change.

### 2) Name:

In [None]:
print(train_data["Name"])

I can think of a few things we could do with the names.

Firstly, we could match up surnames to group families together.
I could imagine that whole families either survived or died together.

Secondly, we can get the titles of the names. As well as there being common ones such as
Mr and Miss, it seems like there are rare/unique ones such as Rev (reverend). If
someone is important enough to have their own title they might have been more likely to survive.

In [None]:
train_data.Name[1].split()

I'm still pretty new to python so I'm not sure what the cannonical way of doing this is, but using
a string split seems like the way to go.

After fiddling around with google, I think I want to use the .assign method for a pandas dataframe.

If I split by comma, the first and second entry will give the family name and title respectively.

In [None]:
train_data = train_data.assign(fname = train_data.Name.str.split(",").str[0])
train_data["title"] = pd.Series([i.split(",")[1].split(".")[0].strip() for i in train_data.Name], index=train_data.index)

I think we can drop the name columns now as we won't need it.
We'll also need to repeat the above for the test set.


(Edit: I originally didn't have the index=train_data.index and all of my pd.Series list comprehensions were coming up
one value short. The joys of 0 indexing vs 1 indexing!)

In [None]:
test_data = test_data.assign(fname = test_data.Name.str.split(",").str[0])
test_data["title"] = pd.Series([i.split(",")[1].split(".")[0].strip() for i in test_data.Name], index=test_data.index)
train_data.drop("Name", axis=1, inplace=True)
test_data.drop("Name", axis=1, inplace=True)

Now to look at what we've made:

In [None]:
print(test_data.fname.nunique())
print(test_data.title.nunique())

In [None]:
ts = sns.countplot(x="title",data=train_data)
ts = plt.setp(ts.get_xticklabels(), rotation=45)
print(train_data["title"].unique())
print(test_data["title"].unique())
other_titles = [title
                for title in train_data["title"]
                if title not in ["Mr", "Miss", "Mme", "Mlle", "Mrs", "Ms"]]
other_titles.append("Dona")

There are a lot of uniques so  I think it makes sense to group them.

#### Titles:
For now we will stick to headings that representing male, female, child and other.
I'll then encode them as numerical.
I will use the pandas dataframe replace and map functions for this:

In [None]:
train_data["title"] = train_data['title'].replace(other_titles, 'Other')
train_data["title"] = train_data["title"].map({"Mr":0, "Miss":1, "Ms" : 1 , "Mme":1, "Mlle":1, "Mrs":1, "Master":2, "Other":3})
test_data["title"] = test_data['title'].replace(other_titles, 'Other')
test_data["title"] = test_data["title"].map({"Mr":0, "Miss":1, "Ms" : 1 , "Mme":1, "Mlle":1, "Mrs":1, "Master":2, "Other":3})

In [None]:
print(train_data.title)
print(test_data.title.isna().sum()) # No NaNs left

#### Family name:
As I was thinking about what to do with this, I realised my original idea was probably breaking some rules of data sciencce.
I was going to encode families as "had family that survived". So if a passenger was a part of a family that had survivors,
mark them down. That way, in the test set, if people had shared family names from the training set, they might be more or
less likely to have survived.

So I think while this could actually improve our submission score, it is incorporating some "future" knowledge
into the model, by creating a variable that is dependant on knowing who survived after the fact.

I guess that leaves me a decision to make: am I trying to write a notebook that maximises my score
 or am I making one to implement a model with actual predictive (i.e. at the time of sinking) power?

 As I'm new to all of this and I want to learn techniques that are transferrable, I will focus on only
 using data that doesn't depend on knowing survival. With that said, I think I'll drop family name for now.

I am curious however, is there much of a discussion about this type of "hacking" features on Kaggle?
I can imagine we could come up with other examples (such as cabin) where we could potentially improve a submission
score by making a model will less predictive power. Anyways back to the analysis...

Update: After reading more on the topic, I don't think there is anything wrong with doing this, I might have just ended up confusing myself. The idea of "predicting" who survived could be seen as predicting who would survive if the titanic were to sail again with the same (or similar) population. In that sense, picking a family name as a feature is no different to picking age as a feature. As such I'm going to update the model to include this and see if it makes a difference!


In [None]:
from sklearn.preprocessing import OneHotEncoder
oh = OneHotEncoder(handle_unknown="ignore", sparse = False)

survivors = [x for x in train_data[train_data["Survived"] == 1]["fname"]]
others = [x for x in train_data[train_data["Survived"] == 0]["fname"]]
train_data["fname"] = train_data['fname'].replace(survivors, 'survivor')
train_data["fname"] = train_data['fname'].replace(others, 'other')
train_data["fname"] = train_data["fname"].map({"survivor":1, "other":0})

other_test = [x for x in test_data["fname"] if x not in survivors]
test_data["fname"] = test_data['fname'].replace(survivors, 'survivor')
test_data["fname"] = test_data['fname'].replace(other_test, 'other_test')
test_data["fname"] = test_data["fname"].map({"survivor":1, "other_test":0})

After yet MORE reading I've decided that any model that includes family name will massively overfit the data. It has a near perfect correlation with survival in the training set which almost certainly doesn't hold for the test set. 

There may be something I could do to include family name but for now I will stick with my original decision to drop it.

In [None]:
train_data.drop("fname", axis = 1, inplace = True)
test_data.drop("fname", axis = 1, inplace = True)

### 3) Sex:

In [None]:
print(train_data["Sex"].unique())
train_data[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

We have two labels for Sex, with females having a much higher survival rate.
It makes sense to include Sex in the model. It is possible that we could use sex to create a new feature by combining
it with other features. For example, what about Sex and Pclass that we looked at earlier?

Across all males and females, females have a much higher survival rate. But what if wealthy males have a higher survival
than poor females? It might make sense to segment this out explicity.

In [None]:
interactions = train_data.assign(sex_class = train_data['Sex'] + "_" + train_data['Pclass'].astype("str"))
interactions[['sex_class', 'Survived']].groupby(['sex_class'], as_index=False).mean().sort_values(by='Survived', ascending=False)

It certainly seems like this interaction feature adds something...

As I'm still new to this I don't yet know if the models will pick up this interaction without me
explicitly adding it as a feature. If I  DO include this column, it will be pretty highly associated with
both sex and class so again I'm not sure if that is something that can interfere with modeling.

For now, my ignorance will let me add it to the dataset and deal with any issues that arise later on.

In [None]:
train_data = train_data.assign(sex_class = train_data['Sex'] + "_" + train_data['Pclass'].astype("str"))
test_data = test_data.assign(sex_class = test_data['Sex'] + "_" + test_data['Pclass'].astype("str"))

Something else that just stood out to me is that I'm not quite sure about how important encoding variables is.
I've read some places that many models need everything to be encoded as numbers.

This seems straight forward but the more I think about it, the more confused I get.
Take Pclass for example. This is encoded numerically and I'm pretty sure it most models would
happily take it and not throw out any errors. But if it's left as is, it would be treated the same as
something like Age. While Pclass is ordinal, and having it encoded as 1, 2 and 3 doesn't seem too outrageous,
I have an uneasy feeling about encoding something with discrete levels the same as a continguous variable (like Age).

I don't know enough about machine learning to actually justify this feeling but just in case I will encode
anything discrete using dummy variables/one-hot encoding.


In [None]:
#train_data = train_data.join(pd.get_dummies(train_data['Pclass'], prefix="Pclass"))
#test_data = test_data.join(pd.get_dummies(test_data['Pclass'], prefix="Pclass"))

While I'm at it, I'll encode it and Sex as numeric using the map method.

In [None]:
train_data["Sex"] = train_data["Sex"].map({"female":0, "male":1})
test_data["Sex"] = test_data["Sex"].map({"female":0, "male":1})

In [None]:
train_data["sex_class"] = train_data["sex_class"].map({"female_1":0, "female_2":1, "female_3":2, "male_1":4, "male_2":5, "male_3":6})
test_data["sex_class"] = test_data["sex_class"].map({"female_1":0, "female_2":1, "female_3":2, "male_1":4, "male_2":5, "male_3":6})

### 4) Age

First thing's first, let's look at the distribution of age and see if there is any association with survival.

In [None]:
g = sns.FacetGrid(train_data, col='Survived')
g = g.map(sns.distplot, "Age")

While the distributions look similar-ish, we can see that a bigger chunk of the survivors
were very young (10 or under) and not many survivors were over 60.

As a novice, my only real experience with machine learning is fitting linear models. Linar models
do pretty poorly when the underlying data isn't normally distributed. It can really only pick up the general trend of
"being older is better" or "being younger is better". If a linear model were used here, it might pick up that being
younger is better for survival but it will miss the details. As this is the only reference point I have,
I'll try to transform the data as if I was going to fit a linear model to it.

But first, there are some missing values that need to be dealt with.
There are (at least) three ways we can deal with this, each one being slightly more effort.

Firstly, we can just drop the rows with missing data. While this might be tempting, dropping a row with around 14
other entries just because of one missing value doesn't sound like the brightest idea.

Secondly, we can replace the missing data with the average age (whether it's median/mode/mean) of the data set.
This would be a good first pass method and it would let us get the models up and running.

Thirdly, we can replace the missing data with the average from similar passengers. For example,
if we're missing the age of a 1st class passenger, who is female, who embarked from C etc. we could substitute in the
age of other passengers who fit that description.

As I want to get a bit more experience using Python/Pandas I'm going to go with option 3.
My rough idea is to rank all the passengers based on their similarity to the missing passenger.
I think the title and Pclass features will be good for this. I want to have a heirarchical approach.
Try to replace with someone of the same title and Pclass, then of sex and class, then mean of just sex.

Update: After reading around on the sklearn user guide,there is an experimental multivariate feature imputer that I am going to use to fill in all the missing data to see how it fares. I'll add this at the end as I need to drop the target feature (Survived) before I use the imputer (so it doesn't try to impute missing survived data in the test set). To see this, skip to section 11) Imputing Age.

### 5 & 6) SibSp and Parch:

As these both relate to family size it's probably best to tackle them together. Many of the notebooks I've
looked at combined these into a "family sizes" metric.

SibSp: The number of siblings or spouses aboard the titanic.
Parch: The number of parents/children aboard the titanic.

Maybe I'm a bit slow but this is taking a while to understand.. an example might help.
So for a family of 4 (two parents and two kids), the parents would each have a value of 1 for SibSp
and the kids would also have a value of one (for their spouse and sibling respectively). The parents have 2 kids
on board and each kid has 2 parents on board. So in this case everyone would have a Parch value of 2.

To get the total size of the family from any one of the members you need 1 (from SibSp) + 2 (from Parch) + 1 (the passenger themself).
That makes sense, but just to be sure I'll write out another example with a different size.

A man, his wife and his three kids are on board. The family size is 5 so we should be able to get this from summing SibSp and Parch.

SibSp_Man: 1 (his wife) + Parch_Man: 3 (his kids) + Man: 1 = Total: 5

SibSp_Child_1: 2 (their siblings) + Parch_Child_1: 2 (both parents) + Child_1 = Total: 5


Okay this seems to work... although I can see a problem. If the man in the above example has a sibling on board,
I would intuitively say that the family size is now 6. But you could only get 6 by summing the mans SibSP and Parch values.
If you tried to sum the man's brother's columns, you would only end up with a family size of 2 or 3 (the man and possibly his wife).
I'm very likely reading into this too much but it's something to consider. By summing these two columns you end up in a
situation where two people can be in the same "family" but have different family sizes. I might come back to this
but for now I think I'll just sum the columns. First I'll check the association with survival:

In [None]:
train_data[['SibSp', 'Survived']].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
train_data[['Parch', 'Survived']].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Both stories tell a similar story, that smaller families tended to survive more than larger families.

In [None]:
train_data["fsize"] = train_data["SibSp"] + train_data["Parch"] + 1
test_data["fsize"] = test_data["SibSp"] + test_data["Parch"] + 1

In [None]:
train_data[['fsize', 'Survived']].groupby(['fsize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

This looks okay. Small families (4 or less) survived better than people who were alone or in bigger familiess
We can throw this in the model. On to the next one...

### 7) Ticket:

Let's take a look at what values tickets take on:

In [None]:
print(train_data.Ticket.nunique())
print(train_data.Ticket.tail())

They seem to be numbers, with some having letter prefixes. There are only 681 unique ones in the training dataset
and with no missing values, it means that some tickets have multiple people on them.
I'll do the same trick as with the family name and titles, use string split to separate prefixes.

In [None]:
train_data["ticket_prefix"] = pd.Series([len(i.split()) > 1 for i in train_data.Ticket], index=train_data.index)

In [None]:
train_data[['ticket_prefix', 'Survived']].groupby(['ticket_prefix'], as_index=False).mean().sort_values(by='Survived', ascending=False)


Well that doesn't look promising. I would have thought the prefixes had some sort of importance.
There probably is something I could do with this but for now I'll just drop it.

In [None]:
train_data.drop("ticket_prefix", axis=1, inplace=True)
train_data.drop("Ticket", axis=1, inplace=True)
test_data.drop("Ticket", axis=1, inplace=True)

### 8) Fare

In [None]:
g = sns.FacetGrid(train_data, col='Survived')
g = g.map(sns.distplot, "Fare")

While the picture isn't super clear, you can see that survivors had more expensive fares and a wider spread of fare prices.
There is at least one outlier with a fare of >500 so I might drop him.

Apart from that the data is pretty skewed. I'll take a log transformation to reduce the skew and to
decrease the massive range in fares.

In [None]:
import numpy as np
train_data["Fare"] = train_data["Fare"].map(lambda i: np.log(i) if i > 0 else 0)
test_data["Fare"] = test_data["Fare"].map(lambda i: np.log(i) if i > 0 else 0)

g = sns.FacetGrid(train_data, col='Survived')
g = g.map(sns.distplot, "Fare")

It looks better to me. We can probably leave it as is for now.

### 9) Cabin:

From earlier we saw that many cabin entries were missing. We could probably do something to impute the data but I'll leave that for another iteration. For now I'm just going to drop it.

In [None]:
train_data.drop("Cabin", axis=1, inplace=True)
test_data.drop("Cabin", axis=1, inplace=True)

### 10) Embarked:

Not much to do here, theres a few missing values which we can fill in.

In [None]:
#train_data = train_data.join(pd.get_dummies(train_data['Embarked'], prefix="Embarked_"))
#test_data = test_data.join(pd.get_dummies(test_data['Embarked'], prefix="Embarked_"))
train_data["Embarked"] = train_data["Embarked"].map({"S":0, "Q":1, "C":2, "NaN": np.nan})
test_data["Embarked"] = test_data["Embarked"].map({"S":0, "Q":1, "C":2, "NaN": np.nan})

### 11) Imputing Age

Adding in the code from above to do the imputing of missing data. I am also then adding in a baby feature, which I could only do after filling in missing age. I think I want to train the imputer on both sets, but first I want to set and drop the Survived column. This way the imputer doesn't try to impute the survival values for the test set. Although I wonder what results that would give...

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import PolynomialFeatures

train_y = train_data["Survived"]
train_data.drop("Survived", axis=1, inplace=True)

combined = pd.concat([train_data,test_data],keys=[0,1])

imp = IterativeImputer(max_iter = 20, random_state = 42)
combined = pd.DataFrame(imp.fit_transform(combined), index = combined.index, columns = combined.columns)

train_data = combined.xs(0)
test_data = combined.xs(1)

train_data["baby"] =  pd.cut(train_data["Age"], bins=[-1,5,100], labels=[1,0]).astype("int64")
test_data["baby"] = pd.cut(test_data["Age"], bins=[-1,5,100], labels=[1,0]).astype("int64")


## Modelling

As it is my first notebook I want to just run something simple that I am comfortable with (logistic regression) as well as two models I'm just starting to learn about (SVM and RF). Features need to be scaled for SVM to work so I'll do that now. I also want to try and see if adding polynomial features will improve LR or SVM.



In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

train_scaled = pd.DataFrame(ss.fit_transform(train_data), index = train_data.index)
test_scaled = pd.DataFrame(ss.transform(test_data), index = test_data.index)

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)

combined = pd.DataFrame(poly.fit_transform(combined), index = combined.index)

train_data_poly = combined.xs(0)
test_data_poly = combined.xs(1)

ss = StandardScaler()

train_scaled_poly = pd.DataFrame(ss.fit_transform(train_scaled), index = train_scaled.index)
test_scaled_poly = pd.DataFrame(ss.transform(test_scaled), index = test_scaled.index)

### 1) Logistic Regression:

First I will split the dataframes up into the independant variables (usually denoted as the matrix X) and the dependant variable (the vector y). I'll do one last check to make sure I have no NAs. 

In [None]:
print(train_data.isna().sum())
print(test_data.isna().sum())

I'm going to use a function from the model_selection module in sklearn. This lets me supply a grid of possible values for the parameters and it will test all possible combinations, storing the best result. As I don't want this "best result" to be overfitted, I'm going to set the cross-validate (cv) parameter to 10, so it will do 10-fold validation. 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV

lr_model = LogisticRegression(random_state=42, max_iter=1000)

test_params ={
    'penalty': ["l1", "l2", "none"],
    'C': [x for x in np.linspace(0, 10, 500)]
}

lr_gs = RandomizedSearchCV(lr_model, test_params, cv=5, n_jobs=4, n_iter =500)

In [None]:
lr_gs.fit(train_data, train_y)

In [None]:
print(lr_gs.best_params_)
print(lr_gs.best_score_)

And now seeing if polynomial features make a difference...

In [None]:
lr_gs.fit(train_scaled_poly, train_y)

In [None]:
print(lr_gs.best_params_)
print(lr_gs.best_score_)

### Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(random_state=42)

rf_params ={
    'bootstrap': [True, False],
    'max_depth': [int(x) for x in np.linspace(1, 50, 50)],
    'max_features': ['auto', 'sqrt', "none"],
    'min_samples_leaf': [int(x) for x in np.linspace(1, 8, 8)],
    'min_samples_split': [int(x) for x in np.linspace(2, 30, 30)],
    'n_estimators': [int(x) for x in np.linspace(2, 50, 50)]}

rf_gs = RandomizedSearchCV(rf_model, rf_params, cv=5, n_jobs=4, n_iter =500)

In [None]:
rf_gs.fit(train_data, train_y)

In [None]:
print(rf_gs.best_params_)
print(rf_gs.best_score_)

In [None]:
rf_gs.fit(train_scaled_poly, train_y)

In [None]:
print(rf_gs.best_params_)
print(rf_gs.best_score_)

### Support Vector Machine
Going to use the scaled features for this one.

In [None]:
from sklearn.svm import SVC

svc_model = SVC(random_state=42, max_iter=1000, probability = True)

test_params ={
    "kernel": ["linear", "rbf", "poly", "sigmoid"],
    "degree": [2],
    "C": [x for x in np.linspace(0, 10, 500)]
}

svc_gs = RandomizedSearchCV(svc_model, test_params, cv=5, n_jobs=4, n_iter =500)

In [None]:
svc_gs.fit(train_scaled, train_y)

In [None]:
print(svc_gs.best_params_)
print(svc_gs.best_score_)

And again checking poly + scaled features, making sure to scale AFTER you add polynomial features:

In [None]:
svc_gs.fit(train_scaled_poly, train_y)

In [None]:
print(svc_gs.best_params_)
print(svc_gs.best_score_)

List out all the best scores. Try a voting classifier to merge the three models votes. Support vector machines don't output a probability by default, so if I want to use soft voting (an average of prediction probabilities) I need to go back and enable it.

In [None]:
print(lr_gs.best_score_)
print(rf_gs.best_score_)
print(svc_gs.best_score_)

In [None]:
from sklearn.ensemble import VotingClassifier

vc = VotingClassifier(estimators = [
    ("lr", lr_gs.best_estimator_),
    ("rf", rf_gs.best_estimator_),
    ("svc", svc_gs.best_estimator_)
], voting = "soft")

Since I have to use scaled data for the svm, and the polynomial features worked best for LR and SVM I'll use the scaled_poly set. 

In [None]:
vc.fit(train_scaled_poly, train_y)

In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
# evaluate a given model using cross-validation
def evaluate_model(model, X, y):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    return scores

In [None]:
np.mean(evaluate_model(vc, train_scaled_poly, train_y))

In [None]:
preds = vc.predict(test_scaled_poly)

In [None]:
output = pd.DataFrame({'PassengerId': test_data.index,
                       'Survived': preds})

output.to_csv('submission.csv', index=False)

Thanks for your attention :) 