## Preamble
I'd appreciate any feedback you have to give. I'm still not sure on which models really need data to be setup numerically with a gaussian distribution or whether it's okay to have ordinal data encoded as integers, but here's my first attempt at a notebook. Enjoy! Upvote if possible!

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None 
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
train_data = pd.read_csv("../input/titanic/train.csv", index_col="PassengerId")
test_data = pd.read_csv("../input/titanic/test.csv", index_col="PassengerId")

## First step is to load up the data and describe the columns and check for differences.

In [None]:
print(train_data.info())
print(train_data.isna().sum())

In [None]:
print(test_data.info())
print(test_data.isna().sum())

In the training set we are missing 177 ages, 2 embarked and 687 cabins.
In the test set we're missing 86 ages, 1 fare and 327 cabins.

With a lot of missing cabins it might make sense to drop the column but we will keep it for now.

## Visualising the Data
I want to see how each of the variables is associated with survival. Lets go in order:

### 1) Pclass:

In [None]:
print(train_data["Pclass"].unique())
train_data[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

There are three classes (1, 2 and 3) representing first, second or third class tickets on the boat.

It seems like the passenger's class has a strong association with survival, with the higher class passengers having a higher survival rate.
It makes sense to include this in the model, likely without much change.

### 2) Name:

In [None]:
print(train_data["Name"])

I can think of a few things we could do with the names.

Firstly, we could match up surnames to group families together.
I could imagine that whole families either survived or died together.

Secondly, we can get the titles of the names. As well as there being common ones such as
Mr and Miss, it seems like there are rare/unique ones such as Rev (reverend). If
someone is important enough to have their own title they might have been more likely to survive.

In [None]:
train_data.Name[1].split()

I'm still pretty new to python so I'm not sure what the cannonical way of doing this is, but using
a string split seems like the way to go.

After fiddling around with google, I think I want to use the .assign method for a pandas dataframe.

If I split by comma, the first and second entry will give the family name and title respectively.

In [None]:
train_data = train_data.assign(fname = train_data.Name.str.split(",").str[0])
train_data["title"] = pd.Series([i.split(",")[1].split(".")[0].strip() for i in train_data.Name], index=train_data.index)

I think we can drop the name columns now as we won't need it.
We'll also need to repeat the above for the test set.


(Edit: I originally didn't have the index=train_data.index and all of my pd.Series list comprehensions were coming up
one value short. The joys of 0 indexing vs 1 indexing!)

In [None]:
test_data = test_data.assign(fname = test_data.Name.str.split(",").str[0])
test_data["title"] = pd.Series([i.split(",")[1].split(".")[0].strip() for i in test_data.Name], index=test_data.index)
train_data.drop("Name", axis=1, inplace=True)
test_data.drop("Name", axis=1, inplace=True)

Now to look at what we've made:

In [None]:
print(test_data.fname.nunique())
print(test_data.title.nunique())

In [None]:
ts = sns.countplot(x="title",data=train_data)
ts = plt.setp(ts.get_xticklabels(), rotation=90)
print(train_data["title"].unique())
print(test_data["title"].unique())
other_titles = [title
                for title in train_data["title"]
                if title not in ["Mr", "Miss", "Mme", "Mlle", "Mrs", "Ms"]]
other_titles.append("Dona")

There are a lot of uniques so  I think it makes sense to group them.

#### Titles:
For now we will stick to headings that representing male, female, child and other.
I'll then encode them as numerical.
I will use the pandas dataframe replace and map functions for this:

In [None]:
train_data["title"] = train_data['title'].replace(other_titles, 'Other')
train_data["title"] = train_data["title"].map({"Mr":0, "Miss":1, "Ms" : 1 , "Mme":1, "Mlle":1, "Mrs":1, "Master":2, "Other":3})
test_data["title"] = test_data['title'].replace(other_titles, 'Other')
test_data["title"] = test_data["title"].map({"Mr":0, "Miss":1, "Ms" : 1 , "Mme":1, "Mlle":1, "Mrs":1, "Master":2, "Other":3})

In [None]:
print(train_data.title)
print(test_data.title.isna().sum()) # No NaNs left

In [None]:
from sklearn.preprocessing import OneHotEncoder
oh = OneHotEncoder(handle_unknown="ignore", sparse = False)

train_data = train_data.join(pd.DataFrame(oh.fit_transform(train_data[["fname", "title"]]), index = train_data.index))
test_data = test_data.join(pd.DataFrame(oh.transform(test_data[["fname", "title"]]), index = test_data.index))
train_data.drop("fname", axis = 1, inplace = True)
test_data.drop("fname", axis = 1, inplace = True)

### 3) Sex:

In [None]:
print(train_data["Sex"].unique())
train_data[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

We have two labels for Sex, with females having a much higher survival rate.
It makes sense to include Sex in the model. It is possible that we could use sex to create a new feature by combining
it with other features. For example, what about Sex and Pclass that we looked at earlier?

Across all males and females, females have a much higher survival rate. But what if wealthy males have a higher survival
than poor females? It might make sense to segment this out explicity.

In [None]:
interactions = train_data.assign(sex_class = train_data['Sex'] + "_" + train_data['Pclass'].astype("str"))
interactions[['sex_class', 'Survived']].groupby(['sex_class'], as_index=False).mean().sort_values(by='Survived', ascending=False)

It certainly seems like this interaction feature adds something...

As I'm still new to this I don't yet know if the models will pick up this interaction without me
explicitly adding it as a feature. If I  DO include this column, it will be pretty highly associated with
both sex and class so again I'm not sure if that is something that can interfere with modeling.

For now, my ignorance will let me add it to the dataset and deal with any issues that arise later on.

In [None]:
train_data = train_data.assign(sex_class = train_data['Sex'] + "_" + train_data['Pclass'].astype("str"))
test_data = test_data.assign(sex_class = test_data['Sex'] + "_" + test_data['Pclass'].astype("str"))

Something else that just stood out to me is that I'm not quite sure about how important encoding variables is.
I've read some places that many models need everything to be encoded as numbers.

This seems straight forward but the more I think about it, the more confused I get.
Take Pclass for example. This is encoded numerically and I'm pretty sure it most models would
happily take it and not throw out any errors. But if it's left as is, it would be treated the same as
something like Age. While Pclass is ordinal, and having it encoded as 1, 2 and 3 doesn't seem too outrageous,
I have an uneasy feeling about encoding something with discrete levels the same as a continguous variable (like Age).

I don't know enough about machine learning to actually justify this feeling but just in case I will encode
anything discrete using dummy variables/one-hot encoding.


In [None]:
train_data = train_data.join(pd.get_dummies(train_data['Pclass'], prefix="Pclass"))
test_data = test_data.join(pd.get_dummies(test_data['Pclass'], prefix="Pclass"))

While I'm at it, I'll encode it and Sex as numeric using the map method.

In [None]:
train_data["Sex"] = train_data["Sex"].map({"female":0, "male":1})
test_data["Sex"] = test_data["Sex"].map({"female":0, "male":1})

In [None]:
train_data["sex_class"] = train_data["sex_class"].map({"female_1":0, "female_2":1, "female_3":2, "male_1":4, "male_2":5, "male_3":6})
test_data["sex_class"] = test_data["sex_class"].map({"female_1":0, "female_2":1, "female_3":2, "male_1":4, "male_2":5, "male_3":6})

### 4) Age

First thing's first, let's look at the distribution of age and see if there is any association with survival.

In [None]:
g = sns.FacetGrid(train_data, col='Survived')
g = g.map(sns.distplot, "Age")

First, there are some missing values that need to be dealt with.
There are (at least) three ways we can deal with this, each one being slightly more effort.

1) We can just drop the rows with missing data. While this might be tempting, dropping a row with around 14 other entries just because of one missing value doesn't sound like the brightest idea.

2) We can replace the missing data with the average age (whether it's median/mode/mean) of the data set. This would be a good first pass method and it would let us get the models up and running.

3) We can replace the missing data with the average from similar passengers. For example, if we're missing the age of a 1st class passenger, who is female, who embarked from C etc. we could substitute in the age of other passengers who fit that description.

In [None]:
def find_similar_passengers(id, dataset):
    subset = dataset[(dataset.title == dataset.title[id]) &
                    (dataset.Pclass == dataset.Pclass[id])]

    if subset["Age"].mean() == "NaN":
        subset = dataset[(dataset["sex_class"] == dataset.iloc[id]["sex_class"])]

    if subset["Age"].mean() == "NaN":
        subset = dataset[(dataset["sex"] == dataset.iloc[id]["sex"])]

    age = subset["Age"].mean()
    return age

In [None]:
no_ages = train_data[train_data["Age"].isna()].index
for pid in no_ages:
    train_data.Age[pid] = find_similar_passengers(pid, train_data)

no_ages_test = test_data[test_data["Age"].isna()].index
for pid2 in no_ages_test:
    test_data.Age[pid2] = find_similar_passengers(pid2, test_data)

Now that the missing data is filled in, we can start to reorganise the Age column to make it easier for a model to "see" what we want it to, namely that children have a much higher survival rate and the elderly have a much lower. I think segmenting them into groups of <5, 5-65 and >65 might be a good first pass.

After yet MORE goolging, pandas has a .cut function to replace a range of values with new labels.

In [None]:
train_data["age_group"] =  pd.cut(train_data["Age"], bins=[0,5,65,100], labels=[0,1,2]).astype("int64")
test_data["age_group"] = pd.cut(test_data["Age"], bins=[0,5,65,100], labels=[0,1,2]).astype("int64")

### 5 & 6) SibSp and Parch:

As these both relate to family size it's probably best to tackle them together.

SibSp: The number of siblings or spouses aboard the titanic.

Parch: The number of parents/children aboard the titanic.

In [None]:
train_data[['SibSp', 'Survived']].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
train_data[['Parch', 'Survived']].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Both stories tell a similar story, that smaller families tended to survive more than larger families.

In [None]:
train_data["fsize"] = train_data["SibSp"] + train_data["Parch"] + 1
test_data["fsize"] = test_data["SibSp"] + test_data["Parch"] + 1

In [None]:
train_data[['fsize', 'Survived']].groupby(['fsize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

This looks okay. Small families (4 or less) survived better than people who were alone or in bigger families, we can throw this in the model.

### 7) Ticket:

Let's take a look at what values tickets take on:

In [None]:
print(train_data.Ticket.nunique())
print(train_data.Ticket.tail())

They seem to be numbers, with some having letter prefixes. There are only 681 unique ones in the training dataset
and with no missing values, it means that some tickets have multiple people on them.
I'll do the same trick as with the family name and titles, use string split to separate prefixes.

In [None]:
train_data["ticket_prefix"] = pd.Series([len(i.split()) > 1 for i in train_data.Ticket], index=train_data.index)

In [None]:
train_data[['ticket_prefix', 'Survived']].groupby(['ticket_prefix'], as_index=False).mean().sort_values(by='Survived', ascending=False)


In [None]:
train_data.drop("ticket_prefix", axis=1, inplace=True)
train_data.drop("Ticket", axis=1, inplace=True)
test_data.drop("Ticket", axis=1, inplace=True)

### 8) Fare

In [None]:
g = sns.FacetGrid(train_data, col='Survived')
g = g.map(sns.distplot, "Fare")

While the picture isn't super clear, you can see that survivors had more expensive fares and a wider spread of fare prices.
There is at least one outlier with a fare of >500 so dropping it.

Apart from that the data is pretty skewed. Take a log transformation to reduce the skew and to decrease the massive range in fares.

In [None]:
import numpy as np
train_data["Fare"] = train_data["Fare"].map(lambda i: np.log(i) if i > 0 else 0)
test_data["Fare"] = test_data["Fare"].map(lambda i: np.log(i) if i > 0 else 0)

g = sns.FacetGrid(train_data, col='Survived')
g = g.map(sns.distplot, "Fare")

### 9) Cabin:

From earlier we saw that many cabin entries were missing. We could probably do something to impute the data but for now drop it.

In [None]:
train_data.drop("Cabin", axis=1, inplace=True)
test_data.drop("Cabin", axis=1, inplace=True)

### 10) Embarked:

Not much to do here, theres a few missing values which we can fill in.

In [None]:
train_data["Embarked"] = train_data["Embarked"].fillna("S")
train_data[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
train_data["Embarked"] = train_data["Embarked"].fillna("S")
print(train_data.Embarked.isna().sum())

In [None]:
train_data = train_data.join(pd.get_dummies(train_data['Embarked'], prefix="Embarked_"))
test_data = test_data.join(pd.get_dummies(test_data['Embarked'], prefix="Embarked_"))
#train_data["Embarked"] = train_data["Embarked"].map({"S":0, "Q":1, "C":2})
#test_data["Embarked"] = test_data["Embarked"].map({"S":0, "Q":1, "C":2})

In [None]:
train_data.drop("Embarked", axis=1, inplace=True)
test_data.drop("Embarked", axis=1, inplace=True)

## Modelling

I am going to try my hand at a few different models. Starting with the very simple (linear regression/classifier) and gradually moving up in complexity.

In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

train_y = train_data["Survived"]
train_data.drop("Survived", axis=1, inplace=True)

scoring_method = "f1"

train_scaled = ss.fit_transform(train_data)
test_scaled = ss.transform(test_data)

### 1) Logistic Regression:

First I will split the dataframes up into the independant variables (usually denoted as the matrix X) and the dependant variable (the vector y). I'll do one last check to make sure I have no NAs. 

In [None]:
print(train_data.isna().sum())
print(test_data.isna().sum())

I'm going to use a function from the model_selection module in sklearn. This lets me supply a grid of possible values for the parameters and it will test all possible combinations, storing the best result. As I don't want this "best result" to be overfitted, I'm going to set the cross-validate (cv) parameter to 8, so it will do 8-fold validation. 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np
model = LogisticRegression(random_state=10, max_iter = 1000)
logit_params = {
    "C": [1, 3, 10, 20, 30, 40],
    "solver": ["lbfgs", "liblinear"]
    
}
logit_gs = GridSearchCV(model, logit_params, scoring="f1", cv = 5, n_jobs=4)

In [None]:
logit_gs.fit(train_data, train_y)

In [None]:
print(logit_gs.best_params_)
print(logit_gs.best_score_)

### 2) Random Forest:

Next I'm going to try to implement a random forest model.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier()

rf_params ={
    'bootstrap': [True, False],
    'max_depth': [10, None],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [5, 10, 15, 20, 25, 30]}

rf_gs = GridSearchCV(rf_model, rf_params, scoring=scoring_method, cv=8, n_jobs=4)

In [None]:
rf_gs.fit(train_data, train_y)

In [None]:
print(rf_gs.best_params_)
print(rf_gs.best_score_)

### 3) SVM

In [None]:
from sklearn.svm import SVC
svc_model = SVC()

test_parameters = {
    "C": [1, 3, 10, 30, 100],
    "kernel": ["linear", "poly", "rbf" , "sigmoid"],
}
svc_gs = GridSearchCV(svc_model, test_parameters, scoring="f1", cv=5, n_jobs=4)

In [None]:
svc_gs.fit(train_scaled, train_y)

In [None]:
print(svc_gs.best_params_)
print(svc_gs.best_score_)

### 4) Light Gradient Boosting:

In [None]:
from lightgbm import LGBMClassifier
lgb_model = LGBMClassifier()
test_parameters = {
    "n_estimators": [int(x) for x in np.linspace(5, 30, 6)],
    "reg_alpha": [0, 0.75, 1, 1.25],
    "learning_rate": [0.5, 0.4, 0.35, 0.3, 0.25, 0.2],
    "subsample": [0.5, 0.75, 1]
}
lgb_gs = GridSearchCV(lgb_model, test_parameters, scoring=scoring_method, cv=8, n_jobs=4)

In [None]:
lgb_gs.fit(train_data, train_y)

In [None]:
print(lgb_gs.best_params_)
print(lgb_gs.best_score_)

## Comparing models

I do eventually want to include a short bit on comparing the models. From what I've read, these different models are probably placing different levels of importance on different features/variables. I think the key to a good ensembler/voter is to have models that have different predictions. Of course if all the models predict the same thing, there isn't much point in having them "vote". In the next iteration I will add models that pick out features differently (maybe some linear models, gradient based models and a neural network). I'll see how their predictions compare and pick a few good but slightly different models to ensemble.

### Ensembling/Voting
Now that I have a few models, I'm going to use a voting classifier to use the three above models to make an overall prediction. I'm leaving out the logistic regression for the time being as it the model was fitted with the scaled dataset.

In [None]:
from sklearn.ensemble import VotingClassifier

ensemble_model = VotingClassifier(estimators=[
    ("logit", logit_gs.best_estimator_),
    ("rf", rf_gs.best_estimator_),
    ("svc", svc_gs.best_estimator_),
    ("lgb", lgb_gs.best_estimator_),
], voting = "hard")

In [None]:
ensemble_model.fit(train_data, train_y)

In [None]:
ensemble_model.score(train_data, train_y)

In [None]:
preds = ensemble_model.predict(test_data)

In [None]:
output = pd.DataFrame({'PassengerId': test_data.index,
                       'Survived': preds})

output.to_csv('submission.csv', index=False)