Below, I input the training and test data and take a first look at the data.

In [1]:
##data processing
#load libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#load in training data
train = pd.read_csv("/kaggle/input/titanic/train.csv")
#load in test data
test = pd.read_csv("/kaggle/input/titanic/test.csv")
#list of training and test data
combined_data = pd.concat([train, test]).reset_index(drop = True) 

Looking at the type of each variable

In [2]:
combined_data.dtypes


Creat boxplots of numeric values

In [3]:
import seaborn as sns
sns.boxplot(x= combined_data['Age'])


In [4]:
sns.boxplot(x= combined_data['Fare'])

In [5]:
sns.boxplot(x= combined_data['SibSp'])

In [6]:
sns.boxplot(x= combined_data['Parch'])

Get summary statistics and missing values for all columns.

In [7]:
combined_data.describe()


Look at how many null values in each column.

In [8]:
combined_data.isnull().sum()

Age and cabin have a lot of missing values. Cabin is mostly missing values. 

Embarked has only a few missing values, so replace with most common value S.

In [9]:
combined_data['Embarked'].value_counts()

In [10]:
combined_data['Embarked'] =combined_data['Embarked'].fillna('S')

Create variable for family size, title, and is alone

In [11]:

#create variable for title
common_titles = ['Mr', 'Mrs', 'Miss', 'Master']
combined_data['Temp'] = combined_data['Name'].str.split(', ', expand=True)[1]
combined_data['Title'] = combined_data['Temp'].str.split('.', expand = True)[0]
combined_data['Title'] = combined_data['Title'].replace('Mlle', 'Miss')
combined_data['Title'] = combined_data['Title'].replace('Ms', 'Miss')
combined_data['Title'] = combined_data['Title'].replace('Mme', 'Mrs')
combined_data.loc[~combined_data['Title'].isin(common_titles), "Title"] = "Other"


In [12]:
combined_data.isnull().sum()

In [13]:
combined_data['Title'].value_counts()

Look to see if cabin being non missing has some correlation to survived.

In [14]:
combined_data['CabinMissing'] = 0
combined_data.loc[combined_data['Cabin'].isnull(), 'CabinMissing'] = 1
combined_data.groupby(['CabinMissing'], as_index = False)['Survived'].mean()

Whether cabin or not is missing seems to be related to survival

Creat variable for family size from siblings and parents

In [15]:
#create variable for family size
combined_data["FamilySize"] = combined_data["SibSp"] + combined_data["Parch"]
combined_data[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='FamilySize', ascending=False)


Meaningful variation seems to be found from grouping family size into groups 0, 1-3, and 4-10. However, very few observations in groups 4-10, so low survival rates may be noise. Just split into family sizes zero and non-zero.

In [16]:
combined_data['Alone'] = 0
combined_data.loc[combined_data['FamilySize'] == 0,'Alone'] = 1
combined_data[['Alone', 'Survived']].groupby(['Alone'], as_index=False).mean().sort_values(by='Alone', ascending=False)

Split into training and test datasets

In [17]:
train_data = combined_data.iloc[:len(train)].copy()
test_data = combined_data.iloc[len(train):].copy()

Fill zero and missing fare values with median by class and sex. Use training data to calculate test data to avoid data leakage.

In [18]:
#impute missing values for training data
train_data['Fare'] = train_data['Fare'].replace(0, np.nan)
train_data['Fare'] = train_data.groupby(['Pclass'])['Fare'].apply(lambda x: x.fillna(x.median()))
#impute zero and missing values for test dataset
test_data['Fare'] = test_data['Fare'].replace(0, np.nan)
group_averages = train_data.groupby(['Pclass'])['Fare'].median().rename("Grouped_Fare")
test_data = test_data.reset_index().merge(group_averages, how = "left", on = ["Pclass"]).set_index('index')
test_data.index.name = None
test_data['Fare'] = test_data['Fare'].fillna(test_data["Grouped_Fare"])

Look if age is correlated with Pclass and sex

In [19]:
train_data.groupby(['Pclass'])['Age'].mean()


In [20]:
train_data.groupby(['Sex', 'Pclass'])['Age'].mean()


Look at distribution of age for pclass and sex groups.

In [21]:
import matplotlib.pyplot as plt
%matplotlib inline
grid = sns.FacetGrid(train_data, row='Pclass', col='Sex', height=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend()

Predict missing values for age by mean of pclass and sex groups. Use mean since age is normally distributed.

In [22]:
#impute zero and missing values for training data
train_data['Age'] = train_data.groupby(['Pclass', 'Sex'])['Age'].apply(lambda x: x.fillna(x.median()))
#impute zero and missing values for test dataset
group_age_averages = train_data.groupby(['Pclass', 'Sex'])['Age'].mean().rename("Grouped_Age")
test_data = test_data.reset_index().merge(group_age_averages, how = "left", on = ['Pclass', 'Sex']).set_index('index')
test_data.index.name = None
test_data['Age'] = test_data['Age'].fillna(test_data["Grouped_Age"])

Select relevant features for modeling

In [23]:
features= ["Pclass", "Sex","Age", "Fare", "Title", "Embarked", "CabinMissing","Alone", "Survived"]
train_data = train_data[features]
features.remove("Survived")
test_data = test_data[features]

Make sure that there are no remaining missing values.

In [24]:
print(train_data.isnull().sum())
print ("\n")
print(test_data[features].isnull().sum())


Create correlation matrix of different variables

In [25]:
corrMatrix = train_data[features].corr()
sns.heatmap(corrMatrix, annot=True)


Extract dummy variables for random forest model.

In [26]:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_absolute_error

y = train_data["Survived"]

X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features]) 

Run logistic regression to understand effects of different variables

In [27]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)
predictions = logreg.predict(X_test).astype(int)
#look at score
log_score = round(logreg.score(X, y) * 100, 2)
log_score

Look at coefficients from logistic regression model

In [28]:
coefficients  = pd.DataFrame(X.columns.delete(0))
coefficients.columns = ['Feature']
coefficients["Correlation"] = pd.Series(logreg.coef_[0])
coefficients.sort_values(by='Correlation', ascending=False)

Run grid search to tune hyperparameters

In [29]:
#function to return best score, best parameters for hyparameters tuning
def clf_performance(classifier, model_name):
    print(model_name)
    print('Best Score: ' + str(classifier.best_score_))
    print('Best Parameters: ' + str(classifier.best_params_))
from sklearn.model_selection import GridSearchCV
#run grid search
#rf = RandomForestClassifier(random_state=123)

#param_grid = {'max_depth': [5, 10, 50],
            #  'criterion':['gini', 'entropy'],
             # 'n_estimators': [100, 200, 500, 1000],
              #'min_samples_split': [2,4,6]}

#grid_search_model = GridSearchCV(rf, param_grid = param_grid, cv = 5, verbose = 0, n_jobs = -1)
#best_rf_model= grid_search_model.fit(X, y)
#clf_performance(best_rf_model,'Random Forest')

Run best random forest model based on grid search

In [30]:
##model

model = RandomForestClassifier(random_state = 1, max_depth = 5, n_estimators = 100)
model.fit(X, y)                                                                                                                                                           
predictions = model.predict(X_test).astype(int)

Look at decision tree score on training data

In [31]:
training_score = round(model.score(X,y) * 100, 2)
training_score

Run cross validation test for random forest model.

In [32]:
from sklearn.model_selection import cross_val_score
cv = cross_val_score(model,X, y,cv=5)
print(cv)
print(cv.mean())


Create output file and submit

In [33]:
##submission
output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")