# Titanic - Machine Learning Competition
The objective of this notebook is to create and train a machine learning model to predict if a passenger on the Titanic survived or not. Training data is avaliable in "train.csv" to train the machine learning model to analyze trends between variables and survival. The trained model will then be run on "test.csv" to predict whether passengers survived based on variables. Given which passengers survive, a "submission.csv" file will be submitted to the competition.

I will also be using both linear and logistic regression to explore the differences in regression models.

In [None]:
import numpy as np
import pandas as pd

# Input data files are available in the read-only "../input/" directory
# For example, running this will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv("/kaggle/input/titanic/train.csv")
test = pd.read_csv("/kaggle/input/titanic/test.csv")
test_ids = test["PassengerId"]

data.head(5)

Upon reading in the files, note that NaN values are present in the data, meaning data cleaning will need to be done prior to any machine learning model training. In addition, for the simple process I aim to conduct, I will not need most of the columns provided.

In [None]:
#create method to clean train/test data
def clean(data):
    data = data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1) #remove unneeded columns
    
    cols = ["SibSp", "Parch", "Fare", "Age"] #columns with null values
    
    for col in cols:
        data[col].fillna(data[col].mean(), inplace=True) #replace null vals with avg of all vals
        
    data.Embarked.fillna("U", inplace=True) #replace null embarked port with U for unknown
    return data

data = clean(data)
test = clean(test)

In [None]:
data.head(5)

Note that previously null values are now either zero or the average of all other values present in the given column. This is a lazy way to correct the problem, with more creative ways being to use the number of siblings/spouses, class, ticket, or titles of names (Mr, Miss, Mrs, Dr) to predict missing age values.

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

cols = ["Sex", "Embarked"] #for columns with non-numerical values

for col in cols:
    data[col] = le.fit_transform(data[col])
    test[col] = le.transform(test[col])
    print(le.classes_)
    
data.head(5)

To create numerical values for columns with non-numerical values, such as "Sex" and "Embarked", SKLearn offers a label encoder that stores all possible values of a column, and replacing the value with the index of the value in the array.

In [None]:
from sklearn.linear_model import LogisticRegression #imports logistic regression model
from sklearn.model_selection import train_test_split

y = data["Survived"] #survived parameter we are predicting for
x = data.drop("Survived", axis=1) #all other parameters

x_train, x_val, y_train, y_val = train_test_split(x, y, test_size = 0.2, random_state = 42)
#x_train is all parameters in data, y_train is survived value
#x_val is parameters given in test data, y_val is resulting survived values

In [None]:
clf = LogisticRegression(random_state = 0, max_iter = 1000).fit(x_train, y_train)
#creates logistic regression model

In [None]:
from sklearn.metrics import accuracy_score

predictions = clf.predict(x_val)
#predictions is predicted y (survived) -values given x (all other) values
accuracy_score(y_val, predictions)

The score above is an accuracy score, indicating an 81% accuracy when predictions are run against the actual survived values.

In [None]:
submission_pred = clf.predict(test) #holds predicted survived values

In [None]:
df = pd.DataFrame({"PassengerId":test_ids.values,
                   "Survived":submission_pred,}) #properly formats predictions in .csv file format

In [None]:
df.to_csv("log_reg_submission.csv", index=False) #creates .csv file for submission