# Titanic Kaggle Competition

This notebook uses the Titanic Kaggle competition to practice machine learning. The goal of this notebook is to make predictions whether a passenger will survive based on their personal characteristics.

This notebook contains:

- Data cleaning

- Label encoding

- Logistic Regression model.

The code in this notebook is from following a tutorial YouTube video by Aladin Persson. 
The video can be found at this link:

https://www.youtube.com/watch?v=pUSi5xexT4Q&t=860s

## Load Dataset

Import the training and testing data for this competition and create dataframes for each of them.

In [1]:
import pandas as pd

train_file = pd.read_csv('train.csv')
test_file = pd.read_csv('test.csv')

train_file.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Cean the data

Adjust the dataframes to have no null values.

Some columns in the dataframes contain information that can be too difficult to extract useful information from or contain too many null values to be beneficial to include in a model. 
It is simpler to drop these columns than to attempt to fill the null values.

Other columns in the dataframes contain only a few null values. 
For the features containing numeric data, the median can fill the null values. 
For the features containing categorical data, the most frequent value can fill the null values.

In [2]:
def clean(data):
    data = data.drop(['Ticket','Cabin','Name','PassengerId'], axis=1)
    
    cols=['SibSp','Parch','Fare','Age']
    for col in cols:
        data[col] = data[col].fillna(data[col].median())
        
    data.Embarked.fillna('U', inplace=True)
    
    return data

train = clean(train_file)
test = clean(test_file)

train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


## Encode non-numeric data

There are two columns in this cleaned dataframe that contain categorical data. 
Since machine learning algorithms cannot work with non-numeric data, these features will need to be encoded.

The 'Sex' column contains "female" and "male" which can be encoded as 0 and 1.

The 'Embarked' column contains "C", "Q", "S", and U" which can be encoded as 0, 1, 2, and 3.

In [3]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

cols = ['Sex','Embarked']

for col in cols:
    train[col] = le.fit_transform(train[col])
    test[col] = le.transform(test[col])
    print(le.classes_)
    
train.head()

['female' 'male']
['C' 'Q' 'S' 'U']


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,2
1,1,1,0,38.0,1,0,71.2833,0
2,1,3,0,26.0,0,0,7.925,2
3,1,1,0,35.0,1,0,53.1,2
4,0,3,1,35.0,0,0,8.05,2


## Split the train file into train and test

The data is split into training and testing data for the model to be trained on and to evaluate its accuracy later. The test size consists of 20% of the training data.

In [4]:
from sklearn.model_selection import train_test_split

y = train['Survived']
X = train[['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Fit the model to the data

Use the training data to fit a Logistic Regression model.

In [5]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)

In [6]:
from sklearn.metrics import accuracy_score

preds = clf.predict(X_test)

accuracy = accuracy_score(preds, y_test)

accuracy

0.8100558659217877

## Make competition submission

Use the trained model to make predictions on the testing dataset. 
Export these predictions as a .csv file for submission.

In [9]:
pred_test = clf.predict(test)

df_pred_test = pd.DataFrame(data=pred_test, index=test.index)

df_pred_test.to_csv('2023.5.15-YouTube_Tutorial_Predictions.csv')