# Introduction
Hello Kagglers and everyone else, this is where I am going to display my working process on doing the Titanic competition. This is also my first time learning and demonstrating what I've learned about Machine Learning and using Python for its language (previously used R).

**Goal**
The goal of this project is to do an Exploratory Data Analysis (EDA), apply Machine Learning, and submit the prediction.

**New Skills learnt and apply here**
* Machine Learning
* Using Python as the primary language
* Using Jupyter Notebook

# Exploratory Data Analysis
We start off by importing libraries and understand the context of the data.

In [None]:
# Import libraries
import numpy as np
import pandas as pd

In [2]:
# Loading the dataset
data = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')
test_ids = test["PassengerId"]

# View the Titanic data
data

*Since I'm using a notebook, I can just type in "data" to display the data without typing "print()".*

I'll need to drop the fields not related, or has insignificant impact.

**PassngerId** field

PassengerId is a unique identifier to show its number in the table, like an index. It is irrelevant for analysis and making predictions.

**Name** field

Although names may have an impact to survivability in this incident, it is stil safe to say that they don't have a huge impact in surviving. Maybe English names may have more effect but I can't be sure unless I find the correlation between names, how people view names during that time, and social classes to their survivability rate.

**Ticket** field

Same thing with tickets. During the disaster, it is also safe to say that they won't be checking the tickets to see who may go on the lifeboat or not, we I can exclude that data from this analysis.

**Cabin** field

It is possible for the cabin's position to affect the passenger's survivability rate yet with too many NaN values makes the data unreliable.


In [4]:
def clean(data):
    data = data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1)
    
    cols = ["SibSp", "Parch", "Fare", "Age"]
    for col in cols:
        data[col].fillna(data[col].median(), inplace=True)
    
    data.Embarked.fillna("U", inplace=True)
    return data

data = clean(data)
test = clean(test)

In [5]:
# View the Describe data
data.head(5)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


This note is taken from [Titanic Competitions Page](https://www.kaggle.com/competitions/titanic/data?select=train.csv)

**Variable Notes**

***pclass:*** A proxy for socio-economic status (SES)

* 1st = Upper
* 2nd = Middle
* 3rd = Lower

***age:*** Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

***sibsp:*** The dataset defines family relations in this way...
* Sibling = brother, sister, stepbrother, stepsister
* Spouse = husband, wife, (mistresses and fiancés were ignored)

***parch:*** The dataset defines family relations in this way...
* Parent = mother, father
* Child = daughter, son, stepdaughter, stepson
* Some children travelled only with a nanny, therefore parch=0 for them

# Prepare Data for Machine Learning

I'll be using **preprocessing** from **sklearn** library to prepare the data for use in a machine learning algorithm. It will clean the data to remove missing or invalid values, scaling or nornamlizing the data, and transforming the data in some way to make it more suitable for the machine learning algorithm.

In [7]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

cols = ["Sex", "Embarked"]

for col in cols:
    data[col] = le.fit_transform(data[col])
    test[col] = le.transform(test[col])
    print(le.classes_)
    
data.head(5)

['female' 'male']
['C' 'Q' 'S' 'U']


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,2
1,1,1,0,38.0,1,0,71.2833,0
2,1,3,0,26.0,0,0,7.925,2
3,1,1,0,35.0,1,0,53.1,2
4,0,3,1,35.0,0,0,8.05,2


**It is automatically mapped to:**

* Female = 0
* Male = 1
* C = 0
* Q = 1
* S = 2
* U = 3

There are multiple algorithm to use for machine learning but I'll be using **Logistic Regression** because the predicted probability is transformed using the logistic function to map the predicted value to a value between 0 and 1, which can be interpreted as a probability.

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

y = data["Survived"]
X = data.drop("Survived", axis=1)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
model = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)

In [10]:
predictions = model.predict(X_val)
from sklearn.metrics import accuracy_score
accuracy_score(y_val, predictions)

0.8100558659217877

In [11]:
submission_preds = model.predict(test)

In [12]:
df = pd.DataFrame({"PassengerId":test_ids.values,
                   "Survived":submission_preds,
                  })

In [13]:
df.to_csv("submission.csv", index=False)

# Conclusion
I've submitted my predictions, it returns a lower score result of** 0.76315** or **76%** while the validation data shows that I got a score of **0.81** or **81%**. I'm unsure how good the result is, but considering I dropped a few columns, I say it is quite good.

The result can probably be higher if I used a much more complicated model and not dropping any data but since this is my first time using Python and Machine Learning, it is a satisfactory result.