# P2: Titanic Classification

### Laden des Titanic-Datensatzes

In [1]:
import pandas as pd
data = pd.read_csv("titanic.csv")

In [2]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
features = ["Survived", "Pclass", "Age", "Fare", "SibSp", "Parch"]
print(features.shape)
filtered_data = data[features].dropna()
print(filtered_data.shape)
filtered_data.head()

Unnamed: 0,Survived,Pclass,Age,Fare,SibSp,Parch
0,0,3,22.0,7.25,1,0
1,1,1,38.0,71.2833,1,0
2,1,3,26.0,7.925,0,0
3,1,1,35.0,53.1,1,0
4,0,3,35.0,8.05,0,0


In [4]:
features = filtered_data.drop("Survived", axis=1)
labels = filtered_data["Survived"]
features.head()

Unnamed: 0,Pclass,Age,Fare,SibSp,Parch
0,3,22.0,7.25,1,0
1,1,38.0,71.2833,1,0
2,3,26.0,7.925,0,0
3,1,35.0,53.1,1,0
4,3,35.0,8.05,0,0


### Binäre Klassifikation

Im ersten Schritt unterteilen wir die Daten in Trainings- und Testdaten:

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=0)
print("Trainingsdatenpunkte: " + str(len(X_train)))
print("Testdatenpunkte: " + str(len(X_test)))

Trainingsdatenpunkte: 571
Testdatenpunkte: 143


Jetzt können wir das logistische Regressionsmodell trainieren.

In [6]:
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression(max_iter=1000) # Gradient Descent mit max. 1000 Schritten
logisticRegr.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

### Evaluation des Regressionsmodells

In [7]:
predictions = logisticRegr.predict(X_test)

Im nächsten Schritt vergleich wir diese Vorhersagen mit dem wahren Label und berechnen die Accuracy:

In [8]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)

0.6993006993006993

Die Accuracy liegt bei diesem einfachen Modell bei 99 Prozent. Schauen wir uns Precision und Recall an.

In [9]:
from sklearn.metrics import precision_score
precision_score(y_test, predictions)

0.723404255319149

In [10]:
from sklearn.metrics import recall_score
recall_score(y_test, predictions)

0.53125

In [11]:
0.6993006993006993
0.723404255319149
0.53125

0.53125

### Hinzugügen des Gender Features

In [12]:
features = ["Survived", "Pclass", "Age", "Fare", "SibSp", "Parch", "Sex"]
filtered_data = data[features].dropna()
filtered_data["Female"] = filtered_data["Sex"] == "female"
filtered_data

Unnamed: 0,Survived,Pclass,Age,Fare,SibSp,Parch,Sex,Female
0,0,3,22.0,7.2500,1,0,male,False
1,1,1,38.0,71.2833,1,0,female,True
2,1,3,26.0,7.9250,0,0,female,True
3,1,1,35.0,53.1000,1,0,female,True
4,0,3,35.0,8.0500,0,0,male,False
...,...,...,...,...,...,...,...,...
885,0,3,39.0,29.1250,0,5,female,True
886,0,2,27.0,13.0000,0,0,male,False
887,1,1,19.0,30.0000,0,0,female,True
889,1,1,26.0,30.0000,0,0,male,False


In [16]:
features = filtered_data.drop(["Survived", "Sex"], axis=1)
labels = filtered_data["Survived"]

In [17]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=0)

In [18]:
logisticRegr = LogisticRegression(max_iter=1000) # Gradient Descent mit max. 1000 Schritten
logisticRegr.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [19]:
predictions = logisticRegr.predict(X_test)
print(accuracy_score(y_test, predictions))
print(precision_score(y_test, predictions))
print(recall_score(y_test, predictions))

0.8461538461538461
0.8387096774193549
0.8125
