Model that classifies whether an individual survived the Titanic shipwreck or not.

In [9]:
import pandas as pd 
df = pd.read_csv('titanic.csv') 
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


People that didnt survive the shipwreck.

In [11]:
df['Survived'].value_counts()

Survived
0    549
1    342
Name: count, dtype: int64

0 did not survive the ship
1 survived the ship

In [12]:
relevant_columns = ['Pclass','Age','SibSp','Fare','Sex','Embarked','Survived'] 
dummy_dataframe = pd.get_dummies(df[relevant_columns],drop_first = True, dtype=float)  
dummy_dataframe.shape

(891, 8)

dropping the missing values

In [18]:
dummy_dataframe = dummy_dataframe.dropna() 
dummy_dataframe.shape

(714, 8)

In [20]:
import statsmodels.api as sm 
y = dummy_dataframe['Survived'] 
X = dummy_dataframe.drop(columns=['Survived'],axis = 1)
X = sm.tools.add_constant(X) 
logit_model = sm.Logit(y, X) 
result = logit_model.fit() 

Optimization terminated successfully.
         Current function value: 0.443267
         Iterations 6


In [21]:
result.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,714.0
Model:,Logit,Df Residuals:,706.0
Method:,MLE,Df Model:,7.0
Date:,"Sat, 10 Aug 2024",Pseudo R-squ.:,0.3437
Time:,12:35:01,Log-Likelihood:,-316.49
converged:,True,LL-Null:,-482.26
Covariance Type:,nonrobust,LLR p-value:,1.1029999999999999e-67

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,5.6503,0.633,8.921,0.000,4.409,6.892
Pclass,-1.2118,0.163,-7.433,0.000,-1.531,-0.892
Age,-0.0431,0.008,-5.250,0.000,-0.059,-0.027
SibSp,-0.3806,0.125,-3.048,0.002,-0.625,-0.136
Fare,0.0012,0.002,0.474,0.636,-0.004,0.006
Sex_male,-2.6236,0.217,-12.081,0.000,-3.049,-2.198
Embarked_Q,-0.8260,0.598,-1.381,0.167,-1.999,0.347
Embarked_S,-0.4130,0.269,-1.533,0.125,-0.941,0.115


Fare and Embarked are not significant based from their higher p values (.05)

In [22]:
relevant_columns = ['Pclass','Age','SibSp','Sex','Survived'] 
dummy_dataframe = pd.get_dummies(df[relevant_columns],drop_first = True,dtype = float) 
dummy_dataframe = dummy_dataframe.dropna() 

y = dummy_dataframe['Survived'] 
X = dummy_dataframe.drop(columns=['Survived'],axis = 1) 

X = sm.tools.add_constant(X) 
logit_model = sm.Logit(y,X) 
result = logit_model.fit() 
result.summary()

Optimization terminated successfully.
         Current function value: 0.445882
         Iterations 6


0,1,2,3
Dep. Variable:,Survived,No. Observations:,714.0
Model:,Logit,Df Residuals:,709.0
Method:,MLE,Df Model:,4.0
Date:,"Sat, 10 Aug 2024",Pseudo R-squ.:,0.3399
Time:,12:43:47,Log-Likelihood:,-318.36
converged:,True,LL-Null:,-482.26
Covariance Type:,nonrobust,LLR p-value:,1.089e-69

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,5.6008,0.543,10.306,0.000,4.536,6.666
Pclass,-1.3174,0.141,-9.350,0.000,-1.594,-1.041
Age,-0.0444,0.008,-5.442,0.000,-0.060,-0.028
SibSp,-0.3761,0.121,-3.106,0.002,-0.613,-0.139
Sex_male,-2.6235,0.215,-12.229,0.000,-3.044,-2.203
