# 🚢**Titanic Survival Prediction** (Task 1 CodSoft)

*1. Importing libraries* 📦

*2. Ignoring warnings* ⚠️🙉

*3. Data preprocessing and exploration* 🔍📊

*4. Oversampling using SMOTE*  📈

*5. Spliting Data* 📊🔀

*6. Feature scaling* 📏

*7. Model training* 🤖

*8. Model evaluation* 📊

*9. Visualization* 📊

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

⌛📊*Loading the dataset* 

In [2]:
titanic = pd.read_csv('/kaggle/input/test-file/tested.csv')

In [3]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [4]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


In [5]:
titanic.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

*Handling null values*

In [6]:
columns = ['Age', 'Fare']
for col in columns:
    titanic[col].fillna(titanic[col].median(), inplace = True)
    
titanic['Cabin'].fillna('Unknown', inplace=True)

*Checking the duplicate values*

In [7]:
dup = titanic.duplicated().sum()
print("The number of duplicated values in the dataset are: ", dup)

The number of duplicated values in the dataset are:  0


👉🏼🤷🔤*Checking for typos in the categorical columns*

In [8]:
for col in titanic.select_dtypes(include = "object"):
    print(f"Name of Column: {col}")
    print(titanic[col].unique())
    print('\n', '-'*60, '\n')

Name of Column: Name
['Kelly, Mr. James' 'Wilkes, Mrs. James (Ellen Needs)'
 'Myles, Mr. Thomas Francis' 'Wirz, Mr. Albert'
 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)'
 'Svensson, Mr. Johan Cervin' 'Connolly, Miss. Kate'
 'Caldwell, Mr. Albert Francis'
 'Abrahim, Mrs. Joseph (Sophie Halaut Easu)' 'Davies, Mr. John Samuel'
 'Ilieff, Mr. Ylio' 'Jones, Mr. Charles Cresson'
 'Snyder, Mrs. John Pillsbury (Nelle Stevenson)' 'Howard, Mr. Benjamin'
 'Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)'
 'del Carlo, Mrs. Sebastiano (Argenia Genovesi)' 'Keane, Mr. Daniel'
 'Assaf, Mr. Gerios' 'Ilmakangas, Miss. Ida Livija'
 'Assaf Khalil, Mrs. Mariana (Miriam")"' 'Rothschild, Mr. Martin'
 'Olsen, Master. Artur Karl' 'Flegenheim, Mrs. Alfred (Antoinette)'
 'Williams, Mr. Richard Norris II'
 'Ryerson, Mrs. Arthur Larned (Emily Maria Borie)'
 'Robins, Mr. Alexander A' 'Ostby, Miss. Helene Ragnhild'
 'Daher, Mr. Shedid' 'Brady, Mr. John Bertram' 'Samaan, Mr. Elias'
 'Louch, Mr. Charles Alexa

*Let's display the modified dataset*

In [9]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,Unknown,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,Unknown,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,Unknown,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,Unknown,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,Unknown,S


*Extracting titles from names*

*Mapping and replacing the titles*

In [10]:
titanic['Title'] = titanic['Name'].str.extract(r',\s(.*?)\.')

titanic['Title'] = titanic['Title'].replace('Ms', 'Miss')
titanic['Title'] = titanic['Title'].replace('Dona', 'Mrs')
titanic['Title'] = titanic['Title'].replace(['Col', 'Rev', 'Dr'], 'Rare')

👦👧👨👩👴👵*Creating Age groups*

In [11]:
bins = [-np.inf, 17, 32, 45, 50, np.inf]
labels = ["Children", "Young", "Mid-Aged", "Senior-Adult", 'Elderly']
titanic['Age_Group'] = pd.cut(titanic['Age'], bins = bins, labels = labels)

*Generating a family size feature*

In [12]:
titanic['Family'] = titanic['SibSp'] + titanic['Parch']

*Dropping non essential columns*

In [13]:
titanic.drop(['PassengerId', 'Name', 'Ticket'], axis = 1, inplace = True)

*Displaying the modified dataset*

In [14]:
titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title,Age_Group,Family
0,0,3,male,34.5,0,0,7.8292,Unknown,Q,Mr,Mid-Aged,0
1,1,3,female,47.0,1,0,7.0,Unknown,S,Mrs,Senior-Adult,1
2,0,2,male,62.0,0,0,9.6875,Unknown,Q,Mr,Elderly,0
3,0,3,male,27.0,0,0,8.6625,Unknown,S,Mr,Young,0
4,1,3,female,22.0,1,1,12.2875,Unknown,S,Mrs,Young,2


*Moving columns to a different position*

*Changing data typr of 'Age_Group' column*

In [15]:
col_to_move = titanic.pop('Age_Group')
titanic.insert(4, 'Age_Group', col_to_move)

col_to_move = titanic.pop('Family')
titanic.insert(7, 'Family', col_to_move)

titanic['Age_Group'] = titanic['Age_Group'].astype('object')

*Descriptive staistics*

In [16]:
titanic.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Family,Fare
count,418.0,418.0,418.0,418.0,418.0,418.0,418.0
mean,0.363636,2.26555,29.599282,0.447368,0.392344,0.839713,35.576535
std,0.481622,0.841838,12.70377,0.89676,0.981429,1.519072,55.850103
min,0.0,1.0,0.17,0.0,0.0,0.0,0.0
25%,0.0,1.0,23.0,0.0,0.0,0.0,7.8958
50%,0.0,3.0,27.0,0.0,0.0,0.0,14.4542
75%,1.0,3.0,35.75,1.0,0.0,1.0,31.471875
max,1.0,3.0,76.0,8.0,9.0,10.0,512.3292


*Categorical Descriptive statistics*

In [17]:
titanic.describe(include = 'O')

Unnamed: 0,Sex,Age_Group,Cabin,Embarked,Title
count,418,418,418,418,418
unique,2,5,77,3,5
top,male,Young,Unknown,S,Mr
freq,266,257,327,270,240


👨👩*Group wise analysis by 'Sex'*

In [18]:
titanic.groupby('Sex')[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Family', 'Fare']].mean()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Family,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,1.0,2.144737,29.734145,0.565789,0.598684,1.164474,49.747699
male,0.0,2.334586,29.522218,0.379699,0.274436,0.654135,27.478728


*Group wise analysis by 'Embarked'*

In [19]:
titanic.groupby('Embarked')[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Family', 'Fare']].mean()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Family,Fare
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
C,0.392157,1.794118,33.220588,0.421569,0.382353,0.803922,66.259765
Q,0.521739,2.869565,28.108696,0.195652,0.021739,0.217391,10.9577
S,0.325926,2.340741,28.485185,0.5,0.459259,0.959259,28.179413


*Calculating Survival Count*

*Creating a Pie chart*

*Updating traces and layout*

*Displaying Pie chart*

In [20]:
survived_counts = titanic['Survived'].value_counts()
fig_surv_perc = px.pie(titanic, names= survived_counts.index,  values = survived_counts.values, title=f'Distribution of Survived', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_surv_perc.update_traces(textinfo='percent+label')
fig_surv_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_surv_perc.show()

*Calculating Pclass count*

*Creating a Pie chart for Pclass distribution*

*Updating traces and layouts for Pclass Pie chart*

*Displaying the Pie chart for Pclass*

In [21]:
pclass_counts = titanic.Pclass.value_counts()
fig_pclass_perc = px.pie(titanic, names= pclass_counts.index, values = pclass_counts.values, title=f'Distribution of Pclass', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_pclass_perc.update_traces(textinfo='percent+label')
fig_pclass_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_pclass_perc.show()

*Creating a histogram for Sex counts*

*Creating a Pie chart for Sex distribution*

In [22]:
fig_sex_count = px.histogram(titanic, x = 'Sex', color = 'Sex', color_discrete_sequence=px.colors.sequential.Viridis)
fig_sex_count.update_layout(title_text='Count of different Sex', xaxis_title='Sex', yaxis_title='Count', plot_bgcolor = 'white')
fig_sex_count.show()

fig_sex_perc = px.pie(titanic, names= 'Sex', title=f'Distribution of Sex', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_sex_perc.update_traces(textinfo='percent+label')
fig_sex_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_sex_perc.show()

*Creating a histogram for age distribution*

*Updating traces and layout for age histogram*

*Displaying the Age histogram*

In [23]:
fig_age = px.histogram(titanic, x='Age', nbins=30, histnorm='probability density')
fig_age.update_traces(marker=dict(color='#440154'), selector=dict(type='histogram'))
fig_age.update_layout(title='Distribution of Age', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Age', yaxis_title='Probability Density', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), bargap=0.02, plot_bgcolor = 'white')
fig_age.show()

*Creating a histogram for Fare distribution*

*Updating traces and layout for Fare histogram*

*Displaying the Fare histogram*

In [24]:
fig_fare = px.histogram(titanic, x='Fare', nbins=30, histnorm='probability density')
fig_fare.update_traces(marker=dict(color='#440154'), selector=dict(type='histogram'))
fig_fare.update_layout(title='Distribution of Fare', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Fare', yaxis_title='Probability Density', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), bargap=0.02, plot_bgcolor = 'white')
fig_fare.show()

*Creating a Histogram for Embarked count*

*Creating a Pie chart for Embarked Distribution*

In [25]:
fig_embarked_count = px.histogram(titanic, x = 'Embarked', color = 'Embarked', color_discrete_sequence=px.colors.sequential.Viridis)
fig_embarked_count.update_layout(title_text='Count of different Embarked', xaxis_title='Embarked', yaxis_title='Count', plot_bgcolor = 'white')
fig_embarked_count.show()

fig_embarked_perc = px.pie(titanic, names= 'Embarked', title=f'Distribution of Embarked', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_embarked_perc.update_traces(textinfo='percent+label')
fig_embarked_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_embarked_perc.show()

*Creating a histogram for Title count*

*Creating a Pie chart for Title distribution*

In [26]:
fig_title_count = px.histogram(titanic, x = 'Title', color = 'Title', color_discrete_sequence=px.colors.sequential.Viridis)
fig_title_count.update_layout(title_text='Count of different Title', xaxis_title='Title', yaxis_title='Count', plot_bgcolor = 'white')
fig_title_count.show()

fig_title_perc = px.pie(titanic, names= 'Title', title=f'Distribution of Title', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_title_perc.update_traces(textinfo='percent+label')
fig_title_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_title_perc.show()

*Creating a Grouped Histogram for Pclass and Survival*

*Updating Layout for Pclass and Survival Histogram*

*Displaying the Grouped Histogram*

In [27]:
fig_pclass_surv = px.histogram(titanic, x = 'Pclass', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_pclass_surv.update_layout(title = 'Survival according to passenger classes', plot_bgcolor = 'white')
fig_pclass_surv.show()

*Creating a Grouped Histogram for Sex and Survival*

*Updating Layout for Sex and Survival Histogram*

*Displaying the Grouped Histogram*

In [28]:
fig_pclass_surv = px.histogram(titanic, x = 'Sex', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_pclass_surv.update_layout(title = 'Survival according to gender', plot_bgcolor = 'white')
fig_pclass_surv.show()

*Creating a Grouped Histogram for Age Groups and Survival*

*Updating Layout for Age Groups and Survival Histogram*

*Displaying the Grouped Histogram*

In [29]:
fig_embarked_surv = px.histogram(titanic, x = 'Age_Group', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_embarked_surv.update_layout(title = 'Survival according to age groups', plot_bgcolor = 'white')
fig_embarked_surv.show()

*Creating a Grouped Histogram for Family Size and Survival*

*Updating Layout for Family Size and Survival Histogram*

*Displaying the Grouped Histogram*

In [30]:
fig_family_surv = px.histogram(titanic, x = 'Family', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_family_surv.update_layout(title = 'Survival according to number of family members', plot_bgcolor = 'white')
fig_family_surv.show()

*Creating a Grouped Histogram for Embarked and Survival*

*Updating Layout for Embarked and Survival Histogram*

*Displaying the Grouped Histogram*

In [31]:
fig_embarked_surv = px.histogram(titanic, x = 'Embarked', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_embarked_surv.update_layout(title = 'Survival according to embarked', plot_bgcolor = 'white')
fig_embarked_surv.show()

*Grouping Data*

*Creating a Faceted Line Chart*

*Updating Layout and Displaying the Chart*

In [32]:
grouped_data = titanic.groupby(['Age', 'Sex', 'Survived']).agg({'Fare': 'mean'}).reset_index()
fig = px.line(grouped_data, x='Age', y='Fare', color='Survived', facet_col='Sex', facet_col_wrap=2, labels={'Fare': 'Fare', 'Survived': 'Survived'}, title='12. Relation of age and gender with fare')

fig.update_layout(hovermode='x unified', plot_bgcolor = 'white')
fig.update_xaxes(title_text='Age')
fig.update_yaxes(title_text='Fair', row=1, col=1)
fig.show()

*Importing LabelEncoder*

*Defining Categorical Columns*

*Label Encoding Loop*

In [33]:
le = LabelEncoder()
cols = ['Sex', 'Age_Group', 'Cabin', 'Embarked', 'Title']

for col in cols:
    titanic[col] = le.fit_transform(titanic[col])

*Count the occurrences of unique values in the 'Survived' column of the Titanic dataset*

In [34]:
titanic.Survived.value_counts()

Survived
0    266
1    152
Name: count, dtype: int64

*Separate the features (independent variables) and the target variable*

In [35]:
X = titanic.drop('Survived', axis = 1)
y = titanic['Survived']

*Applying SMOTE to Resample Data*

In [36]:
smote = SMOTE(random_state = 42)
X_balanced, y_balanced = smote.fit_resample(X, y)

*Splitting the dataset into training and testing parts*

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size = 0.3, random_state = 42)

*Doing feature scaling by StandardScaler*

In [38]:
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

*Importing Model Classes*

*Initializing Model Instances*

*Training the Models*

*Making Predictions*

In [39]:
# Building the models

lr = LogisticRegression()
rf = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train_scaled, y_train)
rf.fit(X_train_scaled, y_train)
gbc.fit(X_train_scaled, y_train)

lr_pred = lr.predict(X_test_scaled)
rf_pred = rf.predict(X_test_scaled)
gbc_pred = gbc.predict(X_test_scaled)

*Generating Classification Reports*

*Performing Cross-Validation*

*Printing Results*

In [40]:
# Evaluating the models by generating classification report and cross validation scores

lr_report = classification_report(y_test, lr_pred)
lr_scores = cross_val_score(lr, X_train_scaled, y_train, cv=5, scoring='accuracy')

rf_report = classification_report(y_test, rf_pred)
rf_scores = cross_val_score(rf, X_train_scaled, y_train, cv=5, scoring='accuracy')

gbc_report = classification_report(y_test, gbc_pred)
gbc_scores = cross_val_score(gbc, X_train_scaled, y_train, cv=5, scoring='accuracy')


print('The classification report of Logistic Regression is below : ', '\n\n\n', lr_report)
print(f"Logistic Regression Mean Cross-Validation Score: {lr_scores}")

print('\n', '='*100, '\n')
print('The classification report of Random Forest is below : ', '\n\n\n', rf_report)
print(f"Random Forest Mean Cross-Validation Score: {rf_scores}")

print('\n', '='*100, '\n')
print('The classification report of Gradient Bossting Classifier is below : ', '\n\n\n', rf_report)
print(f"Gradient Boosting Classifier Mean Cross-Validation Score: {gbc_scores}")

The classification report of Logistic Regression is below :  


               precision    recall  f1-score   support

           0       1.00      1.00      1.00        78
           1       1.00      1.00      1.00        82

    accuracy                           1.00       160
   macro avg       1.00      1.00      1.00       160
weighted avg       1.00      1.00      1.00       160

Logistic Regression Mean Cross-Validation Score: [1. 1. 1. 1. 1.]


The classification report of Random Forest is below :  


               precision    recall  f1-score   support

           0       1.00      1.00      1.00        78
           1       1.00      1.00      1.00        82

    accuracy                           1.00       160
   macro avg       1.00      1.00      1.00       160
weighted avg       1.00      1.00      1.00       160

Random Forest Mean Cross-Validation Score: [1. 1. 1. 1. 1.]


The classification report of Gradient Bossting Classifier is below :  


               prec

*The provided code performs a comprehensive analysis 🔍 of a classification 🏷️ problem using the Titanic dataset 🚢. It  showcases common steps in a machine learning 🤖 workflow and provides a foundation for further refinement and optimization 📈 of the models.*