# Titanic Dataset Analysis Report

## Introduction:

This project reports an exploratory data analysis performed on the Titanic dataset. The main objective was to clean the data, transform data types, engineer meaningful features, and prepare the dataset for analysis and modeling.

## Dataset:

The dataset contains information including ticket details, family relationships, and fare and many more outcomes.

#### Columns and Data Types:

PassengerId (int)

Survived (categorical)

Pclass (categorical)

Sex (categorical)

Age (float)

SibSp (int)

Parch (int)

Fare (float)

Ticket (object)

Cabin (object)

Embarked (categorical)

LastName (object)

FirstName (object)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv('tested.csv', sep = ',')

In [3]:
data.shape

(418, 12)

In [4]:
data.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [5]:
data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [6]:
data.tail(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
413,1305,0,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,,S
414,1306,1,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,C105,C
415,1307,0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S
416,1308,0,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S
417,1309,0,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C


In [7]:
data.sample(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
360,1252,0,3,"Sage, Master. William Henry",male,14.5,8,2,CA. 2343,69.55,,S
307,1199,0,3,"Aks, Master. Philip Frank",male,0.83,0,1,392091,9.35,,S
22,914,1,1,"Flegenheim, Mrs. Alfred (Antoinette)",female,,0,0,PC 17598,31.6833,,S
376,1268,1,3,"Kink, Miss. Maria",female,22.0,2,0,315152,8.6625,,S
275,1167,1,2,"Bryhl, Miss. Dagmar Jenny Ingeborg",female,20.0,1,0,236853,26.0,,S
202,1094,0,1,"Astor, Col. John Jacob",male,47.0,1,0,PC 17757,227.525,C62 C64,C
193,1085,0,2,"Lingane, Mr. John",male,61.0,0,0,235509,12.35,,Q
199,1091,1,3,"Rasmussen, Mrs. (Lena Jacobsen Solvang)",female,,0,0,65305,8.1125,,S
260,1152,0,3,"de Messemaeker, Mr. Guillaume Joseph",male,36.5,1,0,345572,17.4,,S
8,900,1,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C


In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


In [9]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,0.363636,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.481622,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,0.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,0.0,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,0.0,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,1.0,3.0,39.0,1.0,0.0,31.5
max,1309.0,1.0,3.0,76.0,8.0,9.0,512.3292


In [10]:
data.describe(include = 'object')

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,418,418,418,91,418
unique,418,2,363,76,3
top,"Kelly, Mr. James",male,PC 17608,B57 B59 B63 B66,S
freq,1,266,5,3,270


In [11]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [12]:
data.duplicated().sum()

0

In [13]:
data[data[['Cabin', 'Age']].isna().all(axis = 1)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
10,902,0,3,"Ilieff, Mr. Ylio",male,,0,0,349220,7.8958,,S
22,914,1,1,"Flegenheim, Mrs. Alfred (Antoinette)",female,,0,0,PC 17598,31.6833,,S
29,921,0,3,"Samaan, Mr. Elias",male,,2,0,2662,21.6792,,C
33,925,1,3,"Johnston, Mrs. Andrew G (Elizabeth Lily"" Watson)""",female,,1,2,W./C. 6607,23.4500,,S
36,928,1,3,"Roth, Miss. Sarah A",female,,0,0,342712,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
408,1300,1,3,"Riordan, Miss. Johanna Hannah""""",female,,0,0,334915,7.7208,,Q
410,1302,1,3,"Naughton, Miss. Hannah",female,,0,0,365237,7.7500,,Q
413,1305,0,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
416,1308,0,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [14]:
data['Ticket'].value_counts()[:10]

PC 17608              5
CA. 2343              4
113503                4
PC 17483              3
220845                3
347077                3
SOTON/O.Q. 3101315    3
C.A. 31029            3
16966                 3
230136                2
Name: Ticket, dtype: int64

In [15]:
data[data['Ticket'] == 'PC 17608']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
24,916,1,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",female,48.0,1,3,PC 17608,262.375,B57 B59 B63 B66,C
59,951,1,1,"Chaudanson, Miss. Victorine",female,36.0,0,0,PC 17608,262.375,B61,C
64,956,0,1,"Ryerson, Master. John Borie",male,13.0,2,2,PC 17608,262.375,B57 B59 B63 B66,C
142,1034,0,1,"Ryerson, Mr. Arthur Larned",male,61.0,1,3,PC 17608,262.375,B57 B59 B63 B66,C
375,1267,1,1,"Bowen, Miss. Grace Scott",female,45.0,0,0,PC 17608,262.375,,C


In [16]:
data[data['Ticket'] == 'CA. 2343']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
188,1080,1,3,"Sage, Miss. Ada",female,,8,2,CA. 2343,69.55,,S
342,1234,0,3,"Sage, Mr. John George",male,,1,9,CA. 2343,69.55,,S
360,1252,0,3,"Sage, Master. William Henry",male,14.5,8,2,CA. 2343,69.55,,S
365,1257,1,3,"Sage, Mrs. John (Annie Bullen)",female,,1,9,CA. 2343,69.55,,S


In [17]:
data['Fare'].value_counts()

7.7500     21
26.0000    19
13.0000    17
8.0500     17
7.8958     11
           ..
7.8208      1
8.5167      1
78.8500     1
52.0000     1
22.3583     1
Name: Fare, Length: 169, dtype: int64

In [18]:
data[data['Age'] <= 1.0]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
117,1009,1,3,"Sandstrom, Miss. Beatrice Irene",female,1.0,1,1,PP 9549,16.7,G6,S
201,1093,0,3,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.33,0,2,347080,14.4,,S
250,1142,1,2,"West, Miss. Barbara J",female,0.92,1,2,C.A. 34651,27.75,,S
263,1155,1,3,"Klasen, Miss. Gertrud Emilia",female,1.0,1,1,350405,12.1833,,S
281,1173,0,3,"Peacock, Master. Alfred Edward",male,0.75,1,1,SOTON/O.Q. 3101315,13.775,,S
296,1188,1,2,"Laroche, Miss. Louise",female,1.0,1,2,SC/Paris 2123,41.5792,,C
307,1199,0,3,"Aks, Master. Philip Frank",male,0.83,0,1,392091,9.35,,S
354,1246,1,3,"Dean, Miss. Elizabeth Gladys Millvina""""",female,0.17,1,2,C.A. 2315,20.575,,S


In [19]:
data[(data['Age'] <= 3.0) & (data['Age'] > 1.0)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
89,981,0,2,"Wells, Master. Ralph Lester",male,2.0,1,1,29103,23.0,,S
284,1176,1,3,"Rosblom, Miss. Salli Helena",female,2.0,1,1,370129,20.2125,,S
409,1301,1,3,"Peacock, Miss. Treasteall",female,3.0,1,1,SOTON/O.Q. 3101315,13.775,,S


In [20]:
data[(data['Age'] >= 60.0) & (data['Age'] <= 80.0)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
13,905,0,2,"Howard, Mr. Benjamin",male,63.0,1,0,24065,26.0,,S
48,940,1,1,"Bucknell, Mrs. William Robert (Emma Eliza Ward)",female,60.0,0,0,11813,76.2917,D15,C
69,961,1,1,"Fortune, Mrs. Mark (Mary McDougald)",female,60.0,1,4,19950,263.0,C23 C25 C27,S
81,973,0,1,"Straus, Mr. Isidor",male,67.0,1,0,PC 17483,221.7792,C55 C57,S
96,988,1,1,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1,0,19877,78.85,C46,S
114,1006,1,1,"Straus, Mrs. Isidor (Rosalie Ida Blun)",female,63.0,1,0,PC 17483,221.7792,C55 C57,S
142,1034,0,1,"Ryerson, Mr. Arthur Larned",male,61.0,1,3,PC 17608,262.375,B57 B59 B63 B66,C
152,1044,0,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S
179,1071,1,1,"Compton, Mrs. Alexander Taylor (Mary Eliza Ing...",female,64.0,0,2,PC 17756,83.1583,E45,C


## Data Cleaning:

#### Handling Missing Values:

    Age: Missing values were filled using the median age grouped by passenger(Pclass).

    Cabin: Missing values were replaced with the label 'No Cabin'.

#### Data Type Standardization:

    Columns were converted from object or int to category using .astype('category') to improve memory efficiency.
    Split Full Name into 2 seperate columns and removed Prefix for better analysis.
    Droped NA's
    
    


In [21]:
data['Survived'] = data['Survived'].astype('category')
data['Pclass']   = data['Pclass'].astype('category')
data['Sex']      = data['Sex'].astype('category')
data['Embarked'] = data['Embarked'].astype('category')

In [22]:
data['Cabin'] = data['Cabin'].fillna('No_Cabin')

In [23]:
median_age = round(data.loc[data['Pclass'] == 3, 'Age'].median(), 1)
median_age

24.0

In [24]:
data.loc[(data['Age'].isna()) & (data['Pclass'] == 3), 'Age'] = median_age

In [25]:
data = data.dropna()

In [26]:
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,No_Cabin,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,No_Cabin,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,No_Cabin,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,No_Cabin,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,No_Cabin,S
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,0,3,"Spector, Mr. Woolf",male,24.0,0,0,A.5. 3236,8.0500,No_Cabin,S
414,1306,1,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,No_Cabin,S
416,1308,0,3,"Ware, Mr. Frederick",male,24.0,0,0,359309,8.0500,No_Cabin,S


In [27]:
data[['Last_Name', 'First_Name']] = data['Name'].str.split(',', expand = True)
data['First_Name'] = data['First_Name'].str.split('.').str[1].str.strip()

data.drop(columns = 'Name', inplace = True)

## Feature Engineering:

#### Family Based Features:

    Created new column 'Family_Members' by using SibSp and Parch: Family_Members = SibSp + Parch + 1

#### Age and Fare Segmentation:

    Passengers were categorized into meaningful life stages using conditional logic and passengers were categorized into fare categories.

In [28]:
def Age(data):
    if data <= 1:
        return 'Infant'
    elif 1 < data <= 3:
        return 'Toddler'
    elif 3 < data <= 13:
        return 'Child'
    elif 13 < data <= 18:
        return 'Teen'
    elif 60 <= data <= 80:
        return 'Senior'
    else:
        return 'Adult'

data['Age_Category'] = data['Age'].apply(Age)

In [29]:
data['Price_Category'] = pd.cut(data['Fare'], bins = (0, 20, 60, 100, 200, data['Fare'].max()), labels = ['Very_Low', 'Low', 'Medium', 'High', 'Luxury'])

In [30]:
data['Family_Members'] = data['SibSp'] + data['Parch'] + 1

In [32]:
data

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Last_Name,First_Name,Age_Category,Price_Category,Family_Members
0,892,0,3,male,34.5,0,0,330911,7.8292,No_Cabin,Q,Kelly,James,Adult,Very_Low,1
1,893,1,3,female,47.0,1,0,363272,7.0000,No_Cabin,S,Wilkes,James (Ellen Needs),Adult,Very_Low,2
2,894,0,2,male,62.0,0,0,240276,9.6875,No_Cabin,Q,Myles,Thomas Francis,Senior,Very_Low,1
3,895,0,3,male,27.0,0,0,315154,8.6625,No_Cabin,S,Wirz,Albert,Adult,Very_Low,1
4,896,1,3,female,22.0,1,1,3101298,12.2875,No_Cabin,S,Hirvonen,Alexander (Helga E Lindqvist),Adult,Very_Low,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,0,3,male,24.0,0,0,A.5. 3236,8.0500,No_Cabin,S,Spector,Woolf,Adult,Very_Low,1
414,1306,1,1,female,39.0,0,0,PC 17758,108.9000,C105,C,Oliva y Ocana,Fermina,Adult,High,1
415,1307,0,3,male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,No_Cabin,S,Saether,Simon Sivertsen,Adult,Very_Low,1
416,1308,0,3,male,24.0,0,0,359309,8.0500,No_Cabin,S,Ware,Frederick,Adult,Very_Low,1


In [31]:
data.to_csv('Titanic_data.csv', index = False)

## Key Insights:

* Gender had a strong impact on survival
* Passenger class (Pclass) strongly influenced survival
* Infants and children had higher survival rates compared to adults.
* Senior passengers had lower survival rates.
* Family size mattered more than traveling alone
* Cabin availability indicated survival advantage

## Conclusion:

The data cleaning, data type conversion, and feature engineering, the Titanic dataset was transformed into a structured and analysis ready format to improved data quality, interpretability, and suitability for the exploratory analysis and machine learning models.

## Tools Used:

Python

Pandas

NumPy

Jupyter Notebook