In [1]:
import pandas as pd
import numpy as np


In [2]:
df = pd.read_csv('Titanic-Dataset.csv')


In [3]:
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


We loaded the Titanic dataset and used head() and info() to understand
the structure, columns, and missing values.


In [5]:
df.isnull().sum()


Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


Missing values were identified using isnull().sum() to understand
which columns require cleaning.

In [6]:
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


In [7]:
df.isnull().sum()


Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,0
SibSp,0
Parch,0
Ticket,0
Fare,0


In [8]:
df = df.drop(columns=['Cabin'])


In [9]:
df.isnull().sum()


Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,0
SibSp,0
Parch,0
Ticket,0
Fare,0


In this step, missing values in the dataset were identified and handled
using appropriate data cleaning techniques.

Numeric columns such as Age were filled using the mean value to preserve
the overall distribution of the data.

Categorical columns such as Embarked were filled using the most frequent
value (mode), as this represents the most common category.

The Cabin column contained a very large number of missing values and was
not essential for basic analysis. Therefore, it was removed from the
dataset to avoid introducing noise and inaccuracies.

After applying these cleaning steps, the dataset was checked again to
ensure that no missing values remained.


In [10]:
print("Rows before removing duplicates:", df.shape[0])


Rows before removing duplicates: 891


In [11]:
df = df.drop_duplicates()


In [12]:
print("Rows after removing duplicates:", df.shape[0])


Rows after removing duplicates: 891


Removing Duplicate Records

Duplicate rows were identified and removed from the dataset using the
drop_duplicates() function. This ensures that each record is unique and
prevents repeated data from affecting the analysis.

The number of rows was checked before and after removing duplicates to
verify that the operation was successful.


In [14]:
df.dtypes


Unnamed: 0,0
PassengerId,int64
Survived,int64
Pclass,int64
Name,object
Sex,object
Age,float64
SibSp,int64
Parch,int64
Ticket,object
Fare,float64


In [15]:
df['Survived'] = df['Survived'].astype(int)


In [16]:
df['Pclass'] = df['Pclass'].astype(int)


In [17]:
df.dtypes


Unnamed: 0,0
PassengerId,int64
Survived,int64
Pclass,int64
Name,object
Sex,object
Age,float64
SibSp,int64
Parch,int64
Ticket,object
Fare,float64


Datatype Conversion

Datatype conversion was performed to ensure that columns are stored in
the correct format for analysis. The Survived and Pclass columns were
converted to integer type using astype(), allowing accurate calculations
and filtering.

Correct datatypes improve data consistency and prevent errors during
analysis and reporting.


In [18]:
import numpy as np

df['Age_Group'] = np.where(df['Age'] < 18, 'Child', 'Adult')


In [19]:
df[['Age', 'Age_Group']].head()


Unnamed: 0,Age,Age_Group
0,22.0,Adult
1,38.0,Adult
2,26.0,Adult
3,35.0,Adult
4,35.0,Adult


Step 8: Creating a New Column

A new column named Age_Group was created using conditional logic.
Passengers below 18 years of age were classified as Children, while
those aged 18 and above were classified as Adults.

This transformation helps in grouping passengers and simplifies
analysis based on age categories.


In [20]:
df.to_csv('cleaned_data.csv', index=False)


Saving the Cleaned Dataset

After completing all data cleaning and transformation steps, the final
cleaned dataset was saved as a CSV file using the to_csv() function.
This file can be used for further analysis or reporting.
