In [1]:
# Importing Essential Libraries
import pandas as pd
import numpy as np

In [2]:
# We Load The Titanic Dataset
df=pd.read_csv(r"D:\Thiru\ML_Projects\Titanic-Survival-Prediction\Data\Raw Data\train.csv")

In [3]:
# Show First 5 Rows
print(df.head(5))

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


### Load Dataset
- Load Titanic dataset and verify the first few rows to understand structure.

In [4]:
# Missing Values
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### Missing Values
- Age and Cabin have many missing entries.
- Embarked has a few missing values.
- We need to decide strategies for imputing or dropping these columns.

### Plan for Handling Missing Values
- **Age:** Impute with median or use predictive methods.  
- **Cabin:** Too many missing values → drop column or extract deck info.  
- **Embarked:** Only a few missing → fill with mode (most frequent value).

In [5]:
# Fill Missing Embarked With Mode
df['Embarked']=df['Embarked'].fillna(df['Embarked'].mode()[0])

# Fill Missing Age With Median
df['Age']=df['Age'].fillna(df['Age'].median())

# Drop Cabin Column
if 'Cabin' in df.columns:
    df.drop('Cabin', axis=1, inplace=True)

### Handling Missing Values
- Filled missing Embarked values with most frequent port.  
- Filled missing Age values with median.  
- Dropped Cabin column due to too many missing entries.

In [6]:
# Convert Sex And Embarked To Numeric
df['Sex'] = df['Sex'].map({'male':0, 'female':1})
df['Embarked'] = df['Embarked'].map({'C':0, 'Q':1, 'S':2})

### Encoding Categorical Variables
- Sex and Embarked columns are converted to numeric values for modeling.

In [7]:
# Create FamilySize And IsAlone Features
df['FamilySize']=df['SibSp']+df['Parch']+1
df['IsAlone'] = np.where(df['FamilySize']==1, 1, 0)

### Feature Engineering
- FamilySize: total family members on board.  
- IsAlone: 1 if passenger is alone, 0 otherwise.  
- These features may help improve model performance.

In [8]:
#Drop Few Features
df.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)

### Dropping Columns
- PassengerId, Name, and Ticket are dropped as they are not useful for prediction.

In [9]:
# Cleaning Dataset
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    int64  
 1   Pclass      891 non-null    int64  
 2   Sex         891 non-null    int64  
 3   Age         891 non-null    float64
 4   SibSp       891 non-null    int64  
 5   Parch       891 non-null    int64  
 6   Fare        891 non-null    float64
 7   Embarked    891 non-null    int64  
 8   FamilySize  891 non-null    int64  
 9   IsAlone     891 non-null    int64  
dtypes: float64(2), int64(8)
memory usage: 69.7 KB


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,FamilySize,IsAlone
0,0,3,0,22.0,1,0,7.25,2,2,0
1,1,1,1,38.0,1,0,71.2833,0,2,0
2,1,3,1,26.0,0,0,7.925,2,1,1
3,1,1,1,35.0,1,0,53.1,2,2,0
4,0,3,0,35.0,0,0,8.05,2,1,1


### Cleaned Dataset
- All missing values handled.  
- All columns numeric and ready for modeling.  
- Dataset is clean and preprocessed.