pandas is used for data manipulation and analysis.
numpy is used for numerical operations and array handling.

In [2]:
import pandas as pd
import numpy as np

pd.read_csv() reads the Titanic dataset into a DataFrame df.
This dataset contains information about Titanic passengers such as age, sex, passenger class, fare, and survival status.

In [3]:
df = pd.read_csv("C:/Users/swath/Downloads/Titanic_Dataset/Titanic-Dataset.csv")
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Displays the first 5 rows of the dataset to get an overview of the structure and contents.

In [4]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Shows statistical summary (count, mean, std, min, max) of numerical columns.
Helps understand the spread, central values, and range of data.
.T transposes the output to make it more readable.

In [5]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PassengerId,891.0,446.0,257.353842,1.0,223.5,446.0,668.5,891.0
Survived,891.0,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
Pclass,891.0,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
Age,714.0,29.699118,14.526497,0.42,20.125,28.0,38.0,80.0
SibSp,891.0,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
Parch,891.0,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
Fare,891.0,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292


Displays column names, non-null counts, and data types.
Helps identify missing values and types of each column for cleaning.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Combines SibSp (siblings/spouses) and Parch (parents/children) to determine if a passenger is traveling alone.
-> If they have any family onboard → Travelalone = 0
-> If no family onboard → Travelalone = 1
.astype('uint8') reduces memory usage (0/1 only).
why ? 
Helps identify passengers traveling solo — a feature that might influence survival (e.g., families may help each other).

In [8]:
df['Travelalone'] = np.where((df['SibSp'] + df['Parch']) > 0, 0, 1).astype('uint8')

In [None]:
Checks that the Travelalone column was added correctly.

In [10]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Travelalone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


Removes irrelevant or redundant columns:
-> PassengerId, Name, Ticket, Cabin: Not useful for prediction.
-> SibSp, Parch: Already used to create Travelalone.

axis=1 means drop columns (not rows).


PassengerId: Just a unique identifier.
Name: Not useful for prediction.
Ticket: No predictive value.
Cabin: Too many missing values.
SibSp, Parch: Already used to create Travelalone.

In [12]:
df1 = df.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis =1)

In [14]:
df1.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Travelalone
0,0,3,male,22.0,7.25,S,0
1,1,1,female,38.0,71.2833,C,0
2,1,3,female,26.0,7.925,S,1
3,1,1,female,35.0,53.1,S,0
4,0,3,male,35.0,8.05,S,1


Identifies columns with missing values.
Important for deciding which columns need cleaning/imputation.

In [16]:
df1.isna().sum()

Survived         0
Pclass           0
Sex              0
Age            177
Fare             0
Embarked         2
Travelalone      0
dtype: int64

Fills missing values in the Age column with the median age.
median is used instead of mean to avoid skew from outliers.
inplace=True updates the DataFrame directly


Why: Filling missing values prevents errors during modeling.
-> Median is better than mean when outliers are present.

In [17]:
df['Age'].fillna(df1['Age'].median(skipna=True), inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df1['Age'].median(skipna=True), inplace = True)


In [18]:
df1.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Travelalone
0,0,3,male,22.0,7.25,S,0
1,1,1,female,38.0,71.2833,C,0
2,1,3,female,26.0,7.925,S,1
3,1,1,female,35.0,53.1,S,0
4,0,3,male,35.0,8.05,S,1


Converts categorical variables (Pclass, Embarked, Sex) into binary (0/1) columns.
drop_first=True removes the first category to avoid multicollinearity.
For example: Sex → only one column like Sex_male (female = 0), 
Embarked → Embarked_Q, Embarked_S, 
Pclass → Pclass_2, Pclass_3

why ?
Machine learning models need numeric input.drop_first=True: Avoids multicollinearity by dropping one category.

In [19]:
df_titanic = pd.get_dummies(df1, columns = ['Pclass', 'Embarked', 'Sex'], drop_first=True)

In [None]:
Checks the new one-hot encoded columns like Pclass_2, Pclass_3, Sex_male, etc.

In [20]:
df_titanic.head()

Unnamed: 0,Survived,Age,Fare,Travelalone,Pclass_2,Pclass_3,Embarked_Q,Embarked_S,Sex_male
0,0,22.0,7.25,0,False,True,False,True,True
1,1,38.0,71.2833,0,False,False,False,False,False
2,1,26.0,7.925,1,False,True,False,True,False
3,1,35.0,53.1,0,False,False,False,True,False
4,0,35.0,8.05,1,False,True,False,True,True


x: Independent variables (features)
y: Dependent variable (target) – whether the passenger survived

In [21]:
x = df_titanic.drop(['Survived'], axis = 1)
y = df_titanic['Survived']

These are preprocessing tools from scikit-learn to scale features.
why ?
Used to normalize or standardize features before feeding to ML models.

In [22]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

MinMaxScaler(): Scales features to a range of [0, 1]
StandardScaler(): Standardizes features to mean = 0 and standard deviation = 1

In [23]:
trans_MM = MinMaxScaler()
trans_SS = StandardScaler()

Fits and transforms the feature set x using MinMaxScaler.
Converts result into a DataFrame for easy viewing.

In [24]:
df_MM = trans_MM.fit_transform(x)
pd.DataFrame(df_MM)

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.271174,0.014151,0.0,0.0,1.0,0.0,1.0,1.0
1,0.472229,0.139136,0.0,0.0,0.0,0.0,0.0,0.0
2,0.321438,0.015469,1.0,0.0,1.0,0.0,1.0,0.0
3,0.434531,0.103644,0.0,0.0,0.0,0.0,1.0,0.0
4,0.434531,0.015713,1.0,0.0,1.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...
886,0.334004,0.025374,1.0,1.0,0.0,0.0,1.0,1.0
887,0.233476,0.058556,1.0,0.0,0.0,0.0,1.0,0.0
888,,0.045771,0.0,0.0,1.0,0.0,1.0,0.0
889,0.321438,0.058556,1.0,0.0,0.0,0.0,0.0,1.0


Fits and transforms the feature set x using StandardScaler.
Useful when data is normally distributed or models are distance-based (like KNN or SVM).

In [25]:
df_SS = trans_SS.fit_transform(x)
pd.DataFrame(df_SS)

Unnamed: 0,0,1,2,3,4,5,6,7
0,-0.530377,-0.502445,-1.231645,-0.510152,0.902587,-0.307562,0.619306,0.737695
1,0.571831,0.786845,-1.231645,-0.510152,-1.107926,-0.307562,-1.614710,-1.355574
2,-0.254825,-0.488854,0.811922,-0.510152,0.902587,-0.307562,0.619306,-1.355574
3,0.365167,0.420730,-1.231645,-0.510152,-1.107926,-0.307562,0.619306,-1.355574
4,0.365167,-0.486337,0.811922,-0.510152,0.902587,-0.307562,0.619306,0.737695
...,...,...,...,...,...,...,...,...
886,-0.185937,-0.386671,0.811922,1.960202,-1.107926,-0.307562,0.619306,0.737695
887,-0.737041,-0.044381,0.811922,-0.510152,-1.107926,-0.307562,0.619306,-1.355574
888,,-0.176263,-1.231645,-0.510152,0.902587,-0.307562,0.619306,-1.355574
889,-0.254825,-0.044381,0.811922,-0.510152,-1.107926,-0.307562,-1.614710,0.737695
