# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint


## Learning Objectives


At the end of the mini-hackathon you will be able to:
* Perform Data preprocessing
* Apply different ML algorithms on the **Titanic** dataset
* Perform VotingClassifier


## Dataset Description

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of many passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

[ Data Set Link: Kaggle competition](https://www.kaggle.com/competitions/titanic)

<br/>

### Data Set Characteristics:

**PassengerId:** Id of the Passenger

**Survived:** Survived or Not information

**Pclass:** Socio-economic status (SES)
  * 1st = Upper
  * 2nd = Middle
  * 3rd = Lower

**Name:** Surname, First Names of the Passenger

**Sex:** Gender of the Passenger

**Age:** Age of the Passenger

**SibSp:**	No. of siblings/spouse of the passenger aboard the Titanic

**Parch:**	No. of parents/children of the passenger aboard the Titanic

**Ticket:**	Ticket number

**Fare:** Passenger fare

**Cabin:**	Cabin number

**Embarked:** Port of Embarkation
  * S = Southampton
  * C = Cherbourg
  * Q = Queenstown


## Problem Statement

Build a predictive model that answers the question: “what sort of people were more likely to survive?” using titanic's passenger data (ie name, age, gender, socio-economic class, etc).

In [None]:
# @title Download the datasets
from IPython import get_ipython

ipython = get_ipython()

notebook="U1_MH1_Data_Munging" #name of the notebook

def setup():
    from IPython.display import HTML, display
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/titanic.csv")
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/test_titanic.csv")
    print("Data downloaded successfully")
    return

setup()

In [None]:
!ls

## Exercise 1 - Load and Explore the Data (2 Marks)

* Understand different features in the training dataset
* Understand the data types of each column
* Notice the columns of missing values




#### Import Required Packages

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


Matplotlib is building the font cache; this may take a moment.


In [5]:
# Load the dataset
train_df = pd.read_csv('titanic.csv')
print("Training dataset shape:", train_df.shape)
print("\nFirst 5 rows of training data:")
print(train_df.head())


FileNotFoundError: [Errno 2] No such file or directory: 'titanic.csv'

In [None]:
# Getting information about the dataset
test_submission_df = pd.read_csv('test_titanic.csv')
print("\nTest dataset shape:", test_submission_df.shape)
print("\nFirst 5 rows of test data:")
print(test_submission_df.head())

In [None]:
print("=== TRAINING DATASET INFO ===")
print("Dataset Info:")
print(train_df.info())
print("\nDataset Description:")
print(train_df.describe())
print("\nMissing Values:")
print(train_df.isnull().sum())
print("\nData Types:")
print(train_df.dtypes)
print("\nUnique values in each column:")
for col in train_df.columns:
    print(f"{col}: {train_df[col].nunique()} unique values")
    
# Display survival distribution
print("\nSurvival Distribution:")
print(train_df['Survived'].value_counts())
print("\nSurvival Rate:")
print(train_df['Survived'].value_counts(normalize=True))

## Exercise 02: Split the data into train and test sets (1 Mark)
Note: Apply all your data preprocessing steps in the train set first and keep the test set aside.

## Exercise 03: Data Cleaning and Processing (15 Marks)
### 3.1 Working on the "Cabin" column (2 Marks)
Find unique entries in the Cabin column. We can label all passengers in two categories having a cabin or not. Check the data type(use: type) of each entry of the Cabin. Convert a string data type into '1' i.e. passengers with cabin and others into '0' i.e. passengers without cabin.  Write a function for the above operation and apply it to the cabin column and create another column with the name " Has_cabin" containing only 0 or 1 entries.





 ### 3.2 Working on "SibSp" & "Parch" columns (1 Mark)
Combine columns "SibSp" & "Parch" and create another column that represents the total passengers in one ticket with the name "family_size". In each ticket, there might be Siblings/Spouses (SibSp =Number of Siblings/Spouses Aboard) or Parents/Children (Parch=Number of Parents/Children Aboard ) along with the passenger who booked the ticket.

  

### 3.3 Working on the"Embarked" column (2 Marks)
The "embarked" column represents the port of Embarkation: Cherbourg(C), Queenstown(Q), and  Southampton(S ). Thus, the entries are of three categories in this column. Fill in the missing rows in this column. We can fill it with the most frequent category. Map these categorical string entries into numerical.



### 3.4 Working on the "Age" column (2 Marks)
find the number of NaN entries in the age column and their row index. Calculate the mean, Standard deviation of the Age column and check the distribution of the age column.We can fill the missing values with randomly generated integer values between (mean+Standard deviation, mean-Standard deviation). Use : np.isnan; np.random.randint; concept of slicing dataframe. Convert the age column as an integer data type.



### 3.5 Working on "sex" column (1 Mark)
Map the Sex column as 'female' : 0, 'male': 1, and convert it into an integer data type.



### 3.6  Optional- Working on the "Name" column :
Fetch titles from the name. We can map these titles with numbers and convert them into an integer. Use: concept of the regular expression.

### 3.7 Optional- Working on the "Fare" column :
We can convert face into categorical entries like Low, Medium, and High.



### 3.8 Drop the columns (1 Mark)

Drop the columns: - "PassengerId", "Name",  "SibSp" & "Parch", "Tickets", "Cabin"

Now apply different ML algorithms and check the accuracy of your model.



### 3.9 Apply Standard Scalar (1 Mark)

### 3.10 Create a single function for preprocessing the test set (X_test) and apply it. (4 Marks)
#### **Note**: All the pre-processing steps that were applied on the train set before ML Modelling are also applied on the test set before passing through the predict function.

In [None]:
## Create a function


In [None]:
## Applyting above function



### 3.11 Apply standard Scalar transformation to x_test (1 Mark)

## Exercise  4. Apply Multiple ML Algorithm and display the accuracy. (7 Marks)
 * Optional (  Ensemble Technique )
#### Expected Accuracy >= 80%  


## Exercise  5. Pre-process the test_set (3 Marks)
Again we have to apply the same preprocess function and standard scaler on this test set before passing through predict function.

#### Understanding the test set:

#### Note: In the initial train set there were no missing entries in the "Fare" column. But, now for the submission test set, there is one missing entry in this column.

#### There will be a minor change in the preprocess function to address the above issue.

## Exercise  6. Prediction for test data (2 Mark)