In [1]:
# @title
from IPython.display import display, HTML

display(HTML("""
<script>
const firstCell = document.querySelector('.cell.code_cell');
if (firstCell) {
  firstCell.querySelector('.input').style.pointerEvents = 'none';
  firstCell.querySelector('.input').style.opacity = '0.5';
}
</script>
"""))

html = """
<div style="display:flex; flex-direction:column; align-items:center; text-align:center; gap:12px; padding:8px;">
  <h1 style="margin:0;">üëã Welcome to <span style="color:#1E88E5;">Algopath Coding Academy</span>!</h1>

  <img src="https://raw.githubusercontent.com/sshariqali/mnist_pretrained_model/main/algopath_logo.jpg"
       alt="Algopath Coding Academy Logo"
       width="400"
       style="border-radius:15px; box-shadow:0 4px 12px rgba(0,0,0,0.2); max-width:100%; height:auto;" />

  <p style="font-size:16px; margin:0;">
    <em>Empowering young minds to think creatively, code intelligently, and build the future with AI.</em>
  </p>
</div>
"""

display(HTML(html))

### **Who Survives the Titanic?**

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered ‚Äúunsinkable‚Äù RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren‚Äôt enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

#### **Data Dictionary**

| Variable | Definition | Key |
|----------|------------|-----|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

#### **Variable Notes**

**pclass:** A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lower

**age:** Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp:** The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fianc√©s were ignored)

**parch:** The dataset defines family relations in this way...
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch=0 for them.

In [None]:
import pandas as pd
import numpy as np

In [None]:
kaggle_df = pd.read_csv('titanic_data.csv')

In [None]:
kaggle_df.shape

(891, 12)


In [None]:
kaggle_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
kaggle_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


---
## **HOMEWORK ASSIGNMENT: Titanic Data Cleaning and Feature Engineering**

In this assignment, you will apply your pandas skills to clean and transform the Titanic dataset. For each question:
1. Read the instructions carefully
2. Explain your reasoning in the provided markdown cell
3. Write your code in the empty code cell
4. Compare your solution with the provided solution cell

**Good luck! üö¢**

---
### **Question 1: Handling Null Values - Dropping the Cabin Column**

After exploring the dataset, you notice that the 'Cabin' column has many missing values.

**Task:** Drop the 'Cabin' column from the dataframe.

**Before you code, answer in the cell below:**
- Why might we choose to drop a column instead of filling missing values?
- What percentage of missing values might justify dropping a column?

**Your reasoning here:**

(Double-click to edit this cell and write your explanation)

In [None]:
# YOUR CODE HERE
# Drop the 'Cabin' column from kaggle_df


**Solution:**

In [None]:
kaggle_df.drop(columns = ['Cabin'],inplace = True)

In [None]:
kaggle_df.isnull().sum()


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         2
dtype: int64

---
### **Question 2: Filling Missing Values in 'Embarked' Column**

The 'Embarked' column has a few missing values. Since this is a categorical variable with only 3 possible values (C, Q, S), we can fill missing values with the most common value.

**Task:** Fill the missing values in the 'Embarked' column with 'S' (Southampton - the most common embarkation port).

**Before you code, answer in the cell below:**
- Why is it reasonable to fill missing categorical values with the most frequent value?
- What other strategies could we use for filling missing categorical data?

**Your reasoning here:**

(Double-click to edit this cell and write your explanation)

In [None]:
# YOUR CODE HERE
# Fill missing values in 'Embarked' column with 'S'


**Solution:**

In [None]:
kaggle_df['Embarked'].fillna('S',inplace=True)

In [None]:
kaggle_df.isnull().sum()

---
### **Question 3: Filling Missing Age Values Using Group-Based Statistics**

The 'Age' column has many missing values. Instead of using a single value for all missing ages, we can use a more sophisticated approach: fill missing ages with the median age of passengers with the same Sex and Pclass.

**Task:** Fill missing values in the 'Age' column with the median age of passengers grouped by 'Sex' and 'Pclass'.

**Hint:** Use `groupby()` with `transform()` method.

**Before you code, answer in the cell below:**
- Why is filling with group-based statistics better than using a single global median?
- Why use median instead of mean for age?

**Your reasoning here:**

(Double-click to edit this cell and write your explanation)

In [None]:
# YOUR CODE HERE
# Fill missing Age values with the median age of passengers with the same Sex and Pclass


**Solution:**

In [None]:
kaggle_df['Age'] = kaggle_df.groupby(['Sex','Pclass'])['Age'].transform(lambda x: x.fillna(x.median()))

In [None]:
kaggle_df.isnull().sum()

Age              0
Embarked         0
Fare             0
Name             0
Parch            0
PassengerId      0
Pclass           0
Sex              0
SibSp            0
Survived       418
Ticket           0
dtype: int64

---
### **Question 4: Feature Engineering - Extracting Titles from Names**

Passenger names contain titles (Mr., Mrs., Miss., etc.) that can provide valuable information about age, gender, and social status. Let's extract these titles as a new feature.

**Task:** Create a new column called 'Title' by extracting titles from the 'Name' column.

**Steps:**
1. Split the 'Name' column by ", " and take the second part
2. Split that part by "." and take the first part
3. Store this as a new column 'Title'

**Hint:** Use `str.split()` with `expand=True` parameter.

**Before you code, answer in the cell below:**
- Why might titles be useful features for predicting survival?
- What information does a title provide that sex and age might not?

**Your reasoning here:**

(Double-click to edit this cell and write your explanation)

In [None]:
# YOUR CODE HERE
# Extract titles from the 'Name' column and create a 'Title' column


**Solution:**

In [None]:
kaggle_df['Title'] = kaggle_df['Name'].str.split(", ",expand=True)[1].str.split(".",expand=True)[0]

**Check the distribution of titles:**

In [None]:
kaggle_df['Title'].value_counts()

Title
Mr              757
Miss            260
Mrs             197
Master           61
Rev               8
Dr                8
Col               4
Mlle              2
Major             2
Ms                2
Lady              1
Sir               1
Mme               1
Don               1
Capt              1
the Countess      1
Jonkheer          1
Dona              1
Name: count, dtype: int64

---
### **Question 5: Consolidating Rare Titles**

You'll notice that some titles appear very infrequently. Rare categories in machine learning can cause overfitting. Let's consolidate rare titles and standardize similar ones.

**Task:**
1. Replace rare titles (Lady, the Countess, Capt, Col, Don, Dr, Major, Rev, Sir, Jonkheer, Dona) with 'Rare'
2. Replace 'Mlle' with 'Miss'
3. Replace 'Ms' with 'Miss'
4. Replace 'Mme' with 'Mrs'

**Before you code, answer in the cell below:**
- Why is it important to consolidate rare categories?
- What could happen if we leave too many rare categories in our data?

**Your reasoning here:**

(Double-click to edit this cell and write your explanation)

In [None]:
# YOUR CODE HERE
# Replace rare and similar titles


**Solution:**

In [None]:
kaggle_df['Title'] = kaggle_df['Title'].replace(['Lady', 'the Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
kaggle_df['Title'] = kaggle_df['Title'].replace('Mlle', 'Miss')
kaggle_df['Title'] = kaggle_df['Title'].replace('Ms', 'Miss')
kaggle_df['Title'] = kaggle_df['Title'].replace('Mme', 'Mrs')

**Verify the consolidated titles:**

In [None]:
kaggle_df['Title'].value_counts()

Title
Mr        757
Miss      264
Mrs       198
Master     61
Rare       29
Name: count, dtype: int64

---
### **Question 6: Creating a Family Size Feature**

We have two columns: 'SibSp' (siblings/spouses) and 'Parch' (parents/children). We can combine these to create a new feature representing total family size.

**Task:** Create a new column 'Family_size' that represents the total family size (SibSp + Parch + 1, where the +1 represents the passenger themselves).

**Before you code, answer in the cell below:**
- Why might family size be a useful predictor of survival?
- Why do we add 1 to the sum of SibSp and Parch?

**Your reasoning here:**

(Double-click to edit this cell and write your explanation)

In [None]:
# YOUR CODE HERE
# Create 'Family_size' column


**Solution:**

In [None]:
kaggle_df['Family_size'] = kaggle_df['SibSp'] + kaggle_df['Parch'] + 1

**Check a sample of the data with the new column:**

In [None]:
kaggle_df.sample(10)

Unnamed: 0,Age,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,Title,Family_size
937,45.0,C,29.7,"Chevre, Mr. Paul Romaine",0,938,1,male,0,,PC 17594,Mr,1
981,22.0,S,13.9,"Dyker, Mrs. Adolf Fredrik (Anna Elisabeth Judi...",0,982,3,female,1,,347072,Mrs,2
595,36.0,S,24.15,"Van Impe, Mr. Jean Baptiste",1,596,3,male,1,0.0,345773,Mr,3
362,45.0,C,14.4542,"Barbara, Mrs. (Catherine David)",1,363,3,female,0,0.0,2691,Mrs,2
664,20.0,S,7.925,"Lindqvist, Mr. Eino William",0,665,3,male,1,1.0,STON/O 2. 3101285,Mr,2
69,26.0,S,8.6625,"Kink, Mr. Vincenz",0,70,3,male,2,0.0,315151,Mr,3
320,22.0,S,7.25,"Dennis, Mr. Samuel",0,321,3,male,0,0.0,A/5 21172,Mr,1
747,30.0,S,13.0,"Sinkkonen, Miss. Anna",0,748,2,female,0,1.0,250648,Miss,1
523,44.0,C,57.9792,"Hippach, Mrs. Louis Albert (Ida Sophia Fischer)",1,524,1,female,0,1.0,111361,Mrs,2
298,42.0,S,30.5,"Saalfeld, Mr. Adolphe",0,299,1,male,0,1.0,19988,Mr,1


---
### **Question 7: Dropping Unnecessary Columns**

Now that we've extracted useful information from some columns, we can drop those that are no longer needed or won't be useful for modeling.

**Task:** Drop the following columns: 'Name', 'Parch', 'SibSp', 'Ticket'

**Before you code, answer in the cell below:**
- Why can we safely drop 'Name' now that we've extracted titles?
- Why drop 'Parch' and 'SibSp' after creating 'Family_size'?
- Why might 'Ticket' not be useful for prediction?

**Your reasoning here:**

(Double-click to edit this cell and write your explanation)

In [None]:
# YOUR CODE HERE
# Drop the specified columns


**Solution:**

In [None]:
kaggle_df.drop(columns=['Name','Parch','SibSp','Ticket'],inplace=True)

In [None]:
kaggle_df.sample(10)

Unnamed: 0,Age,Embarked,Fare,PassengerId,Pclass,Sex,Survived,Title,Family_size
1106,42.0,S,42.5,1107,1,male,,Mr,1
18,31.0,S,18.0,19,3,female,0.0,Mrs,2
181,29.5,C,15.05,182,2,male,0.0,Mr,1
364,25.0,Q,15.5,365,3,male,0.0,Mr,2
72,21.0,S,73.5,73,2,male,0.0,Mr,1
14,14.0,S,7.8542,15,3,female,0.0,Miss,1
310,24.0,C,83.1583,311,1,female,1.0,Miss,1
490,25.0,S,19.9667,491,3,male,0.0,Mr,2
950,36.0,C,262.375,951,1,female,,Miss,1
1124,25.0,Q,7.8792,1125,3,male,,Mr,1


---
### **Question 8: Creating a Function to Categorize Family Size**

Instead of keeping Family_size as a number, we can categorize it into meaningful groups. This can help machine learning models understand patterns better.

**Task:** Create a function called `family_size()` that takes a number and returns:
- "Alone" if the number is 1
- "Small" if the number is between 2 and 4 (inclusive)
- "Large" if the number is 5 or more

**Before you code, answer in the cell below:**
- Why might categorizing continuous variables be useful for machine learning?
- What are the potential advantages and disadvantages of binning numerical data?

**Your reasoning here:**

(Double-click to edit this cell and write your explanation)

In [None]:
# YOUR CODE HERE
# Define the family_size function


**Solution:**

In [None]:
def family_size(number):
    if number==1:
        return "Alone"
    elif number>1 and number <5:
        return "Small"
    else:
        return "Large"

In [None]:
kaggle_df['Family_size'] = kaggle_df['Family_size'].apply(family_size)
