# Data Science-Titanic Assignment

This notebook answers:
- Count missing values for all attributes and discuss which could be safely imputed vs dropped.
- Try imputation by **Sex** groups instead of **Pclass**. Compare results.
- Extract and create a new feature **Title** from **Name**, and explain why it's useful.

Dataset (used directly from GitHub URL):
https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv


In [2]:
# 1) Imports
import pandas as pd
import numpy as np
from IPython.display import display

# set pandas display options for nicer tables in Colab
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 160)


In [None]:
# 2) Load the dataset directly from the provided raw GitHub URL
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)
print('Dataset loaded — rows:', df.shape[0], ' columns:', df.shape[1])
df.head()


## 3) Count missing values for every column
list columns with missing values and how many are missing.

In [None]:
missing = df.isnull().sum().sort_values(ascending=False)
missing[missing>0]


### Explaination:
- **Age:** many missing — useful for analysis, can be imputed (estimated).
- **Cabin:** many values missing — often dropped or converted to 'HasCabin' flag.
- **Embarked:** only a couple missing — fill with the most common value (mode).
- **Fare:** usually present, but if any missing we can fill with median.


## 4) Quick decisions: drop or impute
create a copy and apply reasonable choices.


In [None]:
df_clean = df.copy()

# Drop Cabin (too many missing) but keep a 'HasCabin' flag because it might contain information
df_clean['HasCabin'] = df_clean['Cabin'].notnull().astype(int)
df_clean.drop(columns=['Cabin'], inplace=True)

# Fill Embarked with mode if missing
df_clean['Embarked'].fillna(df_clean['Embarked'].mode()[0], inplace=True)

# Fill Fare if any missing with median
if df_clean['Fare'].isnull().any():
    df_clean['Fare'].fillna(df_clean['Fare'].median(), inplace=True)

print('After basic cleaning — missing per column:')
df_clean.isnull().sum()[df_clean.isnull().sum()>0]


## 5) Impute Age by **Sex** groups
fill missing Age values using the median Age for each Sex (male/female).

In [None]:
df_sex_imputed = df_clean.copy()
before_missing = df_sex_imputed['Age'].isnull().sum()
df_sex_imputed['Age'] = df_sex_imputed.groupby('Sex')['Age'].transform(lambda x: x.fillna(x.median()))
after_missing = df_sex_imputed['Age'].isnull().sum()
print(f'Age missing before: {before_missing}, after imputation by Sex: {after_missing}')
df_sex_imputed['Age'].describe()


## 6) Impute Age by **Pclass** (for comparison)
We do the same but using passenger class groups (1, 2, 3). Compare results.

In [None]:
df_pclass_imputed = df_clean.copy()
df_pclass_imputed['Age'] = df_pclass_imputed.groupby('Pclass')['Age'].transform(lambda x: x.fillna(x.median()))
print('Missing Age after Pclass imputation:', df_pclass_imputed['Age'].isnull().sum())

# Compare median ages used for each method (for transparency)
med_by_sex = df_clean.groupby('Sex')['Age'].median()
med_by_pclass = df_clean.groupby('Pclass')['Age'].median()
print('\nMedian Age by Sex:')
print(med_by_sex)
print('\nMedian Age by Pclass:')
print(med_by_pclass)


### explanation of comparison:
- **Imputation by Sex**: fills missing ages using the typical age of males or females. Good if Age relates more to gender (e.g., many children are male).
- **Imputation by Pclass**: fills missing ages using the typical age of each class. Good if age varies with passenger class (wealthier older adults in 1st class).
You can decide which is better by checking which gives more realistic age distributions for the dataset.


## 7) Extract **Title** from Name and make it a new feature
Title helps in many ways: shows social status, can hint at age (Master = child), and helps with survival analysis.

In [None]:
def extract_title(name):
    # Extract text between ', ' and '.' e.g. 'Braund, Mr. Owen Harris' -> 'Mr'
    import re
    m = re.search(r',\s*([^\.]+)\.', name)
    return m.group(1).strip() if m else ''

df['Title'] = df['Name'].apply(extract_title)
display(df[['Name','Title']].head(10))


In [None]:
# Group rare titles into 'Other' to keep categories small
title_counts = df['Title'].value_counts()
rare_titles = title_counts[title_counts < 10].index.tolist()
df['Title_clean'] = df['Title'].replace(rare_titles, 'Other')
df['Title_clean'].value_counts()


### Why Title is useful:
- **Social status:** Titles like 'Lady' or 'Sir' often mean the person was wealthy.
- **Gender:** 'Miss' and 'Mrs' help confirm female.
- **Age clue:** 'Master' usually means a young boy; 'Dr' or 'Rev' likely adults.
- **Survival patterns:** Women and children often had better survival rates; Title captures that.


In [None]:
# Show survival rates by title to illustrate usefulness
survival_by_title = df.groupby('Title_clean')['Survived'].mean().sort_values(ascending=False)
survival_by_title


## 8)conclusion:
- We counted missing values and decided to drop Cabin but keep a flag `HasCabin`.
- We imputed `Age` by **Sex** and by **Pclass** to compare methods; both are valid, choose by which fits your story better.
- We created `Title` from `Name`. This feature helps estimate missing ages and explains differences in survival.

