# Day 2 — Student Notebook

*Auto-generated notebook based on provided lecture slides.*

## Learning goals
- Understand what Data Science is (brief)
- Learn basic exploratory data analysis (EDA) & visualizations
- Practice creating simple, clear plots and recognise misleading visuals

Estimated time: ~90 minutes

## 1) Quick setup and load dataset
The dataset we'll use throughout the course is the **Titanic** dataset (passenger info + whether they survived). It's small, mixed types, and ideal for beginners.

In [3]:
# Setup: installs (uncomment the !pip lines if needed) and imports
# If running in a managed environment (e.g. Google Colab), uncomment the pip installs below.
# !pip install pandas numpy seaborn plotly scikit-learn matplotlib

import pandas as pd, numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve
sns.set_theme(style='whitegrid')

# Load dataset (seaborn's titanic dataset) - we'll use this across all notebooks
df = sns.load_dataset('titanic')
df_original = df.copy()  # keep a pristine copy
print('Loaded titanic dataset with shape:', df.shape)
df.head()


Loaded titanic dataset with shape: (891, 15)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## 2) First look: basic exploration (student exercise)
- Show `.info()` and `.describe()`
- Count missing values per column

**Tasks (student):**
1. Run `df.info()` and `df.describe()`.
2. Create a table of missing value counts for each column.

In [4]:
## Student: run basic exploration
print(df.info())
print('\nNumeric summary:')
print(df.describe(include='all'))
print('\nMissing values per column:')
print(df.isna().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
None

Numeric summary:
          survived      pclass   sex         age       s

## 3) Visualisation basics (student exercises)
We'll make a few simple, interpretable charts:
- Bar chart: passenger class counts
- Histogram: distribution of ages
- Boxplot: fare by passenger class
- Scatter: age vs fare

**Use plotly.express for interactivity (optional)**

In [5]:
# Bar chart (class counts)
fig = px.histogram(df, x='class', title='Passenger counts per class')
fig.show()

# Histogram: age distribution (drop missing ages for plotting)
fig = px.histogram(df, x='age', nbins=30, title='Age distribution (drop missing)')
fig.show()

# Boxplot: fare by class
fig = px.box(df, x='class', y='fare', title='Fare by passenger class')
fig.show()

# Scatter: age vs fare (points may overlap)
fig = px.scatter(df, x='age', y='fare', color='survived', title='Age vs Fare (colored by survived)')
fig.show()


ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

## 4) Short exercises (student)
1. Which chart would you pick to compare survival rates across classes? Create it.
2. Find an example of a potentially misleading visualization (e.g. truncated y-axis, pie with many slices) and explain why it's misleading.

In [6]:
# Student: survival rate per class
surv_by_class = df.groupby('class')['survived'].mean().reset_index()
print(surv_by_class)
fig = px.bar(surv_by_class, x='class', y='survived', title='Survival rate per class')
fig.update_yaxes(tickformat='.0%')
fig.show()

# Note: students should answer the misleading visualization question in a markdown cell.


    class  survived
0   First  0.629630
1  Second  0.472826
2   Third  0.242363






ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

## Hints & small answers
- Use `groupby` + `mean()` to get survival rate by class.
- Misleading charts often: change axis scales to exaggerate differences; use inappropriate chart types (3D pies); omit labels.