# OpenIntro Statistics - Chapter 1 : Introduction to Data (Introduction aux données)
## Titanic Survival Analysis (Titanic dataset)
>**Source**: *OpenIntro Statistics*(4th ed.)<br>
>**Core Principle**:*'Variable type determines analysis method'*<br>
[OpenIntro Ch 1 Theory summary](../references/openintro_ch1_summary_md.md)


In [None]:
# Data loading and Initial Inspection.
# • Observation = line/ligne/行 (如: 一名乘客--un case)
# • Variable = column/列 (如: age, sex)
# • Categorical: nominal(sex/性别) / ordinal(pclass/舱位)
# • Numerical: discrete(sibsp--only integer-0,4,2) / continuous(age--float--1.5)
# • Response Y = survived | Explanatory X = pclass, sex, age
# Observational studies → Association ≠ Causation (cannot prove that "buying first class resulted in survival"

In [69]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# Load built-in Titanic dataset
df = sb.load_dataset('titanic')

print("*"*60)
print("step 1.Check dataset") 
print("*"*60)
print(f"Dimension of titanic dataset: {df.ndim}") # savoi
print(f"Observations of titanic dataset: {df.shape[0]:,} passengrs")
print(f"Variables of titanic dataset: {df.shape[1]:} features")
print("List variables group by data type")
#convert all the type (categorial) to str for group by.
dtype_counts = df.dtypes.astype(str).value_counts()
#print(f"dtype_counts\n{dtype_counts}")
for dtype, count in dtype_counts.items():
    #create list for datatype and columns.
    cols = df.select_dtypes(
        include=[dtype]
    ).columns.tolist()
    #print(f"hello{cols}")
    print(f"  • {dtype:15s}: {count:2d} columns ({', '.join(cols)})")
print(f"Show the first three lines")
display(df.head(3))

************************************************************
step 1.Check dataset
************************************************************
Dimension of titanic dataset: 2
Observations of titanic dataset: 891 passengrs
Variables of titanic dataset: 15 features
List variables group by data type
  • object         :  5 columns (sex, embarked, who, embark_town, alive)
  • int64          :  4 columns (survived, pclass, sibsp, parch)
  • float64        :  2 columns (age, fare)
  • category       :  2 columns (class, deck)
  • bool           :  2 columns (adult_male, alone)
Show the first three lines


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True


In [75]:
# OpenIntro Ch 1.2: Variable Type Classification
print("*"*70)
print("STEP 2: Variable Types")
print("*"*70)
print("OpenIntro Principle: 'Variable type determines analysis method'\n")

#create a new list for the variable type report.
report = []

for col in df.columns:
    #get data type of column.
    dtype = str(df[col].dtype)
    #find the amount of unique values for every columns.
    #based on the amount unique to verify categorical(ordinal or nominal) or numerical discrete(int).
    n_unique = df[col].nunique(dropna=True)
    # the shape is (891,15), so len(df) = total of rows -- total of passagers.
    #calculate value null=missing value for every columns.
    missing_pct = df[col].isnull().sum() / len(df) * 100
    
    # check variable type.
    if dtype == 'category':
        var_type = 'Categorical'
        subtype = 'Ordinal' if col in ['class', 'pclass'] else 'Nominal'
    elif dtype == 'object': # pandas object = string, so it is caegorical and nominal.
        var_type = 'Categorical'
        subtype = 'Nominal'
    elif dtype == 'bool':
        var_type = 'Categorical'
        subtype = 'Binary'
    elif dtype == 'int64':
        # the amount of unique values is a little = categorical,otherwise numerical(if type is int64, it is discrete else it is continuous)
        if n_unique <= 10:
            var_type = 'Categorical'
            subtype = 'Ordinal' if col == 'pclass' else 'Discrete'
        else:
            var_type = 'Numerical'
            subtype = 'Discrete'
    elif dtype == 'float64':
        var_type = 'Numerical'
        subtype = 'Continuous'
    else:
        var_type = 'Other'
        subtype = '-'
    
    # 判断角色
    role = 'Response (Y)' if col == 'survived' else 'Explanatory (X)'
    
    #for the list report, its item is a dictionary.
    report.append({
        'Variable': col,
        'Dtype': dtype,
        'Type': var_type,
        'Subtype': subtype,
        'Role': role,
        'Unique Values': n_unique,
        'Missing %': f"{missing_pct:.1f}%"
    })

report_df = pd.DataFrame(report)

#show report_df
print("Variable Type Table (OpenIntro Ch 1.2):")
print(report_df[['Variable', 'Type', 'Subtype', 'Role', 'Unique Values', 'Missing %']].to_string(index=False))

#important result.
print("\nCritical Insights (OpenIntro Ch 1):")
print("  1. Response Variable (Y): 'survived' - We want to predict the target/Nous voulons prédire la cible")
print("  2. Explanatory Variables (X): pclass, sex, age, ... - Features used for prediction")
print("  3. Binary Variables: survived, adult_male, alone - Special nominal classification (only 2 values)）")
print("  4. Redundant Pairs: pclass ↔ class, embarked - embark_town ←avoid reuse")
print("  5. High Missingness: deck (77.2%), age (19.9%) - Requires special handling")

**********************************************************************
STEP 2: Variable Types
**********************************************************************
OpenIntro Principle: 'Variable type determines analysis method'

Variable Type Table (OpenIntro Ch 1.2):
   Variable        Type    Subtype            Role  Unique Values Missing %
   survived Categorical   Discrete    Response (Y)              2      0.0%
     pclass Categorical    Ordinal Explanatory (X)              3      0.0%
        sex Categorical    Nominal Explanatory (X)              2      0.0%
        age   Numerical Continuous Explanatory (X)             88     19.9%
      sibsp Categorical   Discrete Explanatory (X)              7      0.0%
      parch Categorical   Discrete Explanatory (X)              7      0.0%
       fare   Numerical Continuous Explanatory (X)            248      0.0%
   embarked Categorical    Nominal Explanatory (X)              3      0.2%
      class Categorical    Ordinal Explanatory