# Unit 5 Building Full Preprocessing Pipeline for the Titanic Dataset

# Lesson Introduction

Welcome\! Today, we'll learn how to build a full **preprocessing pipeline** for the Titanic dataset. In real work, you're going to deal with big datasets with lots of features and rows.

We aim to learn how to prepare real data for machine learning models by handling missing values, encoding categorical features, scaling numerical features, and splitting the data into training and test sets.

Imagine you have a messy jigsaw puzzle. You need to organize the pieces, find the edges first, and then start assembling. Data preprocessing is like organizing the pieces before starting the puzzle.

-----

## Load and Prepare the Data

Let's start by loading the **Titanic dataset** using **Seaborn**, which has information about passengers like age, fare, and whether they survived. We'll drop some columns we won't use.

```python
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Drop columns that won't be used
df = df.drop(columns=['deck', 'embarked', 'alive'])

print(df.head())
```

Expected output:

```
   survived  pclass     sex   age  sibsp  parch     fare  who  adult_male  \
0         0       3    male  22.0      1      0   7.2500  man        True   
1         1       1  female  38.0      1      0  71.2833  woman      False   
2         1       3  female  26.0      0      0   7.9250  woman      False   
3         1       1  female  35.0      1      0  53.1000  woman      False   
4         0       3    male  35.0      0      0   8.0500  man        True   
 
     embark_town  alone  
0  Southampton    False  
1    Cherbourg    False  
2  Southampton     True  
3  Southampton    False  
4  Southampton     True  
```

We loaded the dataset and dropped columns `deck`, `embarked`, and `alive` because they have too many missing values or aren't useful. For example, the `embarked` column shouldn't affect the passenger's survival rate, so it's questionable as a feature.

-----

## Handle Missing Values

Next, let's handle missing values using **SimpleImputer** from **SciKit Learn**.

```python
from sklearn.impute import SimpleImputer

# Handle missing values
imputer_num = SimpleImputer(strategy='mean')
imputer_cat = SimpleImputer(strategy='most_frequent')

df['age'] = imputer_num.fit_transform(df[['age']])
df['embark_town'] = imputer_cat.fit_transform(df[['embark_town']].values.reshape(-1, 1)).ravel()
df['fare'] = imputer_num.fit_transform(df[['fare']])

print(df.head())
```

As a reminder, `ravel()` is a method in NumPy that returns a contiguous flattened array. In this context, it's used to flatten the column vector returned by `fit_transform()` into a 1-dimensional array. This ensures that the `embark_town` column is reshaped back into a 1-D array that fits into the DataFrame correctly.

Expected output:

```
   survived  pclass     sex   age  sibsp  parch     fare  who  adult_male  \
0         0       3    male  22.0      1      0   7.2500  man        True   
1         1       1  female  38.0      1      0  71.2833  woman      False   
2         1       3  female  26.0      0      0   7.9250  woman      False   
3         1       1  female  35.0      1      0  53.1000  woman      False   
4         0       3    male  35.0      0      0   8.0500  man        True   
 
     embark_town  alone  
0  Southampton    False  
1    Cherbourg    False  
2  Southampton     True  
3  Southampton    False  
4  Southampton     True  
```

We filled missing numerical data (`age`, `fare`) using the mean and categorical data (`embark_town`) using the most frequent value. This is like guessing a missing puzzle piece based on surrounding ones.

-----

## Encode Categorical Features: Part 1

Machine learning models need numerical data. So, we use **OneHotEncoder** to convert categorical features into numbers.

```python
from sklearn.preprocessing import OneHotEncoder

# Encode categorical features
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_columns = encoder.fit_transform(df[['sex', 'class', 'embark_town', 'who', 'adult_male', 'alone']])
encoded_df = pd.DataFrame(encoded_columns, columns=encoder.get_feature_names_out(['sex', 'class', 'embark_town', 'who', 'adult_male', 'alone']))
```

-----

## Encode Categorical Features: Part 2

Next, we drop the original categorical columns and concatenate the new encoded columns with the DataFrame.

```python
# Drop and concatenate
df = df.drop(columns=['sex', 'class', 'embark_town', 'who', 'adult_male', 'alone'])
df = pd.concat([df.reset_index(drop=True), encoded_df], axis=1)

print(df.head())
```

Expected output:

```
   survived  pclass   age  sibsp  parch     fare  alone  sex_male  \
0         0       3  22.0      1      0   7.2500  False       1.0   
1         1       1  38.0      1      0  71.2833  False       0.0   
2         1       3  26.0      0      0   7.9250   True       0.0   
3         1       1  35.0      1      0  53.1000  False       0.0   
4         0       3  35.0      0      0   8.0500   True       1.0   
 
   class_2  class_3  embark_town_Queenstown  embark_town_Southampton  \
0      0.0      1.0                     0.0                      1.0   
1      0.0      0.0                     0.0                      0.0   
2      0.0      1.0                     0.0                      1.0   
3      0.0      0.0                     0.0                      1.0   
4      0.0      1.0                     0.0                      1.0   
 
   who_man  who_woman  adult_male_True  
0      1.0        0.0              1.0  
1      0.0        1.0              0.0  
2      0.0        1.0              0.0  
3      0.0        1.0              0.0  
4      1.0        0.0              1.0  
```

We converted the categorical columns into numerical ones, dropped the originals, and added the new encoded columns. It's like translating words into a secret code for a robot.

-----

## Feature Scaling

Feature scaling ensures all numerical values are on a similar scale. We use **StandardScaler** for this.

```python
from sklearn.preprocessing import StandardScaler

# Feature scaling
scaler = StandardScaler()
scaled_columns = scaler.fit_transform(df[['age', 'fare']])
scaled_df = pd.DataFrame(scaled_columns, columns=['age', 'fare'])

# Drop and concatenate
df = df.drop(columns=['age', 'fare'])
df = pd.concat([df.reset_index(drop=True), scaled_df], axis=1)

print(df.head())
```

Expected output:

```
   survived  pclass  sibsp  parch  alone  sex_male  class_2  class_3  \
0         0       3      1      0  False       1.0      0.0      1.0   
1         1       1      1      0  False       0.0      0.0      0.0   
2         1       3      0      0   True       0.0      0.0      1.0   
3         1       1      1      0  False       0.0      0.0      0.0   
4         0       3      0      0   True       1.0      0.0      1.0   
 
   embark_town_Queenstown  embark_town_Southampton  who_man  who_woman  \
0                     0.0                      1.0      1.0        0.0   
1                     0.0                      0.0      0.0        1.0   
2                     0.0                      1.0      0.0        1.0   
3                     0.0                      1.0      0.0        1.0   
4                     0.0                      1.0      1.0        0.0   
 
   adult_male_True       age      fare  
0              1.0 -0.530376 -0.502445  
1              0.0  0.571829  0.788947  
2              0.0 -0.254596 -0.488854  
3              0.0  0.400810  0.420731  
4              1.0  0.400810 -0.486337  
```

We scaled our numerical data (`age`, `fare`) to have a mean of 0 and a standard deviation of 1. This is like resizing puzzle pieces to fit perfectly.

-----

## Separate Features and Target Variable

Next, we separate our features (used for predictions) and the target variable (the outcome we predict).

```python
# Separate features and target variable
X = df.drop(columns=['survived'])
y = df['survived']

print("X:\n", X.head())
print("\ny:\n", y.head())
```

Expected output:

```
X:
    pclass  sibsp  parch  alone  sex_male  class_2  class_3  embark_town_Queenstown  \
0       3      1      0  False       1.0      0.0      1.0                     0.0   
1       1      1      0  False       0.0      0.0      0.0                     0.0   
2       3      0      0   True       0.0      0.0      1.0                     0.0   
3       1      1      0  False       0.0      0.0      0.0                     0.0   
4       3      0      0   True       1.0      0.0      1.0                     0.0   
 
   embark_town_Southampton  who_man  who_woman  adult_male_True       age  \
0                      1.0      1.0        0.0              1.0 -0.530376   
1                      0.0      0.0        1.0              0.0  0.571829   
2                      1.0      0.0        1.0              0.0 -0.254596   
3                      1.0      0.0        1.0              0.0  0.400810   
4                      1.0      1.0        0.0              1.0  0.400810   
 
       fare  
0 -0.502445  
1  0.788947  
2 -0.488854  
3  0.420731  
4 -0.486337  

y:
 0    0
1    1
2    1
3    1
4    0
Name: survived, dtype: int64
```

Here, `X` contains all features except `survived`, and `y` contains the `survived` column. This helps in training the model more efficiently.

-----

## Train-Test Split

Finally, we split the dataset into training and test sets using **train\_test\_split**. This lets us train the model on one part of the data and test it on another.

```python
from sklearn.model_selection import train_test_split

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {len(X_train)}, Test set size: {len(X_test)}")
```

Expected output:

```
Training set size: 712, Test set size: 179
```

We split the data so 80% is used for training and 20% for testing. This step is like practicing with some pieces before trying the whole puzzle.

-----

## Lesson Summary

Today, we:

  * **Loaded and prepared** the Titanic dataset.
  * **Handled missing values**.
  * **Encoded categorical features**.
  * **Scaled numerical features**.
  * **Separated features and the target variable**.
  * **Split the dataset** into training and test sets.

Now, you'll get to practice these steps hands-on. Happy learning\!

## Drop Unwanted Titanic Columns

Hey Space Navigator, let's continue our journey! We need to load the Titanic dataset and drop some columns we don't need. Fill in the missing lines to load the Titanic dataset using Seaborn and drop the specified columns. Let's ace this mission!


```python
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# TODO: Drop columns that won't be used: deck, embarked and alive

# Display the modified dataset
print(df.head())

```

```python
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Drop columns that won't be used: deck, embarked and alive
df = df.drop(columns=['deck', 'embarked', 'alive'])

# Display the modified dataset
print(df.head())
```

## Handle Missing Values in Titanic Dataset

Great job so far, Galactic Pioneer! Let's handle some missing data before we move ahead. Fill in the TODOs to complete the code.

Cleaning your dataset by filling in missing values helps ensure that the analysis is accurate and meaningful.

```python
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Drop some columns for simplicity
df = df.drop(columns=['deck', 'embarked', 'alive'])

# TODO: Handle missing values in 'age' and 'fare' using mean

print(df[['age', 'fare']].isna().sum())  # should be 0 if NaNs are handled!
```

```python
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Drop some columns for simplicity
df = df.drop(columns=['deck', 'embarked', 'alive'])

# Handle missing values in 'age' and 'fare' using mean
imputer_num = SimpleImputer(strategy='mean')

# Apply to 'age' column
df['age'] = imputer_num.fit_transform(df[['age']])

# Apply to 'fare' column
# Note: While 'fare' has very few missing values (or sometimes none in the default dataset),
# it's good practice to apply the imputer if you expect missing values in real-world scenarios.
df['fare'] = imputer_num.fit_transform(df[['fare']])


print(df[['age', 'fare']].isna().sum())
```

## Encode Categorical Features and Concatenate

Let's dive deeper, Stellar Navigator! Fill in the blanks to encode the categorical features and concatenate them with the original DataFrame. Use OneHotEncoder for the sex column and LabelEncoder for the class column.

It's time to show your mastery!


```python
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Load the Titanic dataset
df = sns.load_dataset('titanic')
   
# Encode 'sex' column with OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse_output=False)
# TODO: Encode the 'sex' column from the dataset with OneHotEncoder
columns=one_hot_encoder.get_feature_names_out(['sex']))

# Encode 'class' column with LabelEncoder
label_encoder = LabelEncoder()
# TODO: Encode the 'class' column from the dataset with LabelEncoder

# Concatenate the encoded columns with the original dataframe
df = pd.concat([df.reset_index(drop=True), encoded_sex_df, encoded_class_df], axis=1)
print(df.head())

```

```python
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Encode 'sex' column with OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse_output=False)
encoded_sex = one_hot_encoder.fit_transform(df[['sex']])
encoded_sex_df = pd.DataFrame(encoded_sex, columns=one_hot_encoder.get_feature_names_out(['sex']))

# Encode 'class' column with LabelEncoder
label_encoder = LabelEncoder()
encoded_class = label_encoder.fit_transform(df['class'])
encoded_class_df = pd.DataFrame(encoded_class, columns=['class_encoded'])

# Concatenate the encoded columns with the original dataframe
df = pd.concat([df.reset_index(drop=True), encoded_sex_df, encoded_class_df], axis=1)
print(df.head())
```

## Handle Missing Values and Feature Scaling

Hey Space Voyager! You're making great progress so far. Now, let’s kick it up a notch.

You'll need to complete the missing pieces of code to scale numeric features. Take a look at the TODO comments and fill in the blanks.

May your aim be true!

```python
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Drop columns that won't be used
df = df.drop(columns=['deck', 'embarked', 'alive'])

# Feature scaling
# TODO: fit and transform scaler on the 'age' and 'fare' columns

print(df[['age', 'fare']].head())

```

```python
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Drop columns that won't be used
df = df.drop(columns=['deck', 'embarked', 'alive'])

# Handle missing values in 'age' and 'fare' before scaling
# For simplicity, let's fill with the mean. A more robust approach might be needed for real-world data.
df['age'].fillna(df['age'].mean(), inplace=True)
df['fare'].fillna(df['fare'].mean(), inplace=True)

# Feature scaling
scaler = StandardScaler()
df[['age', 'fare']] = scaler.fit_transform(df[['age', 'fare']])

print(df[['age', 'fare']].head())
```

Celestial Traveler, let's finalize our Titanic preprocessing adventure! Follow the #TODO steps to load, clean, encode, scale, and split the Titanic dataset as we've learned.

```python
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# TODO: Load Titanic dataset and drop unnecessary columns 'deck', 'embarked', and 'alive'

# TODO: Handle missing values for both numerical and categorical data
#      Use: SimpleImputer with strategy='mean' for numerical columns 'age' and 'fare'
#      Use: SimpleImputer with strategy='most_frequent' for categorical column 'embark_town'

# TODO: Encode categorical features using one-hot encoding

# TODO: Perform feature scaling on numerical features using standard scaling

# TODO: Separate features from the target variable

# TODO: Split the dataset into training and testing sets

# TODO: Print sizes of training and testing sets

```

```python
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Load Titanic dataset and drop unnecessary columns 'deck', 'embarked', and 'alive'
df = sns.load_dataset('titanic')
df = df.drop(columns=['deck', 'embarked', 'alive'])

# Handle missing values for both numerical and categorical data
# Numerical Imputer
numerical_imputer = SimpleImputer(strategy='mean')
df[['age', 'fare']] = numerical_imputer.fit_transform(df[['age', 'fare']])

# Categorical Imputer (for 'embark_town' which has missing values)
categorical_imputer = SimpleImputer(strategy='most_frequent')
df[['embark_town']] = categorical_imputer.fit_transform(df[['embark_town']])


# Identify categorical and numerical features after imputation
categorical_features = ['sex', 'pclass', 'who', 'alone', 'class', 'adult_male', 'embark_town']
numerical_features = ['age', 'fare', 'sibsp', 'parch']

# Encode categorical features using one-hot encoding
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded_features = encoder.fit_transform(df[categorical_features])
encoded_feature_names = encoder.get_feature_names_out(categorical_features)
encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names, index=df.index)

# Perform feature scaling on numerical features using standard scaling
scaler = StandardScaler()
scaled_numerical_features = scaler.fit_transform(df[numerical_features])
scaled_numerical_df = pd.DataFrame(scaled_numerical_features, columns=numerical_features, index=df.index)

# Concatenate all processed features
# First, drop original categorical and numerical columns that have been transformed
df_processed = df.drop(columns=categorical_features + numerical_features)
df_processed = pd.concat([df_processed, scaled_numerical_df, encoded_df], axis=1)

# Separate features from the target variable
X = df_processed.drop('survived', axis=1)
y = df_processed['survived']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print sizes of training and testing sets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
```