##  Lesson Launch: Feature Engineering for Survival Rate Predictors
Welcome back to our course - **Data Cleaning and Preprocessing in Machine Learning**. Today's mission revolves around Feature Engineering on the Titanic dataset. By the end of today's lesson, your toolkit will be loaded with skills that revolve around feature creation, modification, and encoding. Your expertise in Python and Pandas will also be put into practice, reinforcing your knowledge in the process.

## Introduction to Feature Engineering

Feature engineering is the process of creating optimized features that improve the effectiveness of machine learning algorithms. This process utilizes the data to create new features and modify existing ones. This might involve creating new features, transforming existing features, or identifying and removing irrelevant ones. For instance, in our Titanic dataset, we have properties or indicators like `age`, `sex`, `pclass`, etc., which might need some optimizing.

Let's take `sibsp` and `parch` as an example: `sibsp` shows the number of siblings/spouses aboard while `parch` shows the number of parents/children onboard. Because these features both indicate the number of family members onboard for each individual, one might see them as similar features or even overlapping. Hence, we can combine these two features to create a new feature: `family_size`.

## Creating New Features

To start creating new features, we need to have the Titanic dataset in our hands. This can be quickly done in Python using Seaborn, as shown below:

```python
import seaborn as sns

titanic_df = sns.load_dataset("titanic")
```

With the dataset ready, let's compute the `family_size` feature using the number of siblings/spouses (`sibsp`) and parents/children (`parch`).

```python
titanic_df['family_size'] = titanic_df['sibsp'] + titanic_df['parch'] + 1 # +1 for self
```

This snippet of code will create a new column `'family_size'` and add it to the dataset `titanic_df`.

## Modification of Existing Features

Another valuable task in feature engineering is to modify existing features. For instance, the `fare` feature in our Titanic dataset may contain values that are significantly larger than the average, making the distribution of `fare` values skewed. This skewness could be unhelpful for our machine-learning algorithms and might induce overfitting.

In such a case, we can consider applying a log transformation to the `fare` column, with the aim of reducing skewness. The code snippet for this is shown below:

```python
import numpy as np

# Adding a small constant to handle zero fares
titanic_df['log_fare'] = np.log(titanic_df['fare'] + 0.1)
```

Here, we add 0.1 to the `fare` column to adjust for zero values, then apply a logarithmic transformation.

## Handling and Encoding Categorical Values

Machine Learning algorithms generally deal with numbers and not text. Therefore, in handling categorical features, it is important to change text values to numerical ones. The process of doing this is called encoding. For our Titanic dataset, we can perform One-Hot Encoding on the `'sex'` column:

```python
import pandas as pd

# Get dummy variables for the 'sex' column
sex_dummies = pd.get_dummies(titanic_df['sex'])
# Join the dummy variables to the main dataframe
titanic_df = pd.concat([titanic_df, sex_dummies], axis=1)
```

Here, we are using the `pd.get_dummies()` function to perform One-Hot Encoding on the `sex`column. This results in a new dataframe `sex_dummies` where each category in the `sex` column has its own column in the new dataframe.

## That's a Wrap For Now

Fantastic work today! You have delved into feature engineering with Python and Pandas, and you've been introduced to creating new features from scratch, modifying existing ones, and handling categorical features. Combining all these techniques, the cleaned, processed data can now be appropriately structured and ready to feed into a machine learning model for training. I almost forgot - let's go do some practice!


Adjust the code to transform the fare feature using a base-10 logarithm instead of a natural logarithm. This practice will give you insight into data normalization techniques and their impact on feature scaling.

In [None]:
# Import necessary libraries
import seaborn as sns
import pandas as pd
import numpy as np

# Load the Titanic dataset
titanic_df = sns.load_dataset("titanic")

# Create a new feature 'family_size'
titanic_df['family_size'] = titanic_df['sibsp'] + titanic_df['parch'] + 1

# Modify 'fare' feature to 'log_fare' using natural logarithm
titanic_df['log_fare'] = np.log10(titanic_df['fare'] + 0.1)

# One-hot encode 'sex' feature
sex_dummies = pd.get_dummies(titanic_df['sex'], dtype=int)
titanic_df = pd.concat([titanic_df, sex_dummies], axis=1)

# Print the first 5 rows of the dataframe
print(titanic_df.head())

Astounding perseverance, Space Voyager!

Here's a snippet of code meant to augment the Titanic dataset with a new feature. Although it appears correct at first glance, the code isn't functioning as expected. Could you identify the glitch and apply your feature engineering expertise to fix it? Let the coding journey continue!

In [None]:
# Import necessary libraries
import seaborn as sns
import pandas as pd

# Load the Titanic dataset
titanic_df = sns.load_dataset("titanic")

# Create a new binary encoded feature 'embarked_southampton'
embark_town_col = pd.DataFrame([1 if i == 'Southampton' else 0 for i in titanic_df['embark_town']], columns=["embarked_southampton"])

# Join to the main dataframe with aligned indices
titanic_df = titanic_df.reset_index(drop=True)
embark_town_col = embark_town_col.reset_index(drop=True)
titanic_df = pd.concat([titanic_df, embark_town_col], axis=1)

# Print the first 5 rows of the dataframe
print(titanic_df.head())

Space Explorer, you're doing great! It's now time for you to apply what you've learned about transforming feature scales. Implement the line of code that adds a log-transformed version of the 'fare' column to the dataframe. Remember to calculate the logarithm with base 10, and be mindful of zero fares.

In [None]:
# Import necessary libraries
import seaborn as sns
import pandas as pd
import numpy as np

# Load the Titanic dataset
titanic_df = sns.load_dataset("titanic")

# Create a new feature 'family_size'
titanic_df['family_size'] = titanic_df['sibsp'] + titanic_df['parch'] + 1

# TODO: Transform the 'fare' column into a new column 'log10_fare' using log base 10. Include an adjustment for zero fares.
titanic_df['log_fare'] = np.log10(titanic_df['fare'] + 0.1)

# Print the first 5 rows of the dataframe
print(titanic_df.head())

Your skills are now bordering on interstellar, Space Voyager! For this concluding challenge, take command of the IDE and encode the passengers' class information from the Titanic dataset. Implement code that one-hot encodes the 'class' feature, appends it to the dataset, and then displays the frontier of the modified table — the first 5 rows.

In [None]:
# Import necessary libraries
import seaborn as sns
import pandas as pd

# TODO: Load the Titanic dataset and assign it to a variable named 'titanic_df'
titanic_df = sns.load_dataset("titanic")

# TODO: Perform one-hot encoding on the 'class' column to create binary columns for each class
class_dummies = pd.get_dummies(titanic_df['class'])

# TODO: Join the new binary columns to 'titanic_df'
titanic_df = pd.concat([titanic_df, class_dummies], axis=1)

# TODO: Display the first 5 rows of the modified dataframe
print(titanic_df.head())