# Lesson 3: Comprehensive Preprocessing With Multiple Techniques: Part 1


Imagine you are cleaning your room and organizing items step-by-step. Data preprocessing is similar! In this lesson, we'll prepare a dataset for analysis by integrating multiple preprocessing techniques. Our goal is to make the data clean and ready for useful insights.

## Drop Unnecessary Columns

Not all columns are useful for our analysis. Some might be redundant or irrelevant. For example, columns like deck, embark_town, alive, class, who, adult_male, and alone may not add much value. Let's drop these columns.

```python
import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')

# Drop unnecessary columns
columns_to_drop = ['deck', 'embark_town', 'alive', 'class', 'who', 'adult_male', 'alone']
titanic = titanic.drop(columns=columns_to_drop)

# Display the DataFrame after dropping columns
print(titanic.head())
```

```
   survived  pclass     sex   age  sibsp  parch     fare embarked
0         0       3    male  22.0      1      0   7.2500        S
1         1       1  female  38.0      1      0  71.2833        C
2         1       3  female  26.0      0      0   7.9250        S
3         1       1  female  35.0      1      0  53.1000        S
4         0       3    male  35.0      0      0   8.0500        S
```

We use the .drop() function, which takes a list of columns names to drop as an argument columns.

## Handle Missing Values

Data often has missing values, which are problematic for many algorithms. In our Titanic dataset, we can fill missing values with reasonable substitutes like the median for numerical columns and the mode for categorical columns.

```python
# Fill missing values in 'age' with the median value
titanic['age'] = titanic['age'].fillna(titanic['age'].median())

# Fill missing values in 'embarked' with the mode value
titanic['embarked'] = titanic['embarked'].fillna(titanic['embarked'].mode()[0])

# Fill missing values in 'fare' with the median value
titanic['fare'] = titanic['fare'].fillna(titanic['fare'].median())
```

Here, we use the fillna method to replace missing values (NaN) in a DataFrame with a specified value. You can provide a single value, a dictionary of values specifying different substitutes for different columns, or use aggregations like median or mode for more meaningful replacements, like we do here.

Let's check if it worked:

```python
# Check for any remaining missing values
print(titanic.isnull().sum())
```

This line outputs the count of missing values for each column in the titanic DataFrame. isnull() function returns a new dataframe of the same size, containing True instead of the missing values, and False instead of the present values. If we find the sum of these boolean values, True will be taken as 1, and False – as 0. Thus, if there are any missing values, the sum will be positive.

The output is:

```
survived    0
pclass      0
sex         0
age         0
sibsp       0
parch       0
fare        0
embarked    0
dtype: int64
```

We see zeros everywhere, indicating there is no more missing values in the dataframe.

## Encode Categorical Values

Categorical values need to be converted into numbers for most algorithms. For example, the sex and embarked columns in our dataset are categorical. We'll use the get_dummies function to encode these columns.

```python
# Encode categorical values
titanic = pd.get_dummies(titanic, columns=['sex', 'embarked'], dtype='int')

# Display the DataFrame after encoding
print(titanic.head())
```

Note the dtype=int parameter. It specifies that we expect our new encoding columns to hold either 0 or 1. Otherwise, they will hold False or True.

```
   survived  pclass   age  sibsp  parch     fare  sex_female  sex_male  embarked_C  embarked_Q  embarked_S
0         0       3  22.0      1      0   7.2500           0         1           0           0           1
1         1       1  38.0      1      0  71.2833           1         0           1           0           0
2         1       3  26.0      0      0   7.9250           1         0           0           0           1
3         1       1  35.0      1      0  53.1000           1         0           0           0           1
4         0       3  35.0      0      0   8.0500           0         1           0           0           1
```

## Scale Numerical Values

Scaling numerical values is crucial, especially for algorithms that rely on the distance between data points. We will standardize the age and fare columns so they have a mean of 0 and a standard deviation of 1.

```python
# Scale numerical values
titanic['age'] = (titanic['age'] - titanic['age'].mean()) / titanic['age'].std()
titanic['fare'] = (titanic['fare'] - titanic['fare'].mean()) / titanic['fare'].std()

# Display the DataFrame after scaling
print(titanic.head())
```

```
   survived  pclass       age  sibsp  parch      fare  sex_female  sex_male  embarked_C  embarked_Q  embarked_S
0         0       3 -0.530005      1      0 -0.502445           0         1           0           0           1
1         1       1  0.571433      1      0  0.786845           1         0           1           0           0
2         1       3 -0.254888      0      0 -0.488854           1         0           0           0           1
3         1       1  0.396745      1      0  0.420730           1         0           0           0           1
4         0       3  0.396745      0      0 -0.486337           0         1           0           0           1
```

## Lesson Summary

Congratulations! You've cleaned and prepared the Titanic dataset using multiple preprocessing techniques. Here's a quick recap:

- Loaded and inspected the dataset
- Dropped unnecessary columns to focus on valuable data
- Handled missing values to ensure the dataset is complete
- Encoded categorical values to make them usable by algorithms
- Scaled numerical values to improve model performance

Now it's time to put your newfound skills to the test! In the upcoming practice session, you'll apply these preprocessing techniques to another dataset. This hands-on experience will solidify your understanding and give you confidence in tackling data preprocessing in real-world scenarios. Let's get started!

## Drop Irrelevant Columns from Titanic Dataset

Fantastic work, Space Explorer! Let's boost your skills up a notch. Complete the code to drop irrelevant columns from the Titanic dataset.

Keep up the great work!

```py
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# TODO: Remove columns that are not useful for analysis
titanic = titanic.drop(columns=[___])

# Show the first 5 rows
print(titanic.head())

```

I notice you'd like help cleaning up the Titanic dataset. Let's analyze which columns we might want to drop. In the Titanic dataset, certain columns like 'deck', 'embark_town', and 'alive' are often redundant or less useful for analysis since:

- 'deck' has many missing values
- 'embark_town' is redundant with 'embarked'
- 'alive' is redundant with 'survived'

Here's the completed code:

```python
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Remove less useful columns
titanic = titanic.drop(columns=['deck', 'embark_town', 'alive'])

# Show the first 5 rows
print(titanic.head())
```

This keeps the most relevant features for analysis like 'survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', and 'embarked'. Would you like me to explain why we kept any particular columns?

## Dropping Columns and Handling Missing Values

Great job, Galactic Pioneer! Let's tackle another fun task.

Please fill in the missing pieces to drop unnecessary columns and handle missing values in the Titanic dataset.

```py
import seaborn as sns
import pandas as pd

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# TODO: Drop unnecessary columns that might be irrelevant to our analysis
columns_to_drop = ['deck', 'embark_town', 'alive', 'class', 'who', 'adult_male', 'alone']

# TODO: Fill missing values in 'age' with the median and 'embarked' with the mode

print(titanic.head())

```

I'll help you complete the code to handle both dropping columns and dealing with missing values. Here's the solution:

```python
import seaborn as sns
import pandas as pd

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Drop unnecessary columns
columns_to_drop = ['deck', 'embark_town', 'alive', 'class', 'who', 'adult_male', 'alone']
titanic = titanic.drop(columns=columns_to_drop)

# Fill missing values
titanic['age'] = titanic['age'].fillna(titanic['age'].median())
titanic['embarked'] = titanic['embarked'].fillna(titanic['embarked'].mode()[0])

print(titanic.head())
```

This code:
1. Drops the specified redundant columns
2. Fills missing ages with the median age (a common practice for numerical data)
3. Fills missing embarked values with the most common port (mode)

Would you like me to explain any of these data cleaning steps in more detail?

## Encoding Categorical Data

Great job so far, Space Explorer! Now it's time to make the code work by filling in the missing pieces.

Let's preprocess the Titanic dataset by handling missing values and encoding categorical variables.

Good luck!

```py
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Select and drop columns
titanic = titanic[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]

# Fill missing values for 'age' and 'embarked'
titanic['age'] = titanic['age'].fillna(titanic['age'].median())
titanic['embarked'] = titanic['embarked'].fillna(titanic['embarked'].mode()[0])

# TODO: Encode categorical values using One-Hot encoding. Don't forget to use dtype='int'

print(titanic_encoded.head())

```

I'll help you complete the code to handle both dropping columns and dealing with missing values. Here's the solution:

```python
import seaborn as sns
import pandas as pd

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Drop unnecessary columns
columns_to_drop = ['deck', 'embark_town', 'alive', 'class', 'who', 'adult_male', 'alone']
titanic = titanic.drop(columns=columns_to_drop)

# Fill missing values
titanic['age'] = titanic['age'].fillna(titanic['age'].median())
titanic['embarked'] = titanic['embarked'].fillna(titanic['embarked'].mode()[0])

print(titanic.head())
```

This code:
1. Drops the specified redundant columns
2. Fills missing ages with the median age (a common practice for numerical data)
3. Fills missing embarked values with the most common port (mode)

Would you like me to explain any of these data cleaning steps in more detail?

## Preprocess Titanic Data by Handling Missing Values and Scaling

Galactic Pioneer, let’s put your skills to the test! Fill in the missing pieces of code to complete the preprocessing tasks for the Titanic dataset. These tasks are crucial for cleaning and preparing the data for analysis.

May the stars guide your code!

```py
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Drop unnecessary columns
columns_to_drop = ['deck', 'embark_town', 'alive', 'class', 'who', 'adult_male', 'alone']
titanic = titanic.drop(columns=columns_to_drop)

# Fill missing values
# TODO: Fill missing values for 'age' and 'fare' columns using median
titanic['embarked'] = titanic['embarked'].fillna(titanic['embarked'].mode()[0])

# Scale numerical values
# TODO: Scale 'age' and 'fare' columns
print(titanic.head())

```

I'll help you complete the preprocessing tasks by filling in the missing code for handling missing values and scaling numerical features. Here's the solution:

```python
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Drop unnecessary columns
columns_to_drop = ['deck', 'embark_town', 'alive', 'class', 'who', 'adult_male', 'alone']
titanic = titanic.drop(columns=columns_to_drop)

# Fill missing values
titanic['age'] = titanic['age'].fillna(titanic['age'].median())
titanic['fare'] = titanic['fare'].fillna(titanic['fare'].median())
titanic['embarked'] = titanic['embarked'].fillna(titanic['embarked'].mode()[0])

# Scale numerical values
scaler = StandardScaler()
titanic[['age', 'fare']] = scaler.fit_transform(titanic[['age', 'fare']])

print(titanic.head())
```

This code:
1. Fills missing values in 'age' and 'fare' columns with their respective medians
2. Uses StandardScaler to normalize 'age' and 'fare' columns (mean=0, std=1)

The scaling step is important because:
- It puts all numerical features on the same scale
- It helps machine learning algorithms perform better
- It prevents features with larger values from dominating the model

Would you like me to explain any of these preprocessing steps in more detail?