# Data Cleaning and Preprocessing in Machine Learning

# Topic Overview
Welcome! Today, we embark on an exploration journey into the role of data preprocessing in the machine learning landscape. And there's no better way to learn than by tackling real-world data. Thus, we'll be utilizing the Titanic dataset, a rich dataset detailing the passenger manifest from the ill-fated maiden voyage of this once-lauded "unsinkable" ship.

Data preprocessing is a vital preliminary step in any machine learning pipeline, capable of transforming raw, discordant data into a format that can be effectively utilized by machine learning algorithms. This whole process includes diverse techniques such as cleaning the data, dealing with missing values, data format transformations, and data normalization. In this lesson, we set the scene for their application.

By the conclusion of today's lesson, you'll possess an understanding of the necessity of preprocessing in machine learning, an overview of the structure and complexity of the Titanic dataset, and the ability to apply preliminary data analysis techniques to extract initial insights.

So, fasten your seatbelts and start the engines!

# Introduction to Data Preprocessing
Data preprocessing is the heart of any machine learning pipeline, capable of magnifying accuracy when done right or leading to poor performance when overlooked. The quality of the output of any machine learning model is directly proportional to the quality of input data. Hence the Golden Rule, "Garbage In, Garbage Out."

In simple terms, the goal of data preprocessing is to cleanse, transform, and format the raw data into a structure that makes it ready for machine learning algorithms. Choosing the right techniques under preprocessing often depends on the specifics of your data, as such, there is no "one-size-fits-all" strategy.

The section today works like an introduction to this broad ocean of skills and sets the foundation for how you'll approach datasets in ensuing lessons.

# Overview of the Titanic Dataset
Having understood the concept of preprocessing, it's time to roll up our sleeves and get our hands dirty with the Titanic dataset. We aim to understand the data structure and its characteristics.

The Titanic dataset comes pre-packaged in the Seaborn library, a visualization library in Python. Let's go ahead and load the dataset.

```python
import seaborn as sns
import pandas as pd

# Load Titanic dataset
titanic_data = sns.load_dataset('titanic')

# Display the first few records
print(titanic_data.head())

# Review the structure of the dataset
print(titanic_data.info())
```

```md
   survived  pclass     sex   age  ...  deck  embark_town  alive  alone
0         0       3    male  22.0  ...   NaN  Southampton     no  False
1         1       1  female  38.0  ...     C    Cherbourg    yes  False
2         1       3  female  26.0  ...   NaN  Southampton    yes   True
3         1       1  female  35.0  ...     C  Southampton    yes  False
4         0       3    male  35.0  ...   NaN  Southampton     no   True

[5 rows x 15 columns]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB
None
```
In the script above, we imported the seaborn and pandas libraries to load the Titanic dataset and describe the data frame, respectively. The structure of the DataFrame is easily reviewed with the .info() method, dishing out crucial details like the number of non-null entries for each feature, the data type of each column, and the count of data points in each feature.

# Drawing Insights from the Titanic Dataset
Before parting, let's take a look at some general statistics from the Titanic dataset, which will help us gain a better understanding of what we just loaded.

Pandas DataFrames provide us with the neat .describe() function, which returns various descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution.

```python
print(titanic_data.describe())
```

```md
         survived      pclass         age       sibsp       parch        fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%      0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%      0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%      1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200
```

Using the .describe() function, you can see detailed statistics for each numeric column in your DataFrame. These include the number of non-missing values, mean, standard deviation, median (50 percentile), minimum, and maximum. Studying these statistics provides a fundamental understanding of the characteristics of the data you are working with.

Keep in mind that all the impressive and advanced visualizations and models you'll hear about in data science and machine learning are often built on these humble statistics you're looking at. So, understand these well!

# Lesson Summary and Practice
Great job on reaching the end of the lesson! We started our journey by dipping our toes in the ocean of data preprocessing and explored the Titanic as an example dataset. We unfolded the mystery behind the data structure through some initial data analysis.

Looking back, we started off with the significance of data preprocessing, moved to the initial exploration of the Titanic dataset through understanding its structure, and ended with drawing initial descriptive statistics of the dataset.

For the next stage, get ready for some hands-on exploration of the Titanic dataset using Python and Pandas. The practice will involve gaining on-the-field experience in comprehending datasets. Remember, the magic often lies in the details, and the power to unravel that lies within practice. Keep going, and let the world of data keep fascinating you!

Let's delve deeper! In the Titanic dataset, you have examined its basic structure and overall statistics. For targeted insights, adjust the code to display summary statistics specifically for the age and fare columns, offering a more focused view of this historical data. Are you ready to enhance your data preprocessing skills?

In [9]:
import seaborn as sns
import pandas as pd

# Load the Titanic dataset
titanic_data = sns.load_dataset('titanic')

# Display the structure of the dataset
print(titanic_data.info())

# Display summary statistics of the dataset
print(titanic_data[['age','fare']].describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
None
              age        fare
count  714.000000  891.000000
mean    29.699

Brilliant work so far, Space Voyager! We have a dataset ready for inspection, but it seems our script is tripping over its own feet. Can you identify the hiccup and get the dataset exploration back on track?

The code should output the first few records, the dataset's info, and its general statistics. Look closely—the devil is in the details!

In [10]:
import seaborn as sns
import pandas as pd

# Load Titanic dataset
titanic_data = sns.load_dataset('titanic') 

# Display the first few records
print(titanic_data.head())

# Review the structure of the dataset
print(titanic_data.info())

# Print general statistics of the dataset
print(titanic_data.describe())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-nu

# Wrangling Missing Data: Techniques Applied to the Titanic Dataset
## Lesson Introduction
Welcome to an intriguing lesson on missing data handling! Today, we're diving into the Titanic dataset, a passage in time to the early 20th century. Our main aim? To wrangle missing data using Python and Pandas. Don't worry if you're unfamiliar with these terms yet, we'll break them down one by one!

Python: A high-level, interpreted programming language that is easy to learn yet powerful. It has bundles of libraries, like Pandas, that make data manipulation a breeze.
Pandas: A Python library providing high-performance, easy-to-use data structures and data analysis tools.
By the end of this lesson, you'll understand the basics of handling missing data, which is an essential step in preparing your data for machine learning models. So let's get started!

## Understanding Missing Data
As an analyst or data scientist, it's pivotal to understand why data might be missing, as it helps in choosing the best strategy to handle it. Missing data, which are like missing puzzle pieces, can occur due to several reasons, such as not being collected, being recorded incorrectly, or even being lost over time.

Furthermore, missing data can be categorised as:

Missing completely at random (MCAR): The missing data entries are random and don't correlate with any other data.
Missing at random (MAR): The missing values depend on the values of other variables.
Missing not at random (MNAR): The missing values have a particular pattern or logic.
Identifying Missing Values in the Titanic Dataset
Before we can consider how to handle missing data, let's learn how to identify it. We'll use the isnull() and sum() functions from the Pandas library to find the number of missing values in our Titanic dataset:

```python
import seaborn as sns
import pandas as pd

# Import Titanic dataset
titanic_df = sns.load_dataset('titanic')

# Identify missing values
missing_values = titanic_df.isnull().sum()
print(missing_values)
The output from this code will be:
```

```Markdown
Copy to clipboard
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64
```

In the output, you'll see each column name accompanied by a number that denotes the number of missing values in that column.

## Strategies to Handle Missing Data
Armed with the knowledge of missing data and its types, it's time to decide how to handle them. Broadly, you can consider three main strategies:

Deletion: This involves removing the rows and columns containing missing data. However, this might lead to the loss of valuable information.
Imputation: This includes filling missing values with substituted ones, like the mean, median, or mode (the most common value in the data frame).
Prediction: This involves using a predictive model to estimate the missing values.
A balance of intuition, experience, and technical know-how usually dictates the best method to use.

## Handling Missing Data in the Titanic Dataset
Let's get our hands dirty and handle missing data firsthand in the Titanic dataset. For the “age” feature, we'll fill in missing entries with the median passenger age. And, for the “deck” feature, where most entries are missing, we'll delete the entire column.

```python
# Dealing with missing values 

# Dropping columns with excessive missing data
new_titanic_df = titanic_df.drop(columns=['deck'])

# Imputing median age for missing age data
new_titanic_df['age'].fillna(new_titanic_df['age'].median(), inplace=True)

# Display the number of missing values post-imputation
missing_values_updated = new_titanic_df.isnull().sum()
print(missing_values_updated)
The updated missing values count comes out to be:
```


```Markdown
Copy to clipboard
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
embark_town    2
alive          0
alone          0
dtype: int64
```

As you can see from the updated missing values count, we have successfully handled the missing data! Note that we could also use the dropna() function to handle missing data by removing rows with missing values. However, we should be cautious, as this might remove a significant portion of our data. Here's how you can do it: titanic_df.dropna().

## Lesson Summary and Practice
Well done! You have now explored the basics of handling missing data, an essential pre-processing step for any machine-learning model. The skill of dealing with missing data is a key arrow in any data scientist's quiver, ensuring that your data is clean and ready for modeling.

Get set for some upcoming practice sessions that will provide you with opportunities to apply and reinforce what you've learned today. Feel the thrill as we continue venturing deeper into the world of data processing! Nothing should be missing from your data now, so it's time to wield your new skills!

In the given code, we have already cleaned our Titanic dataset by addressing its missing values. The deck column has been removed due to an excessive number of missing data points, and missing values in age, embarked, and embark_town have been imputed with median and mode values, respectively. Run the code to check if all missing values have been handled and to see the improved state of our dataset, now ready for further analysis!

In [11]:
import seaborn as sns
import pandas as pd

# Load the Titanic dataset
titanic_df = sns.load_dataset('titanic')

# Identify and display missing values
missing_values = titanic_df.isnull().sum()
print("Missing values before handling:\n", missing_values)

# Handle missing data by dropping the 'deck' column and imputing 'age'
titanic_df.drop(columns=['deck'], inplace=True)
titanic_df['age'].fillna(titanic_df['age'].median(), inplace=True)

# Impute the 'embarked' and 'embark_town' columns with the most common value
most_common_embarked = titanic_df['embarked'].mode()[0]
titanic_df['embarked'].fillna(most_common_embarked, inplace=True)
most_common_embark_town = titanic_df['embark_town'].mode()[0]
titanic_df['embark_town'].fillna(most_common_embark_town, inplace=True)

# Verify that missing data has been handled
missing_values_after = titanic_df.isnull().sum()
print("Missing values after handling:\n", missing_values_after)

Missing values before handling:
 survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64
Missing values after handling:
 survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64


Superb progress, Space Voyager!

Let's enhance our data imputation skills. In the provided starter code, you'll find a line where missing values in the 'embarked' column are filled with a placeholder. Your task is to modify this line to impute missing values with the most common 'embarked' category instead.

Cosmo


In [12]:
import seaborn as sns
import pandas as pd

# Load the Titanic dataset
titanic_df = sns.load_dataset('titanic')

# Identify and print the number of missing values in the 'age' and 'embarked' columns
missing_values_age_embarked = titanic_df[['age', 'embarked']].isnull().sum()
print('Missing values in age and embarked columns:\n', missing_values_age_embarked)

# Impute the missing values in the 'age' column with the median age
titanic_df['age'].fillna(titanic_df['age'].median(), inplace=True)

# Impute the missing values in the 'embarked' column with a placeholder value 'U' for Unknown
#titanic_df['embarked'].fillna('U', inplace=True)
titanic_df['embarked'].fillna(titanic_df['embarked'].value_counts().index[0], inplace=True)

# Print the dataset info to confirm that there are no more missing values in 'age' and 'embarked'
print('\nDataset information post-imputation:')
print(titanic_df.info())

Missing values in age and embarked columns:
 age         177
embarked      2
dtype: int64

Dataset information post-imputation:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          891 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     891 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float6

Good job navigating the sea of data, Space Voyager! Now, let's put your skills to the test. Fill in the blanks to impute the missing ages, and clean up the dataset by removing a column that's mostly empty.

In [None]:
import seaborn as sns
import pandas as pd

# Load the dataset
titanic = sns.load_dataset('titanic')

# Find the number of missing values in each column
missing_values_before = titanic.isnull().sum()
print("Missing values before handling:")
print(missing_values_before)

# TODO: Replace missing data in 'age' column with a central tendency measure of your choice
titanic['age'].fillna(titanic['age'].mean(), inplace=True)

# TODO: Remove a column with too many missing values to salvage
titanic.drop(columns=['deck'], inplace=True)

# Verify the handling by checking for missing values again
missing_values_after = titanic.isnull().sum()
print("\nMissing values after handling:")
print(missing_values_after)

# Optionally, show the info of the dataset to visualize the changes
print("\nDataset information after handling missing data:")
print(titanic.info())

Great job handling the missing values, Space Explorer! However, the code you have isn't acting as expected. It's generating an error when trying to handle missing categories in the 'age' column. Can you spot the glitch and adjust the thrusters so we can ensure a smooth data preprocessing journey?

In [None]:
import seaborn as sns
import pandas as pd

# Load the Titanic dataset
titanic_df = sns.load_dataset('titanic')

# Drop the 'deck' column due to excessive missing values
titanic_df_cleaned = titanic_df.drop(columns=['deck'])

# Impute the missing 'age' values with the median age
median_age = titanic_df_cleaned['age'].median()
titanic_df_cleaned['age'].fillna(median_age, inplace=True)

# Impute the missing 'embarked' values with the mode
mode_embarked = titanic_df_cleaned['embarked'].mode()[0]
titanic_df_cleaned['embarked'].fillna(mode_embarked, inplace=True)

# Impute the missing 'embark_town' values with the mode
mode_embark_town = titanic_df_cleaned['embark_town'].mode()[0]
titanic_df_cleaned['embark_town'].fillna(mode_embark_town, inplace=True)

# Check for remaining missing values
missing_values_after = titanic_df_cleaned.isnull().sum()
print(missing_values_after)