## Lesson Introduction
Welcome to an intriguing lesson on missing data handling! Today, we're diving into the Titanic dataset, a passage in time to the early 20th century. Our main aim? To wrangle missing data using Python and Pandas. Don't worry if you're unfamiliar with these terms yet, we'll break them down one by one!

Python: A high-level, interpreted programming language that is easy to learn yet powerful. It has bundles of libraries, like Pandas, that make data manipulation a breeze.
Pandas: A Python library providing high-performance, easy-to-use data structures and data analysis tools.
By the end of this lesson, you'll understand the basics of handling missing data, which is an essential step in preparing your data for machine learning models. So let's get started!

## Understanding Missing Data
As an analyst or data scientist, it's pivotal to understand why data might be missing, as it helps in choosing the best strategy to handle it. Missing data, which are like missing puzzle pieces, can occur due to several reasons, such as not being collected, being recorded incorrectly, or even being lost over time.

Furthermore, missing data can be categorised as:

Missing completely at random (MCAR): The missing data entries are random and don't correlate with any other data.
Missing at random (MAR): The missing values depend on the values of other variables.
Missing not at random (MNAR): The missing values have a particular pattern or logic.

## Identifying Missing Values in the Titanic Dataset
Before we can consider how to handle missing data, let's learn how to identify it. We'll use the isnull() and sum() functions from the Pandas library to find the number of missing values in our Titanic dataset:

In [1]:
import seaborn as sns
import pandas as pd

# Import Titanic dataset
titanic_df = sns.load_dataset('titanic')

# Identify missing values
missing_values = titanic_df.isnull().sum()
print(missing_values)

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


In the output, you'll see each column name accompanied by a number that denotes the number of missing values in that column.

## Strategies to Handle Missing Data
Armed with the knowledge of missing data and its types, it's time to decide how to handle them. Broadly, you can consider three main strategies:

Deletion: This involves removing the rows and columns containing missing data. However, this might lead to the loss of valuable information.
Imputation: This includes filling missing values with substituted ones, like the mean, median, or mode (the most common value in the data frame).
Prediction: This involves using a predictive model to estimate the missing values.
A balance of intuition, experience, and technical know-how usually dictates the best method to use.



## Handling Missing Data in the Titanic Dataset
Let's get our hands dirty and handle missing data firsthand in the Titanic dataset. For the “age” feature, we'll fill in missing entries with the median passenger age. And, for the “deck” feature, where most entries are missing, we'll delete the entire column.

In [2]:
# Dealing with missing values 

# Dropping columns with excessive missing data
new_titanic_df = titanic_df.drop(columns=['deck'])

# Imputing median age for missing age data
new_titanic_df['age'].fillna(new_titanic_df['age'].median(), inplace=True)

# Display the number of missing values post-imputation
missing_values_updated = new_titanic_df.isnull().sum()
print(missing_values_updated)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
embark_town    2
alive          0
alone          0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  new_titanic_df['age'].fillna(new_titanic_df['age'].median(), inplace=True)


As you can see from the updated missing values count, we have successfully handled the missing data! Note that we could also use the dropna() function to handle missing data by removing rows with missing values. However, we should be cautious, as this might remove a significant portion of our data. Here's how you can do it: titanic_df.dropna().



## Lesson Summary and Practice
Well done! You have now explored the basics of handling missing data, an essential pre-processing step for any machine-learning model. The skill of dealing with missing data is a key arrow in any data scientist's quiver, ensuring that your data is clean and ready for modeling.

Get set for some upcoming practice sessions that will provide you with opportunities to apply and reinforce what you've learned today. Feel the thrill as we continue venturing deeper into the world of data processing! Nothing should be missing from your data now, so it's time to wield your new skills!



## Handle Missing Data in the Titanic Dataset
Lesson Summary and Practice
Well done! You have now explored the basics of handling missing data, an essential pre-processing step for any machine-learning model. The skill of dealing with missing data is a key arrow in any data scientist's quiver, ensuring that your data is clean and ready for modeling.

Get set for some upcoming practice sessions that will provide you with opportunities to apply and reinforce what you've learned today. Feel the thrill as we continue venturing deeper into the world of data processing! Nothing should be missing from your data now, so it's time to wield your new skills!



In [3]:
import seaborn as sns
import pandas as pd

# Load the Titanic dataset
titanic_df = sns.load_dataset('titanic')

# Identify and display missing values
missing_values = titanic_df.isnull().sum()
print("Missing values before handling:\n", missing_values)

# Handle missing data by dropping the 'deck' column and imputing 'age'
titanic_df.drop(columns=['deck'], inplace=True)
titanic_df['age'].fillna(titanic_df['age'].median(), inplace=True)

# Impute the 'embarked' and 'embark_town' columns with the most common value
most_common_embarked = titanic_df['embarked'].mode()[0]
titanic_df['embarked'].fillna(most_common_embarked, inplace=True)
most_common_embark_town = titanic_df['embark_town'].mode()[0]
titanic_df['embark_town'].fillna(most_common_embark_town, inplace=True)

# Verify that missing data has been handled
missing_values_after = titanic_df.isnull().sum()
print("Missing values after handling:\n", missing_values_after)

Missing values before handling:
 survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64
Missing values after handling:
 survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic_df['age'].fillna(titanic_df['age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic_df['embarked'].fillna(most_common_embarked, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediat

## Update Titanic Dataset Handling Missing Data Code

Superb progress, Space Voyager!

Let's enhance our data imputation skills. In the provided starter code, you'll find a line where missing values in the 'embarked' column are filled with a placeholder. Your task is to modify this line to impute missing values with the most common 'embarked' category instead.

In [4]:
import seaborn as sns
import pandas as pd

# Load the Titanic dataset
titanic_df = sns.load_dataset('titanic')

# Identify and print the number of missing values in the 'age' and 'embarked' columns
missing_values_age_embarked = titanic_df[['age', 'embarked']].isnull().sum()
print('Missing values in age and embarked columns:\n', missing_values_age_embarked)

# Impute the missing values in the 'age' column with the median age
titanic_df['age'].fillna(titanic_df['age'].median(), inplace=True)

# Impute the missing values in the 'embarked' column with a placeholder value 'U' for Unknown
titanic_df['embarked'].fillna('U', inplace=True)

# Print the dataset info to confirm that there are no more missing values in 'age' and 'embarked'
print('\nDataset information post-imputation:')
print(titanic_df.info())

Missing values in age and embarked columns:
 age         177
embarked      2
dtype: int64

Dataset information post-imputation:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          891 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     891 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float6

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic_df['age'].fillna(titanic_df['age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic_df['embarked'].fillna('U', inplace=True)


## Something is missing

Good job navigating the sea of data, Space Voyager! Now, let's put your skills to the test. Fill in the blanks to impute the missing ages, and clean up the dataset by removing a column that's mostly empty.

In [6]:
import seaborn as sns
import pandas as pd

# Load the dataset
titanic = sns.load_dataset('titanic')

# Find the number of missing values in each column
missing_values_before = titanic.isnull().sum()
print("Missing values before handling:")
print(missing_values_before)

# Replace missing data in 'age' column with the median
titanic['age'].fillna(titanic['age'].median(), inplace=True)

# Remove a column with too many missing values to salvage ('deck' in this case)
titanic.drop(columns='deck', inplace=True)

# Verify the handling by checking for missing values again
missing_values_after = titanic.isnull().sum()
print("\nMissing values after handling:")
print(missing_values_after)

# Optionally, show the info of the dataset to visualize the changes
print("\nDataset information after handling missing data:")
print(titanic.info())


Missing values before handling:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

Missing values after handling:
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
embark_town    2
alive          0
alone          0
dtype: int64

Dataset information after handling missing data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age         

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['age'].fillna(titanic['age'].median(), inplace=True)


## Data Cleaning in Titanic Dataset
Great job handling the missing values, Space Explorer! However, the code you have isn't acting as expected. It's generating an error when trying to handle missing categories in the 'age' column. Can you spot the glitch and adjust the thrusters so we can ensure a smooth data preprocessing journey?

In [9]:
import seaborn as sns
import pandas as pd

# Load the Titanic dataset
titanic_df = sns.load_dataset('titanic')

# Drop the 'deck' column due to excessive missing values
titanic_df_cleaned = titanic_df.drop(columns=['deck'])

# Impute the missing 'age' values with the median age
median_age = titanic_df_cleaned['age'].median()
titanic_df_cleaned['age'].fillna(median_age, inplace=True)

# Impute the missing 'embarked' values with the mode
mode_embarked = titanic_df_cleaned['embarked'].mode()[0]
titanic_df_cleaned['embarked'].fillna(mode_embarked, inplace=True)

# Impute the missing 'embark_town' values with the mode
mode_embark_town = titanic_df_cleaned['embark_town'].mode()[0]
titanic_df_cleaned['embark_town'].fillna(mode_embark_town, inplace=True)

# Check for remaining missing values
missing_values_after = titanic_df_cleaned.isnull().sum()
print("Missing values after handling:")
print(missing_values_after)


Missing values after handling:
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic_df_cleaned['age'].fillna(median_age, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic_df_cleaned['embarked'].fillna(mode_embarked, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate objec