# Introduction

Our destination for today's learning journey is Outlier Detection in Passenger Data. We'll be delving into the vast pool of machine learning data preparation, with a special emphasis on the Titanic Dataset. So, why are we focusing on outliers?

So, outliers are data points that significantly deviate from the other data points in our dataset. They can drastically influence the outcomes of our data analysis and machine learning models, possibly leading to erroneous results. While exploring the Titanic Dataset, we may encounter outliers in variables such as extreme ages or abnormally high ticket prices.

In this lesson, we aim to introduce you to Python and the Pandas library's power, allowing you to detect and appropriately handle outliers lucidly. Our itinerary includes understanding the concept of outliers, learning various techniques for their detection, and then exploring strategies to handle them effectively.

The three common methods for outlier detection are **Z-score** (identifying data points with a Z-score greater than 3 as outliers), **IQR** (defining outliers as observations outside the range of Q1−1.5⋅IQRQ1​−1.5⋅IQR and Q3+1.5⋅IQRQ3​+1.5⋅IQR), and **Standard Deviation** (categorizing data points more than three standard deviations from the mean as outliers).

# Outlier Detection Methods

Let's now understand these methods better and see how we can apply Z-score, IQR, and Standard Deviation for outlier detection:

**Z-score**: It describes a data point's relationship to the mean of a group of data points. Z-score is measured in terms of standard deviations from the mean. If a Z-score is `0.0`, it indicates that the data point's score is identical to the mean score. A Z-score of `1.0` represents a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.

```python
import numpy as np
data = titanic_df['fare']
mean = np.mean(data) # calculates the mean
std_dev = np.std(data) # calculates the standard deviation
Z_scores = (data - mean) / std_dev # computes the Z-scores
outliers = data[np.abs(Z_scores) > 3] # finds all the data points that are 3 standard deviations away from the mean
```

**IQR**: The interquartile range (**IQR**) is a measure of statistical dispersion or, in simpler terms, the range within which the central half of the data lies. IQR is calculated as Q3−Q1Q3​−Q1​, where Q3Q3​ is the third quartile, and Q1Q1​ is the first quartile.

```python
Q1 = titanic_df['fare'].quantile(0.25) # calculates the first quartile
Q3 = titanic_df['fare'].quantile(0.75) # calculates the third quartile
IQR = Q3 - Q1 # computes the IQR

# Below, we find all the data points that fall below the lower bound or above the upper bound
outliers = titanic_df['fare'][
    (titanic_df['fare'] < (Q1 - 1.5 * IQR)) |
    (titanic_df['fare'] > (Q3 + 1.5 * IQR))
]
```

**Standard Deviation**: This method identifies outliers based on their distance from the mean measured in standard deviations. Unlike the Z-score method, which converts data into standardized Z-scores (mean of zero and standard deviation of one), the standard deviation method does not standardize the data; instead, it directly flags data points as outliers if they are more than a certain number of standard deviations away from the mean (commonly three). This approach is straightforward but may be less accurate if the data distribution is skewed or not normal.

```python
mean = np.mean(titanic_df['fare']) # calculates the mean
standard_deviation = np.std(titanic_df['fare']) # calculates the standard deviation
outliers = titanic_df['fare'][np.abs(titanic_df['fare'] - mean) > 3 * standard_deviation] # finds all the data points that are 3 standard deviations away from the mean
```

# Outlier Detection in Titanic Dataset

Having understood the techniques of outlier detection, we shall now apply them to the `'age'` and `'fare'` variables of the Titanic dataset to identify any potential outliers. These two attributes are continuous numerical variables that can often include potential outliers, primarily due to data entry errors or other outliers.

```python
import pandas as pd
import numpy as np

# Outlier detection - 'Age'
mean_age = np.mean(titanic_df['age']) # calculates the mean
std_dev_age = np.std(titanic_df['age']) # calculates the standard deviation
Z_scores_age = (titanic_df['age'] - mean_age) / std_dev_age # computes the Z-scores
outliers_age = titanic_df['age'][np.abs(Z_scores_age) > 3] # finds all the data points that are 3 standard deviations away from the mean
print("Outliers in 'Age' using Z-score: \n", outliers_age)

# Outlier detection - 'Fare'
mean_fare = np.mean(titanic_df['fare']) # calculates the mean
std_dev_fare = np.std(titanic_df['fare']) # calculates the standard deviation
Z_scores_fare = (titanic_df['fare'] - mean_fare) / std_dev_fare # computes the Z-scores
outliers_fare = titanic_df['fare'][np.abs(Z_scores_fare) > 3] # finds all the data points that are 3 standard deviations away from the mean
print("\nOutliers in 'Fare' using Z-score: \n", outliers_fare)
```

The output of the above code will be:

```markdown
Outliers in 'Age' using Z-score: 
630    80.0
851    74.0
Name: age, dtype: float64

Outliers in 'Fare' using Z-score: 
 27     263.0000
88     263.0000
118    247.5208
258    512.3292
299    247.5208
311    262.3750
341    263.0000
377    211.5000
380    227.5250
438    263.0000
527    221.7792
557    227.5250
679    512.3292
689    211.3375
700    227.5250
716    227.5250
730    211.3375
737    512.3292
742    262.3750
779    211.3375
Name: fare, dtype: float64
```

# Handling Outliers

After identifying outliers, we'll need to decide on the best strategy for handling them, such as:

1. **Dropping**: If the outlier does not add valuable information or is significantly skewing our data, one option to consider is dropping the outlier.
2. **Capping**: We could also consider replacing the outlier value with a certain maximum and/or minimum value.
3. **Transforming**: Techniques such as log transformations are especially effective when dealing with skewed data. This type of transformation can reduce the impact of the outliers.

Let's go ahead and cap the outliers for `'fare'` and `'age'` variables:

```python
# Drop rows with missing 'age' values
titanic_df = titanic_df.dropna(subset=['age'])

# Calculate the upper bound for 'age'
Q1 = titanic_df['age'].quantile(0.25)
Q3 = titanic_df['age'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR

# Cap the outliers for 'age'
titanic_df['age'] = np.where(titanic_df['age'] > upper_bound, upper_bound, titanic_df['age'])

# Calculate the upper bound for 'fare'
Q1 = titanic_df['fare'].quantile(0.25)
Q3 = titanic_df['fare'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR

# Cap the outliers for 'fare'
titanic_df['fare'] = np.where(titanic_df['fare'] > upper_bound, upper_bound, titanic_df['fare'])
```

In this code, `np.where` is used to replace values in the `'age'` and `'fare'` columns of the `titanic_df` dataframe with the upper bound if they exceed it, effectively capping outliers at the upper bound.

There we go! Now that we have identified and suitably handled outliers in the Titanic dataset, we are one step closer to building a machine-learning model that predicts survival rates with high accuracy.

# Lesson Summary and Practice

Well done on completing this informative lesson on outlier detection! We've started by understanding what outliers are, why they matter, and how they can influence our Machine Learning model. We learned about three popular outlier detection methods - Z-score, IQR, and Standard Deviation and saw their implementation using the Titanic dataset.

The journey doesn't end here. The next stop on our course is Data Transformation for Passenger Features. Before that, let's take some time to reflect on what we've learned today and hone your skills with a few practice exercises. It's the best way to recapitulate today's learning and progress in our quest for data cleaning and preprocessing. Happy learning!

# Exercises

## exercise 1

Ready for a slight twist? Let's adjust the threshold for detecting outliers in the Titanic dataset. You will need to modify the Z-score threshold in the existing code to identify milder outliers in the 'age' and 'fare' columns. Change the threshold from 3 to 2.5 and observe the differences in outlier detection.

In [1]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns

# Load the Titanic Dataset
titanic_df = sns.load_dataset('titanic')

# Ensure that 'age' and 'fare' columns do not have null values
titanic_df = titanic_df.dropna(subset=['age', 'fare'])

# Compute Z-scores for 'age' and 'fare'
mean_age = np.mean(titanic_df['age'])
std_dev_age = np.std(titanic_df['age'])
Z_scores_age = (titanic_df['age'] - mean_age) / std_dev_age
outliers_age = titanic_df['age'][np.abs(Z_scores_age) > 3]

mean_fare = np.mean(titanic_df['fare'])
std_dev_fare = np.std(titanic_df['fare'])
Z_scores_fare = (titanic_df['fare'] - mean_fare) / std_dev_fare
outliers_fare = titanic_df['fare'][np.abs(Z_scores_fare) > 3]

# Print the outliers
print("Outliers in 'Age' using Z-score: \n", outliers_age)
print("\nOutliers in 'Fare' using Z-score: \n", outliers_fare)

Outliers in 'Age' using Z-score: 
 630    80.0
851    74.0
Name: age, dtype: float64

Outliers in 'Fare' using Z-score: 
 27     263.0000
88     263.0000
118    247.5208
258    512.3292
299    247.5208
311    262.3750
341    263.0000
377    211.5000
380    227.5250
438    263.0000
679    512.3292
689    211.3375
700    227.5250
716    227.5250
730    211.3375
737    512.3292
742    262.3750
779    211.3375
Name: fare, dtype: float64


In [2]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns

# Load the Titanic Dataset
titanic_df = sns.load_dataset('titanic')

# Ensure that 'age' and 'fare' columns do not have null values
titanic_df = titanic_df.dropna(subset=['age', 'fare'])

# Compute Z-scores for 'age' and 'fare'
mean_age = np.mean(titanic_df['age'])
std_dev_age = np.std(titanic_df['age'])
Z_scores_age = (titanic_df['age'] - mean_age) / std_dev_age
outliers_age = titanic_df['age'][np.abs(Z_scores_age) > 2.5]

mean_fare = np.mean(titanic_df['fare'])
std_dev_fare = np.std(titanic_df['fare'])
Z_scores_fare = (titanic_df['fare'] - mean_fare) / std_dev_fare
outliers_fare = titanic_df['fare'][np.abs(Z_scores_fare) > 2.5]

# Print the outliers
print("Outliers in 'Age' using Z-score: \n", outliers_age)
print("\nOutliers in 'Fare' using Z-score: \n", outliers_fare)

Outliers in 'Age' using Z-score: 
 33     66.0
96     71.0
116    70.5
493    71.0
630    80.0
672    70.0
745    70.0
851    74.0
Name: age, dtype: float64

Outliers in 'Fare' using Z-score: 
 27     263.0000
88     263.0000
118    247.5208
258    512.3292
299    247.5208
311    262.3750
341    263.0000
377    211.5000
380    227.5250
438    263.0000
679    512.3292
689    211.3375
700    227.5250
716    227.5250
730    211.3375
737    512.3292
742    262.3750
779    211.3375
Name: fare, dtype: float64


## exercise 2

Our journey delves deeper into the cosmos of data. Your mission, should you choose to accept it, involves a keen eye for detail. Complete the code to calculate the Z-score for the 'fare' column of the Titanic dataset using the sample standard deviation. Will you rise to the challenge?

In [3]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns

# Load the Titanic Dataset
titanic_df = sns.load_dataset('titanic')

# Ensure that 'age' and 'fare' columns do not have null values
titanic_df = titanic_df.dropna(subset=['age', 'fare'])

# Compute Z-scores for 'age' and 'fare' using sample standard deviation (ddof=1)
mean_age = np.mean(titanic_df['age'])
std_dev_age = np.std(titanic_df['age'], ddof=1)
Z_scores_age = (titanic_df['age'] - mean_age) / std_dev_age
outliers_age = titanic_df['age'][np.abs(Z_scores_age) > 2.5]

# TODO: Calculate the mean and sample standard deviation for the 'fare' column, then compute the Z-scores.
mean_fare = np.mean(titanic_df['fare'])
std_dev_fare = np.std(titanic_df['fare'], ddof=1)
Z_scores_fare = (titanic_df['fare']-mean_fare) / std_dev_fare
outliers_fare = titanic_df['fare'][np.abs(Z_scores_fare > 2.5)]

# Print the outliers
print("Outliers in 'Age' using Z-score: \n", outliers_age)
# TODO: Print the outliers for 'fare' using Z-scores.
print("Outliers in 'Fare' using Z-score: \n", outliers_fare)

Outliers in 'Age' using Z-score: 
 96     71.0
116    70.5
493    71.0
630    80.0
672    70.0
745    70.0
851    74.0
Name: age, dtype: float64
Outliers in 'Fare' using Z-score: 
 27     263.0000
88     263.0000
118    247.5208
258    512.3292
299    247.5208
311    262.3750
341    263.0000
377    211.5000
380    227.5250
438    263.0000
679    512.3292
689    211.3375
700    227.5250
716    227.5250
730    211.3375
737    512.3292
742    262.3750
779    211.3375
Name: fare, dtype: float64


## exercise 3
You've learned various methods for detecting outliers, and now it's time to put that knowledge into practice! Write a complete program to identify outliers in the 'age' column of the Titanic dataset using the IQR method. Exercise caution, as each step will bring you closer to mastering data cleaning.

In [4]:
# Import the necessary libraries
import numpy as np
import seaborn as sns

# TODO: Load the Titanic dataset and store it in a variable named 'titanic_df'

# TODO: Drop any rows with missing values in the 'age' column

# TODO: Calculate the first and third quartile of the 'age' column and store them in variables 'Q1_age' and 'Q3_age'

# TODO: Calculate the Interquartile Range (IQR) for the 'age' column and store it in a variable 'IQR_age'

# TODO: Using IQR, identify any age values that are outliers and store them in a variable called 'outliers_age'

# TODO: Output the outliers found in the 'age' column

In [5]:
# Import the necessary libraries
import numpy as np
import seaborn as sns

# TODO: Load the Titanic dataset and store it in a variable named 'titanic_df'
titanic_df = sns.load_dataset("titanic")

# TODO: Drop any rows with missing values in the 'age' column
titanic_df = titanic_df.dropna(subset=['age'])

# TODO: Calculate the first and third quartile of the 'age' column and store them in variables 'Q1_age' and 'Q3_age'
Q1_age = titanic_df['age'].quantile(0.25)
Q3_age = titanic_df['age'].quantile(0.75)


# TODO: Calculate the Interquartile Range (IQR) for the 'age' column and store it in a variable 'IQR_age'
IQR_age = Q3_age - Q1_age

# TODO: Using IQR, identify any age values that are outliers and store them in a variable called 'outliers_age'
outliers_age = titanic_df['age'][
    (titanic_df['age'] < (Q1_age - 1.5*IQR_age)) |
    (titanic_df['age'] > (Q3_age + 1.5*IQR_age))
]

# TODO: Output the outliers found in the 'age' column
print("Outliers in 'Age' using IQR: \n", outliers_age)

Outliers in 'Age' using IQR: 
 33     66.0
54     65.0
96     71.0
116    70.5
280    65.0
456    65.0
493    71.0
630    80.0
672    70.0
745    70.0
851    74.0
Name: age, dtype: float64


## Final Exercise
Your next maneuver, Space Voyager, is to locate and handle the outliers. Utilize the IQR method you've mastered to analyze the 'age' variable in the Titanic dataset. But first you should make sure the 'age' column does not have null values. After identifying the outliers, you should handle them by capping them at the upper bound.

In [6]:
# Import the necessary libraries
import numpy as np
import seaborn as sns

# Load the Titanic Dataset
titanic_df = sns.load_dataset('titanic')

# TODO: Ensure that 'age' column does not have null values

# TODO: Compute IQR for 'age' by finding the first and third quartile and their difference

# TODO: Using the computed IQR, identify and assign the outliers in 'age' to the variable outliers_age

# Print the outliers
print("Outliers in 'Age' using IQR: \n", outliers_age)

# TODO: Handle outliers by capping them at the upper and lower bounds using np.where

print("\nOutliers have been handled by capping at bounds.")
print(f"Lower bound: {lower_bound:.2f}, Upper bound: {upper_bound:.2f}")

Outliers in 'Age' using IQR: 
 33     66.0
54     65.0
96     71.0
116    70.5
280    65.0
456    65.0
493    71.0
630    80.0
672    70.0
745    70.0
851    74.0
Name: age, dtype: float64

Outliers have been handled by capping at bounds.


NameError: name 'lower_bound' is not defined

In [7]:
# Import the necessary libraries
import numpy as np
import seaborn as sns

# Load the Titanic Dataset
titanic_df = sns.load_dataset('titanic')

# TODO: Ensure that 'age' column does not have null values
titanic_df = titanic_df.dropna(subset=['age'])

# TODO: Compute IQR for 'age' by finding the first and third quartile and their difference
Q1_age = titanic_df['age'].quantile(0.25)
Q3_age = titanic_df['age'].quantile(0.75)

# TODO: Using the computed IQR, identify and assign the outliers in 'age' to the variable outliers_age
IQR_age = Q3_age - Q1_age
outliers_age = titanic_df['age'][
    (titanic_df['age'] < (Q1_age - 1.5*IQR_age)) |
    (titanic_df['age'] > (Q3_age + 1.5*IQR_age))
]
upper_bound = Q3_age + 1.5 * IQR_age
lower_bound = Q1_age - 1.5 * IQR_age

# Print the outliers
print("Outliers in 'Age' using IQR: \n", outliers_age)

# TODO: Handle outliers by capping them at the upper and lower bounds using np.where
titanic_df['age'] = np.where(titanic_df['age'] > upper_bound, upper_bound, titanic_df['age'])
titanic_df['age'] = np.where(titanic_df['age'] < lower_bound, lower_bound, titanic_df['age'])


print("\nOutliers have been handled by capping at bounds.")
print(f"Lower bound: {lower_bound:.2f}, Upper bound: {upper_bound:.2f}")

Outliers in 'Age' using IQR: 
 33     66.0
54     65.0
96     71.0
116    70.5
280    65.0
456    65.0
493    71.0
630    80.0
672    70.0
745    70.0
851    74.0
Name: age, dtype: float64

Outliers have been handled by capping at bounds.
Lower bound: -6.69, Upper bound: 64.81
