> # Day18. Outliers 

![outlier1](./EDA_PIC/outlier_1.png)

![outlier2](./EDA_PIC/outlier_2.png)

![outlier3](./EDA_PIC/outlier_3.png)

![outlier4](./EDA_PIC/outlier_4.png)

![outlier5](./EDA_PIC/outlier_5.png)

![outlier6](./EDA_PIC/outlier_6.png)

![outlier7](./EDA_PIC/outlier_7.png)

> ## Outlier:
An Outlier is a data point that is significantly different from the rest of the data. It doestn't follow the pattern or trend that most of the data follows.

`Example:` In a data set of salaries like 
`[30k, 35k, 32k, 500k]`, Here __500k__ is Outlier

### Main Types:
1. __Global Outlier__:

__Meaning__: A data point which are very different from other data points in the dataset.

__Example__: In a list of people's ages [25, 27, 30, 29, 105], Here `105` is Global Outlier---it is way outside thr usual range.

2. __Contextual Outlier__:

__Meaning__: A data point that is only considered an outlier in a specific context(like season or time) matters.

__Example__: A temperature of 40C is normal in summmer but would be a context outlier in winter.

3. __Collective Outlier__:

__Meaning__: A group of data points that are normal individually but strange together.

__Example__: If a network sudduenly shows a group of users logging in at _3AM_ from the same location, it may be a collective outlier(could signal a security issue), even if each login alone looks okay. 

![outlier8](./EDA_PIC/outlier_8.png)

![outlier9](./EDA_PIC/outlier_9.png)

![outlier10](./EDA_PIC/outlier_10.png)

![outlier11](./EDA_PIC/outlier_11.png)

> # How to Identify Outlier?
 __Three Mthods__:
1. _Using Histogram_ 
2. _Using BoxPlot => Inter Quartile Range_
3. _Z-Score Method_
![outlier_12](./EDA_PIC/outlier_12.jpg)

![outlier_13](./EDA_PIC/outlier_13.jpg)

![outlier14](./EDA_PIC/outlier_14.png)
> ### IQR = Q_3 - Q_1

![outlier15](./EDA_PIC/outlier_15.png)

![outlier16](./EDA_PIC/outlier_16.png)

> ## How to deal with Outliers?

![outlier17](./EDA_PIC/outlier_17.png)

![outlier18](./EDA_PIC/outlier_18.jpg)

![outlier19](./EDA_PIC/outlier_19.png)


> # Identifying and removing Outliers using IQR( InterQuartile Range ) 

In [1]:
import seaborn as sns
import pandas as pd

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Display the first few rows of the dataset
print(titanic.head())

# Calculate the IQR for the 'age' column
Q1 = titanic['age'].quantile(0.25)
Q3 = titanic['age'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds for the outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
titanic_no_outliers = titanic[(titanic['age'] >= lower_bound) & (titanic['age'] <= upper_bound)]

# Display the first few rows of the dataset without outliers
print(titanic_no_outliers.head())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  

## Causes of Outliers

`Data Entry Errors:` Human errors such as errors caused during data collection, recording, or entry can cause outliers in data.

`Measurement Error:` It can be a result of faulty equipment or the result of experimenter error.

`Experimental Error:` For example, in a controlled environment, an unforeseen factor might disrupt an experiment leading to anomalous results.

`Intentional Outlier:` These are sometimes introduced to test detection methods.

`Sampling Errors:` For instance, during sample collection or extraction, certain unusual samples might be picked.

`Natural Outlier:` They don’t necessarily represent any anomaly. For instance, in a class of students, one student may genuinely be extraordinarily tall or short.

__Why should we care about outliers?__

So, why should you care about outliers? These unconventional data points matter for several compelling reasons:

`Hidden Clues:` Outliers often whisper important clues. They could be hints of hidden patterns that could change the way you understand your data.

`Quality Check:` Outliers can signal data quality issues. Are they real anomalies, or are they just mistakes in how the data was collected?

`Real-World Impact:` In fields like fraud detection, finance, and healthcare, outliers often represent real-world events that need your attention.

__Detecting Outliers__

`Visualization tools:` Box plots, scatter plots, and histograms can be used to spot outliers.

`Statistical Tests:` The Z-score or IQR (Interquartile Range) and Percentile Methods can be used to identify outliers.

`Machine Learning algorithms:` There are algorithms like DBSCAN and Isolation Forest that can be used to detect outliers.
Now, let’s explore how to find these unusual data points.

> **The Z-Score Method**

Imagine the Z-score as your detective tool. It helps you figure out how different a data point is from the average – a direct sign of its uniqueness.

# Install libraries
`pip install numpy`

`pip install scipy`

In [1]:
# Import libraries
import numpy as np
from scipy import stats

# Sample data
data = [2.5, 2.7, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8, 4.0, 100.0]

# Calculate the Z-score for each data point
z_scores = np.abs(stats.zscore(data))

# Set a threshold for identifying outliers
threshold = 2.5# Find outliers
outliers = np.where(z_scores > threshold)[0]

print("Indices of Outliers:", outliers)

Indices of Outliers: [9]


> **IQR (Interquartile Range) – Data Detective Work**

Think of the Interquartile Range (IQR) as your data detective. It identifies outliers by looking at the range between the first and third quartiles, making unusual data points stand out.

In [None]:
# Import library
import numpy as np

# Sample data
data = [10, 15, 20, 25, 30, 35, 40, 45, 50, 100]

# Calculate the IQR
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1

# Set a threshold for identifying outliers
threshold = 1.5# Find outliers
outliers = [x for x in data if (q1 - threshold * iqr) > x > (q3 + threshold * iqr)]

print("Outliers:", outliers)

> **Clustering (K-means)**

Clustering techniques like K-means can be used to identify outliers by grouping data points into clusters. Outliers belong to clusters with very few data points.

In [None]:
# Import library
from sklearn.cluster import KMeans

# Sample data
data = [[2, 2], [3, 3], [3, 4], [30, 30], [31, 31], [32, 32]]

# Create a K-means model with two clusters (normal and outlier)
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)

# Predict cluster labels
labels = kmeans.predict(data)

# Identify outliers based on cluster labels
outliers = [data[i] for i, label inenumerate(labels) if label == 1]

print("Outliers:", outliers)

> **Machine Learning Algorithms (Isolation Forest)**

The Isolation Forest is an algorithm specifically designed for anomaly detection. It works by creating isolation trees, where outliers are isolated in shorter trees compared to normal data points.

In [None]:
# Import library
from sklearn.ensemble import IsolationForest

# Sample data
data = [[2], [3], [4], [30], [31], [32]]

# Create an Isolation Forest model
clf = IsolationForest(contamination=0.2)

# Fit the model
clf.fit(data)

# Predict outliers
outliers = [data[i] for i, pred in enumerate(clf.predict(data)) if pred == -1]

print("Outliers:", outliers)

## **Handling Outliers**

1. `Removing the outlier:` This is the most common method where all detected outliers are removed from the dataset.

2. `Transforming and binning values:` Outliers can be transformed to bring them within a range. Techniques like log transformation or square root transformation can be used.

3. `Imputation:` Outliers can also be replaced with mean, median, or mode values.

4. `Separate treatment:` In some use-cases, it’s beneficial to treat outliers separately rather than removing or imputing them.

5. `Robus Statistical Methods:` Some of the statistical methods to analyze and model the data are less sensitive to outliers and provide more accurate results in the data.

> **Once you’ve found these unusual data points, what should you do with them?**

### **1. Removing Outliers – Cutting Losse Ends**

Removing outliers is like tidying up your dataset. If they don’t belong in the story you’re telling, consider leaving them out.

In [None]:
# Sample data
data = [2, 3, 4, 30, 31, 32]

# Set a threshold for identifying outliers
threshold = 5# Remove outliers
data_no_outliers = [x for x if x <= threshold]

print("Data without outliers:", data_no_outliers)

### **2. Data Transformation – Changing the Shape**

Data transformation is like giving your data a new shape. By applying techniques like logarithmic transformation, you can make your data easier to understand.

In [None]:
# Import numpy
import numpy as np

# Sample data
data = [2, 3, 4, 30, 31, 32]

# Apply a logarithmic transformation to mitigate the impact of outliers
data_transformed = [np.log(x) for x in data]

print("Transformed data:", data_transformed)

### **Imputation – Data Resurrection**

Imputation involves replacing outlier values with more representative values, such as the mean or median of the non-outlier data points.

In [None]:
# Import numpy
import numpy as np

# Sample data
data = [2, 3, 4, 30, 31, 32]

# Set a threshold for identifying outliers
threshold = 5# Replace outliers with the median value
median = np.median(data)
data_imputed = [x if x <= threshold else median for x in data]

print("Imputed data:", data_imputed)

### **Outliers in Real-World Applications**

Outliers are pervasive in various industries:

1. **Finance**

In financial analysis, outliers can indicate market anomalies or financial irregularities.

2. **Healthcare**

Outliers in healthcare data can signify rare diseases or extreme patient outcomes.

3. **Environmental Monitoring**

Anomalies in environmental data can point to unusual events, like natural disasters.

### **Best Practices for Handling Outliers**

To effectively manage outliers in your machine learning projects, consider the following best practices:

- **Understand the Domain**

Familiarize yourself with the domain you’re working in to distinguish meaningful outliers from noise.

- **Use Multiple Techniques**

Combine outlier detection methods to ensure robust results.

- **Consider Impact**

Evaluate the impact of different outlier treatment methods on your specific problem and dataset.

- **Document Your Process**

Keep a clear record of how you handle outliers for transparency and reproducibility.

### **Conclusion**

Outliers in a dataset are observations that deviate dramatically from the rest of the data points. They might arise as a result of data gathering mistakes or abnormalities, or they can be real findings that are just infrequent or extraordinary.
If outliers are not appropriately accounted for, they might produce misleading, inconsistent, and erroneous findings. As a result, identifying and dealing with outliers is critical in order to produce accurate and useful data analysis findings.
Outliers may be detected using a variety of methods, including the percentile approach, IQR method, and z-score method. Outliers can be dealt with in a variety of methods, including removal, transformation, imputation, and so on.
As you venture further into the world of data, don’t shy away from outliers. They are the remarkable characters of your dataset, each with a unique story to share. Embrace your curiosity and let the secrets hidden in your data come to light. Understanding the mysteries of outliers is not just a science; it’s an art.