# Outliers

Outliers are data points that are significantly different from other data points in a dataset, and can affect the accuracy and reliability of data analysis.

They can be identified using various methods, such as the `Z-score`, `IQR`, or `boxplot methods`, and can be handled by removing or treating them separately in the analysis.

It can effect:

1. Machine learning model.
2. Skew the data (i.e mean skewed to outliers).
3. Wrong Results.
4. We need to deal the outliers.

Other Names of _Outliers_:

- Outliers.
- Deviants.
- Abnormalities.
- Anomalous Points.
- Abberrant Observations.

### Types of Outliers

Certainly! Here's the translation into English along with explanations for each type of outlier:

1. **Univariate:** These are outliers that occur in only one variable. For example, if your data has only the age variable, outliers in age would be considered univariate outliers.

2. **Multivariate:** These are outliers that occur in more than one variable. For example, if your data includes both age and income variables, outliers in both age and income would be considered multivariate outliers.

3. **Global:** These are outliers that exist across the entire dataset. For instance, if there are outliers in age and income across all data points, they would be global outliers.

4. **Local:** These are outliers that exist within a specific cluster or group of data points. For example, if there are outliers in age and income within a particular group, they would be considered local outliers.

5. **Point:** These are outliers that occur at a single point in the dataset. For instance, if there's a data point that significantly deviates from the rest in terms of age and income, it would be a point outlier.

6. **Contextual:** These outliers are within a specific context or cluster of data points. For example, if there's a group of data points with unusual age and income values compared to the rest of the dataset, they would be contextual outliers.

7. **Collective:** These outliers form a cluster within the data. For instance, if there's a group of data points that collectively deviate from the expected pattern in terms of age and income, they would be collective outliers.

8. **Recurrent:** These outliers occur repeatedly within specific clusters. For example, if there are consistent outliers in age and income within certain groups or clusters of data points, they would be recurrent outliers.

9. **Periodic:** These outliers occur periodically within specific clusters or contexts. For instance, if there are outliers in age and income that appear regularly or at specific intervals within certain groups of data points, they would be periodic outliers.
  

### Causes of Outliers

- Data Entry Errors.
- Measurement Errors.
- Experimental Errors.
- Intentional Outliers.
- Data Processing Errors.
- Sampling Errors.
- Natural Outliers.

### Detect and Remove Outliers

1. Z-Score
2. IQR
3. DBSCAN
4. Isolation Forest
5. Local Outlier Factor
6. Elliptic Envelope
7. One-Class SVM
8. Mahalanobis Distance
9. Robust Random Cut Forest
10. Histogram-based Outlier Score
11. K-Nearest Neighbors
12. K-Means Clustering
13. Local Correlation Integral

![Z-Score](./images/z-score.png)
![IQR](./images/iqr.png)

## Handling Outliers

- Removing the outliers.
- Transforming and binning values.
- Imputation.
- Sperate treatment.
- Robust statistical method.

In [1]:
# import the required libraries
import pandas as pd
import numpy as np


In [5]:
# Create the data
data = pd.DataFrame({'age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 50]})
data.head()

Unnamed: 0,age
0,20
1,21
2,22
3,23
4,24


- **Z-Score**

In [6]:
# Step 1: check the data is sorted or not

# Step 2: Calculate mean and standard deviation
mean = np.mean(data['age'])
std = np.std(data['age'])

In [8]:
# Step 3: Calculate the Z-Score
data['Z-Score'] = (data['age'] - mean) / std
data

Unnamed: 0,age,Z-Score
0,20,-0.938954
1,21,-0.806396
2,22,-0.673838
3,23,-0.54128
4,24,-0.408721
5,25,-0.276163
6,26,-0.143605
7,27,-0.011047
8,28,0.121512
9,29,0.25407


In [9]:
# Step 4: Print the data
print("----------------------------------------")
print(f"Here is the data with outliers:\n {data}")
print("----------------------------------------")

----------------------------------------
Here is the data with outliers:
     age   Z-Score
0    20 -0.938954
1    21 -0.806396
2    22 -0.673838
3    23 -0.541280
4    24 -0.408721
5    25 -0.276163
6    26 -0.143605
7    27 -0.011047
8    28  0.121512
9    29  0.254070
10   30  0.386628
11   50  3.037793
----------------------------------------


In [10]:
# Step 5: Print the outliers
print(f"Here are the outliers based on the z-score threshold, 3:\n {data[data['Z-Score'] > 3]}")
print("----------------------------------------")

Here are the outliers based on the z-score threshold, 3:
     age   Z-Score
11   50  3.037793
----------------------------------------


In [11]:
# Step 6: Remove the outliers
data = data[data['Z-Score'] <= 3]

# Step 7: Print the data without outliers
print(f"Here is the data without outliers:\n {data}")

Here is the data without outliers:
     age   Z-Score
0    20 -0.938954
1    21 -0.806396
2    22 -0.673838
3    23 -0.541280
4    24 -0.408721
5    25 -0.276163
6    26 -0.143605
7    27 -0.011047
8    28  0.121512
9    29  0.254070
10   30  0.386628


- calculating Z-Score using scipy library.

In [12]:
# Import libraries
import numpy as np
from scipy import stats

# Sample data
data = [2.5, 2.7, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8, 4.0, 110.0]

# Calculate the Z-score for each data point
z_scores = np.abs(stats.zscore(data))

# Set a threshold for identifying outliers
threshold = 2.5 
outliers = np.where(z_scores > threshold)[0]

# print the data
print("----------------------------------------")
print("Data:", data)
print("----------------------------------------")

print("Indices of Outliers:", outliers)
print("Outliers:", [data[i] for i in outliers])

# Remove outliers
data = [data[i] for i in range(len(data)) if i not in outliers]
print("----------------------------------------")
print("Data without outliers:", data)

----------------------------------------
Data: [2.5, 2.7, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8, 4.0, 110.0]
----------------------------------------
Indices of Outliers: [9]
Outliers: [110.0]
----------------------------------------
Data without outliers: [2.5, 2.7, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8, 4.0]


- **IQR Method**

In [13]:
# Step 1: Import the required libraries
import pandas as pd
import numpy as np

# Step 2: Create the data
data = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 50]})

# Step 3: Calculate the first and third quartile
Q1 = np.percentile(data['Age'], 25, interpolation = 'midpoint')
Q3 = np.percentile(data['Age'], 75, interpolation = 'midpoint')

# Step 4: Calculate the IQR
IQR = Q3 - Q1

# Step 5: Calculate the lower and upper bound
lower_bound = Q1 - (1.5 * IQR)
upper_bound = Q3 + (1.5 * IQR)

# Step 6: Print the data
print("----------------------------------------")
print(f"Here is the data with outliers:\n {data}")
print("----------------------------------------")
# Step 7: Print the outliers
print(f"Here are the outliers based on the IQR threshold:\n {data[(data['Age'] < lower_bound) | (data['Age'] > upper_bound)]}")
print("----------------------------------------")
# Step 8: Remove the outliers
data = data[(data['Age'] >= lower_bound) & (data['Age'] <= upper_bound)]

# Step 9: Print the data without outliers
print(f"Here is the data without outliers:\n {data}")

----------------------------------------
Here is the data with outliers:
     Age
0    20
1    21
2    22
3    23
4    24
5    25
6    26
7    27
8    28
9    29
10   30
11   50
----------------------------------------
Here are the outliers based on the IQR threshold:
     Age
11   50
----------------------------------------
Here is the data without outliers:
     Age
0    20
1    21
2    22
3    23
4    24
5    25
6    26
7    27
8    28
9    29
10   30


- **Using K-Means Clustering.**

In [14]:
# Import library
from sklearn.cluster import KMeans

# Sample data
data = [[2, 2], [3, 3], [3, 4], [30, 30], [31, 31], [32, 32]]

# Create a K-means model with two clusters (normal and outlier)
kmeans = KMeans(n_clusters=2, n_init=10)
kmeans.fit(data)

# Predict cluster labels
labels = kmeans.predict(data)

# Identify outliers based on cluster labels
outliers = [data[i] for i, label in enumerate(labels) if label == 1]

# print data
print("Data:", data)
print("Outliers:", outliers)
# Remove outliers
data = [data[i] for i, label in enumerate(labels) if label == 0]
print("Data without outliers:", data)

Data: [[2, 2], [3, 3], [3, 4], [30, 30], [31, 31], [32, 32]]
Outliers: [[30, 30], [31, 31], [32, 32]]
Data without outliers: [[2, 2], [3, 3], [3, 4]]
