# Pembersihan Outliers

Secara umum, membersihkan outlier adalah proses mengidentifikasi dan menghapus data yang secara signifikan berbeda dari pola atau distribusi data lain dalam kumpulan data.  

**Mengapa Outlier Perlu dibersihkan?**  
- Distorsi Rata-rata: Outlier dapat menarik nilai rata-rata (mean) ke arahnya, membuat nilai tersebut tidak lagi representatif untuk sebagian besar data.  
- Mempengaruhi Model Prediktif: Dalam machine learning, outlier dapat membuat model menjadi kurang akurat. Sebagai contoh, dalam regresi linier, sebuah outlier dapat mengubah garis regresi, mengurangi kemampuan model untuk memprediksi nilai baru dengan tepat.

## Data Iris

In [1]:
from pycaret.anomaly import *

ModuleNotFoundError: No module named 'pycaret'

In [87]:
import pandas as pd
from module.dataTransformer import combineData
from module.fetcher import fetchDataMysql, fetchDataPg

data_pg = fetchDataPg("SELECT petal_length, petal_width, species FROM iris_table")
data_my = fetchDataMysql("SELECT sepal_length, sepal_width FROM iris_table")

iris_df = combineData(data1=data_my, data2=data_pg)
iris_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.10,3.50,1.40,0.20,Iris-setosa
1,4.90,3.00,1.40,0.20,Iris-setosa
2,4.70,3.20,1.30,0.20,Iris-setosa
3,4.60,3.10,1.50,0.20,Iris-setosa
4,5.00,3.60,1.40,0.20,Iris-setosa
...,...,...,...,...,...
145,6.70,3.00,5.20,2.30,Iris-virginica
146,6.30,2.50,5.00,1.90,Iris-virginica
147,6.50,3.00,5.20,2.00,Iris-virginica
148,6.20,3.40,5.40,2.30,Iris-virginica


In [88]:
# Menyiyngkirkan kolom species (class) 
numeric_iris_df = iris_df[["sepal_length", "sepal_width", "petal_length", "petal_width"]].astype(float)
numeric_iris_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


## Model ABOD

Membuat model ABOD dengan fraction (proporsi data yang akan di anggap sebagai outlier) sama dengan 0.05 atau 5%

In [89]:
from pycaret.anomaly import *

s = setup(data=numeric_iris_df)

abod_model = create_model("abod", fraction=0.05)

df_abod = assign_model(abod_model)

df_abod

Unnamed: 0,Description,Value
0,Session id,7108
1,Original data shape,"(150, 4)"
2,Transformed data shape,"(150, 4)"
3,Numeric features,4
4,Preprocess,True
5,Imputation type,simple
6,Numeric imputation,mean
7,Categorical imputation,mode
8,CPU Jobs,-1
9,Use GPU,False


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,Anomaly,Anomaly_Score
0,5.1,3.5,1.4,0.2,0,-556.251421
1,4.9,3.0,1.4,0.2,0,-400.000928
2,4.7,3.2,1.3,0.2,0,-93.421993
3,4.6,3.1,1.5,0.2,0,-99.229221
4,5.0,3.6,1.4,0.2,0,-82.176201
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,0,-13.831461
146,6.3,2.5,5.0,1.9,0,-9.110201
147,6.5,3.0,5.2,2.0,0,-20.571707
148,6.2,3.4,5.4,2.3,0,-6.345400


### Data IRIS clean

Data Tanpa Outliers

In [90]:
clean_iris_abod = df_abod[df_abod["Anomaly"] == 0].merge(iris_df["species"], left_index=True, right_index=True)[["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]]
clean_iris_abod

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


### Menyimpan File CSV

Download Hasil: [Download data abod_iris_clean.csv](abod_iris_clean.csv)

In [91]:
clean_iris_abod.to_csv("abod_iris_clean.csv")

## Model KNN

Membuat model KNN dengan fraction (proporsi data yang akan di anggap sebagai outlier) sama dengan 0.05 atau 5%

In [92]:
from pycaret.anomaly import *

s = setup(data=numeric_iris_df)

iforest_model = create_model("knn", fraction=0.05)

df_knn = assign_model(iforest_model)

df_knn

Unnamed: 0,Description,Value
0,Session id,5315
1,Original data shape,"(150, 4)"
2,Transformed data shape,"(150, 4)"
3,Numeric features,4
4,Preprocess,True
5,Imputation type,simple
6,Numeric imputation,mean
7,Categorical imputation,mode
8,CPU Jobs,-1
9,Use GPU,False


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,Anomaly,Anomaly_Score
0,5.1,3.5,1.4,0.2,0,0.141421
1,4.9,3.0,1.4,0.2,0,0.173205
2,4.7,3.2,1.3,0.2,0,0.264575
3,4.6,3.1,1.5,0.2,0,0.264575
4,5.0,3.6,1.4,0.2,0,0.244949
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,0,0.374166
146,6.3,2.5,5.0,1.9,0,0.479583
147,6.5,3.0,5.2,2.0,0,0.387298
148,6.2,3.4,5.4,2.3,0,0.624500


### Data IRIS Clean

Data Tanpa Outliers

In [93]:
clean_iris_knn = df_knn[df_knn["Anomaly"] == 0].merge(iris_df["species"], left_index=True, right_index=True)[["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]]
clean_iris_knn

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


### Menyimpan File CSV

Download Hasil: [Download data knn_iris_clean.csv](knn_iris_clean.csv)

In [94]:
clean_iris_knn.to_csv("knn_iris_clean.csv")

## Model LOF
Membuat model LOF dengan fraction (proporsi data yang akan di anggap sebagai outlier) sama dengan 0.05 atau 5%

In [95]:
from pycaret.anomaly import *

s = setup(data=numeric_iris_df)

lof_model = create_model("lof", fraction=0.05)

df_lof = assign_model(lof_model)

df_lof

Unnamed: 0,Description,Value
0,Session id,5909
1,Original data shape,"(150, 4)"
2,Transformed data shape,"(150, 4)"
3,Numeric features,4
4,Preprocess,True
5,Imputation type,simple
6,Numeric imputation,mean
7,Categorical imputation,mode
8,CPU Jobs,-1
9,Use GPU,False


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,Anomaly,Anomaly_Score
0,5.1,3.5,1.4,0.2,0,0.976302
1,4.9,3.0,1.4,0.2,0,1.008758
2,4.7,3.2,1.3,0.2,0,1.019841
3,4.6,3.1,1.5,0.2,0,1.049882
4,5.0,3.6,1.4,0.2,0,0.958473
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,0,0.978474
146,6.3,2.5,5.0,1.9,0,1.004232
147,6.5,3.0,5.2,2.0,0,0.980847
148,6.2,3.4,5.4,2.3,0,1.021819


### Data IRIS clean

Data Tanpa Outliers

In [96]:
clean_iris_lof = df_lof[df_lof["Anomaly"] == 0].merge(iris_df["species"], left_index=True, right_index=True)[["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]]
clean_iris_lof

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


### Menyimpan File CSV

Download Hasil: [Download data lof_iris_clean.csv](lof_iris_clean.csv)

In [97]:
clean_iris_lof.to_csv("lof_iris_clean.csv")