Imputasi adalah mengganti nilai/data yang hilang (missing value; NaN; blank) dengan nilai pengganti.

Teknik imputasi berdasarkan tipe data, terdiri dari:
1. Tipe data Variabel Numerik, dengan cara:
    - imputasi mean/median.
    - imputasi nilai suka-suka (arbitrary).
    - Imputasi nilai/data ujung (end of tail).
    
2. Tipe data Variabel Kategorikal, dengan cara : 
    - Imputasi Kategori yang sering muncul.
    - Tambah kategori yang hilang.

## Data Numerik

#### Mean

In [1]:
import pandas as pd
import numpy as np

kolom = {'col1':[2,9,19],
         'col2':[5,np.nan,17],
         'col3':[3,9,np.nan],
         'col4':[4,9,1],
         'col5':[np.nan,7,np.nan]}

data = pd.DataFrame(kolom)

In [2]:
data

Unnamed: 0,col1,col2,col3,col4,col5
0,2,5.0,3.0,4,
1,9,,9.0,9,7.0
2,19,17.0,,1,


In [3]:
data.fillna(data.mean())

Unnamed: 0,col1,col2,col3,col4,col5
0,2,5.0,3.0,4,7.0
1,9,11.0,9.0,9,7.0
2,19,17.0,6.0,1,7.0


#### Arbitrary (Nilai suka-suka)

In [5]:
import pandas as pd
import numpy as np

umur = {'umur':[29,43,np.nan,25,34,np.nan,50]}

data = pd.DataFrame(umur)
data

Unnamed: 0,umur
0,29.0
1,43.0
2,
3,25.0
4,34.0
5,
6,50.0


In [6]:
data.fillna(99)

Unnamed: 0,umur
0,29.0
1,43.0
2,99.0
3,25.0
4,34.0
5,99.0
6,50.0


#### End of Tail

In [7]:
import pandas as pd
import numpy as np

umur = {'umur':[29,43,np.nan,25,34,np.nan,50]}

data = pd.DataFrame(umur)
data

Unnamed: 0,umur
0,29.0
1,43.0
2,
3,25.0
4,34.0
5,
6,50.0


In [10]:
pip install feature-engine

Collecting feature-engineNote: you may need to restart the kernel to use updated packages.
  Downloading feature_engine-1.2.0-py2.py3-none-any.whl (205 kB)
Collecting statsmodels>=0.11.1
  Downloading statsmodels-0.13.2-cp39-cp39-win_amd64.whl (9.1 MB)



You should consider upgrading via the 'c:\users\ribka\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.


Collecting patsy>=0.5.2
  Downloading patsy-0.5.2-py2.py3-none-any.whl (233 kB)
Installing collected packages: patsy, statsmodels, feature-engine
Successfully installed feature-engine-1.2.0 patsy-0.5.2 statsmodels-0.13.2


In [13]:
#import EndTailImputer
from feature_engine.imputation import EndTailImputer

#buat Imputer
imputer = EndTailImputer(imputation_method='gaussian',tail='right')

#fit-kan imputer ke set
imputer.fit(data)

#ubah data
test_data = imputer.transform(data)

#yuk tampilkan data
test_data

Unnamed: 0,umur
0,29.0
1,43.0
2,66.896905
3,25.0
4,34.0
5,66.896905
6,50.0


## Bagaimana dengan data kategorikal?
Kita akan menggunakan imputasi modus


In [16]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

mobil = {'merk':['Ford','Ford','Toyota','Honda',np.nan,'Toyota','Honda','Toyota',np.nan,np.nan]}

data = pd.DataFrame(mobil)

In [17]:
data

Unnamed: 0,merk
0,Ford
1,Ford
2,Toyota
3,Honda
4,
5,Toyota
6,Honda
7,Toyota
8,
9,


In [18]:
imputasi = SimpleImputer(strategy='most_frequent')


In [19]:
imputasi.fit_transform(data)

array([['Ford'],
       ['Ford'],
       ['Toyota'],
       ['Honda'],
       ['Toyota'],
       ['Toyota'],
       ['Honda'],
       ['Toyota'],
       ['Toyota'],
       ['Toyota']], dtype=object)

### Bagaimana kalau kita butuh untuk mengatasi missing value untuk kategorikal dan numerikal secara bersamaan?

Kita bisa gunakan random sample

In [22]:
#import dulu si RandomSample
from feature_engine.imputation import RandomSampleImputer

# silahkan buat data yang ada missing valuenya
data = {'Jenis Kelamin':['Laki-laki','Perempuan','Laki-laki','Laki-laki',np.nan,'Laki-laki','Perempuan',np.nan,'Perempuan','Laki-laki'],
       'Umur':[30,np.nan,24,21,34,np.nan,23,43,np.nan,19]}

df = pd.DataFrame(data)

In [23]:
df

Unnamed: 0,Jenis Kelamin,Umur
0,Laki-laki,30.0
1,Perempuan,
2,Laki-laki,24.0
3,Laki-laki,21.0
4,,34.0
5,Laki-laki,
6,Perempuan,23.0
7,,43.0
8,Perempuan,
9,Laki-laki,19.0


In [24]:
#buat dulu imputer
imputasi = RandomSampleImputer(random_state=29)

#fitkan
imputasi.fit(df)

#ubah data
testing_df=imputasi.transform(df)

In [25]:
testing_df

Unnamed: 0,Jenis Kelamin,Umur
0,Laki-laki,30.0
1,Perempuan,43.0
2,Laki-laki,24.0
3,Laki-laki,21.0
4,Perempuan,34.0
5,Laki-laki,34.0
6,Perempuan,23.0
7,Laki-laki,43.0
8,Perempuan,23.0
9,Laki-laki,19.0
