<a href="https://colab.research.google.com/github/taliyameyswara/datamining2023/blob/main/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Library
1. **Numpy** merupakan library python untuk komputasi matriks
2. **Matplotlib** merupakan library python untuk presentasi data berupa grafik/plot
3. **Pandas** untuk mengimport data dari luar seperti csv



In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

---
Panggil file data (.csv) dengan library **Pandas** dan mencetak table data



In [4]:
dataset = pd.read_csv('data_tugas3.csv')

print("Jumlah data (baris, kolom):", dataset.shape)

# hanya 5 baris
dataset.head()

Jumlah data (baris, kolom): (1000, 5)


Unnamed: 0,Regist. Store of Customers,Comp nm,Sale amt,Visit Count,Profit Amount
0,SEMARANG,TOKO ALI,471032440.0,4.0,207971.0
1,SEMARANG,TOKO PUTRA HIDUP,306474400.0,2.0,
2,SEMARANG,TOKO EDIP,,4.0,349644.0
3,,IMA MART,228492000.0,,-11066354.0
4,SEMARANG,TOKO RENNE,188544000.0,1.0,-5927922.0


---

Ganti nama kolom

In [5]:
# mengganti nama kolom
dataset.rename(columns={"Regist. Store of Customers": "Regist Store", "Comp nm": "Comp Name", "Sale amt": "Sale Amount"}, inplace=True)
# cetak nama kolom
for col in dataset.columns:
    print(col)

Regist Store
Comp Name
Sale Amount
Visit Count
Profit Amount


---
Hitung jumlah missing value

In [6]:
# jumlah missing value
dataset.isnull().sum()

Regist Store     1
Comp Name        1
Sale Amount      2
Visit Count      3
Profit Amount    1
dtype: int64

---
Perbaiki missing value, dengan memanggil fungsi

In [11]:
# Panggil fungsi dari sklearn untuk imput
from sklearn.impute import SimpleImputer

# Isi missing value dengan data yang dominan
imputer_store = SimpleImputer(missing_values = np.nan, strategy='most_frequent')
dataset['Regist Store'] = imputer_store.fit_transform(dataset[['Regist Store']])

imputer_comp = SimpleImputer(missing_values = np.nan, strategy='most_frequent')
dataset['Comp Name'] = imputer_comp.fit_transform(dataset[['Comp Name']])

imputer_visit = SimpleImputer(missing_values = np.nan, strategy='most_frequent')
dataset['Visit Count'] = imputer_visit.fit_transform(dataset[['Visit Count']])

# Isi missing value dengan data rata-rata (mean)
imputer_sale = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer_sale.fit(dataset[['Sale Amount']]) # fit kan
dataset['Sale Amount'] = imputer_sale.transform(dataset[['Sale Amount']]) # transformkan

imputer_profit = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer_profit.fit(dataset[['Profit Amount']]) # fit kan
dataset['Profit Amount'] = imputer_profit.transform(dataset[['Profit Amount']]) # transformkan


In [12]:
# jumlah missing value
dataset.isnull().sum()

Regist Store     0
Comp Name        0
Sale Amount      0
Visit Count      0
Profit Amount    0
dtype: int64

In [14]:
dataset.head()

Unnamed: 0,Regist Store,Comp Name,Sale Amount,Visit Count,Profit Amount
0,SEMARANG,TOKO ALI,471032400.0,4.0,207971.0
1,SEMARANG,TOKO PUTRA HIDUP,306474400.0,2.0,-128707.4
2,SEMARANG,TOKO EDIP,7186429.0,4.0,349644.0
3,SEMARANG,IMA MART,228492000.0,1.0,-11066350.0
4,SEMARANG,TOKO RENNE,188544000.0,1.0,-5927922.0


# Membagi dataset ke dalam training set dan test set

Data yang saya gunakan hanya data yang berupa numerik sehingga saya hanya mengambil data dari kolom ke-3 dan ke-4,

In [22]:
x = dataset.iloc[:, 2:4].values
y = dataset.iloc[:,-1].values

In [33]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y,  test_size = 0.2, random_state = 1)

In [38]:
print(x_train)

[[2.41000e+06 1.00000e+00]
 [1.89000e+04 1.00000e+00]
 [3.84000e+04 1.00000e+00]
 ...
 [1.59634e+05 2.00000e+00]
 [4.38517e+06 3.00000e+00]
 [4.49303e+07 5.00000e+00]]


In [39]:
print(x_test[:10])

[[1.523911e+06 2.000000e+00]
 [3.554940e+05 3.000000e+00]
 [1.830803e+06 3.000000e+00]
 [2.520000e+06 1.000000e+00]
 [4.246248e+06 2.000000e+00]
 [1.335000e+05 1.000000e+00]
 [3.845750e+06 2.000000e+00]
 [3.744000e+05 1.000000e+00]
 [3.025100e+06 4.000000e+00]
 [3.074740e+07 3.000000e+00]]


In [43]:
print(y_train[:10])

[     0.   4297.   6495.  23844. 135022. -31925.  86400. -42670. 287702.
   8681.]


In [44]:
print(y_test[:10])

[ 208838.   49369.   85950. -110464.   24770.    5159. -121708.   33368.
  -64817.  446980.]


In [45]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

x_train[:, -1:] = sc.fit_transform(x_train[:, -1:])
x_test[:, -1:] = sc.transform(x_test[:, -1:])

In [46]:
print(x_train)

[[ 2.41000000e+06 -4.45667342e-01]
 [ 1.89000000e+04 -4.45667342e-01]
 [ 3.84000000e+04 -4.45667342e-01]
 ...
 [ 1.59634000e+05 -7.77273279e-02]
 [ 4.38517000e+06  2.90212686e-01]
 [ 4.49303000e+07  1.02609271e+00]]


In [47]:
print(x_test[:10])

[[ 1.52391100e+06 -7.77273279e-02]
 [ 3.55494000e+05  2.90212686e-01]
 [ 1.83080300e+06  2.90212686e-01]
 [ 2.52000000e+06 -4.45667342e-01]
 [ 4.24624800e+06 -7.77273279e-02]
 [ 1.33500000e+05 -4.45667342e-01]
 [ 3.84575000e+06 -7.77273279e-02]
 [ 3.74400000e+05 -4.45667342e-01]
 [ 3.02510000e+06  6.58152699e-01]
 [ 3.07474000e+07  2.90212686e-01]]
