# 1. Missing Value

Sering kali, data rusak, atau hilang, kita perlu mengurusnya terlebih dahulu karena kedepannya data ini tidak berfungsi saat data hilang atau tidak lengkap.

## Imputing missing values dengan Imputer

In [1]:
import pandas as pd
from sklearn.preprocessing import Imputer

In [13]:
df = pd.read_csv('Data.csv')
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [14]:
df.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

In [15]:
df.dropna()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
5,France,35.0,58000.0,Yes
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [7]:
# drop kolom spesifik yang mengandung NaN 
df.dropna(subset=['Age'])

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [8]:
df.iloc[:, 1:3]

Unnamed: 0,Age,Salary
0,44.0,72000.0
1,27.0,48000.0
2,30.0,54000.0
3,38.0,61000.0
4,40.0,
5,35.0,58000.0
6,,52000.0
7,48.0,79000.0
8,50.0,83000.0
9,37.0,67000.0


In [18]:
# replace every occurrence of missing_values to one defined by strategy
# which can be mean, median, mode. Axis = 0 means rows, 1 means column

imputer = Imputer(missing_values='NaN', strategy='median', axis = 0)
df.iloc[:, 1:3] = imputer.fit_transform(df.iloc[:, 1:3])
df



Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,61000.0,Yes
5,France,35.0,58000.0,Yes
6,Spain,38.0,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


## 2. Encoding Data Kategori

In [20]:
# Label Encoder will replace every categorical variable with number. Useful for replacing yes by 1, no by 0.
# One Hot Encoder will create a separate column for every variable and give a value of 1 where the variable is present
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [22]:
lable_encoder = LabelEncoder()
temp = df.copy()
temp.iloc[:, 3] = lable_encoder.fit_transform(df.iloc[:, 3])
print(lable_encoder.classes_)
temp.iloc[:, 0] = lable_encoder.fit_transform(df.iloc[:, 0])
print(lable_encoder.classes_)
print(temp)

['No' 'Yes']
['France' 'Germany' 'Spain']
   Country   Age   Salary  Purchased
0        0  44.0  72000.0          0
1        2  27.0  48000.0          1
2        1  30.0  54000.0          0
3        2  38.0  61000.0          0
4        1  40.0  61000.0          1
5        0  35.0  58000.0          1
6        2  38.0  52000.0          0
7        0  48.0  79000.0          1
8        1  50.0  83000.0          0
9        0  37.0  67000.0          1


In [23]:
# you can pass an array of indices of categorical features
# one_hot_encoder = OneHotEncoder(categorical_features=[0])
# temp = df.copy()
# temp.iloc[:, 0] = one_hot_encoder.fit_transform(df.iloc[:, :0])
# temp
# you can achieve the same thing using get_dummies
pd.get_dummies(df.iloc[:, :-1])

Unnamed: 0,Age,Salary,Country_France,Country_Germany,Country_Spain
0,44.0,72000.0,1,0,0
1,27.0,48000.0,0,0,1
2,30.0,54000.0,0,1,0
3,38.0,61000.0,0,0,1
4,40.0,61000.0,0,1,0
5,35.0,58000.0,1,0,0
6,38.0,52000.0,0,0,1
7,48.0,79000.0,1,0,0
8,50.0,83000.0,0,1,0
9,37.0,67000.0,1,0,0


# 3. Binarizing

Mengubah Data menjadi 0 dan 1.
Kita akan mencoba dataset lain, yaitu dataset iris yang ada pada library scikit-learn. (https://archive.ics.uci.edu/ml/datasets/iris)

In [24]:
from sklearn.datasets import load_iris

iris_dataset = load_iris()
X = iris_dataset.data
y = iris_dataset.target
feature_names = iris_dataset.feature_names
print(feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [25]:
X[:, 1]

array([3.5, 3. , 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3. ,
       3. , 4. , 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3.7, 3.6, 3.3, 3.4, 3. ,
       3.4, 3.5, 3.4, 3.2, 3.1, 3.4, 4.1, 4.2, 3.1, 3.2, 3.5, 3.6, 3. ,
       3.4, 3.5, 2.3, 3.2, 3.5, 3.8, 3. , 3.8, 3.2, 3.7, 3.3, 3.2, 3.2,
       3.1, 2.3, 2.8, 2.8, 3.3, 2.4, 2.9, 2.7, 2. , 3. , 2.2, 2.9, 2.9,
       3.1, 3. , 2.7, 2.2, 2.5, 3.2, 2.8, 2.5, 2.8, 2.9, 3. , 2.8, 3. ,
       2.9, 2.6, 2.4, 2.4, 2.7, 2.7, 3. , 3.4, 3.1, 2.3, 3. , 2.5, 2.6,
       3. , 2.6, 2.3, 2.7, 3. , 2.9, 2.9, 2.5, 2.8, 3.3, 2.7, 3. , 2.9,
       3. , 3. , 2.5, 2.9, 2.5, 3.6, 3.2, 2.7, 3. , 2.5, 2.8, 3.2, 3. ,
       3.8, 2.6, 2.2, 3.2, 2.8, 2.8, 2.7, 3.3, 3.2, 2.8, 3. , 2.8, 3. ,
       2.8, 3.8, 2.8, 2.8, 2.6, 3. , 3.4, 3.1, 3. , 3.1, 3.1, 3.1, 2.7,
       3.2, 3.3, 3. , 2.5, 3. , 3.4, 3. ])

Kita akan mengubah 0 jika dibawah rata-rata, dan 1 jika diatas rata-rata

In [26]:
from sklearn.preprocessing import Binarizer
binarizer_obj = Binarizer(threshold=X[:, 1].mean())
X[:, 1:2] = binarizer_obj.fit_transform(X[:, 1].reshape(-1, 1))
X[:, 1]

array([1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0.,
       0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       1., 1., 0., 1., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0.])

# 4. Fitur Scaling

In [27]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler

df = pd.read_csv('Data.csv').dropna()
print(df)
X = df[["Age", "Salary"]].values.astype(np.float64)
print(X)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
5   France  35.0  58000.0       Yes
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes
[[4.4e+01 7.2e+04]
 [2.7e+01 4.8e+04]
 [3.0e+01 5.4e+04]
 [3.8e+01 6.1e+04]
 [3.5e+01 5.8e+04]
 [4.8e+01 7.9e+04]
 [5.0e+01 8.3e+04]
 [3.7e+01 6.7e+04]]


In [32]:
standard_scaler = StandardScaler()
normalizer = Normalizer()
min_max_scaler = MinMaxScaler()

print("Standardization")
print(standard_scaler.fit_transform(X))

print("Normalizing")
print(normalizer.fit_transform(X))

print("MinMax Scaling")
print(min_max_scaler.fit_transform(X))

Standardization
[[ 0.69985807  0.58989097]
 [-1.51364653 -1.50749915]
 [-1.12302807 -0.98315162]
 [-0.08137885 -0.37141284]
 [-0.47199731 -0.6335866 ]
 [ 1.22068269  1.20162976]
 [ 1.48109499  1.55119478]
 [-0.211585    0.1529347 ]]
Normalizing
[[6.11110997e-04 9.99999813e-01]
 [5.62499911e-04 9.99999842e-01]
 [5.55555470e-04 9.99999846e-01]
 [6.22950699e-04 9.99999806e-01]
 [6.03448166e-04 9.99999818e-01]
 [6.07594825e-04 9.99999815e-01]
 [6.02409529e-04 9.99999819e-01]
 [5.52238722e-04 9.99999848e-01]]
MinMax Scaling
[[0.73913043 0.68571429]
 [0.         0.        ]
 [0.13043478 0.17142857]
 [0.47826087 0.37142857]
 [0.34782609 0.28571429]
 [0.91304348 0.88571429]
 [1.         1.        ]
 [0.43478261 0.54285714]]


# 5. Ekstraksi Fitur
Pada pertemuan sebelumnya kalian telah mencoba membuat program WordCount. WordCount merupakan sebuah teknik dalam melakukan ekstraksi Fitur. Namun, kalian tidak perlu membuat sendiri. Scikit-Learn telah menyediakan librarynya. Ekstraksi Fitur ini nantinya akan berguna dalam pemrosesan klasifikasi, clustering, maupun teknik pembelajaran mesin lainnya.
## 5.1 Count Vectorizer

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

docs = ["Pesawat membawa 178 penumpang dewasa, satu anak-anak, dan dua bayi,ujar Kepala Bagian Kerja Sama dan Humas Direktorat Jenderal Perhubungan Udara, Sindu Rahayu dalam keterangan tertulisnya, Senin, 29 Oktober 2018.","Sindu menyebutkan pesawat Lion Air JT-610 penerbangan dari Jakarta menuju Pangkal Pinang mengalami hilang kontak pada pukul 06.33 WIB. Pesawat dengan nomor registrasi PL LQP dilaporkan tertangkap radar pada koordinat 05 46.15 S - 107 07.16 E.","Sindu menambahkan pesawat Lion Air JT-610 dijadwalkan berangkat pada pukul 06.10 WIB dan akan mendarat di Pangkal Pinang pada pukul 07.10 WIB. Sebelum hilang kontak, kata Sindu, kru pesawat sempat meminta retrun to base"]
X = cv.fit_transform(docs)
print(X)
print(cv.vocabulary_)
print(X.todense())

  (0, 8)	1
  (0, 52)	1
  (0, 9)	1
  (0, 70)	1
  (0, 73)	1
  (0, 38)	1
  (0, 20)	1
  (0, 63)	1
  (0, 71)	1
  (0, 75)	1
  (0, 57)	1
  (0, 33)	1
  (0, 28)	1
  (0, 31)	1
  (0, 66)	1
  (0, 37)	1
  (0, 16)	1
  (0, 36)	1
  (0, 76)	1
  (0, 18)	1
  (0, 29)	1
  (0, 21)	2
  (0, 15)	2
  (0, 67)	1
  (0, 24)	1
  :	:
  (2, 35)	1
  (2, 68)	1
  (2, 25)	1
  (2, 47)	1
  (2, 14)	1
  (2, 3)	2
  (2, 19)	1
  (2, 26)	1
  (2, 46)	1
  (2, 2)	1
  (2, 77)	2
  (2, 1)	1
  (2, 61)	2
  (2, 53)	2
  (2, 39)	1
  (2, 30)	1
  (2, 59)	1
  (2, 54)	1
  (2, 12)	1
  (2, 34)	1
  (2, 13)	1
  (2, 42)	1
  (2, 71)	2
  (2, 21)	1
  (2, 58)	2
{'pesawat': 58, 'membawa': 44, '178': 7, 'penumpang': 56, 'dewasa': 24, 'satu': 67, 'anak': 15, 'dan': 21, 'dua': 29, 'bayi': 18, 'ujar': 76, 'kepala': 36, 'bagian': 16, 'kerja': 37, 'sama': 66, 'humas': 31, 'direktorat': 28, 'jenderal': 33, 'perhubungan': 57, 'udara': 75, 'sindu': 71, 'rahayu': 63, 'dalam': 20, 'keterangan': 38, 'tertulisnya': 73, 'senin': 70, '29': 9, 'oktober': 52, '2018': 8, 

## Dict Vectorizer

DictVectorizer melakukan mapping dari dictionry wordcount ke Vektor

In [37]:
from sklearn.feature_extraction import DictVectorizer

docs = [{"Aku": 1, "suka": 1, "makan": 2}, {"Aku": 1, "tidak": 1, "suka": 2, "makan": 3, "kambing": 1, "bakar": 2, "madu": 3}]
dv = DictVectorizer(sort=False)
X = dv.fit_transform(docs)
print(X)
print(dv.vocabulary_)
print(X.todense())

  (0, 0)	1.0
  (0, 1)	1.0
  (0, 2)	2.0
  (1, 0)	1.0
  (1, 1)	2.0
  (1, 2)	3.0
  (1, 3)	1.0
  (1, 4)	1.0
  (1, 5)	2.0
  (1, 6)	3.0
{'Aku': 0, 'suka': 1, 'makan': 2, 'tidak': 3, 'kambing': 4, 'bakar': 5, 'madu': 6}
[[1. 1. 2. 0. 0. 0. 0.]
 [1. 2. 3. 1. 1. 2. 3.]]


## TfIdf Vectorizer:
Word Count (Term Frekuensi dikali dengan Inverse Dokumen Frekuensi),

Tutorial dapat dilihat pada link berikut:
- https://datascience.mipa.ugm.ac.id/id/representasi-teks-dalam-vektor-part-1/
- https://datascience.mipa.ugm.ac.id/id/representasi-teks-dalam-vektor-part-2/

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfidf_vectorizer = TfidfVectorizer()
cv_vectorizer = CountVectorizer()
docs = ["Mayur is a Guitarist Guitarist", "Mayur is Musician", "Mayur is also a programmer"]
X_idf = tfidf_vectorizer.fit_transform(docs)
X_cv = cv_vectorizer.fit_transform(docs)
print(X_idf.todense())
print(tfidf_vectorizer.vocabulary_)
print(X_cv.todense())

[[0.         0.92276146 0.27249889 0.27249889 0.         0.        ]
 [0.         0.         0.45329466 0.45329466 0.76749457 0.        ]
 [0.6088451  0.         0.35959372 0.35959372 0.         0.6088451 ]]
{'mayur': 3, 'is': 2, 'guitarist': 1, 'musician': 4, 'also': 0, 'programmer': 5}
[[0 2 1 1 0 0]
 [0 0 1 1 1 0]
 [1 0 1 1 0 1]]


# GROUP PROJECT

Tujuan dari Project ini adalah mengaplikasikan hal-hal yang telah dipelajari dari setiap pertemuan pada Digital Talent pada sebuah "big dataset" yang dipilih hingga akhirnya menemukan "insight" dari data tersebut.

Grup Terdiri dari 4-5 orang, dan akan dipilihkan oleh pengajar secara acak.

Dataset yang digunakan adalah GDELT Dataset (Tentatif) https://www.gdeltproject.org


Terdapat milestone yang harus dilaporkan setiap minggunya dalam bentuk Pdf:

Milestone 1 : 
- Deskripsi Project & Dataset. 
- Eksplorasi Data

Milestone 2 : 
- Eksplorasi dengan Statistik Deskriptif
- Research Question

Milestone 3 :
- Model Pembelajaran Mesin yang mungkin diterapkan
- Dasar Pemilihan Metode

Milestone 4 :
- Pembahasan Mengenai Hasil dari riset yang telah dilakukan
- Visualisasi Data dengan Tools yang diajarkan


Milestone 5 :
- Menjawab Research Question beserta kesimpulan dan pembahasan

Milestone 6 : 
- Laporan Final 
- Pembuatan Presentasi dan Poster
- Publikasi dalam paper (opsional) 
