# Machine Learning
- 데이터를 이해하기 위해 수학적 모델을 구축하는 것.

## Supervised Learning
- Definition: Modeling the relationship between data's feature to its label.
1. Classification
2. Regression

##### Classification이냐 Regression이냐는 Label이 continuous하냐 아니냐로 나뉘어짐


### Model Training in Classfication
- Model : A line will classify the classes.
- Model Parameters : Certain numbers that define the line.

##### whole training process is about these parameters being tuned

## Unsupervised Learning
- Label is not used
- Clustering (레이블과 관계없이 특징들을 바탕으로 클러스터시켜버리는것)
- Dimensionality Reduction (차원축소)

# Scikit-Learn
- Machine Learning algorithms package in python

In [None]:
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()

In [None]:
%matplotlib inline
sns.set()
sns.pairplot(iris, hue='species', height=1.5)

In [None]:
#Creating Feature Matrix
X_iris = iris.drop('species', axis=1)
X_iris.shape

In [None]:
#Creating Target Vector
y_iris = iris['species']
y_iris.shape

# Scikit-Learn Workflow

1. Choose Model by importing the right estimator class
2. Instantiate this class with desired values and choose hyperparameters
3. Reconstruct the data as feature matrix and target vector.
4. Call model's fit() method to fit the model with data.
5. Apply model with new data
##### Supervised Learning: use predict() method to predict unknown data's label.
##### Unsupervised Learning: use transform() of predict() method to transform the data's feature or predict

In [None]:
import numpy as np
import matplotlib.pyplot as plt

#rand() - 0 과 1 사이의 균일분포
#randn() - 가우시안 정규 분포
rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x,y)

In [None]:
# 1. Choose Model by importing the right estimator class
from sklearn.linear_model import LinearRegression

## 2. Instantiate this class with desired values and choose hyperparameters
#### Few More Things to consider
- Offset에 적합시킬 것인가?
- Model Normalized or not?
- Model 유연성 높이기 위해 Feature을 전처리 할건가?
- 모델에서 어느정도의 Normalization을 할 건지?
- 얼마나 많은 모델 성분을 사용할건지?

In [None]:
# 2. Instantiate this class with desired values and choose hyperparameters
model = LinearRegression(fit_intercept=True)
model

In [None]:
# 3. Reconstruct the data as feature matrix and target vector.
# Feature Matrix = (n_samples, n_features)
# Target Vector = (n_samples)
#여기서 X 는 feature가 단순한 값이기 때문에 걍 차원만 늘려줌.

X = x[:, np.newaxis]
X.shape

In [None]:
# 4. Call model's fit() method to fit the model with data.
model.fit(X, y)

In [None]:
#what happened? lets see
model.coef_

In [None]:
model.intercept_

In [None]:
# 5. Apply model with new data
xfit = np.linspace(-1,11)

In [None]:
Xfit = xfit[:,np.newaxis] # (n_samples,n_features)로 맞춰줘야되기때문.
yfit = model.predict(Xfit) # prediction based on fitted model's coef with intercept applied

In [None]:
plt.scatter(x, y)
plt.plot(xfit, yfit)

# Supervised Learning: Gaussian Naive Bayes Classification

In [None]:
from sklearn.model_selection import train_test_split # for splitting data into test and train
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris, random_state = 1)

In [None]:
from sklearn.naive_bayes import GaussianNB # Model class selection
model = GaussianNB()                       # Model instantiate
model.fit(Xtrain, ytrain)                  # Model fit to data
y_model = model.predict(Xtest)             # Predict on unseen data

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model) 

# Unsupervised Example : 차원 축소

In [None]:
from sklearn.decomposition import PCA # Model class selection
model = PCA(n_components = 2)         # Model Instantiate with hyperparameter
model.fit(X_iris)                     # Fit to data. This time, no y is needed since we're just reducting demension
X_2D = model.transform(X_iris)        # Convert data into 2 dimension

In [None]:
iris['PCA1'] = X_2D[:,0]
iris['PCA2'] = X_2D[:,1]
sns.lmplot("PCA1","PCA2", hue='species', data = iris, fit_reg=False)

# Unsupervised Example : Clustering

In [None]:
from sklearn.mixture import GaussianMixture                       #Model Selection
model = GaussianMixture(n_components=3, covariance_type='full')   #Model Instantiation with hyperparameters
model.fit(X_iris)                                                 #Model fit to iris data
y_gmm = model.predict(X_iris)                                     #Clustering label

In [None]:
iris['cluster'] = y_gmm
sns.lmplot("PCA1","PCA2", data=iris, hue='species', col = 'cluster', fit_reg=False)

# Exercise
- Digits 데이터셋에 대한 Classification 모델을 만들어보세요.
- 데이터셋을 test와 train으로 split하여 모델 예측력까지 뽑아보세요.

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()

In [None]:
X = digits.data
y = digits.target
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, random_state= 0) #splitting dataset
model = GaussianNB() #Model Instantiation
model.fit(Xtrain,ytrain) #Model Fit to data
y_model = model.predict(Xtest) #Model prediction with new data

In [None]:
accuracy_score(ytest, y_model)

# Confusion Matrix
- 위 모델이 예측력이 어느정도나왔는데, 그 예측력의 근거되는 정보를 보기 위함.

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest, y_model)
sns.heatmap(mat, square = True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value')

# Hyperparameter and Model Validation
- Choosing right Hyperparameter is important
- Model Validation is allso important

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

#Now we choose model and hyperparameter
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors = 1)
model.fit(X,y)
y_model = model.predict(X)
accuracy_score(y,y_model)

#Accuracy가 몇정도 나올까요?

In [None]:
# Self-Excercise : The above model validation has a serious flaw. Tune it to have a right structure.

# Types of Model Validation
1. Holdout set (what we've been covering)
2. Cross-Validation ****

##### Cross-Validation
- 데이터가 100개가 있으면, holdout set을 두번 돌린다고 생각하면 된다.
- 1번째때 1~50 까지 train, 51~100 test. 2번째땐 반대로 51~100 train, 1~50 test.
    

In [None]:
#holdout set method
X1, X2, y1, y2 = train_test_split(X, y, random_state=0, train_size = 0.5)
model.fit(X1,y1)
y_model = model.predict(X2)
accuracy_score(y2,y_model)

In [None]:
#Cross-Validation method
y1_model = model.fit(X2,y2).predict(X1)
y2_model = model.fit(X1,y1).predict(X2)
accuracy_score(y1,y1_model), accuracy_score(y2,y2_model)

- 위에서는 데이터를 2개로 나눠서 진행한 Cross-Validation이므로 two-fold cross-validation이라고 지칭.
- 논문읽다보면 많이 나오는 k-fold 에서 k 가 데이터셋을 k만큼나눈걸 의미한다는걸 알수있죠

In [None]:
#5-fold cross-validation example
from sklearn.model_selection import cross_val_score
cross_val_score(model, X, y, cv=5)

# Selecting Best Model
- Validation 방법도 중요하지만 결국엔 Hyperparameter Tuning이 Machine Learning Practicitioner의 실력을 좌우함.
- 어떤 모델을 가지고 어떤 Hyperparamter 값으로 학습시킬지가 관건.
- 그럼 하이퍼파라미터는(초모수) 뭔가요?

## Hyperparameter vs Parameter
- Parameter : Values that are derived and change during training process
- Hyperparameter : Parameters whose values are pre-defined before learning process begins.

- Ex : Number of layers in NN, Learning Rate, Depth of Decision Tree etc.

##### So how to choose best model?
- Bias : 얼마나 모델이 욕심쟁이 인지에 관한 수치. High bias == Underfit
- Variance : 얼마나 모델이 유도리가 없는지에 관한 수치. High Variance == Overfit
- Finding 황금 밸런스 between bias and variance is the mission
- 황금 밸런스 == Good model complexity that is not too complex(high variance) and not too simple(high bias)

￼![image.png](attachment:image.png)

In [None]:
#Validation curves in Scikit-Learn
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),LinearRegression(**kwargs))

def make_data(N, err=1.0, rseed=1):
    rng = np.random.RandomState(rseed)
    X = rng.rand(N, 1) ** 2
    y = 10 - 1. / (X.ravel() + 0.1) # ravel() flattens the array to 1D
    if err > 0:
        y += err * rng.randn(N)
    return X, y

X, y = make_data(40)

In [None]:
%matplotlib inline
sns.set()
X_test = np.linspace(-0.1, 1.1, 500) [:, None]

plt.scatter(X.ravel(),y,color='black')
axis = plt.axis()
for degree in [1,3,5]:
    y_test = PolynomialRegression(degree).fit(X, y).predict(X_test)
    plt.plot(X_test.ravel(), y_test, label='degree = {0}'.format(degree))
plt.xlim(-0.1, 1.0)
plt.ylim(-2, 12)
plt.legend(loc = 'best')

In [None]:
from sklearn.model_selection import validation_curve
degree = np.arange(0,21)
train_score, val_score = validation_curve(PolynomialRegression(),X, y, 'polynomialfeatures__degree', degree, cv=7)
plt.plot(degree, np.median(train_score, 1), color='blue', label= 'training score')
plt.plot(degree, np.median(val_score, 1), color='red', label='validation score')
plt.legend(loc='best')
plt.ylim(0,1)
plt.xlabel('degree')
plt.ylabel('score')

In [None]:
# 3rd degree seems the best complexity for the model.
plt.scatter(X.ravel(),y)
lim = plt.axis()
y_test = PolynomialRegression(3).fit(X,y).predict(X_test)
plt.plot(X_test.ravel(),y_test)
plt.axis(lim)

### Learning Curves
- 최적의 Model Complexity는 training data의 사이즈에 dependent하다.

In [None]:
X2, y2 = make_data(200)
plt.scatter(X2.ravel(), y2)

In [None]:
degree = np.arange(21)
train_score2, val_score2 = validation_curve(PolynomialRegression(),X2, y2, 'polynomialfeatures__degree',degree, cv=7)
plt.plot(degree, np.median(train_score2, 1), color = 'blue', label='training score')
plt.plot(degree, np.median(val_score2, 1), color='red', label='validation score')
plt.plot(degree, np.median(train_score, 1), color='blue', alpha=0.3, linestyle='dashed')
plt.plot(degree, np.median(val_score, 1), color='red', alpha=0.3, linestyle='dashed')
plt.legend(loc='lower center')
plt.ylim(0,1)
plt.xlabel('degree')
plt.ylabel('score')

# 2 important inputs of the validation curve are 1) Model Complexity 2) Training Points

- 다시 말해서, Training dataset의 크기가 Learning Curve에 중요한 요소이므로 어느정도의 크기로 train/test data split을 할지를 Learning Curve 를 이용하여 알아낼 수 있음
- 단, 어느 한 기점으로 train/validation score가 converge하므로 그 이후부턴 데이터를 더 집어넣어도 별 소용없다는 뜻도 됨


￼![image.png](attachment:image.png)

In [None]:
from sklearn.model_selection import learning_curve
fig, ax = plt.subplots(1,2,figsize=(16,6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace = 0.1)
for i, degree in enumerate([2,9]):
    N, train_lc, val_lc = learning_curve(PolynomialRegression(degree), X, y, cv=7, train_sizes=np.linspace(0.3, 1, 25))
    ax[i].plot(N, np.mean(train_lc, 1), color='blue', label='training score')
    ax[i].plot(N, np.mean(val_lc, 1), color='red', label='validation score')
    ax[i].hlines(np.mean([train_lc[-1], val_lc[-1]]),N[0],N[-1], color='gray',linestyle='dashed')
    ax[i].set_ylim(0,1)
    ax[i].set_xlim(N[0],N[-1])
    ax[i].set_xlabel('training size')
    ax[i].set_ylabel('score')
    ax[i].set_title('degree = {0}'.format(degree), size=14)
    ax[i].legend(loc='best')
    

##### 위 플롯에서 볼 수 있듯이, 2nd degree complexity를 가진 모델에서는 데이터사이즈가 22정도부터 이미 converge 되서 그이후부턴 비슷하지만, 더 complex 한 9th degree 모델에서는 converging point가 늘어난것을 볼수있다. 여기에서 오는 부작용은 2nd degree모델에 비해 더 overfitting 하다는 점.

# Grid-Search
- 이제까지는 Bias,Variance를 조정하는 방법을 알아보며
- 그에 따른 model complexity 와 training data size 에 대한 모델의 의존도도 알아봄.
- 하지만 실제 practice에서는 parameter가 위에처럼 간단하지않고 다차원으로 이루어져있기에 시각화를 하여 최적의 모델 선택이 힘듬.
- 그래서 Grid-Search를 쓰면됨.

#### Grid-Search란?
- 흔히 말하는 하나하나씩 다 때려박아보면서 grid에 흔적을 남기고 그걸 사람이 보면서 optimal 한 세팅을 찾는것임.

In [None]:
from sklearn.model_selection import GridSearchCV
#다 때려박아볼 파라미터들을 세팅하는 그리드를 먼저 만든다.
param_grid = {'polynomialfeatures__degree':np.arange(21), 
             'linearregression__fit_intercept':[True,False],
             'linearregression__normalize':[True, False]}


grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)

In [None]:
#이제 데이터에 fit 시켜봅시다
grid.fit(X, y)

In [None]:
#fit 한 결과 베스트한 hyperparameter는 뭘까?
grid.best_params_

In [None]:
model = grid.best_estimator_
plt.scatter(X.ravel(),y)
lim = plt.axis()
y_test = model.fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test)
plt.axis(lim)

# Feature Engineering
- Real-World Data do not form the perfect (n_samples,n_features) format
- Feature engineering is about taking the information into number representations to build feature matrix

### Categoriacal Features
- 아파트 주민의 호구조사 할때 동 feature (401동 < 402동? 오더링이들어가면안되죠)
- Use One-Hot-Encoding

##### What is One-Hot-Encoding?
- Sparce Vector formed with 0(Not present)s and 1(Present).
- For example, 9 labels can be represented as n=9 Vector. each index represents each label.

In [None]:
from sklearn.feature_extraction import DictVectorizer
data = [{'price': 850000, 'rooms': 4, 'neighborhood': 'Tae-Won-Woo'},
       {'price': 450000, 'rooms': 2, 'neighborhood': 'Hyun-Bo-Shin'},
       {'price': 8900000, 'rooms': 10, 'neighborhood': 'In-A-Lee'},
       {'price': 50000, 'rooms': 1, 'neighborhood': 'Doo-Sol-Han'}]
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)

In [None]:
#Which vector corresponds to which neighbor?
vec.get_feature_names()

### Text Features
- NLP for word-representation
- Word cout, Word-Embedding etc.

In [None]:
#word count example
sample = ['problem of evil','evil queen','horizon problem']
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(sample)
X

In [None]:
import pandas as pd
pd.DataFrame(X.toarray(), columns = vec.get_feature_names())

In [None]:
#영어에서 자주 등장하는 a 같은 애들때문에 이런 방법은 단점이 있어서 TF-IDF를 사용할수도있음.
#TF-IDF로 문서 전체적으로 빈도율 조사 후 각 단어의 weight을 빈도를 고려해서 때림.

from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(),columns=vec.get_feature_names())

### Image Features
- 책에서 너무 복잡하니까 넘어가겠다고 멘션.
- 근데 어찌됫든 그림은 픽셀화시켜서 CNN에서 돌리듯 feature extraction하면 됨.

### Derived Features
- 하이퍼파라미터 섹션에서도 봣듯이, 원레 1차원인 데이터를 다차원화 시켜서 모델에 적용시키는걸 볼수있었음.
- 중요한건 모델을 바꾸는게 아니라, input 을 transform 시키는 것.

In [None]:
x = np.array([1,2,3,4,5])
y = np.array([4,2,1,3,7])
plt.scatter(x,y)

In [None]:
#위와 같은 점들은 분명 linear regression 으로 좋은 fit을 나타낼수 없지만 일단 맨땅에 헤딩
X = x[:,np.newaxis] #이건 1차원임 계속. 단지 scikit-learn 의 feature-matrix 컨벤션 맞추기용(n_samples,n_features)
model = LinearRegression().fit(X, y)
yfit = model.predict(X)
plt.scatter(x,y)
plt.plot(x, yfit)

In [None]:
#그럼 input data를 1차원feature 에서 다차원으로 transform 해본다면?
#여기서 주의할점은 애초에 feature가 1차원이니까 다차원으로변형되도 일관성있는 의미를 가져야함.
poly = PolynomialFeatures(degree=3, include_bias=False)
X2 = poly.fit_transform(X) #3차원에 맞게 늘려라
print(X2)

In [None]:
#위의 아웃풋은 그냥 col1= x , col2 = x^2, col3 = x^3 인것을 볼수있다.
#그럼 이제 다시 리그레션을 해보자.
model = LinearRegression().fit(X2, y)
yfit = model.predict(X2)
plt.scatter(x,y)
plt.plot(x,yfit)

### Imputation of Missing Data
- Nan같이 데이터 내 missing Value가 있으면 모델의 feature type에 따라서 적절하게 채워줘야함.
- 이걸 imputation이라고 함.
- 이건 근데 sklearn.impute.SimpleImputer 클래스로 전처리 해주면 되고 너무 application-specific한 문제라 간단하게 짚고 넘어가서 여기선 스킵을 하겠습니다.

### Feature Pipelines
- 이제까지 배운 모든 내용을 하나 하나 노가다로 하기엔 너무 복잡할 수 있습니다.
- 그래서 준비한 것이 sklearn의 pipeline 클래스.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
model = make_pipeline(SimpleImputer(strategy='mean'),PolynomialFeatures(degree=2),LinearRegression())
#이렇게 하면 model안에 feature 처리과정을 모두 담아냈으므로 standard하게 fit하고 predict 하면 됩니다.

model.fit(X, y)
print(y)
print(model.predict(X))