# 제조 관련 데이터 factory의 회귀 분석
공장 생성 시 수집 되는 데이터를 이용하여 생산 라인에서 생산 되는 상품의 품질을 예측해보자.

### 데이터 불러오기

In [None]:
import pandas as pd
import numpy as np

In [None]:
df=pd.read_csv('factory.csv' ,header=0)
df

In [None]:
df.describe()

In [None]:
X=df.drop(["output_quality"],axis=1)
y=df["output_quality"]

#### 다중공선성 분석

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
df1 = X.corr()
mask = np.zeros_like(df1, dtype=np.bool)
mask[np.triu_indices_from(mask)]= True
f, ax = plt.subplots(figsize=(11, 9))
ax = sns.heatmap(df1, cmap = 'coolwarm', square = True, mask = mask,
                 vmin = -0.4, vmax = 0.4, annot = True, annot_kws = {"size": 15})

### 선형 회귀분석 (Linear regression)

#### Modle 1 : 기본모델링

In [None]:
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
model1 = sm.OLS(y,sm.add_constant(X))
result1 = model1.fit()
result1.summary()

#### Model2 : 변수 선택 후 모델링

In [None]:
X_reduced = X.drop(['motor_amperage'], axis = 'columns')
model_reduced = sm.OLS(y,sm.add_constant(X_reduced))
result_reduced = model_reduced.fit()
result_reduced.summary()

In [None]:
X_reduced = X.drop(['motor_amperage', 'motor_RPM'], axis = 'columns')
model_reduced = sm.OLS(y,sm.add_constant(X_reduced))
result_reduced = model_reduced.fit()
result_reduced.summary()

In [None]:
X_reduced = X.drop(['motor_amperage', 'motor_RPM','temp'], axis = 'columns')
model_reduced = sm.OLS(y,sm.add_constant(X_reduced))
result_reduced = model_reduced.fit()
result_reduced.summary()

## 데이터 구분 : 학습 데이터와 검증데이터
* 전체 데이터를 학습 데이터와 검증 데이터로 50:50으로  구분

In [None]:
from sklearn.model_selection import train_test_split
def data_split (x, y) : 
    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=1234)
    return X_train, X_test, y_train, y_test

In [None]:
 X_train, X_test, y_train, y_test = data_split (X,y)

## Model 1: 선형 회귀모형
* 설명변수 중 유의수준 5% 하에서 유의하지 않았던  Chol 콜레스테롤 변수를 제거하고 모델링

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# 설명 변수중 유의하지 않은 변수 제거 
X_train_selected = X_train.drop(['motor_amperage', 'motor_RPM','temp'],axis=1)
X_test_selected = X_test.drop(['motor_amperage', 'motor_RPM','temp'],axis=1)

In [None]:
# training the model & prediction
model_reg = LinearRegression(fit_intercept = True)
fit_reg=model_reg.fit(X_train_selected ,y_train)
y_pred_reg= fit_reg.predict(X_test_selected)

In [None]:
# plot : prediction vs true
import matplotlib.pyplot as plt
def pred_vs_true (y_pred) :
    plt.scatter(y_test, y_pred, alpha=0.3)
    plt.xlabel("Actual Quality")
    plt.ylabel("Predicted Quality")
    grid = np.linspace(1,12,1000)
    plt.plot(grid, grid, '-', color = 'r');
    plt.show()

In [None]:
pred_vs_true (y_pred_reg)

## Model2 : 신경망
* hidden_layer_sizes=(3,4,5),random_state=1234, max_iter = 1000

In [None]:
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler

In [None]:
# data scaling 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# training the model & prediction
model_nn = MLPRegressor(hidden_layer_sizes=(3,4,5),random_state=1234, max_iter = 1000)
fit_nn = model_nn.fit(X_train_scaled, y_train)
y_pred_nn =fit_nn.predict(X_test_scaled)

In [None]:
pred_vs_true (y_pred_nn)

## Model3 : 의사결정나무

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
model_tree = DecisionTreeRegressor(random_state = 1234)
fit_tree=model_tree.fit(X_train,y_train)
y_pred_tree = fit_tree.predict(X_test)

In [None]:
pred_vs_true (y_pred_tree)

## Model4: 랜덤포레스트

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
model_rf = RandomForestRegressor(random_state=1234)
fit_rf=model_rf.fit(X_train,y_train)
y_pred_rf = fit_rf.predict(X_test)

In [None]:
pred_vs_true (y_pred_rf)

## Model5: 서포트벡터머신(SVR)
* kernel 은 linear, polynomial, rbf 3가지로 진행하였고, 최종 모형 비교에는 default 값인 radial basis kernel을 활용

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
#rbf kernel (디폴트)
from sklearn.svm import SVR, SVC
model_svr_rbf = SVR() #rbf SVM #디폴트
fit_svr_rbf = model_svr_rbf .fit(X_train_scaled, y_train)
y_pred_svr_rbf  = fit_svr_rbf.predict(X_test_scaled)

In [None]:
pred_vs_true (y_pred_svr_rbf)

## 모형비교
* RMSE 와 MAPE를 기준으로 test set data에 대한 예측력을 비교하였다.
* 두 가지 기준에서 모두 Radomforest, neural network의 예측 성능이 우월하였다.

In [None]:
def RMSE(y_actual, y_pred):
    error = y_actual- y_pred
    n = len(y_actual)
    return np.sqrt(np.mean(error**2))

In [None]:
rmse_reg = RMSE(y_test, y_pred_reg)
rmse_nn = RMSE(y_test, y_pred_nn)
rmse_tree = RMSE(y_test, y_pred_tree)
rmse_rf = RMSE(y_test, y_pred_rf)
rmse_svr = RMSE(y_test, y_pred_svr_rbf)

In [None]:
plt.plot(['regression', 'neural network', 'decision tree', 'random forest', 'SVR'], 
         [rmse_reg, rmse_nn, rmse_tree, rmse_rf, rmse_svr], marker='o')
plt.ylabel('RMSE', size=15);

In [None]:
def MAPE(y_actual, y_pred):
    not_zero_idx = y_actual!=0
    _y_actual = y_actual[not_zero_idx]
    _y_pred = y_pred[not_zero_idx]   
    abs_error = abs(_y_actual - _y_pred)
    n = len(_y_actual)    
    return sum(abs_error / _y_actual) / n * 100

In [None]:
mape_reg = MAPE(y_test, y_pred_reg)
mape_nn = MAPE(y_test, y_pred_nn)
mape_tree = MAPE(y_test, y_pred_tree)
mape_rf = MAPE(y_test, y_pred_rf)
mape_svr = MAPE(y_test, y_pred_svr_rbf)

In [None]:
plt.plot(['regression', 'neural network', 'decision tree', 'random forest','SVR'], 
         [mape_reg, mape_nn, mape_tree, mape_rf, mape_svr], marker='o')
plt.ylabel('MAPE', size=15);