#**스마트폰 센서 데이터 기반 모션 분류**
# 단계2 : 기본 모델링


## 0.미션

* 데이터 전처리
    * 가변수화, 데이터 분할, NaN 확인 및 조치, 스케일링 등 필요한 전처리 수행
* 다양한 알고리즘으로 분류 모델 생성
    * 최소 4개 이상의 알고리즘을 적용하여 모델링 수행
    * 성능 비교
        * 각 모델의 성능을 관리하는 별도의 엑셀파일을 만들어 봅시다.
        * 성능 가이드 : Accuracy 0.900 ~

## 1.환경설정

* 세부 요구사항
    - 경로 설정 : 로컬 수행(Ananconda)
        * 제공된 압축파일을 다운받아 압축을 풀고
        * anaconda의 root directory(보통 C:/Users/< ID > 에 project3_1 폴더를 만들고, 복사해 넣습니다.
    - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다.
        * 필요하다고 판단되는 라이브러리를 추가하세요.


In [29]:
import sklearn
print(sklearn.__version__)

1.4.2


### (1) 라이브러리 로딩

In [31]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

import joblib

# 필요한 라이브러리, 함수 로딩 ------------------

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import *

* 제공 함수 생성
    * 변수 중요도를 시각화할 수 있는 함수를 제공합니다.
    * 입력 :
        * importance : 트리모델의 변수 중요도(예: model.feature_importances_)
        * names : 변수 이름 목록(예 : x_train.columns
        * result_only  : 변수 중요도 순으로 데이터프레임만 return할지, 그래프도 포함할지 결정. False이면 결과 데이터프레임 + 그래프
        * topn : 중요도 상위 n개만 표시. all 이면 전체.
    * 출력 :
        * 중요도 그래프 : 중요도 내림차순으로 정렬
        * 중요도 데이터프레임 : 중요도 내림차순으로 정렬

In [33]:
# 변수의 특성 중요도 계산하기
def plot_feature_importance(importance, names, result_only = False, topn = 'all'):
    feature_importance = np.array(importance)
    feature_name = np.array(names)

    data={'feature_name':feature_name,'feature_importance':feature_importance}
    fi_temp = pd.DataFrame(data)

    #변수의 특성 중요도 순으로 정렬하기
    fi_temp.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    fi_temp.reset_index(drop=True, inplace = True)

    if topn == 'all' :
        fi_df = fi_temp.copy()
    else :
        fi_df = fi_temp.iloc[:topn]

    #변수의 특성 중요도 그래프로 그리기
    if result_only == False :
        plt.figure(figsize=(10,20))
        sns.barplot(x='feature_importance', y='feature_name', data = fi_df)

        plt.xlabel('importance')
        plt.ylabel('feature name')
        plt.grid()

    return fi_df

### (2) 데이터 불러오기

* 주어진 데이터셋
    * data01_train.csv : 학습 및 검증용
    * data01_test.csv : 테스트용

* 세부 요구사항
    * 칼럼 삭제 : data01_train.csv와 data01_test.csv 에서 'subject' 칼럼은 불필요하므로 삭제합니다.

#### 1) 데이터로딩

In [37]:
file1 = 'data01_train.csv'
file2 = 'data01_test.csv'

In [38]:
data = pd.read_csv(file1)
test = pd.read_csv(file2)

In [39]:
# 불필요한 칼럼 삭제
data = data.drop(columns='subject')
test = test.drop(columns='subject')

In [40]:
# 열이름에 LGBMClassifier시 JSON에러 뜨는 특수문자 지우기
import re
data = data.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
test = test.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

#### 2) 기본 정보 조회

In [42]:
data.shape

(5881, 562)

In [43]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5881 entries, 0 to 5880
Columns: 562 entries, tBodyAccmeanX to Activity
dtypes: float64(561), object(1)
memory usage: 25.2+ MB


In [44]:
data.describe()

Unnamed: 0,tBodyAccmeanX,tBodyAccmeanY,tBodyAccmeanZ,tBodyAccstdX,tBodyAccstdY,tBodyAccstdZ,tBodyAccmadX,tBodyAccmadY,tBodyAccmadZ,tBodyAccmaxX,...,fBodyBodyGyroJerkMagmeanFreq,fBodyBodyGyroJerkMagskewness,fBodyBodyGyroJerkMagkurtosis,angletBodyAccMeangravity,angletBodyAccJerkMeangravityMean,angletBodyGyroMeangravityMean,angletBodyGyroJerkMeangravityMean,angleXgravityMean,angleYgravityMean,angleZgravityMean
count,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,...,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0
mean,0.274811,-0.017799,-0.109396,-0.603138,-0.509815,-0.604058,-0.628151,-0.525944,-0.605374,-0.46549,...,0.126955,-0.305883,-0.623548,0.008524,-0.001185,0.00934,-0.007099,-0.491501,0.059299,-0.054594
std,0.067614,0.039422,0.058373,0.448807,0.501815,0.417319,0.424345,0.485115,0.413043,0.544995,...,0.249176,0.322808,0.310371,0.33973,0.447197,0.60819,0.476738,0.509069,0.29734,0.278479
min,-0.503823,-0.684893,-1.0,-1.0,-0.999844,-0.999667,-1.0,-0.999419,-1.0,-1.0,...,-0.965725,-0.979261,-0.999765,-0.97658,-1.0,-1.0,-1.0,-1.0,-1.0,-0.980143
25%,0.262919,-0.024877,-0.121051,-0.992774,-0.97768,-0.980127,-0.993602,-0.977865,-0.980112,-0.936067,...,-0.02161,-0.541969,-0.845985,-0.122361,-0.294369,-0.481718,-0.373345,-0.811397,-0.018203,-0.141555
50%,0.277154,-0.017221,-0.108781,-0.943933,-0.844575,-0.856352,-0.948501,-0.849266,-0.849896,-0.878729,...,0.133887,-0.342923,-0.712677,0.010278,0.005146,0.011448,-0.000847,-0.709441,0.182893,0.003951
75%,0.288526,-0.01092,-0.098163,-0.24213,-0.034499,-0.26269,-0.291138,-0.068857,-0.268539,-0.01369,...,0.288944,-0.127371,-0.501158,0.154985,0.28503,0.499857,0.356236,-0.51133,0.248435,0.111932
max,1.0,1.0,1.0,1.0,0.916238,1.0,1.0,0.967664,1.0,1.0,...,0.9467,0.989538,0.956845,1.0,1.0,0.998702,0.996078,0.977344,0.478157,1.0


In [45]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1471 entries, 0 to 1470
Columns: 562 entries, tBodyAccmeanX to Activity
dtypes: float64(561), object(1)
memory usage: 6.3+ MB


## **2. 데이터 전처리**

* 가변수화, 데이터 분할, NaN 확인 및 조치, 스케일링 등 필요한 전처리를 수행한다.


In [47]:
# # 가변수화
# dumm_cols = ['Activity']
# data = pd.get_dummies(data, columns=dumm_cols, dtype = 'int')
# data

### (1) 데이터 분할1 : x, y

* 세부 요구사항
    - x, y로 분할합니다.

In [50]:
target = 'Activity'
x = data.drop(columns=target)
y = data[target]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)

### (2) 데이터분할2 : train, validation

* 세부 요구사항
    - train : val = 8 : 2 혹은 7 : 3
    - random_state 옵션을 사용하여 다른 모델과 비교를 위해 성능이 재현되도록 합니다.

In [53]:
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=.3, random_state = 10)

### (3) 스케일링


* 세부 요구사항
    - 스케일링을 필요로 하는 알고리즘 사용을 위해서 코드 수행
    - min-max 방식 혹은 standard 방식 중 한가지 사용.

In [56]:
minmax = MinMaxScaler()

# 스케일링
x_train_s1 = minmax.fit_transform(x_train)
x_val_s1 = minmax.transform(x_val)

# df로 변환
x_train_s1 = pd.DataFrame(x_train_s1, columns=x_train.columns)
x_val_s1 = pd.DataFrame(x_val_s1, columns=x_val.columns)

In [57]:
x_train_s1.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tBodyAccmeanX,4116.0,0.657475,0.057222,0.0,0.648218,0.659581,0.668865,1.0
tBodyAccmeanY,4116.0,0.395710,0.024958,0.0,0.391678,0.396298,0.399962,1.0
tBodyAccmeanZ,4116.0,0.445612,0.030918,0.0,0.439622,0.445618,0.450961,1.0
tBodyAccstdX,4116.0,0.208200,0.238487,0.0,0.003736,0.027098,0.403845,1.0
tBodyAccstdY,4116.0,0.252492,0.261911,0.0,0.011282,0.071486,0.503845,1.0
...,...,...,...,...,...,...,...,...
angletBodyGyroMeangravityMean,4116.0,0.505255,0.302881,0.0,0.260914,0.506262,0.743516,1.0
angletBodyGyroJerkMeangravityMean,4116.0,0.498992,0.237314,0.0,0.321172,0.500061,0.679486,1.0
angleXgravityMean,4116.0,0.257410,0.257581,0.0,0.095169,0.146700,0.247597,1.0
angleYgravityMean,4116.0,0.714069,0.202619,0.0,0.657738,0.797332,0.843697,1.0


## **3. 기본 모델링**



* 세부 요구사항
    - 최소 5개 이상의 알고리즘을 적용하여 모델링을 수행한다.
    - 각 알고리즘 별로 다음 중 몇가지를 시도하며 성능을 비교한다.

In [60]:
from sklearn.neighbors import KNeighborsClassifier
from lightgbm import LGBMClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score

### (1) KNN

In [62]:
# param = {'n_neighbors': range(2, 34)}

In [63]:
# # 선언하기
# model_knn = GridSearchCV(KNeighborsClassifier(), param, cv=5)

# # 학습하기
# model_knn.fit(x_train_s1, y_train)

# # 결과확인
# print('* 파라미터:', model_knn.best_params_)
# print('* 예측성능:', model_knn.best_score_)

In [64]:
model_KNN = KNeighborsClassifier(n_neighbors=4)
model_KNN.fit(x_train_s1, y_train)
pred_KNN = model_KNN.predict(x_val_s1)

print(accuracy_score(y_val, pred_KNN))
print(confusion_matrix(y_val, pred_KNN))
print(classification_report(y_val, pred_KNN))

0.96657223796034
[[330   1   0   0   0   0]
 [  1 260  21   0   0   0]
 [  0  33 290   0   0   0]
 [  0   0   0 323   0   1]
 [  0   0   0   1 246   1]
 [  0   0   0   0   0 257]]
                    precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       331
           SITTING       0.88      0.92      0.90       282
          STANDING       0.93      0.90      0.91       323
           WALKING       1.00      1.00      1.00       324
WALKING_DOWNSTAIRS       1.00      0.99      1.00       248
  WALKING_UPSTAIRS       0.99      1.00      1.00       257

          accuracy                           0.97      1765
         macro avg       0.97      0.97      0.97      1765
      weighted avg       0.97      0.97      0.97      1765



### (2) DecisionTree

In [66]:
# # 파라미터 선언
# param = {'max_depth': range(1,51)}

# # 선언하기
# model_dst = GridSearchCV(DecisionTreeClassifier(), param, cv=5)

# # 학습하기
# model_dst.fit(x_train, y_train)

# # 결과확인
# print('* 파라미터:', model_dst.best_params_)
# print('* 예측성능:', model_dst.best_score_)

In [67]:
model_DST = DecisionTreeClassifier(max_depth=8)
model_DST.fit(x_train, y_train)
pred_DST = model_DST.predict(x_val)

print(accuracy_score(y_val, pred_DST))
print(confusion_matrix(y_val, pred_DST))
print(classification_report(y_val, pred_DST))

0.9388101983002833
[[331   0   0   0   0   0]
 [  0 249  33   0   0   0]
 [  0  18 305   0   0   0]
 [  0   0   0 300  10  14]
 [  0   0   0   7 232   9]
 [  0   0   1   8   8 240]]
                    precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       331
           SITTING       0.93      0.88      0.91       282
          STANDING       0.90      0.94      0.92       323
           WALKING       0.95      0.93      0.94       324
WALKING_DOWNSTAIRS       0.93      0.94      0.93       248
  WALKING_UPSTAIRS       0.91      0.93      0.92       257

          accuracy                           0.94      1765
         macro avg       0.94      0.94      0.94      1765
      weighted avg       0.94      0.94      0.94      1765



### (3) LGBM

In [69]:
# # 파라미터
# param = {'n_estimators':range(100,301,100), 'learning_rate': np.linspace(0.001, 0.5, 100), 'max_deth' : -1}

# # 모델선언
# model_lgb = RandomizedSearchCV(LGBMClassifier(verbose=-1), param, cv=5, n_iter = 10)

# model_lgb.fit(x_train2, y_train2)

# # 최적 파라미터, 예측 최고 성능
# print('* 파라미터:', model_lgb.best_params_)
# print('* 예측성능:', model_lgb.best_score_)

In [70]:
# # 파라미터
# param = {'n_estimators':range(10,301,10), 'learning_rate': np.linspace(0.001, 0.5, 100)}

# # 모델선언
# model_lgb2 = GridSearchCV(LGBMClassifier(verbose=-1), param, cv=5)

# model_lgb2.fit(x_train2, y_train2)

# # 최적 파라미터, 예측 최고 성능
# print('* 파라미터:', model_lgb.best_params_)
# print('* 예측성능:', model_lgb.best_score_)
# 확인 못 해봄

In [None]:
# 파라미터
param = {'n_estimators' : [260] ,'learning_rate':[0.422]}

# 모델선언
model_lgb = GridSearchCV(LGBMClassifier(verbose=-1), param, cv=5)

model_lgb.fit(x_train, y_train)

# 최적 파라미터, 예측 최고 성능
print('* 파라미터:', model_lgb.best_params_)
print('* 예측성능:', model_lgb.best_score_)

### (4) 모델4

### (5) 모델5

## 4.성능비교

* 세부 요구사항
    - 각 모델에 대해서 test 데이터로 성능 측정후 비교
    

In [None]:
# x, y 분할
x_test = test.drop(['Activity'], axis = 1)
y_test = test['Activity']


In [None]:
model_no = [model_KNN, model_DST, model_lgb]

for model in model_no :
    pred = model.predict(x_test)
    print(accuracy_score(y_test, pred))

## 5.모델 저장
* 각 알고리즘 별 최적의 성능 모델 저장
    * 단, 전체 변수를 이용해 생성한 모델만 저장합니다.(joblib.dump)
    * 튜닝 모델은, model.best_estimator_ 로 저장합니다.

In [78]:
import joblib

In [80]:
#joblib.dump(model_lgb, 'minip_LGBM_chan.pkl')

['minip_LGBM_chan.pkl']