### 훈련데이터를 학습하여 미국내 생산된 자동차인지를 판단하는 머신러닝 모델 제작

In [1]:
import pandas as pd
import numpy as np

In [4]:
x_train = pd.read_csv('./data/mpg_x_train.csv')
x_train.head(5)

Unnamed: 0,name,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
0,pontiac j2000 se hatchback,31.0,4,112.0,85.0,2575,16.2,82
1,pontiac safari (sw),13.0,8,400.0,175.0,5140,12.0,71
2,mazda glc custom l,37.0,4,91.0,68.0,2025,18.2,82
3,oldsmobile vista cruiser,12.0,8,350.0,180.0,4499,12.5,73
4,peugeot 504,19.0,4,120.0,88.0,3270,21.9,76


In [5]:
y_train = pd.read_csv('./data/mpg_y_train.csv')
y_train.head(5)

Unnamed: 0,isUSA
0,1
1,1
2,0
3,1
4,0


In [6]:
x_test = pd.read_csv('./data/mpg_x_test.csv')
x_test.head() # class를 예측하여 csv 파일 형태로 제출

Unnamed: 0,name,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
0,maxda glc deluxe,34.1,4,86.0,65.0,1975,15.2,79
1,plymouth sapporo,23.2,4,156.0,105.0,2745,16.7,78
2,dodge coronet brougham,16.0,8,318.0,150.0,4190,13.0,76
3,amc concord dl 6,20.2,6,232.0,90.0,3265,18.2,79
4,fiat strada custom,37.3,4,91.0,69.0,2130,14.7,79


In [7]:
print(x_train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278 entries, 0 to 277
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          278 non-null    object 
 1   mpg           278 non-null    float64
 2   cylinders     278 non-null    int64  
 3   displacement  278 non-null    float64
 4   horsepower    274 non-null    float64
 5   weight        278 non-null    int64  
 6   acceleration  278 non-null    float64
 7   model_year    278 non-null    int64  
dtypes: float64(4), int64(3), object(1)
memory usage: 17.5+ KB
None


In [8]:
print(x_train.isnull().sum())

name            0
mpg             0
cylinders       0
displacement    0
horsepower      4
weight          0
acceleration    0
model_year      0
dtype: int64


In [10]:
# 결측치 처리
from sklearn.impute import SimpleImputer
# strategy: mean, median, most_frequent
# 'constant': 특정값, SimpleImputer(strategy='constant', fill_value=1)
# 결측치를 평균으로 지정
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
x_train[['horsepower']] = imputer.fit_transform(x_train[['horsepower']])
x_test[['horsepower']] = imputer.fit_transform(x_test[['horsepower']])

In [11]:
x_train.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
count,278.0,278.0,278.0,278.0,278.0,278.0,278.0
mean,23.732734,5.374101,189.994604,103.383212,2948.464029,15.580216,76.057554
std,7.647295,1.677084,105.471423,38.695458,862.949746,2.745907,3.605591
min,10.0,3.0,68.0,46.0,1613.0,8.0,70.0
25%,18.0,4.0,98.0,75.0,2206.25,14.0,73.0
50%,23.0,4.0,140.5,90.5,2737.5,15.5,76.0
75%,29.0,6.0,258.0,118.75,3560.0,17.0,79.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0


In [13]:
print(x_train.columns)
print(x_train.info())

Index(['name', 'mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model_year'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278 entries, 0 to 277
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          278 non-null    object 
 1   mpg           278 non-null    float64
 2   cylinders     278 non-null    int64  
 3   displacement  278 non-null    float64
 4   horsepower    278 non-null    float64
 5   weight        278 non-null    int64  
 6   acceleration  278 non-null    float64
 7   model_year    278 non-null    int64  
dtypes: float64(4), int64(3), object(1)
memory usage: 17.5+ KB
None


In [16]:
print(type(x_train))

<class 'pandas.core.frame.DataFrame'>


In [17]:
col_del=['name'] # 삭제할 컬럼
col_num=['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model_year'] # 수치형 컬럼
col_cat=[] # 카테고리형 컬럼
col_y=['isUSA']   # target 컬럼

x_train = x_train.iloc[:, 1:] # name 제외
x_test = x_test.iloc[:, 1:]   # name 제외
x_train.head(3)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
0,31.0,4,112.0,85.0,2575,16.2,82
1,13.0,8,400.0,175.0,5140,12.0,71
2,37.0,4,91.0,68.0,2025,18.2,82


In [18]:
# 훈련과 검증을  70:30으로 분할
from sklearn.model_selection import train_test_split
x_tr, x_val, y_tr, y_val = train_test_split(x_train, y_train, test_size=0.3)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # 표준화 변환
scaler.fit(x_tr[col_num]) # 데이터 분석을 하여 표준화 객체 초기화
x_tr[col_num] = scaler.transform(x_tr[col_num])
x_val[col_num] = scaler.transform(x_val[col_num])
x_test[col_num] = scaler.transform(x_test[col_num])

In [33]:
# 2차원 배열을 class로 지정하면 경고가 발생하나 실행은 정상적으로 됨.
print(y_tr.values[:5])
print(y_tr.shape)
print(type(y_tr.values))

[[0]
 [0]
 [1]
 [0]
 [1]]
(194, 1)
<class 'numpy.ndarray'>


In [34]:
# 모델 구축(훈련)
from sklearn.neighbors import KNeighborsClassifier # K 최근접이웃 분류기
modelKNN = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
modelKNN.fit(x_tr, y_tr.values)

  return self._fit(X, y)


KNeighborsClassifier(metric='euclidean')

In [35]:
# 1차원 배열을 class로 지정, ravel(): 2차원 -> 1차원
print(y_tr.values.ravel()[:5])
print(y_tr.values.ravel().shape)
print(type(y_tr.values.ravel()))

[0 0 1 0 1]
(194,)
<class 'numpy.ndarray'>


In [36]:
# 모델 구축(훈련)
from sklearn.neighbors import KNeighborsClassifier # K 최근접이웃 분류기
modelKNN = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
modelKNN.fit(x_tr, y_tr.values.ravel()) # 훈련

KNeighborsClassifier(metric='euclidean')

In [37]:
from sklearn.tree import DecisionTreeClassifier # 의사 결정 나무 분류기
modelDT = DecisionTreeClassifier(max_depth=10)
modelDT.fit(x_tr, y_tr) # 훈련

DecisionTreeClassifier(max_depth=10)

In [40]:
# 모델 사용
y_val_p = modelKNN.predict(x_val)
print(y_val_p)
y_val_p = modelDT.predict(x_val)
print(y_val_p)

[1 0 0 1 0 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 1 0 0 0 0
 0 0 1 0 1 0 1 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1
 0 1 0 0 0 1 0 1 1 0]
[1 0 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0
 0 0 1 0 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 1
 1 1 0 0 0 1 0 1 1 0]


In [47]:
y_val_pb = modelKNN.predict_proba(x_val)
print(y_val_pb[:5]) # [0 확률, 1 확률]
y_val_pb = modelDT.predict_proba(x_val)
print(y_val_pb[:5]) # [0 확률, 1 확률]
print(type(y_val_pb[:5]))

[[0.2 0.8]
 [0.8 0.2]
 [0.8 0.2]
 [0.  1. ]
 [1.  0. ]]
[[0. 1.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]]
<class 'numpy.ndarray'>


In [44]:
# 모델 평가
from sklearn.metrics import roc_auc_score

y_val_pb_KNN = modelKNN.predict_proba(x_val)
y_val_pb_DT = modelDT.predict_proba(x_val)
scoreKNN = roc_auc_score(y_val, y_val_pb_KNN[:, 1])
scoreDT = roc_auc_score(y_val, y_val_pb_DT[:, 1])

print( scoreKNN, scoreDT )

0.8401442307692308 0.8509615384615384


In [49]:
# 답변 제출
pd.DataFrame({'isUSA': y_val_pb[:,1]}).to_csv('./send/001.csv', index=False)