# 현업에서 사용중인 머신러닝 알고리즘 Top10

1. 지도학습
  - 선형회귀 
  - 로지스틱 회귀
  - k-nn
  - 나이브베이스
  - 결정트리
  - 랜덤포레스트
  - XGBoost
  - LightGBM

2. 비지도학습
  - k-means
  - PCA

3. 선택이유
  - 범용성
  - 속도
  - 예측력
  - 하이퍼파라미터 튜닝
  - 시각화
  - 해석력

# 머신러닝 필수 라이브러리

1. Numpy : Nummerical Python의 줄임
  - 파이썬 산술계산의 대표적인 라이브러리
  - 자료구조, 알고리즘 산술 데이터를 다루는 대부분의 과학 계산에 필수 라이브러리
  - ndarray 객체
<br>    
2. Pandas : 구조화된 데이터나 표 형식의 데이터를 빠르고 쉽게 표현적으로 다루도록 설계된 고수준 자료구조
  - 데이터 과학에서 데이터를 처리하는 대표적인 라이브러리
  - 데이터 핸들링에 표준
  - Series 객체와 DataFrame 객체가 대표적인 자료구조
<br>    
3. Matplotlib : 시각화 라이브러리
  - 그래프나 2차원 데이터를 시각화하는 파이썬 기반의 라이브러리
<br>    
4. Seaborn : 시본, 다양한 시각화 종류를 제공하는 라이브러리
  - Matplotlib에 종속된 라이브러리
<br>    
5. Scipy : 사이파이
  - 과학 계산 컴퓨팅 영역의 여러 기본 문제를 다루는 패키지 모음
  - scipy.stats : 가장 많이 사용되는 통계도구를 가지고 있는 라이브러리
<br>  
6. scikit-learn : 머신러닝에 핵심 라이브러리
  - 분류 : SVM, 최근접 이웃, 랜덤 포레스트, 로지스틱 회귀 등
  - 회귀 : 라쏘, 릿지 회귀 등
  - 클러스터링 : k-평균 등
  - 자원축소 : PCA, 특징 선택, 행렬 인수분해 등
  - 모델 선택 : 격자 탐색, 교차검증, 행렬
  - 전처리 : 특징 추출, 정규화 등
<br>  
7. statsmodels : R 언어용 회귀분석 모델을 구현한 통계분석 패키지
  - 회귀모델 : 선형회귀
  - 분산분석(ANOVA)
  - 시계열분석 : AR, ARMA, ARIMA 등
  - 통계 모델 결과의 시각화 제공
  

# 앙상블 학습

In [1]:
import numpy as np
import pandas as pd

# 보팅분류기
from sklearn.ensemble import VotingClassifier
# 보팅용 학습알고리즘
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

In [4]:
print(cancer['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [6]:
cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [8]:
df =  pd.DataFrame(cancer.data, columns = cancer.feature_names)
df.head(3)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758


In [9]:
df.shape

(569, 30)

In [10]:
logistic_regression = LogisticRegression()
knn = KNeighborsClassifier(n_neighbors=5)

voting_model = VotingClassifier(estimators=[('LogisticRegression', logistic_regression), ('KNN', knn)], voting='soft')

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, test_size=0.2, random_state=156)

In [12]:
voting_model.fit(X_train, y_train)
pred = voting_model.predict(X_test)

In [13]:
print('보팅 분류기 정확도 : {:.3f}'.format(accuracy_score(y_test, pred)))

보팅 분류기 정확도 : 0.947


In [14]:
# 개별 모델의 학습과 예측 그리고 평가
classifier = [logistic_regression, knn]

for model in classifier:
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    model_name = model.__class__.__name__
    print('{} 정확도 : {:.3f}'.format(model_name, accuracy_score(y_test, pred)))

LogisticRegression 정확도 : 0.939
KNeighborsClassifier 정확도 : 0.904


# 랜덤포레스트(RandomForest)

In [15]:
wine = pd.read_csv('https://raw.githubusercontent.com/rickiepark/hg-mldl/master/wine.csv')

data = wine[['alcohol', 'sugar', 'pH']].to_numpy()
target = wine['class'].to_numpy()

In [16]:
train_input, test_input, train_target, test_target = train_test_split(
    data, target, test_size = 0.2, random_state = 42)

In [17]:
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_jobs= -1, random_state=42)
scores = cross_validate(rf, train_input, train_target, return_train_score = True)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9973541965122431 0.8905151032797809


In [20]:
rf.fit(train_input, train_target)

print(rf.feature_importances_)

[0.23167441 0.50039841 0.26792718]


In [21]:
# OOB(Out Of Bag) 샘플 : 부트스트랩 샘프에 포함되지 않고 남는 샘플
rf = RandomForestClassifier(oob_score = True, n_jobs= -1, random_state=42)
rf.fit(train_input, train_target)

print(rf.oob_score_)

0.8934000384837406


# 실전예제 : 중고차 가격 예측

1. 알고리즘 : 랜덤포레스트(RandomForest)
2. 데이터 셋 : 해외 중고차 거래 데이터셋 이용
3. 데이터 셋의 소개 : 종속변수(selling_price), 독립변수()
  - 중고차 판매이력을 수집한 데이터 세트
4. 문제유형 : 회귀
5. 평가지표 : RMSE
6. 사용할 모델 : RandomFroestRegressor
7. 사용 라이브러리 :

In [3]:
data = pd.read_csv('https://media.githubusercontent.com/media/musthave-ML10/data_source/main/car.csv')

In [4]:
data.head(3)

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,190Nm@ 2000rpm,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,250Nm@ 1500-2500rpm,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,"12.7@ 2,700(kgm@ rpm)",5.0


1. feature 탐색
  - name : 차종
  - year : 년식
  - selling_price : 판매가
  - km_drivn : 주행거리(km)
  - fuel : 연료
  - seller type : 판매자 유형
  - transmission : 변속기
  - owner : 소유자 이력
  - mileage : 연비(km)
  - engine : 배기량
  - max_power : 최대출력(제동마력)
  - torque : 회전력(타이어를 회전시키는 힘)
  - seats : 좌석수(인승)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8128 entries, 0 to 8127
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           8128 non-null   object 
 1   year           8128 non-null   int64  
 2   selling_price  8128 non-null   int64  
 3   km_driven      8128 non-null   int64  
 4   fuel           8128 non-null   object 
 5   seller_type    8128 non-null   object 
 6   transmission   8128 non-null   object 
 7   owner          8128 non-null   object 
 8   mileage        7907 non-null   object 
 9   engine         7907 non-null   object 
 10  max_power      7913 non-null   object 
 11  torque         7906 non-null   object 
 12  seats          7907 non-null   float64
dtypes: float64(1), int64(3), object(9)
memory usage: 825.6+ KB


In [6]:
round(data.describe(), 2)

Unnamed: 0,year,selling_price,km_driven,seats
count,8128.0,8128.0,8128.0,7907.0
mean,2013.8,638271.81,69819.51,5.42
std,4.04,806253.4,56550.55,0.96
min,1983.0,29999.0,1.0,2.0
25%,2011.0,254999.0,35000.0,5.0
50%,2015.0,450000.0,60000.0,5.0
75%,2017.0,675000.0,98000.0,5.0
max,2020.0,10000000.0,2360457.0,14.0


## 전처리 : 텍스트 데이터

- split() : 문자열 분리

### engine

In [7]:
data[['engine', 'engine_unit']] = data['engine'].str.split(expand=True)

In [8]:
data['engine'].head()

0    1248
1    1498
2    1497
3    1396
4    1298
Name: engine, dtype: object

In [9]:
data['engine'] = data['engine'].astype(float)

In [10]:
data['engine'].head()

0    1248.0
1    1498.0
2    1497.0
3    1396.0
4    1298.0
Name: engine, dtype: float64

In [11]:
data['engine_unit'].unique()

array(['CC', nan], dtype=object)

In [12]:
data['engine_unit'].value_counts()

CC    7907
Name: engine_unit, dtype: int64

In [13]:
# 변수제거
data.drop('engine_unit', axis=1, inplace=True)

### mileage 

In [14]:
data[['mileage', 'mileage_unit']] = data['mileage'].str.split(expand=True)

In [15]:
data['mileage'] = data['mileage'].astype(float)

In [16]:
data['mileage_unit'].unique()

array(['kmpl', 'km/kg', nan], dtype=object)

In [17]:
data['fuel'].unique()

array(['Diesel', 'Petrol', 'LPG', 'CNG'], dtype=object)

- 연료종류가 4종류
- 다른 종류의 연료로 주행거리를 비교하려면 같은 기준을 세워야 한다.
- 연료 가격을 활용하면 어떨가? 1달러당 몇 km를 주행할 수 있는지 알아보자
- 2022년 시점의 가격
  - Diesel
  - Petrol
  - LPG
  - CNG

In [18]:
def mile(x):
    if x['fuel'] == 'Petrol':
        return x['mileage'] / 1.048
    elif x['fuel'] == 'Diesel':
        return x['mileage'] / 1.405
    elif x['fuel'] == 'LPG':
        return x['mileage'] / 3.54 
    else:
        return x['mileage'] / 2.76

In [19]:
data['mileage'] = data.apply(mile, axis= 1)

In [20]:
data.drop('mileage_unit', axis=1, inplace =True)

### torque
  - 앞 부분의 숫자만 추출해서 숫자형
  - 단위 스케일(Nm)

In [21]:
data['torque'].unique()

array(['190Nm@ 2000rpm', '250Nm@ 1500-2500rpm', '12.7@ 2,700(kgm@ rpm)',
       '22.4 kgm at 1750-2750rpm', '11.5@ 4,500(kgm@ rpm)',
       '113.75nm@ 4000rpm', '7.8@ 4,500(kgm@ rpm)', '59Nm@ 2500rpm',
       '170Nm@ 1800-2400rpm', '160Nm@ 2000rpm', '248Nm@ 2250rpm',
       '78Nm@ 4500rpm', nan, '84Nm@ 3500rpm', '115Nm@ 3500-3600rpm',
       '200Nm@ 1750rpm', '62Nm@ 3000rpm', '219.7Nm@ 1500-2750rpm',
       '114Nm@ 3500rpm', '115Nm@ 4000rpm', '69Nm@ 3500rpm',
       '172.5Nm@ 1750rpm', '6.1kgm@ 3000rpm', '114.7Nm@ 4000rpm',
       '60Nm@ 3500rpm', '90Nm@ 3500rpm', '151Nm@ 4850rpm',
       '104Nm@ 4000rpm', '320Nm@ 1700-2700rpm', '250Nm@ 1750-2500rpm',
       '145Nm@ 4600rpm', '146Nm@ 4800rpm', '343Nm@ 1400-3400rpm',
       '200Nm@ 1400-3400rpm', '200Nm@ 1250-4000rpm',
       '400Nm@ 2000-2500rpm', '138Nm@ 4400rpm', '360Nm@ 1200-3400rpm',
       '200Nm@ 1200-3600rpm', '380Nm@ 1750-2500rpm', '173Nm@ 4000rpm',
       '400Nm@ 1750-3000rpm', '400Nm@ 1400-2800rpm',
       '200Nm@ 1750-3000rp

In [22]:
data['torque'] = data['torque'].str.upper()

In [23]:
data['torque'].value_counts()

190NM@ 2000RPM             530
200NM@ 1750RPM             445
90NM@ 3500RPM              407
113NM@ 4200RPM             223
114NM@ 4000RPM             171
                          ... 
72.9NM@ 2250RPM              1
155 NM AT 1600-2800 RPM      1
510NM@ 1600-2800RPM          1
285NM@ 2400-4000RPM          1
96  NM AT 3000  RPM          1
Name: torque, Length: 428, dtype: int64

In [24]:
data['torque'].isna().value_counts()

False    7906
True      222
Name: torque, dtype: int64

In [25]:
data.isnull().sum()


name               0
year               0
selling_price      0
km_driven          0
fuel               0
seller_type        0
transmission       0
owner              0
mileage          221
engine           221
max_power        215
torque           222
seats            221
dtype: int64

In [26]:
data.dropna(how = any, inplace=True)

ValueError: invalid how option: <built-in function any>

In [27]:
def torque_unit(x):
    if 'NM' in str(x):
        return 'NM'
    elif 'KGM' in str(x):
        return 'KGM'

In [28]:
data['torque_unit'] = data['torque'].apply(torque_unit)

In [29]:
data['torque_unit'].unique()

array(['NM', 'KGM', None], dtype=object)

In [30]:
data['torque_unit'].isna()

0       False
1       False
2       False
3       False
4       False
        ...  
8123    False
8124    False
8125    False
8126    False
8127    False
Name: torque_unit, Length: 8128, dtype: bool

In [31]:
data[data['torque_unit'].isna()]

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats,torque_unit
13,Maruti Swift 1.3 VXi,2007,200000,80000,Petrol,Individual,Manual,Second Owner,,,,,,
31,Fiat Palio 1.2 ELX,2003,70000,50000,Petrol,Individual,Manual,Second Owner,,,,,,
78,Tata Indica DLS,2003,50000,70000,Diesel,Individual,Manual,First Owner,,,,,,
87,Maruti Swift VDI BSIV W ABS,2015,475000,78000,Diesel,Dealer,Manual,First Owner,,,,,,
119,Maruti Swift VDI BSIV,2010,300000,120000,Diesel,Individual,Manual,Second Owner,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7846,Toyota Qualis Fleet A3,2000,200000,100000,Diesel,Individual,Manual,First Owner,,,,,,
7996,Hyundai Santro LS zipPlus,2000,140000,50000,Petrol,Individual,Manual,Second Owner,,,,,,
8009,Hyundai Santro Xing XS eRLX Euro III,2006,145000,80000,Petrol,Individual,Manual,Second Owner,,,,,,
8068,Ford Figo Aspire Facelift,2017,580000,165000,Diesel,Individual,Manual,First Owner,,,,,,


In [32]:
data[data['torque_unit'].isna()]['torque'].unique()

array([nan, '250@ 1250-5000RPM', '510@ 1600-2400', '110(11.2)@ 4800',
       '210 / 1900'], dtype=object)

In [33]:
data['torque_unit'].value_counts()

NM     7390
KGM     504
Name: torque_unit, dtype: int64

In [34]:
data['torque_unit'].fillna('NM', inplace=True)

In [35]:
data[data['torque_unit'].isna()]

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats,torque_unit


In [36]:
def split_num(x):
    x = str(x)
    for i, j in enumerate(x):
        if j not in '0123456789.':
            cut = i
            break
    return x[:cut]

In [37]:
data['torque'] = data['torque'].apply(split_num)

In [38]:
data['torque'].head()

0     190
1     250
2    12.7
3    22.4
4    11.5
Name: torque, dtype: object

In [39]:
data['torque']=data['torque'].astype('float')

ValueError: could not convert string to float: ''

In [40]:
data['torque'] = data['torque'].replace('', np.NaN)

In [41]:
data['torque']=data['torque'].astype('float')

In [42]:
data['torque_unit'].value_counts()

NM     7624
KGM     504
Name: torque_unit, dtype: int64

In [43]:
data['torque'].tail()

8123    113.7
8124     24.0
8125    190.0
8126    140.0
8127    140.0
Name: torque, dtype: float64

In [44]:
def trans_nm(x):
    if x['torque_unit'] == 'KGM':
        return x['torque'] * 9.80665
    else:
        return x['torque']    

In [45]:
data['torque'] = data.apply(trans_nm, axis=1)

In [46]:
data.drop('torque_unit', axis=1, inplace=True)

In [47]:
data.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,16.654804,1248.0,74 bhp,190.0,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,15.046263,1498.0,103.52 bhp,250.0,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,16.889313,1497.0,78 bhp,124.544455,5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,16.370107,1396.0,90 bhp,219.66896,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,15.362595,1298.0,88.2 bhp,112.776475,5.0


### max_power

In [48]:
data[['max_power', 'max_power_unit']] = data['max_power'].str.split(expand=True)

In [49]:
data['max_power'].head()

0        74
1    103.52
2        78
3        90
4      88.2
Name: max_power, dtype: object

In [50]:
data['max_power'] = data['max_power'].astype(float)

ValueError: could not convert string to float: 'bhp'

In [51]:
data[data['max_power'] == 'bhp']

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats,max_power_unit
4933,Maruti Omni CNG,2000,80000,100000,CNG,Individual,Manual,Second Owner,3.949275,796.0,bhp,,8.0,


In [52]:
def isFloat(x):
    try:
        num = float(x)
        return num
    except ValueError:
        return np.NaN

In [53]:
data['max_power'] = data['max_power'].apply(isFloat)

In [54]:
data['max_power_unit'].unique()

array(['bhp', nan, None], dtype=object)

In [55]:
data.drop('max_power_unit', axis=1, inplace=True)

In [56]:
data.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,16.654804,1248.0,74.0,190.0,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,15.046263,1498.0,103.52,250.0,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,16.889313,1497.0,78.0,124.544455,5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,16.370107,1396.0,90.0,219.66896,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,15.362595,1298.0,88.2,112.776475,5.0


- name : 자동차 브랜드와 모델명이 있다.

In [57]:
data['name'] = data['name'].str.split(expand=True)[0]

In [58]:
data['name'].unique()

array(['Maruti', 'Skoda', 'Honda', 'Hyundai', 'Toyota', 'Ford', 'Renault',
       'Mahindra', 'Tata', 'Chevrolet', 'Fiat', 'Datsun', 'Jeep',
       'Mercedes-Benz', 'Mitsubishi', 'Audi', 'Volkswagen', 'BMW',
       'Nissan', 'Lexus', 'Jaguar', 'Land', 'MG', 'Volvo', 'Daewoo',
       'Kia', 'Force', 'Ambassador', 'Ashok', 'Isuzu', 'Opel', 'Peugeot'],
      dtype=object)

## 전처리 : 결측치와 더미 변수 변환

In [59]:
data.isna().sum()

name               0
year               0
selling_price      0
km_driven          0
fuel               0
seller_type        0
transmission       0
owner              0
mileage          221
engine           221
max_power        216
torque           222
seats            221
dtype: int64

In [60]:
data.dropna(inplace=True)

In [154]:
season = pd.DataFrame({'season':['spring', 'summer', 'fall', 'winter', np.nan]})
season

Unnamed: 0,season
0,spring
1,summer
2,fall
3,winter
4,


In [155]:
pd.get_dummies(season['season'])

Unnamed: 0,fall,spring,summer,winter
0,0,1,0,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,0


In [156]:
pd.get_dummies(season['season'], dummy_na=True)

Unnamed: 0,fall,spring,summer,winter,NaN
0,0,1,0,0,0
1,0,0,1,0,0
2,1,0,0,0,0
3,0,0,0,1,0
4,0,0,0,0,1


In [61]:
data.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti,2014,450000,145500,Diesel,Individual,Manual,First Owner,16.654804,1248.0,74.0,190.0,5.0
1,Skoda,2014,370000,120000,Diesel,Individual,Manual,Second Owner,15.046263,1498.0,103.52,250.0,5.0
2,Honda,2006,158000,140000,Petrol,Individual,Manual,Third Owner,16.889313,1497.0,78.0,124.544455,5.0
3,Hyundai,2010,225000,127000,Diesel,Individual,Manual,First Owner,16.370107,1396.0,90.0,219.66896,5.0
4,Maruti,2007,130000,120000,Petrol,Individual,Manual,First Owner,15.362595,1298.0,88.2,112.776475,5.0


In [62]:
data = pd.get_dummies(data, columns=['name', 'fuel', 'seller_type' ,
                                     'transmission', 'owner'])

- One-hot encoding : 범주형 데이터의 각 범주를 1아니면 0으로 채우는 인코딩 기법
- pd.get_dummies() 함수의 drop_first : 첫번째 카테고리 값은 사용하지 않음

In [63]:
data.head()

Unnamed: 0,year,selling_price,km_driven,mileage,engine,max_power,torque,seats,name_Ambassador,name_Ashok,...,seller_type_Dealer,seller_type_Individual,seller_type_Trustmark Dealer,transmission_Automatic,transmission_Manual,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,2014,450000,145500,16.654804,1248.0,74.0,190.0,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0
1,2014,370000,120000,15.046263,1498.0,103.52,250.0,5.0,0,0,...,0,1,0,0,1,0,0,1,0,0
2,2006,158000,140000,16.889313,1497.0,78.0,124.544455,5.0,0,0,...,0,1,0,0,1,0,0,0,0,1
3,2010,225000,127000,16.370107,1396.0,90.0,219.66896,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0
4,2007,130000,120000,15.362595,1298.0,88.2,112.776475,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0


## 모델링과 평가

### 훈련세트와 테스트세트 분리

In [64]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('selling_price', axis=1), data['selling_price'], test_size=0.2, random_state=100) 

In [65]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(random_state=100)

model.fit(X_train, y_train)
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)

In [66]:
# 종속변수가 연속형, 실제값과 예측값의 차이를 합하는 RMSE
from sklearn.metrics import mean_squared_error

print('train_rmse : ', mean_squared_error(y_train, train_pred) ** 0.5,
     'test_rmse : ', mean_squared_error(y_test, test_pred) ** 0.5)

train_rmse :  53578.71322834786 test_rmse :  132755.77987501275


In [67]:
np.sqrt(mean_squared_error(y_train, train_pred))

53578.71322834786

- 회귀 평가지표 MAE, MSE, RMSE, MSLE, RMSLE는 값이 작을수록 회귀 성능이 좋은 것입니다. 값이 작을수록 예측값과 실제값의 차이가 없다는 뜻이기 때문입니다. 반면, R² 는 값이 클수록 성능이 좋습니다.

### K-폴드 교차검증(cross-validation) =  K겹 교차검증
- 교차검증의 목적은 모델의 예측력을 더 안정적으로 평가하기 위해(교차타당성)
- 교차검증이란 다양한 훈려세트/테스트세트를 이용해 모델을 생성하고 평가하여 생성된 모델에 더 신뢰할 수 있는 평가 방법
- 보통 회귀모델에 사용되며, 데이터가 독립적이고 동일한 분포를 가진 경우에 사용된다.

In [68]:
X = data.drop('selling_price', axis=1)
y = data['selling_price']

- **K-Fold 교차 검증 과정**은 다음과 같다.

 1. 전체 데이터셋을 Training Set과 Test Set으로 나눈다.
 2. Training Set를 Traing Set + Validation Set으로 사용하기 위해 k개의 폴드로 나눈다.
 3. 첫 번째 폴드를 Validation Set으로 사용하고 나머지 폴드들을 Training Set으로 사용한다.
 4. 모델을 Training한 뒤, 첫 번 째 Validation Set으로 평가한다.
 5. 차례대로 다음 폴드를 Validation Set으로 사용하며 3번을 반복한다.
 6. 총 k 개의 성능 결과가 나오며, 이 k개의 평균을 해당 학습 모델의 성능이라고 한다.

In [69]:
from sklearn.model_selection import cross_val_score

# 모델
model = RandomForestRegressor(random_state=100)

# 교차검증
# 파라미터(모델, train_data, train_target, 폴드수)
scores = cross_val_score(model, X, y)

# 성능평가
print('교차 검증별 정확도 : ', np.round(scores, 4))
print('평균 검증 정확도 : ', np.round(np.mean(scores), 4))

교차 검증별 정확도 :  [0.9665 0.968  0.9796 0.9567 0.9672]
평균 검증 정확도 :  0.9676


### 교차 검증 분할기

In [70]:
from sklearn.model_selection import KFold

# 모델
rf = RandomForestRegressor(random_state=100)

# n_split : 몇 개로 분할할지
# shuffle : Fold를 나누기 전에 무작위로 섞을지
# random_state : 나눈 Fold를 그대로 사용할지
kfold = KFold(n_splits=6, shuffle=True, random_state=100)

scores = cross_val_score(rf, X, y, cv=kfold)

# 성능평가
print('교차 검증별 정확도 : ', np.round(scores, 4))
print('평균 검증 정확도 : ', np.round(np.mean(scores), 4))

교차 검증별 정확도 :  [0.9774 0.9666 0.976  0.9721 0.9574 0.9804]
평균 검증 정확도 :  0.9717


### 교차검증 자동화처리

In [71]:
# 현재 데이터 확인
data

Unnamed: 0,year,selling_price,km_driven,mileage,engine,max_power,torque,seats,name_Ambassador,name_Ashok,...,seller_type_Dealer,seller_type_Individual,seller_type_Trustmark Dealer,transmission_Automatic,transmission_Manual,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,2014,450000,145500,16.654804,1248.0,74.00,190.000000,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0
1,2014,370000,120000,15.046263,1498.0,103.52,250.000000,5.0,0,0,...,0,1,0,0,1,0,0,1,0,0
2,2006,158000,140000,16.889313,1497.0,78.00,124.544455,5.0,0,0,...,0,1,0,0,1,0,0,0,0,1
3,2010,225000,127000,16.370107,1396.0,90.00,219.668960,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0
4,2007,130000,120000,15.362595,1298.0,88.20,112.776475,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8123,2013,320000,110000,17.652672,1197.0,82.85,113.700000,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0
8124,2007,135000,119000,11.957295,1493.0,110.00,235.359600,5.0,0,0,...,0,1,0,0,1,0,1,0,0,0
8125,2009,382000,120000,13.736655,1248.0,73.90,190.000000,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0
8126,2013,290000,25000,16.775801,1396.0,70.00,140.000000,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0


In [72]:
# reset_index
# 일반적으로 reset_index를 사용하면 인덱스가 컬럼이 된다.
# drop=True 매개변수를 주면 인덱스를 제거
data.reset_index(drop=True, inplace=True)

In [73]:
data

Unnamed: 0,year,selling_price,km_driven,mileage,engine,max_power,torque,seats,name_Ambassador,name_Ashok,...,seller_type_Dealer,seller_type_Individual,seller_type_Trustmark Dealer,transmission_Automatic,transmission_Manual,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,2014,450000,145500,16.654804,1248.0,74.00,190.000000,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0
1,2014,370000,120000,15.046263,1498.0,103.52,250.000000,5.0,0,0,...,0,1,0,0,1,0,0,1,0,0
2,2006,158000,140000,16.889313,1497.0,78.00,124.544455,5.0,0,0,...,0,1,0,0,1,0,0,0,0,1
3,2010,225000,127000,16.370107,1396.0,90.00,219.668960,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0
4,2007,130000,120000,15.362595,1298.0,88.20,112.776475,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7901,2013,320000,110000,17.652672,1197.0,82.85,113.700000,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0
7902,2007,135000,119000,11.957295,1493.0,110.00,235.359600,5.0,0,0,...,0,1,0,0,1,0,1,0,0,0
7903,2009,382000,120000,13.736655,1248.0,73.90,190.000000,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0
7904,2013,290000,25000,16.775801,1396.0,70.00,140.000000,5.0,0,0,...,0,1,0,0,1,1,0,0,0,0


In [74]:
kf = KFold(n_splits = 5)

X = data.drop('selling_price', axis=1)
y = data['selling_price']

In [75]:
for i, j in kf.split(X):
    print(i, j)

[1582 1583 1584 ... 7903 7904 7905] [   0    1    2 ... 1579 1580 1581]
[   0    1    2 ... 7903 7904 7905] [1582 1583 1584 ... 3160 3161 3162]
[   0    1    2 ... 7903 7904 7905] [3163 3164 3165 ... 4741 4742 4743]
[   0    1    2 ... 7903 7904 7905] [4744 4745 4746 ... 6322 6323 6324]
[   0    1    2 ... 6322 6323 6324] [6325 6326 6327 ... 7903 7904 7905]


In [76]:
train_rmse_total = []
test_rmse_total = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model = RandomForestRegressor(random_state=100)
    model.fit(X_train, y_train)
    
    train_pred = model.predict(X_train) # 훈련세트 예측
    test_pred = model.predict(X_test)   # 테스트세트 예측
    
    train_rmse = mean_squared_error(y_train, train_pred) ** 0.5
    test_rmse = mean_squared_error(y_test, test_pred) ** 0.5
    
    train_rmse_total.append(train_rmse)
    test_rmse_total.append(test_rmse)

In [77]:
train_rmse_total

[50919.64551039735,
 58245.94403552136,
 57325.85737174696,
 56262.584157672245,
 59018.944980074004]

In [78]:
test_rmse_total

[163132.33014922726,
 135974.88460685138,
 124878.36127822289,
 152535.97745702695,
 143432.20025599442]

In [79]:
# 최종평가
print('훈련세트 RMSE : ', sum(train_rmse_total)/5)
print('테스트세트 RMSE : ', sum(test_rmse_total)/5)

훈련세트 RMSE :  56354.595211082386
테스트세트 RMSE :  143990.75074946458


- **교차검증** 전 결과
- train_rmse :  53578.71322834786 test_rmse :  132755.77987501275