# 자주쓰는 머신러닝 

현업에서 사용중인 머신러닝 알고리즘 Top10

1. 지도학습
  1) 선형회귀
  2) 로지스틱 회귀
  3) k-nn
  4) 나이브에이스
  5) 결정트리
  6) 랜덤포레스트
  7) XGBoost
  8) LightGBM

2. 비지도학습
  1) k-means
  2) PCA


3. 선택이유
  - 범용성
  - 속도
  - 예측력
  - 하이퍼파라미터 튜닝
  - 시각화
  - 해석력

# 머신러닝 필수 라이브러리

1. Numpy : Nummerical Python의 줄임
  - 파이썬 살술계산의 대표적인 라이브러리
  - 자료구조, 알고리즘 산술 데이터를 다루는 대부분의 과학 계산에 필수 라이브러리
  - ndarray 객체
  
2. Pandas : 구조화된 데이터나 표 형식의 데이터를 빠르고 쉽게 표현적으로 다루도록 설계된 고수준 자료구조
  - 데이터 과학에서 데이터를 처리하는 대표적인 라이브러리
  - 데이터 핸들링에 표준
  - Series 객체와 DataFrame 객체가 대표적인 자료구조
  
3. Matplotlib : 시각화 라이브러리
  - 그래프나 2차원 데이터를 시각화하는 파이썬 기반의 라이브러리
  
4. Seaborn : 시본, 다양한 시각화 종류를 제공하는 라이브러리
  - Matplotlib에 종속된 라이브러리

5. Scipy : 사이파이
  - 과학 계산 컴퓨팅 영역의 여러 기본 문제르 다루는 패키지 모음
  - scipy.stats : 가장 많이 사용되는 통계도구를 가지고 있는 라이브러리
  
6. scikit-learn : 머신러닝에 핵심 라이브러리
  - 분류 : SVM, 최근접이웃, 랜덤 포레스트, 로지스틱 회귀 등
  - 회귀 : 라쏘, 릿지 회귀 등
  - 클러스터링 : k-평균 등
  - 자원축소 : PCA, 특징 선택, 행렬 인수분해 등
  - 모델 선택 : 격자 탐색, 교차검증, 행렬
  - 전처리 : 특징 추출, 정규화 등
  
7. statsmodels : R언어용 회귀 분석 모델을 구현한 통계분석 패키지
  - 회귀모델 : 선형회귀
  - 분산분석(ANOVA)
  - 시계열분석 : AR,ARMA,ARIMA 등
  - 통계 모델 결과의 시각화 제공

# 앙상블 학습

In [3]:
import numpy as np
import pandas as pd

# 보팅분류기
from sklearn.ensemble import VotingClassifier
# 보팅용 학습알고리즘
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from warnings import filterwarnings
filterwarnings('ignore')

In [5]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

![nn](images/breast_meta.png)

In [6]:
cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [7]:
df = pd.DataFrame(cancer.data, columns = cancer.feature_names)
df.head(3)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758


In [8]:
logistic_regression = LogisticRegression()
knn = KNeighborsClassifier(n_neighbors = 5)

voting_model = VotingClassifier(estimators = [('LogisticRegression', logistic_regression), ('KNN', knn)], voting = 'soft')

In [9]:
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
                                                   test_size = 0.2, random_state = 156)

In [10]:
voting_model.fit(X_train, y_train)
pred = voting_model.predict(X_test)

In [11]:
print('보팅 분류기 정확도 : {:.3f}'.format(accuracy_score(y_test, pred)))

보팅 분류기 정확도 : 0.947


In [13]:
# 개별 모델의 학습과 예측 그리고 평가
classifier = [logistic_regression, knn]

for model in classifier:
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    model_name = model.__class__.__name__
    print('{} 정확도 : {:.3f}'.format(model_name, accuracy_score(y_test, pred)))

LogisticRegression 정확도 : 0.939
KNeighborsClassifier 정확도 : 0.904


# 랜덤포레스트(RandomForest)

In [14]:
wine = pd.read_csv('https://raw.githubusercontent.com/rickiepark/hg-mldl/master/wine.csv')

In [15]:
data = wine[['alcohol', 'sugar', 'pH']].to_numpy()
target = wine[['class']].to_numpy()

In [16]:
X_train, X_test, y_train, y_test = train_test_split(data, target,
                                                   test_size = 0.2, random_state = 42)

In [17]:
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier

In [32]:
rf = RandomForestClassifier(oob_score = True,n_jobs = -1, random_state = 42)

scores = cross_validate(rf, X_train, y_train, return_train_score = True)
rf.fit(X_train, y_train)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))
print(rf.oob_score_)

0.9973541965122431 0.8905151032797809
0.8934000384837406


In [22]:
# 특성 중요도 : 결정트리는 어떤 특성이 가장 유용한지를 나타내는 특성 중요도를 계산해준다.
# df.feature_importances_    ## alcohol, sugar, pH 순으로 중요도 출력 => 트리 모델 예시

## 랜덤포레스트에서도 특성의 중요도를 파악하여야 한다

In [24]:
rf.fit(X_train, y_train)

print(rf.feature_importances_)

# 단일 트리모델의 중요도 : array([0.12345626, 0.86862934, 0.0079144 ])

[0.23167441 0.50039841 0.26792718]


In [29]:
## 부트스트랩을 과정에서 쓰이지 않은 데이터들이 존재함 => OOB(Out Of Bag)
# OOB : 부트스트랩 샘플에 포함되지 않고 남는 샘플
# 검증데이터의 역할로 활용가능(valid)

rf = RandomForestClassifier(oob_score = True, n_jobs = -1, random_state = 42)
rf.fit(X_train, y_train)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))
print(rf.oob_score_)

0.9973541965122431 0.8905151032797809
0.8934000384837406


# 실전예제 : 중고차 가격 예측

1. 알고리즘 : 랜덤포레스트(RandomForest)
2. 데이터 셋 : 해외 중고차 거래 데이터셋 이용
3. 데이터 셋의 소개 : 종속변수(), 독립변수()
 - 중고차 판매이력을 수집한 데이터 셋

4. 문제유형 : 회귀
5. 평가지표 : RMSE
6. 사용할 모델 RandomForestRegressor
7. 사용 라이브러리 :

In [210]:
data = pd.read_csv('https://media.githubusercontent.com/media/musthave-ML10/data_source/main/car.csv')

In [158]:
data.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,190Nm@ 2000rpm,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,250Nm@ 1500-2500rpm,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,"12.7@ 2,700(kgm@ rpm)",5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0 kmpl,1396 CC,90 bhp,22.4 kgm at 1750-2750rpm,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1 kmpl,1298 CC,88.2 bhp,"11.5@ 4,500(kgm@ rpm)",5.0


1. feature 탐색
 - name : 차종
 - year : 연도
 - selling_price : 판매가
 - km_driven : 주행거리(km)
 - fuel : 연료
 - seller type : 판매자 유형
 - transmission : 변속기
 - owener : 소유자 이력
 - mileage : 연비(km)
 - torque : 회전력(타이어를 회전시키는 힘)
 - seats : 좌석수(인승)

In [159]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8128 entries, 0 to 8127
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           8128 non-null   object 
 1   year           8128 non-null   int64  
 2   selling_price  8128 non-null   int64  
 3   km_driven      8128 non-null   int64  
 4   fuel           8128 non-null   object 
 5   seller_type    8128 non-null   object 
 6   transmission   8128 non-null   object 
 7   owner          8128 non-null   object 
 8   mileage        7907 non-null   object 
 9   engine         7907 non-null   object 
 10  max_power      7913 non-null   object 
 11  torque         7906 non-null   object 
 12  seats          7907 non-null   float64
dtypes: float64(1), int64(3), object(9)
memory usage: 825.6+ KB


## 전처리 : 텍스트 데이터

- split() : 문자열 분리

### engine

In [211]:
data['engine'].str.split()

0       [1248, CC]
1       [1498, CC]
2       [1497, CC]
3       [1396, CC]
4       [1298, CC]
           ...    
8123    [1197, CC]
8124    [1493, CC]
8125    [1248, CC]
8126    [1396, CC]
8127    [1396, CC]
Name: engine, Length: 8128, dtype: object

In [212]:
data['engine'].str.split(expand=True)

Unnamed: 0,0,1
0,1248,CC
1,1498,CC
2,1497,CC
3,1396,CC
4,1298,CC
...,...,...
8123,1197,CC
8124,1493,CC
8125,1248,CC
8126,1396,CC


In [213]:
data[['engine', 'engine_unit']] = data['engine'].str.split(expand=True)

In [214]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8128 entries, 0 to 8127
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           8128 non-null   object 
 1   year           8128 non-null   int64  
 2   selling_price  8128 non-null   int64  
 3   km_driven      8128 non-null   int64  
 4   fuel           8128 non-null   object 
 5   seller_type    8128 non-null   object 
 6   transmission   8128 non-null   object 
 7   owner          8128 non-null   object 
 8   mileage        7907 non-null   object 
 9   engine         7907 non-null   object 
 10  max_power      7913 non-null   object 
 11  torque         7906 non-null   object 
 12  seats          7907 non-null   float64
 13  engine_unit    7907 non-null   object 
dtypes: float64(1), int64(3), object(10)
memory usage: 889.1+ KB


In [215]:
data['engine'] = data['engine'].astype('float')

In [216]:
data['engine_unit'].unique()

array(['CC', nan], dtype=object)

In [217]:
data['engine_unit'].value_counts()

CC    7907
Name: engine_unit, dtype: int64

In [218]:
8128 - 7907

221

In [219]:
# 변수제거
data.drop('engine_unit', axis = 1 , inplace = True)

### mileage

In [220]:
data[['mileage', 'mileage_unit']] = data['mileage'].str.split(expand = True)

In [221]:
data['mileage'] = data['mileage'].astype(float)

In [222]:
data['mileage_unit'].unique()

array(['kmpl', 'km/kg', nan], dtype=object)

In [223]:
data['mileage']

0       23.40
1       21.14
2       17.70
3       23.00
4       16.10
        ...  
8123    18.50
8124    16.80
8125    19.30
8126    23.57
8127    23.57
Name: mileage, Length: 8128, dtype: float64

In [224]:
data['fuel'].unique()

array(['Diesel', 'Petrol', 'LPG', 'CNG'], dtype=object)

In [225]:
data.loc[data['fuel'] == 'LPG', ['mileage_unit','fuel']]

Unnamed: 0,mileage_unit,fuel
6,km/kg,LPG
90,km/kg,LPG
870,km/kg,LPG
1511,km/kg,LPG
1658,km/kg,LPG
1907,km/kg,LPG
2108,km/kg,LPG
2166,km/kg,LPG
2466,,LPG
2484,km/kg,LPG


- 연료종류가 4종류
- 다른 종류의 연료로 거행거리를 비교하려면 같은 기준을 세워야 한다.
- 연료 가격을 활용하면 어떨까? 1달러당 몇 km를 주행할 수 있는지 알아보자
- 연료의 단위를 liter로 바꾸면 어떨까?
- 2022년 시점의 가격
    - Disel
    - Petrol
    - LPG
    - CNG

In [226]:
def mile(x):
    if x['fuel'] == 'Petrol':
        return x['mileage'] / 1.048
    elif x['fuel'] == 'Disel':
        return x['mileage'] / 1.405
    elif x['fuel'] == 'LPG':
        return x['mileage'] / 3.54
    else:
        return x['mileage'] / 2.76

In [227]:
data['mileage'] = data.apply(mile, axis = 1)

In [228]:
data.drop('mileage_unit', axis = 1, inplace = True)

### torque
 - 앞 부분의 숫자만 추출해서 숫자형
 - 단위 스케일(Nm)

In [229]:
data['torque'] = data['torque'].str.upper()

In [230]:
data['torque'].head()

0              190NM@ 2000RPM
1         250NM@ 1500-2500RPM
2       12.7@ 2,700(KGM@ RPM)
3    22.4 KGM AT 1750-2750RPM
4       11.5@ 4,500(KGM@ RPM)
Name: torque, dtype: object

In [244]:
data['torque'].isna().value_counts()

False    7906
True      222
Name: torque, dtype: int64

In [259]:
data[data['torque'].isna()&data['mileage'].isna()&data['seats'].isna()&data['max_power'].isna()&data['engine'].isna()]

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
13,Maruti Swift 1.3 VXi,2007,200000,80000,Petrol,Individual,Manual,Second Owner,,,,,
31,Fiat Palio 1.2 ELX,2003,70000,50000,Petrol,Individual,Manual,Second Owner,,,,,
78,Tata Indica DLS,2003,50000,70000,Diesel,Individual,Manual,First Owner,,,,,
87,Maruti Swift VDI BSIV W ABS,2015,475000,78000,Diesel,Dealer,Manual,First Owner,,,,,
119,Maruti Swift VDI BSIV,2010,300000,120000,Diesel,Individual,Manual,Second Owner,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7846,Toyota Qualis Fleet A3,2000,200000,100000,Diesel,Individual,Manual,First Owner,,,,,
7996,Hyundai Santro LS zipPlus,2000,140000,50000,Petrol,Individual,Manual,Second Owner,,,,,
8009,Hyundai Santro Xing XS eRLX Euro III,2006,145000,80000,Petrol,Individual,Manual,Second Owner,,,,,
8068,Ford Figo Aspire Facelift,2017,580000,165000,Diesel,Individual,Manual,First Owner,,,,,


In [254]:
data[data['torque'].isna() == False]

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,8.478261,1248.0,74 bhp,190NM@ 2000RPM,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,7.659420,1498.0,103.52 bhp,250NM@ 1500-2500RPM,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,16.889313,1497.0,78 bhp,"12.7@ 2,700(KGM@ RPM)",5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,8.333333,1396.0,90 bhp,22.4 KGM AT 1750-2750RPM,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,15.362595,1298.0,88.2 bhp,"11.5@ 4,500(KGM@ RPM)",5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai i20 Magna,2013,320000,110000,Petrol,Individual,Manual,First Owner,17.652672,1197.0,82.85 bhp,113.7NM@ 4000RPM,5.0
8124,Hyundai Verna CRDi SX,2007,135000,119000,Diesel,Individual,Manual,Fourth & Above Owner,6.086957,1493.0,110 bhp,"24@ 1,900-2,750(KGM@ RPM)",5.0
8125,Maruti Swift Dzire ZDi,2009,382000,120000,Diesel,Individual,Manual,First Owner,6.992754,1248.0,73.9 bhp,190NM@ 2000RPM,5.0
8126,Tata Indigo CR4,2013,290000,25000,Diesel,Individual,Manual,First Owner,8.539855,1396.0,70 bhp,140NM@ 1800-3000RPM,5.0


In [None]:
# 결측값은 'NA'로 대체

In [250]:
data['torque'] = data[data['torque'].isna()]

0       False
1       False
2       False
3       False
4       False
        ...  
8123    False
8124    False
8125    False
8126    False
8127    False
Name: torque, Length: 8128, dtype: bool

In [263]:
def torque_unit(x):
    if 'NM' in str(x):
        return 'NM'
    elif 'KGM' in str(x):
        return 'KGM'

In [264]:
data['torque_unit'] = data['torque'].apply(torque_unit)

In [265]:
data['torque_unit'].unique()

array(['NM', 'KGM', None], dtype=object)

In [266]:
data['torque_unit'].value_counts()

NM     7390
KGM     504
Name: torque_unit, dtype: int64

In [267]:
data[data['torque_unit'].isna()]

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats,torque_unit
13,Maruti Swift 1.3 VXi,2007,200000,80000,Petrol,Individual,Manual,Second Owner,,,,,,
31,Fiat Palio 1.2 ELX,2003,70000,50000,Petrol,Individual,Manual,Second Owner,,,,,,
78,Tata Indica DLS,2003,50000,70000,Diesel,Individual,Manual,First Owner,,,,,,
87,Maruti Swift VDI BSIV W ABS,2015,475000,78000,Diesel,Dealer,Manual,First Owner,,,,,,
119,Maruti Swift VDI BSIV,2010,300000,120000,Diesel,Individual,Manual,Second Owner,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7846,Toyota Qualis Fleet A3,2000,200000,100000,Diesel,Individual,Manual,First Owner,,,,,,
7996,Hyundai Santro LS zipPlus,2000,140000,50000,Petrol,Individual,Manual,Second Owner,,,,,,
8009,Hyundai Santro Xing XS eRLX Euro III,2006,145000,80000,Petrol,Individual,Manual,Second Owner,,,,,,
8068,Ford Figo Aspire Facelift,2017,580000,165000,Diesel,Individual,Manual,First Owner,,,,,,


In [268]:
data[data['torque_unit'].isna()]['torque'].unique()

array([nan, '250@ 1250-5000RPM', '510@ 1600-2400', '110(11.2)@ 4800',
       '210 / 1900'], dtype=object)

In [315]:
data['torque_unit'].value_counts()

NM     7402
KGM     504
Name: torque_unit, dtype: int64

In [316]:
data['torque_unit'].fillna('NM', inplace = True)

In [317]:
data['torque_unit'].value_counts()

NM     7402
KGM     504
Name: torque_unit, dtype: int64

In [272]:
def split_num(x):
    x = str(x)
    for i, j in enumerate(x):
        if j not in '0123456789.':
            cut = i
            break
    return x[:cut]

In [273]:
data['torque'] = data['torque'].apply(split_num)

In [274]:
data['torque'].head()

0     190
1     250
2    12.7
3    22.4
4    11.5
Name: torque, dtype: object

In [275]:
data[data['torque'] == '']

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats,torque_unit
13,Maruti Swift 1.3 VXi,2007,200000,80000,Petrol,Individual,Manual,Second Owner,,,,,,NM
31,Fiat Palio 1.2 ELX,2003,70000,50000,Petrol,Individual,Manual,Second Owner,,,,,,NM
78,Tata Indica DLS,2003,50000,70000,Diesel,Individual,Manual,First Owner,,,,,,NM
87,Maruti Swift VDI BSIV W ABS,2015,475000,78000,Diesel,Dealer,Manual,First Owner,,,,,,NM
119,Maruti Swift VDI BSIV,2010,300000,120000,Diesel,Individual,Manual,Second Owner,,,,,,NM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7846,Toyota Qualis Fleet A3,2000,200000,100000,Diesel,Individual,Manual,First Owner,,,,,,NM
7996,Hyundai Santro LS zipPlus,2000,140000,50000,Petrol,Individual,Manual,Second Owner,,,,,,NM
8009,Hyundai Santro Xing XS eRLX Euro III,2006,145000,80000,Petrol,Individual,Manual,Second Owner,,,,,,NM
8068,Ford Figo Aspire Facelift,2017,580000,165000,Diesel,Individual,Manual,First Owner,,,,,,NM


In [276]:
data['torque'] = data['torque'].replace('', np.NaN)

In [277]:
data['torque'] = data['torque'].astype('float')

In [284]:
def trans_nm(x):
    if x['torque_unit'] == 'KGM':
        return x['torque'] * 9.80665
    else:
        return x['torque']

In [285]:
data['torque'] = data.apply(trans_nm, axis = 1)

In [None]:
# data

In [286]:
data.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats,torque_unit
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,8.478261,1248.0,74 bhp,190.0,5.0,NM
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,7.65942,1498.0,103.52 bhp,250.0,5.0,NM
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,16.889313,1497.0,78 bhp,124.544455,5.0,KGM
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,8.333333,1396.0,90 bhp,219.66896,5.0,KGM
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,15.362595,1298.0,88.2 bhp,112.776475,5.0,KGM


### max_power

In [287]:
data[['max_power', 'max_power_unit']] = data['max_power'].str.split(expand = True)

In [288]:
data['max_power'].head()

0        74
1    103.52
2        78
3        90
4      88.2
Name: max_power, dtype: object

In [294]:
data[data['max_power'] == 'bhp']

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats,torque_unit,max_power_unit
4933,Maruti Omni CNG,2000,80000,100000,CNG,Individual,Manual,Second Owner,3.949275,796.0,bhp,,8.0,NM,


In [None]:
def isFload(x):
    try:
        num = float(x)
        return num
    except ValueError:
        return np.NaN

In [None]:
data['max_power'] = data['max_power'].apply(isFloat)

In [295]:
data['max_power_unit'].unique()

array(['bhp', nan, None], dtype=object)

In [296]:
data.drop('max_power_unit', axis = 1, inplace = True)

### name
- 자동차 브랜드와 모델명이 있다.

In [298]:
data['name'] = data['name'].str.split(expand = True)[0]

In [299]:
data['name'].unique()

array(['Maruti', 'Skoda', 'Honda', 'Hyundai', 'Toyota', 'Ford', 'Renault',
       'Mahindra', 'Tata', 'Chevrolet', 'Fiat', 'Datsun', 'Jeep',
       'Mercedes-Benz', 'Mitsubishi', 'Audi', 'Volkswagen', 'BMW',
       'Nissan', 'Lexus', 'Jaguar', 'Land', 'MG', 'Volvo', 'Daewoo',
       'Kia', 'Force', 'Ambassador', 'Ashok', 'Isuzu', 'Opel', 'Peugeot'],
      dtype=object)

## 전처리 : 결측치와 더미 변수 변환

In [302]:
data.isna().sum()

name               0
year               0
selling_price      0
km_driven          0
fuel               0
seller_type        0
transmission       0
owner              0
mileage          221
engine           221
max_power        215
torque           222
seats            221
torque_unit        0
dtype: int64

In [303]:
data.dropna(inplace = True)

In [307]:
len(data)

7906

## 전처리 : 결측치와 더미 변수

In [310]:
data.keys()

Index(['name', 'year', 'selling_price', 'km_driven', 'fuel', 'seller_type',
       'transmission', 'owner', 'mileage', 'engine', 'max_power', 'torque',
       'seats', 'torque_unit'],
      dtype='object')

In [312]:
data = pd.get_dummies(data, columns = ['name', 'fuel', 'seller_type',
                                      'transmission', 'owner'])

- One-hot encoding : 범주형 데이터의 각 범주를 1 아니면 0으로 채우는 인코딩 기법
- pd.get_dummies() 함수의 drop_first : 첫번째 카테고리 값은 사용하지 않음

In [314]:
data

Unnamed: 0,year,selling_price,km_driven,mileage,engine,max_power,torque,seats,torque_unit,name_Ambassador,...,seller_type_Dealer,seller_type_Individual,seller_type_Trustmark Dealer,transmission_Automatic,transmission_Manual,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,2014,450000,145500,8.478261,1248.0,74,190.000000,5.0,NM,0,...,0,1,0,0,1,1,0,0,0,0
1,2014,370000,120000,7.659420,1498.0,103.52,250.000000,5.0,NM,0,...,0,1,0,0,1,0,0,1,0,0
2,2006,158000,140000,16.889313,1497.0,78,124.544455,5.0,KGM,0,...,0,1,0,0,1,0,0,0,0,1
3,2010,225000,127000,8.333333,1396.0,90,219.668960,5.0,KGM,0,...,0,1,0,0,1,1,0,0,0,0
4,2007,130000,120000,15.362595,1298.0,88.2,112.776475,5.0,KGM,0,...,0,1,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8123,2013,320000,110000,17.652672,1197.0,82.85,113.700000,5.0,NM,0,...,0,1,0,0,1,1,0,0,0,0
8124,2007,135000,119000,6.086957,1493.0,110,235.359600,5.0,KGM,0,...,0,1,0,0,1,0,1,0,0,0
8125,2009,382000,120000,6.992754,1248.0,73.9,190.000000,5.0,NM,0,...,0,1,0,0,1,1,0,0,0,0
8126,2013,290000,25000,8.539855,1396.0,70,140.000000,5.0,NM,0,...,0,1,0,0,1,1,0,0,0,0
