# 8.0 Cars Dataset - Compare Accuracy of different ML algorithms

<div class="alert alert-block alert-info">
<b>Общая методология выбора и тестирования наилучшей модели</b>

Чтобы все было ясно, напишу конкретный план (структуру) обучения и анализа моделей:

1. **Обучение моделей**. В обучении нужно рассмотреть хотя бы одну простую модель и один бустинг. Подбор гиперпараметров нужно провести хотя бы одной модели. Тут есть два варианта:
    - без валидационной выборки. Здесь нужно подбирать гиперпараметры с помощью кросс-валидации (GridSearchCV, RandomizedSearchCV или вручную (cross_val_score));
    - валидационная выборка есть. Здесь можно не использовать кросс-валидацию и подбирать гиперпараметры вручную.  
    
    
    
2. **Анализ моделей.** После нахождения лучших гиперпараметров стоит измерить время обучения, предсказания и RMSE. Тут тоже есть два варианта:
    - ***без валидационной выборки***: RMSE на кросс-валидации. Время обучения = время model.fit(X_train). Время предсказания  = model.predict(X_train);
    
    - ***валидационная выборка есть***: RMSE на validation. Время обучения = время model.fit(X_train). Время предсказания = время model.predict(X_valid).  
    
    После этого делаем вывод по анализу и советуем заказчику одну модель на основе его критериев.
    
    
    
3. **Тестирование.** Рассчитаем финальную метрику лучшей модели на тестовой выборке (до этого тестовая выборка нигде не должна использоваться!). RMSE должно быть меньше 2500. Если метрика не дотягивает, можно исправить мои замечания, также можно потюнить гиперпараметры (на этапе обучения моделей, не на тестовой выборке!)
</div>

In [500]:
import pandas as pd
import warnings
warnings.simplefilter("ignore", category=RuntimeWarning)
pd.options.mode.chained_assignment = None

from sklearn import tree, metrics, ensemble, neural_network, naive_bayes, neighbors
# ensemble for RF
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn import svm
from sklearn.svm import SVC

import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

In [342]:
data = pd.read_csv('/Users/yuliabezginova/PycharmProjects/deep_learning/cars.csv')

In [343]:
data.head()

Unnamed: 0,mpg,cylinders,cubicinches,hp,weightlbs,time-to-60,year,brand
0,14.0,8,350,165,4209,12,1972,US.
1,31.9,4,89,71,1925,14,1980,Europe.
2,17.0,8,302,140,3449,11,1971,US.
3,15.0,8,400,150,3761,10,1971,US.
4,30.5,4,98,63,2051,17,1978,US.


In [391]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261 entries, 0 to 260
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   mpg          261 non-null    float64
 1   cylinders    261 non-null    int64  
 2   cubicinches  261 non-null    object 
 3   hp           261 non-null    int64  
 4   weightlbs    261 non-null    object 
 5   time-to-60   261 non-null    int64  
 6   year         261 non-null    int64  
 7   brand        261 non-null    object 
dtypes: float64(1), int64(4), object(3)
memory usage: 16.4+ KB


In [392]:
data.columns

Index(['mpg', 'cylinders', 'cubicinches', 'hp', 'weightlbs', 'time-to-60',
       'year', 'brand'],
      dtype='object')

In [393]:
data.columns = data.columns.str.strip()

In [394]:
data.head()

Unnamed: 0,mpg,cylinders,cubicinches,hp,weightlbs,time-to-60,year,brand
0,14.0,8,350,165,4209,12,1972,US.
1,31.9,4,89,71,1925,14,1980,Europe.
2,17.0,8,302,140,3449,11,1971,US.
3,15.0,8,400,150,3761,10,1971,US.
4,30.5,4,98,63,2051,17,1978,US.


In [395]:
pd.DataFrame(round(data.isna().mean()*100,)).style.background_gradient('coolwarm')

Unnamed: 0,0
mpg,0.0
cylinders,0.0
cubicinches,0.0
hp,0.0
weightlbs,0.0
time-to-60,0.0
year,0.0
brand,0.0


In [396]:
data.columns

Index(['mpg', 'cylinders', 'cubicinches', 'hp', 'weightlbs', 'time-to-60',
       'year', 'brand'],
      dtype='object')

In [397]:
# проверка дубликатов внутри признаков
check_col = ['mpg', 'cylinders', 'cubicinches', 'hp', 'weightlbs', 'time-to-60',
       'year', 'brand']
for col in check_col:
    print(col, pd.Series(data[col].unique()).duplicated().sum())

mpg 0
cylinders 0
cubicinches 0
hp 0
weightlbs 0
time-to-60 0
year 0
brand 0


In [398]:
# проверка дубликатов объектов
data.duplicated().sum()

0

In [399]:
# creating a separate dataset for target
target = data['brand']

In [400]:
# excluding target variable from the dataset
features = data.drop(columns=['brand', 'year'], axis=1)

In [401]:
# splitting datasets
# splitting should be done BEFORE encoding in order to avoid a data leakage
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2)

In [402]:
print('Train features sample size:', X_train.shape[0])
print('Test features sample size:', X_test.shape[0])
print('Train features sample size:', y_train.shape[0])
print('Test features sample size:', y_test.shape[0])

Train features sample size: 208
Test features sample size: 53
Train features sample size: 208
Test features sample size: 53


In [403]:
y_train = y_train.to_frame(name=None)

In [404]:
y_test = y_test.to_frame(name=None)

In [405]:
target_col = LabelEncoder()

In [406]:
y_train['brand_numeric'] = target_col.fit_transform(y_train['brand'])

In [407]:
y_test['brand_numeric'] = target_col.fit_transform(y_test['brand'])

In [408]:
y_train = y_train.drop(['brand'], axis='columns')

In [409]:
y_test = y_test.drop(['brand'], axis='columns')

In [410]:
X_train.columns

Index(['mpg', 'cylinders', 'cubicinches', 'hp', 'weightlbs', 'time-to-60'], dtype='object')

In [412]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 208 entries, 212 to 108
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   mpg          208 non-null    float64
 1   cylinders    208 non-null    int64  
 2   cubicinches  208 non-null    object 
 3   hp           208 non-null    int64  
 4   weightlbs    208 non-null    object 
 5   time-to-60   208 non-null    int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 11.4+ KB


In [413]:
# changing the type of variables
X_train['cubicinches'] = np.floor(pd.to_numeric(X_train['cubicinches'], errors='coerce')).astype('float64')
X_train['weightlbs'] = np.floor(pd.to_numeric(X_train['weightlbs'], errors='coerce')).astype('float64')
X_train['mpg'] = np.floor(pd.to_numeric(X_train['mpg'], errors='coerce')).astype('float64')

X_test['cubicinches'] = np.floor(pd.to_numeric(X_test['cubicinches'], errors='coerce')).astype('float64')
X_test['weightlbs'] = np.floor(pd.to_numeric(X_test['weightlbs'], errors='coerce')).astype('float64')
X_test['mpg'] = np.floor(pd.to_numeric(X_test['mpg'], errors='coerce')).astype('float64')

In [414]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 208 entries, 212 to 108
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   mpg          208 non-null    float64
 1   cylinders    208 non-null    int64  
 2   cubicinches  208 non-null    float64
 3   hp           208 non-null    int64  
 4   weightlbs    205 non-null    float64
 5   time-to-60   208 non-null    int64  
dtypes: float64(3), int64(3)
memory usage: 11.4 KB


In [415]:
y_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 208 entries, 212 to 108
Data columns (total 1 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   brand_numeric  208 non-null    int64
dtypes: int64(1)
memory usage: 3.2 KB


In [416]:
y_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53 entries, 66 to 201
Data columns (total 1 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   brand_numeric  53 non-null     int64
dtypes: int64(1)
memory usage: 848.0 bytes


In [417]:
np.any(np.isnan(X_train))

True

In [418]:
np.all(np.isfinite(X_train))

False

In [419]:
np.any(np.isnan(X_test))

True

In [420]:
np.all(np.isfinite(X_test))

False

In [421]:
np.any(np.isnan(y_train))

False

In [422]:
np.any(np.isnan(y_test))

False

In [423]:
X_train.reset_index()

Unnamed: 0,index,mpg,cylinders,cubicinches,hp,weightlbs,time-to-60
0,212,39.0,4,79.0,58,1755.0,17
1,57,33.0,4,98.0,83,2075.0,16
2,48,29.0,4,97.0,78,1940.0,15
3,241,18.0,6,225.0,105,3121.0,17
4,168,20.0,6,156.0,122,2807.0,14
...,...,...,...,...,...,...,...
203,22,24.0,6,200.0,81,3012.0,18
204,1,31.0,4,89.0,71,1925.0,14
205,134,15.0,8,350.0,145,4440.0,14
206,239,23.0,4,122.0,86,2220.0,14


In [424]:
X_train = np.nan_to_num(X_train)
X_test = np.nan_to_num(X_test)

y_train = np.nan_to_num(y_train)
y_test = np.nan_to_num(y_test)

### 1) Initiating and building Decision Tree model

In [411]:
dt = tree.DecisionTreeClassifier()

In [425]:
dt.fit(X_train, y_train)

DecisionTreeClassifier()

In [429]:
y_pred_dt = dt.predict(X_test)

In [430]:
print(y_pred_dt)

[0 1 2 2 2 0 0 2 2 2 0 2 1 2 0 1 2 1 2 2 2 2 2 2 0 2 2 0 2 1 2 2 0 2 0 2 2
 2 2 1 2 2 2 1 2 1 2 2 2 0 2 0 2]


### 2) Initiating and building Random Forest Classifier model

In [468]:
rf = ensemble.RandomForestClassifier()

In [469]:
rf.fit(X_train, y_train)

  rf.fit(X_train, y_train)


RandomForestClassifier()

In [470]:
y_pred_rf = rf.predict(X_test)

### 3) Initiating and building Neural Network model

In [476]:
neural_network = neural_network.MLPClassifier()
neural_network.fit(X_train, y_train)
y_pred_nnw = neural_network.predict(X_test)

  return f(*args, **kwargs)


### 4) Initiating and building Naive Bayes model

In [479]:
nb = naive_bayes.GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)

  return f(*args, **kwargs)


### 5) Initiating and building k-neighbors model

In [501]:
kneighbors = neighbors.KNeighborsClassifier(n_neighbors=2)
kneighbors.fit(X_train, y_train)
y_pred_kneighbors = kneighbors.predict(X_test)

  return self._fit(X, y)


### 6) Initiating and building SVM model

In [502]:
svmalg = svm.SVC()
svmalg.fit(X_train, y_train)
y_pred_svm = svmalg.predict(X_test)

  return f(*args, **kwargs)


## Comparing algorithms

In [503]:
print("Accuracy of Decision Tree is {}%".format(round(metrics.accuracy_score(y_test, y_pred_dt)*100), 4))
print("Accuracy of Random Forest is {}%".format(round(metrics.accuracy_score(y_test, y_pred_rf)*100), 4))
print("Accuracy of Neural Network is {}%".format(round(metrics.accuracy_score(y_test, y_pred_nnw)*100), 4))
print("Accuracy of Naive Bayes is {}%".format(round(metrics.accuracy_score(y_test, y_pred_nb)*100), 4))
print("Accuracy of K-Neignbors is {}%".format(round(metrics.accuracy_score(y_test, y_pred_kneighbors)*100), 4))
print("Accuracy of SVM algorithm is {}%".format(round(metrics.accuracy_score(y_test, y_pred_svm)*100), 4))

Accuracy of Decision Tree is 79%
Accuracy of Random Forest is 83%
Accuracy of Neural Network is 66%
Accuracy of Naive Bayes is 68%
Accuracy of K-Neignbors is 64%
Accuracy of SVM algorithm is 68%


## Conclusion:
**when it comes to classification problem, the highest accuracy is for Random Forest algorithm.**

***Thank you for going through this project. Your comments are more then welcome to ybezginova2021@gmail.com***

***Best wishes,***

***Yulia***