importing needed modules

In [1]:
import numpy as np
import pandas as pd

reading dataset with pandas library using **read_csv** function and see what our dataset looks like.

In [2]:
cars =  pd.read_csv("cleaned_data.csv")
cars.head()

Unnamed: 0,Name,style,Exterior color,interior color,Engine,drive type,Fuel Type,Transmission,Mileage,mpg city,mpg highway,price,Year,Engine V,Brand
0,Titan,Pickup Truck,Deep Blue Pearl,Black,V-8 Gas,4WD,Gas,Automatic,82230,15,21,35620,2018,5.6,Nissan
1,Civic,Hatchback,Sonic Gray Pearl,Unknown,Inline-4 Gas Turbocharged,FWD,Gas,Automatic,24282,31,40,24999,2020,1.5,Honda
2,Charger,Sedan,Indigo Blue,Brazen Gold/Black,V-8 Gas,RWD,Gas,Automatic,19468,16,25,41999,2018,5.7,Dodge
3,F-150,Pickup Truck,Shadow Black,Medium Earth Gray,V-6 Gas Turbocharged,4WD,Gas,Automatic,195205,18,23,20995,2018,2.7,Ford
4,Altima,Sedan,White,Black,Inline-4 Gas,FWD,Gas,Automatic,92366,27,38,10995,2015,2.5,Nissan


exporting column names

In [3]:
cars.columns

Index(['Name', 'style', 'Exterior color', 'interior color', 'Engine',
       'drive type', 'Fuel Type', 'Transmission', 'Mileage', 'mpg city',
       'mpg highway', 'price', 'Year', 'Engine V', 'Brand'],
      dtype='object')

now we ae going to create our X and Y data in order to perform the processes that have been told in the first part.

In [4]:
X =  cars[['Name', 'style', 'Exterior color', 'interior color', 'Engine',
       'drive type', 'Fuel Type', 'Transmission', 'Mileage', 'mpg city',
       'mpg highway', 'Year', 'Engine V', 'Brand']]


Y = cars["price"].values

#### Encode categorical features as a one-hot numeric array.
The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)<br><br>

By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.<br><a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html">read full documentation</a>

In [5]:
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder(categories="auto", handle_unknown="ignore")

categorical_features = onehot.fit_transform(X.iloc[:, [1,4,5,6,7,13]]).toarray()
print(categorical_features)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [6]:
print(categorical_features.shape)

(6532, 91)


in this part we are going to delete unnecessary features and the categorical features that we have encoded in the previous part.

In [7]:
X = np.delete(X.values, [0,1,2,3,4,5,6,7,13], 1)
print(X.shape)

(6532, 5)


now we combine remaining features with the encoded array:

In [8]:
X = np.concatenate((X,categorical_features), axis=1)
X.shape

(6532, 96)

#### Split arrays or matrices into train and test subsets.
Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.<br><a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html">read full documentation</a>

In [9]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    X, Y,
    test_size=0.1,
    random_state=82,
    shuffle=True
)

### Random Forest Regressor : 

A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.
<br><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html">read full documentation</a>

### GridSearch : 
Exhaustive search over specified parameter values for an estimator.<br>

GridSearchCV implements a “fit” and a “score” method. It also implements “score_samples”, “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.

The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
<br><a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">read full documentation</a>

In [10]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

rfr_pip = Pipeline([
    #("standardizer", StandardScaler()),
    ("rfr", RandomForestRegressor())
])

param_range = [i for i in range(50,101)]
grid_params = [{"rfr__n_estimators" : param_range}]

grid = GridSearchCV(
    rfr_pip, 
    grid_params,
    n_jobs=-1, 
    cv=5
)
grid.fit(x_train, y_train)

print(grid.best_score_)
print(grid.best_params_)

0.9046922991577464
{'rfr__n_estimators': 96}


as you can see we used a pipeline for simplicity and used RandomForestRegressor for estimator. its not necessary to bring our features into same scale while we are using decision trees. but we have commented the code and you can use it if you want(the result will not change).<br><br>
we used GridSearch to find the optimal value of n_estimators in RandomForestRegressor. as you can see we got a slightly better result than our previously trained linearregression model by using 94 seprate estimators.<br> <br>
lets check the goodness of our models fit on test data:

In [13]:
from sklearn.metrics import r2_score

# obtaining the best estimator from grid
rfr = grid.best_estimator_
rfr.fit(x_train, y_train)

y_pred = rfr.predict(x_test)
print("Test Accuracy : {:.3f}".format(r2_score(y_test, y_pred)))

Test Accuracy : 0.914


so we've got 91.4 percent accuracy that is a good amount for this case.

Sina Kazemi<br>
Github : <a href="https://github.com/sina96n/">sina96n</a>