Some importings

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import seaborn as sbn
import matplotlib.pyplot as plt

import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")

In our work, the steps will be as follows:

* 1- data overview and data cleaning
* 2- splitting the data into two parts
* 3- scaling of data
* 4- training the data
* 5- run the model
* 6- evaluation of the model

Reading data and an overview to look how data looks like:

In [None]:
dataFrame = pd.read_csv("../input/used-car-dataset-ford-and-mercedes/bmw.csv")
dataFrame.head()

We are going to solve a regression problem, but there are 3 non-numeric columns (model, transmission, fuel type).
In this case, we will either delete these columns or convert the data to a numeric form. 
Instead of deleting the columns, using 'dummy variables' to convert them to a numerical form will make our model work more accurately.
But we're going to do this right before we build the model, now let's get some idea of of the data.

In [None]:
sbn.distplot(dataFrame["price"])

![qweerty.PNG](attachment:eaee0cd2-bb74-4006-bff4-5c9fe14cd990.PNG)
The region I specified in the yellow part distorts the 'normal distribution', it is a choice to remove that part from the data, I will remove that part for my model to work more accurately. 
But first, let's get a little more insight into the data.

In [None]:
sbn.countplot(dataFrame["year"])

We looked at which 'year' and how many cars were sold, the year information is mixed on the x-axis in the graph, but that's okay, we already know that it is between 1970-2020.
What we can understand from here is that there are mostly cars in 2019, there are few cars based on old years, for now we will not do anything about it.
Let's take a look at the correlation information.

In [None]:
dataFrame.corr()

This is how we can look at the correlation of all columns to each other, but in this example we are only concerned with the correlation between price and other data.



In [None]:
dataFrame.corr()["price"].sort_values()

In this way, we looked at the correlation information of 'price' specifically, we added 'sort_values of ( )' so that we can see what the most and least correlated feature is.
So what do we understand by looking at this table? The correlation of price with price is 1 this is already normal, then the year with the highest correlation value of 0.62, it seems that price affects the most 'year', in other words, 'mpg' in other words, mile per gallon correlation is negative. It shows that it affects negatively, that is, the higher the mpg rises, the lower the price, so the more the vehicle consumes, the more the price increases and people do not want to buy the vehicle, the most negative effect is 'mileage', that is, how many kilometers the car travels, which is logical anyway. Finally, we can say that the negative correlation creates an oppositional relationship as if it increases, it decreases, if it decreases, it increases.

In [None]:
# now the part about shaving off some of the data breaking the 'normal distribution'.
# I previously showed that region in yellow on the graph. First, let's sort the dataset by 'price'.
dataFrame.sort_values("price",ascending = False).head(20)

As seen here, there are some dirty data in our data set, we only showed 20 of them, generally in such cases, if 99% of the data is taken, we can continue without disturbing the general table of the data.
So we can subtract 1%.

In [None]:
len(dataFrame) * 0.01

We understand from the above that after sorting our dataset in descending order of 'price', we can discard the first 107 rows. So we will get 'more normal' distrubition.

In [None]:
dataFrame.sort_values(["price"],ascending=False).iloc[131:]

We used 'iloc'(index-based-location) and our new data frame was created, since extracting 107 rows would be the same as extracting the data from the 107th index.

In [None]:
trimmedDf = dataFrame.sort_values("price",ascending=False).iloc[131:]
trimmedDf.head(15)

And as you can see, our new data frame has become a data frame with the highest price of 65 thousand, with the first 107 data discarded in descending order, and the cars with a price of 123 thousand were discarded.
Now let's look at the distribution again.

In [None]:
sbn.distplot(trimmedDf["price"])

After data cleaning, the parts on the right that disrupted the normal distribution were almost shaved and became more normal distrubition.
If we trim even more, we'll ruin it, so there's no need.

I discovered something about engineSize, now I want to share it.
When I looked at the lowest engineSize and saw 0, this was interesting and after doing some research on the internet, I learned that engineSize cannot be 0. So after sorting the data according to 'engineSize', I will detect 0 ones and discard them from the data.

In [None]:
trimmedDf.sort_values("engineSize",ascending = True).head(10)

We see that there are cars with engineSize equal to 0, and we don't know the number of them, so let's drop them below.

In [None]:
trimmedDf = trimmedDf[trimmedDf.engineSize != 0]
trimmedDf.sort_values("engineSize",ascending = True)


Very good, we've deleted those that are equal to 0.
Now it's time to change our dataFrame.

In [None]:
dataFrame = trimmedDf
dataFrame.describe()

We changed our dataFrame and when we looked at the 'count' line, we understood that yes it has changed.

--- Now, as I said at the beginning, it's time to convert non-numeric columns into numeric with dummy variables! ---


In [None]:
dataFrame = pd.get_dummies(dataFrame, columns=["model"])
dataFrame = pd.get_dummies(dataFrame, columns=["transmission"])
dataFrame = pd.get_dummies(dataFrame, columns=["fuelType"])
dataFrame

We did what we said above using the ' pd.get_dummies( ) '  function.

In [None]:
dataFrame.head()

--- Let's Move to Creating Model After Data Cleaning---

In [None]:
y = dataFrame["price"].values

x = dataFrame.drop("price",axis=1).values

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,random_state=10)

Initially we defined x and y, while defining y we dropped the unwanted column with drop( ) instead of giving each column one by one, so it was more practical.

We split the data into x and y with train_test_split( ), remember y = ax + b, where y is output, x is feature, b is bias.

In [None]:
len(x_train)

In [None]:
len(x_test)

We looked at the size of x_train and x_test to check that the data was split correctly.

7422 + 3181 = 10603
10603 * 0.3 = 3181 ✓

Before giving the data to the model, we need to scale the data, thanks to this process, the data is placed between 0-1 in a way that the model can understand. The scaling process allows the model to run fast, for example, when a car price is around 60,000, sizing it between 0-1 makes the model run better.

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)


After importing, we created a variable named scaler and created it as a variable from the MinMaxScaler class, just like an empty list creates an empty array. Then we set the scaler to fit the x_train by saying scaler.fit(x_train), and then we transform the variables related to x.
Instead of first fitting and then transforming, we can also fit and transform at the same time, using .fit_transform( ). I chose this way.

After the scale process is finished, we can move on to creating our model.
In this particular example, we will import two things for our model, Sequential, which is the type of our model, and Dense for our layers.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [None]:
x_train.shape

After dummy variables, the number of columns increased, we remembered this again.

Let's create our model.

In [None]:
model = Sequential()

model.add(Dense(40,activation="relu"))
model.add(Dense(40,activation="relu"))
model.add(Dense(40,activation="relu"))
model.add(Dense(40,activation="relu"))
model.add(Dense(40,activation="relu"))
model.add(Dense(40,activation="relu"))
model.add(Dense(40,activation="relu"))
model.add(Dense(40,activation="relu"))
model.add(Dense(40,activation="relu"))

model.add(Dense(1))

model.compile(optimizer="adam",loss="mse")

We created a variable named 'model' from the Sequential class. This is the name of our model.

Then we add hidden layers to the model, we specify the number of neurons in the layers, I set 40 for now. In the future, we can reduce or increase this by looking at the working performance of the model. We also add activation functions of neurons, 4 of the most known are RelU , Tanh , Sigmoid , Linear. My choice for this example was 'relu'. 
We chose the loss function as 'mse' because we usually use mse when working with regression.


Now it's time to run the model, so let's train the model.

In [None]:
model.fit(x = x_train, y = y_train, batch_size=250, validation_data=(x_test,y_test), epochs=300)


We can think of 'batch_size' as the capacity of the mouth of the model, giving the model 1 million data at a time can crash the model, so we give it little by little, but if we give it little by little, this time it works slowly, so we have to find the appropriate value, here is how much capacity we will give the batch size. The higher the data number, the higher our batch size should be.

'Epochs' is the number of revolutions, while the model is being trained, we determine how many times the model will go over the data set. Too many epochs lead to 'overfitting'. Too few epochs cause the model to be under-trained, we don't want that. So epochs need to be at an optimal value, there are some methods for this but I won't use it for this example because the number of epochs=300 will be enough for us. (Overfitting: When the model deviates from being based on price estimation according to the feature set to be given to it and only gives the data within itself, it is built based on predicting the correct results. this is also ridiculous and unnecessary, we already know the prices, the important thing is that when we give a new feature set, it can tell us the right price.)

While the model is being trained, we can give the validation data to the model with the help of tensorflow, that is, it can be validated while it is being trained, which automatically gives us the x_test - y_test comparison that we can do manually. Since we do the comparison with the test data, we give the test data into validation_data(), the remaining train data was already used while the model was being trained.

In [None]:
lostData = pd.DataFrame(model.history.history)

We have defined the lostData. Let's have a glance.

In [None]:
lostData.head(10)

As you can see, loss and val_loss are close to each other, this is what we wanted.

In [None]:
lostData.plot()

Graphic as it should be.

Let's look at error values ​​to understand exactly how well our model works.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

guessArray = model.predict(x_test)

In [None]:
guessArray

In [None]:
mean_absolute_error(y_test,guessArray)   # we see that there is a difference of 2597 pounds

mean_absolute_error is equal to 2597. This means that there may be an average difference of 2597 pounds in the car estimates. So for example; our model can predict a 58k vehicle to be about 60,6k or 55,4k.
Let's do something else to better understand the error.

In [None]:
dataFrame["price"].mean()

Mean of cars' price is 22168 and mean_absolute_error is 2597.

(2597 / 22168 ) * 100 = 10.52 

So our error rate is ~ %11.71

What if we didn't get the result we wanted? 

We can go back and clean data, change test size split size, increase epochs, change number of neurons, number of layers, etc. We can make changes in the form, but of course, we should be careful not to overfitting while doing these, because maybe a model that will work very well according to the data we have, may be ridiculous when we enter any feature set, so we should test the model's operation by entering new feature sets.

It's time to look at the regression graph.


In [None]:
plt.scatter(y_test,guessArray)
plt.plot(y_test,y_test,color="r")

We plotted the regression graph using mathplotlib.pyplot's .scatter () and .plot () functions.                             We got a nice regression graph, a good result overall, even though there were big deviations in a small number of data.

Finally, let's test our model by giving it a real feature set and finish our work. For this test to get a realistic result, the features we will give must also be realistic and logical, so we will use without 'price' a row  from the dataFrame and give it to the model.

In [None]:
dataFrame.iloc[150]["price"]

Since we will give the data in the 150. index to the model, we looked at its 'price'=52995

In [None]:
newCarSeries = dataFrame.drop("price",axis=1).iloc[150]

Since the row we will use will be a Series variable, we defined a variable called newCarSeries, and we have assigned the row in the 150th index of the dataFrame to the newCarSeries we just created.

In [None]:
newCarSeries

In [None]:
newCarSeries = scaler.transform(newCarSeries.values.reshape(-1,37))

Before giving the data to the model, we need to scale it, so we converted the data to 37 column shape and then scaled the data.

In [None]:
model.predict(newCarSeries)

Using .predict( ), we give our new feature set to the model, and the prediction data came out 53918.


Example Result =>

# Real Price     : ~ 53000
# Predicted Price: ~ 54700

I am new to Data Science, so I would like to thank Atıl Samancioglu and Murat Mert for their help.