# Predicting Fiat500 used car prices

<img src="https://cdn.pixabay.com/photo/2017/09/05/08/55/isolated-2716838_960_720.png" alt="Fiat500" width="200"/>

# The problem
In this notebook we look at the data we got via this [Kaggle competition](https://www.kaggle.com/paolocons/small-dataset-about-used-fiat-500-sold-in-italy). 

We will see if we can predict the sales price of a used Fiat 500 car. 

We will explore the dataset given, check the various features we have and we will make an algorithm that can predict the sales price of the car.

# 1. Import the important libraries / packages
These packages are needed to load and use the dataset

In [None]:
import pandas as pd #we use this to load, read and transform the dataset
import numpy as np #we use this for statistical analysis
import matplotlib.pyplot as plt #we use this to visualize the dataset
import seaborn as sns #we use this to make countplots
import sklearn.metrics as sklm #This is to test the models

In [None]:
#here we load the train data
data = pd.read_csv(r'/kaggle/input/small-dataset-about-used-fiat-500-sold-in-italy/Used_fiat_500_in_Italy_dataset.csv')

#and immediately I would like to see how this dataset looks like
data.head()

In [None]:
#now let's look closer at the dataset we got
data.info()

In [None]:
data.shape

In [None]:
data.describe()

In [None]:
#Let's see what the options are in the model column (the objects)
print(data['model'].unique())

In [None]:
#Let's see what the options are in the transmission column (the objects)
print(data['transmission'].unique())

# 2. Explore the dataset

## Price in the dataset
As this is the column we would like to predict, let's look closer to this column.

In [None]:
#Now let's try a histogram
plt.hist(data['price'])

In [None]:
#Now we will try a Box & Wiskers plot
plt.boxplot(data['price'])

You can see an outlier around 16.000 euro. Let's look closer at this outlier

In [None]:
outliers = data[data['price'] > 14000]
outliers.head()

This price is for a pop model, let's look more closely to the price range per model type

In [None]:
#first let's set the model column as categorical
data['model'] = data['model'].astype('category')
data.info()

In [None]:
#next let's plot per category how the data distribution looks like
models = list(data['model'])
values = list(data['price'])

fig, axs = plt.subplots(1, 2, figsize=(9,4), sharey=True)
axs[0].bar(models, values)
axs[1].scatter(models, values)
fig.suptitle('Categorical Plotting')

In [None]:
#Make a countplot to see how many models are sold
countplt, ax = plt.subplots(figsize = (10,7))
ax =sns.countplot(x = 'model', data=data)

Looks like lounge is the most sold type of model and there is only one star model in this dataset.
Let's look into this star model more closely.

In [None]:
star = data[data['model'] == 'star']
star.head()

There is indeed only one star model in this dataset. 

# 3. Make all columns numeric
We need to make all column input numeric to use them further on. 
This is what we will do now. 

In [None]:
#the only two columns that are not numeric are 'model' and ' transmission'.
#to show how we have changed the values, let's encode this manually
model_dict = {'pop':4, 'lounge':3, 'sport':2, 'star':1}
data['model'].replace(model_dict, inplace=True)
data.info()

In [None]:
#the only two columns that are not numeric are 'model' and ' transmission'.
#to show how we have changed the values, let's encode this manually
trans_dict = {'manual':1, 'automatic':2}
data['transmission'].replace(trans_dict, inplace=True)
data.info()

# 4. Most important features
Let's continue by looking at the most important features according to three different tests. 
Than we will use the top ones to train and test our first model. 

In [None]:
#First we need to split the dataset in the y-column (the target) and the components (X), the independent columns. 
#This is needed as we need to use the X columns to predict the y in the model. 

y = data['price'] #the column we want to predict 
X = data.drop(labels = ['price'], axis = 1)  #independent columns 

In [None]:
#as Longitude and latitude are features which need to be combined to have an influence, we will drop them for now. 
X = X.drop(labels = ['lon', 'lat'], axis =1)

In [None]:
#TEST 1 - ExtraTreesClassifier - GOOD IF YOU USE DECISION TREE MODELS
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns).sort_values()
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

In [None]:
#TEST 2 - SelectKBest - GOOD IF YOU USE A K-NEAREST NEIGHBOR MODEL
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k='all')
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Name of the column','Score']  #naming the dataframe columns
print(featureScores.nlargest(10,'Score'))  #print 10 best features

In [None]:
#TEST 3 - Correlations - Linear and logistic regression like correlated data to have a good prediction
#get correlations of each features in dataset
corrmat = data.drop(labels =['lon', 'lat'], axis = 1) #this is because it is the original target column and therefore has a high correlation with our percentage column
corrmat = corrmat.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(10,10))

#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

Seems that Age in days and km have a strong negative relationship to the price of the car, which is very logical, as the older the car, the less it's worth. 

Also age and km have a strong positive relationship, which is also logical, as the older the car the more likely it has ran more km. 

In [None]:
#let's keep all features for now. 

# 5. Machine learning Model
As we would like to predict a continuous number, the price, we would use a Linear Regression model here. 

In [None]:
#Load the chosen model here
from sklearn.linear_model import LinearRegression

## 5a. Split the dataset in train and test

Before we are going to use the models choosen, we will first split the dataset in a train and test set.
This because we want to test the performance of the model on the training set and to be able to check it's accuracy. 

In [None]:
from sklearn.model_selection import train_test_split

#First try with all features

#I want to withhold 35 % of the trainset to perform the tests
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.35 , random_state = 25)

In [None]:
print('Shape of X_train is: ', X_train.shape)
print('Shape of X_test is: ', X_test.shape)
print('Shape of Y_train is: ', y_train.shape)
print('Shape of y_test is: ', y_test.shape)

## 5b. Make a check for the model

In [None]:
#To check the model, I want to build a check:
import math
def print_metrics(y_true, y_predicted, n_parameters):
    ## First compute R^2 and the adjusted R^2
    r2 = sklm.r2_score(y_true, y_predicted)
    r2_adj = r2 - (n_parameters - 1)/(y_true.shape[0] - n_parameters) * (1 - r2)
    
    ## Print the usual metrics and the R^2 values
    print('Mean Square Error      = ' + str(sklm.mean_squared_error(y_true, y_predicted)))
    print('Root Mean Square Error = ' + str(math.sqrt(sklm.mean_squared_error(y_true, y_predicted))))
    print('Mean Absolute Error    = ' + str(sklm.mean_absolute_error(y_true, y_predicted)))
    print('Median Absolute Error  = ' + str(sklm.median_absolute_error(y_true, y_predicted)))
    print('R^2                    = ' + str(r2))
    print('Adjusted R^2           = ' + str(r2_adj)) #This is the number we will be focussing on. 
    #A good model would have an adjusted R2 of >70%, a bad model below this. 
    

## 5c. Fit and check the Linear regression model

In [None]:
# Linear regression model
model = LinearRegression() 
model.fit(X_train, y_train)

In [None]:
#Now let's see how this model performs
Predictions = model.predict(X_test)
print_metrics(y_test, Predictions, 6)

# Conclusion

As you can see here, we have an adjusted R2 of 79%, so this model is not bad to predict the prices. 