***Car sale price regression for Volkswagen***

This was my first notebook that I made myself from scratch. Very interesting dataset.

Please comment on what you would improve or try out further? Suggestions or improvements would be interesting and appreciated!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Load packages
import seaborn as sns
import matplotlib.pyplot as plt
# Load all data:
all_data = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/vw.csv')
# Change order of data
col_index_new = [0,1,3,4,5,6,7,8,2] # Third column should become last one 
all_data = all_data.reindex(columns = all_data.columns.values[col_index_new])
all_data.head(5), all_data.shape

In [None]:
# Plot of distribution as well as density
plt.figure(figsize=(10,10))
h2 = sns.histplot(data= all_data, x = 'price', hue ='model',kde=True, linewidth=0)
print('Skewness = %f and Kurtosis = %f' % (all_data['price'].skew(),all_data['price'].kurt()))

**Take-aways:**
* The overall distribution seems to consist of different peaks. This could be possible explained by the different types of cars (in the industry different classes/segments). 
* Most of the cars on sale are of type: Polo, closely followed by Golf.
* Golf has a second peak? GTI spec?  

Let's now look for correlations:

In [None]:
# EDA: Correlations, scatter plots etc.
corr = all_data.corr()
mask = np.triu(np.ones_like(corr))
cmap=sns.diverging_palette(20, 220, n=200),
h1 = sns.heatmap(corr,mask=mask,cmap=sns.diverging_palette(20, 220, n=200), square=True, annot = True)

***Take-aways:***
* The year the car is made is most strongly correlated with the sale price. Newer = More expensive
* More milage = cheaper car evidently.
* Tax and engine size is not as correlated as I expected. Tax calculation might be based on factors such as weight as well, which is not included in the data.
* Interestingly mpg is also quite strongly negatively correlated. There might be a reason for that gasoline = lower mpg but cheaper!

Lets dive (even) deeper in these last points! 

In [None]:
f,ax = plt.subplots(1,2,figsize=(12,5))
sns.scatterplot(data=all_data,x='tax',y='engineSize',hue='fuelType',ax=ax[0])
sns.scatterplot(data=all_data,x='mpg',y='price',hue='fuelType',ax=ax[1])


Seems like that last thought was indeed correct! Furthermore, the hybrids further strengthen this negative correlation!

In [None]:
plt.figure(figsize=(12,12))
p1 = sns.pairplot(data=all_data,hue='model')

***Take-aways***
* Clearly not all variables have a linear relationship with the target (price) variable. Linear models might struggle with this dataset because of this reason. Non-linear models required? 

In [None]:
# Check for missing data:
all_data.isna().sum()>0
# Dummy variables
all_dataWithDummy = pd.get_dummies(all_data)
# Predictor and target variables
X_data = all_dataWithDummy.drop(['price'],axis=1)
y_data = all_dataWithDummy['price']

***Take-aways***
* None of the data is missing! 

Lets get modeling!

In [None]:
# Import modeling packages
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor


In [None]:
# Split train and test set
X_train, X_test, y_train, y_test = train_test_split(X_data,y_data,test_size = 0.2, random_state = 8)
# Create linear regression model
LR = LinearRegression()
# Fit linear regression model
LR.fit(X_train,y_train)
# Use linear regression to predict test set
y_pred = LR.predict(X_test)
# Quantify prediction
RMSE = np.sqrt(mean_squared_error(y_test,y_pred))
print('First 5 real values to predict: ' +  str(np.round(y_test[0:5].values,1)))
print('First 5 predicted values: ' +  str(np.round(y_pred[0:5],1)))
print('Average error in price units: ' + str(np.round(RMSE,1)))

In [None]:
RFR = RandomForestRegressor()
DTR = DecisionTreeRegressor()
LR = LinearRegression()

models = [RFR,DTR,LR]

for model in models:
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    RMSE = np.sqrt(mean_squared_error(y_test,y_pred))
    print(str(model),': ',np.round(RMSE,1))

***Take-aways:***
* Random forest outperforms other models with default parameters

Note: This is definitely NOT the most optimal model to use. There are more advanced/other models to try!
