In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df=pd.read_csv('../input/used-car-dataset-ford-and-mercedes/ford.csv')

In [None]:
df.head()

Let's see the type of each feature and how many non-null values contain:

In [None]:
df.info()

We check that there are not null values in all features by using isnull function:

In [None]:
df.isnull().sum()

## Data Exploring and Cleaning 

### Categorical variables: 

Model, transmission and fuel type doesn't seem to be binary nor ordinal categorical, instead nominal so let's see the value counts for each of these features:

In [None]:
df.model.value_counts()

In [None]:
df.transmission.value_counts()

In [None]:
pie2=pd.DataFrame(df['transmission'].value_counts())
pie2.reset_index(inplace=True)
pie2.plot(kind='pie', title='Pie chart of transmission type',y = 'transmission', 
          autopct='%1.1f%%', shadow=False, labels=pie2['index'], legend = False, fontsize=14, figsize=(12,12))

In [None]:
df.fuelType.value_counts()

In [None]:
pie3=pd.DataFrame(df['fuelType'].value_counts())
pie3.reset_index(inplace=True)
pie3=pie3.head(2)
pie3.loc[2]=['Hybrid, Electric or Other type',25]
pie3.plot(kind='pie', title='Pie chart of fuel type',y = 'fuelType',
          autopct='%1.1f%%', shadow=False, labels=pie3['index'], legend = False, fontsize=14, figsize=(12,12))

In [None]:
pd.crosstab(df['fuelType'], df['transmission'], 
            rownames=['fuelType'], colnames=['transmission']).sort_values(by='Manual',ascending=False)

As was shown above the features are nominal so we will need to one hot encode them in order to be used in a machine learning model.

### Numerical variables:

Firstly plot the describe table to see if the measures make sense and then check one by one if outliers are correct:

In [None]:
df.describe()

Ploting hystogram for each feature:

In [None]:
df.hist(bins=30, figsize=(15,13))

Engine size variable:

In [None]:
sns.boxplot(x='engineSize',data=df)

In [None]:
df[df.engineSize>2.5].groupby(by='model').count()

About engine size feature the values are correct as the outliers are models Mustang and Mondeo with engine size of 5 litres and 3.2 litres for model Ranger, these correspond to powerful cars often used in sport.

MPG variable:

In [None]:
sns.boxplot(x='mpg',data=df)

In [None]:
df[df.mpg>90]

In [None]:
df[(df.model==' Kuga') & (df.year==2020)]

For mpg we can see that are some outliers and these correspond to a specific model 'Kuga year 2020' which is a hybrid car with 200 mpg and engine size of 2.5 litres, as the values for this feature are correct will be kept the same.

Tax variable: 

In [None]:
sns.boxplot(x='tax',data=df)

In [None]:
df[df.tax>350]

Values for tax are correct as outliers are cars with taxes reaching aproximately $570 which correspond to models such as: Mustang, Kuga, S-MAX.

Year variable:

In [None]:
df.year.plot.hist(bins=30)

In [None]:
df[df['year']>2020]

Let's try to find some record with similar characteristics as the car of year 2060:

In [None]:
df[df.model.isin([' Fiesta'])].groupby(by='year').count()

In [None]:
df[df.mileage>54000].groupby(by='year').count()

In both tables above the interval of years 2013-2017 was much more frequent and we assume there is a high probability that the car belongs to one of this years.

There is car of the year 2060, obviously it's invalid and we found a way to impute the most appropiate value which is as following:  
1. Retrieve the entire row for the cars of year 2060.  
2. Using a multiple conditional selecting we will retrieve cars of model = 'Fiesta' and mileage > 54000.  
3. The result of the prior step will give us lots of records so we assume that they actually wanted to write 2006 or 2016, so let's find which of these 2 years could make more sense.

In [None]:
df[((df['model']==' Fiesta') & (df['mileage']>54000)) & ((df['year']==2006) | (df['year']==2016))].groupby(by='year').count()

As we see above there are 43 cars with similar characteristics of the 'car 2060', 40 are of year 2016 whereas 3 are of year 2006, based on this we impute 2016 as the year of such car.

In [None]:
df.iloc[17726,1]=2016

In [None]:
df.year.min(),df.year.max()

As the limits of years are 1996 and 2020 we have to plot a new histogram with 25 bins to avoid gaps:

In [None]:
df.year.plot.hist(bins=25)

Mileage variable:

In [None]:
sns.boxplot(x='mileage',data=df)

In [None]:
j=sns.regplot(x='mileage',y='price',data=df)
j.set(ylim=(0, None))

In [None]:
g=sns.regplot(x='year',y='mileage',data=df)
g.set(ylim=(0, None))

It's correct, are cars with long mileage and it's not strongly correlated to year, we can have cars of 2015 with very long mileage and the same for cars prior 2005.

Price variable:

In [None]:
df.price.plot.hist(bins=30)

In [None]:
sns.boxplot(x='price',data=df)

In [None]:
df[df['price']>40000]

It's correct, just the outliers correspond to sport models which are in better conditions, also are well-known due to their high price such as Mustang and Focus.

Let's make a pairplot and focus on the second row 'price', along the columns we will see its regression plot with every numerical feature, for example we could early say there is a considerable correlation with year and mileage, whereas a weak correlation with miles per galon, engine size and tax.

In [None]:
sns.pairplot(df)

Let's compute and show in a table the correlation of the label price with each feature:

In [None]:
data = {'Price':[df['price'].corr(df['year']),df['price'].corr(df['mileage']),df['price'].corr(df['tax']),
                 df['price'].corr(df['mpg']),df['price'].corr(df['engineSize'])]}
 
pd.DataFrame(data, index=['Year','Mileage','Tax','Miles per galon','Engine Size'])

Now let's do something similar for the categorical variables:

Showing the min, mean, median and max of price for each model sorted by mean ascendingly:

In [None]:
df.groupby(by='model').agg([np.min,np.mean,np.median,np.max])['price'].sort_values(by='mean',ascending=False)

We see in the top of the table above are models which are known for justly its expensiveness and luxury. 

In [None]:
df.groupby(by='transmission').agg([np.min,np.mean,np.median,np.max])['price'].sort_values(by='mean',ascending=False)

In [None]:
df.transmission.value_counts()

Above we see that in general manual transmission cars tend to be cheaper and thus are much more frequent and preferred.

In [None]:
df.groupby(by='fuelType').agg([np.min,np.mean,np.median,np.max])['price'].sort_values(by='mean',ascending=False)

In [None]:
df.fuelType.value_counts()

About fuel type we can't conclude too much as in general hybrid, electric and other type tend to be more expensive but we only have 25 of these in total, whereas diesel or petrol vehicles are cheaper but also have outliers which stand out as those sport models that we have already seen.

Focusing on the column 'mean' for each table we see a significant difference between each other, we could compute a t-test and levene's test to reject the hypothesis that the means are the same, but it's not needed as is clear the difference.

## Feature Engineering

**Before encoding we must get polynomial features:**

In [None]:
df3=df.copy(deep=True)

In [None]:
from sklearn.preprocessing import PolynomialFeatures

Let's firstly split the dataset into numerical and categorical columns, so the first will be used in the argument of polynomial features function, and then will be concatenated with their corresponding categorical values:

In [None]:
df_cat=df3[['model','transmission','fuelType']]  # Categorical columns
df_cat.head()

In [None]:
df_num=df3[['year','price','mileage','tax','mpg','engineSize']]   #Numerical columns

In [None]:
pf = PolynomialFeatures(degree=2, include_bias=False)
df3_pf = pf.fit_transform(df_num)

From the 6 features we expect the polynomialfeatures function will generate a dataset of 27 columns in total excluding bias: 

In [None]:
pd.DataFrame(df3_pf).head()

In [None]:
X_new=pd.concat([pd.DataFrame(df3_pf), df_cat], axis=1)

And finally our new concatenated dataset corresponds to:

In [None]:
X_new.head() 

Let's check if the data of these first 5 rows correctly match with the prior dataset df:

In [None]:
df.head()

In [None]:
X_new.shape

**Now encoding of categorical features:**  
As we only have nominal type one hot enconding function will be applied to these:

In [None]:
one_hot_encode_cols = X_new.dtypes[X_new.dtypes == np.object]  # filtering by string categoricals
one_hot_encode_cols = one_hot_encode_cols.index.tolist()
one_hot_encode_cols

In [None]:
df2=X_new.copy(deep=True)

In [None]:
df2 = pd.get_dummies(X_new, columns=one_hot_encode_cols, drop_first=True) #To avoid multicollinearity
df2.describe().T

After one hot encode the nominal categorical variables we expect to have a total of 55 features which come from: 27 polynomial + 22 model + 2 transmission + 4 fueltype.

In [None]:
df2.shape

In [None]:
df2.head()

As we know after the polynomial transformation the name of every feature was changed by integer numbers, when building the machine learning model 'price' will be used as label and this one correspond to the column '1' on the dataframe above, so the following step will be spliting this into X and Y after the train-test split.

In [None]:
y_col = 1
X = df2.drop(y_col, axis=1)
y = df2[y_col]

## Modeling:

The following models will be built and compared using their corresponding error measurements:  
1. Linear Regression without polynomial features nor standardization scaling.
2. Linear Regression with engineered features.
3. Ridge Regression with Cross-validation with engineered features.
4. Lasso Regression with Cross-validation with engineered features.
5. ElasticNet with ratios between 0.1 - 0.9 and Cross-validation with engineered features.

Before building the different models let's declare some error metrics in order to compare the performace of each one:

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

def rmse(ytrue, ypredicted):
    return np.sqrt(mean_squared_error(ytrue, ypredicted))

**1. Linear Regression without polynomial features nor standardization scaling.**

In [None]:
df9=df.copy(deep=True)

In [None]:
df9 = pd.get_dummies(df9, columns=one_hot_encode_cols, drop_first=True) #To avoid multicollinearity

In [None]:
df9.head()

In [None]:
y_col = 'price'
Xno_pf = df9.drop(y_col, axis=1)
yno_pf = df9[y_col]

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(Xno_pf, yno_pf, test_size=0.3, random_state=42)

In [None]:
from sklearn.linear_model import LinearRegression

linearRegression3 = LinearRegression().fit(X_train2, y_train2)

y_pred3 = linearRegression3.predict(X_test2)

print('MSE: ',mean_squared_error(y_pred3,y_test2))
print('RMSE: ',rmse(y_pred3,y_test2))
print('Coefficient of determination: ',r2_score(y_pred3,y_test2))

**2. Linear Regression with polynomial features and standardized.**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

In [None]:
from sklearn.preprocessing import StandardScaler
s = StandardScaler()

X_train_s = s.fit_transform(X_train)
X_test_s = s.transform(X_test)

In [None]:
from sklearn.linear_model import LinearRegression

linearRegression = LinearRegression().fit(X_train_s, y_train)

y_pred = linearRegression.predict(X_test_s)

print('MSE: ',mean_squared_error(y_pred,y_test))
print('RMSE: ',rmse(y_pred,y_test))
print('Coefficient of determination: ',r2_score(y_pred,y_test))

**3. Ridge Regression with Cross-validation.**

In [None]:
from sklearn.linear_model import RidgeCV

alphas = [0.0005, 0.001, 0.003, 0.007, 0.009, 0.02]

ridgeCV = RidgeCV(alphas=alphas, cv=4).fit(X_train_s, y_train)

ridgeCV_pre = ridgeCV.predict(X_test_s)

print('Alpha found: ',ridgeCV.alpha_)
print('MSE ', mean_squared_error(ridgeCV_pre,y_test))
print('RMSE: ',rmse(ridgeCV_pre,y_test))
print('Coefficient of determination: ',r2_score(ridgeCV_pre,y_test))

**4. Lasso Regression with Cross-validation.**

In [None]:
from sklearn.linear_model import LassoCV

alphas2 = np.array([1e-6, 5e-6, 1e-5, 2e-5])

lassoCV = LassoCV(alphas=alphas2, max_iter=5e4, cv=3).fit(X_train_s, y_train)

lassoCV_pre = lassoCV.predict(X_test_s)

print('Alpha found: ',lassoCV.alpha_)
print('MSE ', mean_squared_error(lassoCV_pre,y_test))
print('RMSE: ',rmse(lassoCV_pre,y_test))
print('Coefficient of determination: ',r2_score(lassoCV_pre,y_test))

**5. ElasticNet with ratios between 0.1 - 0.9 and Cross-validation.**

In [None]:
from sklearn.linear_model import ElasticNetCV

l1_ratios = np.linspace(0.1, 0.9, 9)

elasticNetCV = ElasticNetCV(alphas=alphas2, l1_ratio=l1_ratios, max_iter=1e4).fit(X_train_s, y_train)
elasticNetCV_pre = elasticNetCV.predict(X_test_s)

print('Alpha found: ',elasticNetCV.alpha_)
print('l1_ratio: ', elasticNetCV.l1_ratio_)
print('MSE ', mean_squared_error(elasticNetCV_pre,y_test))
print('RMSE: ',rmse(elasticNetCV_pre,y_test))
print('Coefficient of determination: ',r2_score(elasticNetCV_pre,y_test))

Let's build a table showing the error metrics for the three models:

In [None]:
data = {'Linear without tuning':[mean_squared_error(y_pred3,y_test),rmse(y_pred3,y_test),r2_score(y_pred3,y_test)],
        'Linear with tuning':[mean_squared_error(y_pred,y_test),rmse(y_pred,y_test),r2_score(y_pred,y_test)],
        'RidgeCV': [mean_squared_error(ridgeCV_pre,y_test),rmse(ridgeCV_pre,y_test),r2_score(ridgeCV_pre,y_test)],
        'LassoCV': [mean_squared_error(lassoCV_pre,y_test),rmse(lassoCV_pre,y_test),r2_score(lassoCV_pre,y_test)],
        'ElasticNetCV': [mean_squared_error(elasticNetCV_pre,y_test),rmse(elasticNetCV_pre,y_test),r2_score(elasticNetCV_pre,y_test)]}
 
pd.DataFrame(data, index=['MSE','RMSE','R2 score'])

Clearly the feature engineering is a fundamental step in the aim to get the highest accuracies in the prediction by ML models, we can see a significant and big difference in the first model for this reason, about the other 4 models the metrics for each one is fairly good, they all have R2 scores greater than 0.9999, so we will focus on the MSE and RMSE because in these we can see a major difference. ElasticNetCV and LassoCV have the highest errors, this should be product of still overfitting data giving 13.97 for RMSE and both used considerably low alphas such as 1e-6, comparing to RidgeCV which used alpha equal to 1e-5 this one have the lowest error reaching 1.88 for RMSE and a measure a bit higher was for Linear Regression reaching 2.00.  
Focusing on the best model 'Ridge' based on their metrics we would expect to have an error of US$1.88 when predicting the price of a car given the 9 features. Let's see the difference on performance a bit better by making a couple of interesting plots: 

**Plotting R2 score for each model:**

In [None]:
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
langs = ['Linear without tuning', 'Linear with tuning', 'RidgeCV', 'LassoCV', 'ElasticNetCV']
students = [r2_score(y_pred3,y_test),r2_score(y_pred,y_test),r2_score(ridgeCV_pre,y_test),
            r2_score(lassoCV_pre,y_test),r2_score(elasticNetCV_pre,y_test)]
ax.bar(langs,students)
ax.set_xticks(ticks=[0, 1, 2, 3, 4])
ax.set_xticklabels(['Linear without tuning', 'Linear with tuning', 'RidgeCV', 'LassoCV', 'ElasticNetCV'], rotation=90)
ax.set_xlabel('Models')
ax.set_ylabel('Coefficient of determination (R2 Score)')
ax.set(ylim=(0.7, None))
plt.show()

**Plotting MSE and RMSE in logarithmic scale for each model:**

In [None]:
data =  ((mean_squared_error(y_pred3,y_test),rmse(y_pred3,y_test)), 
        (mean_squared_error(y_pred,y_test),rmse(y_pred,y_test)), 
        (mean_squared_error(ridgeCV_pre,y_test),rmse(ridgeCV_pre,y_test)), 
        (mean_squared_error(lassoCV_pre,y_test),rmse(lassoCV_pre,y_test)), 
        (mean_squared_error(elasticNetCV_pre,y_test), rmse(elasticNetCV_pre,y_test)))

dim = len(data[0])
w = 0.75
dimw = w / dim

fig, ax = plt.subplots()
x = np.arange(len(data))
for i in range(len(data[0])):
    y = [d[i] for d in data]
    b = ax.bar(x + i * dimw, y, dimw, bottom=0.001)

ax.set_xticks(ticks=[0, 1, 2, 3, 4])
ax.set_xticklabels(['Linear without tuning', 'Linear with tuning', 'RidgeCV', 'LassoCV', 'ElasticNetCV'], rotation=90)
ax.set_yscale('log')
ax.set_xlabel('Models')
ax.set_ylabel('Logarithmic Error in US$')

colors = {'MSE':'blue', 'RMSE':'orange'}         
labels = list(colors.keys())
handles = [plt.Rectangle((0,0),1,1, color=colors[label]) for label in labels]
plt.legend(handles, labels)

plt.show()