# Mercedes price prediction

<img src ="https://images.hgmsites.net/hug/mercedes-benz-historical-logos_100711609_h.jpg" width="200" height="200">

**Dataset used:** https://www.kaggle.com/adityadesai13/used-car-dataset-ford-and-mercedes

**Goal:** predict the car price depending on the features of the car (e.g. mileage, engine type, transmission etc.) - both categorical and continuous.

**ML task type:** regression.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns 
import matplotlib.pyplot as plt
# machine learning
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv("../input/used-car-dataset-ford-and-mercedes/merc.csv")

# Exploratory Data Analysis (EDA)

In [None]:
df.isnull().sum()
df.info()

In [None]:
sns.pairplot(df)

**Let's check the price distribution among the models represented. Hypothesis: price is very correlated with the model.**

In [None]:
fig, ax = plt.subplots(figsize = (10,5))
sns.countplot(y = 'model', data = df, order = df['model'].value_counts().index)
plt.ylabel('Car model')
plt.title('Model distribution')

**One-hot encoding of categorical features:**

**We do this to avoid the misinterpretation of feature correlations by the ML algorithm.**

In [None]:
ohe = pd.get_dummies(df)
ohe.head()

In [None]:
X = ohe.drop(['price'], axis=1)
y = ohe['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [None]:
pipe = Pipeline([('scaler', StandardScaler()), ('LinReg', LinearRegression())])

In [None]:
pipe.fit(X_train, y_train)

In [None]:
pipe.score(X_train, y_train)

In [None]:
y_pred = pipe.predict(X_test)

In [None]:
from sklearn.metrics import r2_score
r2_lin = r2_score(y_test, y_pred)

**Let's plot learning curves (depending on the number of samples in the set). The RMSE should tend to converge closer to the max limit of samples.**

In [None]:
from sklearn.metrics import mean_squared_error

def plot_learning_curves (model, X, y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3)
    train_errors, val_errors = [], []
    for m in range(1, len(X_train)):
        model.fit(X_train[:m], y_train[:m])
        y_train_pred = model.predict(X_train[:m])
        y_val_pred = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train_pred, y_train[:m]))
        val_errors.append(mean_squared_error(y_val_pred, y_val))
        plt.plot(np.sqrt(train_errors), 'r--', linewidth=2, label='train')
        plt.plot(np.sqrt(val_errors), 'b--', linewidth=2, label='val')
        plt.ylabel('RMSE')
        plt.xlabel('Number of samples')

linReg = LinearRegression()
plot_learning_curves(linReg, X[:200], y[:200])

**So, learning curves tell us that the RMSE stabilizes as long as the number of samples grow in volume.**

**Let's try a polynomial regression (maybe there are some hidden non-linear correlation between the features).**

**+ CrossValScore + feature importances**

*This part of code is commented cause I ran out of memory in the kernel :(*

In [None]:
#from sklearn.preprocessing import PolynomialFeatures
#poly_features = PolynomialFeatures(degree=2)
#X_poly_train = poly_features.fit_transform(X_train)
#X_poly_test = poly_features.fit_transform(X_test)

#pipe_poly = Pipeline([('poly', poly_features), ('linreg', LinearRegression())])
#pipe_poly.fit(X_poly_train, y_train)
#y_pred_poly = pipe_poly.predict(X_poly_test)

In [None]:
#pipe_poly.score(X_poly_train, y_train)

In [None]:
#r2_poly = pipe_poly.score(X_poly_test, y_pred_poly)

In [None]:
#r2_score(y_test, y_pred_poly)

In [None]:
#from sklearn.model_selection import cross_val_score


#print(cross_val_score(pipe_poly, X, y, cv=3))

**Apparently, a more complex model with polynomial features is too much for this model. Let's try Ridge regression.**

In [None]:
from sklearn.linear_model import Ridge

pipe_ridge = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge(alpha=100, solver='sag', max_iter=2000))])

In [None]:
pipe_ridge.fit(X_train, y_train)
pipe_ridge.score(X_train, y_train)

In [None]:
y_pred_ridge = pipe_ridge.predict(X_test)
r2_ridge = r2_score(y_test, y_pred_ridge)

In [None]:
from sklearn.model_selection import cross_val_score
print(cross_val_score(pipe_ridge, X, y, cv=3))

**And finally, Lasso regression.**

In [None]:
from sklearn.linear_model import Lasso

pipe_lasso = Pipeline([('scaler', StandardScaler()), ('ridge', Lasso(alpha=0.1, max_iter=100000, warm_start=True))])
pipe_lasso.fit(X_train, y_train)
pipe_lasso.score(X_train, y_train)

In [None]:
y_pred_lasso = pipe_lasso.predict(X_test)
r2_lasso = r2_score(y_test, y_pred_lasso)

In [None]:
from sklearn.linear_model import ElasticNet

pipe_en = Pipeline([('scaler', StandardScaler()), 
                    ('ridge', ElasticNet(alpha=0.01, l1_ratio=0.5))])
pipe_en.fit(X_train, y_train)
pipe_en.score(X_train, y_train)

In [None]:
y_pred_en = pipe_en.predict(X_test)
r2_en = r2_score(y_test, y_pred)

In [None]:
r2_scores = sorted([r2_lin, r2_ridge, r2_lasso, r2_en])
names = ['Linear', 'Ridge', 'Lasso', 'ElasticNet']

In [None]:
import plotly.express as px

g = px.bar(x=names, y=r2_scores, log_y=True)
g.show()

**So, even though the R2-score among these four models vary insignificantly, the best model to predict Mercedes-car prices is ElasticNet.**

**To sum up,**

Mercedes car price correlates with the features in the dataset. ElasticNet regression proves to perform best out of the ones presented in the kernel.