# **Mercedes Price Prediction**

### **Objective-**
##### Our goal is to build a model that will allow us to predict the price of the Mercedes Car.

### **Importing Required Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

import warnings
warnings.filterwarnings('ignore')

### **Loading Dataset**

In [None]:
data = pd.read_csv('../input/used-car-dataset-ford-and-mercedes/merc.csv')

In [None]:
# lets check the shape of data
data.shape

In [None]:
# checking first five rows of dataset
data.head()

### **Data Description**

* model : Model of car
* year : registration year
* price : car price in pound
* Transmission : Type of gear
* mileage : distance used
* fuelType : Fuel type
* tax : tax
* mpg : Miles per gallon (1 galon = 3,78541178 liters)
* engine size : Size of engine (liters)

In [None]:
# checking for missing values 
data.isnull().sum()

##### We can see that there are no missing values present in data.

In [None]:
# summary statistics 
data.describe()

### **Exploratory Data Analysis(EDA)**

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Correlation Analysis
</div>

In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(data.corr(), annot=True, cmap='plasma')
plt.show()

* Mileage negatively affects price which could be result of caution.
* Engine size have strong correlation with target.
* There exists a multicolinearity between mileage and year.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Models
</div>

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(20,10))
total = float(len(data))
ax = sns.countplot(x="model", data=data)
plt.xticks(rotation=90)
plt.title("Count Plot For Mercedes Models", fontsize=20)
for p in ax.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height()/total)
    x = p.get_x() + p.get_width() / 2.
    y = p.get_height()
    ax.annotate(percentage, (x, y),ha='center',va='bottom')
plt.show()

##### Cleary Mercedes C Class is most common model followed by A Class.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Price and Year
</div>

In [None]:
plt.figure(figsize=(20, 10))
fig = sns.boxplot(x='year', y="price", data=data)
plt.xticks(rotation=90)
plt.show()

##### Price of newest models is high compared to old models.

In [None]:
data[data['year'] == 1970]

##### In year 1970, mean of price is too high and it only contains one value(outlier) so I am dropping that value from original data.

In [None]:
data.drop(12072,axis=0,inplace=True)

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Price and Transmission
</div>

In [None]:
# Plot Transmission vs Price
plt.figure(figsize=(12, 8))
sns.catplot(y='price',x='transmission',data= data.sort_values('price',ascending=False),kind="boxen",height=6, aspect=3)
plt.show()

In [None]:
data[data['transmission'] == 'Other']

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Price and FuelType
</div>

In [None]:
plt.figure(figsize=(12, 8))
fig = sns.boxplot(x='fuelType', y="price", data=data.sort_values('price',ascending=False))
plt.show()

##### It is clear from the above figure that the Petrol cars are relatively more costly than the rest because it contains some high range values.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Analysis of Price and Continuous Variables
</div>

In [None]:
cols = ['mileage', 'tax', 'mpg','engineSize']

font = {'family': 'serif',
        'color':  'darkred',
        'weight': 'normal',
        'size': 16,
        }
fig,axes = plt.subplots(2,2,figsize=(25,15),sharey=True)
fig.subplots_adjust(wspace=0.1, hspace=0.3)
fig.suptitle('Scatter Plot Of Continous Variables vs Price',fontsize = 20, fontdict=font)
fig.subplots_adjust(top=0.95)

axes = axes.ravel()

for i,col in enumerate(cols):
    #using log transformation
    x = data[col]
    y = data['price'].apply(np.log)
    sns.scatterplot(x, y ,ax=axes[i])

##### Mileage,engineSize and mpg has a strong association with the price.

<div style="color:red;
           display:fill;
           font-size:130%;
           font-family:Argentina;
           letter-spacing:0.5px">
Distribution of Target variable - Price
</div>

In [None]:
# lets check the distribution of target variable - price
f,ax = plt.subplots(1, sharex=True,figsize=(15,6))
mean_price = data['price'].mean()
median_price = data['price'].median()
mode_price = data['price'].mode().values[0]

sns.distplot(data['price'],ax = ax)
ax.axvline(mean_price, color='r', linestyle='--', label="Mean")
ax.axvline(median_price, color='g', linestyle='-', label="Median")
ax.axvline(mode_price, color='b', linestyle='-', label="Mode")

ax.legend()
plt.xlim()
plt.show()

* Majority of cars having price around 15000-30000 pounds.
* There are very minimum cars having price between 50000 to 160000 pound.
* We can see that mean price is greater than median of price, also long tail of distribution is longer on right hand side as compared to left hand side which shows that distribution is positively skewed.
* We can transform it to represent a normal distribution.Lets try to remove some outliers and see if that helps here.

In [None]:
# dropping 1 percent of the data ie. Outlier values.
data = data.sort_values('price',ascending = False).iloc[int(len(data) * 0.01):]

In [None]:
plt.figure(figsize=(15,6))
sns.distplot(data['price'])
plt.show()

##### So Now, the distribution appears close to normal.

In [None]:
data.dtypes

### **Handling Categorical Features**

In [None]:
# Using get_dummies where data are not in any order and LabelEncoder when data is in order.

categorical_col = ['model','transmission','fuelType']
dummies = pd.get_dummies(data[categorical_col])

# removing dummy trap variables
dummies.drop(['model_230','transmission_Other','fuelType_Other'],axis=1,inplace=True)

In [None]:
# concat data and dummies df
data = pd.concat([data,dummies],axis=1)

# drop the original categorical columns
data.drop(categorical_col,axis=1,inplace=True)

In [None]:
# lets create a feature age = current year - year
current_year = 2021
data['Age'] = current_year - data['year']
# drop year
data.drop('year',axis=1,inplace=True)

In [None]:
data.shape

In [None]:
data.head()

### **Model Building**

In [None]:
# Separate Dependent and Independent Variables
X = data.drop('price',axis=1)
y = data['price']

In [None]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 50)

In [None]:
# Models
models=[LinearRegression(),RandomForestRegressor(random_state=50)]
model_names=['LinearRegression','RandomForestRegressor']
score = []
dictionary = {}
for model in range(len(models)):
    reg = models[model]
    reg.fit(X_train,y_train)
    y_pred = reg.predict(X_test)
    score.append(r2_score(y_pred,y_test))
     
dictionary = {'Model Names':model_names,'r2 Score':score}
# Put the scores in a data frame.
score_df = pd.DataFrame(dictionary)

In [None]:
score_df

##### Random Forest Performed Best.