# Car Price Prediction

##### Hi, 
###### Welcome to this notebook! The objective of this notebook is to best understand linear regression. The data used in this repo has been taken from Kaggle (link below). 
https://www.kaggle.com/hellbuoy/car-price-prediction


### About project Mechanic of Machine Learning:
I am a mechanical engineer by education. Now, I want to deep dive in the world of Machine Learning, hence the name, mechanic of ML :D. I have taken up this project to understand the in-depth mathematics involved in regularly used ML algorithms. Under this project, I will be sharing useful material and links as I explore this domain. The objective is to learn and spread the same. Stay tuned to my GitHub for updates!

### Business Case: 
A top manufacturer of automobiles has realized the need to provide real time cost estimates to consumers when configuring their car through their website. Build an ML based model to facilitate this requirement
### Notebook objectives:

* To understand and implement linear regression 
* To visualize and understand the data
* To select features which can best predict costs based on attribute-value pair. 
* To derive conclusions from the data and suggest solutions for business.

### References:
* Linear Regression and Gradient Descent: https://www.youtube.com/watch?v=4b4MUYve_U8&list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU&index=2
* Stochastic Gradient Descent: https://www.youtube.com/watch?v=vMh0zPT0tLI
* Gradient Descent: https://www.youtube.com/watch?v=sDv4f4s2SB8
* Notes and source code: https://github.com/ArindamRoy23/Car_Price_Prediction_Linear_Regression_Mechanic-of-ML.git


## Index:
1. Exploratory Analysis and Visualization
2. Observations 
3. Running an initial analysis without removing outlier values
4. Running an analysis after removing outlier values
5. Conclusion 

### Exploratory Analysis and Visualization

In [None]:
#Importing packages
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
%matplotlib inline

In [None]:
input_df = pd.read_csv('../input/car-price-prediction/CarPrice_Assignment.csv')

In [None]:
sns.distplot(input_df['price'],kde=True)

In [None]:
#Taking only progressing parameters as we only have linear regression models to fit. 
input_df = input_df[['symboling', 'wheelbase', 'carlength', 
                    'carwidth','carheight',  'curbweight',
                    'cylindernumber', 'enginesize','boreratio','stroke','compressionratio',
                    'horsepower', 'peakrpm', 'citympg', 'highwaympg','price']]

In [None]:
plt.show()
(sns.heatmap(input_df.corr()))

In [None]:
sns.heatmap(input_df.corr()[input_df.corr().apply(lambda x:x**2 )>.25].fillna(0))

In [None]:
print('High positive corelation with price:',list(input_df.corr()[(input_df.corr().apply(lambda x:x**2 )>.25 )&(input_df.corr()>0)]['price'].dropna().index))
print('High negative corelation with price:',list(input_df.corr()[(input_df.corr().apply(lambda x:x**2 )>.25 )&(input_df.corr()<0)]['price'].dropna().index))
pos_cor_list = list(input_df.corr()[(input_df.corr().apply(lambda x:x**2 )>.25 )&(input_df.corr()>0)]['price'].dropna().index)
neg_cor_list = list(input_df.corr()[(input_df.corr().apply(lambda x:x**2 )>.25 )&(input_df.corr()<0)]['price'].dropna().index)
to_keep_list = pos_cor_list + neg_cor_list

In [None]:
input_df = input_df[to_keep_list]

## Observations:

1. Following parameters are highly positively corelated with price:
    wheelbase, car length/base/height, curb weight, engine size, bore ratio & horsepower
2. Following parameters are highly negatively corelated with price:
    citympg,highwaympg
3. The dataset is relatively small. Thus, loosing data might affect the model prediction. 


### Running an initial analysis without removing outlier values: 

In [None]:
X = input_df.drop('price',axis=1)
y = input_df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

In [None]:
reg = LinearRegression().fit(X_train, y_train)

In [None]:
y_pred = reg.predict(X_test)
print('Mean error:',int(mean_absolute_error(y_test, y_pred)))
print('Percentage error: ~(+/-)',int((mean_absolute_error(y_test, y_pred)/input_df['price'].mean())*100)/2,'%')

### Increasing the test set:

In [None]:
X = input_df.drop('price',axis=1)
y = input_df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
reg = LinearRegression().fit(X_train, y_train)

In [None]:
y_pred = reg.predict(X_test)
print('Mean error:',int(mean_absolute_error(y_test, y_pred)))
print('Percentage error: ~(+/-)',int((mean_absolute_error(y_test, y_pred)/input_df['price'].mean())*100)/2,'%')

### Running an analysis after removing outlier values: 

In [None]:
input_df.sort_values('price',ascending=False)
clean_df = input_df[input_df['price']<input_df['price'].mean()+2*input_df['price'].std()]

In [None]:
X = clean_df.drop('price',axis=1)
y = clean_df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

In [None]:
reg = LinearRegression().fit(X_train, y_train)

In [None]:
y_pred = reg.predict(X_test)
print('Mean error:',int(mean_absolute_error(y_test, y_pred)))
print('Percentage error: ~(+/-)',int((mean_absolute_error(y_test, y_pred)/clean_df['price'].mean())*100)/2,'%')

### Increasing the test set:

In [None]:
X = clean_df.drop('price',axis=1)
y = clean_df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
reg = LinearRegression().fit(X_train, y_train)

In [None]:
y_pred = reg.predict(X_test)
print('Mean error:',int(mean_absolute_error(y_test, y_pred)))
print('Percentage error: ~(+/-)',int((mean_absolute_error(y_test, y_pred)/clean_df['price'].mean())*100)/2,'%')

## Conclusions:

1. The model is predicting with an accuracy in the range of (+/-) 10% with orignal data.
2. The model is predicting in the range of (+/-) 6.5% with cleaned data aftrer removing outlier values.
3. As expected, shrinking the train size has adverse effect on the predictions as the dataset is small in nature. 
4. This model can further be integrated in the client portal to give a range of predictions to the customers. 
5. More dataponts can help improve the accuracy of the model.
6. Every run might have a different result with significantly changed output values. This is due to the train test split. More data will help brigge this gap.  