## <center><b>Project</b></center>
## <center><b>Car Price Prediction Model</b></center>

## Importing Required Libraries

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

<br>

## Loading the Dataset

In [None]:
fileurl = "https://raw.githubusercontent.com/RitikRatnawat/Regression/main/CarPrice_Assignment.csv"
carsdf = pd.read_csv(fileurl)
carsdf.head()

<br>

## Getting the Size of the Data

In [None]:
print("Number of Rows in the Data :",carsdf.shape[0])
print("Number of Columns in the Data :",carsdf.shape[1])

<br>

## List the Datatypes for each column
#### It is important to first understand what type of variable you are dealing with.

In [None]:
carsdf.dtypes

<br>

## Analyzing Continuous Variables
<p>Continuous numerical variables are variables that may contain any value within some range. Continuous numerical variables can have the type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines.</p>

<p>In order to start understanding the (linear) relationship between an individual variable and the price. We can do this by using "regplot", which plots the scatterplot plus the fitted regression line for the data.</p>

<br>

### Creating Correlation Heatmap
##### Here, we are creating a heatmap to understand the relationships of different features with each others.

In [None]:
sns.heatmap(carsdf.corr())

<br>

##### From the above Heatmap we interpret that wheelbase, carlength, carwidth, curbweight, enginesize, boreratio, 
##### horsepower are highly correlated with positive linear relationship with Price and citympg, highwaympg are 
##### highly correlated with negative linear relationship with price.

<br>

## Understanding the Relationship between Features using Visualisation

#### Between Wheelbase and Price

In [None]:
sns.regplot(x = 'wheelbase', y = 'price', data = carsdf)
plt.ylim(0,)
carsdf[['wheelbase','price']].corr()

##### <b>Conclusion : </b>As the wheelbase goes up, the price goes up : this indicates a positive direct correlation between these two variables. 
##### Wheel Base seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.
##### We can examine the correlation between 'wheelbase' and 'price' and see it's approximately  0.57.

<br>

#### Between Car length and Price

In [None]:
sns.regplot(x = 'carlength', y = 'price', data = carsdf)
plt.ylim(0,)
carsdf[['carlength','price']].corr()

##### <b>Conclusion : </b>As the carlength goes up, the price goes up : this indicates a positive direct correlation between these two variables. 
##### Car length seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.
##### We can examine the correlation between 'carlength' and 'price' and see it's approximately  0.68.

<br>

#### Between Car width and Price

In [None]:
sns.regplot(x = 'carwidth', y = 'price', data = carsdf)
plt.ylim(0,)
carsdf[['carwidth','price']].corr()

##### <b>Conclusion : </b>As the carwidth goes up, the price goes up : this indicates a positive direct correlation between these two variables. 
##### Car width seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.
##### We can examine the correlation between 'carwidth' and 'price' and see it's approximately  0.75.

<br>

#### Between Curb Weight and Price

In [None]:
sns.regplot(x = 'curbweight', y = 'price', data = carsdf)
plt.ylim(0,)
carsdf[['curbweight','price']].corr()

##### <b>Conclusion : </b>As the curbweight goes up, the price goes up : this indicates a positive direct correlation between these two variables. 
##### Curb Weight seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.
##### We can examine the correlation between 'curbweight' and 'price' and see it's approximately  0.83.

<br>

#### Between Engine Size and Price

In [None]:
sns.regplot(x = 'enginesize', y = 'price', data = carsdf)
plt.ylim(0,)
carsdf[['enginesize','price']].corr()

##### <b>Conclusion : </b>As the enginesize goes up, the price goes up : this indicates a positive direct correlation between these two variables. 
##### Engine Size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.
##### We can examine the correlation between 'enginesize' and 'price' and see it's approximately  0.87.

<br>

#### Between Bore ratio and Price

In [None]:
sns.regplot(x = 'boreratio', y = 'price', data = carsdf)
plt.ylim(0,)
carsdf[['enginesize','price']].corr()

##### <b>Conclusion : </b>As the boreratio goes up, the price goes up : this indicates a positive direct correlation between these two variables. 
##### Bore ratio seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.
##### We can examine the correlation between 'boreratio' and 'price' and see it's approximately  0.87.

<br>

#### Between Horsepower and Price

In [None]:
sns.regplot(x = 'horsepower', y = 'price', data = carsdf)
plt.ylim(0,)
carsdf[['horsepower','price']].corr()

##### <b>Conclusion : </b>As the horsepower goes up, the price goes up : this indicates a positive direct correlation between these two variables. 
##### Horsepower seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.
##### We can examine the correlation between 'horsepower' and 'price' and see it's approximately  0.80.

<br>

#### Between City MPG and Price

In [None]:
sns.regplot(x = 'citympg', y = 'price', data = carsdf)
plt.ylim(0,)
carsdf[['citympg','price']].corr()

##### <b>Conclusion : </b>As the citympg goes down, the price goes up : this indicates a negative direct correlation between these two variables. 
##### City MPG seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.
##### We can examine the correlation between 'citympg' and 'price' and see it's approximately  -0.68.

<br>

#### Between Highway MPG and Price

In [None]:
sns.regplot(x = 'highwaympg', y = 'price', data = carsdf)
plt.ylim(0,)
carsdf[['highwaympg','price']].corr()

##### <b>Conclusion : </b>As the highwaympg goes down, the price goes up : this indicates a negative direct correlation between these two variables. 
##### Highway MPG seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.
##### We can examine the correlation between 'higwaympg' and 'price' and see it's approximately  -0.69.

<br><br>

## Analyzing Categorical Variables
<p>These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots.</p>


<br>

#### Let's look at the relationship between "fueltype" and "price".

In [None]:
sns.boxplot(x = 'fueltype', y = 'price', data = carsdf)

##### <b>Conclusion : </b> We see that the distributions of price between the different fuel-type categories have a significant overlap,
##### and so fuel-type would not be a good predictor of price.

<br>

#### Let's examine engine "aspiration" and "price"

In [None]:
sns.boxplot(x = 'aspiration', y = 'price', data = carsdf)

##### <b>Conclusion : </b> We see that the distributions of price between the different fuel-type categories have a significant overlap,
##### and so fuel-type would not be a good predictor of price.

<br>

#### Now, Let's look at the relationship between "doornumber" and "price".

In [None]:
sns.boxplot(x = 'doornumber', y = 'price', data = carsdf)

##### <b>Conclusion : </b> We see that the distributions of price between the different doornumber categories have a significant overlap,
##### and so doornumber would not be a good predictor of price.

<br>

#### Now, examine the relaionship between "carbody" and "price"

In [None]:
sns.boxplot(x = 'carbody', y = 'price', data = carsdf)

##### <b>Conclusion : </b> We see that the distributions of price between the different carbody categories have a significant overlap,
##### and so carbody would not be a good predictor of price.

<br>

#### Now, let's look at the relationship between "drivewheel" and "price"

In [None]:
sns.boxplot(x = 'drivewheel', y = 'price', data = carsdf)

##### <b>Conclusion : </b> Here we see that the distribution of price between the different drive-wheels categories differs;
##### as such drive-wheels could potentially be a predictor of price.


<br>

#### Now, we examine the relationship between "enginelocation" and "price"

In [None]:
sns.boxplot(x = 'enginelocation', y = 'price', data = carsdf)

##### <b>Conclusion : </b> Here we see that the distribution of price between these two engine-location categories, 
##### front and rear, are distinct enough to take engine-location as a potential good predictor of price.
##### But the difference in value counts is extremely large so, it can't be a good predictor.


<br>

## Checking Statistical Significance of Continuous Variables with Price

<p><b>Correlation</b>: a measure of the extent of interdependence between variables.</p>

<p><b>Causation</b>: the relationship between cause and effect between two variables.</p>

<p>It is important to know the difference between these two and that correlation does not imply causation. Determining correlation is much simpler  the determining causation as causation may require independent experimentation.</p>

<p3>Pearson Correlation</p>

<p>The Pearson Correlation measures the linear dependence between two variables X and Y.</p>
<p>The resulting coefficient is a value between -1 and 1 inclusive, where:</p>
<ul>
    <li><b>1</b>: Total positive linear correlation.</li>
    <li><b>0</b>: No linear correlation, the two variables most likely do not affect each other.</li>
    <li><b>-1</b>: Total negative linear correlation.</li>
</ul>

<p>Pearson Correlation is the default method of the function "corr".  Like before we can calculate the Pearson Correlation of the of the 'int64' or 'float64'  variables.</p><br>


 sometimes we would like to know the significant of the correlation estimate.

<b>P-value</b>: 

<p>What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.</p>

By convention, when the

<ul>
    <li>p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.</li>
    <li>the p-value is $>$ 0.1: there is no evidence that the correlation is significant.</li>
</ul>


<br>

### Wheel-Base vs Price
##### Let's calculate the  Pearson Correlation Coefficient and P-value of 'wheelbase' and 'price'. 

In [None]:
pc, p = stats.pearsonr(carsdf['wheelbase'], carsdf['price'])
print("Pearson Correlation Cofficient is",pc,"with P-value :",p)

<h5><b>Conclusion : </b></h5>
<p>Since the p-value is $<$ 0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (~0.577)</p>


<br>

## Car-length vs Price
##### Let's calculate the Pearson Correlation Coefficient and P-value of 'carlength' and 'price'.

In [None]:
pc, p = stats.pearsonr(carsdf['carlength'], carsdf['price'])
print("Pearson Correlation Cofficient is",pc,"with P-value :",p)

<h5><b>Conclusion : </b></h5>
<p>Since the p-value is $<$ 0.001, the correlation between carlength and price is statistically significant, although the linear relationship isn't extremely strong (~0.682)</p>


<br>

### Car-width vs Price
##### Let's calculate the Pearson Correlation Coefficient and P-value of 'carwidth' and 'price'.

In [None]:
pc, p = stats.pearsonr(carsdf['carwidth'], carsdf['price'])
print("Pearson Correlation Cofficient is",pc,"with P-value :",p)

<h5><b>Conclusion : </b></h5>
<p>Since the p-value is $<$ 0.001, the correlation between carwidth and price is statistically significant, although the linear relationship is very strong (~0.759)</p>


<br>

### Curb-weight vs Price
##### Let's calculate the Pearson Correlation Coefficient and P-value of 'curbweight' and 'price'.

In [None]:
pc, p = stats.pearsonr(carsdf['curbweight'], carsdf['price'])
print("Pearson Correlation Cofficient is",pc,"with P-value :",p)

<h5><b>Conclusion : </b></h5>
<p>Since the p-value is $<$ 0.001, the correlation between curbweight and price is statistically significant, although the linear relationship is extremely strong (~0.835)</p>

<br>

### Engine-size vs Price
##### Let's calculate the Pearson Correlation Coefficient and P-value of 'enginesize' and 'price'.

In [None]:
pc, p = stats.pearsonr(carsdf['enginesize'], carsdf['price'])
print("Pearson Correlation Cofficient is",pc,"with P-value :",p)

<h5><b>Conclusion : </b></h5>
<p>Since the p-value is $<$ 0.001, the correlation between enginesize and price is statistically significant, although the linear relationship is extremely strong (~0.874)</p>

<br>

### Bore-ratio vs Price
##### Let's calculate the Pearson Correlation Coefficient and P-value of 'boreratio' and 'price'.

In [None]:
pc, p = stats.pearsonr(carsdf['boreratio'], carsdf['price'])
print("Pearson Correlation Cofficient is",pc,"with P-value :",p)

<h5><b>Conclusion : </b></h5>
<p>Since the p-value is $<$ 0.001, the correlation between boreratio and price is statistically significant, although the linear relationship isn't extremely strong (~0.553)</p>

<br>

### Horsepower vs Price
##### Let's calculate the Pearson Correlation Coefficient and P-value of 'horsepower' and 'price'.

In [None]:
pc, p = stats.pearsonr(carsdf['horsepower'], carsdf['price'])
print("Pearson Correlation Cofficient is",pc,"with P-value :",p)

<h5><b>Conclusion : </b></h5>
<p>Since the p-value is $<$ 0.001, the correlation between horsepower and price is statistically significant, although the linear relationship is very strong (~0.808)</p>

<br>

### City-mpg vs Price
##### Let's calculate the Pearson Correlation Coefficient and P-value of 'citympg' and 'price'.

In [None]:
pc, p = stats.pearsonr(carsdf['citympg'], carsdf['price'])
print("Pearson Correlation Cofficient is",pc,"with P-value :",p)

<h5><b>Conclusion : </b></h5>
<p>Since the p-value is $<$ 0.001, the correlation between citympg and price is statistically significant, although the linear relationship isn't extremely strong (~0.685)</p>

<br>

### Highway-mpg vs Price
##### Let's calculate the Pearson Correlation Coefficient and P-value of 'highwaympg' and 'price'.

In [None]:
pc, p = stats.pearsonr(carsdf['highwaympg'], carsdf['price'])
print("Pearson Correlation Cofficient is",pc,"with P-value :",p)

<h5><b>Conclusion : </b></h5>
<p>Since the p-value is $<$ 0.001, the correlation between highwaympg and price is statistically significant, although the linear relationship isn't extremely strong (~0.697)</p>

<br>

## Checking Statistical Significane of Categorical Variables with Price

<h3>ANOVA: Analysis of Variance</h3>
<p>The Analysis of Variance  (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:</p>

<p><b>F-test score</b>: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.</p>

<p><b>P-value</b>:  P-value tells how statistically significant is our calculated score value.</p>

<p>If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.</p>


### Drive Wheels

<p>Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.</p>

<p>Let's see if different types 'drive-wheels' impact  'price', we group the data.</p>


In [None]:
groupeddf = carsdf[['drivewheel','price']].groupby('drivewheel')
groupeddf.head(2)

In [None]:
groupeddf['price'].mean().to_frame()

<br>

we can use the function 'f_oneway' in the module 'stats'  to obtain the <b>F-test score</b> and <b>P-value</b>.

In [None]:
f_val, p_val = stats.f_oneway(groupeddf.get_group('fwd')['price'], groupeddf.get_group('rwd')['price'], groupeddf.get_group('4wd')['price'])
print("ANOVA Results : F-Score =",f_val,"P-value =",p_val)

This is a great result, with a large F test score showing a strong correlation and a P value of almost 0 implying almost certain statistical significance.<br>
But does this mean all three tested groups are all this highly correlated? 

<br>

#### Separately : fwd and rwd

In [None]:
f_val, p_val = stats.f_oneway(groupeddf.get_group('fwd')['price'], groupeddf.get_group('rwd')['price'])
print("ANOVA Results : F-Score =",f_val,"P-value =",p_val)

<br>

#### rwd and 4wd

In [None]:
f_val, p_val = stats.f_oneway(groupeddf.get_group('rwd')['price'], groupeddf.get_group('4wd')['price'])
print("ANOVA Results : F-Score =",f_val,"P-value =",p_val)

<br>

#### fwd and 4wd

In [None]:
f_val, p_val = stats.f_oneway(groupeddf.get_group('fwd')['price'], groupeddf.get_group('4wd')['price'])
print("ANOVA Results : F-Score =",f_val,"P-value =",p_val)

<br>

<p>Now, we have a better idea of what our data looks like and which variables are important to take into account when predicting the car price. We have narrowed it down to the following variables:</p>

Continuous numerical variables:

<ul>
    <li>Length</li>
    <li>Width</li>
    <li>Curb-weight</li>
    <li>Engine-size</li>
    <li>Horsepower</li>
    <li>City-mpg</li>
    <li>Highway-mpg</li>
    <li>Wheel-base</li>
    <li>Bore-ratio</li>
</ul>
    
Categorical variables:
<ul>
    <li>Drive-wheels</li>
</ul>

<p>As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.</p>


<br>

## Train_Test_Split
#####  In this section, we are splitting the data into Train and Test Set for training and evaluating the model to predict car price.

<br>

#### Fetching the important variables to which affect the price of the car from the Dataset

In [None]:
requireddf = carsdf[['carlength','carwidth','drivewheel','curbweight','enginesize','horsepower','citympg','highwaympg','wheelbase','boreratio']]
requireddf.head()

<br>

#### Replacing the string Data to numerical data for model development 

In [None]:
requireddf = requireddf.replace({'fwd' : 0, 'rwd' : 1, '4wd' : 2})
requireddf.head()

<br>

#### Converting the dataset into the Array

In [None]:
X = np.asarray(requireddf)
X[0:10]

In [None]:
Y = np.asarray(carsdf['price'])
Y[0:10]

<br>

#### Splitting the Data into Training and Test Set

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 4)
print("Training Set X and Y:",X_train.shape,Y_train.shape)
print("Test Set X and Y:",X_test.shape,Y_test.shape)

<br>

## Model Development

#### Importing Scikit-Learn library for Model Development

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model

<br>

#### Creating Model

In [None]:
price_model = linear_model.LinearRegression()
price_model.fit(X_train, Y_train)
Y_rslt = price_model.predict(X_test)
Y_rslt[0:10]

<br>

## Model Evaluation
##### Checking the Accuracy of the Model

In [None]:
from sklearn.metrics import r2_score

In [None]:
print("Mean Absolute Error : %.2f" % np.mean(np.absolute(Y_rslt - Y_test)))
print("R2_score : %.2f" % r2_score(Y_test, Y_rslt))

<br>

## Conclusion

We conclude that <b>the MLR model is the best model</b> to be able to predict price from our dataset. This result makes sense, since we have 27 variables in total, and we know that more than one of those variables are potential predictors of the final car price.

<br>