> # Car Price Prediction - Regression
### Business Objective
The objective of this analysis is to provide a reliable regression model to predict the price of a car based on the variables provided as accurately as possible. The idea is for this to be used in the future for any new cars that would added to the dataset going forward.

### Initial Data Pull and Exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
#Read in the Data Dictionary provided to give clearer insight into the dataset's variables
df_dict = pd.read_excel('../input/car-data/Data Dictionary - carprices.xlsx')
df_dict.rename(columns={"Unnamed: 7": "Column_Name", "Unnamed: 11": "Description"}, inplace=True)
df_dict = df_dict.drop(df_dict.index[0:3])
df_dict = df_dict.drop(df_dict.index[26:28])
df_dict = df_dict.filter(["Column_Name","Description"])
pd.set_option('display.max_colwidth', -1)
df_dict

In [None]:
#Read in CarPrice_Assignment csv dataset
df = pd.read_csv('../input/car-data/CarPrice_Assignment.csv')
print(df.shape)
df.head()

In [None]:
#Look closer at the CarPrices dataset using describe() and info()
df.describe()

In [None]:
df.info()

# Data Preprocessing

In [None]:
#When looking at the CarName column, you can see that the Manufacturer is at the beginning of the values. We want to extract this as a new column
Manufacturer = df['CarName'].apply(lambda x : x.split(' ')[0])
Manufacturer.unique()

Looking through the Manufacturer names pulled from the CarName column, there appears to be misspelling and variations on a few of the names. These will be updated/corrected.

'maxda' -> 'mazda'

'Nissan' -> 'nissan'

'porcshce' -> 'porcshe'

'toyouta' -> 'toyota'

'vokswagen' -> 'volkswagen'

'vm' -> 'volkswagen'

In [None]:
Manufacturer = Manufacturer.replace('maxda', 'mazda')
Manufacturer = Manufacturer.replace('Nissan', 'nissan')
Manufacturer = Manufacturer.replace('porcshce', 'porsche')
Manufacturer = Manufacturer.replace('toyouta', 'toyota')
Manufacturer = Manufacturer.replace('vokswagen', 'volkswagen')
Manufacturer = Manufacturer.replace('vw', 'volkswagen')
Manufacturer.unique()

In [None]:
df.insert(1,'manufacturer',Manufacturer)
df.head()

In [None]:
#Since there are so many unique Car Names compared to the overall size of the dataset, I will remove 'CarName' now that we have Manufacturer to be used instead.
df.drop(['CarName'],axis=1,inplace=True)
df.head()

# Exploratory Data Analysis (EDA)

In [None]:
#Looking into the dependent variable "price"
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
plt.title('Price Boxplot')
sns.boxplot(y="price", data=df)

plt.subplot(1,2,2)
plt.title('Price Distribution')
sns.distplot(df["price"])

In [None]:
#Looking at how the different manufacturers compare with against their prices
ax  = sns.boxplot(x="manufacturer", y="price", data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)

In [None]:
#Getting counts of each manufacturer as well
ax = sns.countplot(x="manufacturer", data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)

### Looking into the Categorical variables

In [None]:
df.select_dtypes(include=['object'])

#### Countplots

In [None]:
plt.figure(figsize=(15,5))

plt.subplot(1,3,1)
plt.title('Fuel Type')
sns.countplot(x="fueltype", data=df)

plt.subplot(1,3,2)
plt.title('Aspiration')
sns.countplot(x="aspiration", data=df)

plt.subplot(1,3,3)
plt.title('Number of Doors')
sns.countplot(x="doornumber", data=df)

plt.show()

In [None]:
plt.figure(figsize=(15,5))

plt.subplot(1,3,1)
plt.title('Carbody')
sns.countplot(x="carbody", data=df)

plt.subplot(1,3,2)
plt.title('Drivewheel')
sns.countplot(x="drivewheel", data=df)

plt.subplot(1,3,3)
plt.title('Engine Location')
sns.countplot(x="enginelocation", data=df)

plt.show()

In [None]:
plt.figure(figsize=(15,5))

plt.subplot(1,3,1)
plt.title('Engine Type')
sns.countplot(x="enginetype", data=df)

plt.subplot(1,3,2)
plt.title('Cylinder Number')
sns.countplot(x="cylindernumber", data=df)

plt.subplot(1,3,3)
plt.title('Fuel System')
sns.countplot(x="fuelsystem", data=df)

plt.show()

#### Boxplots

In [None]:
plt.figure(figsize=(18,5))

plt.subplot(1,3,1)
plt.title('Fuel Type')
sns.boxplot(x="fueltype", y="price", data=df)

plt.subplot(1,3,2)
plt.title('Aspiration')
sns.boxplot(x="aspiration", y="price", data=df)

plt.subplot(1,3,3)
plt.title('Number of Doors')
sns.boxplot(x="doornumber", y="price", data=df)

plt.show()

In [None]:
plt.figure(figsize=(18,5))

plt.subplot(1,3,1)
plt.title('Carbody')
sns.boxplot(x="carbody", y="price", data=df)

plt.subplot(1,3,2)
plt.title('Drivewheel')
sns.boxplot(x="drivewheel", y="price", data=df)

plt.subplot(1,3,3)
plt.title('Engine Location')
sns.boxplot(x="enginelocation", y="price", data=df)

plt.show()

In [None]:
plt.figure(figsize=(18,5))

plt.subplot(1,3,1)
plt.title('Engine Type')
sns.boxplot(x="enginetype", y="price", data=df)

plt.subplot(1,3,2)
plt.title('Cylinder Number')
sns.boxplot(x="cylindernumber", y="price", data=df)

plt.subplot(1,3,3)
plt.title('Fuel System')
sns.boxplot(x="fuelsystem", y="price", data=df)

plt.show()

### Categorical Variable Conclusions
After looking at the countplots and boxplots for the categorical variables, the number of doors ("doornumber") does not seem to play much of a role in the price. Also, engine location ("enginelocation") only has a few records with values as "rear" and the "front" engine location records have prices across the board. So I will not be including either in the model later.

### Looking into the Numerical variables

In [None]:
df.select_dtypes(include=['int64', 'float64'])

In [None]:
#Looking to understand the symboling variable
plt.figure(figsize=(18,5))

plt.subplot(1,3,1)
sns.boxplot(x="symboling", y="price", data=df)

plt.subplot(1,3,2)
sns.countplot(x="symboling", data=df)

plt.subplot(1,3,3)
sns.scatterplot(x="symboling", y="price", data=df)

In [None]:
#Comparing the other numerical variables to price
plt.figure(figsize=(20,5))

plt.subplot(1,3,1)
plt.title('Wheel Base')
sns.scatterplot(x="wheelbase", y="price", data=df)

plt.subplot(1,3,2)
plt.title('Engine Size')
sns.scatterplot(x="enginesize", y="price", data=df)

plt.subplot(1,3,3)
plt.title('Curb Weight')
sns.scatterplot(x="curbweight", y="price", data=df)

plt.show()

In [None]:
plt.figure(figsize=(20,5))

plt.subplot(1,3,1)
plt.title('Car Length')
sns.scatterplot(x="carlength", y="price", data=df)

plt.subplot(1,3,2)
plt.title('Car Width')
sns.scatterplot(x="carwidth", y="price", data=df)

plt.subplot(1,3,3)
plt.title('Car Height')
sns.scatterplot(x="carheight", y="price", data=df)

plt.show()

In [None]:
plt.figure(figsize=(20,5))

plt.subplot(1,3,1)
plt.title('Bore Ratio')
sns.scatterplot(x="boreratio", y="price", data=df)

plt.subplot(1,3,2)
plt.title('Stroke')
sns.scatterplot(x="stroke", y="price", data=df)

plt.subplot(1,3,3)
plt.title('Compression Ratio')
sns.scatterplot(x="compressionratio", y="price", data=df)

plt.show()

In [None]:
plt.figure(figsize=(23,5))

plt.subplot(1,4,1)
plt.title('Horsepower')
sns.scatterplot(x="horsepower", y="price", data=df)

plt.subplot(1,4,2)
plt.title('Peak RPM')
sns.scatterplot(x="peakrpm", y="price", data=df)

plt.subplot(1,4,3)
plt.title('City MPG')
sns.scatterplot(x="citympg", y="price", data=df)

plt.subplot(1,4,4)
plt.title('Highway MPG')
sns.scatterplot(x="highwaympg", y="price", data=df)

plt.show()

### Numeric Variable Conclusions
After looking at the scatter plots, the numeric variables symboling, carheight, compressionratio and peakrpm do not seem to have a significant correlation with price and will be removed in the restructuring step before modeling.

I also want to see how these variables are correlated with each other.

In [None]:
sns.pairplot(df[['price','wheelbase','enginesize','curbweight','carlength','carwidth','boreratio','stroke','horsepower','citympg','highwaympg']])
plt.show()

In [None]:
sns.heatmap(df[['price','wheelbase','enginesize','curbweight','carlength','carwidth','boreratio','stroke','horsepower','citympg','highwaympg']].corr())
plt.show()

Based on the pairplots and heatmap graph related to the correlation, you can see that there are numerous variables that appear to have moderate to high correlations. Things like carlength and wheelbase seem to be positively correlated but citympg and highwaympg are the 2 that jump out to me. Since they are so similar, I will perform some feature engineering to combine these variables.

## Develop new variable

In [None]:
#As mentioned above, since citympg and highwaympg are so strongly correlated and when you think about it logically these 2 variables 
    #are both related to the car's fuel economy and how many mpg the car can drive.
    
#I will combine these 2 variables into one by taking the adding the values and dividing by 2. This will give a rough "average mpg".

df["avgmpg"] = (df["citympg"]+df["highwaympg"])/2
df["avgmpg"].head(10)


# Restructure the Data

In [None]:
df_cars = df[['price', 'manufacturer', 'fueltype', 'aspiration','carbody','drivewheel','enginetype','cylindernumber','fuelsystem','wheelbase','enginesize','curbweight','carlength','carwidth','boreratio','stroke','horsepower','avgmpg']]
df_cars

### Setting up dummy variables from the categorical variables

In [None]:
df_dummy = pd.get_dummies(df_cars['manufacturer'])
df_cars = pd.concat([df_cars, df_dummy], axis = 1)
df_cars.drop('manufacturer', axis = 1, inplace=True)

df_dummy = pd.get_dummies(df_cars['fueltype'])
df_cars = pd.concat([df_cars, df_dummy], axis = 1)
df_cars.drop('fueltype', axis = 1, inplace=True)

df_dummy = pd.get_dummies(df_cars['aspiration'])
df_cars = pd.concat([df_cars, df_dummy], axis = 1)
df_cars.drop('aspiration', axis = 1, inplace=True)

df_dummy = pd.get_dummies(df_cars['carbody'])
df_cars = pd.concat([df_cars, df_dummy], axis = 1)
df_cars.drop('carbody', axis = 1, inplace=True)

df_dummy = pd.get_dummies(df_cars['drivewheel'])
df_cars = pd.concat([df_cars, df_dummy], axis = 1)
df_cars.drop('drivewheel', axis = 1, inplace=True)

df_dummy = pd.get_dummies(df_cars['enginetype'])
df_cars = pd.concat([df_cars, df_dummy], axis = 1)
df_cars.drop('enginetype', axis = 1, inplace=True)

df_dummy = pd.get_dummies(df_cars['cylindernumber'])
df_cars = pd.concat([df_cars, df_dummy], axis = 1)
df_cars.drop('cylindernumber', axis = 1, inplace=True)

df_dummy = pd.get_dummies(df_cars['fuelsystem'])
df_cars = pd.concat([df_cars, df_dummy], axis = 1)
df_cars.drop('fuelsystem', axis = 1, inplace=True)

print(df_cars.shape)

### Splitting dataset into Training and Testing sets

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

y = df_cars['price']
X = df_cars.drop(['price'], axis = 1)

X_train_org, X_test_org, y_train, y_test = train_test_split(X, y, random_state = 0)

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train_org)
#.fit_transform first fits the original data and then transforms it
X_test = scaler.transform(X_test_org)

print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)

# Building a Model

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn import metrics


lreg = LinearRegression()
lreg.fit(X_train, y_train)
print("R2 Training Score: ", lreg.score(X_train, y_train))
print("R2 Testing Score: ", lreg.score(X_test, y_test))

In [None]:
print(lreg.intercept_)
lreg.coef_

Now, looking at the above coefficients and intercepts aren't necessarily practical, but can still be interesting to look at. We will use this model though to run a prediction on the test set.

### Linear Regression Prediction

In [None]:
test_predict = lreg.predict(X_test)
test_predict = pd.DataFrame(test_predict,columns=['Predicted_Price'])
test_predict['Predicted_Price'] = round(test_predict['Predicted_Price'],2)

y_test_index = y_test.reset_index()
y_test_index = y_test_index.drop(columns='index', axis = 1)
test_predict = pd.concat([y_test_index, test_predict], axis = 1)
test_predict.head(15)

By looking at the predictions compared to the listed price, you can see there are examples where the values are very similar, but there are also values that are quite different. The biggest example I can see is index 3, where the predicted price is -4707.10 which is completely invalid. Lets look at how these predictions vs the price look in a regplot.

In [None]:
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.scatterplot(x="price", y="Predicted_Price", data=test_predict)

plt.subplot(1,2,2)
sns.regplot(x="price", y="Predicted_Price", data=test_predict)

You can see that the records with prices/prediction between 5K-20K are fairly close to the prediction line for the most part while values >$20K appear further away. To me this makes sense as the vast majority of prices in the current dataset are in this range while there are fewer cars/records with higher prices.

While this is not a bad model, I wanted to see how using L1 and L2 regularization could affect the outcome of the model. Below I will run Lasso and Ridge Regression to see how they compare to the base Linear Regression.

### Lasso Regression
Lasso regression is a linear least squares model with L1 regularization. This creates a loss function where the model minimizes the sum of the least absolute errors. i.e minimizing the sum of the absolute value differences (errors) between the true y values and estimated y values.

In [None]:
from  sklearn.linear_model import Lasso

#Testing different alpha values for the L1 regularization
alpha_range = [0.01, 0.1, 1, 10, 100]
train_score_list = []
test_score_list = []

for alpha in alpha_range: 
    lasso = Lasso(alpha)
    lasso.fit(X_train,y_train)
    train_score_list.append(lasso.score(X_train,y_train))
    test_score_list.append(lasso.score(X_test, y_test))

In [None]:
#Comparing different alpha values to see which produces the best scores
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(alpha_range, train_score_list, c = 'g', label = 'Train Score')
plt.plot(alpha_range, test_score_list, c = 'b', label = 'Test Score')
plt.xscale('log')
plt.legend(loc = 3)
plt.xlabel(r'$\alpha$')

The alpha values between [0.01, 0.1, 1, 10] seem to be overfitting the data, so I will try alpha = 100.

In [None]:
lasso = Lasso(100)
lasso.fit(X_train,y_train)
print(lasso.score(X_train,y_train))
print(lasso.score(X_test,y_test))

These R Squared scores are quite similar to the linear regression model (fairly lower training score but slightly higher test score). We can see how running a prediction using this model will look compared to the linear regression prediction.

### Lasso Regression Prediction

In [None]:
test_predict = lasso.predict(X_test)
test_predict = pd.DataFrame(test_predict,columns=['Predicted_Price'])
test_predict['Predicted_Price'] = round(test_predict['Predicted_Price'],2)
y_test_index = y_test.reset_index()
y_test_index = y_test_index.drop(columns='index', axis = 1)
test_predict = pd.concat([y_test_index, test_predict], axis = 1)
print(test_predict.head(15))
sns.regplot(x="price", y="Predicted_Price", data=test_predict)

Again not a bad model, it seems to have fixed the prediction that was giving a negative value, but when looking at the predictions compared to the price there are larger error terms.

### Ridge Regression
Ridge regression is a linear least squares model with L2 regularization. This creates a loss function where the model minimizes the sum of the least squares errors. i.e minimizing the sum of the squared differences (errors) between the true y values and estimated y values.

In [None]:
from  sklearn.linear_model import Ridge

#Testing different alpha values for the L2 regularization
x_range = [0.01, 0.1, 1, 10, 100]
train_score_list = []
test_score_list = []

for alpha in x_range: 
    ridge = Ridge(alpha)
    ridge.fit(X_train,y_train)
    train_score_list.append(ridge.score(X_train,y_train))
    test_score_list.append(ridge.score(X_test, y_test))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(x_range, train_score_list, c = 'g', label = 'Train Score')
plt.plot(x_range, test_score_list, c = 'b', label = 'Test Score')
plt.xscale('log')
plt.legend(loc = 3)
plt.xlabel(r'$\alpha$')

In [None]:
print(train_score_list)
print(test_score_list)

Alpha = 1 give the best results for test while still having a high score for train so I will use that alpha value for the model.

### Ridge Regression Prediction

In [None]:
ridge = Ridge(1)
ridge.fit(X_train,y_train)
test_predict = ridge.predict(X_test)
test_predict = pd.DataFrame(test_predict,columns=['Predicted_Price'])
test_predict['Predicted_Price'] = round(test_predict['Predicted_Price'],2)
y_test_index = y_test.reset_index()
y_test_index = y_test_index.drop(columns='index', axis = 1)
test_predict = pd.concat([y_test_index, test_predict], axis = 1)
test_predict.head(15)

In [None]:
sns.regplot(x="price", y="Predicted_Price", data=test_predict)

This appears to give the best model and prediction. It still has higher error terms once the prices go beyond ~25K, but that is a current limitation of this dataset. 

## Conclusion

As with any model, as additional data is collected or new variables become available we could re-evaluate the model and see what happens, but for the variables provided in the dataset, once cleaned and preprocessed, a Ridge Regression model with alpha value = 1 provides the best price prediction on the cars. It gives an R-Squared for the training split of 95.9% and the testing split of 87.9%. 

#### Next Steps & Future Analysis

An interesting exercise going beyond this analysis could be seeing how feature reduction could affect the models above or seeing how different regression techniques/algorithms could be used to create better predictions. If you have any suggestions please feel free to comment, I would love to hear any feedback or ideas.

# Further Analysis utilizing Cross Validation

Beyond linear, lasso and ridge regression on the train and test split, I want to test how Lasso, Ridge and K Neighbors Regressor perform using cross validation against the entire dataset. This is the simulate another type of scenario where you want to generate a model against your entire dataset while validating with cross validation.

In [None]:
#Re-show X and y (y = price, X = all other variables from df_cars)
print(X.shape)
X.head()

In [None]:
#Scale the X dataset
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
print(y.shape)
y[:5]

### K Neighbors Regressor

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

nn_list = list(range(1,51))

param_grid = {'n_neighbors': nn_list}

grid_search = GridSearchCV(KNeighborsRegressor(), param_grid, cv=5)
grid_search.fit(X_scaled, y)
print(grid_search.score(X_scaled,y))
print("Best parameters: {}".format(grid_search.best_params_))


Using cross validation on the entire dataset (X being scaled), you find that n_neighbors = 23 provide the best results. Below I will fit the algorithm with the found n_neighbors.

In [None]:
knn_reg = KNeighborsRegressor(n_neighbors=23)
knn_reg.fit(X_scaled, y)
knn_reg.score(X_scaled, y)

### Lasso

In [None]:
param_grid = {'alpha': [0.01,0.1,1,10,100]}

grid_search = GridSearchCV(Lasso(), param_grid, cv=5)
grid_search.fit(X_scaled, y)
print("Best parameters: {}".format(grid_search.best_params_))

In [None]:
lasso = Lasso(alpha=100)
lasso.fit(X_scaled, y)
lasso.score(X_scaled, y)

### Ridge

In [None]:
param_grid = {'alpha': [0.01,0.1,1,10,100]}

grid_search = GridSearchCV(Ridge(), param_grid, cv=5)
grid_search.fit(X_scaled, y)
print("Best parameters: {}".format(grid_search.best_params_))

In [None]:
ridge = Ridge(alpha=1)
ridge.fit(X_scaled, y)
ridge.score(X_scaled,y)

### Cross Validation Conclusion

After utilizing cross validation and grid search to find the best hyperparameters for each of the respective models, Ridge regression still provides the best results as previously found when running with the train and test split.