# Car Price Prediction with Regression

In this notebook, I will apply different regression algorithms on [Car Price data from Kaggle](https://www.kaggle.com/goyalshalini93/car-data). My aim will be to:
1. Understand the data by exploring and visualizing.
2. Clean and make any necessary feature engineering.
3. Try different regression algorithms on different versions (unaltered, standardized & normalized) of the data.
4. Tune hyperparameters to get better results.

### Sections:
- Data Description & Cleaning
- Exploratory Data Analysis
        o Categorical Features
        o Numerical Features
        o Target Variable
- Winsorization
- Heatmap
- One-Hot-Encoding
- Standardization
- Normalization
- Building Machine Learning Models
        o Linear Regression
        o Lasso Regression
        o Ridge Regression
        o Support Vector Regressor
        o XGBoost
        o LightGBM
- Hyperparameter Tuning
        o Lasso Regression
        o Ridge Regression
        o Support Vector Regressor
        o XGBoost
        o LightGBM
- GridSearchCV Results & Final Comparison

In [None]:
!pip install xlrd
!pip install openpyxl

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats.mstats import winsorize

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import train_test_split
from sklearn import metrics

from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor


import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/car-data/CarPrice_Assignment.csv')
data_dict = pd.read_excel('/kaggle/input/car-data/Data Dictionary - carprices.xlsx', skiprows=3)
data_dict = data_dict.iloc[:-2, [7, 11]].rename({'Unnamed: 7': 'Column', 'Unnamed: 11': 'Description'}, axis=1)

# Car Data - Data Description & Cleaning

### Columns and Descriptions

In [None]:
pd.set_option('max_colwidth', 250)
display(data_dict)

In [None]:
df.head()

In [None]:
df['brand'] = df['CarName'].apply(lambda x: x.split(' ')[0])
df['brand'] = df['brand'].replace({'vokswagen': 'volkswagen', 'vw': 'volkswagen', 'maxda': 'mazda',
                                   'Nissan': 'nissan', 'porcshce': 'porsche', 'toyouta': 'toyota'})

In [None]:
df['brand'].unique()

In [None]:
df['symboling'] = df['symboling'].astype('category')

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.describe(include='O')

In [None]:
df.select_dtypes('object').nunique()

# Exploratory Data Analysis

## Categorical Features

In [None]:
categorical_features = list(df.select_dtypes(['O', 'category']).drop('CarName', axis=1).columns)

In [None]:
plt.figure(figsize=(15, 20))

for i in range(len(categorical_features) - 1):
    plt.subplot(4, 3, i+1)
    plt.title(categorical_features[i] + ' (Value Counts)')
    plt.bar(df[categorical_features[i]].value_counts().index,
            df[categorical_features[i]].value_counts())
    
plt.subplot(4, 3, 11)
plt.title('brand (Value Counts)')
plt.bar(df['brand'].value_counts().index,
        df['brand'].value_counts())
plt.xticks(rotation=90)

plt.show()

In [None]:
for i in range(len(categorical_features)):
    print(df.groupby(categorical_features[i])['price'].mean(), '\n')

In [None]:
plt.figure(figsize=(18, 24))

for i in range(len(categorical_features)-1):
    plt.subplot(4, 3, i+1)
    plt.title(categorical_features[i] + ' (mean price)')
    sns.barplot(categorical_features[i], 'price', data=df,
                order=df.groupby(categorical_features[i])['price'].mean().sort_values(ascending=False).index, ci=None)
    
plt.subplot(4, 3, 11)
plt.title('brand (Value Counts)')
sns.barplot('brand', 'price', data=df, ci=None,
           order=df.groupby('brand')['price'].mean().sort_values(ascending=False).index)
plt.xticks(rotation=90)

plt.show()

## Numerical Features

In [None]:
numeric_features = list(df.select_dtypes([np.int64, np.float64]).drop('car_ID', axis=1).columns)

In [None]:
plt.figure(figsize=(25, 15))

for i in range(len(numeric_features)):
    plt.subplot(3, 5, i+1)
    plt.title(numeric_features[i], fontsize=20)
    plt.hist(df[numeric_features[i]], bins=15)
    
plt.show()

In [None]:
plt.figure(figsize=(25, 15))

for i in range(len(numeric_features)):
    plt.subplot(3, 5, i+1)
    plt.title(numeric_features[i], fontsize=20)
    plt.boxplot(df[numeric_features[i]])
    
plt.show()

### Correlations between Features & Target Variable

In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(df.corr(),annot=True)
plt.show()

Highest positive correlations with **price** column:
- **enginesize**: 0.87
- **curbweight**: 0.84
- **horsepower**: 0.81
- **carwidth**: 0.76

Highest negative correlations **price** column:
- **highwaympg**: -0.7
- **citympg**: -0.69

### Scatter Plot of Features with Highest Correlations and Target Variable

In [None]:
plt.figure(figsize=(15, 10))

plt.subplot(2, 3, 1)
plt.scatter(df['enginesize'], df['price'])
plt.title('enginesize')

plt.subplot(2, 3, 2)
plt.scatter(df['curbweight'], df['price'])
plt.title('curbweight')

plt.subplot(2, 3, 3)
plt.scatter(df['horsepower'], df['price'])
plt.title('horsepower')

plt.subplot(2, 3, 4)
plt.scatter(df['carwidth'], df['price'])
plt.title('carwidth')

plt.subplot(2, 3, 5)
plt.scatter(df['highwaympg'], df['price'])
plt.title('highwaympg')

plt.subplot(2, 3, 6)
plt.scatter(df['citympg'], df['price'])
plt.title('citympg')

plt.show()

### Scatter Plot of All Features and Target Variable

In [None]:
plt.figure(figsize=(20, 20))

for i in range(len(numeric_features)):
    plt.subplot(4, 4, i+1)
    plt.scatter(df[numeric_features[i]], df['price'])
    plt.title(numeric_features[i])

plt.show()

## Target Variable (Price)

In [None]:
plt.figure(figsize=(15, 7))

plt.subplot(1, 2, 1)
plt.hist(df['price'], bins=30)
plt.title('Price (Histogram)', fontsize=15)

plt.subplot(1, 2, 2)
plt.boxplot(df['price'])
plt.title('Price (Boxplot)', fontsize=15)

plt.show()

# Winsorization

Features with outliers:
- **carwidth**
- **enginesize**
- **stroke**
- **compressionratio**

In [None]:
df['carwidth_winsorize'] = winsorize(df['carwidth'], limits=[0, 0.1])
df['enginesize_winsorize'] = winsorize(df['enginesize'], limits=[0, 0.1])
df['stroke_winsorize'] = winsorize(df['stroke'], limits=[0.1, 0.1])
df['compressionratio_winsorize'] = winsorize(df['compressionratio'], limits=[0.1, 0.1])

In [None]:
display(df[['carwidth','carwidth_winsorize']].describe())
display(df[['enginesize','enginesize_winsorize']].describe())
display(df[['stroke','stroke_winsorize']].describe())
display(df[['compressionratio','compressionratio_winsorize']].describe())

In [None]:
plt.figure(figsize=(8, 20))

plt.subplot(4, 2, 1)
plt.boxplot(df['carwidth'])
plt.title('cardwidth')

plt.subplot(4, 2, 2)
plt.boxplot(df['carwidth_winsorize'])
plt.title('cardwidth_winsorize')

plt.subplot(4, 2, 3)
plt.boxplot(df['enginesize'])
plt.title('enginesize')

plt.subplot(4, 2, 4)
plt.boxplot(df['enginesize_winsorize'])
plt.title('enginesize_winsorize')

plt.subplot(4, 2, 5)
plt.boxplot(df['stroke'])
plt.title('stroke')

plt.subplot(4, 2, 6)
plt.boxplot(df['stroke_winsorize'])
plt.title('stroke_winsorize')

plt.subplot(4, 2, 7)
plt.boxplot(df['compressionratio'])
plt.title('compressionratio')

plt.subplot(4, 2, 8)
plt.boxplot(df['compressionratio_winsorize'])
plt.title('compressionratio_winsorize')

plt.show()

In [None]:
df = df.drop(['carwidth', 'enginesize', 'stroke', 'compressionratio'], axis=1)

# Heatmap (Numerical features only)

In [None]:
df = df[['car_ID', 'symboling', 'CarName', 'fueltype', 'aspiration',
         'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase',
         'carlength', 'carheight', 'curbweight', 'enginetype', 'cylindernumber',
        'fuelsystem', 'boreratio', 'horsepower', 'peakrpm', 'citympg',
        'highwaympg', 'brand', 'carwidth_winsorize',
        'enginesize_winsorize', 'stroke_winsorize',
        'compressionratio_winsorize', 'price']]

In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(df.corr(), annot=True)
plt.show()

# One-Hot-Encoding

In [None]:
df = pd.get_dummies(df.drop('CarName', axis=1))

In [None]:
y = df['price']
X = df.drop(['price', 'car_ID'], axis=1)

# Standardization

In [None]:
scaler = StandardScaler()

df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
X_scaled = df_scaled.drop(['price', 'car_ID'], axis=1)

# Normalization

In [None]:
normalizer = Normalizer()

df_normalized = pd.DataFrame(normalizer.fit_transform(df), columns=df.columns)
X_normalized = df_normalized.drop(['price', 'car_ID'], axis=1)

In [None]:
print('X:')
display(X.head())

print('X_scaled:')
display(X_scaled.head())

print('X_normalized:')
display(X_normalized.head())

# Building Machine Learning Models

- Linear Regression
- Lasso Regression
- Ridge Regression
- Support Vector Regressor
- XGBoost
- LightGBM

In [None]:
def fit_predict_score(Model, X_train, y_train, X_test, y_test):
    """Fit the model of your choice,
    predict for test data,
    returns MAE, MSE, RMSE."""
    model = Model
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)

    return (train_score, test_score, metrics.mean_absolute_error(y_test, y_pred),
            metrics.mean_squared_error(y_test, y_pred), np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

def model_comparison(X, y):
    """Creates a DataFrame comparing Linear Regression, Lasso, Ridge, SVR (kernel: linear),
    XBRegressor, and LGBMRegressor scores and errors."""
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    
    lrm_train_score, lrm_test_score, lrm_mae, lrm_mse, lrm_rmse = fit_predict_score(LinearRegression(), X_train, y_train, X_test, y_test)
    lasso_train_score, lasso_test_score, lasso_mae, lasso_mse, lasso_rmse = fit_predict_score(Lasso(), X_train, y_train, X_test, y_test)
    ridge_train_score, ridge_test_score, ridge_mae, ridge_mse, ridge_rmse = fit_predict_score(Ridge(), X_train, y_train, X_test, y_test)
    svr_train_score, svr_test_score, svr_mae, svr_mse, svr_rmse = fit_predict_score(SVR(kernel='linear'), X_train, y_train, X_test, y_test)
    xgbr_train_score, xgbr_test_score, xgbr_mae, xgbr_mse, xgbr_rmse = fit_predict_score(XGBRegressor(), X_train, y_train, X_test, y_test)
    lgbr_train_score, lgbr_test_score, lgbr_mae, lgbr_mse, lgbr_rmse = fit_predict_score(LGBMRegressor(), X_train, y_train, X_test, y_test)
    
    models = ['Linear Regression', 'Lasso Regression', 'Ridge Regression',
          'SVM (kernel:linear)', 'XGBoost (Regressor)', 'LightGBM (Regressor)']
    train_score = [lrm_train_score, lasso_train_score, ridge_train_score, svr_train_score, xgbr_train_score, lgbr_train_score]
    test_score = [lrm_test_score, lasso_test_score, ridge_test_score, svr_test_score, xgbr_test_score, lgbr_test_score]
    mae = [lrm_mae, lasso_mae, ridge_mae, svr_mae, xgbr_mae, lgbr_mae]
    mse = [lrm_mse, lasso_mse, ridge_mse, svr_mse, xgbr_mse, lgbr_mse]
    rmse = [lrm_rmse, lasso_rmse, ridge_rmse, svr_rmse, xgbr_rmse, lgbr_rmse]
    
    model_comparison = pd.DataFrame(data=[models, train_score, test_score, mae, mse, rmse]).T.rename({0: 'Model', 1:'Training Score',
                                                                                    2: 'Test Score',
                                                                                    3:'Mean Absolute Error',
                                                                                    4: 'Mean Squared Error',
                                                                                    5:'Root Mean Squared Error'}, axis=1)
    
    return model_comparison

In [None]:
print("Default:")
display(model_comparison(X, y))
print("Scaled:")
display(model_comparison(X_scaled, y))
print("Normalized:")
display(model_comparison(X_normalized, y))

# Hyperparameter Tuning

## Lasso

In [None]:
params = {'alpha': [10**i for i in range(1, 5)] + [round(0.1**i,5) for i in range(5)]}

lasso_grid = GridSearchCV(estimator = Lasso(),
                        param_grid = params,                        
                        cv = 5)

lasso_grid.fit(X_scaled, y)

In [None]:
print('Best Score: ', lasso_grid.best_score_)
print('Best Params: ', lasso_grid.best_params_)

In [None]:
display(pd.DataFrame(pd.DataFrame(lasso_grid.cv_results_)[['param_alpha', 'mean_test_score']].groupby(['param_alpha'])['mean_test_score'].mean()).reset_index().sort_values('param_alpha'))

## Ridge

In [None]:
params = {'alpha': [10**i for i in range(1, 5)] + [round(0.1**i,5) for i in range(5)]}

ridge_grid = GridSearchCV(estimator = Ridge(),
                        param_grid = params,                        
                        cv = 5)

ridge_grid.fit(X_scaled, y)

In [None]:
print('Best Score: ', ridge_grid.best_score_)
print('Best Params: ', ridge_grid.best_params_)

In [None]:
display(pd.DataFrame(pd.DataFrame(ridge_grid.cv_results_)[['param_alpha', 'mean_test_score']].groupby(['param_alpha'])['mean_test_score'].mean()).reset_index().sort_values('param_alpha'))

## Support Vector Machine

In [None]:
params = {'C': [10**i for i in range(1, 2)] + [round(0.1**i,5) for i in range(5)],
          'kernel': ['linear']}

svr_grid = GridSearchCV(estimator = SVR(),
                        param_grid = params,                        
                        cv = 5)

svr_grid.fit(X, y)

In [None]:
display(pd.DataFrame(pd.DataFrame(svr_grid.cv_results_)[['param_C', 'mean_test_score']].groupby(['param_C'])['mean_test_score'].mean()).reset_index().sort_values('param_C'))

In [None]:
print('Best Score: ', svr_grid.best_score_)
print('Best Params: ', svr_grid.best_params_)

## XGBoost

In [None]:
params = {
        'learning_rate': [0.1, 0.3, 0.5],
        'max_depth': [1, 3, 5],
        'min_child_weight': [1, 3, 5],
        'subsample': [0.1, 0.3, 0.5],
        'colsample_bytree': [0.1, 0.3, 0.5],
        'n_estimators' : [100, 200, 500, 750, 1000],
        'objective': ['reg:squarederror']
}

xgbr_grid = GridSearchCV(estimator = XGBRegressor(),
                         param_grid = params,
                         cv = 3)

xgbr_grid.fit(X_normalized, y)

In [None]:
pd.DataFrame(xgbr_grid.cv_results_)[['param_colsample_bytree', 'param_learning_rate', 'param_max_depth',
       'param_min_child_weight', 'param_n_estimators',
       'param_subsample', 'mean_test_score']].sort_values('mean_test_score', ascending=False).head()

In [None]:
print('Best Score: ', xgbr_grid.best_score_)
print('Best Params: ', xgbr_grid.best_params_)

## LightGBM

In [None]:
params = {
    'learning_rate': [0.001, 0.01, 0.1, 1],
    'n_estimators': [100, 200, 500, 750, 1000]
}

lgbr_grid = GridSearchCV(estimator = LGBMRegressor(),
                        param_grid = params,                        
                        cv = 3)

lgbr_grid.fit(X_normalized, y)

In [None]:
pd.DataFrame(lgbr_grid.cv_results_)[['param_learning_rate', 'param_n_estimators', 'mean_test_score']].sort_values('mean_test_score', ascending=False).head()

In [None]:
print('Best Score: ', lgbr_grid.best_score_)
print('Best Params: ', lgbr_grid.best_params_)

# GridSearchCV Results & Final Comparison

In [None]:
print('Lasso: X_scaled')
print('Lasso Best Params: {}\n'.format(lasso_grid.best_params_))

print('Ridge: X_scaled')
print('Ridge Best Params: {}\n'.format(ridge_grid.best_params_))

print('SVR: X' )
print('SVR Best Params: {}\n'.format(svr_grid.best_params_))

print('XGBRegressor: X_normalized')
print('XGBRegressor Best Params: {}\n'.format(xgbr_grid.best_params_))

print('LGBMRegressor: X_normalized')
print('LGBMRegressor Best Params: {}'.format(lgbr_grid.best_params_))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.33, random_state=42)

# Lasso
lasso = Lasso(alpha=100)
lasso.fit(X_train, y_train)

y_pred = lasso.predict(X_test)
lasso_train_score = lasso.score(X_train, y_train)
lasso_test_score = lasso.score(X_test, y_test)

lasso_mae = metrics.mean_absolute_error(y_test, y_pred)
lasso_mse = metrics.mean_squared_error(y_test, y_pred)
lasso_rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

# Ridge
ridge = Ridge(alpha=1000)
ridge.fit(X_train, y_train)

y_pred = ridge.predict(X_test)
ridge_train_score = ridge.score(X_train, y_train)
ridge_test_score = ridge.score(X_test, y_test)

ridge_mae = metrics.mean_absolute_error(y_test, y_pred)
ridge_mse = metrics.mean_squared_error(y_test, y_pred)
ridge_rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Linear Regression
lrm = LinearRegression()
lrm.fit(X_train, y_train)

y_pred = lrm.predict(X_test)
lrm_train_score = lrm.score(X_train, y_train)
lrm_test_score = lrm.score(X_test, y_test)

lrm_mae = metrics.mean_absolute_error(y_test, y_pred)
lrm_mse = metrics.mean_squared_error(y_test, y_pred)
lrm_rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

# SVR
svr = SVR(C=0.001, kernel='linear')
svr.fit(X_train, y_train)

y_pred = svr.predict(X_test)
svr_train_score = svr.score(X_train, y_train)
svr_test_score = svr.score(X_test, y_test)

svr_mae = metrics.mean_absolute_error(y_test, y_pred)
svr_mse = metrics.mean_squared_error(y_test, y_pred)
svr_rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.33, random_state=42)

# XGBRegressor
xgbr = XGBRegressor(colsample_bytree=0.3, learning_rate=0.1, max_depth=5, min_child_weight=1,
                    n_estimators=500, objective='reg:squarederror', subsample=0.5)
xgbr.fit(X_train, y_train)

y_pred = xgbr.predict(X_test)
xgbr_train_score = xgbr.score(X_train, y_train)
xgbr_test_score = xgbr.score(X_test, y_test)

xgbr_mae = metrics.mean_absolute_error(y_test, y_pred)
xgbr_mse = metrics.mean_squared_error(y_test, y_pred)
xgbr_rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

# Ridge
lgbr = LGBMRegressor(learning_rate=0.01, n_estimators=1000)
lgbr.fit(X_train, y_train)

y_pred = lgbr.predict(X_test)
lgbr_train_score = lgbr.score(X_train, y_train)
lgbr_test_score = lgbr.score(X_test, y_test)

lgbr_mae = metrics.mean_absolute_error(y_test, y_pred)
lgbr_mse = metrics.mean_squared_error(y_test, y_pred)
lgbr_rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

In [None]:
models = ['Linear Regression', 'Lasso Regression', 'Ridge Regression',
      'SVM (kernel:linear)', 'XGBoost (Regressor)', 'LightGBM (Regressor)']
train_score = [lrm_train_score, lasso_train_score, ridge_train_score, svr_train_score, xgbr_train_score, lgbr_train_score]
test_score = [lrm_test_score, lasso_test_score, ridge_test_score, svr_test_score, xgbr_test_score, lgbr_test_score]
mae = [lrm_mae, lasso_mae, ridge_mae, svr_mae, xgbr_mae, lgbr_mae]
mse = [lrm_mse, lasso_mse, ridge_mse, svr_mse, xgbr_mse, lgbr_mse]
rmse = [lrm_rmse, lasso_rmse, ridge_rmse, svr_rmse, xgbr_rmse, lgbr_rmse]

model_comparison = pd.DataFrame(data=[models, train_score, test_score, mae, mse, rmse]).T.rename({0: 'Model', 1:'Training Score',
                                                                                2: 'Test Score',
                                                                                3:'Mean Absolute Error',
                                                                                4: 'Mean Squared Error',
                                                                                5:'Root Mean Squared Error'}, axis=1)

display(model_comparison)

### Best Results:

- Normalize the data.
- Use **XGBRegressor**.