### Problem Statement

A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:

Which variables are significant in predicting the price of a car
How well those variables describe the price of a car
Based on various market surveys, the consulting firm has gathered a large data set of different types of cars across the America market.

### Business Goal

We are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

DataSet Information:

* Car_ID: Unique id of each observation (Interger)
* Symboling: Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.(Categorical)
* CarName: Name of car company (Categorical)
* fueltype: Car fuel type i.e gas or diesel (Categorical)
* aspiration: Aspiration used in a car (Categorical)
* doornumber: Number of doors in a car (Categorical)
* carbody: body of car (Categorical)
* drivewheel: type of drive wheel (Categorical)
* enginelocation: Location of car engine (Categorical)
* wheelbase: Weelbase of car (Numeric)
* carlength: Length of car (Numeric)
* carwidth: Width of car (Numeric)
* carheight: height of car (Numeric)
* curbweight: The weight of a car without occupants or baggage. (Numeric)
* enginetype: Type of engine. (Categorical)
* cylindernumber: cylinder placed in the car (Categorical)
* enginesize: Size of car (Numeric)
* fuelsystem: Fuel system of car (Categorical)
* boreratio: Boreratio of car (Numeric)
* stroke: Stroke or volume inside the engine (Numeric)
* compressionratio: compression ratio of car (Numeric)
* horsepower: Horsepower (Numeric)
* peakrpm: car peak rpm (Numeric)
* citympg: Mileage in city (Numeric)
* highwaympg: Mileage on highway (Numeric)
* price(Dependent variable): Price of car (Numeric)

Features:
* 13 numerical ;
* 11 categorical;
* 1 Car_ID: Unique id of each observation (Interger)

Target:
* 1 numerical - price

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv('/kaggle/input/car-price-prediction/CarPrice_Assignment.csv')
df.head()

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.describe()

In [None]:
numeric_features = df.describe().columns

In [None]:
# plot a bar plot for each categorical feature count (except car_ID)

for col in numeric_features[1:]:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = df[col]
    feature.hist(bins=50, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()

Comment:
* symboling has 6 distinct values (-2 to 3), consider to change type to categorical

In [None]:
# plot scatter plots that show the intersection of feature and label values. (except car_ID,symboling)
# calculate the correlation statistic to quantify the apparent relationship.

for col in numeric_features[2:-2]:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = df[col]
    label = df['price']
    correlation = feature.corr(label)
    plt.scatter(x=feature, y=label)
    plt.xlabel(col)
    plt.ylabel('Price')
    ax.set_title('price vs ' + col + '- correlation: ' + str(correlation))
plt.show()

In [None]:
# change symbloling type to categorical

df['symboling'] = df['symboling'].astype('category')

In [None]:
df.describe(include=['object','category'])

In [None]:
# drop car name column due to it has 147 unique name compare to only 205 cars that we have in the dataset

df = df.drop('CarName',axis=1)

In [None]:
categorical_features = df.describe(include=['object','category']).columns

In [None]:
# plot a bar plot for each categorical feature count  

for col in categorical_features:
    counts = df[col].value_counts().sort_index()
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    counts.plot.bar(ax = ax, color='steelblue')
    ax.set_title(col + ' counts')
    ax.set_xlabel(col) 
    ax.set_ylabel("Frequency")
plt.show()

In [None]:
# plot a boxplot for the label by each categorical feature  

for col in categorical_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    df.boxplot(column = 'price', by = col, ax = ax)
    ax.set_title('Label by ' + col)
    ax.set_ylabel("Price")
plt.show()

## Train a Regression Model

In [None]:
X = df.drop(['car_ID','price'],axis=1)
y = df.price

In [None]:
categorical_features

In [None]:
# get dummmmies for all categorical features
X = pd.get_dummies(X,columns=categorical_features)

In [None]:
print('Features:',X[:3], '\nLabels:', y[:3], sep='\n')

In [None]:
from sklearn.model_selection import train_test_split

# Split data 90%-10% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=38)

In [None]:
numeric_features

In [None]:
numeric_features = numeric_features[2:-1]
numeric_features

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression

numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

#categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, X.columns)])
        #('cat', categorical_transformer, categorical_features)])

In [None]:
model = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', LinearRegression())]) 

In [None]:
model.fit(X_train,y_train)

In [None]:
predictions = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)

rmse = np.sqrt(mse)
print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)
print("R2:", r2)

In [None]:
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='red')
plt.show()

In [None]:
coef_ = model.named_steps.classifier.coef_
coef_

In [None]:
intercept_ = model.named_steps.classifier.intercept_
intercept_

In [None]:
# match column names to coefficients

for coef, col in enumerate(X_train.columns):
    print(f'{col}:  {coef_[coef]}')

In [None]:
# create data frame for coef and variables

original_variable = list(X_train.columns)

zipped = list(zip(original_variable,coef_))

coefs = [list(z) for z in zipped]

coefs = pd.DataFrame(coefs,columns=['variable','coefficients'])

coefs.head()

In [None]:
# top 5 coefficients

coefs.sort_values(by=['coefficients'],axis=0,ascending=False,inplace=True)
coefs.head()

In [None]:
# bottom 5 coefficients
coefs.tail()

In [None]:
# plot show features importance 
plt.subplots(figsize=(15,15))
plt.barh(X.columns,coef_)
plt.ylabel('Coefficients')
plt.xticks(rotation=90) 
plt.show()

### Comment:

* With linear regression, we get the results as follow:

    - RMSE: 2420.56 
    - R2: 0.93

* engine size is the most important feature with positive coefficient of 4253.9 whereas engine type (ohcv)has the most negative effect on the price with negative coefficient of -1566.47.

## Other regressor algorithms 
* LGBMRegressor, XGBRegressor,Gradient Boosting Regressor, Random Forest Regressor

In [None]:
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

alg = [LGBMRegressor(), XGBRegressor(),GradientBoostingRegressor(),RandomForestRegressor()]

for regressor in alg:

    model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', regressor)])  

    model.fit(X_train,y_train)

    predictions = model.predict(X_test)
    
    print(regressor)
    print()
    mse = mean_squared_error(y_test, predictions)
    print("MSE:", mse)

    rmse = np.sqrt(mse)
    print("RMSE:", rmse)

    r2 = r2_score(y_test, predictions)
    print("R2:", r2)

    plt.scatter(y_test, predictions)
    plt.xlabel('Actual Labels')
    plt.ylabel('Predicted Labels')
    
    # overlay the regression line
    z = np.polyfit(y_test, predictions, 1)
    p = np.poly1d(z)
    plt.plot(y_test,p(y_test), color='red')
    plt.show()

## Optimize Hyperparameters - Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# Try these hyperparameter values with RandomForestRegressor()

params = {
 'max_depth': range(4,8),
 'n_estimators' : range(100,1000,100)
 }

score = make_scorer(r2_score)
gridsearch = GridSearchCV(RandomForestRegressor(), params, scoring=score, cv=3, return_train_score=True)

# Use a Random Forest Regressor algorithm
model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('gridsearch',gridsearch)])  

# Find the best hyperparameter combination to optimize the R2 metric
model.fit(X_train, y_train)

print("Best parameter combination:", gridsearch.best_params_, "\n")

# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

## Randomized Search 

In [None]:
from sklearn.model_selection import RandomizedSearchCV
 
    
# Try these hyperparameter values 

params = {
 'max_depth': range(2,10),
 'n_estimators' : range(100,1000,100),
 'max_features' : ['auto', 'sqrt'],
 'min_samples_split' : [2, 5, 10, 15],
 'min_samples_leaf' : [1, 2, 5, 10]
 }    
    
score = make_scorer(r2_score)
randomsearch = RandomizedSearchCV(RandomForestRegressor(), params, scoring=score, cv=3, return_train_score=True)

# Use a Random Forest Regressor algorithm
model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('randomsearch',randomsearch)])  

# Find the best hyperparameter combination to optimize the R2 metric
model.fit(X_train, y_train)

print("Best parameter combination:", randomsearch.best_params_, "\n")

# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

## Grid Search

In [None]:
# Try these hyperparameter values with GradientBoostingRegressor()

params = {
 'max_depth': range(4,8),
 'n_estimators' : range(100,1000,100),
 'learning_rate' : [0.1,0.01,0.001]
 }

score = make_scorer(r2_score)
gridsearch = GridSearchCV(GradientBoostingRegressor(), params, scoring=score, cv=3, return_train_score=True)

# Use a Random Forest Regressor algorithm
model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('gridsearch',gridsearch)])  

# Find the best hyperparameter combination to optimize the R2 metric
model.fit(X_train, y_train)

print("Best parameter combination:", gridsearch.best_params_, "\n")

# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

In [None]:
# Try these hyperparameter values with XGBRegressor()

params = {
 'max_depth': range(4,8),
 'n_estimators' : range(100,1000,100),
 'learning_rate' : [0.1,0.01,0.001]
 }

score = make_scorer(r2_score)
gridsearch = GridSearchCV( XGBRegressor(), params, scoring=score, cv=3, return_train_score=True)

# Use a Random Forest Regressor algorithm
model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('gridsearch',gridsearch)])  

# Find the best hyperparameter combination to optimize the R2 metric
model.fit(X_train, y_train)

print("Best parameter combination:", gridsearch.best_params_, "\n")

# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()   
    

## Conclusion:
* XGBRegressor regression provides the best result - RMSE:1615 , R2:0.97
* Best parameter combination for XGBRegressor regression: learning_rate: 0.1, max_depth: 4, n_estimators: 300