In [None]:
# Importing the required libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
%matplotlib inline 
sns.set(color_codes=True)
np.random.seed(31415)
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
# Loading the CSV file into a pandas dataframe.
df = pd.read_csv("../input/cars1/CARS.csv")
df.head(5)

**Cleaning the data**

In this part I removed the $ sign and the commas (,) in the cars prices (MSRP).

I also removed the rows that contain Nan values.

I also converted the prices from integers to floats because it was causing problems when printing.


In [None]:
colstocheck = df.columns
df[colstocheck] = df[colstocheck].replace({'\$':''}, regex = True)
df[colstocheck] = df[colstocheck].replace({',':''}, regex = True)
col_mask=df.isnull().any(axis=0) 
print(col_mask)
row_mask=df.isnull().any(axis=1)
df.loc[row_mask,col_mask]
df = df.dropna()
df['MSRP'] = df['MSRP'].astype(float)
df.head(5)

* **Removing irrelevant features.**

I will remove some features such as Drive Train, Model, Invoice, Type, and Origin from this dataset. Because these features do not contribute to the prediction of price.

In [None]:
df = df.drop(['Make','Model','Type','Origin','DriveTrain','Invoice'],axis=1)
df.head(5)

**Implementing the algorithms**

In this assignment, I am implementing three different Machine Learning Algorithms, this is because every algorithm has a better prediction score and accuracy.

The first step involved in implementing the above algorithms is determining the dependent and independent variables. At this stage, I knew that the dependent variable is the price or MSRP and the independent variables are all variables excluding MSRP in my data set. The reason for this is I could see a pattern in my dataset. For example, if the engine size of the car was high then the price of the car was high and the same refers to Horsepower, Cylinders, Length, Wheelbase and many more. Because of this pattern, I predicted the price (MSRP) against all the features of the car.

In [None]:
# Importing all the required libraries
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

**Implementing LASSO Regression**

In [None]:
# Creating the Lasso Regression Model
reg = linear_model.Lasso(alpha=0.1)

**Preparing the data in an order that is supported by the algorithms**

Here the data needs to be prepared in a particular order because wrong mismatch order of data can cause an error. The ‘X’ variable contains all the features except the MSRP and it of 2 dimensions. Similarly, the ‘y’ variable contains only MSRP values and its of 1 dimension. I have converted the data frames into a numpy array. Because it was initially giving me an error, then I came to know that most of the SciKit algorithms accept the input as an array. So I converted both the ‘X’ and the ‘y’ values to a numpy array. We can check the dimensions using ndim method. If the ‘y’ value is a two-dimension array then it cannot be fitted in the model because of the wrong order in the data frames. This is one of the most important steps that must be done before feeding the values to the model.

In [None]:
X = df.drop("MSRP", axis=1)
y = df["MSRP"]
X = X.to_numpy()
X.ndim

In [None]:
y = y.to_numpy()
y.ndim

During Implementation of the models I will be dividing the data into two parts:
1. Train data: Training data is the one wherein we train and fit the data to the algorithm.
2. Test data: Testing data is the one wherein we test that data based on the trained data and check the performance of the model

Splitting the data helps our model to predict more accurately because the model would be trained and tested with multiple data. Here the train and testing data is divided as 70 and 30.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# Fitting and predicting the trained values to the Lassor Regression Model
reg.fit(X_train, y_train)
pred = reg.predict(X_test)
# Printing the first five predicted values
pred[1:5]


**Plotting Lasso Regression**

In [None]:
plt.figure(figsize= (6, 6))
plt.title("Visualizing the Regression using Lasso Regression algorithm")
sns.regplot(x=pred, y=y_test, color = "teal")
plt.xlabel("New Predicted Price (MSRP)")
plt.ylabel("Old Price (MSRP)")
plt.show()

**Providing statistical information that supports the model**

This step involves printing the results such as the Mean Absolute Error, Mean Squared Error Coefficient, Intercept and the r²_score. Out of these three, the r²_score is the important metric. It determines the accuracy or the score of the model. I have also included the score of the other algorithms such as Linear Regression, Random Forest.

In [None]:
print("Mean Absolute Error is :", mean_absolute_error(y_test, pred))
print(" — — — — — — — — — — — — — — — — — — — — — — — ")
print("Mean Squared Error is :", mean_squared_error(y_test, pred))
print(" — — — — — — — — — — — — — — — — — — — — — — — ")
print("Coeffients are : ", reg.coef_)
print(" — — — — — — — — — — — — — — — — — — — — — — — ")
print("Intercepts are :" ,reg.intercept_)
print(" — — — — — — — — — — — — — — — — — — — — — — — ")
print("The R2 square value of Lasso is :", r2_score(y_test, pred)*100)

**Random Forest Regression model**

In [None]:
model = RandomForestRegressor()
model.fit(X_train, y_train)
pred = model.predict(X_test)

plt.figure(figsize= (6, 6))
plt.title("Visualizing the Regression using Random Forest Regression algorithm")
sns.regplot(pred, y_test, color = 'teal')
plt.xlabel("New Predicted Price (MSRP)")
plt.ylabel("Old Price (MSRP)")
plt.show()


In [None]:
model = RandomForestRegressor()
model.fit(X_train, y_train)
pred = model.predict(X_test)
print("Mean Absolute Error is :", mean_absolute_error(y_test, pred))
print(" — — — — — — — — — — — — — — — — — — — — — — — ")
print("Mean Squared Error is :", mean_squared_error(y_test, pred))
print(" — — — — — — — — — — — — — — — — — — — — — — — ")
print("The R2 square value of RandomForest Regressor is :", r2_score(y_test, pred)*100)
print(" — — — — — — — — — — — — — — — — — — — — — — — ")

**Linear Regression model**

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)
pred = model.predict(X_test)
plt.figure(figsize= (6, 6))
plt.title("Visualizing the Regression Linear Regression Algorithm")
sns.regplot(pred, y_test, color = 'teal')
plt.xlabel("New Predicted Price (MSRP)")
plt.ylabel("Old Price (MSRP)")
plt.show()

In [None]:
print("Mean Absolute Error is :", mean_absolute_error(y_test, pred))
print(" — — — — — — — — — — — — — — — — — — — — — — — ")
print("Mean Squared Error is :", mean_squared_error(y_test, pred))
print(" — — — — — — — — — — — — — — — — — — — — — — — ")
print("Coeffients are : ", model.coef_)
print(" — — — — — — — — — — — — — — — — — — — — — — — ")
print("Intercepts are :" ,model.intercept_)
print(" — — — — — — — — — — — — — — — — — — — — — — — ")
print("The R2 square value of Linear Regression is :", r2_score(y_test, pred)*100)

**Discussion**
These models have the power to predict the outcome of future instances. The models predict the price (MSRP) of the car given the specifications or other features of the car. The model overall gives a score of 74.5% (Lasso Regression), 80.5% (Random Forest Regression), and 74.5% (Linear Regression).
