The following Project aims to employ two different Machine Learning Regression Models to make predictions about the prices of various cars.
Thereafter, we can identify the Regression Model better suited to make predictions on Multiple Linear Progression Models and Data.

The steps employed for accurate prediction are:
1. Dealing with absent/missing data.

2. Identifying and Converting Non-Numeric/Qualitative data into Quantitaive form to optimize the dataset for a more accurate prediction.

3. Identifying mangnitude of inter-dependence between parametres of the dataset and the impact they have on final price.

4. Division of Data into Training and Testing Sets to test the validity of the model.

5. Calculating Result and identifying the model that is better suited.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

**Dealing with absent/missing data**

The given dataset has no missing values and therefore does not require modification. 
However, in the chance that such entries exist, the following steps should be followed:

* Identifying the data items needed for prediction from among the datasets. 
  Indiscrimanate removal of rows owing to missing non-esential data reduces the data we have for model training and validation, thus reducing accuracy.

* Using function **dropna()** to remove all rows with missing/absent entries.

In [None]:
file_path = '../input/car-price-prediction/CarPrice_Assignment.csv'
price_prediction = pd.read_csv(file_path, index_col = 0)
price_prediction.isnull().sum()

In [None]:
price_prediction.describe()

In [None]:
price_prediction.head()

In [None]:
plt.figure(figsize = (15, 5))
price_prediction.price.plot(title = 'Variation of Car Statistics by Index')
price_prediction.peakrpm.plot(color = 'y')

plt.legend()
plt.show()

In [None]:
price_prediction.citympg.plot(figsize = (15, 5), title = 'Variation of Engine Statistics', color = 'g')
price_prediction.highwaympg.plot(color = 'r')
price_prediction.horsepower.plot(kind = 'bar', rot = 90)

plt.axes().xaxis.set_major_formatter(plt.NullFormatter())

plt.legend()
plt.show()

In [None]:
type_cars = price_prediction.groupby('CarName').size().sort_values(ascending = False)
type_cars.nlargest(15).plot(kind = 'bar', figsize = (20, 5), title = 'Most Popular Car Models', rot = 10)

plt.show()

**Identifying and Converting Non-Numeric/Qualitative data into Quantitaive form to optimize the dataset for a more accurate prediction.**

This is done using Sci-Kit. Similar data is grouped together and assigned a number value to be used to identify its effect on final price.

In [None]:
price_prediction.dtypes

In [None]:
from sklearn.preprocessing import LabelEncoder
conversion = LabelEncoder()

price_prediction.CarName = conversion.fit_transform(price_prediction.CarName)
price_prediction.fueltype = conversion.fit_transform(price_prediction.fueltype)
price_prediction.aspiration = conversion.fit_transform(price_prediction.aspiration)
price_prediction.doornumber = conversion.fit_transform(price_prediction.doornumber)
price_prediction.carbody = conversion.fit_transform(price_prediction.carbody)
price_prediction.drivewheel = conversion.fit_transform(price_prediction.drivewheel)
price_prediction.enginelocation = conversion.fit_transform(price_prediction.enginelocation)
price_prediction.enginetype = conversion.fit_transform(price_prediction.enginetype)
price_prediction.cylindernumber = conversion.fit_transform(price_prediction.cylindernumber)
price_prediction.fuelsystem = conversion.fit_transform(price_prediction.fuelsystem)

In [None]:
price_prediction.dtypes

**Identifying mangnitude of inter-dependence between parametres of the dataset and the impact they have on final price.**

This is done using a Correlation Matrix, which is similar to a heatmap. It studies the relationship between parameters.
The co-relation magnitude ranges from [-1, 1]. A positive value indicates a positive impact on the parameter and vice-versa.

In [None]:
correlation = price_prediction.corr()
correlation.style.background_gradient(cmap = 'coolwarm', axis = None).set_precision(3)

**Division of Data into Training and Testing Sets to test the validity of the model.**

If the entire data is used to train the model, there would be know way to test the accuracy of future real-world predictions. Therefore the dataset is divided into two separate sets. 
One is used to train the Model to make predictions, while the other set is used to check the validity of the the hypothesis.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In [None]:
X = price_prediction.drop('price', axis = 1)
y = price_prediction.price

train_X, test_X, train_y, test_y = train_test_split(X, y)

**Calculating Result and identifying the model that is better suited.**

The two models being employed to make that prediction are:
* DecisionTreeRegressor
* RandomForestRegressor

To ensure accuracy, multiple models need to be tested to identify the relationship pattern between the data and make the nmost accurate predictions.

In [None]:
from sklearn.tree import DecisionTreeRegressor

DTreeModel = DecisionTreeRegressor(random_state = 0)
DTreeModel.fit(train_X, train_y)

predval_DTree = DTreeModel.predict(test_X)
mean_error = mean_absolute_error(test_y, predval_DTree)
percent_error = (mean_error / price_prediction.price.mean()) * 100

print('Mean Prices in Original Model:', price_prediction.price.mean())
print('Mean Absolute Error of Predictions:', mean_error)
print('Percentage Error: {}%'.format(percent_error))

In [None]:
from sklearn.ensemble import RandomForestRegressor

RandForestModel = RandomForestRegressor(random_state = 1)
RandForestModel.fit(train_X, train_y)

predval_RandFor = RandForestModel.predict(test_X)
mean_error = mean_absolute_error(test_y, predval_RandFor)
percent_error = (mean_error / price_prediction.price.mean()) * 100

print('Mean Prices in Original Model:', price_prediction.price.mean())
print('Mean Absolute Error of Predictions:', mean_error)
print('Percentage Error: {}%'.format(percent_error))

From the above data, it is apparent that a **RandomForestRegressor** is better suited to make predictions regarding this data based on a Multiple Linear Regression DataSet.

Thanks for going through the project!