In [None]:
import numpy as np
import pandas as pd
import scipy
from matplotlib import pyplot as plt
import seaborn as sns
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Car Price Prediction - Exploratory Data Analysis and Linear Model

In this project, we will create a model to predict car prices based on a subset of features, using Multiple Linear Regression.
The first step will be to understand the target variable, car prices, using EDA.
We will also prepare and treat the data before selecting the features and testing the model.

In [None]:
car_data = pd.read_csv('../input/car-data/CarPrice_Assignment.csv')

## PART 1: Exploratory Data Analysis

Let's try and get a summary of the data:

In [None]:
car_data.describe()

In [None]:
car_data.shape

Understanding the target variable - Car prices.

In [None]:
car_price = car_data['price']
car_price.describe()

Based on the summary above we can see that, on average, a car costs 13 276 USD, while the minimum price to pay for one is 5118 USD. Only 25% of the cars in this dataset would cost more than 16503 USD, while the maximum price is 45400USD, suggesting a right-skewed distribution of prices.
We can try and plot the data in order to get a better understanding of car prices:

In [None]:
sns.distplot(car_price, axlabel='Car Price')

## PART 2: Feature Selection

In this part, we will select the variables we will use to build the linear model and prepare them.

Let's try and see how each variable in our dataset correlates with car prices, creating a correlation matrix:

In [None]:
car_data.corr()

In order to select the variables for our model, we will remove the variables with insignificant correlation with car price (significance level at p = 0.05)

In [None]:
variables = list(car_data)
variables.remove('car_ID')
selected = []

# selects all variables with p < 0.05
for var in variables:
    p = scipy.stats.spearmanr(car_data[var], car_price)[1]
    if p < 0.05:
        selected.append(var)
print(selected)

Now we can create a second dataframe with the significant varibles only and visualize correlations.
Those are the variables we will use in our model.

In [None]:
car_data2 = car_data[selected]
car_data2.corr()

In [None]:
# we have to remove the target variable from predictors
selected.remove('price')
car_data3 = car_data2[selected]

We can now build a linear model with selected variables.
But first, we will need to codify the categorical ones using dummies.
The categorical variables we have from the selected ones are fueltype, aspiration, drivewheel, enginelocation and fuelsystem.
We can check which distinct values each one has:

In [None]:
print(car_data3.fueltype.unique())
print(car_data3.aspiration.unique())
print(car_data3.drivewheel.unique())
print(car_data3.enginelocation.unique())
print(car_data3.fuelsystem.unique())

And dummify them:

In [None]:
car_data_with_dummies = pd.get_dummies(car_data3, columns=['fueltype', 'aspiration',
                                            'drivewheel', 'enginelocation',
                                            'fuelsystem'])
car_data_with_dummies.head()


## PART 3: Building the model
Now that we have selected and prepared our predictors, we can create the linear model.
In order to create a linear model, we need to:

1. Build the model
2. Evaluate and test the model

In order to evaluate the linear model, we will need to split the data into training and testing sets. The training set is used to build the model while the testing set is used to make predictions.
We can then verify how far the predictions made by this model are from the real values.
This way we can assure the model we are building is reliable.

In [None]:
model = LinearRegression()

predictors = car_data_with_dummies # features for prediction

# create training and testing sets
x_train, x_test, y_train, y_test = train_test_split(predictors, car_price, 
                                                    test_size=0.3,
                                                   random_state=5)
# fit the model with test data
model.fit(x_train, y_train) 

predicted = model.predict(x_test)  # model predicted values 

Now that we used our test set to make predictions we can see how far the values predicted by the model are from the test set

In [None]:
plt.scatter(y_test, predicted)
plt.title('Car prices VS Predicted Car Prices')
plt.xlabel('Car prices')
plt.ylabel('Predicted car prices')

We can also understand how close the predicted values are from the real ones using correlation:

In [None]:
scipy.stats.pearsonr(y_test, predicted)

The values predicted by the model correlate strongly and significantly from the real values.
Lastly, we can check the R Squared to understand how well the model adjusts to our data:

In [None]:
## will print accuracy score for the model
r_squared = model.score(x_train, y_train)
print(r_squared)

By checking the R Squared, we can conclude the model is a good fit for our data.

Now that we have built and evaluated our model successfully, we can verify the weights of each feature in our linear model:

In [None]:
coefficients = list(zip(model.coef_, predictors))

for c in coefficients:
    print(c)

The feature weights give us information on how each factor affects car price.
For example, horse power has a weight of 36.9 (aprox.) in the dependent variable (car price) meaning an increase in one unit of HP will increase the car price by 36.9.

## Summary

Based on the correlations performed above, we could see that the factors mostly associated with car price are:

1. Enginesize
2. Curbweigth
3. Horsepower
4. Carlength
5. Width
6. HighwayMPG and CityMPG
7. Wheelbase

Most of these factors are associated with the size and weight of the car (engine size, curbweight, lenght, width and wheelbase) while others are associated with the vehicle's performance (such as horse power and MPG -  Miles Per Galon).

Since the size of the engine seems to be the variable most associated with car price, we can try and visualize this relationship:

In [None]:
plt.scatter(car_data['enginesize'], car_data['price'])
plt.title('Effect of engine size on car prices')
plt.xlabel('Engine size')
plt.ylabel('Car price')

Car prices are also highly influenced by fuel consumption, as we can verify above:

In [None]:
plt.scatter(car_data['highwaympg'], car_data['price'])
plt.scatter(car_data['citympg'], car_data['price'])
plt.title('Car prices by fuel consumption')
plt.xlabel('MPG - City (orange) Highway (Blue)')
plt.ylabel('Car price')

A reduction in highway consumption by 1L per mile can increase price by 100 USD aproximately (as verified in the linear model).