### Hello!, Welcome to my first project!, today I will show you how to explore your data and build some basic models using sklearn library in order to predict car price based on using one or several features.

For this project we will use a dataset created by an automobile importer, which stores several characteristics of cars and its corresponding prices. 
### Let's get started!

# Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

### Load data and store in dataframe df:

In [None]:
# path of data 
filename = '../input/auto-eda/automobileEDA.csv'
df = pd.read_csv(filename)
df.head()

In [None]:
df.shape

### Looking for nan or null values:

In [None]:
df.isnull().sum().sum()

In [None]:
df1=df[df.isna().any(axis=1)]
df1

### Removing nan values:

In [None]:
df.dropna(inplace=True)

In [None]:
df.isnull().sum().sum()

### Let's see distribution of column types in our dataframe:

In [None]:
df.dtypes.value_counts()

In [None]:
df.describe(include='object')

In [None]:
df.describe()

In [None]:
(df.select_dtypes(include=['object'])).columns

In [None]:
sns.boxplot(x='body-style',y='price',data=df)

In [None]:
featurecols=df.drop(['price'],axis=1)
label=df['price']

In [None]:
featurecols.corrwith(label)

In [None]:
abs(featurecols.corrwith(label)).sort_values(ascending=False)

# 1. Linear Regression and Polynomial Regression

#### Let's load the modules for linear regression

In [None]:
from sklearn.linear_model import LinearRegression

#### Create the linear regression object

In [None]:
lm = LinearRegression()
lm

#### How could Highway-mpg help us predict car price?

In [None]:
X = df[['highway-mpg']]
Y = df['price']

Now fit the linear model using highway-mpg as feature and price as label.

In [None]:
lm.fit(X,Y)

To make a prediction of the model we have to use the '.predict( )' and using X as its argument.

In [None]:
Yhat=lm.predict(X)
Yhat[0:5]  

Using seaborn library we can easily make a regression plot of these:

In [None]:
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)

The regression plot does not seem too accurate for this feature, we can see several points far from the line, which is indicative of underfitting, for this reason we will use a residual plot from seaborn which measures and plots the difference between the predicted and the actual point: 

In [None]:
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.residplot(df['highway-mpg'], df['price'])
plt.show()

In models with more accuracy we can expect the residual plot to concentrate much more points near zero in the y-axis. 

As this is a linear regression function, we expect to obtain its intercept and slope:

#### What is the value of the intercept (a)?

In [None]:
lm.intercept_

#### What is the value of the Slope (b)?

In [None]:
lm.coef_

### Error metrics for regression model:
As we are dealing with predicting a continuous value, we calculate the errors using mean squared error and coefficient of determination (R2 score):

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

print('The R-square is: ', r2_score(Y, Yhat))
mse = mean_squared_error(Y, Yhat)
print('The mean square error of price and predicted value is: ', mse)

We saw earlier in the two plots and now in error metrics that a linear model did not provide the best fit while using highway-mpg as the predictor variable, but we could improve this accuracy by transforming this feature to a polynomial type. 

# Polynomial features 
Considered as a particular case of the general linear regression model or multiple linear regression models, we get non-linear relationships by squaring or setting higher-order terms of the predictor variables.

### Let's transform our feature to polynomial and fit a new model:

In [None]:
X = df[['highway-mpg']]
Y = df['price']

In [None]:
from sklearn.preprocessing import PolynomialFeatures

pr=PolynomialFeatures(degree=6)  #Defining our function to convert our feature to a 6th degree polynomial
poly_feat=pr.fit_transform(X)

Below we see our actual feature, an then the 6th degree polynomial created from this: 

In [None]:
X

In [None]:
pd.DataFrame(poly_feat)

We have to consider this last one as our new "predictor variable" despite the fact that it contains 7 columns, we we fit a new linear regression and predict as before:

In [None]:
lm_poly=LinearRegression()
lm_poly.fit(poly_feat,Y)
poly_pred=lm_poly.predict(poly_feat)

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

print('The R-square is: ', r2_score(label, poly_pred))
mse = mean_squared_error(label, poly_pred)
print('The mean square error of price and predicted value is: ', mse)

Now we see the error metrics has improved considerably, making our model much more accurate.

We will use the following code to plot the polynomial function: 

In [None]:
def PlotPolly(model, independent_variable, dependent_variabble, Name):
    x_new = np.linspace(15, 55, 100)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
    plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('Price of Cars')

    plt.show()
    plt.close()

In [None]:
x = df['highway-mpg']
y = df['price']

In [None]:
f = np.polyfit(x, y, 6)
p = np.poly1d(f)
print(p)

In [None]:
PlotPolly(p, x, y, 'highway-mpg')

Above we confirm the increase in accuracy was because the function fits more the data points, improving the prediction.

# 2. Multiple linear Regression
As we want to get the highest possible accurary from our models we will achieve this when we make use of all posible variables incluiding the categorical (nominal and ordinal), to achieve this we must transform these features using LabelBinarizer, LabelEncoder and OneHotEncoder, but for the current project we will only use numerical features so as to keep our focus on developing our models. We will deal with categorical variables the next project in which we will build more complex models.

### Now let's fit a model with all numerical features:

In [None]:
numerical_cols=featurecols.select_dtypes(exclude=['object']) #Select all columns which are not object type.

In [None]:
lm2=LinearRegression()    #We will use the same object because the only thing different than before is the multiple predictors. 
lm2

### Fit the linear model using all of our numeric features above.

In [None]:
lm2.fit(numerical_cols, label) #Fitting our numerical columns as predictors and price as label.

#### As we know the function of our model will have one coefficient by each feature
We should get a final linear function with the following structure:
$$
Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4 + ... + b_n X_n
$$

#### What is the value of the intercept(a)?

In [None]:
lm2.intercept_

#### What are the values of the coefficients (b1, b2, b3, b4, ... , bn)?

In [None]:
lm2.coef_

In [None]:
Y_predicted = lm2.predict(numerical_cols)

How do we visualize a model for Multiple Linear Regression? This gets a bit more complicated because we can't visualize it with regression or residual plot as before.

One way to look at the fit of the model is by looking at the distribution plot: We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.

In [None]:
plt.figure(figsize=(width, height))


ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Y_predicted, hist=False, color="b", label="Fitted Values" , ax=ax1)


plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')

plt.show()
plt.close()

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

print('The R-square is: ', r2_score(label, Y_predicted))
mse = mean_squared_error(label, Y_predicted)
print('The mean square error of price and predicted value is: ', mse)

We can see that the fitted values are reasonably close to the actual values, since the two distributions overlap a bit. However, there is definitely some room for improvement.

### Now, we are going to create polynomial features, then standardize every column and finally feed our linear model with these features:

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
pr=PolynomialFeatures(degree=2)
pr

In [None]:
poly_feat=pr.fit_transform(numerical_cols)

In [None]:
numerical_cols.shape

Initially we had 18 features to use as predictors and after converting them to polynomial 2nd degree we see below the total number of features now has increased to 190.

In [None]:
poly_feat.shape

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
poly_feat=scaler.fit_transform(poly_feat)  #Apply standardization to our polynomial features 

Finally we will use this processed features to feed our model and expect a much better performance: 

In [None]:
lm4=LinearRegression()
lm4.fit(poly_feat,label)   
Ypoly_predicted = lm4.predict(poly_feat)

In [None]:
plt.figure(figsize=(width, height))


ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Ypoly_predicted, hist=False, color="b", label="Fitted Values" , ax=ax1)


plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')

plt.show()
plt.close()

We see the performance of this last model is almost perfect!, both curves are almost the same and we can see the difference between them quantified by computing the error metrics.

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

print('The R-square is: ', r2_score(label, Ypoly_predicted))
mse = mean_squared_error(label, Ypoly_predicted)
print('The mean square error of price and predicted value is: ', mse)

**Now we could predict the price of a car with a relatively high accuracy by only giving the features of the new one and use these in the argument of the function lm4.predict( )**