# Assumptions Of Linear Regression Algorithm

I will analyze this dataset based on some assumptions of linear regression.

Linear Relationship between the features and target - Linear regression requires the relationship between the independent and dependent variables to be linear. Linearity can be checked with the scatter plot and Pearson Correlation.

No Multicollinearity - Linear regression assumes that the independent variables are not highly correlated with each other. This assumption is tested using Variance Inflation Factor (VIF) values.

Homoscedasticity – This assumption states that the variance of errors are similar across the values of the independent variables. A plot of standardized residuals versus predicted values can show whether points are equally distributed across all values of the independent variables.

Normal distribution of errors – Linear regression assumes that the residuals are normally distributed.

# **1. Reading And Understanding The Data**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
dataset = pd.read_csv("../input/car-price-prediction/CarPrice_Assignment.csv", index_col=0)

In [None]:
print(dataset.shape)

In [None]:
print(dataset.head())

In [None]:
print(dataset.info())

There are no null values in the dataset. On the other hand, there are some categorical variables.

# 2. Cleaning The Data

Some car names are badly written in the dataset. We need to fix them to get more accurate results.
 * maxda = mazda 
 * toyouta = toyota
 * vokswagen = volkswagen
 * vw = volkswagen
 * porcshce = porsche
 * Nissan = nissan

In [None]:
dataset['CarName'] = dataset['CarName'].str.split(' ',expand =True)[0]

In [None]:
print(dataset['CarName'].unique())

In [None]:
dataset['CarName'] = dataset['CarName'].replace({'maxda': 'mazda', 'porcshce': 'porsche', 'toyouta': 'toyota', 
                                                 'vokswagen': 'volkswagen', 'vw': 'volkswagen', 'Nissan': 'nissan'})

In [None]:
print(dataset['CarName'].unique())

In [None]:
print(dataset.duplicated().sum())

There are no duplicated values in the dataset.

# 3. Exploratory Data Analysis

Firstly, it is better to check car price distribution.

In [None]:
plt.figure(figsize=(16,5))

plt.subplot(1,2,1)
sns.kdeplot(dataset['price'],  color='blue', shade=True)
plt.xlabel("price")

plt.subplot(1,2,2)
sns.boxplot(dataset['price'], palette="Set3")

plt.show()

Car price distribution is left skewed. On the other hand, there are some outliers in the dataset. Since linear regression is sensitive to outliers, I will analyze them later.

In [None]:
plt.figure(figsize=(16,5))
sorted = dataset.groupby(['CarName'])['price'].median().sort_values()
sns.boxplot(x=dataset['CarName'], y=dataset['price'], order = list(sorted.index))
plt.title("Car Name vs Price")
plt.xticks(rotation=90)
plt.show()

Chevrolet is the cheapest car. Honda, Dodge, Plymouth has cheap prices, but these cars have some outlier prices. Bmw, Porsche, Buick, Jaguar has the highest prices.


Firstly, I will check the linear relationship between variables visually. After that, I will analyze Pearson Correlation and P-values to be able to understand the correlation. If the correlation is weak, I will drop related variables.

In [None]:
sns.pairplot(dataset, diag_kind="kde", vars=['symboling', 'wheelbase', 'carlength', 'carwidth', 'carheight', 'curbweight', 'enginesize', 'price'])
plt.show()

Symboling and Carheight doesn't affect price. There is not a linear relationship.

In [None]:
sns.pairplot(dataset, diag_kind="kde", vars=['boreratio', 'stroke', 'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg', 'price'])
plt.show()

Stroke, Compressionratio, Peakrpm doesn't affect price. There is not a linear relationship. I will drop these variables as well.

Firstly, I will analyze the relationship between categorical variables and the price.

In [None]:
categorical_columns = ['fueltype', 'aspiration', 'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'enginetype', 'cylindernumber', 'fuelsystem']
categorical_data = dataset[categorical_columns]

In [None]:
plt.figure(figsize=(20,15))
for index, item in enumerate(categorical_columns, 1):
    plt.subplot(3,3,index)
    sns.barplot(x = item, y = 'price', data = dataset)
plt.show()  

* diesel-powered cars more expensive than the gas-powered cars. 
* convertible and hardop is more expensive than sedan, hatchback, and wagon.
* rwd drivewheel is more expensive than fwd and 4wd.
* rear engine location is more expensive.
* If the cylinder number is high, price is high.
* mpfi and idi fuel systems is relatively expensive than other types.

# 4. Data Preparation

I will check Pearson Correlation to investigate the linear relationship between two continuous variables. If the features have a weak relationship with the price, I will drop from the dataset. 

I will also check P-value to analyze the correlation is statistically significant or not. It is generally accepted that if the value is above 0.05, the correlation is not significant. If it is below 0.05, the correlation is significant.

In [None]:
from scipy import stats
numerical_columns = dataset.select_dtypes(exclude='object').columns
for i in (list(numerical_columns)):
    pearson_coef, p_value = stats.pearsonr(dataset[i], dataset['price'])
    print(i.capitalize(), "Pearson Correlation:", pearson_coef, "P-value:", p_value)
    print("The correlation is not significant:", p_value>0.05)
    print()

Symboling, Carheight, Stroke, Compressionratio, Peakrpm has a weak relationship with Price. The correlation between these variables and the price is not statistically significant. I will drop them from the dataset.

Wheelbase, Boreratio has a moderate relationship with Price. The correlation is statistically significant.

Carlength, Carwidth, Curbweight, Enginesize, Horsepower has a strong positive relationship with Price. The correlation is statistically significant.

Citympg, Highwaympg has a strong negative relationship with Price. The correlation is statistically significant.

On the other hand, door number has no relationship with the price at all. It is clear visually. I will drop door number from the dataset.

In [None]:
dataset.drop(['symboling', 'carheight', 'stroke', 'compressionratio', 'peakrpm', 'doornumber'], axis=1, inplace=True)

In [None]:
print(dataset.shape)

### Converting to numerical values

I will add "cars category" column to the dataset according to car prices. I will group cars as budget friendly, medium range, expensive cars. I will drop "cars name" column as I will add "cars category" column.

In [None]:
data_new = dataset.copy()
t_price = data_new.groupby(['CarName'])['price'].mean()
data_new = data_new.merge(t_price.reset_index(), how='left', on='CarName')
bins = [0,10000,20000,40000]
label =['Budget_Friendly','Medium_Range','Expensive_Cars']
dataset['Cars_Category'] = pd.cut(data_new['price_y'], bins, right=False, labels=label)
dataset.drop("CarName", axis=1, inplace=True)
dataset.head()

I will convert categorical variables to numerical variables. Categorical variables in the dataset are nominal. I can apply OneHotEncoder.

In [None]:
column = ['fueltype','aspiration','carbody', 'drivewheel', 'enginelocation', 'enginetype','cylindernumber', 'fuelsystem', 'Cars_Category']
dummies = pd.get_dummies(dataset[column], drop_first = True)
dataset = pd.concat([dataset, dummies], axis = 1)
dataset.drop(column, axis = 1, inplace = True)

In [None]:
print(dataset.shape)

### Rescaling

Ordinary Least Squares method does not make normality assumptions about the data. It makes normality assumptions about the residuals. I will not transform the data to ensure Gaussian distribution. 

On the other hand, linear regression is sensitive to outliers. Quantile Transformer is robust to outliers. It will transform the variables and handle the outliers in the dataset.

In [None]:
from sklearn.preprocessing import QuantileTransformer
transform =  QuantileTransformer(n_quantiles=205)
columns = ['wheelbase', 'carlength', 'carwidth', 'curbweight','enginesize','boreratio','horsepower','citympg','highwaympg','price']
dataset[columns] = transform.fit_transform(dataset[columns]) 

In [None]:
print(dataset.columns.values)

These are all the columns in the dataset now.

### Correlation Between Variables

Linear regression assumes the independent variables are not related with each other. If the correlation degree is high, it will cause problems when we fit the model.

To check multicollinearity, I will use heatmap and VIF.

In [None]:
plt.figure(figsize = (40, 40))
sns.heatmap(dataset.corr(method ='pearson'), cmap='PuBu', annot=True, linewidths=.5, annot_kws={'size':8})
plt.show()

In [None]:
print(dataset.corr(method ='pearson').unstack().sort_values().drop_duplicates())


Multicollinearity exists among predictors. After even transforming the variables, there is a strong relationships between independent variables. For this reason, I will use Variation Inflation Factor (VIF) to detect multicollinearity and to eliminate these variables from the dataset.

### Checking Pearson Correlation

Before eliminating correlated variables, I will check Pearson Correlation and p values. I will eliminate the features based on the accordingly.

In [None]:
data = list(dataset.columns)
for i in data:
    pearson_coef, p_value = stats.pearsonr(dataset[i], dataset['price'])
    print(i.capitalize(), "Pearson Correlation:", pearson_coef, "P-value:", p_value)
    print("The correlation is not significant:", p_value>0.05)
    if p_value>0.05:
        dataset.drop(i, axis=1, inplace=True)
    print()

In [None]:
print(dataset.shape)

### Variation Inflation Factor (VIF)

A variance inflation factor(VIF) detects multicollinearity in regression analysis. I will select the features with VIF that is below 10.

In [None]:
X = dataset.drop('price', axis=1)
y = dataset['price']

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["variables"] = X.columns
vif['vif'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
for index,column in enumerate(X.columns):
    print(index, column, vif['vif'][index])
    if vif['vif'][index]>10:
        vif = vif.drop([index], axis=0)

In [None]:
print(vif)

In [None]:
print(list(vif['variables']))

# 5. Building the Model

In [None]:
columns = list(vif['variables'])
data = dataset [columns]
data = pd.concat([data, dataset['price']], axis=1)

In [None]:
from sklearn.model_selection import train_test_split
X = data.drop('price', axis=1)
y = data ['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.75, test_size = 0.25, random_state=42)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
lr = LinearRegression()
lr.fit(X_train,y_train)
pred_test = lr.predict(X_test)
pred_train = lr.predict(X_train)
print("R Squared Value of Train Data: {}".format(r2_score(y_train, pred_train)))
print("R Squared Value of Test Data: {}".format(r2_score(y_test, pred_test)))

# 6. Evaluating The Model

I will check the residual normality assumption visually. Errors should be normally distributed.

In [None]:
plt.figure(figsize=(16,5))

plt.subplot(1,2,1)
sns.distplot((y_train - pred_train))
plt.title('Train Data Residual Analysis', fontsize = 20)                   
plt.xlabel('Errors', fontsize = 18)

plt.subplot(1,2,2)
sns.distplot((y_test - pred_test))
plt.title('Test Data Residual Analysis', fontsize = 20)              
plt.xlabel('Errors', fontsize = 18)

plt.show()

I will check ***Homoscedasticity***. There should not be specific pattern in the distribution of residuals. 

In [None]:
from yellowbrick.regressor import ResidualsPlot
visualizer = ResidualsPlot(lr)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.poof()
plt.show()