Hi! This is my first time modelling a dataset with kaggle. I've read several submissions here and they are very great since their model's scores are so high. In my first version, I used two regression analysis, PCR (Principal Component Regression) and PLS (Partial Least Square), then I compared between those two models. Since PCR was the best model, I then realized that what if I combined PCA (Principal Component Analysis) and XGBRegressor since it is the best model this far. And what surprises me is that the score is 0.99! I don't know whether it's true or not (amateur shock). So let's get started and please comment in down below.

# Import Packages and Data

First, I'm going to import some packages I need and also the data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.signal import savgol_filter
from statsmodels.sandbox.stats.runs import runstest_1samp
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/audi.csv')
data

# Create Dummy Variables

Since 'model', 'transmission', and 'fuel type' data types are categorical, I will change their values to numeric values. Here, I'll create dummy variables (binary encoding).

In [None]:
data_encode_dummy = pd.get_dummies(data,columns=['model', 'transmission','fuelType'], drop_first = True)
data_encode_dummy

In [None]:
unique_model = data['model'].unique()
transmission_unique = data['transmission'].unique()
fueltype_unique = data['fuelType'].unique()

unique_model.sort()
transmission_unique.sort()
fueltype_unique.sort()

print(data_encode_dummy.columns)
print(unique_model)
print(transmission_unique)
print(fueltype_unique)

The code above is used to see which is the reference category. We can see that model A1, transmission Automatic, and fuel type Diesel are the reference categories.

# Asumptions Check

Next, we're going to check the asumptions for linear regression. I'll use linear regression because it's the very basic regression.

### Linearity

In [None]:
X = data_encode_dummy.drop(columns = ['price'])
Y = data_encode_dummy['price']

title = ['Price vs Year', 
         'Price vs Mileage', 
         'Price vs Tax', 
         'Price vs Mpg', 
         'Price vs Engine Size']

fig,ax = plt.subplots(3,2,figsize=(20,20))

i = 0
for rows in range(3):
    for cols in range(2):
        if rows == 2 and cols == 1:
            fig.delaxes(ax[rows,cols])
            break
        ax[rows,cols].scatter(x = X[X.columns[i]], y = Y)
        ax[rows,cols].set_title(title[i])
        i = i+1
        
fig.subplots_adjust(hspace=0.5, wspace=0.5)

We can see from the plots above that not all of the independent variables has a linear relationship with the dependent variable ('price' is the dependent variable and the rest are the independent variables). Therefore, we need to transform the data. Most of them have an exponentially relationship so we will only take the logarithm of the 'price' variable.

### Transformation

In [None]:
data_encode_dummy_transform = data_encode_dummy.copy()
data_encode_dummy_transform['price'] = np.log(data_encode_dummy_transform['price'])
data_encode_dummy_transform

In [None]:
X = data_encode_dummy_transform.drop(columns = ['price'])
Y = data_encode_dummy_transform['price']

title = ['Price vs Year', 
         'Price vs Mileage', 
         'Price vs Tax', 
         'Price vs Mpg', 
         'Price vs Engine Size']

fig,ax = plt.subplots(3,2,figsize=(20,20))

i = 0
for rows in range(3):
    for cols in range(2):
        if rows == 2 and cols == 1:
            fig.delaxes(ax[rows,cols])
            break
        ax[rows,cols].scatter(x = X[X.columns[i]], y = Y)
        ax[rows,cols].set_title(title[i])
        i = i+1
        
fig.subplots_adjust(hspace=0.5, wspace=0.5)

After we transform it, we plot it again to check whether it's already linear or not. We can see that it's already linear so we can continue to the next step.

Note: I'm sorry for the long variables name. I just want to make it clear what those variables are hehe...

### Multicollinearity

Next, we're going to check whether the independent variables are correlated each other. For an overview, we'll check the correlation matrix.

In [None]:
X = data_encode_dummy_transform[['year','mileage','tax','mpg','engineSize']]
X.corr()

We can see that the correlation between 'mileage' and 'year' is so high. Not only them, but also 'mpg' and 'tax'. We need to check the VIF to see the overall correlation between the independent variables.

In [None]:
VIF = pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], 
                index=X.columns)
print(VIF)

We can see that the VIF for 'year', 'mpg', and 'engineSize' are more than 10. It means that those variables correlate with the other variables (for example, 'year' variable effects the values of all of the other independent variables). So there is a multicollinearity in this data. From the statistics view, we can exclude the variables, but I don't know whether those variables are important to predict the price or not. So, I'll keep those variables.

# Split Data

As usual, we will split the train and test data.

In [None]:
X = data_encode_dummy_transform.drop(columns = ['price'])
Y = data_encode_dummy_transform['price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

X_train = X_train.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)
Y_train = Y_train.reset_index(drop = True)
Y_test = Y_test.reset_index(drop = True)

# Create Model

And now for the main event, we will create the model. Because the data has a multicollinearity, we need a regression analysis that can handle multicollinearity. Like I said in the introduction, I've made a model before in my first version using PCR and PLS and then compare these models. Then I realized that what if I combined PCA and XGBRegressor. So first we will use PCA on independent variables matrix, then we will use XGBRegressor to create the model. I will use the code for PCA from [this site](https://nirpyresearch.com/principal-component-regression-python/).

In [None]:
# Define the PCA object
pca = PCA()

# Preprocessing (1): first derivative
d1X = savgol_filter(X_train, 25, polyorder = 5, deriv=1)

# Preprocess (2) Standardize features by removing the mean and scaling to unit variance
Xstd = StandardScaler().fit_transform(d1X[:,:])

# Run PCA producing the reduced variable Xreg and select the first pc components
Xreg = pca.fit_transform(Xstd)[:,:]

XGB = XGBRegressor(random_state=0)

XGB.fit(Xreg,Y_train)

In [None]:
XGB.score(Xreg,Y_train)

In [None]:
Y_PCA_XGB = XGB.predict(Xreg)

mean_squared_error(Y_train, Y_PCA_XGB)

As you can see, the $R^2$ is $0.989 \sim 99\%$ which is high enough and the $MSE$ is $0.0022$ which is small enough.

# Diagnostic Checking

Well I saw that many people skip this step, but I'll keep doing this step.

### Normality of Residuals

In [None]:
resid = Y_train - Y_PCA_XGB

sns.distplot(resid)

With the plot looks like that, we can assume that the residuals are normal.

### Autocorrelation of Residuals

To check the autocorrelation of residuals, we will use run test with the null hypothesis is the residual values are random (there is no autocorrelation in residuals).

In [None]:
result = runstest_1samp(resid)[1]
print('P-value :',result)

With significance $\alpha=0.05$, we can see that p-value $> 0.05$. Therefore, we won't reject the null hypothesis so that there is no autocorrelation in residuals.

### Homoscedasticity of Residuals

In [None]:
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.scatter(Y_PCA_XGB,resid)

We can see that the residuals don't form any shape so that the variance is constant. Therefore, we can say that the homoscedasticity of residuals is achieved.

Since this model has passed all of the diagnostic checkings, we can say that this model is well enough to predict the Audi car.

# Test Data

Now, we will test our model with the test data.

In [None]:
# Preprocessing (1): first derivative
d1X = savgol_filter(X_test, 25, polyorder = 5, deriv=1)

# Preprocess (2) Standardize features by removing the mean and scaling to unit variance
Xstd = StandardScaler().fit_transform(d1X[:,:])

# Run PCA producing the reduced variable Xreg and select the first pc components
Xreg = pca.fit_transform(Xstd)[:,:]

prediction = XGB.predict(Xreg)

data_test = {
    'Y_test' : Y_test,
    'Prediction' : prediction
}

pd.DataFrame(data_test)

In [None]:
mean_squared_error(Y_test, prediction)

Well the $MSE$ of the data test using this model is higher than using PCR, but it's still low for me haha.... At least, it's great to know how to handle multicollinearity. Please comment in down below so that we can learn each other. :)

Note: if we want to use this model, we must first standardize the independent variables. Then, after we get the predicted dependent variables, we should take the exponent instead to see the exact price.