## Methodology
Our response variable for our dataset is Price, a continuous variable. Hence, our problem is a regression problem. We will thus consider a few regression models.

In [38]:
# Create variables for feature data and price values
X = new_df.drop(["Price"], axis=1) # feature data train  (response)
y = new_df["Price"].values # price values train (predictor)

### Split the Dataset into Train and Test Datasets
We will select 80% of our dataset as our train dataset, and the remaining 20% as our test dataset.

The **input** will include the following features: <br>
Mileage<br>
EngineV<br>
Registration<br>
Year<br>
Engine Type_Gas, Engine Type_Petrol, Engine Type_Diesel, Engine Type_Other<br>
Brand_Audi, Brand_BMW, Brand_Mercedes-Benz, Brand_Mitsubishi, Brand_Renault, Brand_Toyota<br>
Model_Vista, Model_Vito, Model_X1, Model_X3, Model_X5, Model_X5 M, Model_X6, Model_Yaris, Model_Z3, Model_Z4 ...<br>


The **output** will be the car Price.

We will be using the same inputs and outputs for all the following regression models.

In [39]:
# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

  ### The Linear Regression Model

Linear Regression is a widely used supervised learning algorithm in machine learning due to its simplicity and interpretability. It is used to predict continuous numerical values based on one or more independent variables or features. Independent variables or features such as the car's mileage, brand, engine version etc can be used to predict car pricings. <br>

The goal of a linear regression model is to find a linear relationship between the independent variables and the dependent variable (i.e., car price). The model does this by estimating the values of the coefficients that multiply each independent variable, such that the sum of the product of these coefficients and independent variables, along with an intercept term, results in the predicted value of the dependent variable (i.e., car price). <br>

Once the model is trained on a training dataset, it can be used to predict the car prices on unseen data by applying the learned coefficients to the independent variables of the new data. The accuracy of the model's prediction is typically evaluated using metrics such as the mean squared error (MSE) or R-squared. <br>

However, it also has some limitations. For example, it assumes that the relationship between the independent variables and the dependent variable is linear, and it can be sensitive to outliers and multicollinearity. Nevertheless, it remains a useful model for predicting car prices and other numerical values in many real-world applications.


### Obtain HyperParameters for the Linear Regression Model

These are the default hyperparameters for the Linear Regression model.

In [40]:
linreg_model = LinearRegression()
linreg_model_params = linreg_model.get_params()
linreg_model_params

{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}

### Training for  Linear Regression Model

In [41]:
# Linear Regression using Train Data
linreg = LinearRegression()         # create the linear regression object
linreg.fit(X_train, y_train)        # train the linear regression model 

In [42]:
y_train_pred = linreg.predict(X_train)

# Mean Squared Error (MSE)
def mean_sq_err(actual, predicted):
    '''Returns the Mean Squared Error of actual and predicted values'''
    return np.mean(np.square(np.array(actual) - np.array(predicted)))

mse = mean_sq_err(y_train, y_train_pred)

### LinearRegression Training Results
The LinearRegression model seems to have good fit our training dataset and achieve the following the results on the training dataset:<br>
- Explained Variance (R²) 	: 0.8417371648276059 
- Mean Squared Error (MSE) 	: 14073180.660390412
- Root Mean Squared Error (RMSE) 	: 3751.4238177511234 

In [43]:
print("Goodness Fit of Model (Train set)")

# Explained Variance (R²)
print("Explained Variance (R²) \t:", linreg.score(X_train, y_train))

print("Mean Squared Error (MSE) \t:", mse)
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mse))

Goodness Fit of Model (Train set)
Explained Variance (R²) 	: 0.8417371648276059
Mean Squared Error (MSE) 	: 14073180.660390412
Root Mean Squared Error (RMSE) 	: 3751.4238177511234


### Testing for  Linear Regression Model

In [44]:
y_test_pred = linreg.predict(X_test)

# Mean Squared Error (MSE)
def mean_sq_err(actual, predicted):
    '''Returns the Mean Squared Error of actual and predicted values'''
    return np.mean(np.square(np.array(actual) - np.array(predicted)))

mse = mean_sq_err(y_test, y_test_pred)

### LinearRegression Testing Results
The LinearRegression model doesn't seems to fit our testing dataset and achieve the following the results on the testing dataset:<br>
- Explained Variance (R²) 	: -4.24883870973393e+20
- Mean Squared Error (MSE) 	: 4.381265430358157e+28
- Root Mean Squared Error (RMSE) 	: 209314725481943.97

In [45]:
print("Goodness Fit of Model (Test set)")

# Explained Variance (R²)
print("Explained Variance (R²) \t:", linreg.score(X_test, y_test))

print("Mean Squared Error (MSE) \t:", mse)
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mse))

Goodness Fit of Model (Test set)
Explained Variance (R²) 	: -4.24883870973393e+20
Mean Squared Error (MSE) 	: 4.381265430358157e+28
Root Mean Squared Error (RMSE) 	: 209314725481943.97


### Tuning Linear Regression HyperParameters
We will be using GridSearchCV that runs through all the different parameters that is fed into the parameter grid and find the optimal set of hyperparameters for Linear Regression model, thereby improving its performance. 

- 'copy_x' and 'n_jobs' affects memory usage and computational efficiency respectively and are unlikely to explain why our model fits better.
- Setting 'fit_intercept' to True in linear regression models allows for the model to capture the intercept or constant term, which can reduce bias in the model. Setting 'positive' to True can force the coefficients to be positive, which can improve the interpretability of the model. Hence, we can see a better R² and RMSE values (indicative of a better fit) for our model.

In [46]:
parameters = {'copy_X': [0, 1],
                  'fit_intercept': [0, 1],
                  'n_jobs' : [-1, 1, 5, 10],
                  'positive'    : [0, 1]
                 }

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [48]:
grid_linreg = GridSearchCV(estimator=linreg_model, param_grid = parameters, cv = 2, n_jobs=-1)
grid_linreg.fit(X_train, y_train)

print(" Results from Grid Search " )
print("\n The best estimator across ALL searched params:\n",grid_linreg.best_estimator_)
print("\n The best parameters across ALL searched params:\n",grid_linreg.best_params_)


 Results from Grid Search 

 The best estimator across ALL searched params:
 LinearRegression(copy_X=0, fit_intercept=0, n_jobs=-1, positive=1)

 The best parameters across ALL searched params:
 {'copy_X': 0, 'fit_intercept': 0, 'n_jobs': -1, 'positive': 1}


### Retrain Linear Regression model using Tuned HyperParameters

Analysing the hyperparameters returned by GridSearchCV, the best parameters across ALL searched params:
 {'copy_X': 0, 'fit_intercept': 0, 'n_jobs': -1, 'positive': 1}.

In [49]:
linreg_tuned = LinearRegression(copy_X=0, fit_intercept=0, n_jobs=1, positive=1)
linreg_tuned.fit(X_train, y_train)

#### LinearRegression Training results using Tuning HyperParameters
Our tuned Linear Regression model managed to achieve the following results on the following dataset:
- Explained Variance (R²) 	: 0.832822917797544
- Mean Squared Error (MSE) 	: 14865860.816592315
- Root Mean Squared Error (RMSE) 	: 3855.627162549864


In [50]:
y_train_pred = linreg_tuned.predict(X_train)

# Mean Squared Error (MSE)
def mean_sq_err(actual, predicted):
    '''Returns the Mean Squared Error of actual and predicted values'''
    return np.mean(np.square(np.array(actual) - np.array(predicted)))

mse = mean_sq_err(y_train, y_train_pred)

print("Goodness Fit of Model (Train set)")

# Explained Variance (R²)
print("Explained Variance (R²) \t:", linreg_tuned.score(X_train, y_train))
print("Mean Squared Error (MSE) \t:", mse)
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mse))

Goodness Fit of Model (Train set)
Explained Variance (R²) 	: 0.832822917797544
Mean Squared Error (MSE) 	: 14865860.816592315
Root Mean Squared Error (RMSE) 	: 3855.627162549864


#### LinearRegression Testing results using Tuning HyperParameters
Our tuned Linear Regression model managed to achieve the following results on the following dataset:
- Explained Variance (R²) 	: 0.7211749826587799
- Mean Squared Error (MSE) 	: 28751536.432706818
- Root Mean Squared Error (RMSE) 	: 5362.04591855635

In [51]:
y_test_pred = linreg_tuned.predict(X_test)

print("Goodness Fit of Model (Test set)")

# Explained Variance (R²)
print("Explained Variance (R²) \t:", linreg_tuned.score(X_test, y_test))

# Mean Squared Error (MSE)
def mean_sq_err(actual, predicted):
    '''Returns the Mean Squared Error of actual and predicted values'''
    return np.mean(np.square(np.array(actual) - np.array(predicted)))


mse = mean_sq_err(y_test, y_test_pred)

print("Mean Squared Error (MSE) \t:", mse)
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mse))


Goodness Fit of Model (Test set)
Explained Variance (R²) 	: 0.7211749826587799
Mean Squared Error (MSE) 	: 28751536.432706818
Root Mean Squared Error (RMSE) 	: 5362.04591855635


### Analysis
The large negative R² value and extreme RMSE value for our Test dataset before HyperTuning is indicative of a poor fit of the model. This suggests that the model is performing worse than a horizontal line, and is overfitting the training data. Overfitting means that the model has learned the training data too well and is unable to generalize to new data.

Hypertuning may have reduced the R² value for train data because it could have made the model less complex, reducing the overfitting on the training data. A simpler model may not perform as well on the training data but may generalize better to new data. But it is important to note that the difference in R² values is small (0.01) and have an insignificant impact on the fit of our model for train data. 

Hypertuning may have increased the R² value for test data because it could have found the optimal hyperparameters that better fit the test data. By fine-tuning the hyperparameters, the model can be adjusted to fit the data more closely, leading to better performance on the test data.


