<a href="https://colab.research.google.com/github/rdcool92/Assignment7/blob/main/Assignment_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Car price prediction is essential for optimizing revenue in the automotive industry. It facilitates dynamic pricing strategies, aligning with market demand and competitor analyses. Accurate predictions aid in inventory optimization, mitigating overstock and stockout risks. Robust predictive models, leveraging features such as mileage, engine volume, and production year, enhance decision-making for marketing and sales campaigns. This data-driven approach supports effective financial planning, ensuring strategic resource allocation and risk management.

**The dataset used for this assignment appears to contain information related to cars. Here is an overview of the relevant features and the target variable:**

# **Features:**

**ID:** Identifier for each car entry.

**Levy:** The tax or levy imposed on the car.

**Manufacturer:** The company that produced the car.

**Model:** The specific model name of the car.

**Prod. year:** The year the car was manufactured.

**Category:** The category or type of the car.

**Leather interior:** Binary indicator of whether the car has leather interior (Yes/No).

**Fuel type:** The type of fuel the car uses.

**Engine volume:** The volume of the car's engine.

**Mileage:** The distance the car has traveled.

**Cylinders:** Number of cylinders in the car's engine.

**Gear box type:** The type of gearbox in the car.

**Drive wheels:** The type of wheels that receive power from the engine.

**Doors:** Number of doors on the car.

**Wheel:** The type of wheel the car has.

**Color:** The color of the car.

**Airbags:** Number of airbags in the car.

# **Target Variable:**

**Price:** The target variable, representing the price of the car.

# **Objective:**

 The objective of the assignment is to develop and implement machine learning algorithms to accurately predict car prices based on a dataset containing relevant features such as manufacturer, model, production year, mileage, fuel type, engine volume, and other specifications. The goal is to leverage these features to build a predictive model that can effectively estimate the price of cars in the automotive market. The assignment aims to employ data-driven approaches to enhance decision-making in the automotive industry, supporting activities such as pricing strategy optimization, inventory management, and overall competitiveness.

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
data=pd.read_csv("/content/drive/MyDrive/car_price_prediction.csv")



# **Data Preprocessing:**

Data preprocessing is the method of cleaning, transforming, and organizing raw data into a format suitable for analysis or machine learning. It involves handling missing values, scaling features, encoding categorical variables, and addressing outliers to enhance the quality and usability of the data for subsequent tasks. The goal is to ensure accurate, efficient, and meaningful analysis or model training.

In [None]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19237 entries, 0 to 19236
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ID                19237 non-null  int64  
 1   Price             19237 non-null  int64  
 2   Levy              19237 non-null  object 
 3   Manufacturer      19237 non-null  object 
 4   Model             19237 non-null  object 
 5   Prod. year        19237 non-null  int64  
 6   Category          19237 non-null  object 
 7   Leather interior  19237 non-null  object 
 8   Fuel type         19237 non-null  object 
 9   Engine volume     19237 non-null  object 
 10  Mileage           19237 non-null  object 
 11  Cylinders         19237 non-null  float64
 12  Gear box type     19237 non-null  object 
 13  Drive wheels      19237 non-null  object 
 14  Doors             19237 non-null  object 
 15  Wheel             19237 non-null  object 
 16  Color             19237 non-null  object

In [None]:
data.head()

Unnamed: 0,ID,Price,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags
0,45654403,13328,1399,LEXUS,RX 450,2010,Jeep,Yes,Hybrid,3.5,186005 km,6.0,Automatic,4x4,04-May,Left wheel,Silver,12
1,44731507,16621,1018,CHEVROLET,Equinox,2011,Jeep,No,Petrol,3.0,192000 km,6.0,Tiptronic,4x4,04-May,Left wheel,Black,8
2,45774419,8467,-,HONDA,FIT,2006,Hatchback,No,Petrol,1.3,200000 km,4.0,Variator,Front,04-May,Right-hand drive,Black,2
3,45769185,3607,862,FORD,Escape,2011,Jeep,Yes,Hybrid,2.5,168966 km,4.0,Automatic,4x4,04-May,Left wheel,White,0
4,45809263,11726,446,HONDA,FIT,2014,Hatchback,Yes,Petrol,1.3,91901 km,4.0,Automatic,Front,04-May,Left wheel,Silver,4


In [None]:
data.tail()

Unnamed: 0,ID,Price,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags
19232,45798355,8467,-,MERCEDES-BENZ,CLK 200,1999,Coupe,Yes,CNG,2.0 Turbo,300000 km,4.0,Manual,Rear,02-Mar,Left wheel,Silver,5
19233,45778856,15681,831,HYUNDAI,Sonata,2011,Sedan,Yes,Petrol,2.4,161600 km,4.0,Tiptronic,Front,04-May,Left wheel,Red,8
19234,45804997,26108,836,HYUNDAI,Tucson,2010,Jeep,Yes,Diesel,2,116365 km,4.0,Automatic,Front,04-May,Left wheel,Grey,4
19235,45793526,5331,1288,CHEVROLET,Captiva,2007,Jeep,Yes,Diesel,2,51258 km,4.0,Automatic,Front,04-May,Left wheel,Black,4
19236,45813273,470,753,HYUNDAI,Sonata,2012,Sedan,Yes,Hybrid,2.4,186923 km,4.0,Automatic,Front,04-May,Left wheel,White,12


In [None]:
data.isnull().sum()


ID                  0
Price               0
Levy                0
Manufacturer        0
Model               0
Prod. year          0
Category            0
Leather interior    0
Fuel type           0
Engine volume       0
Mileage             0
Cylinders           0
Gear box type       0
Drive wheels        0
Doors               0
Wheel               0
Color               0
Airbags             0
dtype: int64

In [None]:
data.duplicated().sum()

313

Remove duplicate rows

In [None]:
data = data.drop_duplicates()

In [None]:
data.duplicated().sum()

0

In [None]:
data.dropna(inplace=True)


In [None]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,18924.0,45575380.0,937546.823889,20746880.0,45695007.5,45771914.5,45801742.25,45816654.0
Price,18924.0,18587.44,192135.630726,1.0,5331.0,13172.0,22063.0,26307500.0
Prod. year,18924.0,2010.914,5.665749,1939.0,2009.0,2012.0,2015.0,2020.0
Cylinders,18924.0,4.580216,1.200223,1.0,4.0,4.0,4.0,16.0
Airbags,18924.0,6.568379,4.322323,0.0,4.0,6.0,12.0,16.0


In [None]:
data.columns


Index(['ID', 'Price', 'Levy', 'Manufacturer', 'Model', 'Prod. year',
       'Category', 'Leather interior', 'Fuel type', 'Engine volume', 'Mileage',
       'Cylinders', 'Gear box type', 'Drive wheels', 'Doors', 'Wheel', 'Color',
       'Airbags'],
      dtype='object')

In [None]:
final_dataset = data[['ID', 'Price', 'Levy', 'Manufacturer', 'Model', 'Prod. year',
       'Category', 'Leather interior', 'Fuel type', 'Engine volume', 'Mileage',
       'Cylinders', 'Gear box type', 'Drive wheels', 'Doors', 'Wheel', 'Color',
       'Airbags']]

In [None]:
final_dataset.head()

Unnamed: 0,ID,Price,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags
0,45654403,13328,1399,LEXUS,RX 450,2010,Jeep,Yes,Hybrid,3.5,186005 km,6.0,Automatic,4x4,04-May,Left wheel,Silver,12
1,44731507,16621,1018,CHEVROLET,Equinox,2011,Jeep,No,Petrol,3.0,192000 km,6.0,Tiptronic,4x4,04-May,Left wheel,Black,8
2,45774419,8467,-,HONDA,FIT,2006,Hatchback,No,Petrol,1.3,200000 km,4.0,Variator,Front,04-May,Right-hand drive,Black,2
3,45769185,3607,862,FORD,Escape,2011,Jeep,Yes,Hybrid,2.5,168966 km,4.0,Automatic,4x4,04-May,Left wheel,White,0
4,45809263,11726,446,HONDA,FIT,2014,Hatchback,Yes,Petrol,1.3,91901 km,4.0,Automatic,Front,04-May,Left wheel,Silver,4


In [None]:
print(data.dtypes)


ID                    int64
Price                 int64
Levy                 object
Manufacturer         object
Model                object
Prod. year            int64
Category             object
Leather interior     object
Fuel type            object
Engine volume        object
Mileage              object
Cylinders           float64
Gear box type        object
Drive wheels         object
Doors                object
Wheel                object
Color                object
Airbags               int64
dtype: object


##**Feature Selection for Prediction Task:**

**Random Forest Feature Importance:**


1.   Train a Random Forest model to assess feature importance.
2.   Use the SelectFromModel method to choose features based on importance.




In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import LabelEncoder

In [None]:
label_encoder = LabelEncoder()
data['Levy'] = label_encoder.fit_transform(data['Levy'])
data['Manufacturer'] = label_encoder.fit_transform(data['Manufacturer'])
data['Model'] = label_encoder.fit_transform(data['Model'])
data['Category'] = label_encoder.fit_transform(data['Category'])
data['Leather interior'] = label_encoder.fit_transform(data['Leather interior'])
data['Fuel type'] = label_encoder.fit_transform(data['Fuel type'])
data['Engine volume'] = label_encoder.fit_transform(data['Engine volume'])
data['Mileage'] = label_encoder.fit_transform(data['Mileage'])
data['Gear box type'] = label_encoder.fit_transform(data['Gear box type'])
data['Drive wheels'] = label_encoder.fit_transform(data['Drive wheels'])
data['Doors'] = label_encoder.fit_transform(data['Doors'])
data['Wheel'] = label_encoder.fit_transform(data['Wheel'])
data['Color'] = label_encoder.fit_transform(data['Color'])


In [None]:
# Separate features and target variable
X = data.drop('Price', axis=1)
y = data['Price']

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Train a random forest regressor to get feature importances
rf_model = RandomForestRegressor()
rf_model.fit(X_train, y_train)

In [None]:
# Use feature importances for feature selection
sfm = SelectFromModel(rf_model, threshold=0.1)  # You can adjust the threshold as needed
sfm.fit(X_train, y_train)

# Get selected feature indices
selected_feature_indices = sfm.get_support(indices=True)

# Display selected features
selected_features = X.columns[selected_feature_indices]
print("Selected Features:", selected_features)

Selected Features: Index(['Prod. year', 'Engine volume', 'Mileage'], dtype='object')


## **Feature Engineering:**

Feature engineering involves creating new features or transforming existing ones to improve the model's performance.

In [None]:
current_year = 2023  # Update with the current year
data['Car Age'] = current_year - data['Prod. year']

In [None]:
# Check the data types of the 'Mileage' column
print(data['Mileage'].dtype)

# If the data type is not 'str', convert it to 'str' before applying string operations
data['Mileage'] = data['Mileage'].astype(str)
data['Mileage'] = data['Mileage'].str.replace(' km','').str.replace(',','').astype(float)


int64


In [None]:
# Convert 'Mileage' to numerical values
data['Mileage'] = data['Mileage'].astype(float)

# Convert 'Engine volume' to numerical values
data['Engine volume'] = data['Engine volume'].astype(str).str.replace(' L','').astype(float)
# Create interaction terms
data['Mileage_per_Year'] = data['Mileage'] / data['Car Age']
data['EngineV_times_Age'] = data['Engine volume'] * data['Car Age']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

In [None]:
# Polynomial features for selected numerical features
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(data[['Car Age', 'Mileage', 'Engine volume']])
poly_columns = [f"{feature1}*{feature2}" for feature1 in ['Car Age', 'Mileage', 'Engine volume']
                                         for feature2 in ['Car Age', 'Mileage', 'Engine volume']]
data_poly = pd.DataFrame(poly_features, columns=poly_columns)
data = pd.concat([data, data_poly], axis=1)

# Drop original columns used for polynomial features
data.drop(['Car Age', 'Mileage', 'Engine volume'], axis=1, inplace=True)


In [None]:
# Standardize numerical features
numeric_features = data.select_dtypes(include=['float64']).columns
scaler = StandardScaler()
data[numeric_features] = scaler.fit_transform(data[numeric_features])


In [None]:
# Display the updated dataset with engineered features
data.head()

Unnamed: 0,ID,Price,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Gear box type,...,EngineV_times_Age,Car Age*Car Age,Car Age*Mileage,Car Age*Engine volume,Mileage*Car Age,Mileage*Mileage,Mileage*Engine volume,Engine volume*Car Age,Engine volume*Mileage,Engine volume*Engine volume
0,0.084288,-0.027374,-0.563572,-0.061132,0.924073,-0.161366,-0.81193,0.614976,-0.79002,-0.599184,...,0.989982,0.161366,-0.236131,1.409546,-0.040594,-0.04702,0.989982,-0.489566,0.379451,1.296567
1,-0.900111,-0.010235,-1.069006,-1.410441,-0.496921,0.015138,-0.81193,-1.62608,0.870883,1.630041,...,0.528418,-0.015138,-0.181823,0.952113,-0.151328,-0.09327,0.528418,-0.448037,0.256125,0.761464
2,0.212302,-0.052675,-1.087386,-0.679565,-0.433657,-0.867381,-1.170113,-1.62608,0.870883,2.744654,...,-0.407269,0.867381,-0.101696,-1.2697,0.490931,0.507914,-0.407269,-0.383574,-0.658651,-0.942127
3,0.206719,-0.07797,1.265182,-0.960671,-0.489621,0.015138,-0.81193,0.614976,-0.79002,-0.599184,...,0.151631,-0.015138,-0.425319,0.298639,-0.151328,-0.31422,0.151631,-0.620588,-0.262027,0.106237
4,0.249468,-0.035712,0.594332,-0.679565,-0.433657,0.544649,-1.170113,0.614976,0.870883,-0.599184,...,-0.95989,-0.544649,1.792857,-1.2697,-0.430379,0.951699,-0.95989,2.249051,0.227285,-0.942127


## **Random Forest:**

Random Forest is an ensemble learning method that constructs a multitude of decision trees during training and outputs the average prediction of the individual trees for regression tasks.Random Forest is suitable for car price prediction due to its ability to handle non-linear relationships and interactions between features. It is robust, less prone to overfitting, and generally provides good performance on diverse datasets.

**Hyperparameter Tuning and Cross-Validation:**

Important hyperparameters include the number of trees (n_estimators), maximum depth of trees (max_depth), and minimum samples required to split a node (min_samples_split). Hyperparameter tuning can be done using techniques like grid search or randomized search. Cross-validation helps estimate the model's performance on different subsets of the data and aids in selecting the best hyperparameters.

## **Gradient Boosting:**

Gradient Boosting is an ensemble method that builds a series of weak learners (typically decision trees) sequentially. Each new tree corrects the errors made by the previous ones, leading to a strong predictive model.Gradient Boosting is suitable for car price prediction because it can capture complex relationships in the data. It often performs well and is less prone to overfitting compared to individual decision trees.

**Hyperparameter Tuning and Cross-Validation:**

 Key hyperparameters include the learning rate (learning_rate), the number of trees (n_estimators), and the maximum depth of trees (max_depth). Hyperparameter tuning is typically done using grid search or randomized search. Cross-validation helps assess model generalization and select optimal hyperparameter values.

## **Bayesian Regression:**

Bayesian Regression is a probabilistic approach to linear regression that incorporates Bayesian principles. Instead of providing a point estimate for model parameters, Bayesian Regression provides a probability distribution over possible values, capturing uncertainty in the parameter estimates. The model incorporates a prior distribution representing prior beliefs about the parameters and updates this distribution based on the observed data using Bayes' theorem. This leads to a posterior distribution that characterizes the uncertainty in the parameter estimates.Bayesian Regression is particularly suitable when dealing with limited data. Its probabilistic nature allows for better handling of uncertainty, which is crucial when the dataset is small.

**Hyperparameter Tuning and Cross-Validation:**

Bayesian Regression has fewer hyperparameters compared to some other algorithms. The choice of the prior distribution is a key hyperparameter. Different types of prior distributions (e.g., Gaussian, Laplace) and their parameters can be experimented with to assess their impact on model performance. Cross-validation is crucial to assess the model's performance on different subsets of the data. Given the probabilistic nature of Bayesian Regression, cross-validation helps in understanding the stability of predictions and evaluating the generalization performance.

### **Random Forest:**  
we split the dataset into training and testing sets specifically for the Random Forest model. We then initialize the Random Forest model, train it on the training set, make predictions on the test set, and finally, evaluate its performance using mean squared error (MSE) and R-squared.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Separate features and target variable
X = data.drop('Price', axis=1)
y = data['Price']

# Split the data into training and testing sets for Random Forest
X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Impute missing values in the features
imputer = SimpleImputer(strategy='mean')
X_train_rf_imputed = imputer.fit_transform(X_train_rf)
X_test_rf_imputed = imputer.transform(X_test_rf)

# Impute missing values in the target variable
y_train_rf_imputed = imputer.fit_transform(y_train_rf.values.reshape(-1, 1))
y_test_rf_imputed = imputer.transform(y_test_rf.values.reshape(-1, 1))

In [None]:
# Initialize the Random Forest model
random_forest_model = RandomForestRegressor(random_state=42)

# Train the Random Forest model on the training set
random_forest_model.fit(X_train_rf_imputed, y_train_rf_imputed.ravel())

# Make predictions on the test set
rf_predictions = random_forest_model.predict(X_test_rf_imputed)

In [None]:
# Evaluate the Random Forest model
mse_rf = mean_squared_error(y_test_rf_imputed, rf_predictions)
r2_rf = r2_score(y_test_rf_imputed, rf_predictions)

# Print the evaluation metrics
print(f"Random Forest - Mean Squared Error: {mse_rf:.2f}, R-squared: {r2_rf:.2f}")

Random Forest - Mean Squared Error: 0.50, R-squared: -48.65


### **Bayesian Regression:**

Bayesian Regression, also known as Bayesian Ridge Regression, doesn't require imputation for missing values in the same way as Random Forest. It inherently handles missing values and is robust to multicollinearity.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import BayesianRidge
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Separate features and target variable
X = data.drop('Price', axis=1)
y = data['Price']

# Split the data into training and testing sets for Bayesian Ridge Regression
X_train_bayesian, X_test_bayesian, y_train_bayesian, y_test_bayesian = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Impute missing values in the features
imputer = SimpleImputer(strategy='mean')
X_train_bayesian_imputed = imputer.fit_transform(X_train_bayesian)
X_test_bayesian_imputed = imputer.transform(X_test_bayesian)

# Impute missing values in the target variable
y_train_bayesian_imputed = imputer.fit_transform(y_train_bayesian.values.reshape(-1, 1))
y_test_bayesian_imputed = imputer.transform(y_test_bayesian.values.reshape(-1, 1))


In [None]:
# Initialize the Bayesian Ridge Regression model
bayesian_model = BayesianRidge()

# Train the Bayesian Ridge Regression model on the training set
bayesian_model.fit(X_train_bayesian_imputed, y_train_bayesian_imputed.ravel())

# Make predictions on the test set
bayesian_predictions = bayesian_model.predict(X_test_bayesian_imputed)

In [None]:
# Evaluate the Bayesian Ridge Regression model
mse_bayesian = mean_squared_error(y_test_bayesian_imputed, bayesian_predictions)
r2_bayesian = r2_score(y_test_bayesian_imputed, bayesian_predictions)

# Print the evaluation metrics
print(f"Bayesian Ridge Regression - Mean Squared Error: {mse_bayesian:.2f}, R-squared: {r2_bayesian:.2f}")

Bayesian Ridge Regression - Mean Squared Error: 0.01, R-squared: 0.09


### **Gradient Boosting:**
This code splits the dataset into training and testing sets specifically for the Gradient Boosting model. It then initializes the Gradient Boosting model, trains it on the training set, makes predictions on the test set, and evaluates its performance using mean squared error (MSE) and R-squared.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Separate features and target variable
X = data.drop('Price', axis=1)
y = data['Price']

# Split the data into training and testing sets for Gradient Boosting
X_train_gb, X_test_gb, y_train_gb, y_test_gb = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Impute missing values in the features
imputer = SimpleImputer(strategy='mean')
X_train_gb_imputed = imputer.fit_transform(X_train_gb)
X_test_gb_imputed = imputer.transform(X_test_gb)

# Impute missing values in the target variable
y_train_gb_imputed = imputer.fit_transform(y_train_gb.values.reshape(-1, 1))
y_test_gb_imputed = imputer.transform(y_test_gb.values.reshape(-1, 1))

In [None]:
# Initialize the Gradient Boosting model
gradient_boosting_model = GradientBoostingRegressor(random_state=42)

# Train the Gradient Boosting model on the training set
gradient_boosting_model.fit(X_train_gb_imputed, y_train_gb_imputed.ravel())

# Make predictions on the test set
gb_predictions = gradient_boosting_model.predict(X_test_gb_imputed)

In [None]:
# Evaluate the Gradient Boosting model
mse_gb = mean_squared_error(y_test_gb_imputed, gb_predictions)
r2_gb = r2_score(y_test_gb_imputed, gb_predictions)

# Print the evaluation metrics
print(f"Gradient Boosting - Mean Squared Error: {mse_gb:.2f}, R-squared: {r2_gb:.2f}")

Gradient Boosting - Mean Squared Error: 0.02, R-squared: -0.73


### The Bayesian Ridge Regression model has a lower MSE and a positive R-squared value. While the R-squared is still low, it suggests that this model is performing better than the Random Forest model and the Gradient Boosting model.

## **Conclusion:**

Among the models we've evaluated, Bayesian Ridge Regression has the best performance on the test set, as it has the lowest MSE and a positive R-squared value. It's important to note that the choice of the 'best' model depends on the specific requirements of our task and the characteristics of our data. In some cases, further hyperparameter tuning or trying different models may be necessary to improve performance. Additionally, considering other metrics and performing cross-validation can provide a more comprehensive evaluation of the models.