## Comprehensive Assessment : Machine Learning

### Step 1: Load and Preprocess the Dataset

Loading the Dataset:

Load the dataset from the provided link.

Inspect the dataset to understand its structure, including the data types, missing values, and basic statistics.

Preprocessing:

Handle missing values through imputation or removal, depending on the extent of the missing data.
    
Encode categorical variables using techniques such as one-hot encoding.

Normalize or standardize the numerical features to ensure all features contribute equally to the model.
    
Split the dataset into training and testing sets to evaluate the performance of the models.
    
### Step 2: Implement Regression Models

Implement the following regression algorithms:

Linear Regression:

A simple, interpretable model that assumes a linear relationship between the independent variables and the target variable (car price).

Decision Tree Regressor:

A non-linear model that splits the data into subsets based on feature values, building a tree structure.
    
Random Forest Regressor:

An ensemble method that builds multiple decision trees and merges their results to improve performance and reduce overfitting.
    
Gradient Boosting Regressor:

Another ensemble method that builds trees sequentially, with each tree trying to correct the errors of the previous one.
    
Support Vector Regressor:

A model that uses the principles of support vector machines to perform regression, effective for high-dimensional spaces.
                                                                                 
### Step 3: Model Evaluation
                                                                                 
Metrics:
                                                                                 
Evaluate the models using R-squared, Mean Squared Error (MSE), and Mean Absolute Error (MAE).
                                                                                 
Compare these metrics across all models to identify the best performer.
                                                                                 
### Step 4: Feature Importance Analysis
                                                                                 
Feature Selection:
                                                                                 
Use methods such as feature importance from tree-based models or coefficients from linear models to identify significant variables.
                                                                                 
Visualize and interpret the importance of each feature in predicting car prices.
                                                                                 
### Step 5: Hyperparameter Tuning
                                                                                 
Tuning:
                                                                                 
Use techniques such as GridSearchCV or RandomizedSearchCV to perform hyperparameter tuning on the models.
    
Evaluate the performance of the tuned models and compare it to the default settings.


Implementation

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Load the dataset
url = 'https://drive.google.com/uc?id=1FHmYNLs9v0Enc-UExEMpitOFGsWvB2dP'
data = pd.read_csv(url)

# Preprocessing
# Handling missing values
data.dropna(inplace=True)

# Splitting the dataset
X = data.drop(columns=['price'])
y = data['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing pipelines for numerical and categorical data
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Model pipelines
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree Regressor': DecisionTreeRegressor(),
    'Random Forest Regressor': RandomForestRegressor(),
    'Gradient Boosting Regressor': GradientBoostingRegressor(),
    'Support Vector Regressor': SVR()
}

results = {}

for name, model in models.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('model', model)])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    
    results[name] = {
        'R-squared': r2_score(y_test, y_pred),
        'MSE': mean_squared_error(y_test, y_pred),
        'MAE': mean_absolute_error(y_test, y_pred)
    }

results_df = pd.DataFrame(results).T
print(results_df)

# Feature Importance Analysis for tree-based models
for name, model in models.items():
    if hasattr(model, 'feature_importances_'):
        pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                   ('model', model)])
        pipeline.fit(X_train, y_train)
        feature_importances = model.feature_importances_
        features = numeric_features.tolist() + pipeline.named_steps['preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names_out(categorical_features).tolist()
        feature_importance_df = pd.DataFrame({'feature': features, 'importance': feature_importances})
        feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False)
        print(f'Feature importances for {name}:')
        print(feature_importance_df.head(10))
        
# Hyperparameter Tuning example for Random Forest
param_grid = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [10, 20, 30]
}
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', RandomForestRegressor())])
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
print("Best parameters found: ", grid_search.best_params_)
print("Lowest RMSE found: ", np.sqrt(-grid_search.best_score_))

# Retrain the best model on the full training set and evaluate
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)

final_results = {
    'R-squared': r2_score(y_test, y_pred),
    'MSE': mean_squared_error(y_test, y_pred),
    'MAE': mean_absolute_error(y_test, y_pred)
}
print(final_results)


                             R-squared           MSE          MAE
Linear Regression            -1.261189  1.785074e+08  7036.820982
Decision Tree Regressor       0.878221  9.613756e+06  1932.662610
Random Forest Regressor       0.955453  3.516732e+06  1334.409707
Gradient Boosting Regressor   0.933338  5.262537e+06  1660.950927
Support Vector Regressor     -0.099864  8.682769e+07  5695.713406
Feature importances for Decision Tree Regressor:
                feature  importance
7            enginesize    0.648244
6            curbweight    0.262414
0                car_ID    0.018401
14           highwaympg    0.014662
4              carwidth    0.009319
9                stroke    0.008328
12              peakrpm    0.007991
13              citympg    0.007263
11           horsepower    0.004110
84  CarName_peugeot 504    0.004049
Feature importances for Random Forest Regressor:
       feature  importance
7   enginesize    0.579850
6   curbweight    0.251657
14  highwaympg    0.042796
11