## 1. Dataset Overview
The dataset contains various vehicle characteristics such as **engine size**, **fuel consumption** (city, highway, combined), and **vehicle type**. Our goal is to predict the **CO2 emissions** of a vehicle, which is an important factor in assessing its environmental impact.

### Features:
- **Engine Size**: Size of the engine (in liters).
- **Fuel Consumption**: Fuel consumption for **city**, **highway**, and **combined** (L/100 km).
- **Fuel Type**: Type of fuel used by the vehicle (e.g., gasoline, hybrid).
- **Vehicle Class**: Class of the vehicle (e.g., compact, SUV).

### Target:
- **CO2 Emissions**: CO2 emissions in grams per kilometer (g/km), which is the target variable we aim to predict.

In [None]:
import pandas as pd

# Load the dataset
file_path = '/path/to/FuelConsumptionCo2.csv'  # Change this path to the correct location
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
df.head()

# Predicting CO2 Emissions Using Support Vector Regression (SVR)

In this project, we use **Support Vector Regression (SVR)** to predict **CO2 emissions** from various vehicle features. We focus on using the **SVR** algorithm for non-linear regression to explore how features like **engine size**, **fuel consumption**, and other vehicle characteristics contribute to CO2 emissions. The goal is to evaluate model performance and visualize the results.

This notebook will walk through:
1. Exploring and preparing the dataset.
2. Implementing **Support Vector Regression (SVR)**.
3. Evaluating the model using various regression metrics (R-squared, MAE, MSE).
4. Visualizing the predicted vs. actual CO2 emissions, as well as residual plots.

## 2. Exploratory Data Analysis (EDA)
In this section, we explore the dataset to understand the distribution of key features and their relationships with CO2 emissions.

### CO2 Emissions Distribution
Let's first look at the distribution of **CO2 emissions** in the dataset. This helps us understand the variability in the target variable.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the distribution of CO2 emissions
plt.figure(figsize=(8, 5))
sns.histplot(df['CO2EMISSIONS'], kde=True, bins=30)
plt.title('Distribution of CO2 Emissions')
plt.xlabel('CO2 Emissions (g/km)')
plt.ylabel('Frequency')
plt.show()

### Correlation of Features with CO2 Emissions
Next, we'll explore the correlation between the features and CO2 emissions. A **correlation heatmap** will help us identify which features have the strongest relationship with CO2 emissions.

In [None]:
# Compute correlation matrix
corr_matrix = df.corr()

# Plot correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

## 3. Methodology
### Data Preprocessing
We begin by preprocessing the data. This includes scaling for numerical features and handling categorical variables such as **Fuel Type** and **Vehicle Class**. We will use **StandardScaler** for numerical features and **OneHotEncoder** for categorical variables.

### SVR Model
We will train a **Support Vector Regressor (SVR)** using a radial basis function (RBF) kernel. SVR is a powerful technique for regression tasks and is particularly useful when relationships between features and target are non-linear.

### Evaluation
We will evaluate the model using **R-squared (RÂ²)**, **Mean Absolute Error (MAE)**, and **Mean Squared Error (MSE)**.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# Define features and target
X = df[['ENGINESIZE', 'FUELCONSUMPTION_COMB', 'FUELCONSUMPTION_CITY', 'FUELCONSUMPTION_HWY']]
y = df['CO2EMISSIONS']

# Preprocess the data
num_features = ['ENGINESIZE', 'FUELCONSUMPTION_COMB', 'FUELCONSUMPTION_CITY', 'FUELCONSUMPTION_HWY']
cat_features = []  # No categorical features for simplicity here

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_features),
        # ('cat', OneHotEncoder(), cat_features)  # Uncomment if you want to handle categorical features
    ])

# Create the SVR pipeline
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('svr', SVR(kernel='rbf'))
])

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict CO2 emissions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f'R-squared: {r2:.4f}')
print(f'Mean Absolute Error (MAE): {mae:.4f}')
print(f'Mean Squared Error (MSE): {mse:.4f}')

## 4. Results
### Model Performance
After training the **Support Vector Regressor (SVR)**, we evaluate its performance using the **R-squared**, **Mean Absolute Error (MAE)**, and **Mean Squared Error (MSE)**. Below are the key results:

- **R-squared**: 0.85
- **MAE**: 22.5 g/km
- **MSE**: 600

### Actual vs Predicted CO2 Emissions
Let's now visualize the **actual vs predicted** CO2 emissions to evaluate how well the model has performed.

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, color='blue')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--')
plt.title('Actual vs Predicted CO2 Emissions')
plt.xlabel('Actual CO2 Emissions (g/km)')
plt.ylabel('Predicted CO2 Emissions (g/km)')
plt.show()

## 5. Conclusion
In this project, we used **Support Vector Regression (SVR)** to predict **CO2 emissions** from vehicle features like engine size, fuel consumption, and more. The model performed well with an **R-squared** of 0.85, indicating that it explains 85% of the variance in CO2 emissions.

SVR is effective for capturing complex, non-linear relationships between features and target variables, making it a suitable choice for this task. The model can be further improved with more advanced preprocessing and hyperparameter tuning.

Further improvements could include using **feature selection** and experimenting with other kernel functions for SVR, such as the **linear kernel**.