# Electric Vehicle Prediction

In this checkpoint, I am going to work on the **'Electric Vehicle Data'** dataset that was provided by Kaggle as part of the Electric Vehicle Price Prediction competition.

Dataset description: This dataset contains information on the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered with the Washington State Department of Licensing (DOL). This dataset was introduced as part of an official invitation-based competition on Kaggle. Our SVM model should answer the question "This is my car's model & make, along with a few other parameters, what price can this vehicle be brought or sold?”

➡️ Dataset link

https://i.imgur.com/IpuCW3s.jpg
 

Instructions

1. Import you data and perform basic data exploration phase
- Display general information about the dataset
- Create a pandas profiling reports to gain insights into the dataset
- Handle Missing and corrupted values
- Remove duplicates, if they exist
- Handle outliers, if they exist
- Encode categorical features
  
2. Select your target variable and the features
3. Split your dataset to training and test sets
4. Build and train an SVM model on the training set
5. Assess your model performance on the test set using relevant evaluation metrics
6. Discuss with your cohort alternative ways to improve your model performance

In [None]:
# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR 
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
#loading the dataset
df = pd.read_csv("train.csv")

#### Overview of the dataset

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df["Expected Price ($1k)"].unique()

In [None]:
# Replace 'N/' with NaN 
df["Expected Price ($1k)"] = df["Expected Price ($1k)"].replace(['N/', 'NA', ''], np.nan)
df["Expected Price ($1k)"] = pd.to_numeric(df["Expected Price ($1k)"], errors='coerce')
    

#### Checking for missing values and duplicates

In [None]:
df.isnull().sum()

In [None]:
df[df["Expected Price ($1k)"].isnull()]

We can see here that the prices for Vehicles with Make of **FORD** are missing

In [None]:
df.dropna(subset=["Expected Price ($1k)"], inplace=True)

In [None]:
df = df.dropna()

In [None]:
df.isnull().sum()

In [None]:
df.duplicated().sum()

There are no duplicates in this dataset

In [None]:
df.shape

In [None]:
df["Make"].value_counts()

In [None]:
df["Model"].value_counts()

In [None]:
df["Make"].unique()

In [None]:
df["Model"].unique()

##### Let's see the distribution of prices across the model and make

In [None]:
plt.figure(figsize=(18, 5))
sns.boxplot(x='Make', y='Expected Price ($1k)', data=df)
plt.title('Price Distribution Across Makes')
plt.xticks(rotation=45)  
plt.show()


#### Feature Engineering

##### Using Electric Range as a Proxy for Mileage

In [None]:
df['Price per Mile'] = df['Expected Price ($1k)'] / df['Electric Range']

##### Changing the datatype of model year

In [None]:
df["Model Year"] = df["Model Year"].astype(int)

##### Age of the vehicle

In [None]:
current_year = 2024
df['Age of Vehicle'] = current_year - df['Model Year']

#### Encoding categorical features

##### Label Encoding method assigns a unique integer to each category. 

In [None]:
df["Clean Alternative Fuel Vehicle (CAFV) Eligibility"].value_counts()

In [None]:
label_encoder = LabelEncoder()
df['(CAFV)_Eligibility_encoded'] = label_encoder.fit_transform(df['Clean Alternative Fuel Vehicle (CAFV) Eligibility'])

In [None]:
df["Electric Vehicle Type"].value_counts()

In [None]:
label_encoder = LabelEncoder()
df['EV_Type_encoded'] = label_encoder.fit_transform(df['Electric Vehicle Type'])

##### Target Encoding will be suitable for **Model** and **Make** features since there is a large number of categories and label or one-hot encoding would result in high dimensionality.

In [None]:
df['Model_encoded'] = df.groupby('Model')['Expected Price ($1k)'].transform('mean')

In [None]:
df['Make_encoded'] = df.groupby('Make')['Expected Price ($1k)'].transform('mean')

### Combining the data wrangling steps into one function for a test CSV file. 

In [None]:
def wrangle(file_path):
    # Load the dataset
    df = pd.read_csv(file_path)
    
    # Clean and convert 'Expected Price ($1k)' to numeric
    df["Expected Price ($1k)"] = df["Expected Price ($1k)"].replace(['N/', 'NA', ''], np.nan)
    df["Expected Price ($1k)"] = pd.to_numeric(df["Expected Price ($1k)"], errors='coerce')
    
    # Drop rows with missing values
    df = df.dropna()
    
    # Create 'Price per Mile' feature, handle divide by zero with np.where
    df['Price per Mile'] = np.where(df['Electric Range'] == 0, np.nan, df['Expected Price ($1k)'] / df['Electric Range'])
    
    # Convert 'Model Year' to integer and calculate vehicle age
    df["Model Year"] = df["Model Year"].astype(int)
    current_year = 2024
    df['Age of Vehicle'] = current_year - df['Model Year']
    
    # Encode categorical variables
    label_encoder = LabelEncoder()
    df['(CAFV)_Eligibility_encoded'] = label_encoder.fit_transform(df['Clean Alternative Fuel Vehicle (CAFV) Eligibility'])
    df['EV_Type_encoded'] = label_encoder.fit_transform(df['Electric Vehicle Type'])
    
    # Aggregate and encode 'Model' and 'Make' based on mean 'Expected Price'
    df['Model_encoded'] = df.groupby('Model')['Expected Price ($1k)'].transform('mean')
    df['Make_encoded'] = df.groupby('Make')['Expected Price ($1k)'].transform('mean')
    
    return df


#### Checking relationships between the target vector and the feature matrix

In [None]:
selected_columns = ['Expected Price ($1k)', 'Model Year','Price per Mile', 'Electric Range', 'Age of Vehicle', 'Base MSRP', 'Model_encoded',
       'Make_encoded', 'EV_Type_encoded', '(CAFV)_Eligibility_encoded']

correlation_matrix = df[selected_columns].corr()

In [None]:
correlation_matrix

In [None]:
plt.figure(figsize=(15, 6)) 
sns.heatmap(correlation_matrix, 
            annot=True,
            cmap="RdGy",           
            linewidths=0.5,           
            linecolor='black',        
            cbar_kws={'label': 'Correlation Coefficient'})

plt.title('Correlation Matrix of Features', fontsize=18)
plt.xlabel('Features', fontsize=10)
plt.ylabel('Features', fontsize=10)

plt.show()

# Modelling

##### Selecting my features and splitting the data into training and test sets

In [None]:
# Split features (X) and target (y) 
X = df.drop(columns=["ID", "VIN (1-10)", "Expected Price ($1k)", "Make", "Model" , "Price per Mile","County",
                                 "City", "State", "ZIP Code", "Electric Vehicle Type", "Legislative District", "DOL Vehicle ID", "Vehicle Location", "Electric Utility", "Clean Alternative Fuel Vehicle (CAFV) Eligibility"])
y = df['Expected Price ($1k)']

# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Verifying the features
print(X_train.columns)

##### Scaling my data to ensure that each feature contributes equally to the distance calculations or the optimization process.

In [None]:
# Check for NaN values
print(np.isnan(X_train).sum())

# Check for infinite values
print(np.isinf(X_train).sum())


In [None]:
# Create a StandardScaler instance
scaler = StandardScaler()

# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data
X_test_scaled = scaler.transform(X_test)

## Support Vector Machine(SVR)

In [None]:
svr_model = SVR(C=90, gamma='scale', kernel='rbf')

In [None]:
svr_model.fit(X_train_scaled, y_train)

In [None]:
svr_ypred = svr_model.predict(X_test_scaled)

In [None]:
# Model evaluation
print("SVR RMSE:", mean_squared_error(y_test, svr_ypred, squared=False))
print("SVR R2 Score:", r2_score(y_test, svr_ypred))

#### saving the trained model

In [None]:
import pickle

In [None]:
# Perform target encoding for 'Model' and 'Make' using the mean of 'Expected Price ($1k)'
model_encoding = df.groupby('Model')['Expected Price ($1k)'].mean().to_dict()
make_encoding = df.groupby('Make')['Expected Price ($1k)'].mean().to_dict()

# Save the model and encoding mappings together in a dictionary
model_data = {
    'model': svr_model,  # Your trained SVR model
    'model_encoding': model_encoding,  # Target encoding for 'Model'
    'make_encoding': make_encoding     # Target encoding for 'Make'
}

# Save the dictionary with the model and encodings to a .sav file
filename = "trained_model_and_encodings.sav"
pickle.dump(model_data, open(filename, 'wb'))

print(f"Model and encodings saved to {filename}")

#### loading the saved model

In [None]:
# Load the model and encodings from the .sav file
filename = "trained_model_and_encodings.sav"
loaded_data = pickle.load(open(filename, 'rb'))

# Extract the model and the encodings
loaded_model = loaded_data['model']  # SVR model
loaded_model_encoding = loaded_data['model_encoding']  # Model encodings
loaded_make_encoding = loaded_data['make_encoding']  # Make encodings

print("Model and encodings loaded successfully.")

#### Transforming New Data for Prediction

In [None]:
# Let's assume we have new data for prediction
new_model = "MODEL 3"
new_make = "TESLA"
new_model_year = 2022
new_electric_range = 350  
new_base_msrp = 35000     
new_vehicle_age = 2        
new_cafv_eligibility = 1   
new_ev_type = 1           

# Apply the encoding using the loaded encoding mappings
encoded_model = loaded_model_encoding.get(new_model, 0)  
encoded_make = loaded_make_encoding.get(new_make, 0)     

# Create the input data list with all required features
new_data = [
    new_model_year,         # 'Model Year'
    new_electric_range,     # 'Electric Range'
    new_base_msrp,          # 'Base MSRP'
    new_vehicle_age,        # 'Age of Vehicle'
    new_cafv_eligibility,   # '(CAFV)_Eligibility_encoded'
    new_ev_type,            # 'EV_Type_encoded'
    encoded_model,          # 'Model_encoded'
    encoded_make            # 'Make_encoded'
]

# Make a prediction with the loaded model
predicted_price = loaded_model.predict([new_data])

print(f"Predicted price: ${predicted_price[0]}k")