<a href="https://colab.research.google.com/github/usshaa/Colabnb/blob/main/EV_Data_Cleaning_Exploration_Model_Training_Practical_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🚗 EV Data Analytics – Data Cleaning & Exploration

Each task has questions, and you are expected to **write solutions in the provided space**.

In [None]:
import pandas as pd
import numpy as np

# Random seed for reproducibility
np.random.seed(42)

# Define some EV brands and models
makes = ["Tesla", "Nissan", "BMW", "Hyundai", "Kia", "Chevrolet", "Audi"]
models = {
    "Tesla": ["Model 3", "Model S", "Model X", "Model Y"],
    "Nissan": ["Leaf", "Ariya"],
    "BMW": ["i3", "i4", "iX"],
    "Hyundai": ["Kona Electric", "Ioniq 5"],
    "Kia": ["EV6", "Niro EV"],
    "Chevrolet": ["Bolt EV"],
    "Audi": ["e-tron", "Q4 e-tron"]
}

# Generate synthetic rows
rows = []
for _ in range(100):
    make = np.random.choice(makes)
    model = np.random.choice(models[make])
    year = np.random.randint(2015, 2024)
    battery_capacity = np.random.choice([40, 50, 60, 70, 75, 80, 90, 100]) + np.random.randint(-5, 6)
    range_km = int(battery_capacity * np.random.uniform(4, 6)) + np.random.randint(-20, 20)
    charging_time = np.round(np.random.uniform(0.5, 12), 1)
    price = np.random.randint(30000, 120000)
    fast_charging = np.random.choice(["Yes", "No"], p=[0.7, 0.3])

    rows.append([make, model, year, battery_capacity, range_km, charging_time, price, fast_charging])

df = pd.DataFrame(rows, columns=[
    "Make", "Model", "Year", "Battery_Capacity_kWh", "Range_km", "Charging_Time_hr", "Price_USD", "Fast_Charging"
])

# Introduce some missing values
for col in ["Battery_Capacity_kWh", "Range_km", "Fast_Charging"]:
    df.loc[df.sample(frac=0.05).index, col] = np.nan

# Save dataset
df.to_csv("ev_data.csv", index=False)
print("✅ Synthetic EV dataset saved as ev_data.csv")
df.head()

In [None]:

import pandas as pd

# Load the dataset
df = pd.read_csv("ev_data.csv")
df.head()


## Task 1: Load the Dataset
1. Load the dataset into a pandas DataFrame.
2. Display the first 5 rows.

❓ **Question:** What are the column names and their data types?

✍️ **Your Answer:** (Fill here)


## Task 2: Missing Values
1. Check for missing values in the dataset.
2. Display the percentage of missing values for each column.

❓ **Question:** Which column has the highest percentage of missing values?

✍️ **Your Answer:**


## Task 3: Data Cleaning
1. Handle missing values by:
   - Filling numeric columns with median values.
   - Filling categorical columns with mode.

❓ **Question:** After cleaning, how many missing values remain?

✍️ **Your Answer:**


## Task 4: Duplicates
1. Check for duplicate rows in the dataset.
2. Remove duplicates if any.

❓ **Question:** How many duplicate rows were removed?

✍️ **Your Answer:**


## Task 5: Outlier Detection
1. Use boxplots to check for outliers in `Battery_Capacity_kWh` and `Range_km`.
2. Decide whether to drop or cap them.

❓ **Question:** Which column had more visible outliers?

✍️ **Your Answer:**


## Task 6: Basic Exploration
1. Find the average `Battery_Capacity_kWh`.
2. Find the most expensive EV (`Price_USD`).

❓ **Question:** What is the average EV battery capacity?

✍️ **Your Answer:**


## Task 7: Grouping & Aggregation
1. Group the data by `Make` and calculate the average `Range_km`.
2. Sort to find the top 3 EV brands with the highest average range.

❓ **Question:** Which 3 brands have the highest average range?

✍️ **Your Answer:**


## Task 8: Correlation Analysis
1. Compute the correlation matrix for numeric columns.
2. Identify the two variables with the strongest positive correlation.

❓ **Question:** Which two variables are most strongly correlated?

✍️ **Your Answer:**


## Task 9: Visualization
1. Create a histogram of EV prices.
2. Create a bar chart comparing average range by `Fast_Charging` support.

❓ **Question:** Do EVs with fast charging generally have higher ranges?

✍️ **Your Answer:**


## Task 10: Insights
Write **3 key insights** you learned from this dataset after cleaning and exploration.

✍️ **Your Answer:**


## Task 11: Feature Engineering & Linear Regression

### Task 11.1: Feature Engineering
```python
# TODO: Encode categorical variables (Make, Model, Fast_Charging)
# TODO: Define features (X) and target (y = Price_USD)
# TODO: Split dataset into train (80%) and test (20%)
```
❓ **Question:** Which features did you select for training?

✍️ **Your Answer:**

### Task 11.2: Model Building
```python
# TODO: Initialize LinearRegression model
# TODO: Train model on training data
```
❓ **Question:** How many features were used in the final model?

✍️ **Your Answer:**

### Task 11.3: Prediction & Evaluation
```python
# TODO: Predict on test data
# TODO: Compute MAE, MSE, RMSE, R2 Score
```
❓ **Question:** What is the R² score of your model?

✍️ **Your Answer:**

### Task 11.4: Sample Prediction
```python
# TODO: Predict the price for a sample EV with:
# Make = Tesla, Model = Model 3, Battery = 75, Range = 450, Charging Time = 1.5, Fast_Charging = Yes
```
❓ **Question:** What is the predicted price of the Tesla Model 3 sample?

✍️ **Your Answer:**