# Exploratory Data Analysis of Electric Vehicle Population Data

- **Course:** CS660/71425 Mathematical Foundations of Analytics
- **Instructor:** Prof. Sarbanes
- **Group-1:** Will Torres, Mike Griffin, Watson Blair, Syed Abdul Mubashir, Mohammed Abdul Munaf
- **Semester:** Fall 2024
- **Project #:** 1
- **Due Date:** 07-Oct-2024

## Project Description
Exploratory Data Analysis (EDA) is essential for understanding, cleaning, and preparing data for further analysis in data science projects. This project focuses on analyzing the Electric Vehicle (EV) population dataset from Washington State, USA.


## Questions to be Answered
1. Which car manufacturers are most commonly used for EVs in Washington?
2. What are the highest and lowest electric ranges in this dataset, and which car makers and models do they correspond to?
3. Is the maximum electric range value unique? If not, which cars share this range?
4. Is the minimum electric range value unique? If not, which cars share this range?
5. How does the electric range vary between car makers and between models?
6. Which are the top 5 cities adopting EVs?
7. How does the EV adoption rate vary among car makers over the years?
8. Is there a correlation between the electric range and the city of an EV?
9. Which county has the greatest variety of EV car models?

## EDA

### Step 1: Understand the Dataset Context
- **Objective Clarification**
- **Data Source Identification**

### Step 2: Import Libraries and Load Data
- **Import Necessary Libraries:** `pandas`, `numpy`, `matplotlib`, `seaborn`
- **Load the Dataset:** `EV_Population_WA_Data.csv`

In [2]:
"""Imports"""
import pandas as pd


"""Load Data"""
rawData = pd.read_csv('./data/EV_Population_WA_Data.csv')

### Step 3: Initial Data Inspection
- **View Data Structure:** `.head()`, `.info()`, `.describe()`
- **Check Dimensions:** `.shape()`
- **Identify Missing Values:** `.isnull().sum()`

### Step 4: Data Cleaning
- **Handle Missing/Incomplete Data**
  - Range Data
- **Handle Outliers**
- **Correct Data Types**
  - transform categorical data into numeric values for use in correlation operations 
  - 
- **Handle Duplicates**

In [4]:
from utils import calculateRange, calculateMSRP, convertEligibility

cleanData = rawData.copy(deep=True)

cleanData = calculateRange(cleanData)

cleanData = calculateMSRP(cleanData) # Corrects aprox 10,000 records

cleanData = convertEligibility(cleanData)



In [18]:
print(cleanData['Clean Alternative Fuel Vehicle (CAFV) Eligibility'].unique())



['Clean Alternative Fuel Vehicle Eligible'
 'Eligibility unknown as battery range has not been researched'
 'Not eligible due to low battery range']


### Step 5: Univariate Analysis
- **Summary Statistics**
- **Visualize Distributions:** histograms, box plots, bar charts

### Step 6: Bivariate Analysis
- **Correlation Analysis**: (e.g., Pearson, Spearman)
- **Cross-tabulation**
- **Visualize Relationships:** scatter plots, box plots, heatmaps

### Step 7: Multivariate Analysis
- **Pairplot/Scatterplot Matrix**
- **Multivariate Statistics**
- **Advanced Visualizations**

### Step 8: Feature Engineering
- **Create New Features**
- **Feature Transformation**
- **Encoding Categorical Variables**

### Step 9: Handle Imbalanced Data (If Applicable)
- **Resampling Techniques**: Use oversampling, under-sampling, or SMOTE if the target variable is imbalanced

### Step 10: Analyze and Validate Assumptions
- **Check for Multicollinearity**: Use VIF (Variance Inflation Factor) to detect multicollinearity among predictors.
- **Normality Testing**: Test if numerical data follows a normal distribution (e.g., using the Shapiro-Wilk test).
- **Homoscedasticity**: Check the equality of variance across groups

### Step 11: Preliminary Insights and Hypotheses
- **Identify Key Findings**
- **Generate Hypotheses**

### Step 12: Document and Communicate Findings
- **Create Visual Summaries**
- **Write a Summary Report**

### Step 13: Next Steps
- **Plan for Further Analysis**

### Step 14: Review and Reiterate
- **Review EDA**
- **Iterate as Needed**