# Final Project - Maintaining EV customers

---------------
## Context
---------------
This dataset presents results from a survey of FCV and BEV and compared the sociodeographic profile of FCV buyers vs BEV Households.  


-----------------
## Objective
-----------------
Based on the data we want to identify Current FCV & BEV Owners Demographics and see if there are any major differences between the two car owners.  

For our ML model we will attemp to predict based on the information if we can identify the customer as a FCV or BEV owner.  

-------------------------
## Data Dictionary
-------------------------

The dataset has the following information:
- When the customer submitted their data
- If they Previously Owned a PHEV, BEV, HEV, CNG
- Household Income
- Importance of Reducing Greenhouse Emissions
- Demographics of the customer current car (year, manufacture, model)
- Demographics on the customer's (home type, ownership of home,  education, gender, age, # of people in household)
- Demographics on customers car usage (longest trip, number of trips over 200 miles, one way commute distance, annual VMT vehicle miles traveled)
- If the customer is currently a FCV or BEV Current Owner


#### Acronyms used
- BEV: Battery Electric Vehicle
- FCV: Fuel Cell Vehicle (Hydrogen fuel cell vehicle)
- PHEV: Plug-in hybrid electric vehicle
- CNG: Compressed Natural Gas (everyday car)

### Import the necessary libraries or dependencies

In [1]:
#Import Dependencies
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

###  Read the Dataset 

In [None]:
#Import File
fcv_df = pd.read_excel('Resources/FCV&EVMT Data_6.18.19.xlsx')

In [None]:
fcv_df.head(5)

## Descriptive Analysis/EDA

1. Check dimensions of the dataframe in terms of rows and columns
2. Check data types. Ensure your data types are correct. Refer data definitions to validate
3. If data types are not as per business definition, change the data types as per requirement
4. Study summary statistics
5. Check for missing values
6. Study correlation
7. Detect outliers

### Examine Dataset 

### The dimension of the `data` dataframe. (shape, r x c)

In [None]:
fcv_df.shape

#### Observations: 
The original dataset has 27,021 rows and 25 columns

### Data Types/Categorical vs. Numerical Columns

In [None]:
fcv_df.dtypes

## Clean the dataset

In [None]:
# Update Column Names
fcv_df = fcv_df.rename(columns={'Month Year[subm...Date submitted]':'Month/Year Submitted',
                                'Month[Month Yea...ate submitted]]': 'Month Submitted',
                               'Year[Month Year...ate submitted]]':'Year Submitted'})

In [None]:
#Split Carmain into separate columns
fcv_df[['Model Year', 'Manufacturer', 'Model']] = fcv_df['Carmain'].str.split(' ', n=2, expand=True)

#Drop Carmain, & ID as not longer needed
fcv_df = fcv_df.drop(columns=['Carmain','id. Response ID'],axis=1)

### Missing Values

**If we encounter with missing data, what we can do:**

* leave as is
* drop them with dropna()
* fill missing value with fillna()
* fill missing values with test statistics like mean

Mode Inputation
* Mode imputation means replacing missing values by the mode, or the most frequent- category value.


In [None]:
fcv_df.isnull().sum()

In [None]:
percent_missing = fcv_df.isnull().sum() * 100 / len(fcv_df)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df.style.format({'percent_missing':'{:.0f}%'})

**Observations:**
Most of the missing data comes from `Previous PHEVs`,`Previous BEVs`, `Previous HEVs`, `Previous CNGs`. These features are missing 41% of their data. `FC, BEV Dummy` is missing 48% of it's data. This is an important feature we are measuring. So, to have this column missing almost half of it's data is a big deal.

In [None]:
fcv_df = fcv_df.dropna()

In [None]:
# Checking the count of unique values in each column
fcv_df.nunique()

In [None]:
fcv_df.shape

#### Observations: 
After all rows with null values have been dropped the dataset has 4,709 rows and 26 columns

In [None]:
# Changing the data type of float type column to integer. 
fcv_df['Previous BEVs'] = fcv_df['Previous BEVs'].astype("int32")
fcv_df['Previous PHEVs'] = fcv_df['Previous PHEVs'].astype("int32")
fcv_df['Previous HEVs'] = fcv_df['Previous HEVs'].astype("int32")
fcv_df['Previous CNGs'] = fcv_df['Previous CNGs'].astype("int32")
fcv_df['Number of people in the household'] = fcv_df['Number of people in the household'].astype("int32")
fcv_df['FCV, BEV Dummy'] = fcv_df['FCV, BEV Dummy'].astype("int32")
fcv_df['Age'] = fcv_df['Age'].astype("int32")
fcv_df['Gender (Male 1)'] = fcv_df['Gender (Male 1)'].astype("int32")
fcv_df['Month Submitted'] = fcv_df['Month Submitted'].astype("int32")
fcv_df['Year Submitted'] = fcv_df['Year Submitted'].astype("int32")

## Summary Statistics

In [None]:
fcv_df.describe()

In [None]:
fcv_df.Manufacturer.value_counts()

In [None]:
brand_mapping = {
    'FIAT' : 'Fiat',
    'tesla' : 'Tesla', 
    'Volkswagon' : 'Volkswagen', 
    'Mercedes' : 'Mercedes-Benz',
    'chevy' : 'Chevrolet',
    'VW' : 'Volkswagen',
    'vw' : 'Volkswagen',
    'Tesler' : 'Tesla',
    'Chev' : 'Chevrolet',
    'Telsa' : 'Tesla',
    'hyundai' : 'Hyundai',
    'nISSAN' : 'Nissan',
    'MBW' : 'BMW',
    'Nissa' : 'Nissan',
    'Hyundi' : 'Hyundai',
    'Chevy' : 'Chevrolet',
    'smart' : 'Smart'
}

fcv_df['Manufacturer'] = fcv_df['Manufacturer'].replace(brand_mapping)
fcv_df.Manufacturer.value_counts()

In [None]:
model_mapping = {
    '500E' : '500e',
    'Model X 60D' : 'Model X',
    'LEAF' : 'Leaf',
    'Tuscon' : 'Tucson',
    'Tucson FCV' : 'Tucson',
    'S 85' : 'Model S',
    'Tuscon fuel cell' : 'Tucson',
    'bolt' : 'Bolt EV',
    'Bolt' : 'Bolt EV',
    ' 75d' : 'Model S',
    'Model S P90D' : 'Model S',
    'S 90D' : 'Model S',
    'Tuscon FCV' : 'Tucson',
    ' Chevreolet Volt' : 'Volt',
    ' Tucson' : 'Tucson',
    'S 85 D' : 'Model S',
    'e golf' : 'e-Golf',
    ' E-Golf' : 'e-Golf',
    'eGolf' : 'e-Golf',
    'VW  e-Golf' : 'e-Golf',
    ' Leaf' : 'Leaf',
    'Model S P85D' : 'Model S',
    's' : 'Model S',
    'Tucson fuel cell' : 'Tucson',
    'model S' : 'Model S',
    'Tuscan' : 'Tucson',
    'S' : 'Model S',
    'tuscon' : 'Tucson',
    '85 S' : 'Model S',
    'Model S, P85' : 'Model S',
    ' Model S' : 'Model S',
    ' Tuscon' : 'Tucson',
    'Benz B Class Electric' : 'B250e',
    ' Model X' : 'Model X',
    'e-golf' : 'e-Golf',
    'Focus Electric' : 'Focus',
    'B-Class Electric Drive' : 'B250e',
    ' Chevrolet Volt' : 'Volt',
    'Spark' : 'Spark EV'
    
}

fcv_df['Model'] = fcv_df['Model'].replace(model_mapping)

In [None]:
fcv_df.Model.value_counts()

Let's check the distribution and outliers for each column in the data.

In [None]:
# Uni-variate analysis of numerical variables allow us to study their central tendency and dispersion.
# function that will help us create boxplot and histogram for any input numerical variable.
# This function takes the numerical column as the input and return the boxplots and histograms for the variable.
#
def histogram_boxplot(feature, figsize=(15,10), bins = None):
    """ Boxplot and histogram combined
    feature: 1-d feature array
    figsize: size of fig (default (9,8))
    bins: number of bins (default None / auto)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2
                                           sharex = True, # x-axis will be shared among all subplots
                                           gridspec_kw = {"height_ratios": (.25, .75)}, 
                                           figsize = figsize 
                                           ) # creating the 2 subplots
    sns.boxplot(x = feature, ax=ax_box2, showmeans=True, color='lightblue') # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(x = feature, kde=F, ax=ax_hist2, bins=bins,palette="winter") if bins else sns.histplot(x = feature, kde=False, ax=ax_hist2) # For histogram
    ax_hist2.axvline(np.mean(feature), color='green', linestyle='--') # Add mean to the histogram
    ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram

In [None]:
# Build the histogram boxplot for Income
histogram_boxplot(fcv_df['Household Income'])

### Distribution Plot 

In [None]:
#sns.displot(fcv_df['col_name'], kind = 'kde')


#### Observations: 


### Check for Max and Min Values

#### Observation:


### Examine the mean, median, and mode. Are the three measures of central tendency equal?
-- this will help describe the skewness/distribution of attributes

#### Observations: 

### Pairplot for the variables. 

In [None]:
#sns.pairplot(data = pima, vars = ['Glucose', 'SkinThickness', 'DiabetesPedigreeFunction'], hue = 'Outcome')


#### Observations: 

### Scatterplots 

In [None]:
#plt.scatter(x = 'Glucose', y = 'Insulin', data = pima)
#plt.show()

#### Observations:

### Boxplots for Variables
Check the distribution and outliers for each column in the data.**

In [None]:
#plt.boxplot(pima['Age'])
#plt.title('Boxplot of Age')
#plt.ylabel('Age')
#plt.show()

#### Observations:


### Histograms

In [None]:
#plt.hist(pima[pima['Outcome'] == 1]['Age'], bins = 5)
#plt.title('Distribution of Age for Women who has Diabetes')
#plt.xlabel('Age')
#plt.ylabel('Frequency')
#plt.show()

#### Observations:

### The interquartile range of all the variables

In [None]:
#Q1 = df.quantile(0.25)
#Q3 = df.quantile(0.75)
#IQR = Q3 - Q1
#print(IQR)

#### Observations: 

## Export of File to CSV for database

In [None]:
fcv_df.to_csv('Exports/FCV_Dataset.csv')

### Visualize the Correlation Matrix.

* Correlation is a statistic that measures the degree to which two variables move in relation to each other. A positive correlation indicates
* the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable
* increases as the other decreases. Correction among multiple variables can be represented in the form of a matrix. This allows us to see which pairs have the high correlations.
* correlation Correlation is a mutual relationship or connection between two or more things. It takes a value between (+1) and (-1)
* The correlation between two independent events is zero, two events with zero correlations may not be independent.

In [None]:
#plt.figure(figsize = (8, 8))
#sns.heatmap(corr_matrix, annot = True)

# Display the plot
#plt.show()

#### Observations: 


## Data Preprocessing for Modeling

* Renaming Columns
* Scaling/Normalizing
* Dropping unnecessary columns
* Hot encoding
* Imputing missing values with mode/median for columns
* Converting data types
* Format data types
* Apply get_dummies on the dataframe data

## Predictive Analysis/Building Models

### **Checking the below linear regression assumptions**

1. **Mean of residuals should be 0**
2. **No Heteroscedasticity**
3. **Linearity of variables**
4. **Normality of error terms**

## **Actionable Insights and Business Recommendations**