# Final Project - Maintaining EV customers

---------------
## Context
---------------
This dataset presents results from a survey of FCV and BEV and compared the sociodeographic profile of FCV buyers vs BEV Households.  


-----------------
## Objective
-----------------
Based on the data we want to identify Current FCV & BEV Owners Demographics and see if there are any major differences between the two car owners.  

For our ML model we will attemp to predict based on the information if we can identify the customer as a FCV or BEV owner.  

-------------------------
## Data Dictionary
-------------------------

The dataset has the following information:
- When the customer submitted their data
- If they Previously Owned a PHEV, BEV, HEV, CNG
- Household Income
- Importance of Reducing Greenhouse Emissions
- Demographics of the customer current car (year, manufacture, model)
- Demographics on the customer's (home type, ownership of home,  education, gender, age, # of people in household)
- Demographics on customers car usage (longest trip, number of trips over 200 miles, one way commute distance, annual VMT vehicle miles traveled)
- If the customer is currently a FCV or BEV Current Owner


#### Acronyms used
- BEV: Battery Electric Vehicle
- FCV: Fuel Cell Vehicle (Hydrogen fuel cell vehicle)
- PHEV: Plug-in hybrid electric vehicle
- CNG: Compressed Natural Gas (everyday car)

### Import the necessary libraries or dependencies

In [1]:
#Import Dependencies
import pandas as pd

###  Read the Dataset 

In [2]:
#Import File
fcv_df = pd.read_excel('Resources/FCV&EVMT Data_6.18.19.xlsx')

## Descriptive Analysis/EDA

1. Check dimensions of the dataframe in terms of rows and columns
2. Check data types. Ensure your data types are correct. Refer data definitions to validate
3. If data types are not as per business definition, change the data types as per requirement
4. Study summary statistics
5. Check for missing values
6. Study correlation
7. Detect outliers

### Examine Dataset 

In [3]:
fcv_df.head()

Unnamed: 0,id. Response ID,submitdate. Date submitted,Month Year[subm...Date submitted],Month[Month Yea...ate submitted]],Year[Month Year...ate submitted]],lastpage. Last page,Carmain,Previous PHEVs,Previous BEVs,Previous HEVs,...,Highest Level of Education,Longest trip in the last 12 months,Number of trips over 200 miles in the last 12 months,One-way commute distance,Number of people in the household,Age,Gender (Male 1),Number of vehicles in the household,Annual VMT Estimate,"FCV, BEV Dummy"
0,FCV_1_3,2017/06/02 11:30:57,06/2017,6.0,2017.0,42.0,2016 Toyota Mirai,1.0,0.0,0.0,...,3.0,483.14,0.0,0.01,2.0,65.0,0.0,2,14622.0,0.0
1,FCV_1_4,2017/06/02 11:15:39,06/2017,6.0,2017.0,42.0,2016 Toyota Mirai,0.0,0.0,1.0,...,4.0,568.09,1.0,10.69,3.0,65.0,0.0,3,9197.142857,0.0
2,FCV_1_2,2017/06/02 10:51:59,06/2017,6.0,2017.0,42.0,2016 Toyota Mirai,1.0,1.0,0.0,...,4.0,398.57,0.0,9.39,5.0,55.0,1.0,4,15360.0,0.0
3,FCV_1_5,,,,,,,,,,...,,,,,,,,1,,
4,FCV_1_15,2017/06/02 19:35:59,06/2017,6.0,2017.0,42.0,2017 Toyota Mirai,1.0,0.0,0.0,...,2.0,255.16,0.0,17.63,2.0,55.0,0.0,3,5082.352941,0.0


### The dimension of the `data` dataframe. (shape, r x c)

#### Observations: 


In [4]:
fcv_df.shape

(27021, 25)

### Data Types/Categorical vs. Numerical Columns

In [5]:
fcv_df.dtypes

id. Response ID                                                                     object
submitdate. Date submitted                                                          object
Month Year[subm...Date submitted]                                                   object
Month[Month Yea...ate submitted]]                                                  float64
Year[Month Year...ate submitted]]                                                  float64
lastpage. Last page                                                                float64
Carmain                                                                             object
Previous PHEVs                                                                     float64
Previous BEVs                                                                      float64
Previous HEVs                                                                      float64
Previous CNGs                                                                      float64

#### Observations:

### Missing Values

**If we encounter with missing data, what we can do:**

* leave as is
* drop them with dropna()
* fill missing value with fillna()
* fill missing values with test statistics like mean

Mode Inputation
* Mode imputation means replacing missing values by the mode, or the most frequent- category value.


In [6]:
fcv_df.isnull().sum()

id. Response ID                                                                        0
submitdate. Date submitted                                                          6939
Month Year[subm...Date submitted]                                                   6939
Month[Month Yea...ate submitted]]                                                   6939
Year[Month Year...ate submitted]]                                                   6939
lastpage. Last page                                                                  996
Carmain                                                                             1405
Previous PHEVs                                                                     11077
Previous BEVs                                                                      11077
Previous HEVs                                                                      11077
Previous CNGs                                                                      11077
Household Income     

#### Observations: 

In [7]:
fcv_df = fcv_df.dropna()

### Summary Statistics

In [8]:
fcv_df.describe()

Unnamed: 0,Month[Month Yea...ate submitted]],Year[Month Year...ate submitted]],lastpage. Last page,Previous PHEVs,Previous BEVs,Previous HEVs,Previous CNGs,Household Income,"Importance of reducing greenhouse gas emissions (-3 not important, 3 important)",Home ownership (own 1),...,Highest Level of Education,Longest trip in the last 12 months,Number of trips over 200 miles in the last 12 months,One-way commute distance,Number of people in the household,Age,Gender (Male 1),Number of vehicles in the household,Annual VMT Estimate,"FCV, BEV Dummy"
count,4709.0,4709.0,4709.0,4709.0,4709.0,4709.0,4709.0,4709.0,4709.0,4709.0,...,4709.0,4709.0,4709.0,4709.0,4709.0,4709.0,4709.0,4709.0,4709.0,4709.0
mean,6.489063,2016.004672,43.518794,0.099809,0.175409,0.217244,0.005946,223311.74347,1.715364,0.87938,...,3.406668,371.820822,47208.94,18.578486,3.066893,49.643661,0.218518,2.571034,12547.143431,0.949034
std,1.927018,0.82468,1.128866,0.299777,0.380357,0.412414,0.076889,123880.139436,1.576913,0.32572,...,0.678945,340.614394,3238360.0,40.951945,1.24674,12.380423,0.413284,0.89882,14703.56979,0.219952
min,4.0,2015.0,3.0,0.0,0.0,0.0,0.0,50000.0,-3.0,0.0,...,1.0,0.21,0.0,0.0,1.0,18.0,0.0,1.0,-158400.0,0.0
25%,5.0,2015.0,43.0,0.0,0.0,0.0,0.0,125000.0,1.24,1.0,...,3.0,171.58,0.0,6.68,2.0,45.0,0.0,2.0,7928.571429,1.0
50%,6.0,2016.0,43.0,0.0,0.0,0.0,0.0,175000.0,2.55,1.0,...,4.0,318.14,0.0,14.0,3.0,45.0,0.0,2.0,10838.709677,1.0
75%,8.0,2017.0,45.0,0.0,0.0,0.0,0.0,275000.0,2.74,1.0,...,4.0,437.34,2.0,23.71,4.0,55.0,0.0,3.0,14532.0,1.0
max,11.0,2017.0,45.0,1.0,1.0,1.0,1.0,500000.0,3.0,1.0,...,4.0,4041.57,222223200.0,2381.91,12.0,80.0,1.0,5.0,342000.0,1.0


In [9]:
# Update Column Names
fcv_df = fcv_df.rename(columns={'Month Year[subm...Date submitted]':'Month/Year Submitted',
                                'Month[Month Yea...ate submitted]]': 'Month Submitted',
                               'Year[Month Year...ate submitted]]':'Year Submitted'})

In [10]:
#Split Carmain into separate columns
fcv_df[['Model Year', 'Manufacturer', 'Model']] = fcv_df['Carmain'].str.split(' ', n=2, expand=True)

#Drop Carmain, & ID as not longer needed
fcv_df = fcv_df.drop(columns=['Carmain','id. Response ID'],axis=1)

fcv_df.head(5)

Unnamed: 0,submitdate. Date submitted,Month/Year Submitted,Month Submitted,Year Submitted,lastpage. Last page,Previous PHEVs,Previous BEVs,Previous HEVs,Previous CNGs,Household Income,...,One-way commute distance,Number of people in the household,Age,Gender (Male 1),Number of vehicles in the household,Annual VMT Estimate,"FCV, BEV Dummy",Model Year,Manufacturer,Model
0,2017/06/02 11:30:57,06/2017,6.0,2017.0,42.0,1.0,0.0,0.0,0.0,275000.0,...,0.01,2.0,65.0,0.0,2,14622.0,0.0,2016,Toyota,Mirai
1,2017/06/02 11:15:39,06/2017,6.0,2017.0,42.0,0.0,0.0,1.0,0.0,275000.0,...,10.69,3.0,65.0,0.0,3,9197.142857,0.0,2016,Toyota,Mirai
4,2017/06/02 19:35:59,06/2017,6.0,2017.0,42.0,1.0,0.0,0.0,0.0,125000.0,...,17.63,2.0,55.0,0.0,3,5082.352941,0.0,2017,Toyota,Mirai
5,2017/06/06 12:11:14,06/2017,6.0,2017.0,42.0,0.0,0.0,1.0,0.0,175000.0,...,3.53,2.0,75.0,0.0,2,13025.454545,0.0,2016,Toyota,Mirai
7,2017/06/02 15:57:08,06/2017,6.0,2017.0,42.0,0.0,0.0,1.0,0.0,500000.0,...,28.05,3.0,45.0,0.0,3,18000.0,0.0,2016,Toyota,Mirai


### Distribution Plot 

In [11]:
#sns.displot(fcv_df['col_name'], kind = 'kde')


#### Observations: 


### Check for Max and Min Values

#### Observation:


### Examine the mean, median, and mode. Are the three measures of central tendency equal?
-- this will help describe the skewness/distribution of attributes

#### Observations: 

### Pairplot for the variables. 

In [12]:
#sns.pairplot(data = pima, vars = ['Glucose', 'SkinThickness', 'DiabetesPedigreeFunction'], hue = 'Outcome')


#### Observations: 

### Scatterplots 

In [13]:
#plt.scatter(x = 'Glucose', y = 'Insulin', data = pima)
#plt.show()

#### Observations:

### Boxplots for Variables
Check the distribution and outliers for each column in the data.**

In [14]:
#plt.boxplot(pima['Age'])
#plt.title('Boxplot of Age')
#plt.ylabel('Age')
#plt.show()

#### Observations:


### Histograms

In [15]:
#plt.hist(pima[pima['Outcome'] == 1]['Age'], bins = 5)
#plt.title('Distribution of Age for Women who has Diabetes')
#plt.xlabel('Age')
#plt.ylabel('Frequency')
#plt.show()

#### Observations:

### The interquartile range of all the variables

In [16]:
#Q1 = df.quantile(0.25)
#Q3 = df.quantile(0.75)
#IQR = Q3 - Q1
#print(IQR)

#### Observations: 

## Export of File to CSV for database

In [17]:
fcv_df.to_csv('Exports/FCV_Dataset.csv')

### Visualize the Correlation Matrix.

* Correlation is a statistic that measures the degree to which two variables move in relation to each other. A positive correlation indicates
* the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable
* increases as the other decreases. Correction among multiple variables can be represented in the form of a matrix. This allows us to see which pairs have the high correlations.
* correlation Correlation is a mutual relationship or connection between two or more things. It takes a value between (+1) and (-1)
* The correlation between two independent events is zero, two events with zero correlations may not be independent.

In [18]:
#plt.figure(figsize = (8, 8))
#sns.heatmap(corr_matrix, annot = True)

# Display the plot
#plt.show()

#### Observations: 


## Data Preprocessing for Modeling

* Renaming Columns
* Scaling/Normalizing
* Dropping unnecessary columns
* Hot encoding
* Imputing missing values with mode/median for columns
* Converting data types
* Format data types
* Apply get_dummies on the dataframe data

## Predictive Analysis/Building Models

### **Checking the below linear regression assumptions**

1. **Mean of residuals should be 0**
2. **No Heteroscedasticity**
3. **Linearity of variables**
4. **Normality of error terms**

## **Actionable Insights and Business Recommendations**