Introduction

Electrification of the transportation sector is on the critical path to reducing carbon emissions and mitigating the harmful impacts of climate change. Yet while Tesla remains a hot stock with fashionable products, electrical vehicles remain a niche market. Indeed, adoption has fallen short of where many experts predicted we would be at this stage. 

One of the reasons for this gap - and the subject of the analysis below - is the instability that charging stations introduce to the electrical grid. 

Solving this problem will have both market and policy implications.

In [157]:
import pandas as pd
import seaborn as sns
import json
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, ElasticNet

In [158]:
json_file_path = "/Users/toddhendricks/Downloads/acndata_sessions.json"

with open(json_file_path, 'r') as j:
     data = json.loads(j.read())

In [166]:
df = pd.json_normalize(data['_items'])

Generate a description of the dataset.

In [168]:
charging_slots = df.stationID.nunique()
customers = df.userID.nunique()
sessions = df.sessionID.nunique()
total_mwh_delivered = sum(df.kWhDelivered) / 1000
avg_kwh_delivered = df.kWhDelivered.mean()
"There are {} charging stations on campus. The data contains {} unique customers of the stations, with {} unique charging sessions over the course of 2020. The total energy delivered was {} megawatt hours, with the average session delivering {} kilowatt hours.".format(charging_slots, customers, sessions, total_mwh_delivered, avg_kwh_deliveredvg_kwh_delivered)

AttributeError: 'Series' object has no attribute 'med'

In [161]:
df.columns

Index(['_id', 'clusterID', 'connectionTime', 'disconnectTime',
       'doneChargingTime', 'kWhDelivered', 'sessionID', 'siteID', 'spaceID',
       'stationID', 'timezone', 'userID', 'userInputs'],
      dtype='object')

Our target is the duration of a charging session represented in minutes. We will need to convert the relevant columns to datetime objects in order to facilitate the math.

In [162]:
df.connectionTime = pd.to_datetime(df.connectionTime)
df.doneChargingTime = pd.to_datetime(df.doneChargingTime)
# air_quality["datetime"] = pd.to_datetime(air_quality["datetime"])

In [163]:
df.connectionTime = pd.to_datetime(df.connectionTime.dt.strftime('%m/%d/%Y %H:%M')) 
df.doneChargingTime = pd.to_datetime(df.doneChargingTime.dt.strftime('%m/%d/%Y %H:%M'))

In [164]:
df['charging_session_duration_in_min'] = df['charging_session_duration'].astype("timedelta64[m]")

KeyError: 'charging_session_duration'

In [None]:
df_less_outliers = df[df['charging_session_duration_in_min'] <= 1000]

In [None]:
n = 2
data_lost = (len(df_less_outliers) / len(df)) * 100
print("After removing outliers, we have {0:.{1}f} percent of the data we began with.".format(data_lost, n))

In [None]:
session_length = df_less_outliers['charging_session_duration_in_min']

In [None]:
sns.displot(session_length)

Our target is right-skewed which makes intuitive sense. Most of the sessions are short, but there are a few that are very long. From a modeling standpoint, we will consider applying a power transformation at the engineering stage.

In [None]:
sns.scatterplot(data=df_less_outliers, x='kWhDelivered', y='charging_session_duration_in_min')

The scatterplot reveals an interesting property of the data: there is a limit to how fast a vehicle can be charged - but there's no upper bound. Physics would explain the lower boundary. We do not yet know what covariate(s) explain the variance above the hard lower bound.
The vector of zeros on the X axis is also interesting. There are a considerable number of sessions where the vehicle is supposedly charging - but it is not registering kWh delivered. 

In [None]:
by_parking_slot = df_less_outliers.groupby(by='stationID').mean()

In [None]:
x = by_parking_slot['kWhDelivered']
y = by_parking_slot['charging_session_duration_in_min']

In [None]:
sns.scatterplot(x=x,y=y)

In [None]:
X,y = df_less_outliers['kWhDelivered'], df_less_outliers['charging_session_duration_in_min']
X = X.values.reshape(-1,1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

We now have a baseline model. The task before us is to reduce the bias of our model by introducing complexity. 

In [None]:
df_less_outliers[df_less_outliers['userInputs'] == 'None']