# The Prediction Analysis of Available Bikes and Stands

## Introduction

This file is only about the implement of the prediction analysis of available bikes and stands both at a given bike station and on a given date.

## 1 The Data

In our application, our web crawler crawls the data from the bike web and the weather web every five minutes, and the data from the bike web and that from the weather web are immediately merged into one table， named 'real_time'， after being captured. At the same time, the rows in the real_time table are added into a table named 'completetable'. Thus, at any time, there is a 'real_time' table, from which we can obtain the real time data of bike stations and weather, while all the historical data are stacked in the 'completetable' table. Any row in these data holds the data of bike station and of weather matched by time.

Training the prediction model need the data from the 'completetable' table will be used. Therefore, firstly I will fetch the data from the 'completetable' table.


In [3]:
from sqlalchemy import create_engine
import numpy as np
import pandas as pd
import datetime
import holidays

# The configuration of AWS RDS MySQL
user = 'admin'
password = '00000000'
host = 'dbbikes.ci3iggfwlke6.eu-west-1.rds.amazonaws.com'
port = 3306
database = 'dbbikes'

# create the link to Database
engine = create_engine(f'mysql+mysqlconnector://{user}:{password}@{host}:{port}/{database}')

The dependent variables we will predict are the numbers of the available bikes and the spare stands at a given station, that is, 'available_bikes' and 'available_bike_stands', respectively.

The independent variables we use to make predictions consist of two three-groups: the features of the bike station, the features of the weather and the time.

The features of the bike station:  
>number - number of the station. This is NOT an id, thus it is unique only inside a contract;  
>banking - indicates whether this station has a payment terminal;  
>bonus - indicates whether this is a bonus station.

The features of the weather:  
>weather_main - Group of weather parameters (Rain, Snow, Extreme etc.)
>temp -  Temperature.
>wind_speed - Wind speed

The time stamp:  
>timestamp: the update time  
The time stamp will be converted to the variables of date,of weekday, of hour, and of holiday indicator.

In [4]:
# read the data table from database on AWS
table_name = 'completedata'
dfAll = pd.read_sql_table(table_name, engine)

In [7]:
# take out the needed feature
df = dfAll[['available_bikes', 'available_bike_stands', 'banking', 'number', 'bonus', 'weather_main', 'temp', 'wind_speed', 'timestamp']].copy()

# retrieve 'yyyy-mm-dd', 'hour', 'weekday', 'holiday'
df['date'] = df['timestamp'].dt.date
df['hour'] = df['timestamp'].dt.hour
df['date'] = pd.to_datetime(df['date'])
df['weekday'] = df['date'].dt.day_name()
ie_holidays = holidays.IE()
df['holiday'] = df['date'].dt.date.apply(lambda x: x in ie_holidays)

# take out the needed columns
df = df[['available_bikes', 'available_bike_stands', 'banking', 'number', 'bonus', 'weather_main', 'temp', 'wind_speed', 'hour', 'weekday', 'holiday']]

# 2 Data Preparation

We process the data for the consequent regression analysis.

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [9]:
# take out independent features and dependent variable
X = df[['hour', 'number', 'weekday', 'holiday', 'weather_main', 'banking', 'bonus', 'wind_speed', 'temp']].copy()
y = df[['available_bikes', 'available_bike_stands']].copy()

# convert categorical variables to one-hot coded variables
X[['hour', 'number', 'weekday', 'holiday', 'weather_main', 'banking', 'bonus']] = X[['hour', 'number', 'weekday', 'holiday', 'weather_main', 'banking', 'bonus']].astype(str)

X = pd.get_dummies(X, columns=['hour', 'number','weekday', 'holiday', 'weather_main', 'banking', 'bonus'])

# divide data into training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=93)

In [10]:
# standardize continuous features
scaler = StandardScaler()
X_train[['wind_speed', 'temp']] = scaler.fit_transform(X_train[['wind_speed', 'temp']])
X_test[['wind_speed', 'temp']] = scaler.transform(X_test[['wind_speed', 'temp']])

# 3 Training the  Regression Model

Considering the spatial correlation of the bike stations, we pooled all the bike stations' data and use the station id, 'number', as a dummy variable.

In [11]:
# train the regression model
y_train_bikes = y_train['available_bikes']
y_train_stands = y_train['available_bike_stands']
lr_bikes = LinearRegression()
lr_stands = LinearRegression()
X_train.columns = X_train.columns.astype(str)
lr_bikes.fit(X_train, y_train_bikes)
lr_stands.fit(X_train, y_train_stands)

# 4 Evaluating the trained Model

We use the test data set to produce the prediction and evaluate the model's performance measured by standard errors.

In [13]:
# compute prediction on the test data
X_test.columns = X_test.columns.astype(str)
y_pred_bikes = lr_bikes.predict(X_test)
y_pred_stands = lr_stands.predict(X_test)

# Evaluate the model's performance using Mean Squared Error
y_test_bikes = y_test['available_bikes']
y_test_stands = y_test['available_bike_stands']

# compute the standard deviance
mse_bikes = mean_squared_error(y_test_bikes, y_pred_bikes)
ste_bikes = np.sqrt(mse_bikes)

mse_stands = mean_squared_error(y_test_stands, y_pred_stands)
ste_stands = np.sqrt(mse_stands)

print("Standard Error for Bikes:", ste_bikes)
print("Standard Error for for Stands:", ste_stands)

Standard Error for Bikes: 7.876773122824796
Standard Error for for Stands: 8.067956938441425


# 5 Saving the trained model

This model will be updated with new data once a month. When the model is trained, it will be saved for being called by the web application.

In [14]:
import os
import pickle

# Record the order of the features in training data set, 
# which will be used to rearrange the columns in the forecasting data
feature_order = list(X_train.columns) 

# save the trained models of the available bikes and of the spare stands
# save the order of the features in training data set,
# save the scaler for standarizing the continuous features
models = {'lr_bikes': lr_bikes, 'lr_stands': lr_stands,  'scaler':scaler, 'feature_order': feature_order}

# create the path to save the model.
file_dir = 'D:\PythonProjects\DublinBike'

file_path = os.path.join(file_dir, 'models.pkl')
with open(file_path, 'wb') as f:
    pickle.dump(models, f)