# MIE1512H Project

### Topic: Predicting Station-Level Demand in Bike-Sharing Systems

Name: Tanya Tang<br>
Version: V1<br>
Date: March 22, 2020

___
### Project Versions/Timelines

#### V2

CRISP-DM Tasks:
* Data Preparation on Full Scale (selecting, cleaning, constructing, integrating, formatting)
* Modeling (selecting techniques, generating test design, building, assessing)

Timeline:

#### V3

CRISP-DM Tasks:
* Evaluation (evaluting results, reviewing process, determining next steps)

Planned Timeline (Week 3: March 26 to April 2):
* Analyze results and make hypotheses (3 hrs)
* Review project activites to verify correctness (1 hr)
* Summarize project and list some potential next steps (1 hr)

#### F

Planned Timeline (Week 4: April 2 to April 9):
* Write report (6 hrs)

___
### Date Preparation

Data preparation will be completed on data from January 2018 to February 2020 in the same manner as was completed for the smaller dataset in V1. Following a similar partitioning as the paper, training will be done on data from January 2018 to August 2019, validation will be done on data from September to November 2019, and testing will be done on data from December 2019 to February 2020. 

___
### Modeling

ADD EXPLANATION

First, we need to make sure we can apply all of our modelling techniques using MLlib to the toy dataset prepared in V1. Since we saved **aggregateTripWeatherCrimeData** as a single parquet file we can load the cleaned data immediately after initializing spark. 

In [37]:
# Add Java locations
import os
os.environ["JAVA_HOME"] = "/Library/Java/JavaVirtualMachines/jdk1.8.0_231.jdk/Contents/Home/"
os.environ["JRE_HOME"] = "/Library/Java/JavaVirtualMachines/jdk1.8.0_231.jdk/Contents/Home/"

# Initialize spark
import findspark
findspark.init("/usr/local/Cellar/apache-spark@2.3.2/2.3.2/libexec/")
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("appName").getOrCreate()

Load data from previously cleaned **aggregateTripWeatherCrimeData** data. 

In [40]:
v1data = spark.read.parquet('resources/v1_aggregate_trip_weather_crime_data.parquet')
v1data.createOrReplaceTempView('v1data')

Sample loaded data to make sure it's correct. 

In [44]:
spark.sql("""
SELECT * FROM v1data
LIMIT 20
""").toPandas()

Unnamed: 0,month,day,hour,station_name,lat,long,departure_count,arrival_count,wind_speed_rate,visibility,temperature,precipitation,num_incidents
0,1,26,18,10th Ave at E 15th St,37.792714,-122.24878,0,1,0.0,16093.0,10.0,0.0,0
1,2,5,17,10th Ave at E 15th St,37.792714,-122.24878,2,1,0.0,16093.0,14.4,0.0,0
2,2,11,16,10th Ave at E 15th St,37.792714,-122.24878,0,1,46.0,16093.0,9.4,0.0,0
3,1,8,18,10th St at Fallon St,37.797673,-122.262997,1,0,15.0,3219.0,11.1,18.0,0
4,1,17,21,10th St at Fallon St,37.797673,-122.262997,0,1,31.0,8047.0,13.9,0.0,0
5,1,20,0,10th St at Fallon St,37.797673,-122.262997,1,0,67.0,16093.0,12.2,0.0,0
6,1,24,10,10th St at Fallon St,37.797673,-122.262997,0,2,0.0,16093.0,7.8,0.0,0
7,1,29,23,10th St at Fallon St,37.797673,-122.262997,1,0,21.0,16093.0,13.9,0.0,0
8,2,20,14,10th St at Fallon St,37.797673,-122.262997,1,3,15.0,16093.0,2.2,0.0,0
9,1,2,11,11th St at Bryant St,37.77003,-122.411726,1,1,0.0,16093.0,10.0,0.0,0


First, let's test the k-means reduction technique. The other two reduction techniques we will see from this project do not require any extra work. 

In [50]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

Get all distinct stations and their latitude/longitude points. 

In [51]:
station_data = spark.sql("""
SELECT DISTINCT lat, long
FROM v1data
""")

Transform station data to appropriate format for the KMeans algorithm in MLlib. 

In [60]:
kmeans_data = station_data.rdd.map(lambda r: [Vectors.dense(r)]).toDF(['features'])
station_kmeans = KMeans().setK(5).setSeed(1)
station_model = station_kmeans.fit(kmeans_data)
predictions = station_model.transform(kmeans_data)
silhouette = ClusteringEvaluator().evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
centers = station_model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print('Latitude:', center[0], 'Longitude:', center[1])

Silhouette with squared euclidean distance = 0.7771003013404956
Cluster Centers: 
Latitude: 37.77354269971474 Longitude: -122.41233959625654
Latitude: 37.33065340782083 Longitude: -121.88387651254166
Latitude: 37.8502021372017 Longitude: -122.26919072704749
Latitude: 37.804872849713206 Longitude: -122.26158037281321
Latitude: 37.33871156077894 Longitude: -121.89914932402634
