### Goal
The goal of this project is to expose you with a real data science problem, looking at the end-to-end pipeline. 

### The Following Notebook will accomplish:

* Access historical bike rental data for 2019 from HealthyRidePGH and summarize the rental data  
* Create graphs to show the popularity of the different rental stations, given filter conditions  
* Create graphs to show the rebalancing issue  
* Cluster the data to group similar stations together, using a variety of clustering functions and visualize the results of the clustering.  

In [96]:
import matplotlib.pyplot as plt
plt.ioff()
import pandas as pd
from sklearn import cluster
from scipy.spatial.distance import euclidean
import warnings
warnings.filterwarnings('ignore')

For the sake of interactive display in Jupyter, we will enable matplotlib inline.

In [97]:
%matplotlib inline


### Access historical bike rental data for 2019 from HealthyRidePGH and summarize the rental dat

In [98]:
# combining 2019 quarters into one dataframe 
q1 = pd.read_csv('HealthyRideRentals2019-Q1.csv')
q2 = pd.read_csv('HealthyRideRentals2019-Q2.csv')
q3 = pd.read_csv('HealthyRideRentals2019-Q3.csv')
list = []
list.append(q1)
list.append(q2)
list.append(q3)
df = pd.concat(list)
month = df.copy()
day = df.copy()
#df.head()

##### Summarizing historical bike rental data for 2019 from HealthyRidePGH.

In [99]:
day['Starttime'] = pd.to_datetime(df['Starttime']).dt.date
day['Stoptime'] = pd.to_datetime(df['Stoptime']).dt.date

fromCNT = day.groupby(['Starttime', 'From station id']).size().reset_index(name = "fromCNT")
toCNT = day.groupby(['Stoptime', 'To station id']).size().reset_index(name = "toCNT")

fromCNT.rename(columns = {'Starttime':'Date'}, inplace = True)
fromCNT.rename(columns = {'From station id':'Station id'}, inplace = True)
toCNT.rename(columns = {'Stoptime':'Date'}, inplace = True)
toCNT.rename(columns = {'To station id':'Station id'}, inplace = True)

daily_breakdown = pd.merge(fromCNT, toCNT, how = "inner", on = ["Date", "Station id"])
daily_breakdown['rebalCNT'] = daily_breakdown['fromCNT'] - daily_breakdown['toCNT']

daily_breakdown['rebalCNT'] = daily_breakdown['rebalCNT'].abs()
#daily_breakdown.head()

In [100]:
month['Starttime'] = pd.to_datetime(df['Starttime']).dt.month
month['Stoptime'] = pd.to_datetime(df['Stoptime']).dt.month

fromCNT_month = month.groupby(['Starttime', 'From station id']).size().reset_index(name = "fromCNT")
toCNT_month = month.groupby(['Stoptime', 'To station id']).size().reset_index(name = "toCNT")

fromCNT_month.rename(columns = {'Starttime':'Month'}, inplace = True)
fromCNT_month.rename(columns = {'From station id':'Station id'}, inplace = True)
toCNT_month.rename(columns = {'Stoptime':'Month'}, inplace = True)
toCNT_month.rename(columns = {'To station id':'Station id'}, inplace = True)

monthly_breakdown = pd.merge(fromCNT_month, toCNT_month, how = "inner", on = ["Month", "Station id"])
monthly_breakdown['rebalCNT'] = monthly_breakdown['fromCNT'] - monthly_breakdown['toCNT']
monthly_breakdown['rebalCNT'] = monthly_breakdown['rebalCNT'].abs()
#monthly_breakdown.head()


---
### Create graphs to show the popularity of the different rental stations, given filter conditions

In [101]:
filter_month = 4
filter_stationID = 1046


In [102]:
month_filter = monthly_breakdown[monthly_breakdown['Month'] == filter_month]
month_filter = month_filter.sort_values(by = ['fromCNT'], ascending = False)
month_filter = month_filter[:25]
plt.ioff()
fig = plt.bar(range(len(month_filter['Station id'])), month_filter['fromCNT'])
plt.xticks(range(len(month_filter['Station id'])), month_filter['Station id'], rotation = 90)
plt.xlabel('Station id')
plt.ylabel('fromCNT')
plt.title("Most popular stations for month " + str (filter_month))
plt.ioff()
#plt.show()

<matplotlib.pyplot._IoffContext at 0x7ff72a49cb80>

In [103]:
daily_breakdown['Month'] = pd.DatetimeIndex(daily_breakdown['Date']).month
daily_breakdown['Day'] = pd.DatetimeIndex(daily_breakdown['Date']).day

month_filter = daily_breakdown[daily_breakdown['Month'] == filter_month]
sid = month_filter[month_filter['Station id'] == filter_stationID]
sid = sid.sort_values(by = ['fromCNT'], ascending = False)
plt.ioff()
fig = plt.bar(range(len(sid['Day'])), sid['fromCNT'])
plt.xticks(range(len(sid['Day'])), sid['Day'])
plt.xlabel("Days of month " + str(filter_month))
plt.ylabel("fromCNT")
plt.title("Most popular days of month " + str(filter_month) + " for station " + str(filter_stationID))
plt.ioff()
#plt.show()

<matplotlib.pyplot._IoffContext at 0x7ff72e75d490>

In [104]:
df['Month'] = pd.DatetimeIndex(df['Starttime']).month
df['Hour'] = pd.DatetimeIndex(df['Starttime']).hour
month_filter = df[df['Month'] == filter_month]

hours = month_filter.groupby(['Hour']).size().reset_index(name = "fromCNT")
hours = hours.sort_values(by = ['fromCNT'], ascending = False)
plt.ioff()
fig = plt.bar(range(len(hours['Hour'])), hours['fromCNT'])
plt.xticks(range(len(hours['Hour'])), hours['Hour'])
plt.xlabel("Hours of the day in month " + str(filter_month))
plt.ylabel("fromCNT")
plt.title("Most popular hours of month " + str(filter_month) + " for all stations" )
plt.ioff()
#plt.show()



<matplotlib.pyplot._IoffContext at 0x7ff72b71f7c0>

In [105]:
totalBikes = month.groupby(['Starttime', 'Bikeid']).size().reset_index(name = "num_rented")
totalBikes.rename(columns = {'Starttime': 'Month'}, inplace = True)
month_filter = totalBikes[totalBikes['Month'] == filter_month]
month_filter = month_filter.sort_values(by = ['num_rented'], ascending = False)
month_filter = month_filter[:25]
plt.ioff()
fig = plt.bar(range(len(month_filter['Bikeid'])), month_filter['num_rented'])
plt.xticks(range(len(month_filter['Bikeid'])), month_filter['Bikeid'], rotation = 90)
plt.xlabel("Bike Id")
plt.ylabel("Number of times bikes were rented")
plt.title("Top 25 most popular bikes rented during month " + str(filter_month))
plt.ioff()
#plt.show()

<matplotlib.pyplot._IoffContext at 0x7ff72bd9f4f0>

---
### Create graphs to show the rebalancing issue.

In [106]:
month_filter = monthly_breakdown[monthly_breakdown['Month'] == filter_month]
month_filter = month_filter.sort_values(by = ['rebalCNT'], ascending = False)
month_filter = month_filter[:25]
plt.ioff()
fig = plt.bar(range(len(month_filter['Station id'])), month_filter['rebalCNT'])
plt.xticks(range(len(month_filter['Station id'])), month_filter['Station id'], rotation = 90)
plt.xlabel("Station id ")
plt.ylabel("rebalCNT")
plt.title("Most popular stations (rebalCNT) for month " + str(filter_month))
plt.ioff()
#plt.show()

<matplotlib.pyplot._IoffContext at 0x7ff72b9b44c0>

In [107]:
sid = sid.sort_values(by = ['rebalCNT'], ascending = False)
plt.ioff()
fig = plt.bar(range(len(sid['Day'])), sid['rebalCNT'])
plt.xticks(range(len(sid['Day'])), sid['Day'])
plt.xlabel("Days of month " + str(filter_month))
plt.ylabel("rebalCNT")
plt.title("Most popular days of month " + str(filter_month) + " for station " + str(filter_stationID))
plt.ioff()
#plt.show()

<matplotlib.pyplot._IoffContext at 0x7ff72f039a90>

---
### Cluster the data to group similar stations together, using a variety of clustering functions and visualize the results of the clustering.  

In [108]:
cluster7 = monthly_breakdown[monthly_breakdown['Month'] == 7]
cluster7.rename(columns = {'fromCNT':'fromCNT_7'}, inplace = True)
cluster7.rename(columns = {'rebalCNT':'rebalCNT_7'}, inplace = True)
cluster7.drop(columns = {'Month', 'toCNT'}, inplace = True)

cluster8 = monthly_breakdown[monthly_breakdown['Month'] == 8]
cluster8.rename(columns = {'fromCNT':'fromCNT_8'}, inplace = True)
cluster8.rename(columns = {'rebalCNT':'rebalCNT_8'}, inplace = True)
cluster8.drop(columns = {'Month', 'toCNT'}, inplace = True)

cluster9 = monthly_breakdown[monthly_breakdown['Month'] == 9]
cluster9.rename(columns = {'fromCNT':'fromCNT_9'}, inplace = True)
cluster9.rename(columns = {'rebalCNT':'rebalCNT_9'}, inplace = True)
cluster9.drop(columns = {'Month', 'toCNT'}, inplace = True)

q3_cluster = cluster7.merge(cluster8,on='Station id').merge(cluster9,on='Station id')
#q3_cluster.head()

In [109]:
#K-Means clustering 

def generate_kmeans_cluster(n_clusters, cluster_name):
    k_means = cluster.KMeans(n_clusters = n_clusters, init = 'k-means++', random_state = 5000)
    k_means_cluster = k_means.fit(q3_cluster[[ 'fromCNT_7', 'rebalCNT_7', 'fromCNT_8', 'rebalCNT_8', 'fromCNT_9', 'rebalCNT_9']])
    labels = k_means_cluster.labels_
    q3_cluster[cluster_name] = labels
    
generate_kmeans_cluster(2, 'ClusterID_one')
generate_kmeans_cluster(3, 'ClusterID_two')
generate_kmeans_cluster(4, 'ClusterID_three')

In [110]:
#DBScan clustering 

def dbscan(eps, samples, cluster_name):

    dbscan = cluster.DBSCAN(eps=eps, min_samples= samples, metric = 'euclidean', algorithm = 'auto', leaf_size = 30, p= None, n_jobs = 1)
    dbscan_one = dbscan.fit(q3_cluster[[ 'fromCNT_7', 'rebalCNT_7', 'fromCNT_8', 'rebalCNT_8', 'fromCNT_9', 'rebalCNT_9']])
    labels = dbscan_one.labels_
    q3_cluster[cluster_name] = labels
    
dbscan(20,2,'db_clusterID_one')
dbscan(20,2,'db_clusterID_two')
dbscan(20,2,'db_clusterID_three')

In [111]:
def create_graph(cluster_name, title):
    kmeans = q3_cluster[cluster_name].value_counts().reset_index(name = "Num stations")
    plt.ioff()
    fig = plt.bar(range(len(kmeans)), kmeans['Num stations'])
    plt.xticks(range(len(kmeans['index'])), kmeans['index'])
    plt.xlabel("Cluster ID")
    plt.ylabel("Number of Stations")
    plt.title(title)
    #plt.show()

create_graph('ClusterID_one','K-Means with two clusters')
create_graph('ClusterID_two','K-Means with three clusters')
create_graph('ClusterID_three','K-Means with four clusters')

In [112]:
def create_db_graph(cluster_name, title):
    db = q3_cluster[cluster_name].value_counts().reset_index(name = "Num stations")
    plt.ioff()
    fig = plt.bar(range(len(db)), db['Num stations'])
    plt.xticks(range(len(db['index'])), db['index'])
    plt.xlabel("Cluster ID")
    plt.ylabel("Number of Stations")
    plt.title(title)
    #plt.show()
create_db_graph('db_clusterID_one', 'DBScan (eps = 20, min_sample = 2)')
create_db_graph('db_clusterID_two', 'DBScan (eps = 20, min_sample = 3)')
create_db_graph('db_clusterID_three', 'DBScan (eps = 20, min_sample = 4)')


### Conclusion

For the k-means clustering, I chose to run the clustering with 2,3,and 4 clusters. I chose these cluster numbers because I wanted to see what would happen as the number of clusters increased and how that affects the overall effectiveness of the algorithms. From the output of the three clustering algorithms, I believe that the best K value of the three, is three clusters. This is because with two clusters, some data points that should be apart of their own cluster are forced to be grouped with the two clusters, even if they do not match. Four clusters, overly divides the clusters. With regard to if the K-Means or DBscan is better, I would argue that the K-means is better because all three simulations of the DB-scan contains noise points. These points are not visited, and therefore not considered. This can lead to inaccurate results. 
