# Part II - Ford GoBike System Data (Exploratory Data Analysis)
## by Ruben Dekker

## Investigation Overview

My analysis of the GoBike usage data revealed a strong correlation between age and usage, with young professionals being the primary target audience. Furthermore, I observed that the day of the week and hour of the day variables had a significant impact on usage patterns. Additionally, I found that while there is a high degree of similarity between the start and end stations, there is not a perfect correlation between the two. The variance in trip duration was also found to be high, although the interquartile range was relatively small. Based on these findings, I recommend that Ford, as the operator of the GoBike service, consider partnering with businesses and offering a package tailored towards young professionals in order to expand their GoBike business. The above findings are based on the data available and further research may be needed to validate my findings.

## Dataset Overview

The selected dataset (Ford GoBike System dataset from 2019) contains 183,412 observations and 16 columns that concerns information on bike rentals from the Ford GoBike bike sharing system. The dataset includes information such as the start and end station, start and end time, duration of the ride, and user information (e.g., whether the user is a subscriber or a customer, and their birth year). It also contains geospatial information such as the latitude and longitude of the start and end station. This dataset covers GoBike usage over de course of February 2019 in the Greater San Francisco Bay Area. The dataset was used to analyze bike rental patterns and user behavior, as well as write an advice towards bike sharing companies to improve the bike sharing system and maximize profit.

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

In [None]:
# load in the dataset into a pandas dataframe
df_bike = pd.read_csv("Ford GoBike data.csv")

In [None]:
df_bike.head(3)

> Note that the above cells have been set as "Skip"-type slides. That means
that when the notebook is rendered as http slides, those cells won't show up.

## Visualization 1

* <b>Question</b>: How are trips distributed between days of the week?
* <b>Visualization</b>: Seaborn countplot
* <b>Observations</b>: There is slight variance between the different weekdays and a significant difference between weekend and weekdays.

In [None]:
# Convert start_time and end_time columns from object to datetime64[ns]
df_bike[['start_time','end_time']] = df_bike[['start_time','end_time']].apply(pd.to_datetime)

# Create additional variable called 'day_of_week' by extracting this information from the start_time column (will raise error if ran more than once)
def filter_day_of_week(df_bike):
    df_bike['day_of_week'] = df_bike['start_time'].dt.weekday
    return df_bike

df_bike = filter_day_of_week(df_bike)

sns.set_style('whitegrid')
ax = sns.countplot(y='day_of_week', data=df_bike, palette='twilight', lw=1, ec='black')
ax.set_yticklabels(['Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.xlabel("Number of trips taken")
plt.ylabel("Day of the Week")
plt.title("Trip count distribution by day")
plt.show()

## Visualization 2

* <b>Question</b>: How are trips distributed between hours of the day?
* <b>Visualization</b>: Seaborn countplot
* <b>Observations</b>: There are major peaks at 8:00 and 17:00, conventional office hours.

In [None]:
# Create additional variable called 'hour_of_day' by extracting this information from the start_time column (will raise error if ran more than once)
def filter_hour_of_day(df_bike):
    df_bike['hour_of_day'] = df_bike['start_time'].dt.hour
    return df_bike

df_bike = filter_hour_of_day(df_bike)

sns.set_style('whitegrid')
ax = sns.countplot(x='hour_of_day', data=df_bike, palette='Pastel1', lw=1, ec='black')
plt.xlabel("Hour of the day at which trip was started")
plt.ylabel("Number of trips taken")
plt.title("Trip count distribution by hour")
plt.show()

## Visualization 3
* <b>Question</b>: Is there a correlation between birth year and trip duration?
* <b>Visualization</b>: Seaborn heatmap
* <b>Observations</b>: GoBike usage among the youngest and oldest generation is neglible compared to Millenials.

In [None]:
# create bins for member_birth_year
df_bike['birth_year_bin'] = pd.cut(df_bike['member_birth_year'], bins=[1900, 1920, 1940, 1960, 1980, 2000, 2020], labels=['1900-1920', '1920-1940', '1940-1960', '1960-1980', '1980-2000','2000-2020'])

# create a pivot table to organize the data for the heatmap
pivot_table = df_bike.pivot_table(values=df_bike.columns[0], index='day_of_week', columns='birth_year_bin', aggfunc='size')

# create the heatmap
ax = sns.heatmap(pivot_table, cmap='YlGnBu', annot=True)
ax.set_yticklabels(['Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday'], rotation=0)
plt.title("Heatmap of GoBike usage per day by age")
plt.xlabel("Generation")
plt.ylabel("Day of week")
plt.show()

## Visualization 4
* <b>Question</b>: What is the IQR range of trip duration per day of the week?
* <b>Visualization</b>: Seaborn boxplot
* <b>Observations</b>: In despite of variance in preceding scatterplot, the IQR for trip duration is nearly identical when plotted against days of the week.

In [None]:
# filter data
df_bike_duration_limited = df_bike[df_bike["duration_sec"] < 2000]

ax = sns.boxplot(x='day_of_week', y='duration_sec', data=df_bike_duration_limited, color='lightgrey')
ax.set_xticklabels(['Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.title("Box plot of the duration of a trip per day of the week")
plt.xlabel("Day of the week")
plt.ylabel("Duration of trips (sec)")
plt.show()

## Visualization 5
* <b>Question</b>: When do different generations use GoBike?
* <b>Visualization</b>: Seaborn boxplot
* <b>Observations</b>: The oldest generation peaks on Monday and Tuesday. Those of working age are consistent between different weekdays and the weekend, but the difference between those is significant. Usage by the youngest generation peaks Tuesday afternoon/evening. The median is on average around 13:00, which makes sense considering that the peaks occurred at 8:00 and 17:00.
* <b>Nb.</b> Birth year bin 1900-1920 (see Tuesday) represents 3 observations for which the birth year was exactly 1920.

In [None]:
sns.set_palette(sns.color_palette("viridis_r", n_colors=8))
plt.figure(figsize=(12,8))
ax = sns.boxplot(data=df_bike, x='day_of_week', y='hour_of_day', hue='birth_year_bin')
ax.set_xticklabels(['Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday'], rotation=0)
plt.xlabel("Day of the week")
plt.ylabel("Hour of the day")
plt.title("GoBike usage per hour of the day by age group (binned per 20 years)")
plt.yticks(np.arange(0, 25, 1))
plt.show()

### Generate Slideshow
Once you're ready to generate your slideshow, use the `jupyter nbconvert` command to generate the HTML slide show.  

In [None]:
# Use this command if you are running this file in local
!jupyter nbconvert Part_II_slide_deck.ipynb --to slides --post serve --no-input --no-prompt

> In the classroom workspace, the generated HTML slideshow will be placed in the home folder. 

> In local machines, the command above should open a tab in your web browser where you can scroll through your presentation. Sub-slides can be accessed by pressing 'down' when viewing its parent slide. Make sure you remove all of the quote-formatted guide notes like this one before you finish your presentation! At last, you can stop the Kernel. 

### Submission
If you are using classroom workspace, you can choose from the following two ways of submission:

1. **Submit from the workspace**. Make sure you have removed the example project from the /home/workspace directory. You must submit the following files:
   - Part_I_notebook.ipynb
   - Part_I_notebook.html or pdf
   - Part_II_notebook.ipynb
   - Part_I_slides.html
   - README.md
   - dataset (optional)


2. **Submit a zip file on the last page of this project lesson**. In this case, open the Jupyter terminal and run the command below to generate a ZIP file. 
```bash
zip -r my_project.zip .
```
The command abobve will ZIP every file present in your /home/workspace directory. Next, you can download the zip to your local, and follow the instructions on the last page of this project lesson.
