# Part I - Ford GoBike System Data (Exploratory Data Analysis)
## by Ruben Dekker
## Introduction
The selected dataset (Ford GoBike System dataset from 2019) is a dataset that contains information on bike rentals from the Ford GoBike bike sharing system. The dataset includes information such as the start and end station, start and end time, duration of the ride, and user information (e.g., whether the user is a subscriber or a customer, and their birth year). It also contains geospatial information such as the latitude and longitude of the start and end station. It likely covers multiple months, the duration is likely covering a full year or a portion of it in 2019. The dataset may be used to analyze bike rental patterns and user behavior, as well as help city planners and bike sharing companies improve the bike sharing system.


# Structure and focus of analysis

### What is the structure of your dataset?

> As can be observed from the output in the *Assessment* chapter:
> * This dataset contains 183,412 rows and 16 columns.
> * All variable names are clear and formatting is consistent.
> * Longitude and latitude have 6 digits after the decimal point, which are necessary for accuracy.
> * The start_time and end_time columns are formatted at *object*, will be converted to datetime[ns] so analyses can be performed on it.

### What is/are the main feature(s) of interest in your dataset?

> * Distribution between days of the week/hour of the day, duration, and various demographic factors.
> * Correlation between most frequent start and end station.
> * Customer:Subscriber ratio in various scenarios.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> The (converted) datetime columns will be the main focus of this analysis. Only *start_time* is used in this analysis, but the code can easily be converted by replacing *start* with *end* in the code.

# Preliminary Wrangling

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

%matplotlib inline

pd.set_option('display.max_columns', 500)
pd.set_option('max_colwidth', 800)

In [None]:
df = pd.read_csv("Ford GoBike data.csv")
print("Dataframe succesfully created")

In [None]:
# A copy of the dataset to perform tests on that can be reiterated an infinite amount of times
df_bike = df.copy()

# Assessment

In [None]:
# The number of observations and variables
df_bike.shape

In [None]:
# The number of duplicated rows
df_bike.duplicated().sum()

In [None]:
# Names of columns
df_bike.columns

In [None]:
# Datatype of variables and non-null count
df_bike.info()

In [None]:
# First 10 observations in dataset by index
df_bike.head()

In [None]:
# Final 10 observations in dataset by index
df_bike.tail()

In [None]:
# Random sample from dataset
df_bike.sample()

In [None]:
# Amount of null values per variable
df_bike.isnull().sum().sort_values(ascending=False)

In [None]:
# Descriptive statistics
df_bike.describe()

In [None]:
# Visualization of the birth year distribution
sns.boxplot(df_bike['member_birth_year']);

In [None]:
# Percentage of people born before 1920
mask = df_bike['member_birth_year'] < 1920
pre_count = len(df_bike[mask])
pre_proportion = (pre_count / len(df_bike)) * 100
print("The percentage of people born before 1920 is {:.2f}%".format(pre_proportion))

# Cleaning

## 2.1 Define

### 2.1.1 Tidiness issues
* None found (given that the extraction from the start_time column for the sake of analysis is not considered a quality issue).

### 2.1.2 Quality issues
* *start_time* and *end_time* are formatted as object, convert to DateTime.

## 2.3 Code

In [None]:
# Convert start_time and end_time columns from object to datetime64[ns]
df_bike[['start_time','end_time']] = df_bike[['start_time','end_time']].apply(pd.to_datetime)
print("Conversion succesful")

## 2.2 Test

In [None]:
# Check Dtype of start_time and end_time
df_bike.info()

# Univariate Exploration

## Visualization 1
* <b>Question</b>: How are trips distributed between days of the week?
* <b>Visualization</b>: Seaborn countplot
* <b>Observations</b>: There is slight variance between the different weekdays and a significant difference between weekend and weekdays.

In [None]:
# Create additional variable called 'day_of_week' by extracting this information from the start_time column (will raise error if ran more than once)
def filter_day_of_week(df_bike):
    df_bike['day_of_week'] = df_bike['start_time'].dt.weekday
    return df_bike

df_bike = filter_day_of_week(df_bike)

print("day_of_week variable succesfully created")

In [None]:
sns.set_style('whitegrid')
ax = sns.countplot(y='day_of_week', data=df_bike, palette='twilight', lw=1, ec='black')
ax.set_yticklabels(['Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.xlabel("Number of trips taken")
plt.ylabel("Day of the Week")
plt.title("Trip count distribution by day")
plt.show()

## Visualization 2
* <b>Question</b>: How are trips distributed between hours of the day?
* <b>Visualization</b>: Seaborn countplot
* <b>Observations</b>: There are major peaks at 8:00 and 17:00, conventional office hours.

In [None]:
# Create additional variable called 'hour_of_day' by extracting this information from the start_time column (will raise error if ran more than once)
def filter_hour_of_day(df_bike):
    df_bike['hour_of_day'] = df_bike['start_time'].dt.hour
    return df_bike

df_bike = filter_hour_of_day(df_bike)

print("hour_of_day variable succesfully created")

In [None]:
sns.set_style('whitegrid')
ax = sns.countplot(x='hour_of_day', data=df_bike, palette='Pastel1', lw=1, ec='black')
plt.xlabel("Hour of the day at which trip was started")
plt.ylabel("Number of trips taken")
plt.title("Trip count distribution by hour")
plt.show()

## Visualization 3
* <b>Question</b>: What is the age distribution of GoBike users?
* <b>Visualization</b>: Kernel Density Estimation (KDE) plot
* <b>Observations</b>: The target audience is people in their late twenties and early thirties, presumably young professionals.

In [None]:
# Removes all observations where member_birth_year is before 1920 (see output to confirm succesful removal)
def drop_over_100(df):
    df = df[df_bike['member_birth_year'] >= 1920]
    df_bike.dropna(subset=['member_birth_year'], inplace=True)
    return df

df_bike = drop_over_100(df_bike)

df_bike['member_birth_year'].describe()

In [None]:
sns.kdeplot(data=df_bike['member_birth_year'], fill=True)
plt.xlabel('Birth Year')
plt.ylabel('Probability')
plt.title('Kernel Density Estimation of birth year')
plt.xticks(np.arange(min(df_bike['member_birth_year']), max(df_bike['member_birth_year']), 10))
plt.show()

## Visualization 4
* <b>Question</b>: What is the subscriber:customer ratio?
* <b>Visualization</b>: MatplotLib pie chart
* <b>Observations</b>: The majority of GoBike users are subscribers rather than pay-per-use.

In [None]:
df_bike['user_type'].value_counts().plot.pie(autopct='%.1f%%')
plt.title("User Type Distribution")
plt.ylabel('')
plt.show()

## Visualization 5
* <b>Question</b>: What is the gender distribution of GoBike users?
* <b>Visualization</b>: MatplotLib bar chart
* <b>Observations</b>: GoBike users are predominantly male.

In [None]:
df_bike['member_gender'].value_counts().plot.bar()
plt.xlabel('')
plt.ylabel('Occurences')
plt.title('Gender Distribution')
plt.show()

## Visualization 6
* <b>Question</b>: What are the 10 most popular start stations?
* <b>Visualization</b>: Seaborn countplot
* <b>Observations</b>: Frequency is well-distributed, no outliers.

In [None]:
# Isolate the 10 most popular start stations
top_10_start_station_name = df_bike['start_station_name'].value_counts().head(10)

sns.countplot(y='start_station_name', data=df_bike, order=top_10_start_station_name.index)
plt.xlabel("Frequency")
plt.ylabel("Start station name")
plt.title("Frequency of the 10 most popular start stations")
plt.show()

## Visualization 7
* <b>Question</b>: What are the 10 most popular end stations?
* <b>Visualization</b>: Seaborn countplot
* <b>Observations</b>: Frequency is well-distributed, no outliers.

In [None]:
# Isolate 10 most popular end stations
top_10_end_station_name = df_bike['end_station_name'].value_counts().head(10)

sns.countplot(y='end_station_name', data=df_bike, order=top_10_end_station_name.index)
plt.xlabel("Frequency")
plt.ylabel("End station name")
plt.title("Frequency of the 10 most popular end stations")
plt.show()

## Review of univariate exploration

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> I looked at my variable of interest from two perspectives: how is usage distributed per day and per hour? First of all, I decided to only use the *start_time* variable to answer this question, because outliers caused by programmatic bugs seemed less likely for checkins than checkouts. Most trips are taking place during the week and the least are taken during the weekend. There is some variety between the days of the week itself, but the seperation between weekdays and weekend is especially salient. This suggests that the target audience might be either students or (young) professionals. After diving deeper into the data by visualizing the distribution per hour, there are clear peaks around 8:00 and 17:00. In other words, people leaving home to go to work and people going home from their work.

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> The fact that Thursday is more popular than other days of the week is noticable, and would require further research and additional data. Distribution by hour and user type further confirmed the initial hypothesis that young professionals are the target audience of this system. Changing the datatype of the *start_time* and *end_tim*e columns was necessary to extract information from this column. I decided to remove all observations which reported a birth year before 1920: the likeliness of someone over 100 being physically capable of cycling, although not impossible, is low. I researched the necesssity of fractional digits in long- and latitude data, those turned out to be necessary, so I decided to leave those alone. Other than that, there was no data I thought required transformation, at least not for the purpose of this analysis.

# Bivariate Exploration

## Visualization 8
* <b>Question</b>: Are there significant correlations in this dataset?
* <b>Visualization</b>: Seaborn heatmap
* <b>Observations</b>: No significant correlation except for longitude/latitude/id, which all refer to the same concept.

In [None]:
plt.subplots(figsize=(10, 8))
sns.heatmap(df_bike.corr(), cmap='coolwarm', annot=True)
plt.show()

## Visualization 9
* <b>Question</b>: Is there a correlation between birth year and trip duration?
* <b>Visualization</b>: Seaborn scatterplot
* <b>Observations</b>: Younger users ocassionaly travel for longer periods of time, but most trips are consistently short.

In [None]:
# Scatterplot showing the correlation between birth year and length of trip
sns.scatterplot(x='member_birth_year', y='duration_sec', data=df_bike, alpha=0.25, s=20)
plt.title("Scatter plot on the correlation between birth year and duration of trip in seconds")
plt.xlabel("Birth year")
plt.ylabel("Duration of trip in seconds")
plt.show()

## Visualization 10
* <b>Question</b>: Is there a correlation between birth year and trip duration?
* <b>Visualization</b>: Seaborn heatmap
* <b>Observations</b>: GoBike usage among the youngest and oldest generation is neglible compared to Millenials.

In [None]:
# create bins for member_birth_year
df_bike['birth_year_bin'] = pd.cut(df_bike['member_birth_year'], bins=[1900, 1920, 1940, 1960, 1980, 2000, 2020], labels=['1900-1920', '1920-1940', '1940-1960', '1960-1980', '1980-2000','2000-2020'])

# create a pivot table to organize the data for the heatmap
pivot_table = df_bike.pivot_table(values=df_bike.columns[0], index='day_of_week', columns='birth_year_bin', aggfunc='size')

# create the heatmap
plt.subplots(figsize=(10, 8))
ax = sns.heatmap(pivot_table, cmap='YlGnBu', annot=True)
ax.set_yticklabels(['Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday'], rotation=0)
plt.title("Heatmap of GoBike usage per day by age")
plt.xlabel("Generation")
plt.ylabel("Day of week")
plt.show()

## Visualization 11
* <b>Question</b>: What is the IQR range of trip duration per day of the week?
* <b>Visualization</b>: Seaborn boxplot
* <b>Observations</b>: In despite of variance in preceding scatterplot, the IQR for trip duration is nearly identical when plotted against days of the week.

In [None]:
# filter data
df_bike_duration_limited = df_bike[df_bike["duration_sec"] < 2000]

ax = sns.boxplot(x='day_of_week', y='duration_sec', data=df_bike_duration_limited, color='lightgrey')
ax.set_xticklabels(['Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.title("Box plot of the duration of a trip per day of the week")
plt.xlabel("Day of the week")
plt.ylabel("Duration of trips (sec)")
plt.show()

## Visualization 12
* <b>Question</b>: Is there a correlation between gender and trip duration?
* <b>Visualization</b>: Seaborn violin plot
* <b>Observations</b>: Women take slightly longer trips, but the difference is minimal.

In [None]:
# Plot 12 filter data
df_bike_duration_limited = df_bike[df_bike["duration_sec"] < 2000]

sns.set_palette("pastel")
sns.violinplot(x='member_gender', y='duration_sec', data=df_bike_duration_limited, color='#20B2AA')
plt.title("Violin plot of duration_sec vs member_gender")
plt.xlabel("Gender")
plt.ylabel("Duration (sec)")
plt.show()

## Visualization 13
* <b>Question</b>: Is there a correlation in the popularity between start and end stations?
* <b>Visualization</b>: MatplotLib stacked bar chart
* <b>Observations</b>: 9/10 of the most popular start and end stations are shared.

In [None]:
top_10_start_station = df_bike['start_station_name'].value_counts().head(10)
top_10_end_station = df_bike['end_station_name'].value_counts().head(10)

# Combine the top 10 start and end station into one dataframe
top_10_stations = pd.concat([top_10_start_station, top_10_end_station], axis=1)
top_10_stations.columns = ['start_station', 'end_station']

# Create the stacked bar chart
top_10_stations.plot(kind='bar', stacked=True)
plt.xlabel("Station Name")
plt.ylabel("Frequency")
plt.title("Frequency of the top 10 start and end stations")
plt.show()

## Review of bivariate exploration

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> As can be derived from the correlation heatmap, there is little correlation between variables except for the obvious longitude/latitude/station_id variables, which is obvious because they refer to the same concept. A heatmap of the correlation between age (bins of 20 years, the generally accepted range of a generation) and day of the week shows exactly how skewed the data is: in the 1980-2000 group there are between 11.000 and 25.000 trips on a specific day, whereas the first (1920-'40) and the last (2000-'20) are barely over 100 and less than 50 over the course of the entire week, respectively.

> When generating a scatterplot the duration data seems simultaneously platykurtic and leptokurtic: highly centered around a mean, but with a wide range. To determine the IQR, I visualized a boxplot and emperically determined the range. The first conclusion that can be draw from this boxplot is that approximately 75% of all trips taken during the week take between 300 and 750 seconds: between 5 en 12.5 minutes and that trips on average last 500 seconds, or 8 minutes and 20 seconds. Additionally, the wider IQR for Saturday and Sunday suggests that there are less very long trips (outliers). Finally, the ten most popular start and end stations. By creating a stacked bar chart, I hoped to quickly figure whether there is a correlation between start and end stations: as it turns out, yes, 9/10 of the most popular stations are shared.

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Although there no significant findings besides time-related, it is interesting to see that trips taken by women are on average longer than those taken by men. Additionally, I found it surprisingly that the most popular *start_station* and *end_station* are not perfectly correlated.

# Multivariate Exploration

## Visualization 14
* <b>Question</b>: Is there variation in the customer:subscriber ratio in different age groups?
* <b>Visualization</b>: Seaborn stripplot
* <b>Observations</b>: The oldest generation is predominantly subscribers, the youngest customers.

In [None]:
# Plot 14: The variation in customer:subscriber ratio per age group of 20 years

plt.figure(figsize=(10,12))
sns.stripplot(x="birth_year_bin", y="duration_sec", hue="user_type", data=df_bike, alpha=0.50, palette='terrain')
plt.xlabel("Birth year (binned per 20 years)")
plt.ylabel("Duration (seconds)")
plt.title("Duration of trip per (binned) birth year, split out by user type")
plt.xticks(range(5), ["1920-1940", "1940-1960", "1960-1980", "1980-2000","2000-2020"])
plt.show()

## Visualization 15
* <b>Question</b>: When do different generations use GoBike?
* <b>Visualization</b>: Seaborn boxplot
* <b>Observations</b>: The oldest generation peaks on Monday and Tuesday. Those of working age are consistent between different weekdays and the weekend, but the difference between those is significant. Usage by the youngest generation peaks Tuesday afternoon/evening. The median is on average around 13:00, which makes sense considering that the peaks occurred at 8:00 and 17:00.

In [None]:
# Plot 15: GoBike usage per hour of the day by age group (binned per 20 years)

sns.set_palette(sns.color_palette("viridis_r", n_colors=8))
plt.figure(figsize=(12,8))
ax = sns.boxplot(data=df_bike, x='day_of_week', y='hour_of_day', hue='birth_year_bin')
ax.set_xticklabels(['Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday'], rotation=0)
plt.xlabel("Day of the week")
plt.ylabel("Hour of the day")
plt.title("GoBike usage per hour of the day by age group (binned per 20 years)")
plt.yticks(np.arange(0, 25, 1))
plt.show()

In [None]:
# Explanation of the 1900-1920 bin
df_bike.sort_values('member_birth_year').head(3)

## Review of multivariate exploration

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> The final summarized all time-related date: day of the week, hour of the day, and age. The consistency between weekdays amongst the Millenial generation (1980-2000) is the most consistent, as well as nearly always the largest.

### Were there any interesting or surprising interactions between features?

> The first things that attracts attention in plot 14. is the earliest and latest age cohort: the majority of those born between 1920 and 1940 are subscribers, whereas the youngest generation mostly opt for a pay-per-use consumption. 

# Conclusions
> In conclusion, my analysis of the GoBike usage data revealed a strong correlation between age and usage, with young professionals being the primary target audience. Furthermore, I observed that the day of the week and hour of the day variables had a significant impact on usage patterns. Additionally, I found that while there is a high degree of similarity between the start and end stations, there is not a perfect correlation between the two. The variance in trip duration was also found to be high, although the interquartile range was relatively small. Based on these findings, I recommend that Ford, as the operator of the GoBike service, consider partnering with businesses and offering a package tailored towards young professionals in order to expand their GoBike business. The above findings are based on the data available and further research may be needed to validate my findings.
