## Dataset Overview

> This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.
The data set has been stored as a pandas dataframe, It has 16 columns and 183412 rows. The features cover 3 main areas: 
1. trip duration
2. station information
3. member information

## Investigation Overview

> The Goal of this presentation is to explore the main features of bike-sharing in the greater San Francisco Bay area. The main goal here is to specify the main determinants of trip duration by looking at the relationship between trip duration and other explanatory variables in the dataset. We try to answer the following questions:
1. What does the distribution of trip duration look like?
2. Which days have the highest demand on trips?
3. Which hours during the day have the highest demand on trips?
4. How trip duration differs by user age, hour, day, and user type?


In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# suppress warnings 
import warnings
warnings.simplefilter("ignore")

#### Load the dataset and describe its properties

In [None]:
#Reading the data frame 
df = pd.read_csv("../input/ford-gobike-2019feb-tripdata/201902-fordgobike-tripdata.csv")

In [None]:
pd.options.display.max_rows = 999999
df.head()

In [None]:
#display the first 5 rows
df.info()

In [None]:
#describe the data - main statistics
df.describe()

In [None]:
# Is there any duplicated rows?
df.duplicated().sum()

In [None]:
df.isna().sum()

### Problems found in the data so far:
1. Many features have incorrect data type
2. Many features have missing values

### Correcting data types 

In [None]:
# convert start_time and end_time into datetime 
df.start_time = pd.to_datetime(df.start_time)
df.end_time = pd.to_datetime(df.end_time)

# convert start_station_id, end_station_id, and bike_id into strings
df.start_station_id = df.start_station_id.astype('str')
df.end_station_id = df.end_station_id.astype('str')
df.bike_id = df.bike_id.astype('str')

# convert user_type and member_gender into categories
df.user_type = df.user_type.astype('category')
df.member_gender = df.member_gender.astype('category')

In [None]:
# quick check
df.info()

### Filling in Missing Data

In [None]:
# Percent of missing values in each column
(df.isna().sum() / df.shape[0]) * 100

In [None]:
# filling in missing values 
for col in ["start_station_name", "end_station_name", "member_birth_year", "member_gender"]:
    df[col] = df[col].fillna(df[col].mode()[0])

df["member_gender"] = df["member_gender"].fillna("Male")

In [None]:
# Percent of missing values in each column
(df.isna().sum() / df.shape[0]) * 100

### Feature Engineering 

In [None]:
# add new columns for trip duration in minute, hour of the day, day of week and month

df['duration_minute'] = df['duration_sec']/60
df['start_date'] = df.start_time.dt.strftime('%Y-%m-%d')
df['start_hourofday'] = df.start_time.dt.strftime('%H')
df['start_dayofweek'] = df.start_time.dt.strftime('%A')
df['start_month'] = df.start_time.dt.strftime('%B')

df.head()

In [None]:
# Calculating Age from 'member_birth_year'
df['member_age'] = 2021 - df['member_birth_year']
df.head()

### The structure the dataset:

> The data set has been stored as a pandas dataframe, It has 16 columns and 183412 rows. The features cover 3 main areas: 
1. trip duration
2. station information
3. member information

### The main feature(s) of interest:
> 1. duration_sec
2. duration_minute


### Features that will help support our investigation:
> 1. member_birth_year
2. member_age
3. member_gender
4. bike_share_for_all_trip
5. start_month
6. start_dayofweek
7. user_type

## Univariate Exploration

> In this section, we will investigate distributions of individual variables. If
we see unusual points or outliers, we will take a deeper look to clean things up
and prepare ourself to look at relationships between variables.

In [None]:
# trip distribution by duration
plt.figure(figsize = (8, 4), dpi = 100)

sns.histplot(data = df, x = "duration_minute")
plt.xlim(0, 100)
plt.title("trip distribution by duration in minutes")
plt.xlabel('Duration in minutes')
plt.ylabel('Count')
plt.axvline(x=30, color = "red")
plt.show()

In [None]:
len(df[df["duration_minute"] <= 30]) / len(df["duration_minute"]) * 100

In [None]:
len(df[df["duration_minute"] > 60]) / len(df["duration_minute"]) * 100

**Graph Conclusion:** from the distribution of duraction we can notice that more than 96 percent of trips were less than 30 minutes. We can also notice that only 0.93 percent of trips are of duration more than 1 hour. These might be considered as outliers that needs to be removed before going further in the bivariate analysis. 

In [None]:
# trip distribution over day hours
plt.figure(figsize = (8,4), dpi = 100)

base_color = sns.color_palette()[0]

order = df["start_hourofday"].value_counts().index

sns.countplot(data = df, x = "start_hourofday", color = base_color, order = order)
plt.title("trip distribution over day hours")
plt.xticks(rotation = 90)
plt.show()

In [None]:
# trip distribution over day hours
plt.figure(figsize = (8,4), dpi = 100)

base_color = sns.color_palette()[0]

hour = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23']
hour_categ = pd.api.types.CategoricalDtype(ordered=True, categories=hour)
df['start_hourofday'] = df['start_hourofday'].astype(hour_categ)


sns.countplot(data = df, x = "start_hourofday", color = base_color)
plt.title("trip distribution over day hours")
plt.xlabel('Day hours')
plt.ylabel('Count')
plt.xticks(rotation = 90)
plt.show()

**Graph Conclusion:** From the graph we can notice that peak hours are those from 7 - 9 am and from 4 - 6 pm. This might be related to the time when employees and students go to and leave work and school.

In [None]:
# trip distribution over weekdays
plt.figure(figsize = (8,4), dpi = 100)

weekday = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekday_categ = pd.api.types.CategoricalDtype(ordered=True, categories=weekday)
df['start_dayofweek'] = df['start_dayofweek'].astype(weekday_categ)

sns.countplot(data=df, x='start_dayofweek', color=base_color)
plt.xlabel('Trip Start Day of Week')
plt.ylabel('Count')
plt.title("Trip distribution over weekdays")

plt.show()

**Graph Conclusion:** The demand for trips gradually increases untill reaching its highest levels on Thursday, it then declines untill reaching its lowest levels on Saturday and Sunday. This is due to the fact that Saturday and Sunday are the weekend in the United States of America.

In [None]:
# trip distribution over months
plt.figure(figsize = (8,4), dpi = 100)

month = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
month_categ = pd.api.types.CategoricalDtype(ordered=True, categories=month)
df['start_month'] = df['start_month'].astype(month_categ)

sns.countplot(data=df, x='start_month', color=base_color)
plt.xticks(rotation=90)
plt.xlabel('Trip Start Month')
plt.ylabel('Count')

plt.title("Trip distribution over months")

plt.show()

**Graph Conclusion:** All trips have been taken place in the month of February.  

In [None]:
# Distribution of Age
plt.figure(figsize = (8,4), dpi = 100)

bins = np.arange(0, df['member_age'].max()+5, 5)
sns.histplot(data=df, x='member_age', color=base_color, bins = bins)
plt.xticks(rotation=90)
plt.xlabel('Member age')
plt.ylabel('Count')

plt.title("Distribution of Age")

plt.show()

In [None]:
len(df[df["member_age"] <= 45]) / len(df["member_age"]) * 100

**Graph Conclusion:** The distribution follows a typical age distribution (Skewed to the right). It is consistent with the distribution of weekdays with those aged 20 - 45 are the most demanding segment for rides as they are the most active population either in work or study. 

In [None]:
plt.figure(figsize = (8,4), dpi = 100)

plot = sns.countplot(data=df, x='user_type', color=base_color)
plt.xlabel('User Type')
plt.ylabel('Count')
plt.title("Distribution of Customers by type - count")

plt.show()

In [None]:
plt.figure(figsize = (8,4), dpi = 100)

counts = df['user_type'].value_counts(normalize = True)
sns.barplot(x = counts.index, y = counts.values, color=base_color)
plt.xlabel('User Type')
plt.ylabel('Count')
print(counts * 100)

plt.title("Distribution of Customers by type - percent")
plt.show()

**Graph Conclusion:** Customers represent 10.8 percent of users, whereas subscribers represents 89.2 percent.

In [None]:
plt.figure(figsize = (8,4), dpi = 100)

counts = df['member_gender'].value_counts(normalize = True)
sns.barplot(x = counts.index, y = counts.values, color=base_color)
plt.xlabel('Member Gender')
plt.ylabel('Count')
print(counts * 100)
plt.title("Distribution of Customers by gender")

plt.show()

**Graph Conclusion:** Males represent 75.7 percent of users, whereas Females represents 22.3 percent, the remainder is others with 1.99 percent

#### Removing outliers

In [None]:
df1 = df[df["duration_minute"] <= 60]
df2 = df1[df["member_age"] <= 80]

df2.head()

In [None]:
df2["duration_minute"].describe()

In [None]:
df2["member_age"].describe()

### The distribution(s) of variable(s) of interest:

> 1. From the distribution of duraction we noticed that more than 96 percent of trips were less than 30 minutes. We also noticed that only 0.93 percent of trips are of duration more than 1 hour. These were considered as outliers and were removed before going further in the bivariate analysis.
2. We also noticed that peak hours are those from 7 - 9 am and from 4 - 6 pm. This might be related to the time when employees and students go to and leave work and school. This is was also consistent with the distribution of trips over weekdays, where work days have the most demand for trips.
3. Age distribution follows any typical age distribution (Skewed to the right). It is consistent with the distribution of weekdays with those aged 20 - 45 are the most demanding segment for rides as they are the most active population either in work or study.
4. Customers represent 10.8 percent of users, whereas subscribers represents 89.2 percent
5. Males represent 75.7 percent of users, whereas Females represents 22.3 percent, the remainder is others with 1.99 percent

### Unusual distributions:

> 1. The distribution of duration was surprising, 96 percent of trips were 30 mintes or less. There were some outliers that we removed.
2. The Age has also some outliers that we removed.
3. We created new features out of the time variable

## Bivariate Exploration

> In this section, we will investigate relationships between pairs of variables in our
data.

 #### Duration distribution by member gender
    

In [None]:
plt.figure(figsize = (8,4), dpi = 100)

sns.boxplot(data = df2, x = "member_gender", y = "duration_minute")
plt.xlabel('Gender');
plt.ylabel('Trip Duration in Minute')

plt.title("Distribution of trip duration by Gender")

plt.show()

**Graph Conclusion:** Male riders seem to have shorter trips compared to females other gender gender types, this is edvident by smaller median and shorter IQR. However, the difference is very small and we are not sure whether it is significant or not.

In [None]:
plt.figure(figsize = (8,4), dpi = 100)

sns.boxplot(data = df2, x = "user_type", y = "duration_minute")
plt.xlabel('User Type');
plt.ylabel('Trip Duration in Minute')

plt.title("Distribution of trip duration by user type")

plt.show()

**Graph Conclusion:** Subscribers have shorter trips, whearas casual riders (customers) have longer trips. 

In [None]:
plt.figure(figsize = (8,4), dpi = 100)

sns.barplot(data = df2, x = "start_dayofweek", y = "duration_minute", color=base_color)
plt.xlabel('Day of Week');
plt.ylabel('Avg. Trip Duration in Minute')

plt.show()

In [None]:
plt.figure(figsize = (8,4), dpi = 100)

sns.boxplot(data = df2, x = "start_dayofweek", y = "duration_minute", color=base_color)
plt.xlabel('Day of Week');
plt.ylabel('Avg. Trip Duration in Minute')

plt.show()

**Graph Conclusion:** The graph reflects stable use along work days. Trip duration is longer during weekends, reflecting more casual and entertainment usage.  

In [None]:
plt.figure(figsize = (8,4), dpi = 100)

sns.countplot(data=df2, x='start_dayofweek', hue='user_type')
plt.xlabel('Trip Start Day of Week')
plt.ylabel('Count')
plt.title("Trip distribution over weekdays")

plt.show()

**Graph Conclusion:** Subscribers seem to have consistent usage for a specific purpose every day, mainly: work and study. As a result the number of their rides declines the most at weekends  

In [None]:
plt.figure(figsize = (8,4), dpi = 100)

sns.countplot(data = df2, x = "start_hourofday", hue='user_type')
plt.title("trip distribution over day hours - by user type")
plt.xlabel('Day hours')
plt.ylabel('Count')
plt.xticks(rotation = 90)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.show()

**Graph Conclusion:** From the graph we can notice that peak hours for both user types are those from 7 - 9 am and from 4 - 6 pm. This might be related to the time when employees and students go to and leave work and school.

In [None]:
plt.figure(figsize = (8,4), dpi = 100)

sns.scatterplot(data = df2, x = "member_age", y = "duration_minute", alpha = 0.3)
plt.title("Rlationship between trip duration and age")
plt.xlabel('Day hours')
plt.ylabel('Count')
plt.xticks(rotation = 90)
#lt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.show()

**Graph Conclusion:** A clear negative relatioship between age and trip duration

In [None]:
# Distribution of Age
plt.figure(figsize = (8,4), dpi = 100)

sns.boxplot(data=df2, x = "user_type", y='member_age')
plt.xticks(rotation=90)
plt.xlabel('Member age')
plt.ylabel('Count')

plt.title("Distribution of Age by user type")

plt.show()

### Some of the relationships observed in this part of the investigation:

> There are way more subscribers than customers. Subscribers usage seem to be very consistent and standard, their usage is intended for daily routine such as work or study. Therfore subscribers usage reaches its highest levels during rush hours and work days. Customers on the other hand tend to use bikes for fun, their usage is concentrated during weekends at midnights and middays.   

### Interesting relationships between the other features:

> It was surbrizing to see customers rides mostly occur during midnight and midday

## Multivariate Exploration


In [None]:
plt.figure(figsize = (8,4), dpi = 100)

sns.barplot(data = df2, x = "start_dayofweek", y = "duration_minute", hue='user_type')
plt.xlabel('Day of Week');
plt.ylabel('Avg. Trip Duration in Minute')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("Trip distribution over weekdays and by customer type")

plt.show()

In [None]:
plt.figure(figsize = (8,4), dpi = 100)

sns.boxplot(data = df2, x = "start_dayofweek", y = "duration_minute", hue='user_type')
plt.xlabel('Day of Week');
plt.ylabel('Avg. Trip Duration in Minute')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("Trip distribution over weekdays and by customer type")

plt.show()

**Graph Conclusion:** Customers have consistently longer trips than subscribers

In [None]:
plt.figure(figsize = (8,4), dpi = 100)

sns.barplot(data = df2, x = "start_hourofday", y = "duration_minute",   hue='user_type', ci = None)
plt.title("Trip duration over day hours and by customer type")
plt.xlabel('Day hours')
plt.ylabel('Avg. Trip Duration in Minute')
plt.xticks(rotation = 90)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.show()

In [None]:
plt.figure(figsize = (8,4), dpi = 100)

sns.boxplot(data = df2, x = "start_hourofday", y = "duration_minute", hue='user_type')
plt.title("Trip duration over day hours and by customer type")
plt.xlabel('Day hours')
plt.ylabel('Avg. Trip Duration in Minute')
plt.xticks(rotation = 90)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.show()

**Graph Conclusion:** Customers have consistently longer trips across all hours of the day. However, customer trips are much longer at midnight and midday

In [None]:
plt.figure(figsize = (8,4), dpi = 100)

sns.heatmap(df2.corr(), cmap = "viridis", annot = True)
plt.title("Correlation Matrix")
plt.xticks(rotation = 90)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.show()

In [None]:
plt.figure(figsize = (8,6), dpi = 100)

plt.subplot(2, 1, 1)
customers = df2[df2['user_type'] == "Subscriber"] 
ct_counts = customers.groupby(['start_dayofweek', 'start_hourofday']).size()
ct_counts = ct_counts.reset_index(name='count')
ct_counts = ct_counts.pivot(index='start_dayofweek', columns='start_hourofday', values='count')
sns.heatmap(ct_counts, cmap='rocket_r');
plt.title('Subscriber', loc='right');
plt.xlabel('Hour of Day');
plt.ylabel('Day of Week');

plt.subplot(2, 1, 2)
customers = df2[df2['user_type'] == "Customer"] 
ct_counts = customers.groupby(['start_dayofweek', 'start_hourofday']).size()
ct_counts = ct_counts.reset_index(name='count')
ct_counts = ct_counts.pivot(index='start_dayofweek', columns='start_hourofday', values='count')
sns.heatmap(ct_counts, cmap='rocket_r');
plt.title('Customer', loc='right');
plt.xlabel('Hour of Day');
plt.ylabel('Day of Week');

plt.tight_layout()

**Graph Conclusion:** There is a clear different usage pattern between customers and subscribers in the way we previously explained

### Some of the relationships you observed in this part of the investigation:
> 1. Customers have consistently longer trips than subscribers.
2.Customers have consistently longer trips across all hours of the day. However, customer trips are much longer at midnight and midday
3.There is a clear different usage pattern between customers and subscribers in the way we previously explained 

### Interesting or surprising interactions between features:

> It was surbrizing to see customers rides mostly occur during midnight and midday