## Ford Go-Bike Dataset Exploration
## Prepared By: Mohamad AbouElela

This data set represents trips taken by members of the Ford Go Bike service for month of February of 2019.
Data consists of info about trips taken by service's members, their types, their age, their gender, stations of starting and ending trips, duration of trips etc.

### Dataset Dictionary:

1. duration_sec: Trip Duration (seconds)
2. start_time>: Start Time and Date
3. end_time: End Time and Date
4. start_station_id: Start Station ID
5. start_station_name: Start Station Name
6. start_station_latitude: Start Station Latitude
7. start_station_longitude: Start Station Longitude
8. end_station_id: End Station ID
9. end_station_name: End Station Name
10. end_station_latitude: End Station Latitude
11. end_station_longitude: End Station Longitude
12. bike_id: Bike ID
13. user_type: User Type (Subscriber or Customer – “Subscriber” = Member or “Customer” = Casual)
14. member_birth_year: Member Year of Birth
15. member_gender: Member Gender
16. bike_share_for_all_trip: Boolean to track members who are enrolled in the "Bike Share for All" program for low-income residents

In [None]:
# import required liberaries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [None]:
# read dataset file
df = pd.read_csv("../input/ford-gobike-2019feb-tripdata/201902-fordgobike-tripdata.csv")

## Explanatory Data Analysis - EDA

In [None]:
print (f'DataFrame Shape = {df.shape}')
df.head()

In [None]:
df.info()

In [None]:
# checking for null values
df.isna().sum()

In [None]:
# numeric varaiables information
df.describe()

In [None]:
# members gender types
df['member_gender'].value_counts()

In [None]:
# Users types
df['user_type'].value_counts()

In [None]:
df['start_station_name'].value_counts()

In [None]:
# check duplicates
df.duplicated().value_counts()

### Observations:
1. Data consists of 183412 rows and 16 columns 
2. Trip Duration time is in seconds
3. Start and End time are not in date time format
4. Dataset is  missing 197 (station id and station name) and 8265 missing (member birthday and member gender) as well
5. No duplicates in our dataset.
6. Some Member_birth_year values need investigations

## Data Cleaning and organizing:

In order to start working with our data we need to clean the dataset following the next steps

1. Drop unwanted ['start_time', 'end_time', 'start_station_latitude', 'start_station_longitude', 'end_station_latitude', 'end_station_longitude', 'bike_id'] columns
2. Drop missing values on station_id & Station_name
3. Investigate and drop out of normal member_birth_year values
4. Change duration_sec from seconds to Minutes for easier understanding 
5. Change member_birth_year to member_age for easier understanding

In [None]:
# drop unwanted columns
cols_drop = ['start_time', 'end_time', 'start_station_latitude', 'start_station_longitude', 'end_station_latitude', 'end_station_longitude', 'bike_id']
df1 = df.drop(columns= cols_drop, axis = 1)

In [None]:
df1.shape

In [None]:
df1.head()

In [None]:
# drop all NaN values 
df2 = df1.dropna()
df2.shape

In [None]:
# check no NaN values
df2.isna().sum()

In [None]:
# check member birth year values
df2['member_birth_year'].value_counts()

In [None]:
# investigate member birth year value of 1878 which is clearly wrong
id = df2[df2['member_birth_year']== 1878]
id

In [None]:
# drop wrong birth year value 
df3  = df2.drop(index = id.index, axis =0)
df3.shape

In [None]:
# change duration sec to duration min for easy understanding

df3['duration_min'] = df3['duration_sec']/60
df3.head()

In [None]:
# change member birth year to member age
df3['member_age'] = 2019 - df3['member_birth_year']
df3.head()

In [None]:
# drop duration_sec and member birth year columns
drop_cols = ['duration_sec', 'member_birth_year']
df_final = df3.drop(columns = drop_cols, axis = 1)
df_final.head()

## Visual Explorations
### Univairate Exploration

#### Features of intrest
1. Gender
2. Users type
3. Age
4. Location 
5. Duration

In [None]:
# Explore members gender and user type

plt.figure(figsize = [10, 6])
plt.subplot(1,2,1)
sorted_gender = df_final['member_gender'].value_counts()
plt.pie(sorted_gender, labels=sorted_gender.index, startangle = 90, autopct = '%1.1f%%')
plt.axis('square')

plt.subplot(1,2,2)
sorted_type = df_final['user_type'].value_counts()
plt.pie(sorted_type, labels = sorted_type.index, startangle = 90, autopct = '%1.1f%%')
plt.axis('equal');

#### Males represent 75% of the dataset
#### More than 90% of users are Subscribers to the ford go bike service 

In [None]:
# member age exploration

age_bins = np.arange(10, df_final['member_age'].max()+4, 4)
plt.hist(data= df_final, x = 'member_age');

#### The age distrubution is right skewed with members of age between 30 to 40 years representing major partition of the dataset` 

In [None]:
# Explore Top ten start stations vs Top ten end stations
base_color = sb.color_palette()[0]
plt.figure(figsize = [10, 10])
plt.subplot(2,1,1)
start_order = df_final['start_station_name'].value_counts()[:10]
df_start = df_final.loc[df_final['start_station_name'].isin(start_order.index)]
sb.countplot(data = df_start, y = 'start_station_name', order = start_order.index, color = base_color)
plt.subplot(2,1,2)
end_order = df_final['end_station_name'].value_counts()[:10]
df_end = df_final.loc[df_final['end_station_name'].isin(end_order.index)]
sb.countplot(data = df_end, y = 'end_station_name', order = end_order.index, color = base_color);

#### Both (Market st, San Francisco caltrain station2) are the highest starting and destination stations.
#### We can use this data to increase number of available bikes at these stations 

In [None]:
df_final['duration_min'].describe()

In [None]:
# duration exploration
bins = np.arange(df_final['duration_min'].min(), df_final['duration_min'].max()+ 30, 30)
plt.hist(df_final['duration_min'], bins = bins)
plt.xlabel ('Duration in Min')
plt.xlim((0,200));

### We Notice most of the bike ride duration is below 30 min, however the duration have a long tail of outliers may be due to users keeping their bike rented during work or forget to log off after finishing rides.

## Bivairate Exploration

#### Features of intrest

In [None]:
# check relation between age and ride duration
plt.figure(figsize = [12,5])
plt.scatter(x = df_final['member_age'], y = df_final['duration_min'], alpha = 1/6);


#### We notice that members aged between 25-40 years tend to do the longest rides durations. 
#### As mentioned before an interesting observation is that some customers may be keeping there bikes rented during working hours or forget to log off as we observe long rides over 10 hours and up to 20 hours.    

In [None]:
# sort level of catgorical variables 
object_columns = ['user_type', 'member_gender']

def obj_cat (object_columns):
    for i in object_columns:
        col_order = df_final[i].value_counts().index
        cat = pd.api.types.CategoricalDtype(ordered = True, categories = col_order)
        df_final[i] = df_final[i].astype(cat)
obj_cat(object_columns)
df_final.info()

In [None]:
# check relation between gender and age to try to find the average male/female age in the dataset

sb.boxplot(data = df_final, x = 'member_gender', y = 'member_age', color = base_color);

#### Although Males represent 75% of the dataset, the plot show the average age for males and females are quite equal around 33 years of age 

In [None]:
# Relation between member genders and user types

sb.countplot(data = df_final, x = 'member_gender', hue = 'user_type');

#### This plot is very informative and we can observe:
* Higher numbers of subscriber members VS customer members in both males and females which can be due to easy and cheap subscribing fees, also may be due fare difference between subscribers and customers. 
* As mentioned before Males represent over 75% of the data set and we notice from the plot that around 90% of males are subscribrs to the service

## Multivariate exploration

In [None]:
g = sb.FacetGrid(data = df_final, col = 'member_gender', hue = 'user_type')
g.map(plt.hist, 'member_age');

### As observed from previous plots subscribers percentage is higher than customers regardeless of the gender

In [None]:
g = sb.FacetGrid(data = df_final, col = 'bike_share_for_all_trip', hue = 'member_gender', legend_out = True)
g.map(plt.hist, 'member_age');

#### We Notice that most users prefer not to share the bikes regardless of there gender  

## Exploration Summary

In this Section we'll explain the Ford Go-Bike dataset findings in summary for easy of understanding by any users.
We'll focus on:
1. Overall explanation for dataset gender distribution and average gender ages
2. Age and Ride duration relation
3. Most active stations

## Dataset Overview

The dataset represents trips taken by 183412 members of the Ford Go Bike service for month of February of 2019.
Data consists of info about trips taken by service's members, their types, their age, their gender, stations of starting and ending trips, duration of trips etc.

## (Visualization 1)

The following Visualization consists of two graphs, the first pie chart represent the gender types of our dataset with percentage of each gender type. The second box plot shows the age quartile distribution of each gender type. 

In [None]:
plt.figure(figsize = [10, 6])

plt.subplot(1,2,1)
sorted_gender = df_final['member_gender'].value_counts()
plt.pie(sorted_gender, labels=sorted_gender.index, startangle = 90, autopct = '%1.1f%%')
plt.axis('square')
plt.title('Gender Types Pie chart')

plt.subplot(1,2,2)
sb.boxplot(data = df_final, x = 'member_gender', y = 'member_age', color = base_color, width = 0.4)
plt.ylim((15,60))
plt.title("Age distribution for each Gender Type");

## (Visualization 2)
The plot illustrate the relation between Age and Ride duration time in minutes

In [None]:
plt.figure(figsize = [12,5])
plt.scatter(x = df_final['member_age'], y = df_final['duration_min'], alpha = 1/10)
plt.xlabel('Age')
plt.ylabel ('Ride duration in Min')
plt.xlim((15,60))
plt.title('Age VS Ride Duration in Minutes');

## (Visualization 3)
The plot shows the top ten start station and top ten end stations pf our dataset

In [None]:
base_color = sb.color_palette()[0]
plt.figure(figsize = [10, 10])
plt.subplot(2,1,1)
start_order = df_final['start_station_name'].value_counts()[:10]
df_start = df_final.loc[df_final['start_station_name'].isin(start_order.index)]
sb.countplot(data = df_start, y = 'start_station_name', order = start_order.index, color = base_color)
plt.title('Top Ten Start Stations')
plt.subplot(2,1,2)
end_order = df_final['end_station_name'].value_counts()[:10]
df_end = df_final.loc[df_final['end_station_name'].isin(end_order.index)]
sb.countplot(data = df_end, y = 'end_station_name', order = end_order.index, color = base_color)
plt.title('Top Ten End Stations');