# Communicate Data Findings Project : Ford GoBike System

## Table of contents 
<ul> 
    <li><a href='#intro'>1. Introduction</a></li> 
    <li><a href='#overview'>2. DataSet General Overview</a></li>  
    <li><a href='#inquiry'>3. Important Inquiries For Research</a></li> 
    <li><a href='#clean'>4. Data Cleaning & Wrangling</a></li> 
    <li><a href='#explore'>5. Exploratory Data Analysis & Visualizations</a></li> 
    <li><a href='#final'>6. Final Report + Explanatory Data Visulaization</a></li> 
</ul>

<a id='intro'></a>
## 1. Introduction

My chosen DataSet is showing information that covers over 180K records of individual rides made in a bike-sharing system covering the greater San Francisco Bay area in Feb-2019. 

This notebook is mainly aimed to get an overview of the dataset and answer important questions regarding bike trips (No., time, users, .....etc) which will be shown along the notebook via exploratory & explanatory data analysis & visualizations.

Coding will be via python different libraries for data analysis (Pandas, Matplotlib, Seaborn) for data wrangling, cleaning and creating storytelling visualizations. 

<a id='overview'></a>
## 2. DataSet General Overview 

Starting with the overview of the dataset to identify if any modifications need to be performed on data structure itself or further wrangling and cleanaing are needed. 

In [None]:
# Import the necessary libraries 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
%matplotlib inline 
# Import other libraries 
import time, os, warnings 
# Load the data & quick overview 
df = pd.read_csv('../input/ford-gobike-2019feb-tripdata/201902-fordgobike-tripdata.csv')
df.head(3)

In [None]:
# More Viewing of the data 
df.info()

Some Null values can be viewed in birth_year, gender columns, stations (start, end). 

In [None]:
df.describe()

In [None]:
# Check for outliers in birth year column 
sns.boxplot(df['member_birth_year']);

In [None]:
# Check for outliers in duration column 
sns.boxplot(df['duration_sec']);

##### Removing outliers in duration column would result in biased conclusions in the exploratory analysis 

In [None]:
# check for proportion of people born before 1920 
born_before_1920 = len(df[df['member_birth_year']<1920])
before_1920_percent = born_before_1920/len(df)
print (f'People who were born before 1920 represent {(before_1920_percent*100):.2f} %')

In [None]:
# Check for duplications in data 
def dup_checkremove(df): 
    df_nodup = df.drop_duplicates()
    len_dup = len(df)-len(df_nodup)
    if len_dup > 0 : 
        print (f'There are {len_dup} duplicate values')
        print ('Removing duplicate values......')
        df = df.drop_duplicates()
    else : 
        print ('There isn\'t any duplicate values .. you may proceed')
dup_checkremove(df)

In [None]:
# Check for values in columns with qualitative data 
qualitative_cols = ['start_station_name', 'end_station_name', 'member_gender', 'bike_share_for_all_trip', 'user_type']
for col in qualitative_cols : 
    print (df[col].unique())

That brings to me a serious question, Could I have different station names for the same station id ??! .. I guess I'll try to find out

In [None]:
# Checking if each station id only has one station name or more 
start_st_ids = df['start_station_id'].unique()
ids_multinames = []
for i in start_st_ids : 
    station_names = df[df['start_station_id']==i]['start_station_name'].unique()
    if len(station_names) > 1:
        ids_multinames.append(i)
end_st_ids = df['end_station_id'].unique()
for i in end_st_ids : 
    station_names = df[df['end_station_id']==i]['end_station_name'].unique()
    if len(station_names) > 1:
        ids_multinames.append(i)
if len(ids_multinames) > 0 :
    print (f'Some stations have more than one name as in the following ids {ids}')
else : 
    print ('All station ids have only one name .. you may proceed')

#### From the previous overview, the following is found to be done in wrangling & cleaning step : 
1. No Duplicate values were found 
2. Missing values are found in gender, birth date, start & end stations  column and all will be removed 
3. Start, end time columns need to be modified into datetime data type. 
4. New columns for day of week, day of month & hour will be created for better insight in the data 
5. New column will be created showing Age from 'Birth year' column
6. Gender Column contains useless values 'Other', these records will also be removed. 
7. Remove outliers in data (e.g. Age column), people born before 1920  
8. Drop unnecessary columns in analysis 

<a id='inquiry'></a>
## 3. Important Inquiries For Research 

#### 1. What are the main factors affecting number & duration of bike trips ?? 
        1. Location                            2. Age
        3. Gender                              4. User type
        5. Bike Share status                   6. Time 

<a id='clean'></a>
## 4. Data Cleaning & Wrangling 

In [None]:
# Remove all missing values 
df = df.dropna()
df.info()

In [None]:
# modify datetime columns 
time_cols = ['start_time','end_time']
for col in time_cols: 
    df[col] = pd.to_datetime(df[col])
df.info()

In [None]:
# create new column (day of week, day, hour)
df['start_day'] = df['start_time'].dt.day
df['start_day_of_week'] = df['start_time'].dt.day_name()
df['start_hour'] = df['start_time'].dt.hour
df['end_day'] = df['end_time'].dt.day
df['end_day_of_week'] = df['end_time'].dt.day_name()
df['end_hour'] = df['end_time'].dt.hour
df.info()

In [None]:
# Create new age column 
df['age'] = 2019 - df['member_birth_year']
df.info()

In [None]:
# Drop odd values in gender column 
ids = df[df['member_gender']=='Other'].index
df = df.drop(index=ids)
df.info()

In [None]:
# Remove outliers in Birth year column 
ids = df[df['member_birth_year']<1920].index
df = df.drop(index=ids)
df.describe()

In [None]:
# Drop the unnecessary columns 
df.columns 

In [None]:
unnecessary_cols = ['start_time', 'end_time', 'start_station_id', 'start_station_name','end_station_name','start_station_latitude', 'start_station_longitude',
                    'end_station_id', 'end_station_latitude', 'end_station_longitude','member_birth_year']
df = df.drop(columns=unnecessary_cols,axis=1)
df.info()

<a id='explore'></a>
## 5. Exploratory Data Analysis & Viusalizations 

#### First, we get overview in all columns to get effect of different features on number of bike trips 

In [None]:
warnings.filterwarnings('ignore')
nrows, ncols = 3, 4
cols = df.columns 
color = sns.color_palette()[0]
fig, ax = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20,10))
for i in range(nrows):
    for j in range(ncols): 
        if df[cols[i*ncols+j]].dtype=='O':
            sns.countplot(df[cols[i*ncols+j]],color=color, ax = ax[i,j])
        else : 
            sns.distplot(df[cols[i*ncols+j]],color=color, ax = ax[i,j],kde=True)

In [None]:
# Get proportion of different categories to total users 
male_prop = len(df[df['member_gender']=='Male'])/len(df) 
sub_prop = len(df[df['user_type']=='Subscriber'])/len(df)
no_prop = len(df[df['bike_share_for_all_trip']=='No'])/len(df)
weekend_prop = (len(df[df['start_day_of_week']=='Thursday'])+len(df[df['start_day_of_week']=='Friday']))/len(df)
print (f'Male trips represent percentage of {(male_prop*100):.2f} % From the total trips')
print (f'Subscribers trips represent percentage of {(sub_prop*100):.2f} % From the total trips')
print(f'users who doesn\'t share bike represent percentage of {(no_prop*100):.2f} % From the total trips')
print (f'weekend days (thursday, friday) represent percentage of {(weekend_prop*100):.2f} % From the total trips')

##### Get an overview of the relationship between differnt quantitative features and trip duration 

In [None]:
# Define Quantitative variables for plotting 
quantitative_cols = []
for col in df.columns : 
    if df[col].dtype != 'O':
        quantitative_cols.append(col)
quantitative_cols = quantitative_cols[1:]
quantitative_cols

In [None]:
# Create Scatter plots to get relationship between quantitative variables and trip duration
nrows, ncols = 2, 3 
cols = quantitative_cols
fig, ax = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20,10))
for i in range(nrows):
    for j in range(ncols): 
        sns.scatterplot(data=df,y=df['duration_sec'],x=df[cols[i*ncols+j]],color=color, ax = ax[i,j],alpha=0.5)

##### Finally, check will be performed to view effect of differnt features on bike trips count & duration in relation to gender

In [None]:
# Create Scatter plots to get relationship between quantitative variables and trip duration in relation to gender
nrows, ncols = 2, 3 
cols = quantitative_cols
fig, ax = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20,10))
for i in range(nrows):
    for j in range(ncols): 
        sns.scatterplot(data=df,y=df['duration_sec'],x=df[cols[i*ncols+j]],color=color, ax = ax[i,j],alpha=0.5
                       , hue='member_gender', legend='full')

In [None]:
# get overview in all columns to get effect of different features on number of bike trips in reference to gender
cols = ['duration_sec','start_day','start_day_of_week','start_hour','age','bike_id']
warnings.filterwarnings('ignore')
nrows, ncols = 2, 3
color = sns.color_palette()[0]
fig, ax = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20,10))
for i in range(nrows):
    for j in range(ncols): 
        if df[cols[i*ncols+j]].dtype=='O':
            sns.countplot(data =df,x=df[cols[i*ncols+j]],color=color, ax = ax[i,j],hue='member_gender')
        else : 
            sns.histplot(data=df,x=df[cols[i*ncols+j]],color=color, ax = ax[i,j],kde=True,hue='member_gender')

##### Check for different trip durations in different week days 

In [None]:
# Group by the start day 
plt.title('Mean Trip Duration in Different Days of The Week (Sec)')
plt.ylabel('Trip Duration (Sec)')
df.groupby(['start_day_of_week'])['duration_sec'].mean().plot(kind='bar');

In [None]:
# Group by the end day 
plt.title('Mean Trip Duration in Different Days of The Week (Sec)')
plt.ylabel('Trip Duration (Sec)')
df.groupby(['end_day_of_week'])['duration_sec'].mean().plot(kind='bar');

In [None]:
# Group by different days of the month
plt.title('Mean Trip Duration in Different Days of The Month (Sec)')
plt.ylabel('Trip Duration (Sec)')
df.groupby(['start_day'])['duration_sec'].mean().plot(kind='bar');

<a id='final'></a>
## 6. Final Report + Explanatory Data Visualization

### What are the most common times for having bike trips ?? Is there difference in relation to gender ?? 
The Following Graphs are showing the most common days & hours where we have bike trips for both genders

In [None]:
# Plot for most common hours for having bike trips in relation to gender
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.histplot(data=df, x=df['start_hour'], hue='member_gender', kde=True)
plt.title('Most Common Hours For Having Bike Trips', fontsize=20)
plt.ylabel('Bike Trips', fontsize=14)
plt.xlabel('Start Hour of Trip', fontsize=14); 

In [None]:
# Plot the most common days to have bike trips in relation to gender 
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.histplot(data=df, x=df['start_day'], hue='member_gender', kde=True)
plt.title('Most Common Days of Month For Having Bike Trips', fontsize=20)
plt.ylabel('Bike Trips', fontsize=14)
plt.xlabel('Start Day of Trip', fontsize=14); 

### Then Why Does Certain Days have higher trip rates ?? How about check for days of the week ? 

In [None]:
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.histplot(data=df, x=df['start_day_of_week'], hue='member_gender')
plt.title('Most Common Days of Week For Having Bike Trips', fontsize=20)
plt.ylabel('Bike Trips', fontsize=14)
plt.xlabel('Start Weekday of Trip', fontsize=14); 

### What is the main age range & gender of target for bike trips ?? 

In [None]:
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.histplot(data=df, x=df['age'], hue='member_gender', kde=True)
plt.title('Most Common Age For Having Bike Trips', fontsize=20)
plt.ylabel('Bike Trips', fontsize=14)
plt.xlabel('Age', fontsize=14); 

### What is the relation between age & trip duration ?? 

In [None]:
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.scatterplot(data=df, y=df['duration_sec'], x=df['age'], hue='member_gender', alpha=0.5)
plt.title('Relationship Between Age & Trip Duration', fontsize=20)
plt.ylabel('Trip Duration (Sec)', fontsize=14)
plt.xlabel('Age', fontsize=14); 

### Is there any effect of days of the week on trip durations ?? How about a check for that too ??

In [None]:
fig, ax = plt.subplots(figsize=(20,10))
df_dur = df.groupby('start_day_of_week')['duration_sec'].mean()[['Saturday', 'Sunday','Monday','Tuesday','Wednesday','Thursday','Friday']]
ax = df_dur.plot(kind='bar')
plt.title(' Average Trip Duration in Different Days of The Week', fontsize=20)
plt.xlabel('Weekday', fontsize=14)
plt.ylabel('Trip Duration(Sec)', fontsize=14);

#### Findings in Data Wrangling & cleaning process 
1. Missing values were found and removed as they didn't show any statistical significance 
2. Datetime columns types were modified to datetime type
3. New columns were created from datetime columns (showing day, day of week, hour) to give more insights. Datetime columns were removed afterwards. 
4. New Column was created to show age instead of birth year and birth year column was removed. 
5. Outliers were checked in age columns and removed due to non-statistical significance. 
6. All unnecessary columns were removed to focus more on the significant data columns. 

#### Findings in Data from exploratory visualizations 
1. Most trips are with bikes with higher id, showing that by default people prefer more modern bikes. 
2. Most Users are subscribers as 90.6 % of total trips are for subscribers showing that people will be more likely to engage in the service on consistent basis and subscribe. 
3. Males represent around 76.2 % of the total trips giving more indication about females not prefering bikes as go to for workouts. The visualization is showing no difference between males & females activities either in usage days, hours or even trips duration.  
4. Age range is mostly between 25 & 35 years, giving high indication that youth are more interested in  bike trips even more than kids which sounds kind of illogical finding 
5. Most trips fall in Thursday, Friday (Weekend days) with 35 % from total trips giving and indication that people use bike trips mostly for entrtainment than daily usage 
6. On Contrary to the previous point, people who use bikes in working days (Sunday, Monday) tend to use bikes for longer periods of time as they use it on regular basis not just for entrtainment. 
7. Rush hour in bike trips would be between around 8 AM till 6 PM which is very logical wheather user will go for entrtainment or regular usage. 
8. Trips duration is highest at age range from 20 to 40 as expected as they are the most users. 