<a href="https://colab.research.google.com/github/wahyunh10/Data-Exploration-Ford-GoBike-Project/blob/main/Data_Exploration_Ford_GoBike_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Exploration Ford GoBike Project**

# **Table of Contents**
---
* Preliminary Wrangling
1. Data Gathering & Assessing
2. Data Cleaning


* Data Exploration
1. Univariate Exploration
2. Bivariate Exploration
3. Multivariate Exploration





# **Preliminary Wrangling**

Ford GoBike System Dataset includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area in the year 2017.

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import requests
import datetime
import os

sb.set_style('darkgrid')
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
import os
for dirname, _, filenames in os.walk('/kaggle/input/2017-ford-gobike-ridedata/2017_ford_go_bike_tripdata.csv'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# **Data Gathering & Assessing**

I have chose 2017 Ford GoBike System Dataset which consists of six months bike rides data(Jun-Dec).

In [3]:
# Read data from csv file and Load it into dataframe    
ford_df = pd.read_csv('2017-fordgobike-tripdata.csv')
ford_df.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender
0,80110,2017-12-31 16:57:39.6540,2018-01-01 15:12:50.2450,74,Laguna St at Hayes St,37.776435,-122.426244,43,San Francisco Public Library (Grove St at Hyde...,37.778768,-122.415929,96,Customer,1987.0,Male
1,78800,2017-12-31 15:56:34.8420,2018-01-01 13:49:55.6170,284,Yerba Buena Center for the Arts (Howard St at ...,37.784872,-122.400876,96,Dolores St at 15th St,37.76621,-122.426614,88,Customer,1965.0,Female
2,45768,2017-12-31 22:45:48.4110,2018-01-01 11:28:36.8830,245,Downtown Berkeley BART,37.870348,-122.267764,245,Downtown Berkeley BART,37.870348,-122.267764,1094,Customer,,
3,62172,2017-12-31 17:31:10.6360,2018-01-01 10:47:23.5310,60,8th St at Ringold St,37.77452,-122.409449,5,Powell St BART Station (Market St at 5th St),37.783899,-122.408445,2831,Customer,,
4,43603,2017-12-31 14:23:14.0010,2018-01-01 02:29:57.5710,239,Bancroft Way at Telegraph Ave,37.868813,-122.258764,247,Fulton St at Bancroft Way,37.867789,-122.265896,3167,Subscriber,1997.0,Female


In [4]:
# high-level overview of data shape and composition
ford_df.shape

(519700, 15)

In [5]:
ford_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 519700 entries, 0 to 519699
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             519700 non-null  int64  
 1   start_time               519700 non-null  object 
 2   end_time                 519700 non-null  object 
 3   start_station_id         519700 non-null  int64  
 4   start_station_name       519700 non-null  object 
 5   start_station_latitude   519700 non-null  float64
 6   start_station_longitude  519700 non-null  float64
 7   end_station_id           519700 non-null  int64  
 8   end_station_name         519700 non-null  object 
 9   end_station_latitude     519700 non-null  float64
 10  end_station_longitude    519700 non-null  float64
 11  bike_id                  519700 non-null  int64  
 12  user_type                519700 non-null  object 
 13  member_birth_year        453159 non-null  float64
 14  memb

In [6]:
#check for duplicated values
ford_df.duplicated().sum()

0

In [7]:
#check for null values
ford_df.isna().sum()

duration_sec                   0
start_time                     0
end_time                       0
start_station_id               0
start_station_name             0
start_station_latitude         0
start_station_longitude        0
end_station_id                 0
end_station_name               0
end_station_latitude           0
end_station_longitude          0
bike_id                        0
user_type                      0
member_birth_year          66541
member_gender              66462
dtype: int64

**Observations:**

* Missing values in the dataset
* Erreneous data types:
 start_time and end_time types are object instead of datetime type

  - member_birth_year should be type of int
  - start_station_id, end_station_id, and bike_id can be str type
  - user_type and member_gender can be type of category
* Unwanted columns

# **Data Cleaning**

In [8]:
# Before cleaning, make copy of original datasets
ford_clean = ford_df.copy()

**Issue 1: Missing values in the dataset**

**Define** Drop the null values from the dataset

In [9]:
#drop the null values from dataset
ford_clean.dropna(how = 'any', axis = 0, inplace=True)

**Test**

In [10]:
# check whether null values are dropped
ford_clean.isna().sum()

duration_sec               0
start_time                 0
end_time                   0
start_station_id           0
start_station_name         0
start_station_latitude     0
start_station_longitude    0
end_station_id             0
end_station_name           0
end_station_latitude       0
end_station_longitude      0
bike_id                    0
user_type                  0
member_birth_year          0
member_gender              0
dtype: int64

**Issue 2: Erreneous data types**

**Define** 

* convert start_time and end_time types from object to datetime
* convert member_birth_year type from float to int
* convert start_station_id, end_station_id, and bike_id to str type
* convert user_type and member_gender to type of category

In [11]:
# convert the data type of start_time and end_time to datetime.
ford_clean.start_time = pd.to_datetime(ford_clean.start_time)
ford_clean.end_time = pd.to_datetime(ford_clean.end_time)

# convert member_birth_year from float64 to int
ford_clean.member_birth_year = ford_clean.member_birth_year.astype(int)

# convert ids from object to str
ford_clean.start_station_id = ford_clean.start_station_id.astype(str)
ford_clean.end_station_id = ford_clean.end_station_id.astype(str)
ford_clean.bike_id = ford_clean.bike_id.astype(str)


# convert user_type and member_gender datatype into category
ford_clean.user_type = ford_clean.user_type.astype('category')
ford_clean.member_gender = ford_clean.member_gender.astype('category')

**Test**

In [12]:
# check whether data type are converted correctly
ford_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 453159 entries, 0 to 519699
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             453159 non-null  int64         
 1   start_time               453159 non-null  datetime64[ns]
 2   end_time                 453159 non-null  datetime64[ns]
 3   start_station_id         453159 non-null  object        
 4   start_station_name       453159 non-null  object        
 5   start_station_latitude   453159 non-null  float64       
 6   start_station_longitude  453159 non-null  float64       
 7   end_station_id           453159 non-null  object        
 8   end_station_name         453159 non-null  object        
 9   end_station_latitude     453159 non-null  float64       
 10  end_station_longitude    453159 non-null  float64       
 11  bike_id                  453159 non-null  object        
 12  user_type       

**Issue 3: Unwanted Columns**

**Define** Drop the unused columns using dropna() method

In [13]:
# drop the unused columns
ford_clean.drop(['start_station_latitude','start_station_longitude','end_station_latitude','end_station_longitude'], axis = 1 , inplace = True)

**Test**

In [14]:
# check whether columns are dropped
ford_clean.columns

Index(['duration_sec', 'start_time', 'end_time', 'start_station_id',
       'start_station_name', 'end_station_id', 'end_station_name', 'bike_id',
       'user_type', 'member_birth_year', 'member_gender'],
      dtype='object')

**Feature Engineering**

To make the analysis easy, lets fetch time of day, day of the week, or month of the year from start_time




In [15]:
# use strftime() method to scrap the time data
ford_clean['start_day'] = ford_clean['start_time'].apply(lambda x: x.strftime('%A')).astype('category')
ford_clean['start_month'] = ford_clean['start_time'].apply(lambda x: x.strftime('%b')).astype('category')
ford_clean['start_hour'] = ford_clean['start_time'].apply(lambda x: x.strftime('%H')).astype(int)

# Lets calcluate the member age from the birthyear
ford_clean['member_age'] = 2017 - ford_clean['member_birth_year']

In [16]:
# Brief summary of cleaned DataFrame
ford_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 453159 entries, 0 to 519699
Data columns (total 15 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   duration_sec        453159 non-null  int64         
 1   start_time          453159 non-null  datetime64[ns]
 2   end_time            453159 non-null  datetime64[ns]
 3   start_station_id    453159 non-null  object        
 4   start_station_name  453159 non-null  object        
 5   end_station_id      453159 non-null  object        
 6   end_station_name    453159 non-null  object        
 7   bike_id             453159 non-null  object        
 8   user_type           453159 non-null  category      
 9   member_birth_year   453159 non-null  int64         
 10  member_gender       453159 non-null  category      
 11  start_day           453159 non-null  category      
 12  start_month         453159 non-null  category      
 13  start_hour          453159 no

**Storing cleaned Dataset :**

In [17]:
ford_clean.to_csv('cleaned-fordgobike-tripdata-2017.csv', encoding='utf-8', index=False)

In [19]:
# Read data from csv file and Load it into dataframe
df = pd.read_csv('cleaned-fordgobike-tripdata-2017.csv')
df.sample(2)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,end_station_id,end_station_name,bike_id,user_type,member_birth_year,member_gender,start_day,start_month,start_hour,member_age
89777,465,2017-11-28 08:01:31.757,2017-11-28 08:09:17.044,323,Broadway at Kearny,23,The Embarcadero at Steuart St,2218,Subscriber,1977,Male,Tuesday,Nov,8,40
429196,1581,2017-07-22 11:35:33.132,2017-07-22 12:01:54.136,81,Berry St at 4th St,28,The Embarcadero at Bryant St,2028,Subscriber,1987,Male,Saturday,Jul,11,30


# **Data Exploration**

**What is the structure of your dataset?**



> The dataset consists of 453159 bike ride entries and 15 features

**What is/are the main feature(s) of interest in your dataset?**



> * Most interested features in this dataset are bike ride start_time (in terms of month of the year, day of the week, or hour of the day) and ride duration. Does it depends on characteristics of the riders?



**What features in the dataset do you think will help support your investigation into your feature(s) of interest?**



> I expect time_characteristics and duration_sec highly depends on below characteristics of the riders,

* *user_type*
* *member_gender*
* *member_age*



# **Univariate Exploration**

In [20]:
# set the default color
default_color = sb.color_palette()[0]