In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Cyclistic: Bike-Share Case Study**

## **Overview:**
This case study is provided by [Google Data Analytics Professional Certificate](https://www.coursera.org/professional-certificates/google-data-analytics) on [Coursera](https://www.coursera.org/).Cyclistic is a fictional bike-share company founded in 2016. Cyclistic users are more lilely to ride for leisure, while about 30% users use this bike-share to commute. Because annual members are more profitable, the goal is to *convert casual users into annual memberships*.<br/> 
The historical trip data used in this case study is provided by Motivate International Inc. under [license](https://www.divvybikes.com/data-license-agreement)  



## **Ask:**

#### Business Task:
Analyze and discover the difference between casual users and annual members for bike-share. 
#### Key Stakeholders:
Lily Moreno: director of marketing  
Marketing analytics team  
Executive team  



## **Prepare:**  
**Download, Storage, and Security:**  
The original historical data is downloaded to the local computer in zip format.
The dataset is extracted localy and will be uploaded to Kaggle, BigQuery,and Tableau after filtering and sorting. 
All the data is public and does not include personal information.      
**Description of data:**  
* previous 12 months of data is downloaded, which are Jun 2020 - May 2021.
* trip data: there are 15 columns for each table
| Field Name | Type | Desciption |
|-------- |-----|-----|
|ride_id | String | Id for each ride |
|rideable_type | String | bike type |
| started_at | Timestamp | start time|
| ended_at | Timestamp | end time|
| start_station_name| String |  |
| start_station_id | Int | |
| end_station_name | String | |
| end_station_id | Int | |
| start_lat | Float| start latitude|
| start_lng | Float | start longitude |
| end_lat | Float | end latitude|
| end_lng | Float | end longtitude |
| member_casual | String | identify if member|  

**Organization of data:**   

* All dataset are organized as csv files
* The naming of every file is converted to Snake_Case in format:'divvy_trips_yyyymm'  

**Credibility of data:**  
The data came from first-party data of [Divvy bike-share](www.divvybikes.com), so it is considered as credible data.




## **Process:**  
In this case study, I choose to use Google BigQuery with postgreSQL to process the data. 

**Check data for errors:**  
 >To make sure every table's integrity, I checked the tables repetitivly before merging.
1. Check for duplicates: Check for numbers of same Id presented in the table. If the query returns no result, there is no duplicate row. The following is a sample code on data of May 2020. *All schemas contain no deuplicate.*
```SQL
SELECT ride_id, COUNT(*)
FROM `cyclisticcasestudy.tripData.2020_06` 
GROUP BY ride_id
HAVING COUNT(*) >1
```
2. Check integrity of member/casual inputs: member_casual field should only contain NULL, member, or casual value, indicating the rider's status. If the query returns no result, there is no invalid input in member_casual field. The following is a sample code on data of May 2020. *All schemas contain no invalid input in member_casual.*   
```SQL
SELECT ride_id, member_casual
FROM `cyclisticcasestudy.tripData.2020_06` 
WHERE member_casual <> 'casual'
AND member_casual <> 'member'
AND member_casual IS NOT NULL;
```
3. Check integrity of ridable_type inputs: similar to member/casual, ridable_type should only include three types' names or NULL. *There are three ridable types: docked_bike, eletric_bike, classic_bike*
```SQL
SELECT ride_id, ridable_type
FROM `cyclisticcasestudy.tripData.2020_06` 
WHERE ridable_type <> 'docked_bike'
AND ridable_type <> 'eletric_bike'
AND ridable_type <> 'classic_bike'
AND ridable_type IS NOT NULL;
```
4. Check errors on start time and end time: end time are suppose to be later than start time. Therefore, error data containing start time later than end time will be deleted from the table:
```SQL
DELET FROM `cyclisticcasestudy.tripData.2020_06`
WHERE started_at>ended_at
```
log:
This statement removed 469 rows from cyclisticcasestudy:tripData.2020_06.
This statement removed 1,745 rows from cyclisticcasestudy:tripData.2020_07.
This statement removed 2,769 rows from cyclisticcasestudy:tripData.2020_08.
This statement removed 2,132 rows from cyclisticcasestudy:tripData.2020_09.
This statement removed 1,911 rows from cyclisticcasestudy:tripData.2020_10.
This statement removed 865 rows from cyclisticcasestudy:tripData.2020_11.
This statement removed 434 rows from cyclisticcasestudy:tripData.2020_12.
This statement removed 2 rows from cyclisticcasestudy:tripData.2021_01.
This statement removed 0 rows from cyclisticcasestudy:tripData.2021_02.
This statement removed 2 rows from cyclisticcasestudy:tripData.2021_03.
This statement removed 5 rows from cyclisticcasestudy:tripData.2021_04.
This statement removed 2 rows from cyclisticcasestudy:tripData.2021_05.
**Data integrity:**  
After checking for errors, I confirm that there is no significant wrong data. Table of 12 months includes 4073561 rows of data, having enough number of sample size for analysis.  

**Merge all information and transform data:**  
To make analysis efficient, the data in timestamp is seperated to month, week of day, hour, and duration of riding. 
Using union all to merge all 12 month tables to one big table:  
```SQL
SELECT ride_id,rideable_type,EXTRACT(DAYOFWEEK from started_at) AS ride_day,
 EXTRACT(HOUR from started_at) AS start_time, 
 EXTRACT(MONTH from started_at) AS start_month,DATETIME_DIFF(ended_at,started_at,MINUTE) AS duration,
 start_station_name,start_lat,start_lng, end_station_name,end_lat, end_lng
FROM `cyclisticcasestudy.tripData.2020_06` 
UNION ALL
 SELECT  ride_id,rideable_type,EXTRACT(DAYOFWEEK from started_at) AS ride_day,
 EXTRACT(HOUR from started_at) AS start_time, 
 EXTRACT(MONTH from started_at) AS start_month,DATETIME_DIFF(ended_at,started_at,MINUTE) AS duration,
 start_station_name,start_lat,start_lng, end_station_name,end_lat, end_lng
FROM `cyclisticcasestudy.tripData.2020_07` 
```
Check merged table:
Checked for possible duplicates. The total row number star consistent.
```SQL
SELECT DISTINCT *
FROM `cyclisticcasestudy.tripData.total_selective_data`
```


## **Analyze:**  
**Average Duration:**  

```SQL
SELECT member_casual,AVG(duration) AS averageDuration
FROM `cyclisticcasestudy.tripData.total_tripdata` 
GROUP BY member_casual
```
result:  

|member_casual| averageDuration|
|---|---|
|casual|42.20672687851149|
|member|15.01677871316342|

The average duration of casual and member are significantly different, and the result of this query implys that casual users tend to spend more time on each ride.  
**Most ride day:**
```SQL
SELECT ride_day,COUNT(*) AS totalDays
FROM `cyclisticcasestudy.tripData.total_tripdata` 
WHERE member_casual='casual'--/member
GROUP BY ride_day
ORDER BY COUNT(*) DESC
```
result:   

|casual|member|
|--|--|
|<table> <tr><th>ride_day</th><th>totalDays</th></tr><tr><td>7</td><td>396428</td></tr><tr><td>1</td><td>329948</td></tr><tr><td>6</td><td>246652</td></tr> <tr><td>5</td><td>191762</td></tr><tr><td>2</td><td>188417</td></tr><tr><td>4</td><td>182565</td></tr><tr><td>3</td><td>174412</td></tr></table>|<table> <tr><th>ride_day</th><th>totalDays</th></tr><tr><td>7</td><td>360579</td></tr><tr><td>6</td><td>350586</td></tr><tr><td>4</td><td>347184</td></tr><tr><td>5</td><td>341971</td></tr><tr><td>3</td><td>329416</td></tr><tr><td>2</td><td>315487</td></tr><tr><td>1</td><td>307818</td></tr></table>|  



>first day of week is Sunday, so 7 means Saturday and 1 means Sunday   

The results imply that casual members use bike-share mostly during weekend, and members tend to use more during the weekday while both kinds of users use bike-share often on Sunday.

**Travel distance:**
```SQL
SELECT member_casual,AVG(ABS(end_lat-start_lat)) AS latDiff,AVG(ABS(end_lng-start_lng)) AS lngDiff
FROM `cyclisticcasestudy.tripData.total_tripdata` 
GROUP BY member_casual
```
result:  

|member_casual|latDiff|lngDiff|
|---|---|---|
|casual|0.015071854591753685|0.01267706615532465|
|member|0.015217680626098365 |0.013367068971533039 |


latDiff and lngDiff show the average latitude and longitude difference between every ride's start station and end station. The result shows little difference in casual and member's average ride. The fact that casual users tend to spend more time on each ride is consistent with the assumption: casual users tend to use bike-share for trips, while members tend to use bike-share to commute.

## **Share:**  
The merged table is uploaded to Tableau Desktop for data visualization.  
Below are the data visualization from Tableau.

In [None]:
%%HTML
<div class='tableauPlaceholder' id='viz1623972284244' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Cy&#47;CyclisticCaseStudy_16238384116320&#47;RideableTypeBreakdown&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='CyclisticCaseStudy_16238384116320&#47;RideableTypeBreakdown' /><param name='tabs' value='yes' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Cy&#47;CyclisticCaseStudy_16238384116320&#47;RideableTypeBreakdown&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1623972284244');                    var vizElement = divElement.getElementsByTagName('object')[0];  
vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                
var scriptElement = document.createElement('script');
scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    
vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>             


From the visualization, I can conclude the following differences casual users have than members:
* Casual users tend to use bike more in the downtown Chicago. Refering to the dashboard 'HeatMap', casual users are more concentrated in the center.
* Casual users tend to use bike for longer duration each ride, refering to the trend line charts dashboard 'Trend'.
* Casual users tend to use bike during summer. The peaks of usage are in May and Augest, and the bottoms are in December to February.  
* Causal users tend to use more docked cike and fewer classic bikes comparing to members.

## **Act:**  
The top three recommendations from conclusion made in the share phase:
* Target users who live near the downtown Chicago
* Advertise more during summer
* Create promotion combinging popular leisure riding routes and the annual memberships

## **Case Study Wrap Up:**

It is my first case study that thoroughly walked through all six phases of data analysis alone. The whole case study took approximatly 10 hours to complete. It should be shorter because there are only 12 tables with 15 attributes, but it took me a really long time figuring out what data to combine or extract and what data visualizations to build. Because there are two set of nullable data on the start location and end location, I was not sure what to do with them and I could not just use COALESCE() to obtain one primary attribute because these data are in different data type. Therefore, I just kept these redundant data when I ran queries and built vizs to analyze. I also tried to drew the trips into lines on the map, but the trips' map was too messy to use because I didn't query out the poppular routes prior to analyze phase. There are much more for me to learn, and I hope I can make progress in every case study or project.   

#### Thank you so much for reading my case study!