# Bike Share Data Analysis

###  As my final project for Google's Data Analysis certificate, I was asked to explore a real dataset from a bike-share company based in Chicago, in order to discover differences between riders with memberships and riders without memberships (aka casual riders). The [data](https://www.kaggle.com/laborday/cyclistic-data) includes over 3 million rides from April 2020 to March 2021.

###  The dataset contained the following fields:

* ride_id: a set of 16 characters unique to each ride

* rideable_type: the style of bicycle (classic, docked or electric)

* started_at: the date and time that each ride began

* ended_at: the date and time that each ride ended

* start_station_name: which terminal the bike was checked out from

* start_station_id: an ID for each start station

* end_station_name: which terminal the bike was checked into

* end_station_id : an ID for each end station

* start_lat: the latitude of the start station

* start_lng: the longitude of the start station

* end_lat: the latitude of the end station

* end_lng: the longitude of the end station

* member_casual: the membership status of the rider

### I began by using SQL to check the relevant columns for invalid entries.

#### Checking for unique values in the rideable_type column reveals no bad entries:
```
SELECT
  rideable_type,
  COUNT (rideable_type) AS count

FROM `massive-current-311523.Cyclistic.ride_data`

GROUP BY rideable_type
```

![Image](https://i.imgur.com/u8lXnYT.png)

#### Checking for earliest and latest dates reveals none are outside of the expected range:
```
SELECT
  MAX(started_at) as max_started_at,
  MAX(ended_at) as max_ended_at,
  MIN(started_at) as min_started_at,
  MIN(started_at) as min_started_at

FROM `massive-current-311523.Cyclistic.ride_data`
```
![Image](https://i.imgur.com/nozg6fR.png)

#### Creating a temporary table with an extra column to calculate trip length reveals 65,124 rides with length less than or equal to 0 minutes (1.97%), which will be filtered out later:
```
WITH temp_table AS (
 
  SELECT
    started_at,
    ended_at,
    DATE_DIFF (ended_at, started_at, MINUTE) as trip_length
 
  FROM `massive-current-311523.Cyclistic.ride_data`
)

SELECT
  started_at,
  ended_at,
  trip_length
 
FROM temp_table

WHERE trip_length = 0 OR trip_length < 0

ORDER BY trip_length
```
![Image](https://i.imgur.com/NgDDstH.png)

#### Checking for min and max coordinates reveals none outside of the expected range:
```
SELECT
  MAX(start_lat) as max_start_lat,
  MAX(start_lng) as max_start_lng,
  MIN(start_lat) as min_start_lat,
  MIN(start_lng) as min_start_lng,
 
  MAX(end_lat) as max_end_lat,
  MAX(end_lng) as max_end_lng,
  MIN(end_lat) as min_end_lat,
  MIN(end_lng) as min_end_lng

FROM `massive-current-311523.Cyclistic.ride_data`
```
![Image](https://i.imgur.com/qi2lwgz.png)

#### Checking for unique starting latitudes reveals there is too much decimal precision, creating over 260,000 unique latitudes:
```
SELECT
  start_lat,
  COUNT(start_lat) as count

FROM `massive-current-311523.Cyclistic.ride_data`

GROUP BY start_lat
```
![Image](https://i.imgur.com/XRalSfp.png)

#### Trimming decimal places off coordinates reduces the number of unique entries to 1552:
````
SELECT
  TRUNC(start_lat, 4) as trimmed_start_lat,
  COUNT(TRUNC(start_lat, 4)) as count

FROM `massive-current-311523.Cyclistic.ride_data`

GROUP BY trimmed_start_lat
````
![Image](https://i.imgur.com/wSXZ9lC.png)

#### Checking for unique values in the member_casual column reveals no bad entries:
```
SELECT 
  member_casual,
  COUNT (member_casual) as count
  
FROM `massive-current-311523.Cyclistic.ride_data` 

GROUP BY member_casual
```
![Image](https://imgur.com/y6KP8Gf.png)

#### Now grabbing all the relevant columns (while filtering out the bad time values and trimming the coordinates) and storing them in a new table for analysis:
```
WITH cleaned_data AS (

  SELECT
    rideable_type,
    started_at,
    ended_at,
    DATE_DIFF (ended_at, started_at, MINUTE) as trip_length,
    TRUNC(start_lat, 4) as trimmed_start_lat,
    TRUNC(start_lng, 4) as trimmed_start_lng,
    TRUNC(end_lat, 4) as trimmed_end_lat, 
    TRUNC(end_lng, 4) as trimmed_end_lng,
    member_casual

  FROM `massive-current-311523.Cyclistic.ride_data`
)

SELECT *

FROM cleaned_data

WHERE trip_length > 0
```

![Image](https://i.imgur.com/LM0jR8s.png)

### With the cleaned dataset, I used Tableau to create visualizations summarizing my findings.

#### 1. Members took 59% of rides, compared to 41% from casual riders:
##### Out of 3,761,854 rides, members took 2,216,736 and casual riders took 1,545,118.

![Image](https://i.imgur.com/ttg2tQe.png)


#### 2. Casual riders take longer trips than members:
##### On average, members rode for just under 16 minutes per trip, compared to over 44 minutes per trip by casual riders.

![Image](https://i.imgur.com/a8iuxck.png)

#### 3. Everybody rides longer on weekends, especially casual riders:
##### Trip lengths were below average on weekdays, and above average on weekends.

![Image](https://i.imgur.com/FC56GJV.png)

#### 4. Casual riders significantly increase on the weekend:
##### While the number of rides by members does increase slightly over the course of the week, most casual rides occur on the weekend.

![Image](https://i.imgur.com/4e3R6V4.png)

#### 5. Casual riders and members share similar bike preferences:
##### Docked bikes were the most popular with both members and casual riders. Classic bikes, the least popular, were especially unpopular with casual riders.

![Image](https://i.imgur.com/tm2GCVB.png)

#### 6. The number of rides increases throughout the day:
##### Rides increase with significant events like the start of working hours, lunch time and the end of the work day.

![Image](https://i.imgur.com/Ot3w7Gl.png)

#### 7. Casual riders take longer trips over the course of the day:
##### While members take trips consistent in length, the length of casual rides grows dramatically over the day, peaking at 4 am.

![Image](https://imgur.com/jLS2or6.png)

#### 8. Seasonal variation is consistent across both casual riders and members:
##### Rides peak during early spring and mid-summer.

![Image](https://i.imgur.com/VDibFao.png)

#### 9. Casual riders are geographically distinct from members:
##### By filtering out all but the most popular starting and ending coordinates (10,000+ rides), we see that members tend to ride in different areas than casual riders. Specifically, casual rides mainly occur around local attractions.

![Image](https://i.imgur.com/DBovUjo.png)

## Summary

### Unsurprisingly, it appears that casual riders use these bikes for leisure; they ride more on the weekends, they ride for longer and they ride around local attractions.
### Members appear to use these bikes for commuting; they ride consistently, every day, and for the same amount of time, usually around residential locations.



