# UEP-0239 Spring 2022 Assignment 4

---

**Name:** (edit this markdown cell and replace this text with your name)

---

In this assignment you will explore 2021 BlueBikes ridership in the Boston Metropolitan Area and attempt to answer the following questions:

1. How many BlueBikes trips took place in 2021? (6 pts)
2. What was the mean trip duration and the most used bike? (4 pts)
3. When did most BlueBikes trips take place? (12 pts)
4. How does temperature affect BlueBikes ridership? (14 pts)
5. How does precipitation affect BlueBikes ridership? (12 pts)
6. What are the most and least used BlueBikes stations and where are they located? (18 pts)
7. What is the most popular route and what is its straight line length? (16 pts)
8. In which Boston area Zip Code do most BlueBike riders live? (18 pts)

---

The `data` directory contains the following files:

- `2021MM-bluebikes-tripdata.csv` – BlueBikes trips taken in month `MM` of 2021 (12 files total with `MM` values ranging from `01` to `12`)
- `bos-prcp-temp-2021.csv` – daily 2021 temperature and precipitation readings from the Boston Logan Airport weather station
- `mapc-icc-zcta.geojson` – Zip Code Tabulation Areas (ZCTAs) for the Boston Metropolitan Area Planning Council (MAPC) Inner Core Committee (ICC) sub-region

---

Each row in a `2021MM-bluebikes-tripdata.csv` file denotes a single BlueBikes trip and the columns are as follows:

- `tripduration` – total time bike was not docked (trip duration) in seconds
- `starttime` – date and time bike was removed from dock (trip start) in `YYYY-MM-DD HH:mm:ss.SSS` format
- `stoptime` – date and time bike was returned to dock (trip end) in `YYYY-MM-DD HH:mm:ss.SSS` format
- `start station id` – unique ID for the station bike was removed from (trip start station)
- `start station name` – trip start station name
- `start station latitude` – trip start station latitude
- `start station longitude` – trip start station longitude
- `end station id` – unique ID for the station bike was returned to (trip end station)
- `end station name` – trip end station name
- `end station latitude` – trip end station latitude
- `end station longitude` – trip end station longitude
- `bikeid` – unique ID for the bike used
- `usertype` – type of user with `Customer` denoting a single trip or day pass user and `Subscriber` denoting an annual or monthly member
- `postal code` – user ZIP code (either self-reported or derived from payment information)

---

Each row in the `bos-prcp-temp-2021.csv` file denotes a single day and the columns are as follows:
- `STATION` – weather station ID
- `NAME` – weather station name
- `DATE` – date of readings in `YYYY-MM-DD` format
- `PRCP` – total daily precipitation in inches
- `SNOW` – total daily snowfall in inches
- `TAVG` – average daily temperature in degrees Fahrenheit
- `TMAX` – maximum daily temperature in degrees Fahrenheit
- `TMIN` – minimum daily temperature in degrees Fahrenheit

---

The `mapc-icc-zcta.geojson` file is in [IETF RFC 7946 GeoJSON](https://datatracker.ietf.org/doc/html/rfc7946) format with each object having the following properties:
- `ZCTA5CE10` – five-digit ZIP Code as designated in the 2010 census

---

## Import Required Packages and Libraries

These are the packages and libraries you will definitely need for this assignment. Feel free to add additional import statements if needed.

In [None]:
from glob import glob
import pandas as pd
import geopandas as gpd
from shapely.geometry import LineString
from matplotlib import pyplot as plt
import plotly.express as px
import contextily as cx
import seaborn as sns
import hvplot.pandas
import folium

---

## 1. Total BlueBikes Trips in 2021 (6 pts)

Currently the BlueBikes data is split across 12 different files. To perform analysis on all 2021 BlueBikes trips, these files must be combined into a single DataFrame. However, individually reading in 12 different files would be extremely tedious...

Luckily the Python built-in [`glob`](https://docs.python.org/3/library/glob.html) library can help automate the process! Run the code cell below to investigate how it works.

In [None]:
glob('data/2021??-bluebikes-tripdata.csv')

Note how the `glob()` function takes a file path with one or more wildcards and returns a list of all file paths that match the specified pattern.

Here are two other helpful functions from Pandas:
- [`pandas.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) takes a file path to a CSV file and reads it in as a DataFrame
- [`pandas.concat()`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) takes a list of DataFrames with matching columns and combines them into a single dataframe

(1.1) Knowing this, combine all 2021 BlueBike trips into a single Pandas DataFrame called `trips`. Use a loop or experiment with [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp). (5 pts)

(1.2) How many  BlueBikes trips were taken in 2021? Use a Pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) attribute to display the result. (1 pt)

---

## 2. Mean Trip Duration and Most Used Bike (4 pts)

(2.1) Display the mean trip duration in minutes. (1 pt)

(2.2) Display the ID of the most used bike. Remember that the most frequently appearing value in a set of values is called the *mode*. (1 pt)

(2.3) How many trips were taken with the most frequently used bike? (2 pts)

---

## 3. Day with Most BlueBike Trips (12 pts)

Create a new DataFrame called `daily` with 365 rows the following columns:

- `date` – date in `datetime64` format
- `trips` – total number of BlueBikes trips taken on given date

Use the date of the start time of the trip as the date of the trip. Here is a sample workflow:

1. Use [`pandas.to_datetime()`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) to convert the trip start time into a `datetime64` object.
2. Use [`pandas.Series.dt.date`](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.date.html) to extract the date of the trip into a new column.
3. Use [`pandas.DataFrame.groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), [`pandas.groupby.GroupBy.size()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.size.html), [`pandas.Series.to_frame()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html), and [`pandas.DataFrame.reset_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) to extract a DataFrame with dates and total daily trips.
4. Use [`pandas.DataFrame.rename()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) and [`pandas.DataFrame.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) to rename  and drop columns if needed.

(3.1) Create the DataFrame `daily` as specified above. (5 pts)

(3.2) How many trips took place on the date with the most trips? (1 pt)

(3.3) What is the date with the most trips? (2 pts)

(3.4) Create a line graph that shows daily trips on over times with date on the X-axis and total number of trips on the Y-axis. Make sure to title your plot and label your axes. (4 pts)

---

## 4. Relationship Between Temperature and Ridership (14 pts)

Add the following columns to the `daily` DataFrame:

- `prcp` - total daily percipitation
- `snow` – total daily snowfall
- `temp` – average daily temperature

Remember that [`pandas.DataFrame.merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) can be used to join on columns. Note that the 

You might want to use [`pandas.to_datetime()`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) and [`pandas.Series.dt.date`](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.date.html) to ensure the date in the weather data is compatible with the date in the summary table.

(4.1) Update the DataFrame `daily` as specified above. (5 pts)

(4.2) Create a static scatter plot illustrating the relationship between average temperature and BlueBike ridership. Add a title and axes labels. (4 pts)

Plots with a trend line will be awarded with one bonus point!

(4.3) Create an interactive scatter plot illustrating the relationship between average temperature and BlueBike ridership. Add a title and axes labels. (4 pts)

(4.4) Briefly summarize your results and discuss the relationship between temperature and ridership. (1 pt)

(edit this markdown cell and replace this with your asnwer)

---

## 5. Relationship Between Percipitation and Ridership (12 pts)

(5.1) Add a new binary column called `prcp_status` to the `daily` DataFrame that reads `True` if the day had **any** percitipation or snowfall and `False` otherwise. (5 pts)

(5.2) Display the average daily ridership for days with precipitation. (1 pt)

(5.3) Display the average daily ridership for days without any precipitation. (1 pt)

(5.4) Create a box plot illustrating the relationship between precipitation status and BlueBike ridership. Add a title and axes labels. (4 pts)

(5.5) Briefly summarize your results and discuss the relationship between percipitation and ridership. (1 pt)

(edit this markdown cell and replace this with your asnwer)

---

## 6. Station Usage and Locations (18 pts)

Create a new DataFrame called `stations` where each row is an unique BlueBikes station and the columns are as follows:

- `id` – station ID
- `name` – station name
- `lat` – station latitude
- `lon` – station longitude
- `trips` – number of times a trip was **started** at this station in 2021

You can assume that all stations have both originating and departing trips. (Meaning that you can only look at originating stations for example.)

Consider using [`pandas.DataFrame.groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) to group the DataFrame based on multiple columns instead of just one.

(6.1) Create the DataFrame `stations` as specified above. (4 pts)

(6.2) How many unique BlueBikes stations are there? (1 pt)

(6.3) How many trips originated from the most used station? (1 pt)

(6.4) Display the name of the most used station. (1 pt)

(6.5) How many trips originated from the least used station? (1 pt)

(6.6) Display the name of the least used station. (1 pt)

(6.7) Convert `stations` into a GeoDataFrame. (1 pt)

Here is a useful reference: https://geopandas.org/en/stable/gallery/create_geopandas_from_pandas.html

(6.8) Use [`geopandas.GeoDataFrame.to_crs()`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_crs.html) the reproject the `stations` GeoDataFrame to the [EPSG:26986](https://epsg.io/26986) coordinate system. (1 pt)

(6.9) What is the  [EPSG:26986](https://epsg.io/26986) coordinate system commonly known as? (1 pt)

(edit this markdown cell and replace this with your asnwer)

(6.10) Create a static map showing the locations of all the BlueBike stations as blue graduated circles with the circle size porportional to the number of trips originating from the station. Add a [basemap](https://contextily.readthedocs.io/en/latest/) and title. (6 pts)

---

## 7. Most Popular Route (16 pts)

Create a new DataFrame called `routes` where each row denotes a unique route between two BlueBike stations and the columns are as follows:

- `origin_id` – ID of the station the route originates from
- `destination_id` – ID of the station the route terminates at
- `trips` – total number of trips along specified route

The workflow for this should be very similar to creating the `stations` DataFrame.

(7.1) Create the DataFrame `routes` as specified above. (2 pts)

(7.2) How many trips took place along the most popular route? (1 pt)

(7.3) What is the origin ID for the most popular route? (1 pt)

(7.4) What is the desitnation ID for the most popular route? (1 pt)

(7.5) What is the straight line distance for the most popular route? (5 pts)

Extract the coordinates for the origin and destination stations from the `stations` GeoDataFrame and calculate the distance between the two points defined by those coordinates.

(7.6) What unit is the distance in? How do you know? (2 pts)

(edit this markdown cell and replace this with your asnwer)

(7.7) Recreate the BlueBikes station map from before but this time add a red line connecting the origin and destination stations for the most popular route. (4 pts)

---

## 8. Zip Code with Most Riders (18 pts)

(8.1) Read the `mapc-icc-zcta.geojson` file into a GeoDataFrame called `zip_codes`. (1 pt)

(8.2) What is the EPSG code of the coordinate system for `zip_codes`? (1 pt)

(8.3) What is this coordinate system commonly known as? (1 pt)

Create a new DataFrame called `zip_summary` where each row denotes a ZIP Code from the `zip_codes` GeoDataFrame and the columns are as follows:

- `zip_code` – five-digit ZIP Code
- `trips` – total number of trips taken by supposed residents of corresponding ZIP Code

Here is how you could create this DataFrame:

1. Extract a list of Zip Codes from the `zip_codes` GeoDataFrame.
2. Use [`pandas.Series.isin()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html) to select all the rows from `trips` where the ZIP Code is one of the ZIP Codes from the extracted list.
3. Use [`pandas.DataFrame.groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), [`pandas.groupby.GroupBy.size()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.size.html), [`pandas.Series.to_frame()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html), and [`pandas.DataFrame.reset_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) to extract a DataFrame with ZIP Codes and total trips.
4. Use [`pandas.DataFrame.rename()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) and [`pandas.DataFrame.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) to rename  and drop columns if needed.

(8.4) Create the DataFrame `zip_summary` as specified above. (4 pts)

(8.5) What is the ZIP Code whose residents supposedly took the most trips? (1 pt)

(8.6) Add the `trips` column to the `zip_codes` GeoDataFrame by joining it with the `zip_codes` DataFrame. (2 pts)

Pay attention to the order of the join and ensure the result is a GeoDataFrame!

(8.7) Use [Folium](https://python-visualization.github.io/folium/quickstart.html) to create an interactive choropleth map illustrating the number of trips taken by the residents of each ZIP Code. (8 pts)

Your map should have a suitable basemap, a well-fromated legend, and popups that display the ZIP Code and number of trips.

Note that you can use [`geopandas.GeoDataFrame.to_json()`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_json.html) to convert a GeoDataFrame into a GeoJSON object if needed.

---

## Submitting the Assignment

Once you have finished the assignment, do the following:

1. Select *Kernel > Restart Kernel and Run All Cells* from the menu bar above.
2. The notebook will restart and evaluate all code cells from the beginning. Ensure all cells get evaluated and that there are no errors.
3. Review your notebook. If you need to make any changes, start over from step one.
4. Save your notebook and rename it to `uep239-hw04-surname.ipynb` replacing the placeholder with your surname.
5. Select *File > Save and Export Notebook As > HTML* from the menu bar above and note how an HTML version of your notebook gets downloaded.
5. Select *File > Save and Export Notebook As > Webpdf* from the menu bar above and note how a PDF version of your notebook gets downloaded.

Submit your notebook along with the HTML and PDF export. These are the files you should be submitting:

- `uep239-hw04-surname.ipynb`
- `uep239-hw04-surname.html`
- `uep239-hw04-surname.pdf`