# Get Divvy Data

This notebook should only be run once. No need to run it again once we have `data/processed/divvy_dataset.csv`.

Run all cells in this notebook to obtain filtered divvy data.

In [1]:
PATH_RAW = "data/raw/"
PATH_PROCESSED = "data/"

In this project, we are interested in Divvy data from the month of **October 2022**. This data is available from the [Divvy data index](https://divvy-tripdata.s3.amazonaws.com/index.html) and can be downloaded as a `.zip` file.

In [2]:
filename = "202210-divvy-tripdata.zip"

We first run commands to set up the folders that will store the datasets. These particular commands work for MacOS devices.

In [3]:
! mkdir data; cd data; mkdir raw;

mkdir: data: File exists


Now, we try to download the file from the index.

In [4]:
import os
import requests

def download_zip(url, output_path):
  """Download the file from the url given into the specified folder
  """
  try:
    response = requests.get(url)
    filename = url.split('/')[-1]
    with open(os.path.join(output_path, filename), 'wb') as output_file:
      output_file.write(response.content)
    print("Download complete:", filename)
    response.close()
  except requests.ConnectionError as error:
    print("A problem occured during download.")
    print()

In [5]:
url = "https://divvy-tripdata.s3.amazonaws.com/" + filename

In [6]:
download_zip(url, PATH_RAW)

Download complete: 202210-divvy-tripdata.zip


At this point, we should have the raw data zip file. We can extract the zip file and delete it after.

In [7]:
from zipfile import ZipFile

with ZipFile(PATH_RAW + filename, 'r') as zip_obj:
  zip_obj.extractall(PATH_RAW)

In [8]:
! cd data/raw/; rm *.zip;

Now for the fun part, filtering the data with pandas. We only want to remove the `NaN` values.

In [9]:
filename = "202210-divvy-tripdata.csv"

In [10]:
import pandas as pd

In [11]:
df = pd.read_csv(PATH_RAW + filename)

In [12]:
df = df.dropna()

And thus, we have our dataset with removed `NaN` values.

In [13]:
df.shape

(414269, 13)

This is actually still a little bit too large, as this will create a ~100MB `.csv` file, when we need one that is <50MB for Observable.

So, we just want to pick data from **one week** in September. Specifically, rides occuring between **Monday, 24th October 2022** and **Sunday, 30th October 2022** (inclusive). We also wish to add two extra attributes that can be derived from the raw dataset, which are _distance_ and _duration_ of the trips. This will help us answer our questions.

Select only one week of data, due to size constraints.

In [14]:
mask = (df["started_at"] > "2022-10-24 00:00:01") & (df["started_at"] <= "2022-10-30 23:59:59")

In [15]:
subsetdf = df.loc[mask]

In [16]:
subsetdf.shape

(84747, 13)

This shrinks our dataset size to about 20% from the normal dataset. We believe meaningful knowledge can still be obtained even with this constraint.

Prepare to export dataframe to file.

In [17]:
subsetdf = subsetdf.reset_index()

In [18]:
subsetdf.to_csv(PATH_PROCESSED + "divvy_dataset.csv", index=False)  # index=False important for Observable

At this point, **manually** delete the `data/raw/` folder.