### Get the Raw Data

First, we want to get our raw data from the web. This can be done using the ##requests## package. The function *download_one_file_of_raw_data* will do just that. We use a type hint in the function definition to help with debugging. This just means we are expecting a Path to be returned to us. We give the function a year and a month that will be input into the url so that any month and year can be downloaded. To get single digit months, we need to use {month:02d} which pads them with a 0. The data will be downloaded into the desired path which we name when the status_code is equal to 200, which means it was a successful request. If not, we create an exception string. We do this by using open(path, 'wb').write(response.content). The 'wb' mode is used for pictures or other non-text files. So we open the path in write binary mode then we write the content of our response in that path.

In [13]:
from pathlib import Path
import requests

def download_one_file_of_raw_data(year: int, month: int) -> Path:
    '''Download the files from the web.'''
    # establish the url we want to get to use {month:02d} to pad single digits with a 0.
    url = f'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{year}-{month:02d}.parquet'

    # requests.get() gets us to the given url
    response = requests.get(url)

    # status code 200 means successful request
    if response.status_code == 200:
        # open the path and write the data to it. wb is writing binary data and the requests.content is binary data
        path = f'../data/raw/rides_{year}-{month:02d}.parquet'
        open(path, 'wb').write(response.content)
        return path
    
    # if the status code was not success...
    else:
        raise Exception(f'{url} is not available.')

In [14]:
# call the function
download_one_file_of_raw_data(year = 2022, month = 1)

'../data/raw/rides_2022-01.parquet'

### Load the Data

In [12]:
import pandas as pd

In [17]:
# load our data using pd.read_parquet
rides = pd.read_parquet('../data/raw/rides_2022-01.parquet')

# view first 5 rows
rides.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,13.3,0.0,0.0
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,7.5,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,8.0,0.5,0.5,0.0,0.0,0.3,11.8,2.5,0.0
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,1.0,N,68,163,1,23.5,0.5,0.5,3.0,0.0,0.3,30.3,2.5,0.0


For this project, we will only be working with the pickup time and the pick up location. We can use these two columns as the only columns in our dataframe.

In [18]:
# only use pickup time and pick up location
rides = rides[['tpep_pickup_datetime', 'PULocationID']]

In [21]:
# rename columns
rides.rename(columns = {
    'tpep_pickup_datetime' : 'pickup_datetime',
    'PULocationID' : 'pickup_location_id'
}, inplace = True
)

In [22]:
rides.columns

Index(['pickup_datetime', 'pickup_location_id'], dtype='object')