# Data scraping, preparation, exploration, and pre-processing

## Data scraping

We scrape article metadata using the New York Times API. For our project we will analyze articles published between May 15th, 2023 and May 14th, 2024. We start by importing a few libraries and setting up the API.  
> **Note:** In order to run the following code you will need to create an account on (https://developer.nytimes.com) and obtain your own API key. You will then need to create a file `keys.py` in your local repo directory, and save your API key as a string under the variable name `NYT_API_KEY`.  

In [2]:
name: NYTimes
channels:
  - conda-forge
  - defaults
dependencies:
  - appnope=0.1.4=pyhd8ed1ab_0
  - asttokens=2.4.1=pyhd8ed1ab_0
  - bzip2=1.0.8=h6c40b1e_6
  - ca-certificates=2024.3.11=hecd8cb5_0
  - comm=0.2.2=pyhd8ed1ab_0
  - debugpy=1.6.7=py312hcec6c5f_0
  - decorator=5.1.1=pyhd8ed1ab_0
  - exceptiongroup=1.2.0=pyhd8ed1ab_2
  - executing=2.0.1=pyhd8ed1ab_0
  - expat=2.6.2=hcec6c5f_0
  - importlib-metadata=7.1.0=pyha770c72_0
  - importlib_metadata=7.1.0=hd8ed1ab_0
  - ipykernel=6.29.3=pyh3cd1d5f_0
  - ipython=8.24.0=pyh707e725_0
  - jedi=0.19.1=pyhd8ed1ab_0
  - jupyter_client=8.6.1=pyhd8ed1ab_0
  - jupyter_core=5.5.0=py312hecd8cb5_0
  - libcxx=14.0.6=h9765a3e_0
  - libffi=3.4.4=hecd8cb5_1
  - libsodium=1.0.18=hbcb3906_1
  - matplotlib-inline=0.1.7=pyhd8ed1ab_0
  - ncurses=6.4=hcec6c5f_0
  - nest-asyncio=1.6.0=pyhd8ed1ab_0
  - openssl=3.3.0=h87427d6_2
  - packaging=24.0=pyhd8ed1ab_0
  - parso=0.8.4=pyhd8ed1ab_0
  - pexpect=4.9.0=pyhd8ed1ab_0
  - pickleshare=0.7.5=py_1003
  - pip=24.0=py312hecd8cb5_0
  - platformdirs=4.2.2=pyhd8ed1ab_0
  - prompt-toolkit=3.0.42=pyha770c72_0
  - psutil=5.9.0=py312h6c40b1e_0
  - ptyprocess=0.7.0=pyhd3deb0d_0
  - pure_eval=0.2.2=pyhd8ed1ab_0
  - pygments=2.18.0=pyhd8ed1ab_0
  - python=3.12.2=hd58486a_0
  - pyzmq=25.1.2=py312hcec6c5f_0
  - readline=8.2=hca72f7f_0
  - setuptools=69.5.1=py312hecd8cb5_0
  - six=1.16.0=pyh6c4a22f_0
  - sqlite=3.45.3=h6c40b1e_0
  - stack_data=0.6.2=pyhd8ed1ab_0
  - tk=8.6.14=h4d00af3_0
  - tornado=6.3.3=py312h6c40b1e_0
  - traitlets=5.14.3=pyhd8ed1ab_0
  - typing_extensions=4.11.0=pyha770c72_0
  - wcwidth=0.2.13=pyhd8ed1ab_0
  - wheel=0.43.0=py312hecd8cb5_0
  - xz=5.4.6=h6c40b1e_1
  - zeromq=4.3.5=hcec6c5f_0
  - zipp=3.17.0=pyhd8ed1ab_0
  - zlib=1.2.13=h4b97444_1
  - pip:
      - numpy==1.26.4
      - pandas==2.2.2
      - python-dateutil==2.9.0.post0
      - pytz==2024.1
      - tzdata==2024.1
prefix: /opt/miniconda3/envs/NYTimes

SyntaxError: invalid syntax (1573505936.py, line 2)

In [1]:
import pandas as pd
from pynytimes import NYTAPI
from datetime import datetime
from keys import NYT_API_KEY

nyt_api = NYTAPI(NYT_API_KEY, parse_dates=True)


ModuleNotFoundError: No module named 'pynytimes'

We then fetch the data one month at a time and store in a new `.csv` file. Note that the API has a limit of 5 calls per minute (see the API [FAQ's](https://developer.nytimes.com/faq#a11)), so we need to make sure to wait until 12 seconds between consecutive calls. Note that there is also a limit of 500 calls a day.  
We wrap the process in a function so we can easily adjust the start and end date or change the path of the destination file. The function will also check whether the file already exists, and in that case give a warning before executing the procedure. This will help control the risk of restarting the relatively time-consuming process and limit unnecessary calls to the NYT servers.  
Since we are only interested in articles, we also filter out all other document types during the process.  

In [52]:
from os import listdir
from dateutil.rrule import rrule, MONTHLY
from time import sleep

def scrape(start_date: datetime, end_date: datetime, dest: str='data/nyt_metadata.csv') -> None:
    # Avoid restarting the process if the .csv already exists...
    execute = dest.split('/')[-1] not in listdir(''.join(dest.split('/')[:-1]))
    # ...unless the user confirms otherwise.
    if not execute:
        execute = input(f'A file already exists at `{dest}`. Do you want to continue?').lower() in ['y', 'yes']
    
    if execute:
        df = pd.DataFrame()
        for month in rrule(MONTHLY, dtstart=start_date, until=end_date):
            print(f'Working on the month of {month}')
            monthly_df = pd.DataFrame(nyt_api.archive_metadata(date=month))
            monthly_df = monthly_df.loc[monthly_df.pub_date.apply(lambda x: str(x)) >= str(start_date)]
            monthly_df = monthly_df.loc[monthly_df.pub_date.apply(lambda x: str(x)) < str(end_date)]
            monthly_df = monthly_df.loc[monthly_df.document_type=='article']
            if not monthly_df.empty:
                df = pd.concat([df, monthly_df]) if not df.empty else monthly_df
            sleep(12)   # The NYT servers have a limit of 5 calls per minute.
        
        df.to_csv(dest)

In [54]:
scrape(datetime(2023, 5, 15), datetime(2024, 5, 15))

Working on the month of 2023-05-15 00:00:00
Working on the month of 2023-06-15 00:00:00
Working on the month of 2023-07-15 00:00:00
Working on the month of 2023-08-15 00:00:00
Working on the month of 2023-09-15 00:00:00
Working on the month of 2023-10-15 00:00:00
Working on the month of 2023-11-15 00:00:00
Working on the month of 2023-12-15 00:00:00
Working on the month of 2024-01-15 00:00:00
Working on the month of 2024-02-15 00:00:00
Working on the month of 2024-03-15 00:00:00
Working on the month of 2024-04-15 00:00:00
Working on the month of 2024-05-15 00:00:00
