## A notebook for testing headline processing
But this version used the OLD format of the NYT headlines, which were from news-api. Since then, NYT headliens were no longer served by NewsAPI so it needs a separate method. Then, sometime between January and April 2020, newsapi started serving from nytimes.com.

## Field Mapping

## Bylines
The author field from newsapi is a mess. Each source has a different style. Note: the source for all of these comes through as "ABC News", so there will be duplicates from AP.

From abc-news:
- PHILIP MARCELO Associated Press
- Thomas Smith
- ANDREW DALTON AP Entertainment Writer
- Benjamin Siegel, Adia Robinson

From associated-press:
- By RONALD BLUM AP Baseball
- By JOHNSON LAI and ELAINE KURTENBACH Associated Press
- By JEFF HAMPTON The Virginian-Pilot
- By DANNY MCARTHUR, Daily Journal
- By The Associated Press
- null (empty)

From fox-news:
- Joseph Wulfsohn
- L. Brent Bozell III, Tim Graham
- Associated Press
- null (empty)

From msnbc (OK, all of these are "watch" urls, maybe this isn't what I needed):
- MSNBC.com

From the-washington-post:
- Juan Zamorano | AP
- Associated Press
- Carolyn Hax
- Sudarsan Raghavan, Loveday Morris

From the-new-york-times api original might be:
- By Caroline Biggs
- By The New York Times
- null, ['person'] is empty list

```
"byline": {
  "original": "By Caroline Biggs",
  "person": [
    {
      "firstname": "Caroline",
      "middlename": null,
      "lastname": "Biggs",
      "qualifier": null,
      "title": null,
      "role": "reported",
      "organization": "",
      "rank": 1
    }
  ],
  "organization": null
}
```

For NYT, there is occasionally a different print headline than main.

NYT stores output in `data['response']['docs']`

| newsapi | nyt-api |
|---|---|
| author | `data['response']['docs'][i]['byline']['original']` |
| title | `data['response']['docs'][i]['headline']['main']` |
| description | `data['response']['docs'][i]['abstract']` |
| url | `data['response']['docs'][i]['web_url']` |
| urlToImage | https://static01.nyt.com/ + `data['response]['docs'][i]['multimedia'][0]['url']` |
| publishedAt | `data['response']['docs'][i]['pub_date']` |
| content |`data['response']['docs'][i]['lead_paragraph']` |
| source.id | 'nytimes' |
| source.name | 'The New York Times' |


In [1]:
import glob
import json
import logging
import os
from datetime import datetime, timedelta
import pandas as pd

In [2]:
def find_files(base_dir, begin_date, end_date):

    n_days = int((end_date - begin_date).total_seconds()/60/60/24)
    date_list = [begin_date + timedelta(days=x) for x in range(n_days+1)]
    
    search_strs = []
    
    for date in date_list:
        search_strs.append(date.strftime('%Y-%m-%d'))

    filenames=[]

    for search_str in search_strs:
        
        filenames += sorted(glob.glob(os.path.join(base_dir, '*' + search_str + '*.json')))
        
    return filenames

In [5]:
def process_nytsource_on_date(date, source='the-new-york-times'):

    source_dir = f"/home/will/Projects/headliner/datastore/raw/{source}/"
    out_dir = f"/home/will/Projects/headliner/datastore/processed/{source}/"
    
    filenames = find_files(source_dir, date, date)

    def concat_files(file_list):
        
        results = []
        
        for filename in file_list:
            
            with open(filename, "r") as file:
                to_add = json.load(file)
                
            authors = []
            titles = []
            descriptions = []
            urls = []
            urlToImages = []
            publishedAts = []
            contents = []

            for item in to_add['response']['docs']:
                authors.append(item['byline']['original'])
                titles.append(item['headline']['main'])
                descriptions.append(item['abstract'])
                urls.append(item['web_url'])
                publishedAts.append(item['pub_date'])
                contents.append(item['lead_paragraph'])

                if len(item['multimedia']) > 0:
                    urlToImages.append('https://static01.nyt.com/' + item['multimedia'][0]['url'])
                else:
                    urlToImages.append(None)
                    
            to_add = pd.DataFrame(
                {
                    'author': authors,
                    'title': titles,
                    'description': descriptions,
                    'url': urls,
                    'urlToImage': urlToImages,
                    'publishedAt': publishedAts,
                    'content': contents
                }
            )
            to_add['source.id'] = source
            to_add['source.name'] = 'The New York Times'
                    
            results.append(to_add)
            
        return pd.concat(results, ignore_index=True)
    
    concatted = concat_files(filenames)
    concatted.to_csv(os.path.join(out_dir, f"{source}-{date.strftime('%Y-%m-%d')}.csv"), index=False)
    
    return True

In [7]:
process_nytsource_on_date(datetime(2020,1,29,0,0))

KeyError: 'response'

In [None]:
filenames = find_files(, date, date)