# Finding computational notebooks of data journalists

Many data journalism articles are published in conjunction with computational notebooks under source code management with git and published on GitHub. This Jupyter notebook details the methods used to search for and download these notebooks as git submodules.

## A quick search for data journalism repos on GitHub

I initially use the GitHub Search API to see which users are adding "journalism" or "data journalism" tags to their repos. The code below will return search results that reflect the current collection of code repositories with these tags on GitHub. Results are written to `search_results.txt` in order to preserve the search results at the time of this analysis.

In [1]:
import requests

# Fetch a sorted, unique list of GitHub repos with relevant tags
repos = []
for tag in ['data-journalism', 'journalism']:
    url = 'https://api.github.com/search/repositories?q=topic:{}'.format(tag)
    r = requests.get(url)
    repos += [ item['html_url'] for item in r.json()['items'] ]
repos = list(set(repos))  # deduplicate results
repos.sort()

# Save this stuff to a text file
fn = 'search_results.txt'
results = '{} Results\n'.format(len(repos))
results += '\n'.join(repos)
with open(fn, 'w') as f:
    f.write(results)

with open(fn, 'r') as f:
    print(f.read())

54 Results
https://github.com/BurdaMagazinOrg/thunder-distribution
https://github.com/CJWorkbench/cjworkbench
https://github.com/Caaddss/coda.br_workshop
https://github.com/Enegnei/JacobAppelbaumLeavesTor
https://github.com/IVMachiavelli/OSINT_Team_Links
https://github.com/N2ITN/are-you-fake-news
https://github.com/ONLPS/Datasets
https://github.com/OpenRefine/OpenRefine
https://github.com/Ph055a/OSINT-Collection
https://github.com/TryGhost/Ghost
https://github.com/TryGhost/Ghost-CLI
https://github.com/alephdata/aleph
https://github.com/alephdata/opensanctions
https://github.com/brandonrobertz/autoscrape-py
https://github.com/california-civic-data-coalition/django-calaccess-raw-data
https://github.com/coralproject/talk
https://github.com/dannguyen/journalism-syllabi
https://github.com/datadesk/california-ccscore-analysis
https://github.com/datadesk/california-crop-production-wages-analysis
https://github.com/datadesk/california-electricity-capacity-analysis
https://github.com/datadesk/c

A manual inspection of each repo revealed that many of these are tools and tutorials for data journalism. However, I'm only interested in those computational notebooks that data workflows used in published articles. This search did turn up a handful interesting repos worth a closer work. I download notebooks from the *Los Angeles Times* later and include salient results from this search in the list of repos downloaded as individual numbers.

### Notebooks from the *Los Angles Times*

While many relevant notebooks from the *Los Angeles Times* were returned in the search above, the *LA Times* Data Desk keeps a centeralized [repository of all their computational notebooks](https://github.com/datadesk/notebooks/tree/d0fea75950b7911d37cd5c2fc1995da498f3d494). This index includes notebooks hosted on journalist's personal GitHub profile, too. 

Including the latest commit ID in the raw URL to `notebooks.csv` ensures that results from the bash command below only return notebooks published at the time of my analysis, May 26, 2019. GitHub has a few [hidden features for permenant links](https://help.github.com/en/articles/getting-permanent-links-to-files).

In [1]:
! mkdir -p notebooks/la_times \
&& cd notebooks/la_times \
&& curl "https://raw.githubusercontent.com/datadesk/notebooks/d0fea75950b7911d37cd5c2fc1995da498f3d494/notebooks.csv" \
| awk -F ',' '{print $4}' \
| tail -n +2

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3781  100  3781    0     0  15063      0 --:--:-- --:--:-- --:--:-- 15063
https://github.com/datadesk/census-hard-to-map-analysis
https://github.com/datadesk/hsr-document-analysis
https://github.com/ryanvmenezes/lawlers-law
https://github.com/datadesk/swana-census-analysis
https://github.com/kyleykim/R_Scripts/tree/master/la-me-ln-california-fire-aircraft-delay
https://github.com/kyleykim/R_Scripts/tree/master/la-me-ln-hhh-unequal
https://github.com/datadesk/highschool-homicide-analysis
https://github.com/datadesk/deleon-district-election-results-analysis
https://github.com/datadesk/nyrb-covers-analysis
https://github.com/datadesk/california-fire-zone-analysis
https://github.com/datadesk/helicopter-accident-analysis
https://github.com/datadesk/hollister-ranch-analysis
https://github.com/datadesk/la-settlements-analysis
https:

### Adding notebooks from *BuzzFeed News*

*BuzzFeed News* doesn't tag [their repos on GitHub](https://github.com/buzzfeed) so they didn't show up in the initial search. However, I know they publish lots of their code in their data journalism work. Like the *LA Times*, *BuzzFeed News* has [a central repository](https://github.com/BuzzFeedNews/everything) containing a single markdown document that serves as an index for all their projects on GitHub, including data and analysis.

In [2]:
! mkdir -p notebooks/buzzfeednews \
&& cd notebooks/buzzfeednews \
&& curl -s "https://raw.githubusercontent.com/BuzzFeedNews/everything/a2cbcc78ef67322a8443553645f85e6b310e658c/README.md" \
| grep "\[:link:\]" \
| grep -E -o "https:\/\/github\.com\/BuzzFeedNews\/[^\)]+"

https://github.com/BuzzFeedNews/2019-04-democratic-candidate-codonors
https://github.com/BuzzFeedNews/2018-12-fake-news-top-50
https://github.com/BuzzFeedNews/2018-12-wechat-pence/
https://github.com/BuzzFeedNews/2018-10-russian-troll-tweets
https://github.com/BuzzFeedNews/2018-10-midterm-demographics
https://github.com/BuzzFeedNews/2018-09-ftc-analysis
https://github.com/BuzzFeedNews/2018-08-charlottesville-twitter-trolls
https://github.com/BuzzFeedNews/2018-07-ofsted-inspections
https://github.com/BuzzFeedNews/2018-06-nyc-311-complaints-and-gentrification
https://github.com/BuzzFeedNews/2018-05-fentanyl-and-cocaine-overdose-deaths
https://github.com/BuzzFeedNews/2018-03-oscars-script-diversity-analysis
https://github.com/BuzzFeedNews/2018-02-olympic-figure-skating-analysis
https://github.com/BuzzFeedNews/2018-02-figure-skating-analysis
https://github.com/BuzzFeedNews/2018-01-trump-state-of-the-union
https://github.com/BuzzFeedNews/2018-01-twitter-withheld-accounts
http