# Fetching computational notebooks of data journalists

Many data journalism articles are published in conjunction with computational notebooks under source code management with git and published on GitHub. This Jupyter notebook details the methods used to search for and download these notebooks as git submodules.

## A quick search for data journalism repos on GitHub

I initially use the GitHub Search API to see which users are adding "journalism" or "data journalism" tags to their repos. The code below will return search results that reflect the current collection of code repositories with these tags on GitHub. Results are written to `search_results.txt` in order to preserve the search results at the time of this analysis.

In [1]:
import requests

# Fetch a sorted, unique list of GitHub repos with relevant tags
repos = []
for tag in ['data-journalism', 'journalism']:
    url = 'https://api.github.com/search/repositories?q=topic:{}'.format(tag)
    r = requests.get(url)
    repos += [ item['html_url'] for item in r.json()['items'] ]
repos = list(set(repos))  # deduplicate results
repos.sort()

# Save this stuff to a text file
fn = 'search_results.txt'
results = '{} Results\n'.format(len(repos))
results += '\n'.join(repos)
with open(fn, 'w') as f:
    f.write(results)

with open(fn, 'r') as f:
    print(f.read())

54 Results
https://github.com/BurdaMagazinOrg/thunder-distribution
https://github.com/CJWorkbench/cjworkbench
https://github.com/Caaddss/coda.br_workshop
https://github.com/Enegnei/JacobAppelbaumLeavesTor
https://github.com/IVMachiavelli/OSINT_Team_Links
https://github.com/N2ITN/are-you-fake-news
https://github.com/ONLPS/Datasets
https://github.com/OpenRefine/OpenRefine
https://github.com/Ph055a/OSINT-Collection
https://github.com/TryGhost/Ghost
https://github.com/TryGhost/Ghost-CLI
https://github.com/alephdata/aleph
https://github.com/alephdata/opensanctions
https://github.com/brandonrobertz/autoscrape-py
https://github.com/california-civic-data-coalition/django-calaccess-raw-data
https://github.com/coralproject/talk
https://github.com/dannguyen/journalism-syllabi
https://github.com/datadesk/california-ccscore-analysis
https://github.com/datadesk/california-crop-production-wages-analysis
https://github.com/datadesk/california-electricity-capacity-analysis
https://github.com/datadesk/c

A manual inspection of each repo revealed that many of these are tools and tutorials for data journalism. However, I'm only interested in those computational notebooks that data workflows used in published articles. This search did turn up a handful interesting repos worth a closer work. I download notebooks from the *Los Angeles Times* later and include salient results from this search in the list of repos downloaded as individual numbers.

### Notebooks from the *Los Angles Times*

While many relevant notebooks from the *Los Angeles Times* were returned in the search above, the *LA Times* Data Desk keeps a centeralized [repository of all their computational notebooks](https://github.com/datadesk/notebooks/tree/d0fea75950b7911d37cd5c2fc1995da498f3d494). This index includes notebooks hosted on journalist's personal GitHub profile, too. 

Including the latest commit ID in the raw URL to `notebooks.csv` ensures that results from the bash command below only return notebooks published at the time of my analysis, May 26, 2019. GitHub has a few [hidden features for permenant links](https://help.github.com/en/articles/getting-permanent-links-to-files).

In [2]:
! mkdir -p notebooks/la_times \
&& cd notebooks/la_times \
&& curl "https://raw.githubusercontent.com/datadesk/notebooks/d0fea75950b7911d37cd5c2fc1995da498f3d494/notebooks.csv" \
| awk -F ',' '{print $4}' \
| tail -n +2 \
| xargs -I {} git submodule add "{}.git"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3781  100  3781    0     0   6899      0 --:--:-- --:--:-- --:--:--  6887
Cloning into '/home/jovyan/notebooks/la_times/census-hard-to-map-analysis'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 39 (delta 11), reused 28 (delta 6), pack-reused 0[K
Unpacking objects: 100% (39/39), done.
Cloning into '/home/jovyan/notebooks/la_times/hsr-document-analysis'...
remote: Enumerating objects: 26, done.[K
remote: Counting objects: 100% (26/26), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 26 (delta 11), reused 12 (delta 3), pack-reused 0[K
Unpacking objects: 100% (26/26), done.
Cloning into '/home/jovyan/notebooks/la_times/lawlers-law'...
remote: Enumerating objects: 27948, done.[K
remote: 

### Converting to PDF for coding

In [9]:
%%bash

cd notebooks

notebooks=(california-ccscore-analysis )

for notebook in "${notebooks[@]}"
do

done

find: ‘**/*.ipynb’: No such file or directory


CalledProcessError: Command 'b'\ncd notebooks && find **/*.ipynb\n'' returned non-zero exit status 1.