## Initial Test

---

Initial test to try and attempt to view what a sample `JSON` output would look like

In [1]:
import os
import requests
import json
import time
import calendar
import numpy as np

from datetime import date, timedelta
from urllib.parse import urljoin
from pathlib import Path

### Settings & Explanations

---

In the below, we play with the data ranges, between `2018-08-21` and another date

However, it turns out the same date can be used for multiple entries, therefore simplifying the extraction of dates for a single date

In [15]:
day = 16
month = 3
year = 2014

lezo_fx = lambda x: "0" + str(x) if len(str(x)) == 1 else str(x) ## LEading ZEro lambda function to get properly formatted dates
url = f'https://api.biorxiv.org/details/biorxiv/{year}-{lezo_fx(month)}-{lezo_fx(day)}/{year}-{lezo_fx(month)}-{lezo_fx(day)}/0'
response = requests.get(url)

response.raise_for_status()

data = response.json()

In [16]:
print(json.dumps(data, indent=2))

{
  "messages": [
    {
      "status": "ok",
      "interval": "2014-03-16:2014-03-16",
      "cursor": 0,
      "count": 1,
      "count_new_papers": 1,
      "total": 1
    }
  ],
  "collection": [
    {
      "doi": "10.1101/003319",
      "title": "Genetic influences on translation in yeast",
      "authors": "Frank W. Albert;Dale Muzzey;Jonathan S. Weissman;Leonid Kruglyak;",
      "author_corresponding": "Frank W. Albert",
      "author_corresponding_institution": "University of California, Los Angeles",
      "date": "2014-03-16",
      "version": "1",
      "type": "New Results",
      "license": "cc_by_nc_nd",
      "category": "Genetics ",
      "jatsxml": "https://www.biorxiv.org/content/early/2014/03/16/003319.source.xml",
      "abstract": "Heritable differences in gene expression between individuals are an important source of phenotypic variation. The question of how closely the effects of genetic variation on protein levels mirror those on mRNA levels remains open. Here

## API Access w/ Certain Considerations

---

In this portion, the we will collect a few JSONs:

- `full_dict_by_date`: A JSON object with dates as keys representing the output of bioRxiv metadata for each article from 01-01-2013 onwards

    - This will be futher deposited into folders for monthly and then yearly collection henceforth

- `return_dict` A JSON object of JSON objects

    - The first JSON object will have keys of DOI
    - And values of the `date`, `license`, and `jatsxml`
        - These will represent an easy-to-access format for the data

    - The second JSON object will have keys of the license types (i.e. `CC-BY`) and values of the counts of each license type

        - This is specifically for the creation of a pie chart for breakdown by license type

- `full_dict`: A JSON object representing the full output of each request

    - The name of the object will be the date associated
    
    - This object will be deposited into a folder system with subfolder system in the format:
        
        `full_dict`
            L year1
                L month1
                    L day1
                    L day2
                L month2
                    L day1
                    L day2
                    L day3
            L year2
                L month1
                    L day1
                    L day2
                    L day3
                L month2
                    L day1

Each of the above mentioned dictionaries will have methods of saving and retrieving in the main loop

In [4]:
def update_return_dict(articles, cur_date):
    '''
    This function processes all of the articles associated with a specific date, it checks:
        1) Whether the current DOI has ALREADY been processed, if so updates the saved JATSXML link in associated JSON
        2) Whether the key for the license associated with the current article exists, if so, a new key is then created and added to the JSON value for that key in particular
            @param article: The article JSON object to be processed
            @return: None
    '''
    cur_date = str(cur_date)
    articles = articles['collection']
    for article in articles:
        ## Tracks the number of license type count from the return_dict "license" object
        return_dict['license'][article['license']] = return_dict['license'].get(article['license'], 0) + 1

        ## Adds/UPDATES the most recent DOI submission from return_dict
        return_dict['content'][article['doi']] = {
            'date': article['date'],
            'license': article['license'],
            'jatsxml': article['jatsxml']
        }

In [25]:
def update_full_dict_by_date(articles, cur_date):
    '''
    This function adds to a dict object where the keys are the dates in string format and the values are a list of DOIs for that date
        @param articles: All articles in current date to extract information from
        @param cur_date: The current date to process in YYYY-mm-dd format
        @return: None
    '''
    cur_date = str(cur_date)
    articles = articles['collection']
    
    full_dict_by_date[cur_date] = [{'doi': item['doi'], 'license': item['license']} for item in articles]

In [26]:
def get_dated_subfolders(date):
    '''
    Function to create subfolders for each of the day, month, year in the current date, specifically to store metadata from "full_dict_by_date"
        @param date: The current date in YYYY-mm-dd format
        @return: The full output path for the folder representing the current date
    '''
    month = date.month
    year = date.year

    dict_main_folder = Path('./full-dict')

    dict_year_folder = dict_main_folder.joinpath(str(year))
    dict_year_folder.mkdir(exist_ok=True)

    dict_month_folder = dict_year_folder.joinpath(str(month))
    dict_month_folder.mkdir(exist_ok=True)

    return dict_month_folder

In [27]:
def split_dates(dates, num_splits):
    '''
    Function that takes in a list of dates (or anything, really) and splits those dates into num_splits number of splits
        @param dates: A list of dates (or anything really)
        @param num_splits: The number of sub-lists to split the original list into
    '''
    first_last_dates_fx = lambda lis: [[li[0], li[-1]] for li in lis] ## Essentially extracts the first and last elements of the split list of dates
    return list(first_last_dates_fx(np.array_split(dates, num_splits))) ## Returns the first and last dates of each split

In [28]:
def get_split_folder(split_num):
    '''
    Function that creates a folder for the specific split applicable to the current operation
        @param split_num: The split number to obtain the associated folder for
        @return: The path to the folder for the split number
    '''
    split_main_folder = Path('.')
    split_main_folder.mkdir(exist_ok=True)

    split_folder = split_main_folder.joinpath(str(split_num))
    split_folder.mkdir(exist_ok=True)

    return split_folder ## Not needed but just in case

In [29]:
def process_bioRxiv(start_date=lambda _: date(2013, 1, 1),
                    end_date=lambda _: date.today(),
                    load_dicts=False,
                    delay=5,
                    **kwargs):
    '''
    Function to process the data (loading in if necessary) for a period of time defined by:
        @param start: Beginning date in the format YYYY-mm-dd
        @param end: End date in the format YYY-mm-dd
        @param save_every: Saves (overrites) the last dict every n number of iterations through the data
        @param load_dicts: Loads in the dictionary (JSON) objects associated with "full_dict_by_date" and "return_dict"
    '''
    global full_dict_by_date
    full_dict_by_date = {}

    global return_dict
    return_dict = {
        'license': {},
        'content': {}
    }

    get_split_folder(SELECTED_NUMBER) ## Creates the folder for the current split number

    if load_dicts:
        with open(f'{SELECTED_NUMBER if SELECTED_NUMBER else "."}/full_dict_by_date.json', 'r') as f:
            full_dict_by_date = json.load(f)
        with open(f'{SELECTED_NUMBER if SELECTED_NUMBER else "."}/return_dict.json', 'r') as f:
            return_dict = json.load(f)

        start_date = date(full_dict_by_date.keys().sort()[-1]) + timedelta(days=1) ## Gets the last available date and adds 1 IF not the first date, otherwise starts at Jan 1, 2013

    dates = [start_date + timedelta(days=x) for x in range((end_date  - start_date).days + 1)] ## Gets a list of available dates

    for cur_date in dates:

        flag = True ## Flag of whether or not to continue the loop to get more pages
        num_skip = 0 ## Flag to skip as per API instructions
        while flag:
            cur_date_dir = get_dated_subfolders(cur_date)

            url = f'https://api.biorxiv.org/details/biorxiv/{str(cur_date)}/{str(cur_date)}/{num_skip}'

            response = requests.get(url) ## Processes request and waits for response / articles in JSON format
            response.raise_for_status()
            articles = response.json()

            if articles['messages'][0]['status'] != 'ok': ## Continues onto the next date if posts for the current date are unavailable
                flag = False
                continue

            update_full_dict_by_date(articles=articles, cur_date=cur_date)
            update_return_dict(articles=articles, cur_date=cur_date)

            print('Saving "full_dict_by_date.json"...')
            with open(f'./output_dicts/{SELECTED_NUMBER}/full_dict_by_date.json', 'w') as f:
                json.dump(full_dict_by_date, f, indent=2)
            
            print('Saving "return_dict.json"...')
            with open(f'./output_dicts{SELECTED_NUMBER}/return_dict.json', 'w') as f:
                json.dump(return_dict, f, indent=2)

            print('Saving "full_dict_by_date.json"...') ## Saves the updated JSONs
            with open(f'{str(cur_date_dir)}/{cur_date}.json', 'w') as f: ## Saves the whole output as today's date
                json.dump(articles['collection'], f, indent=2)

            print('='*50)
            print('Processed date:', cur_date, 'with', len(articles['collection']), 'articles')
            print(f'Sleeping for {delay} seconds... Zzz...')
            print('='*50)
            time.sleep(delay) ## Uses the default 5 seconds if need be

            num_skip += 100

# Directions for Asynchronous Users

---

From the number you chose in the Discord server, please enter the number in the following cell under `SELECTED NUMBER`.

DO NOT change the `DO_NOT_CHANGE_FLAG` flag. This represents the number of total splits to use.

Afterwards, simply sync the current folder containing all the metadata with GitHub.

In [30]:
global DO_NOT_CHANGE_FLAG
DO_NOT_CHANGE_FLAG = 100 ## Do not change this number, it is the number of splits to create and is set standard to 100

In [None]:
LEFT_OFF = 53 ## The split number to start from, if 0, then starts from the beginning

In [34]:
for num in range(LEFT_OFF, DO_NOT_CHANGE_FLAG):
    global SELECTED_NUMBER
    SELECTED_NUMBER = num ## Enter selected number here, which represents the selected number

    dates = [date(2013, 1, 1) + timedelta(days=x) for x in range((date.today() - date(2013, 1, 1)).days + 1)] ## Gets a list of available dates
    split_dates_list = split_dates(dates, DO_NOT_CHANGE_FLAG) ## Splits the dates into the selected number of splits

    start_date = split_dates_list[SELECTED_NUMBER][0] ## Gets the start date from the selected number
    end_date = split_dates_list[SELECTED_NUMBER][1] ## Gets the end date from the selected number

    process_bioRxiv(start_date=start_date, end_date=end_date, save_every_n_days=100, load_dicts=False, delay=5)

Saving "full_dict_by_date.json"...
Saving "return_dict.json"...
Saving "full_dict_by_date.json"...
Processed date: 2013-11-07 with 8 articles
Sleeping for 5 seconds... Zzz...
Saving "full_dict_by_date.json"...
Saving "return_dict.json"...
Saving "full_dict_by_date.json"...
Processed date: 2013-11-11 with 2 articles
Sleeping for 5 seconds... Zzz...
Saving "full_dict_by_date.json"...
Saving "return_dict.json"...
Saving "full_dict_by_date.json"...
Processed date: 2013-11-12 with 7 articles
Sleeping for 5 seconds... Zzz...
Saving "full_dict_by_date.json"...
Saving "return_dict.json"...
Saving "full_dict_by_date.json"...
Processed date: 2013-11-13 with 2 articles
Sleeping for 5 seconds... Zzz...
Saving "full_dict_by_date.json"...
Saving "return_dict.json"...
Saving "full_dict_by_date.json"...
Processed date: 2013-11-14 with 7 articles
Sleeping for 5 seconds... Zzz...
Saving "full_dict_by_date.json"...
Saving "return_dict.json"...
Saving "full_dict_by_date.json"...
Processed date: 2013-11-15

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

# Combine

---

The below section deals with combining all of the obtained data into one single JSON object

In [None]:
def combine_output_dicts(base_dict, path='./output_dicts', num=0):
    '''
    Function to combine all of the output dicts into one JSON file
        @param path: The path to the folder containing all of the output dicts
        @return: None
    '''
    try:
        with open(f'{path}/{num}/return_dict.json', 'r') as f:
            cur_dict = json.load(f)
    except FileNotFoundError:
        print(f'No files in folder {num}...')

    for key, value in cur_dict.items():
        if key == 'license':
            for k, v in value.items():
                base_dict[key][k] += v
        elif key == 'content':
            base_dict[key] += value

In [None]:
def get_base_dict(path='./output_dicts'):
    '''
    Function to get the base dict for cobining all of the output dicts
        @param: None
        @return: Base dict
    '''
    for num in range(DO_NOT_CHANGE_FLAG):
        for dirpath, dirnames, filenames in os.walk(f'{num}'):
            if filenames:
                return num
    
    return None

In [None]:
base_dict = get_base_dict()
for num in range(DO_NOT_CHANGE_FLAG):
    combine_output_dicts(base_dict, f'./output_dicts/{num}')