<a href="https://colab.research.google.com/github/simon-clematide/casdmit-fs21/blob/master/oai_zora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# UZH ZORA OAI Downloader
Kleines Skript, das 
 - ZORA-Publikationen runterlädt
 - ein _multilabel_, _multiclass_ Datenset erstellt mit Dewey-Kodes als Klassen
 - eingeschränkt auf die englische Publikationen mit einem Abstract.

## Installation, Setup, Importieren von Modulen


In [None]:
! pip install sickle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sickle
  Downloading Sickle-0.7.0-py3-none-any.whl (12 kB)
Installing collected packages: sickle
Successfully installed sickle-0.7.0


In [None]:
import sickle
import json
import logging
import re

## Downloader


In [None]:
def download(filename, limit=999999999):
    """Download a maximum of limit publications to a JSONL file"""

    # Initialisiere den Sickle-Client mit der Basis-URL des OAI-PMH-Servers
    sickle_client = sickle.Sickle('https://www.zora.uzh.ch/cgi/oai2')
    
    # Set metadata prefix and other parameters for the OAI-PMH request
    metadata_prefix = 'oai_dc'
    params = {'metadataPrefix': metadata_prefix, 'ignore_deleted': True}
    
    # when looping over the records, sickle automatically loads further pages as long as a resumption token is there 
    records = sickle_client.ListRecords(**params)
    
    # open output file and iteratively write each record
    with open(filename, "w", encoding="utf-8") as outfile:
        for i, record in enumerate(records):
            d_record = dict(record)
            print(json.dumps(d_record, ensure_ascii=False), file=outfile)
            
            # stoppe hier, wenn Download-Limite erreicht ist
            if i >= limit:
                break
            # reportiere den Download-Fortschritt
            if i % 100 == 0:
                logging.info(f"Downloaded {i} records")

In [None]:
download("zora-1000.jsonl", limit=100)

In [None]:
! head zora-1000.jsonl

{"relation": ["https://www.zora.uzh.ch/id/eprint/1/", "10.1186/1472-6963-7-7"], "title": ["HEE-GER: a systematic review of German economic evaluations of health care published 1990-2004"], "creator": ["Schwappach, D L B", "Boluarte, T A"], "subject": ["Swiss Research Institute for Public Health and Addiction", "610 Medicine & health"], "description": ["BACKGROUND: Studies published in non-English languages are systematically missing in systematic reviews of growth and quality of economic evaluations of health care. The aims of this study were: to characterize German evaluations, published in English or German-language, in terms of various key parameters; to investigate methods to derive quality-of-life weights in cost-utility studies; and to examine changes in study characteristics over the years. METHODS: We conducted a country-specific systematic review of the German and English-language literature of German economic evaluations (assessment of or application to the German health care

JSON Viewer: https://codebeautify.org/jsonviewer

## Extrahiere den Dewey Decimal Code aus dem Zora Subject-Feld

In [None]:
def extract_dewey(zora_subjects):
    """Return dewey numbers as fasttext labels
    
    ["Institute of Food Safety and Hygiene", "570 Life sciences; biology", "610 Medicine & health"] 

    ["__label__570","__label__610__"]
    """

    labels = []
    for subject in zora_subjects:
        m = re.match(r'^(\d\d\d)', subject)
        if m:
            labels.append(f"__label__{m.group(1)}")
    return labels

## Konvertiere JSONL-Format in Fasttext-Klassifikationsformat

In [None]:
def zora_jsonl2fasttext_tsv(inputfile, outputfile):
    """Convert Zora JSONL into fasttext labeled input format"""

    lines_read = 0
    lines_written = 0

    with open(inputfile,"r",encoding="utf-8") as input:
        with open(outputfile, "w",encoding="utf-8") as output:
            for line in input:
                record = json.loads(line)
                lines_read += 1
                # ignore entries in other languages than English
                if not "language" in record or "eng" not in record["language"]:
                    continue
                
                # ignore entries without abstract
                if  not record.get("description"):
                    continue

                abstract = re.sub(r"\s+"," ",record["description"][0])
                title = re.sub(r"\s+"," ", record["title"][0])
                dewey_labels = extract_dewey(record["subject"])

                # we ignore articles with empty subjects
                if dewey_labels:
                    print(" ".join(dewey_labels), title + " " + abstract, sep="\t", file=output)
                    lines_written += 1
    print(f"Lines read from {inputfile}: {lines_read}")
    print(f"Lines written to {outputfile}: {lines_written}")

In [None]:
zora_jsonl2fasttext_tsv("zora-1000.jsonl", "zora-1000.fasttext.tsv")

Lines read from zora-10000.jsonl: 10002
Lines written to zora-10000.fasttext.tsv: 5955


In [None]:
! tail -n 10 zora-1000.fasttext.tsv

__label__150	Were they really laughed at? That much? Gelotophobes and their history of perceived derisibility The List of Derisible Situations (LDS; Proyer, Hempelmann and Ruch, List of Derisible Situations (LDS), University of Zurich, 2008) consists of 102 different occasions for being laughed at. They were retrieved in a corpus study and compiled into the LDS. Based on this list, information on the frequency and the intensity with which people recall being laughed at during a given time-span (12 months in this study) can be collected. An empirical study (N = 114) examined the relations between the LDS and the fear of being laughed at (gelotophobia), the joy of being laughed at (gelotophilia), and the joy of laughing at others (katagelasticism; Ruch and Proyer this issue). More than 92% of the participants recalled having been laughed at at least once over the past 12 months. Highest scores were found for experiencing an embarrassing situation, chauvinism of others or being laughed at

## Erstelle kleine Statistik zur Verteilung der Dewey-Label

In [None]:
! cut -f 1 < zora-1000.fasttext.tsv | sort | uniq -c | sort -rn

   1680 __label__610
    953 __label__570 __label__610
    938 __label__570
    584 __label__570 __label__590
    353 __label__330
    245 __label__540
    236 __label__570 __label__630
    155 __label__910
    149 __label__150
     89 __label__580
     76 __label__530
     74 __label__000
     59 __label__170 __label__610
     53 __label__510
     44 __label__560
     38 __label__300
     27 __label__570 __label__610 __label__600
     22 __label__070
     20 __label__320
     19 __label__150 __label__610
     18 __label__610 __label__540
     13 __label__820
     10 __label__000 __label__410
      9 __label__570 __label__610 __label__540
      7 __label__570 __label__170 __label__610
      7 __label__370
      5 __label__900
      5 __label__790 __label__390 __label__300
      5 __label__570 __label__580 __label__610
      5 __label__100
      4 __label__470 __label__480
      4 __label__340 __label__610
      3 __label__900 __label__330
      3 __label__570 __label__580
      3 __lab

# Analysiere 20000 zufällig gesampelte ZORA Records
Es werden nur die englischen Publikationen konvertiert!

In [None]:
! ! curl https://files.ifi.uzh.ch/cl/siclemat/lehre/fs23/bibliosuisse/data/zora-20000.jsonl -o zora-20000.jsonl

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 38.6M  100 38.6M    0     0  16.4M      0  0:00:02  0:00:02 --:--:-- 16.4M


In [None]:
zora_jsonl2fasttext_tsv("zora-20000.jsonl", "zora-20000.fasttext.tsv")

Lines read from zora-20000.jsonl: 20000
Lines written to zora-20000.fasttext.tsv: 10267


In [None]:
! cut -f 1 < zora-20000.fasttext.tsv | sort | uniq -c | sort -rn

   4020 __label__610
    927 __label__570 __label__610
    603 __label__570
    537 __label__330
    447 __label__150
    440 __label__530
    355 __label__910
    328 __label__570 __label__590
    288 __label__570 __label__630
    260 __label__540
    226 __label__000
    193 __label__510
    180 __label__580
    130 __label__170 __label__610
    112 __label__300
    104 __label__320
     83 __label__142 __label__610
     77 __label__560
     65 __label__370
     65 __label__070
     57 __label__000 __label__410
     50 __label__142
     47 __label__820
     46 __label__142 __label__570 __label__610
     41 __label__340 __label__610
     40 __label__570 __label__610 __label__600
     40 __label__100
     37 __label__610 __label__540
     37 __label__340
     28 __label__800 __label__470 __label__410 __label__440 __label__460 __label__450
     26 __label__490 __label__890 __label__410
     23 __label__900
     19 __label__170 __label__330
     17 __label__340 __label__610 __label__510
