# Train a domain classifier on the [semantic scholar dataset](https://api.semanticscholar.org/corpus)

## Part 1: obtain clean data

This can be achieved by downloading a public dataset (such as in this case), or by having a Data Engineer setup and give us access to the organisations's data.

1. **Extract**: Download the semantic scholar dataset from a public S3 bucket
2. **Transform**: Project it to only keep the necessary components (text and labels), clean the textual content using `great_ai.utilities.clean`
3. **Load**: Upload the dataset (or a part of it) to a shared infrastructure using `great_ai.LargeFile` with its S3 backend

> We will speed up the processing using `great_ai.utilities.parallel_map`

In [1]:
import json
from typing import List, Tuple
import urllib.request
from itertools import chain
import gzip
from great_ai.utilities.clean import clean
from great_ai.utilities.parallel_map import parallel_map
from great_ai.utilities.language import is_english, predict_language
from great_ai import configure, LargeFile


DATASET_KEY = "semantic-scholar-dataset-small"
MAX_FILE_COUNT = 12

configure()

[38;5;39m2022-05-27 11:02:29,131 |     INFO | Environment variable ENVIRONMENT is not set, defaulting to development mode[0m
[38;5;39m2022-05-27 11:02:29,133 |     INFO | Options: configured ✅[0m


In [2]:
# Download the list of available chunks

manifest = (
    urllib.request.urlopen(
        "https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/2022-02-01/manifest.txt"
    )
    .read()
    .decode()
)
chunks = manifest.split()[:MAX_FILE_COUNT]

In [3]:
# Extract, Transform


def process_chunk(chunk_key: str) -> List[Tuple[str, List[str]]]:
    response = urllib.request.urlopen(
        "https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/2022-02-01/"
        + chunk_key
    )  # returns a gzipped JSON Lines file
    decompressed = gzip.decompress(response.read())
    decoded = decompressed.decode()
    chunk = [json.loads(line) for line in decoded.split("\n") if line]

    all = [
        # Create pairs of `(text, [domains])`
        # The text is cleaned to remove PDF extraction, web scraping, and other common artifacts
        (
            clean(
                f'{c["title"]} {c["paperAbstract"]} {c["journalName"]} {c["venue"]}',
                convert_to_ascii=True,
            ),
            c["fieldsOfStudy"],
        )
        for c in chunk
    ]

    return [
        (content, domains)
        for content, domains in all
        if (domains and content and is_english(predict_language(content)))
    ]


clean_chunks = parallel_map(process_chunk, chunks)

[38;5;39m2022-05-27 11:02:30,364 |     INFO | Starting parallel map (concurrency: 12, chunk size: 1)[0m


  0%|          | 0/12 [00:00<?, ?it/s]

[38;5;39m2022-05-27 11:04:30,870 |     INFO | Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing brace: '}' @(1,177)[0m
[38;5;39m2022-05-27 11:04:36,961 |     INFO | Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing brace: '}' @(1,2108)[0m
[38;5;39m2022-05-27 11:04:44,270 |     INFO | Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing brace: '}' @(1,1302)[0m
[38;5;39m2022-05-27 11:04:44,300 |     INFO | Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing brace: '}' @(1,1408)[0m
[38;5;39m2022-05-27 11:04:44,324 |     INFO | Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing brace: '}' @(1,1497)[0m
[38;5;39m2022-05-27 11:04:44,347 |     INFO | Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing brace: '}' @(1,1553)[0m
[38;5;39m2022-05-27 11:04:45,265 |     INFO | Ignoring parse error (tolerant parsing mode): Unexpected mis

In [None]:
# Load
# Using this GreatAI utility, the data will be made available on S3 (and in our local cache)

with LargeFile(DATASET_KEY, "w", keep_last_n=1) as f:
    json.dump(list(chain(clean_chunks)), f)

[38;5;39m2022-05-27 11:01:09,266 |     INFO | Fetching online versions of semantic-scholar-dataset-small[0m
[38;5;39m2022-05-27 11:01:09,581 |     INFO | Found versions: [0][0m
[38;5;39m2022-05-27 11:01:09,890 |     INFO | Copying file for semantic-scholar-dataset-small-1[0m
[38;5;39m2022-05-27 11:01:09,916 |     INFO | Compressing semantic-scholar-dataset-small-1[0m
[38;5;39m2022-05-27 11:01:11,197 |     INFO | Uploading semantic-scholar-dataset-small-1 to S3 from /tmp/large-file-s55zn39v[0m
[38;5;39m2022-05-27 11:01:13,095 |     INFO | Uploading semantic-scholar-dataset-small-1.tar.gz 0.79/7.46 MB (10.5%)[0m
[38;5;39m2022-05-27 11:01:14,729 |     INFO | Uploading semantic-scholar-dataset-small-1.tar.gz 1.57/7.46 MB (21.1%)[0m
[38;5;39m2022-05-27 11:01:16,608 |     INFO | Uploading semantic-scholar-dataset-small-1.tar.gz 2.36/7.46 MB (31.6%)[0m
[38;5;39m2022-05-27 11:01:18,209 |     INFO | Uploading semantic-scholar-dataset-small-1.tar.gz 3.15/7.46 MB (42.2%)[0m
[38