# Train a domain classifier on the [semantic scholar dataset](https://api.semanticscholar.org/corpus)
> Part 1: obtain and clean data

![position of this step in the lifecycle](diagrams/scope-data.svg)
> The blue boxes show the steps implemented in this notebook.

In [1]:
MAX_CHUNK_COUNT = 4

### Extract

This can be achieved by downloading a public dataset (such as in this case), or by having a Data Engineer setup and give us access to the organisation's data.

In this case, we download the semantic scholar dataset from a public S3 bucket.

In [2]:
import urllib.request

manifest = (
    urllib.request.urlopen(
        "https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/2022-02-01/manifest.txt"
    )
    .read()
    .decode()
)  # a list of available chunks separated by '\n' characters

chunks = manifest.split()[:MAX_CHUNK_COUNT]

f"Processing {len(chunks)} out of the {len(manifest.split())} available chunks"

'Processing 4 out of the 6002 available chunks'

### Transform

- Filter out non-English abstracts using `great_ai.utilities.predict_language`
- Project it to only keep the necessary components (text and labels), clean the textual content using `great_ai.utilities.clean`
- We will speed up processing using `great_ai.utilities.parallel_map`.

In [3]:
from typing import List, Tuple
import json
import gzip
from great_ai import parallel_map, clean, is_english, predict_language


def preprocess_chunk(chunk_key: str) -> List[Tuple[str, List[str]]]:
    # Extract
    response = urllib.request.urlopen(
        f"https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/2022-02-01/{chunk_key}"
    )  # a gzipped JSON Lines file

    decompressed = gzip.decompress(response.read())
    decoded = decompressed.decode()
    chunk = [json.loads(line) for line in decoded.split("\n") if line]

    # Transform
    return [
        # Create pairs of `(text, [...domains])`
        # The text is cleaned to remove PDF extraction, web scraping, and other common artifacts
        (
            clean(
                f'{c["title"]} {c["paperAbstract"]} {c["journalName"]} {c["venue"]}',
                convert_to_ascii=True,
            ),
            c["fieldsOfStudy"],
        )
        for c in chunk
        if c["fieldsOfStudy"] and is_english(predict_language(c["paperAbstract"]))
    ]


preprocessed_chunks = parallel_map(preprocess_chunk, chunks)

[38;5;39m2022-06-19 14:59:12,563 |     INFO | Starting parallel map (concurrency: 4, chunk size: 1)[0m


  0%|          | 0/4 [00:00<?, ?it/s]

In [4]:
from itertools import chain

preprocessed_data = list(chain(*preprocessed_chunks))
X, y = zip(
    *preprocessed_data
)  # X is the input, y is the expected (ground truth) output

### Load

Upload the dataset (or a part of it) to a central repository using `great_ai.add_ground_truth`. This step automatically tags each datapoint with a split label according to the ratios we set. Additional tags can be also given.

#### Use a different repository

For the sake of simplicity, the tutorial uses the local hard drive (`great_ai.ParallelTinyDbDriver`) as the central repository.
This can be simply changed, for example, by the following snippet:

```python
from great_ai import configure, MongoDbDriver

configure(tracing_database=MongoDbDriver('mongodb://localhost:27017_or_something_like_that'))
```

In [5]:
from great_ai import add_ground_truth

add_ground_truth(X, y, train_split_ratio=0.8, test_split_ratio=0.2)

[38;5;39m2022-06-19 15:03:30,301 |     INFO | Options: configured ✅[0m


### Next: [Part 2](train.ipynb)