# Train a domain classifier on the [semantic scholar dataset](https://api.semanticscholar.org/corpus)
> Part 1: obtain and clean data

![position of this step in the lifecycle](../diagrams/scope-data.svg)
> The blue boxes show the steps implemented in this notebook.

### Extract

This can be achieved by downloading a public dataset (such as in this case), or by having a Data Engineer setup and give us access to the organisation's data.

In this case, we download the semantic scholar dataset from a public S3 bucket.

In [1]:
MAX_CHUNK_COUNT = 1

In [2]:
import urllib.request
from random import shuffle

manifest = (
    urllib.request.urlopen(
        "https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/2022-02-01/manifest.txt"
    )
    .read()
    .decode()
)  # a list of available chunks separated by '\n' characters

lines = manifest.split()
shuffle(lines)
chunks = lines[:MAX_CHUNK_COUNT]

f"Processing {len(chunks)} out of the {len(manifest.split())} available chunks"

'Processing 1 out of the 6002 available chunks'

### Transform

- Filter out non-English abstracts using `great_ai.utilities.predict_language`
- Project it to only keep the necessary components (text and labels), clean the textual content using `great_ai.utilities.clean`
- We will speed up processing using `great_ai.utilities.parallel_map`.

In [3]:
from typing import List, Tuple
import json
import gzip
from great_ai import parallel_map, clean, is_english, predict_language


def preprocess_chunk(chunk_key: str) -> List[Tuple[str, List[str]]]:
    # Extract
    response = urllib.request.urlopen(
        f"https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/2022-02-01/{chunk_key}"
    )  # a gzipped JSON Lines file

    decompressed = gzip.decompress(response.read())
    decoded = decompressed.decode()
    chunk = [json.loads(line) for line in decoded.split("\n") if line]

    # Transform
    return [
        (
            clean(
                f'{c["title"]} {c["paperAbstract"]} {c["journalName"]} {c["venue"]}',
                convert_to_ascii=True,
            ),  # The text is cleaned to remove PDF extraction, web scraping, and other common artifacts
            c["fieldsOfStudy"],
        )  # Create pairs of `(text, [...domains])`
        for c in chunk
        if c["fieldsOfStudy"] and is_english(predict_language(c["paperAbstract"]))
    ]


preprocessed_chunks = parallel_map(preprocess_chunk, chunks)

Spacy model en_core_web_sm not found locally, downloading...
Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 3.6 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.3.0


[38;5;39m2022-06-25 14:21:57,984 |     INFO | Starting parallel map (concurrency: 1, chunk size: 1)[0m
100%|██████████| 1/1 [03:26<00:00, 206.86s/it]


In [4]:
from itertools import chain

preprocessed_data = list(chain(*preprocessed_chunks))
X, y = zip(
    *preprocessed_data
)  # X is the input, y is the expected (ground truth) output

### Load

Upload the dataset (or a part of it) to a central repository using `great_ai.add_ground_truth`. This step automatically tags each datapoint with a split label according to the ratios we set. Additional tags can be also given.

#### Production-ready backend

The MongoDB driver is automatically configured if  `mongo.ini` exists with the following scheme:

```ini
mongo_connection_string=mongodb://localhost:27017/
mongo_database=my_great_ai_db
```
> You can install MongoDB from [here](https://www.mongodb.com/docs/manual/installation) or [use it as a service](https://www.mongodb.com/cloud/atlas/register)


In [5]:
from great_ai import add_ground_truth

add_ground_truth(X, y, train_split_ratio=0.8, test_split_ratio=0.2)

[38;5;39m2022-06-25 14:25:24,990 |     INFO | Found credentials file (/data/projects/great_ai_example/mongo.ini), initialising MongoDbDriver[0m
[38;5;39m2022-06-25 14:25:24,991 |     INFO | Found credentials file (/data/projects/great_ai_example/mongo.ini), initialising LargeFileMongo[0m
[38;5;39m2022-06-25 14:25:24,992 |     INFO | Settings: configured ✅[0m
[38;5;39m2022-06-25 14:25:24,993 |     INFO | 🔩 tracing_database: MongoDbDriver[0m
[38;5;39m2022-06-25 14:25:24,994 |     INFO | 🔩 large_file_implementation: LargeFileMongo[0m
[38;5;39m2022-06-25 14:25:24,994 |     INFO | 🔩 is_production: False[0m
[38;5;39m2022-06-25 14:25:24,995 |     INFO | 🔩 should_log_exception_stack: True[0m
[38;5;39m2022-06-25 14:25:24,996 |     INFO | 🔩 prediction_cache_size: 512[0m
[38;5;39m2022-06-25 14:25:24,997 |     INFO | 🔩 dashboard_table_size: 50[0m


### Next: [Part 2](train.ipynb)