# dataQuest pipeline

This notebook illustrates the complete pipeline of dataQuest, from defining keywords and other metadata to selecting final articles and generating output.

## Step0: Install dataQuest package

In [1]:
# Run the following line to install dataQuest
# %pip install dataQuest

## Step1: Convert your corpus to the expected json format

The expected format is a set of JSON files compressed in the .gz format. Each JSON file contains metadata related to a newsletter, magazine, etc., as well as a list of article titles and their corresponding bodies. These files may be organized within different folders or sub-folders.
Below is a snapshot of the JSON file format:

```commandline
{
    "newsletter_metadata": {
        "title": "Newspaper title ..",
        "language": "NL",
        "date": "1878-04-29",
        ...
    },
    "articles": {
        "1": {
            "title": "title of article1 ",
            "body": [
                "paragraph 1 ....",
                "paragraph 2...."
            ]
        },
        "2": {
            "title": "title of article2",
            "body": [
                "text..."  
             ]
        }
    }
}    
```

You can find a sample of data in [data](https://github.com/UtrechtUniversity/dataQuest/blob/main/example/data/).


## Step2: Create a config file 

Create a config file to include the followings:
- filters
- criteria to select final articles
- output format

```
{
 "filters": [
        {
            "type": "AndFilter",
                "filters": [
                        {
                            "type": "YearFilter",
                            "start_year": 1800,
                            "end_year": 1910
                        },
                        {
                            "type": "NotFilter",
                            "filter": {
                                "type": "ArticleTitleFilter",
                                "article_title": "Advertentie"
                            },
                            "level": "article"
                        },
                        {
                            "type": "KeywordsFilter",
                            "keywords": ["dames", "liberalen"]
                        }
                ]
        }
 ],
  "article_selector":
    {
      "type": "percentage",
      "value": "30"
    },
  "output_unit": "segmented_text",
  "sentences_per_segment": 10
}
```

You can find a sample of [config.json](https://github.com/UtrechtUniversity/dataQuest/blob/main/example/config.json)

## Step3: Run the pipeline
Run the following command:

```
filter-articles
--input-dir "data/"
--output-dir "output/"
--input-type "delpher_kranten"
--glob "*.gz"
--config-path "config.json"
--period-type "decade"
```