# Extending your Metadata using DocumentClassifiers at Index Time

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial16_Document_Classifier_at_Index_Time.ipynb)

With DocumentClassifier it's possible to automatically enrich your documents with categories, sentiments, topics or whatever metadata you like. This metadata could be used for efficient filtering or further processing. Say you have some categories your users typically filter on. If the documents are tagged manually with these categories, you could automate this process by training a model. Or you can leverage the full power and flexibility of zero shot classification. All you need to do is pass your categories to the classifier, no labels required. This tutorial shows how to integrate it in your indexing pipeline.

DocumentClassifier adds the classification result (label and score) to Document's meta property.
Hence, we can use it to classify documents at index time. \
The result can be accessed at query time: for example by applying a filter for "classification.label".

This tutorial will show you how to integrate a classification model into your preprocessing steps and how you can filter for this additional metadata at query time. In the last section we show how to put it all together and create an indexing pipeline.

In [1]:
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,ocr]

!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz
!tar -xvf xpdf-tools-linux-4.04.tar.gz && sudo cp xpdf-tools-linux-4.04/bin64/pdftotext /usr/local/bin

# Install  pygraphviz
!apt install libgraphviz-dev
!pip install pygraphviz

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-22.1.2-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 5.1 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.1.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack[colab,ocr]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-8b93513g/farm-haystack_14402fe3091d400d92563809f28b0238
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-8b93513g/farm-haystack_14402fe3091d400d92563809f28b0238
  Resolved https://github.com/deepset-ai/haystack.git to commit bb729ab95f12b63cbbeed2978d449544f855a82c
  Installing buil

In [2]:
# Here are the imports we need
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack.nodes import PreProcessor, TransformersDocumentClassifier, FARMReader, BM25Retriever
from haystack.schema import Document
from haystack.utils import convert_files_to_docs, fetch_archive_from_http, print_answers

INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
ERROR - root -  Failed to import 'magic' (from 'python-magic' and 'python-magic-bin' on Windows). FileTypeClassifier will not perform mimetype detection on extensionless files. Please make sure the necessary OS libraries are installed if you need this functionality.


In [3]:
# This fetches some sample files to work with

doc_dir = "data/tutorial16"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial16.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

INFO - haystack.telemetry -  Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry
INFO - haystack.utils.import_utils -  Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial16.zip to `data/tutorial16`


True

## Read and preprocess documents


In [4]:
# note that you can also use the document classifier before applying the PreProcessor, e.g. before splitting your documents

all_docs = convert_files_to_docs(dir_path=doc_dir)
preprocessor_sliding_window = PreProcessor(split_overlap=3, split_length=10, split_respect_sentence_boundary=False)
docs_sliding_window = preprocessor_sliding_window.process(all_docs)

INFO - haystack.utils.preprocessing -  Converting data/tutorial16/classics.txt
INFO - haystack.utils.preprocessing -  Converting data/tutorial16/bert.pdf
INFO - haystack.utils.preprocessing -  Converting data/tutorial16/heavy_metal.docx


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


100%|██████████| 3/3 [00:00<00:00, 116.41docs/s]


## Apply DocumentClassifier

We can enrich the document metadata at index time using any transformers document classifier model. While traditional classification models are trained to predict one of a few "hard-coded" classes and required a dedicated training dataset, zero-shot classification is super flexible and you can easily switch the classes the model should predict on the fly. Just supply them via the labels param.
Here we use a zero shot model that is supposed to classify our documents in 'music', 'natural language processing' and 'history'. Feel free to change them for whatever you like to classify. \
These classes can later on be accessed at query time.

In [5]:
doc_classifier = TransformersDocumentClassifier(
    model_name_or_path="cross-encoder/nli-distilroberta-base",
    task="zero-shot-classification",
    labels=["music", "natural language processing", "history"],
    batch_size=16,
)

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0


Downloading:   0%|          | 0.00/701 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/313M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [None]:
# we can also use any other transformers model besides zero shot classification

# doc_classifier_model = 'bhadresh-savani/distilbert-base-uncased-emotion'
# doc_classifier = TransformersDocumentClassifier(model_name_or_path=doc_classifier_model, batch_size=16, use_gpu=-1)

In [None]:
# we could also specifiy a different field we want to run the classification on

# doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
#    task="zero-shot-classification",
#    labels=["music", "natural language processing", "history"],
#    batch_size=16, use_gpu=-1,
#    classification_field="description")

In [6]:
# classify using gpu, batch_size makes sure we do not run out of memory
classified_docs = doc_classifier.predict(docs_sliding_window)

In [7]:
# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
print(classified_docs[0].to_dict())

{'content': 'Classics or classical studies is the study of classical antiquity,', 'content_type': 'text', 'score': None, 'meta': {'name': 'classics.txt', '_split_id': 0, 'classification': {'sequence': 'Classics or classical studies is the study of classical antiquity,', 'labels': ['music', 'natural language processing', 'history'], 'scores': [0.3462618887424469, 0.3370615541934967, 0.316676527261734], 'label': 'music'}}, 'embedding': None, 'id': '5f06721d4e5ddd207e8de318274a89b6'}


## Indexing

In [8]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
)
# wait until ES has started
! sleep 30

In [9]:
# Connect to Elasticsearch
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

In [10]:
# Now, let's write the docs to our DB.
document_store.delete_all_documents()
document_store.write_documents(classified_docs)

                1. delete_all_documents() method is deprecated, please use delete_documents method
                For more details, please refer to the issue: https://github.com/deepset-ai/haystack/issues/1045
                


In [11]:
# check if indexed docs contain classification results
test_doc = document_store.get_all_documents()[0]
print(
    f'document {test_doc.id} with content \n\n{test_doc.content}\n\nhas label {test_doc.meta["classification"]["label"]}'
)

document 5f06721d4e5ddd207e8de318274a89b6 with content 

Classics or classical studies is the study of classical antiquity,

has label music


## Querying the data

All we have to do to filter for one of our classes is to set a filter on "classification.label".

In [12]:
# Initialize QA-Pipeline
from haystack.pipelines import ExtractiveQAPipeline

retriever = BM25Retriever(document_store=document_store)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
pipe = ExtractiveQAPipeline(reader, retriever)

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2


Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0     0  
INFO - haystack.modeling.infer -  /w\   /w\ 
INFO - haystack.modeling.infer -  /'\   / \ 


In [13]:
## Voilà! Ask a question while filtering for "music"-only documents
prediction = pipe.run(
    query="What is heavy metal?",
    params={"Retriever": {"top_k": 10, "filters": {"classification.label": ["music"]}}, "Reader": {"top_k": 5}},
)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.68s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.02s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.02 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.00s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.02 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.02 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.00s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.00s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.03 Batches/s]


In [14]:
print_answers(prediction, details="high")




Query: What is heavy metal?
Answers:
[   <Answer {'answer': 'thick, massive sound', 'type': 'extractive', 'score': 0.5388457477092743, 'context': ',[6] heavy metal bands developed a thick, massive sound, characterized', 'offsets_in_document': [{'start': 35, 'end': 55}], 'offsets_in_context': [{'start': 35, 'end': 55}], 'document_id': 'b69a8816c2c8d782dceb412b80a4bf6e', 'meta': {'_split_id': 5, 'classification': {'sequence': ',[6] heavy metal bands developed a thick, massive sound, characterized', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.9268990755081177, 0.04476191848516464, 0.028338981792330742], 'label': 'music'}, 'name': 'heavy_metal.docx'}}>,
    <Answer {'answer': 'a genre', 'type': 'extractive', 'score': 0.39985864609479904, 'context': 'Heavy metal\n\nHeavy metal (or simply metal) is a genre of', 'offsets_in_document': [{'start': 46, 'end': 53}], 'offsets_in_context': [{'start': 46, 'end': 53}], 'document_id': '9903d23737f3d05a9d9ee170703dc245'

## Wrapping it up in an indexing pipeline

In [15]:
from pathlib import Path
from haystack.pipelines import Pipeline
from haystack.nodes import TextConverter, PreProcessor, FileTypeClassifier, PDFToTextConverter, DocxToTextConverter

In [16]:
file_type_classifier = FileTypeClassifier()
text_converter = TextConverter()
pdf_converter = PDFToTextConverter()
docx_converter = DocxToTextConverter()

indexing_pipeline_with_classification = Pipeline()
indexing_pipeline_with_classification.add_node(
    component=file_type_classifier, name="FileTypeClassifier", inputs=["File"]
)
indexing_pipeline_with_classification.add_node(
    component=text_converter, name="TextConverter", inputs=["FileTypeClassifier.output_1"]
)
indexing_pipeline_with_classification.add_node(
    component=pdf_converter, name="PdfConverter", inputs=["FileTypeClassifier.output_2"]
)
indexing_pipeline_with_classification.add_node(
    component=docx_converter, name="DocxConverter", inputs=["FileTypeClassifier.output_4"]
)
indexing_pipeline_with_classification.add_node(
    component=preprocessor_sliding_window,
    name="Preprocessor",
    inputs=["TextConverter", "PdfConverter", "DocxConverter"],
)
indexing_pipeline_with_classification.add_node(
    component=doc_classifier, name="DocumentClassifier", inputs=["Preprocessor"]
)
indexing_pipeline_with_classification.add_node(
    component=document_store, name="DocumentStore", inputs=["DocumentClassifier"]
)
indexing_pipeline_with_classification.draw("index_time_document_classifier.png")

document_store.delete_documents()
txt_files = [f for f in Path(doc_dir).iterdir() if f.suffix == ".txt"]
pdf_files = [f for f in Path(doc_dir).iterdir() if f.suffix == ".pdf"]
docx_files = [f for f in Path(doc_dir).iterdir() if f.suffix == ".docx"]
indexing_pipeline_with_classification.run(file_paths=txt_files)
indexing_pipeline_with_classification.run(file_paths=pdf_files)
indexing_pipeline_with_classification.run(file_paths=docx_files)

document_store.get_all_documents()[0]

100%|██████████| 1/1 [00:00<00:00, 115.05docs/s]
100%|██████████| 1/1 [00:00<00:00, 77.00docs/s]
100%|██████████| 1/1 [00:00<00:00, 639.96docs/s]


<Document: {'content': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\nJacob Devlin', 'content_type': 'text', 'score': None, 'meta': {'_split_id': 0, 'classification': {'sequence': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\nJacob Devlin', 'labels': ['natural language processing', 'history', 'music'], 'scores': [0.5377881526947021, 0.3413040041923523, 0.12090784311294556], 'label': 'natural language processing'}}, 'embedding': None, 'id': '5cf7c72dfc1e5e8286f42fa76c7f968d'}>

In [17]:
# we can store this pipeline and use it from the REST-API
indexing_pipeline_with_classification.save_to_yaml("indexing_pipeline_with_classification.yaml")

In [20]:
!pip3 list

Package                       Version
----------------------------- ------------------------------
absl-py                       1.0.0
alabaster                     0.7.12
albumentations                0.1.12
alembic                       1.8.0
altair                        4.2.0
appdirs                       1.4.4
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
arviz                         0.12.1
astor                         0.8.1
astropy                       4.3.1
astunparse                    1.6.3
atari-py                      0.2.9
atomicwrites                  1.4.0
attrs                         21.4.0
audioread                     2.1.9
autograd                      1.4
azure-ai-formrecognizer       3.2.0b2
azure-common                  1.1.28
azure-core                    1.22.1
Babel                         2.10.1
backcall                      0.2.0
backoff                       1.11.1
beautifulsoup4                4.6.3
bleach                     

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!  
Our focus: Industry specific language models & large scale QA systems.  
  
Some of our other work: 
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)
