# OCR, Index and Semantic Search Tutorial

Before running this tutorial make sure you've installed haystack and also downloaded an opensearch image from dockerhub. If not you can do that by running the following commands:

1. Install haystack via `pip install haystack`.
2. Then you can download the image by running the following command: `docker pull opensearchproject/opensearch:1.0.1`
3. and then please launch the image by running the following command: `docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.0.1`

In [1]:
from ocrpy import DocumentReader, TextOcrIndexPipeline
from haystack.nodes import BM25Retriever, TfidfRetriever
from haystack.document_stores import OpenSearchDocumentStore

In [None]:
# unzip the data
!unzip sample_data/data.zip -d sample_data/data
!mkdir sample_data/output

### Lets create a new pipeline and Index the documents

In [4]:
SOURCE = 'sample_data/data' # s3 bucket or local directory or gcs bucket with your documents.
DESTINATION = 'sample_data/output/' # s3 bucket or local directory or gcs bucket to write the processed documents.
PARSER = 'pytesseract' # or 'google-cloud-vision' or 'pytesseract'
CREDENTIALS = {"AWS": "path/to/aws-credentials.env/file",
               "GCP": "path/to/gcp-credentials.json/file"} # optional - if you are using any cloud service.

DATABASE_BACKEND = "opensearch"
DATABASE_CONFIG = {"opensearch": {"port": 9200, "username": "admin", "password": "admin"} , "batch_size": 100}

In [5]:
pipeline = TextOcrIndexPipeline(source_dir= SOURCE,
                                destination_dir=DESTINATION,
                                parser_backend=PARSER,
                                credentials_config=CREDENTIALS,
                                database_backend=DATABASE_BACKEND,
                                database_config=DATABASE_CONFIG)

In [6]:
pipeline.process()

Running Pipeline with the following configuration:

1. DOCUMENT_SOURCE: data
2. DOCUMENT_DESTINATION: output
3. SOURCE_STORAGE_TYPE: LOCAL
4. DESTINATION_STORAGE_TYPE: LOCAL
5. PARSER_BACKEND_TYPE: pytesseract
6. TOTAL_DOCUMENT_COUNT: 9
7. IMAGE_FILE_COUNT: 3
8. PDF_FILE_COUNT: 5
9. CREDENTIALS: {'AWS': 'path/to/aws-credentials.env/file', 'GCP': 'path/to/gcp-credentials.json/file'}
10. DATABASE_BACKEND: opensearch
11. DATABASE_CONFIG: {'opensearch': {'port': 9200, 'username': 'admin', 'password': 'admin'}, 'batch_size': 100}


0it [00:00, ?it/s]

FILE: .DS_Store - ERROR: 'FileTypeNotSupported' object is not iterable


9it [07:14, 48.24s/it]


### Semantic Search with Haystack

#### Connect to open search instance with your credentials and index name.

In [8]:
# Create a document store to retrieve;
doc_store = OpenSearchDocumentStore(**DATABASE_CONFIG['opensearch'])
retriver = BM25Retriever(doc_store)


### Search the index with your query

In [13]:
for i in retriver.retrieve(query="benefits of meditation and visualization", top_k=3):    
    print(f"File Name: {i.meta['file_name']}")
    print("Content: \n",i.content[:200])
    print("-"*10)

File Name: How to improve visualization_pytesseract.json
Content: 
 How to Practice Visualization Meditation: 3 Best Scripts 21/07/22, 8:22 PM

What Is Visualization
Meditation?

 

 

Visualization meditation focuses on the use of guided
imagery to cultivate certain 
----------
File Name: How to improve meditation_pytesseract.json
Content: 
 How to Perform Body Scan Meditation: 3 Best Scripts 21/07/22, 8:23 PM

Research

Nervous
system
response to
body scan
meditation

 

Although MBSR has been studied extensively as a program
and shown t
----------
File Name: Humanistic Psychology_pytesseract.json
Content: 
 Humanistic Psychology's Approach to Wellbeing: 3 Theories

Brief History of Humanistic
Psychology

The revolution of
humanistic
psychology first
began in the 1960s.

At this time, humanistic psycholog
----------


### Lets do another search

In [16]:
for i in retriver.retrieve(query="What are best way to be happy", top_k=3):    
    print(f"File Name: {i.meta['file_name']}")
    print("Content: \n",i.content[:200])
    print("-"*10)

File Name: Know your Self_pytesseract.json
Content: 
 Gretchen Rubin 21/07/22, 8:44 PM

 

Spotlight on the Know Yourself

Better Journal
October 18, 2021

People often ask me, "What's the secret to happiness? If you had to
choose one thing, what would y
----------
File Name: How to improve visualization_pytesseract.json
Content: 
 How to Practice Visualization Meditation: 3 Best Scripts 21/07/22, 8:22 PM

What Is Visualization
Meditation?

 

 

Visualization meditation focuses on the use of guided
imagery to cultivate certain 
----------
File Name: How to Become More Creative | Psychology Today_pages-to-jpg-0002_pytesseract.json
Content: 
 How to Become More Creative | Psychology Today 21/07/22, 9:06 PM
breath and then gasped for another.
| can’t believe this is happening to me again. Why me?!
Another inner voice suddenly called out:

S
----------
