This project proposes an architecture for efficiently and securely extracting, processing, and searching for information from digital media through the use of deep learning approaches. The architecture is designed to make digital content more accessible through semantic search and explore the domains of information extraction from digital media.
- Neural Search
- Information Retrieval
- Semantic Search
The proposed architecture consists of three serving layers: extraction, transformation, and loading.
The extraction phase retrieves data from various sources (e.g. cloud storage, databases, websites) and categorizes it based on its MIME type (e.g. document, audio/video, image). Necessary text and images are extracted and stored in a document database, while extracted images are passed to the transformation phase for further processing. The extraction phase also involves the creation of a config.yaml file, which contains configurations for various file types and information retrieval techniques.
- Document extraction: Extracting data (text and images) from document types such as .doc, .ppt, .xls, and .pdf
- Video extraction: Extracting video frames and audio and storing them separately
- Web-page extraction: Parsing HTML to extract text and image data from websites
The transformation phase converts raw data from the extraction phase into a normalized format and stores it using an asynchronous task distributed system (e.g. Apache Kafka, Redis). It consists of three blocks: audio transformation, image transformation, and video transformation. Tools like OpenCV and Pillow may be used for image and video processing.
The transformation phase also includes a consumer block with worker nodes that listen to queues and process incoming tasks. These workers have the ability to act as producers, triggering additional tasks and chaining them together. The producer block consists of web nodes that handle web requests and enqueue jobs in the task queue when a new task is received. The task queue, implemented using a Redis transport, acts as the broker, controlling the flow of information into the system and storing tasks temporarily in case of system failure.
The loading phase performs the important processes of information retrieval and indexing, and provides an abstraction layer for searching through all indexed data. The process of indexing begins with the creation of a config.yaml file, which specifies the deep learning-based information retrieval techniques and containers to be used for processing.
A single task goes through the following procedures:
- Downloading a single file or retrieving it from cloud storage, a binary storage database, or an API
- Parsing the binary or file metadata to determine the MIME type and categorizing the file accordingly
- Starting the necessary containers for information retrieval based on the config.yaml file and the file's MIME type
- Extracting relevant information from the file using the specified information retrieval techniques
- Indexing the extracted information for searching
The search phase of the implementation allows users to search indexed data using four methods: text search with semantic search and full-text search, audio search using audio fingerprinting, image-to-image search using Euclidean distance, and face search using facial vectors and the Dlib library. Text search uses BERT and Elasticsearch, while full-text search uses Typesense. Image search uses Xception. The search process starts when a client sends a request to the API Gateway, which routes the request to the appropriate microservice.
This architecture presents a solution for efficiently extracting, processing, and searching for information from digital media through the use of deep learning techniques and an asynchronous task distributed system. It aims to make digital content more accessible through semantic search and improve the process of indexing and storing large amounts of digital media.
Search images based on senetences, phrases or captions
Search similar images based on uploaded images
Search documents with similar meaning using bert vectorization over elastic search
Search similar persons based on facial data using face-to-face search
Search millions of documents within seconds with type errors using typo-tolerant-search
Starts indexing files from Cloud storage
git clone --recurse-submodules https://github.com/semantic-search/indexing_main.git
sudo apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr \
flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig
pip install -r requirements.txt
cd Services
chmod u+x pdfimages
chmod u+x pdftotext
- Document Conversion Container Required
cd docker-unoconv-webservice
docker build -t docker-unoconv-webservice .
docker run --env-file=docker.env -p 80:3000 docker-unoconv-webservice
- Extra Server Requirements
-
Mongo Db with Authentication applied
-
Redis with Authentication applied
-
Apache Kafka Server with Authentication Applied (plain text Method)
Starting Kafka Server
Zookeeper
export KAFKA_OPTS="-Djava.security.auth.login.config=/home/jainal09/kafka_2.13-2.6.0/config/zookeeper_jaas.conf"
kafka_2.13-2.6.0/bin/zookeeper-server-start.sh config/zookeeper.properties
Server
export KAFKA_OPTS="-Djava.security.auth.login.config=/home/jainal09/kafka_2.13-2.6.0/config/kafka_server_jaas.conf"
kafka_2.13-2.6.0/bin/kafka-server-start.sh kafka_2.13-2.6.0/config/server.properties
logstash Server
Setup a Logstash by following this blog
celery -A task_worker worker -l INFO
flower -A task_worker --address=0.0.0.0 --port=5550
Dont Forget to add the environment variables in the .env
file
REDIS_HOSTNAME=
REDIS_PORT=
REDIS_PASSWORD=
KAFKA_HOSTNAME=
KAFKA_PORT=
KAFKA_CLIENT_ID=
KAFKA_USERNAME=
KAFKA_PASSWORD=
MONGO_HOST=
MONGO_PORT=
MONGO_DB=
MONGO_USER=
MONGO_PASSWORD=
CONNECTION_STRING=
BLOB_STORAGE_CONTAINER_NAME=
UNOCONV_SERVER=
STORAGE_PROVIDER=
DASHBOARD_API_URL_UPDATE_STATE=
DASHBOARD_API_URL_REMOVE_FILE=
DASHBOARD_API_CLIENT_ID=
LOGGER_SERVER_HOST=
LOGGER_SERVER_PORT=
CORS_ORIGIN=
python main.py config.yaml
-
Currently we tested on Azure Blob Storage
-
You can add your preferred service provider by creating a simple elif condition to download a file in
task_utils/download_file_from_storage.py
-
Don't Forget to add the Variable
STORAGE_PROVIDER=
in .env
@InProceedings{10.1007/978-981-19-0898-9_8,
author="Gosaliya, Jainal S.
and Gupta, Adarsh K.
and Ashok, Akshay
and Parikh, Swapnil M.",
editor="Pandian, A. Pasumpon
and Fernando, Xavier
and Haoxiang, Wang",
title="Architectural Insight of Neural Information Extraction, Retrieval, and Processing for Multimodal Neural Search",
booktitle="Computer Networks, Big Data and IoT",
year="2022",
publisher="Springer Nature Singapore",
address="Singapore",
pages="93--110",
abstract="In the growing world of digitization, digital media is engendered in abundance. With the ascension of the utilization of the Internet, there has been a prodigious increase in the engendering of digital content which includes images, audio, video, and documents such as pdf and text data. Information is free and more accessible than in any other era of humanity. Due to such a cognizance explosion, there is a vigorous need to make it more accessible. This can be achieved with semantic search. The quandary of processing, indexing, and storing such content has grown exponentially. At the same time, the infrastructure to handle such length has to be efficient and scalable. The current scenario of erudition explosion resulted in sizably voluminous data having a high performant scalable and resilient architecture which can parallelly process this multimodal binary file, can be gamely transmuting, and is becoming a requirement of the future. Different from the subsisting approaches that design handcrafted and task-concrete architectures for neural search to address only a single task, our architecture is tuned to handle multimodality which fundamentally denotes those data types (modalities) that can be audio, video, documents, images. This paper discusses the solution available to make digital content more accessible which is engendered as a result of the cognizance explosion. The proposed architecture will explore the domains of information extraction from this digital media securely and efficiently with various deep learning approaches for some categorical use cases.",
isbn="978-981-19-0898-9"
}