# RAG application with Azure Open AI & Azure Cognitive Search
## Resume analysis usecase - 01 Data

### Objective
Let's build an application that will analyse **resume PDF documents using a RAG application.**

<img src="https://github.com/retkowsky/images/blob/master/HR.jpg?raw=true">

**Retrieval-Augmented Generation (RAG)** can exhibit variability in its implementation, but at a fundamental level, employing RAG within an AI-driven application involves the following sequential steps:

- The user submits a query or question.
- The system initiates a search for pertinent documents that hold the potential to address the user's query. These documents are often comprised of proprietary data and are maintained within a document index.
- The system formulates an instruction set for the Language Model (LLM) that encompasses the user's input, the identified relevant documents, and directives on how to utilize these documents to respond to the user's query effectively.
- The system transmits this comprehensive prompt to the Language Model.
- The Language Model processes the prompt and generates a response to the user's question, drawing upon the context provided. This response constitutes the output of our system.

### Steps
- Uploading PDF documents into an Azure Cognitive Search Index
- Use of some Azure Cognitive Search queries to get some answers
- Use a GPT model to analyse the answer (summmary, keywords generation)
- Get the text from the document and the reference to validate the proposed answer
- Chatbot experience using Azure Open AI to ask questions and get results provided by AI with references

### Process
<img src="https://github.com/retkowsky/images/blob/master/rag.png?raw=true" width=800>

In [1]:
import os

## Zip file

In [2]:
data_dir = "data"

!ls $data_dir/cv.zip -lh

-rwxrwxrwx 1 root root 18M Nov 13 15:58 data/cv.zip


## Unzipping the zip file

In [3]:
!unzip -q $data_dir/cv.zip

## CVs

In [5]:
cv_dir = "cv"

In [22]:
def count_pdf_files_per_dir(top_directory):
    """
    Count PDF files from a dir including sub directories
    """
    pdf_results = {}
    total_pdf = 0

    for root, dirs, files in os.walk(top_directory):
        if ".ipynb_checkpoints" in dirs:
            dirs.remove(".ipynb_checkpoints")
        pdf_count = sum(1 for file in files if file.lower().endswith(".pdf"))
        pdf_results[root] = pdf_count
        total_pdf += pdf_count

    return pdf_results, total_pdf

In [29]:
pdf_results, total_pdf = count_pdf_files_per_dir(cv_dir)

for directory, pdf_count in pdf_results.items():
    print(f"Dir: {directory} | Total number of PDF files = {pdf_count}")

print(f"\nTotal of PDF files = {total_pdf}")

Dir: cv | Total number of PDF files = 0
Dir: cv/BUSINESS-DEVELOPMENT | Total number of PDF files = 113
Dir: cv/CONSULTANT | Total number of PDF files = 115
Dir: cv/DESIGNER | Total number of PDF files = 107
Dir: cv/DIGITAL-MEDIA | Total number of PDF files = 96
Dir: cv/ENGINEERING | Total number of PDF files = 118
Dir: cv/INFORMATION-TECHNOLOGY | Total number of PDF files = 120
Dir: cv/SALES | Total number of PDF files = 116

Total of PDF files = 785


> Go to the next notebook