# AI-Powered HR Assistant: A Use Case for Nestle’s HR Policy Documents

## Nestlé HR Policy Chatbot Overview

This Jupyter Notebook implements a conversational chatbot designed to answer user queries based on information contained within Nestlé's HR policy documents. It leverages several powerful technologies from the field of Natural Language Processing (NLP) and Large Language Models (LLMs) to achieve this:

1.  **Document Loading and Chunking:** The notebook begins by loading the HR policy document (in PDF format) and splitting it into smaller, manageable text chunks. This is crucial because LLMs have limitations on the amount of text they can process at once.

2.  **Text Embeddings:** Each text chunk is then converted into a numerical vector representation called an "embedding." These embeddings capture the semantic meaning of the text, allowing the system to understand the relationships between different pieces of information. OpenAI's embedding models are used for this purpose.

3.  **Vector Database:** The embeddings are stored in a vector database (Chroma), which allows for efficient similarity search. This means that when a user asks a question, the system can quickly find the most relevant text chunks from the HR policy.

4.  **Question Answering with LLM:** The most relevant text chunks are then passed to a large language model (OpenAI's GPT model) along with the user's question. The LLM uses this context to generate a coherent and informative answer.

5.  **User Interface:** Finally, a user-friendly interface is created using Gradio, allowing users to easily interact with the chatbot.

In summary, this notebook demonstrates a complete workflow for building a question-answering system over a PDF document using state-of-the-art NLP and LLM techniques. This approach can be generalized to other document types and domains, making it a valuable tool for information retrieval and knowledge management. The notebook is structured with Markdown explanations and code blocks separated by functionality, making it easy to follow and understand the implementation details.

---

### Installation of Libraries

This code block uses `pip` to install the necessary Python libraries. These libraries are essential for the chatbot's functionality:

*   `openai`: For interacting with OpenAI's models (GPT).
*   `langchain`: A framework for developing applications powered by language models.
*   `chromadb`: A vector database for storing and retrieving text embeddings.
*   `pypdf`: For loading and processing PDF documents.
*   `gradio`: For creating the user interface.
*   `tiktoken`: For tokenizing text, especially important for managing context windows with large language models.

**Note:** You only need to run this cell once. If you've already installed these libraries, you can skip this step. If you are running this in a Google Colab notebook, you will need to run this section every time you connect to a new runtime.

In [None]:
# Install necessary libraries (run this in your notebook environment if needed)
# !pip install openai langchain chromadb pypdf gradio tiktoken

### Import Libraries and Set API Key

This block imports the required libraries and sets up the OpenAI API key.

*   **Imports:** Imports the necessary classes and functions from the installed libraries.
*   **API Key:** Sets the OpenAI API key using environment variables. This is the **recommended** way to manage API keys for security reasons. 

In [None]:
import os
import openai
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
import logging
import gradio as gr

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Set OpenAI API key (replace with your actual key - use environment variables)
def load_env():
    """Load and validate environment variables"""
    load_dotenv(verbose=True)
    api_key = os.getenv('OPENAI_API_KEY')
    pdf_doc = os.getenv('PDF_DOC_PATH')
    if not api_key:
        raise ValueError("OPENAI_API_KEY not found in environment variables")
    if not pdf_doc:
        raise ValueError("PDF_DOC_PATH note found in environment variables")
    return api_key, pdf_doc

api_key, pdf_doc = load_env()

print(f'api_key: {api_key}')
print(f'pdf doc path: {pdf_doc}')

### Load the PDF Document

This block loads the PDF document using `PyPDFLoader`.

*   **`PyPDFLoader`:** This class from `langchain` loads the PDF file.
*   **`loader.load()`:** This method reads the PDF and extracts the text content.
*   **Error Handling:** The `try-except` block handles the case where the specified PDF file is not found, preventing the code from crashing.

In [None]:
# Load the PDF document
try:
    loader = PyPDFLoader(pdf_doc) # Replace with your PDF file name
    documents = loader.load()
except FileNotFoundError:
    print("Error: {pdf_doc} not found. Please ensure the file is in the correct directory or provide the correct path.")
    exit()

### Split the Document into Chunks

Large documents are often split into smaller chunks for processing by language models. This block uses `CharacterTextSplitter` to do this.

*   **`CharacterTextSplitter`:** This class splits the text into chunks of a specified size.
*   **`chunk_size`:** The maximum number of characters in each chunk (1000 in this case).
*   **`chunk_overlap`:** The number of overlapping characters between adjacent chunks (0 in this case). Overlapping chunks can help maintain context.
*   **`texts = text_splitter.split_documents(documents)`:** This line performs the actual splitting.

In [None]:
# Split the document into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)