#Project Goal
This project aims to develop a Retrieval-Augmented Generation (RAG) chatbot that can answer queries based on company documentation. By combining semantic search (via Pinecone) with generative AI (OpenAI GPT-4 Turbo), the system provides accurate, context-aware responses to user questions.

In [1]:
!pip install python-docx sentence-transformers langchain pinecone-client sentence-transformers openai

Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting fastapi
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting pyngrok
  Downloading pyngrok-7.2.3-py3-none-any.whl.metadata (8.7 kB)
Collecting pinecone-plugin-inference<2.0.0,>=1.0.3 (from pinecone-client)
  Downloading pinecone_plugin_inference-1.1.0-py3-none-any.whl.metadata (2.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone-client)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Collecting starlette<0.46.0,>=0.40.0 (from fastapi)
  Downloading starlette-0.45.3-py3-none-any.whl.metadata (6.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3

#Importing Libraries

In [2]:
from docx import Document
import pinecone
import torch
import openai
import numpy as np
import os
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter


#Data Extraction & Processing

In [3]:
file_path = "/content/Data.docx"

doc = Document(file_path)

text_data = []
for para in doc.paragraphs:
    if para.text.strip():
        text_data.append(para.text.strip())

full_text = "\n".join(text_data)

print(full_text[:1000])

BlueSky Innovations
Table of Contents
Introduction & Company Overview
1.1 Welcome to BlueSky Innovations
1.2 Company History & Founding Principles
1.3 Mission, Vision & Values
1.4 Corporate Structure
1.5 Purpose of the Policy and Benefits Manual
Equal Employment Opportunity & Anti-Discrimination Policies
2.1 Equal Employment Opportunity (EEO) Statement
2.2 Anti-Discrimination and Harassment Policy
2.3 Reasonable Accommodations for Disabilities
2.4 Cultural Competency & Sensitivity Training
2.5 Reporting Procedures and Non-Retaliation
Employment Status & Classification
3.1 Employment at Will
3.2 Full-Time, Part-Time, and Temporary Employees
3.3 Exempt vs. Non-Exempt Status
3.4 Independent Contractors and Consultants
3.5 Job Descriptions and Duties
Recruitment, Hiring & Onboarding
4.1 Recruitment Process and Best Practices
4.2 Job Postings and Internal Applications
4.3 Pre-Employment Screening & Background Checks
4.4 Offer Letters and Employment Contracts
4.5 Orientation and Training
Com

#Embedding Generation & Storage

Use MiniLM (all-MiniLM-L6-v2) to convert text into vector embeddings.
Store these embeddings in Pinecone, a vector database optimized for efficient similarity search.

In [4]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=700,  # Adjust chunk size as needed
    chunk_overlap=100  # Overlapping text to maintain context
)
chunks = splitter.split_text(full_text)
total_chunks = len(chunks)
print(f"Total number of chunks: {total_chunks}")

Total number of chunks: 94


In [5]:
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

embeddings = model.encode(chunks, convert_to_tensor=True)
print(f"Embedding Shape: {embeddings.shape}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding Shape: torch.Size([94, 384])


In [None]:
HUGGINGFACE_API_KEY = os.getenv("HUGGINGFACE_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
PINECONE_API_KEY=os.getenv("PINECONE_API_KEY")

In [6]:
pc = pinecone.Pinecone(api_key=PINECONE_API_KEY)

index_name = "company-docs-index"

existing_indexes = [index.name for index in pc.list_indexes()]

if index_name in existing_indexes:
    print(f"Index '{index_name}' already exists.")
else:
    pc.create_index(
        name=index_name,
        dimension=384,
        metric="cosine",
        spec=pinecone.ServerlessSpec(cloud="aws", region="us-east-1")
    )


Index 'company-docs-index' already exists.


In [7]:
index = pc.Index(index_name)

vectors = []
for i, chunk in enumerate(chunks):
    embedding = model.encode(chunk, convert_to_tensor=False).tolist()
    vectors.append((str(i), embedding, {"text": chunk}))

batch_size = 100
for i in range(0, len(vectors), batch_size):
    index.upsert(vectors[i : i + batch_size])

print(f"Stored {len(vectors)} text chunks in Pinecone!")


Stored 94 text chunks in Pinecone!


#Retrieval of Relevant Information




*   When a user asks a question, convert it into an embedding using MiniLM.
*   Query Pinecone to retrieve the most relevant document chunks based on cosine similarity.



In [8]:
def query_pinecone(question, top_k=3, threshold=0.3):
    query_embedding = model.encode(question, convert_to_numpy=True).astype(np.float32).tolist()

    results = index.query(vector=query_embedding, top_k=top_k, include_metadata=True, include_values=False)

    filtered_responses = [
        match["metadata"]["text"] for match in results["matches"] if match["score"] >= threshold
    ]

    return filtered_responses if filtered_responses else ["This information is not available in the given source."]

user_question = "What is BlueSky Innovations?"
retrieved_texts = query_pinecone(user_question)

for i, text in enumerate(retrieved_texts):
    print(f"Answer {i+1}:\n{text}\n")


Answer 1:
Conclusion
1. Introduction & Company Overview
1.1 Welcome to BlueSky Innovations
BlueSky Innovations is delighted to welcome you to our team. As a leading tech solutions provider, we pride ourselves on fostering an environment of innovation, collaboration, and professional growth. We firmly believe that our employees are the backbone of our success. Your talent, dedication, and unique perspective are the driving force behind our mission to deliver cutting-edge, customer-focused solutions.

Answer 2:
BlueSky Innovations strives to offer a work environment that respects individual differences, promotes fairness, and rewards dedication. We are excited to have you on board and look forward to a mutually beneficial and productive relationship. If you have any questions or concerns about the material covered in this manual, feel free to reach out to your supervisor, the Human Resources (HR) Department, or any member of the management team.
1.2 Company History & Founding Principles


In [None]:
import requests

# Hugging Face API token
HUGGINGFACE_API_TOKEN = HUGGINGFACE_API_KEY

def generate_answer(query, context):
    """Query a free Hugging Face model for an answer based on retrieved context."""

    prompt = f"""
        Answer the question based only on the Following context:
        {context}
        Question: {query}
    """
    headers = {
        "Authorization": f"Bearer {HUGGINGFACE_API_TOKEN}",
        "Content-Type": "application/json"
    }

    data = {
        "inputs": prompt,
        "parameters": {"max_length": 800, "temperature": 0.3}
    }

    # Call Hugging Face API
    response = requests.post(
        "https://api-inference.huggingface.co/models/google/flan-t5-large",
        headers=headers,
        json=data
    )

    if response.status_code == 200:
        return response.json()[0]["generated_text"]
    else:
        return f"Error: {response.json()}"

query = "What is BlueSky Innovations?"
retrieved_text = query_pinecone(query,top_k=5)

print("Retrieved Context:", retrieved_text)

final_answer = generate_answer(query, retrieved_text)
print("Generated Answer:", final_answer)


Retrieved Context: ['Conclusion\n1. Introduction & Company Overview\n1.1 Welcome to BlueSky Innovations\nBlueSky Innovations is delighted to welcome you to our team. As a leading tech solutions provider, we pride ourselves on fostering an environment of innovation, collaboration, and professional growth. We firmly believe that our employees are the backbone of our success. Your talent, dedication, and unique perspective are the driving force behind our mission to deliver cutting-edge, customer-focused solutions.', 'BlueSky Innovations strives to offer a work environment that respects individual differences, promotes fairness, and rewards dedication. We are excited to have you on board and look forward to a mutually beneficial and productive relationship. If you have any questions or concerns about the material covered in this manual, feel free to reach out to your supervisor, the Human Resources (HR) Department, or any member of the management team.\n1.2 Company History & Founding Prin

#Answer Generation using GPT-4 Turbo


In [9]:
# OpenAI API key
openai.api_key = OPENAI_API_KEY


In [10]:
def generate_answer(query, context):
    """Generates an answer using OpenAI GPT-4 Turbo based on retrieved context."""

    prompt = f"Use the following information to answer the query:\n\n{context}\n\nQuery: {query}\nAnswer:"

    response = openai.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": "You are an AI assistant that provides company-related answers based on provided documents."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=200
    )

    return response.choices[0].message.content

query = "What is BlueSky Innovations?"
retrieved_text = query_pinecone(query, top_k=5)

print("Retrieved Context:", retrieved_text)

final_answer = generate_answer(query, retrieved_text)
print(f"\n\nGenerated Response: {final_answer}")


Retrieved Context: ['Conclusion\n1. Introduction & Company Overview\n1.1 Welcome to BlueSky Innovations\nBlueSky Innovations is delighted to welcome you to our team. As a leading tech solutions provider, we pride ourselves on fostering an environment of innovation, collaboration, and professional growth. We firmly believe that our employees are the backbone of our success. Your talent, dedication, and unique perspective are the driving force behind our mission to deliver cutting-edge, customer-focused solutions.', 'BlueSky Innovations strives to offer a work environment that respects individual differences, promotes fairness, and rewards dedication. We are excited to have you on board and look forward to a mutually beneficial and productive relationship. If you have any questions or concerns about the material covered in this manual, feel free to reach out to your supervisor, the Human Resources (HR) Department, or any member of the management team.\n1.2 Company History & Founding Prin