#### Setting Up the Project Environment

```bash
# 1. Create Virtual Environment
python -m venv chatbot_rag_env

# 2. Activate the Virtual Environment
# .\chatbot_rag_env\Scripts\activate

# 3. Install Jupyter and Create Notebook Kernel
pip install jupyter ipykernel
python -m ipykernel install --user --name=chatbot_rag_kernel --display-name "chatbot_rag_kernel"


In [9]:
!python --version

Python 3.11.4


In [3]:
pip freeze

aiohttp==3.9.5
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.4.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
attrs==23.2.0
Babel==2.15.0
beautifulsoup4==4.12.3
bleach==6.1.0
certifi==2024.7.4
cffi==1.16.0
charset-normalizer==3.3.2
colorama==0.4.6
comm==0.2.2
debugpy==1.8.2
decorator==5.1.1
defusedxml==0.7.1
distro==1.9.0
executing==2.0.1
fastjsonschema==2.20.0
filelock==3.15.4
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.6.1
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.5
httpx==0.27.0
huggingface-hub==0.24.3
idna==3.7
ipykernel==6.29.5
ipython==8.26.0
ipywidgets==8.1.3
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.4
joblib==1.4.2
json5==0.9.25
jsonpatch==1.33
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client==8.6.2
jupyter_core==5.7.2
jupyter_server==2.14.2
jupyter_server_terminals==0.5.3
jupyterlab==4.2.4
jupyterlab_p

#### `hasattr` Function

The `hasattr` function in Python is used to check if an object has a specific attribute. This helps avoid `AttributeError` exceptions when trying to access an attribute that may not exist.

**Syntax:**
```python
hasattr(object, attribute_name)

Returns:

True: If the object has the specified attribute.
False: If the object does not have the specified attribute.

In [4]:
class MyClass:
    def __init__(self):
        self.my_attribute = "Hello"

# Create an instance of MyClass
obj = MyClass()

# Check if the object has 'my_attribute'
print(hasattr(obj, 'my_attribute'))  # Output: True

# Check if the object has 'another_attribute'
print(hasattr(obj, 'another_attribute'))  # Output: False

True
False


## Understanding Embeddings

**Embeddings** are numerical representations of text that capture semantic meaning. They convert words, sentences, or entire documents into vectors in a continuous vector space, allowing models to perform various tasks like text similarity, classification, and clustering.

**Example**:
- **Text**: "Machine learning is fascinating."
- **Embedding**: `[0.12, -0.34, 0.89, ...]` (a high-dimensional vector)

### Other Models for Text Embeddings

Here are some other popular models that you can use to generate text embeddings:

1. **`sentence-transformers/all-MiniLM-L6-v2`**: A small but efficient model for general-purpose sentence embeddings.
2. **`distilbert-base-nli-stsb-mean-tokens`**: A DistilBERT model fine-tuned for sentence similarity tasks.
3. **`bert-base-uncased`**: Standard BERT model, great for capturing general linguistic features (requires additional processing for embeddings).
4. **`roberta-base`**: A robust transformer model suitable for a wide range of NLP tasks.
5. **`openai/clip-vit-base-patch16`**: CLIP model for generating embeddings that link text and images effectively.


In [1]:
pip install torch==2.2.1

Collecting torch==2.2.1
  Using cached torch-2.2.1-cp311-cp311-win_amd64.whl (198.6 MB)
Installing collected packages: torch
Successfully installed torch-2.2.1
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.1.2 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import torch
# from sentence_transformers import SentenceTransformer

# Check if PyTorch is working
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

PyTorch version: 2.2.1+cpu
CUDA available: False


In [3]:
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


In [4]:
# Load the embedding model
model_name = "BAAI/bge-small-en-v1.5"
model = SentenceTransformer(model_name)

# Example text
texts = [
    "Machine learning is fascinating.",
    "Natural language processing enables computers to understand human language.",
    "Embeddings help in capturing the meaning of text."
]

# Convert text to embeddings
embeddings = model.encode(texts)

# Print embeddings for each text
for i, text in enumerate(texts):
    print(f"Text: {text}")
    print(f"Embedding: {embeddings[i]}\n")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Text: Machine learning is fascinating.
Embedding: [-1.06342416e-02  6.15970092e-03  2.86009908e-03 -2.16785502e-02
  5.52340597e-02 -7.06955418e-03  6.54394478e-02  3.27320956e-02
  4.33941856e-02 -2.41341498e-02  4.39849868e-03  1.36029413e-02
  6.88087486e-04  1.50783611e-02 -1.32442219e-02  2.61649881e-02
 -1.75927095e-02 -4.36784178e-02 -4.17363569e-02 -2.12334190e-02
  2.95715164e-02 -2.41762884e-02 -2.40243319e-03 -9.45207402e-02
 -3.15334089e-02  5.21748848e-02  1.30350119e-03 -9.17303115e-02
 -4.15261462e-02 -1.50486276e-01  3.78408693e-02 -2.72730794e-02
  1.30772248e-01 -2.42583454e-03 -1.70451328e-02  2.29147542e-02
  1.98831912e-02  3.30332927e-02 -1.72940735e-02  2.81937001e-03
  2.29434483e-02 -9.70904678e-02 -1.91013999e-02 -1.37887169e-02
  5.49651571e-02 -1.77916475e-02  2.49328063e-04  1.37854193e-03
 -5.52358851e-02  2.84696091e-02 -2.65588686e-02 -6.59348816e-02
  8.39856546e-03  6.56810924e-02  4.37417813e-02  3.87247279e-02
  4.84960750e-02  1.02218986e-02  2.4467

## Creating a Vector Store using FAISS

Creating a vector store using FAISS (Facebook AI Similarity Search) is an efficient way to store and query large sets of vector embeddings. FAISS is a library developed by Facebook AI Research that is optimized for fast nearest neighbor search in high-dimensional spaces. It is widely used for tasks like similarity search, clustering, and classification based on vector embeddings.

Here’s a step-by-step explanation and example of how to create a vector store using FAISS to store embeddings and text chunks:

#### Step-by-Step Explanation

1. **Generate Embeddings**: First, generate vector embeddings for your text data using a pre-trained model (like the one from Hugging Face).

2. **Initialize FAISS Index**: Choose an appropriate FAISS index type based on your requirements (e.g., `IndexFlatL2` for basic L2 distance search).

3. **Add Embeddings to FAISS Index**: Add the generated embeddings to the FAISS index.

4. **Save the Index**: Optionally, save the FAISS index to disk for later use.

5. **Query the Index**: You can query the FAISS index to find the nearest neighbors of a given query embedding.


In [51]:
# Import necessary libraries
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document

# Sample text data
texts = [
    "Machine learning is fascinating.",
    "Artificial intelligence is the future.",
    "Natural language processing is a key technology.",
    "Deep learning models are powerful."
]


In [52]:
texts

['Machine learning is fascinating.',
 'Artificial intelligence is the future.',
 'Natural language processing is a key technology.',
 'Deep learning models are powerful.']

In [53]:
# Create document objects with page_content attributes
documents = [Document(page_content=text) for text in texts]

In [54]:
documents

[Document(page_content='Machine learning is fascinating.'),
 Document(page_content='Artificial intelligence is the future.'),
 Document(page_content='Natural language processing is a key technology.'),
 Document(page_content='Deep learning models are powerful.')]

In [61]:
# Initialize the Hugging Face embeddings model
embeddings_model = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5", encode_kwargs={"normalize_embeddings": True})


In [60]:
!pip install faiss-cpu




[notice] A new release of pip is available: 23.1.2 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [62]:
# Create a FAISS vector store from the documents and their embeddings
vectorstore = FAISS.from_documents(documents, embeddings_model)

In [63]:
# Save the vector store to a local file for later use
vectorstore.save_local("test_vectorstore.db")

**index.faiss**: This file contains the FAISS index data. It stores the vector embeddings and the structure necessary for efficient similarity search. FAISS uses this file to perform nearest neighbor search operations.

**index.pkl**: This file contains metadata about the documents and their embeddings. It includes information needed to map the embeddings back to the original documents, such as the document contents and any additional metadata that might be associated with them. The pickle format is used to serialize Python objects, making it easy to save and load complex data structures.


#### lets see what is in test_vectorstore.db

In [27]:
import numpy as np

In [44]:
# Import necessary libraries
# from langchain.vectorstores import FAISS
# from langchain.embeddings import HuggingFaceEmbeddings

# Define the embeddings model
embeddings_model = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5", encode_kwargs={"normalize_embeddings": True})


In [64]:
# Load the previously saved vector store with the embeddings model
# Set allow_dangerous_deserialization to True
vectorstore = FAISS.load_local("test_vectorstore.db", embeddings=embeddings_model, allow_dangerous_deserialization=True)

In [65]:
# Define a sample query text
query_text = "Tell me about artificial intelligence."

In [67]:
# Create an embedding for the query text using the same embeddings model
query_embedding = embeddings_model.embed_query(query_text)

In [68]:
query_embedding

[-0.021075986325740814,
 -0.02607446350157261,
 0.019068073481321335,
 -0.04253644123673439,
 0.0008565769530832767,
 -0.016443293541669846,
 0.06200935319066048,
 0.05786814168095589,
 0.030383119359612465,
 -0.023833412677049637,
 -0.01376871857792139,
 0.011247210204601288,
 0.006401291582733393,
 0.0384674072265625,
 0.0478692464530468,
 0.04401574283838272,
 -0.01101597212255001,
 -0.06429845094680786,
 -0.03992951288819313,
 -0.003889136714860797,
 0.028591401875019073,
 0.010802383534610271,
 -0.07747098058462143,
 -0.04576127603650093,
 -0.0926128402352333,
 0.02929188497364521,
 0.02536199241876602,
 -0.029110761359333992,
 -0.06440001726150513,
 -0.05355849489569664,
 0.02061765268445015,
 -0.010383758693933487,
 0.11772619187831879,
 0.002045287052169442,
 0.03492391109466553,
 0.04835473373532295,
 0.010746816173195839,
 0.027467498555779457,
 -0.0375484935939312,
 0.017579736188054085,
 -0.01183816883713007,
 -0.09743086993694305,
 -0.010390748269855976,
 -0.01774390228092

In [75]:
# Retrieve similar documents from the vector store
results = vectorstore.similarity_search_with_score_by_vector(
query_embedding,
k=2)

In [77]:
# Display the results
print("Query:", query_text)
print("Results:")
for result in results:
    print("Text:", result)

Query: Tell me about artificial intelligence.
Results:
Text: (Document(page_content='Artificial intelligence is the future.'), 0.42374086)
Text: (Document(page_content='Machine learning is fascinating.'), 0.4712982)


### Understanding Different Language Models (LLMs)

When working with Language Models (LLMs) for tasks such as retrieval, generation, and more, you have a variety of models to choose from, each with its unique features, strengths, and use cases. Below is an overview of some popular LLMs and their differences.

#### 1. **OpenAI GPT-3.5 (and GPT-4)**
- **Model Names**: `gpt-3.5-turbo`, `gpt-4`
- **Provider**: OpenAI
- **Use Cases**: Text generation, summarization, translation, conversational agents, question answering, creative writing.
- **Strengths**: High-quality language understanding and generation, large context window, widely adopted in various applications.
- **Weaknesses**: Requires API access, can be expensive, and sensitive to prompt engineering.

#### 2. **Google BERT**
- **Model Names**: `bert-base-uncased`, `bert-large-uncased`, etc.
- **Provider**: Google
- **Use Cases**: Text classification, named entity recognition, question answering, sentiment analysis.
- **Strengths**: Strong performance on a wide range of NLP tasks, excellent at understanding the context of words in a sentence.
- **Weaknesses**: Primarily used for understanding rather than generation, may require fine-tuning for specific tasks.

#### 3. **Google T5**
- **Model Names**: `t5-small`, `t5-base`, `t5-large`, etc.
- **Provider**: Google
- **Use Cases**: Text-to-text tasks including translation, summarization, text classification.
- **Strengths**: Unified text-to-text framework, good performance across multiple NLP tasks.
- **Weaknesses**: Larger models can be resource-intensive, requires task-specific fine-tuning.

#### 4. **Facebook RoBERTa**
- **Model Names**: `roberta-base`, `roberta-large`
- **Provider**: Facebook AI
- **Use Cases**: Similar to BERT, used for text classification, sentiment analysis, and more.
- **Strengths**: Improved performance over BERT on various benchmarks, robust pre-training.
- **Weaknesses**: Similar limitations to BERT, focusing more on understanding rather than generation.

#### 5. **Hugging Face DistilBERT**
- **Model Names**: `distilbert-base-uncased`
- **Provider**: Hugging Face
- **Use Cases**: Efficient text classification, sentiment analysis, entity recognition.
- **Strengths**: Smaller and faster than BERT with minimal performance loss, suitable for deployment in resource-constrained environments.
- **Weaknesses**: May not capture context as effectively as larger models.

#### 6. **EleutherAI GPT-Neo and GPT-J**
- **Model Names**: `gpt-neo-2.7B`, `gpt-j-6B`
- **Provider**: EleutherAI
- **Use Cases**: Text generation, summarization, conversational AI, similar to OpenAI's GPT-3.
- **Strengths**: Open-source, comparable performance to GPT-3, large-scale models available for free.
- **Weaknesses**: Requires significant computational resources, less robust support compared to commercial APIs.

#### 7. **Microsoft Turing-NLG**
- **Model Names**: `turing-nlg-17B`
- **Provider**: Microsoft
- **Use Cases**: Text generation, dialogue systems, content creation.
- **Strengths**: One of the largest language models available, strong generative capabilities.
- **Weaknesses**: Extremely resource-intensive, limited accessibility compared to other models.
