Q 1)

Load the BERT Model and Tokenizer:

In [1]:
!pip install transformers torch gensim numpy scipy matplotlib



Part 1: Word Embedding Arithmetic
Task : Create 5 examples of word arithmetic similar to the "king- man + woman ≈ queen" analogy. Use words that have relevant semantic relationships. Steps

Load the BERT model and tokenizer.
Implement functions to get word embeddings and perform word arithmetic.
Write word_arithmetic and find_most_similar functions to create your examples
The word arithmetic function will be able to take two list of words:
○ Thefirst list is parameters to the word_arithmatic as example, (paris, france, italy), run the arithmetic and collect the return value (e.g., paris- france + italy =?).

○ Usingthe find_most_similar function with return value of word_arithmetic as input, along with the second list of words like (rome, romaine, ramania, ronnie, random) to find the most similar word to the answer.

○ Showthis for of 5 potential pairs of such words

○ Print answer for each of the 5 test cases

In [3]:
import torch
from transformers import AutoTokenizer, AutoModel
import gensim.downloader as api
import numpy as np
from scipy.spatial.distance import cosine
import matplotlib.pyplot as plt

# Load pre-trained BERT model and tokenizer using Auto classes
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

# Function to get the word embedding from BERT
def get_word_embedding(word):
    # Tokenize the input word and get the embeddings
    tokens = tokenizer(word, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**tokens)
    # Take the mean of the token embeddings
    embedding = outputs.last_hidden_state.mean(dim=1).squeeze()
    return embedding

# Function to perform word arithmetic
def word_arithmetic(word1, word2, word3):
    vec1 = get_word_embedding(word1)
    vec2 = get_word_embedding(word2)
    vec3 = get_word_embedding(word3)

    # Perform arithmetic: word1 - word2 + word3
    result_vec = vec1 - vec2 + vec3
    return result_vec

# Function to find the most similar word
def find_most_similar(target_vec, word_list):
    similarities = []
    for word in word_list:
        word_vec = get_word_embedding(word)
        # Calculate cosine similarity
        similarity = 1 - cosine(target_vec, word_vec)
        similarities.append((word, similarity))
    # Sort by similarity
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[0]  # Return the most similar word and its similarity score

# New examples of word arithmetic
examples = [
    ('driver', 'car', 'bus', ['airhostess', 'developer', 'head', 'thar', 'conductor']),
    ('moon', 'midnight', 'afternoon', ['sun', 'smallstars', 'bluesky', 'clouds', 'comet']),
    ('artist', 'arts', 'chisel', ['sculptor', 'actor', 'heavy', 'moon', 'clay']),
    ('pc', 'mouse', 'blackpaper', ['pencile', 'typographiy', 'scanner', 'book', 'bigscreen']),
    ('butterfly', 'gum', 'tea', ['bafelow', 'hourse', 'cat', 'loni', 'icecream'])
]

# Perform word arithmetic and find the most similar word for each example
for word1, word2, word3, options in examples:
    result_emb = word_arithmetic(word1, word2, word3)
    most_similar, similarity = find_most_similar(result_emb, options)
    print(f"{word1} - {word2} + {word3} is most similar to: {most_similar} (similarity: {similarity:.4f})")



driver - car + bus is most similar to: developer (similarity: 0.8167)
moon - midnight + afternoon is most similar to: sun (similarity: 0.8311)
artist - arts + chisel is most similar to: sculptor (similarity: 0.6401)
pc - mouse + blackpaper is most similar to: scanner (similarity: 0.6541)
butterfly - gum + tea is most similar to: cat (similarity: 0.7638)


In [4]:
!pip install langchain groq
!pip install langchain-groq groq
!pip install -U langchain-community
!pip install langchain langchain-community huggingface_hub faiss-cpu
!pip install sentence-transformers

Collecting langchain
  Downloading langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groq
  Downloading groq-0.11.0-py3-none-any.whl.metadata (13 kB)
Collecting langchain-core<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_core-0.3.0-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.121-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting httpx<1,>=0.23.0 (from groq)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->groq)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->groq)
  Downloading h11-0.14.0-py3-n

Part 2: RAG System Implementation
Task : Implement a simple RAG system using LangChain, process an article of your choice, and run 5 different queries on its content. Steps

Choose at least 5 diverse articles on a different topic of your interest from wikipedia dump on HuggingFace (e.g., Artificial Intelligence, Machine Learning, etc.).
Use the provided code from the class to load and process each article, create embeddings, store embeddings for each article in the single VectorDB and set up the RAGsystem.
Formulate 10 diverse queries that explore various aspects of your article's content.
Runeach query using the run_query function and record the results

In [6]:

pip install langchain faiss-cpu transformers wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=4719ec21626f47fc3a40b21589eeeb36288c0350bca169085c13303c64772726
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [9]:
from langchain.chains.question_answering import load_qa_chain
from langchain_groq import ChatGroq
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import WikipediaLoader
import os

# Set Groq API key
os.environ["GROQ_API_KEY"] = "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"
# Step 1: Choose 5 articles (article titles are just examples, adjust as needed)
articles = [
    "cybersecurity",
    "tecnical support",
    "data science",
    "big data",
    "cloud services"
]

# Step 2: Load and process each article
documents = []
for article_title in articles:
    loader = WikipediaLoader(article_title)
    article_text = loader.load()
    documents.extend(article_text)

# Step 3: Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs_chunks = text_splitter.split_documents(documents)

# Step 4: Create embeddings and store in a VectorDB
embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
vector_db = FAISS.from_documents(docs_chunks, embedding_model)

# Step 5: Initialize the Groq LLM
llm = ChatGroq(model_name="mixtral-8x7b-32768")  # Ensure `ChatGroq` is compatible with `langchain`

# Step 6: Load the QA chain using the appropriate LLM
qa_chain = load_qa_chain(llm, chain_type="stuff")

# Step 7: Define the query function using the chain and retriever
def run_query(query):
    docs = vector_db.similarity_search(query)
    result = qa_chain.run(input_documents=docs, question=query)
    return result

# Step 8: Run 10 diverse queries on the RAG system
queries = [
    "what are the common cyberattacks ?",
    "what is the biggest issue in cyber security ?",
    "What are the steps involved in installing the software/hardware ?",
    "What should I do if the installation fails?",
    "Does the author demonstrate expertise in the subject area?",
    "How is data science  used in predictive analytics?",
    "What role do big data play in it field?",
    "How dose cloud services work ?",
    "what is the role of data scientist in tecnical field?",
    "What are the differences between data science and data analysis?"
]

# Step 9: Run each query and record results
for i, query in enumerate(queries, 1):
    response = run_query(query)
    print(f"Query {i}: {query}")
    print(f"Response: {response}\n")



Query 1: what are the common cyberattacks ?
Response: Based on the provided context, common cyberattacks include:

1. Phishing: This is a method used to trick individuals into revealing sensitive information, such as usernames and passwords, by disguising as a trustworthy entity in electronic communication.

2. Ransomware: This is a type of malware that encrypts the victim's data and demands payment in exchange for the decryption key.

3. Water holing: This is a technique used by attackers to compromise a specific group of users by infecting websites that the group frequently visits.

4. Scanning: This is a method used by attackers to search for vulnerabilities in computer systems and networks.

5. Hacking: This is the unauthorized access to or manipulation of computer systems and networks, often for malicious purposes.

6. Exploiting vulnerabilities in digital products and Internet of Things devices: This is a method used by attackers to take advantage of security flaws in digital pro