A lightweight Gen AI prototype using LangChain, FAISS, and OpenAI for semantic search and question-answering over Emma Watson’s UN speech.

In [1]:
!pip install langchain



In [2]:
!pip install openai



In [3]:
pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain_community)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 k

In [4]:
import os
os.environ["OPENAI_API_KEY"] = ""

In [6]:
# we are using pdf doc as input
!pip install pdfplumber

Collecting pdfplumber
  Downloading pdfplumber-0.11.7-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20250506 (from pdfplumber)
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.7-py3-none-any.whl (60 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer_six-20250506-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [7]:
# Save the input file which is transcript of Emma Watson's speech, in the folder
import requests
import io
import pdfplumber  # install via pip install pdfplumber

url = "https://www.un.int/sites/www.un.int/files/IAPR/full-transcript-of-emma-watson.pdf"
response = requests.get(url)
if response.status_code != 200:
    raise Exception(f"Failed to download PDF (status {response.status_code})")

with pdfplumber.open(io.BytesIO(response.content)) as pdf:
    all_text = "\n\n".join(page.extract_text() for page in pdf.pages)

with open("emma_speech_un_transcript.txt", "w", encoding="utf-8") as f:
    f.write(all_text)

In [8]:
# Document Loader
from langchain.document_loaders import TextLoader
loader = TextLoader('./emma_speech_un_transcript.txt')
documents = loader.load()

In [9]:
# Text Splitter
#Split the text into chuncks
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

#check the total length of chunks
print(f"Total chunks: {len(docs)}")




Total chunks: 3


In [58]:
# Transformer only required if using hugging face
# Used for Loading the transformer model
# and Converting text into embeddings (numerical vectors)

#!pip install sentence_transformers



In [10]:
from langchain.embeddings import OpenAIEmbeddings

# Initialize embeddings (uses text-embedding-ada-002 by default)
embeddings = OpenAIEmbeddings()

  embeddings = OpenAIEmbeddings()


In [11]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.0 kB)
Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m56.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0.post1


In [12]:
from langchain.vectorstores import FAISS

#using FAISS for db vector stores
db = FAISS.from_documents(docs, embeddings)

# Look at the stored vectors in the raw form
# Number of vectors stored
print(db.index.ntotal)

# Vector for the first document
vec = db.index.reconstruct(0)  # FAISS uses integer IDs
print(vec[:10])  # Show first 10 dimensions of the vector


#vector store data in humar readable form
print(db.index)  # This is the FAISS index itself
print(db.docstore._dict)  # Contains mapping from internal IDs to LangChain Document objects


#no need to embed the query manually
# We used FAISS.from_documents(docs, embeddings), so the returned db(vector store object) already knows what embeddings model we used.
#so, when we use the raw query, langchain automatically embeds it behind the scenes using the same embedding object that we passed before

# Let's try chnging the two queries below and see the difference in the result

# This query below should return the first chunck
query = "what is the name of Emma's campaign?"

# This query below should return the third chunk
#query = "How many girls will be married in the next 16 years"

docs_result = db.similarity_search(query)

3
[-0.02684503 -0.02059751 -0.01423405 -0.00765161  0.01162555 -0.00914586
 -0.02280025  0.00354885 -0.03217797 -0.02416568]
<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x7ed764111380> >
{'66c51dff-f6d7-43ce-9170-ed0605bbde84': Document(id='66c51dff-f6d7-43ce-9170-ed0605bbde84', metadata={'source': './emma_speech_un_transcript.txt'}, page_content="Full Transcript of Emma Watson's Speech\non Gender Equality at the UN\nEmma Invites All of Us to Fight for Gender Equality\nEmma Watson with UN Secretary General Bank Ki-moon at the launch of the HeForShe\ncampaign in New York City. Eduardo Munoz Alvarez/Stringer\nOn Saturday, September 20, British actor and Goodwill Ambassador for UN Women, Emma\nWatson, gave a smart, important, and moving speech about gender inequality and how to fight\nit. In doing so, she launched the HeForShe initiative, which aims to get men and boys to\npledge to join the feminist fight for gender equality. In the speech M

In [13]:
#print(docs[0].page_content)
print(docs_result[0].page_content)

Full Transcript of Emma Watson's Speech
on Gender Equality at the UN
Emma Invites All of Us to Fight for Gender Equality
Emma Watson with UN Secretary General Bank Ki-moon at the launch of the HeForShe
campaign in New York City. Eduardo Munoz Alvarez/Stringer
On Saturday, September 20, British actor and Goodwill Ambassador for UN Women, Emma
Watson, gave a smart, important, and moving speech about gender inequality and how to fight
it. In doing so, she launched the HeForShe initiative, which aims to get men and boys to
pledge to join the feminist fight for gender equality. In the speech Ms. Watson makes the very
important point that in order for gender equality to be achieved, harmful and destructive
stereotypes of and expectations for masculinity have got to change. Below is the full transcript
of her thirteen-minute speech.
Today we are launching a campaign called for HeForShe. I am reaching out to you because
we need your help. We want to end gender inequality, and to do this, we ne

In [14]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# using FAISS vector store as retriever
retriever = db.as_retriever()

# Creating a QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=retriever,
    return_source_documents=True  #  to see where the answer came from
)

# Run the query
# Query 1
result = qa_chain({"query": "what did Emma tell herself firmly?"})

# Query 2
#result = qa_chain({"query": "How many girls will be married in the next 16 years"})

print("Answer:", result["result"])

  llm=OpenAI(),
  result = qa_chain({"query": "what did Emma tell herself firmly?"})


Answer: 
"If not me, who? If not now, when?"


In [17]:
# This query below should return the third chunk
query = "How many girls will be married in the next 16 years"

docs_result = db.similarity_search(query)

print(docs_result[0].page_content)

Both men and women should feel free to be sensitive. Both men and women should feel free to
be strong. It is time that we all perceive gender on a spectrum, instead of two sets of opposing
ideals. If we stop defining each other by what we are not, and start defining ourselves by who
we are, we can all be freer, and this is what HeForShe is about. It’s about freedom.
I want men to take up this mantle so that their daughters, sisters, and mothers can be free
from prejudice, but also so that their sons have permission to be vulnerable and human too,
reclaim those parts of themselves they abandoned, and in doing so, be a more true and
complete version of themselves.
You might be thinking, “Who is this Harry Potter girl, and what is she doing speaking at the
UN?” And, it’s a really good question. I’ve been asking myself the same thing.
All I know is that I care about this problem, and I want to make it better. And, having seen
what I’ve seen, and given the chance, I feel it is my responsibi

In [18]:
# Query 2
result = qa_chain({"query": "How many girls will be married in the next 16 years"})

print("Answer:", result["result"])

Answer:  15.5 million
