## Spring 2024 Info Session LangChain

Transcribing MP4 File of Info Session

In [4]:
!pip install openai-whisper

Collecting openai-whisper
  Downloading openai-whisper-20240930.tar.gz (800 kB)
[K     |████████████████████████████████| 800 kB 6.7 MB/s eta 0:00:01
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Building wheels for collected packages: openai-whisper
  Building wheel for openai-whisper (PEP 517) ... [?25ldone
[?25h  Created wheel for openai-whisper: filename=openai_whisper-20240930-py3-none-any.whl size=803319 sha256=efe6068d6d8bc66cdcac460ba519ddeb420c773cebbd9d4e272cc4da4990932d
  Stored in directory: /Users/justingong/Library/Caches/pip/wheels/58/9f/3f/657caca5c67b43cb90d168c2061936f3255bc28fef73b752ea
Successfully built openai-whisper
Installing collected packages: openai-whisper
Successfully installed openai-whisper-20240930


In [6]:
!pip install ffmpeg-python

Collecting ffmpeg-python
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Installing collected packages: ffmpeg-python
Successfully installed ffmpeg-python-0.2.0


In [4]:
from whisper import load_model

# Load Whisper model
model = load_model("base")



In [12]:
result = model.transcribe("spring2024_info_session.mp4")



FileNotFoundError: [Errno 2] No such file or directory: 'ffmpeg'

In [None]:
with open("spring2024_info_session.txt", "w", encoding="utf-8") as f:
    f.write(result["text"])

**Cleaning Transcription and Converting into PDF Format**

In [5]:
from fpdf import FPDF
import re

with open("spring2024_info_session.txt", "r", encoding="utf-8") as file:
    text = file.read()

cleaned_text = re.sub(r'\b(uh|um)\b', '', text, flags=re.IGNORECASE)

pdf = FPDF()
pdf.set_auto_page_break(auto=True, margin=15)
pdf.add_page()
pdf.set_font("Arial", size=12)
pdf.multi_cell(0, 10, cleaned_text)
pdf.output("spring2024_info_session_transcript.pdf")

''

**Loading in Transcription in LangChain**

In [6]:
import openai
openai.api_key = 'REDACTED'

In [7]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("spring2024_info_session_transcript.pdf")
pages = loader.load()

In [8]:
# Inspecting Pages Information
pages[0]

Document(metadata={'source': 'spring2024_info_session_transcript.pdf', 'page': 0}, page_content="Okay, awesome.\nAnd yeah, let's just jump right into it.\nSo my name is Vade, I'm the president of the Data Science Union.\nAnd during this info session, we're just going to be taking you through a couple of things.\nFirst, you're going to meet us, the board.\nAnd we'll give you a little bit of insight into, you know,\nwhat major we are, like what we do for DSU.\nWe'll tell you a little bit about our organization and our core pillars and mission.\nAnd then we'll get into a little bit of what we offer as well as our recruitment process.\nAnd then at 7 p.m., we will have time to answer any and all questions.\nSo I'm going to hand it off now.\nTwo.\nOh, there's me.\nI'm the president.\nAnd then next we have Justin.\nHi, everyone.\nMy name is Justin.\nI'm the internal vice president here at DSU.\nAnd I am a data theory major.\nHi, everyone.\nMy name is ball.\nI'm a third year data theory major.

In [9]:
pages[1]

Document(metadata={'source': 'spring2024_info_session_transcript.pdf', 'page': 1}, page_content="Hi, my name is Caleb.\nAnd I'm a second year data theory major also.\nAnd I'm finance director.\nHi, I'm Sonia.\nI'm also a second year data theory major.\nAnd I'm the director of marketing.\nHi, I'm Riley.\nI'm a second year math of comp major.\nAnd I'm the client relations director.\nHi, everyone.\nI'm Danelle.\nI'm a third year data theory major.\nAnd I am the director of professional development.\nHi, I'm Hannah.\nI'm a second year stats and data science major.\nAnd I'm the project director.\nHey guys, I'm Charlie.\nI'm a second year stats DS major.\nAnd I'm director of membership.\nHey, everyone.\nLet's get to see all your guys' faces.\nI'm Daniel.\nI'm a third year data theory major.\nAnd I'm running some fun research projects here this year.\nHi, I'm Maddie.\nI'm a third year stats major.\nAnd I'm one of the executive advisors.")

In [10]:
print(len(pages))

15


In [11]:
pages[0].page_content[:500]

"Okay, awesome.\nAnd yeah, let's just jump right into it.\nSo my name is Vade, I'm the president of the Data Science Union.\nAnd during this info session, we're just going to be taking you through a couple of things.\nFirst, you're going to meet us, the board.\nAnd we'll give you a little bit of insight into, you know,\nwhat major we are, like what we do for DSU.\nWe'll tell you a little bit about our organization and our core pillars and mission.\nAnd then we'll get into a little bit of what we offer as"

In [12]:
pages[0].metadata

{'source': 'spring2024_info_session_transcript.pdf', 'page': 0}

**Document Splitting for Meaningful Chunks**

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [14]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=75,
    length_function = len, 
    separators = ["\n", ". ", " "]
)

In [15]:
docs = r_splitter.split_documents(pages)

In [16]:
len(docs)

55

In [17]:
len(pages)

15

In [18]:
docs[0]

Document(metadata={'source': 'spring2024_info_session_transcript.pdf', 'page': 0}, page_content="Okay, awesome.\nAnd yeah, let's just jump right into it.\nSo my name is Vade, I'm the president of the Data Science Union.\nAnd during this info session, we're just going to be taking you through a couple of things.\nFirst, you're going to meet us, the board.\nAnd we'll give you a little bit of insight into, you know,\nwhat major we are, like what we do for DSU.\nWe'll tell you a little bit about our organization and our core pillars and mission.\nAnd then we'll get into a little bit of what we offer as well as our recruitment process.")

In [19]:
docs[1]

Document(metadata={'source': 'spring2024_info_session_transcript.pdf', 'page': 0}, page_content="And then at 7 p.m., we will have time to answer any and all questions.\nSo I'm going to hand it off now.\nTwo.\nOh, there's me.\nI'm the president.\nAnd then next we have Justin.\nHi, everyone.\nMy name is Justin.\nI'm the internal vice president here at DSU.\nAnd I am a data theory major.\nHi, everyone.\nMy name is ball.\nI'm a third year data theory major.\nAnd I'm the external vice president of DC.\nHi, everyone.\nI'm Jacob.\nI'm one of the curriculum directors for DSU.\nAnd I'm a second year data theory major.")

In [20]:
docs[2]

Document(metadata={'source': 'spring2024_info_session_transcript.pdf', 'page': 1}, page_content="Hi, my name is Caleb.\nAnd I'm a second year data theory major also.\nAnd I'm finance director.\nHi, I'm Sonia.\nI'm also a second year data theory major.\nAnd I'm the director of marketing.\nHi, I'm Riley.\nI'm a second year math of comp major.\nAnd I'm the client relations director.\nHi, everyone.\nI'm Danelle.\nI'm a third year data theory major.\nAnd I am the director of professional development.\nHi, I'm Hannah.\nI'm a second year stats and data science major.\nAnd I'm the project director.\nHey guys, I'm Charlie.\nI'm a second year stats DS major.\nAnd I'm director of membership.\nHey, everyone.")

In [21]:
docs[52]

Document(metadata={'source': 'spring2024_info_session_transcript.pdf', 'page': 13}, page_content="technical section. We won't have you code anything. It's more just to test like your critical thinking\nskills and see how you solve problems, but we really recommend just like talking through your entire\nthought process.\nAnd if you want to use any stats knowledge that you do have, then feel free to do so. But yeah, but\ndon't be nervous. I know it's easier said than done, but we are rooting for you on the other side and\nwe really just want to get to know you. So yeah.")

## Embeddings and Vectorstore

In [23]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

In [24]:
embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en")

  embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en")


In [25]:
persist_directory = 'info_session_vectorstore'

In [28]:
vectordb = Chroma.from_documents(
    documents=docs,
    embedding=embedding_model,
    persist_directory=persist_directory
)

In [29]:
vectordb.persist() 

  vectordb.persist()


In [30]:
question = "How do I succeed in coffee chats?"

In [31]:
results = vectordb.similarity_search(question,k=3)

In [32]:
results[0]

Document(metadata={'page': 13, 'source': 'spring2024_info_session_transcript.pdf'}, page_content="place on Saturday morning and afternoon.\nSo coffee chats or more of a casual conversation just to get to know each of you and also to see\nhow you work with other people.\nAnd I know somebody it touched on a lot of our advice, but yeah, our advice this round and along\nwith honestly all the other rounds is just to be yourself. I know it's cliche, but we really do just want to\nget to know you. So if it helps you to practice talking about your past experiences, then we definitely\nrecommend doing that.")

In [33]:
results[1]

Document(metadata={'page': 13, 'source': 'spring2024_info_session_transcript.pdf'}, page_content="recommend doing that.\nAnd we also like to see when you can build off other people's ideas so that your collapse so we can\nsee that you can collaborate and we can see how you work and interact with others.\nSo that's coffee chats and then next slide.\nYeah, then after coffee chats will invite some of you to join us for individual interviews and for this\nyou can also just casually because again our goal is just to get to know you individually and to\nunderstand your thought process.\nSo in the interviews, there will be both a behavioral and a technical section and don't worry for the")

## Loading in Vector Database for Question Answering

In [7]:
import openai

In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
persist_directory = 'info_session_vectorstore'
embedding = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en")
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

  embedding = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en")
  from tqdm.autonotebook import tqdm, trange
  vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)


In [2]:
print(vectordb._collection.count())

55


In [5]:
question = "How do I succeed in coffee chats?"
docs = vectordb.similarity_search(question,k=3)
len(docs)

3

In [4]:
docs

[Document(metadata={'page': 13, 'source': 'spring2024_info_session_transcript.pdf'}, page_content="place on Saturday morning and afternoon.\nSo coffee chats or more of a casual conversation just to get to know each of you and also to see\nhow you work with other people.\nAnd I know somebody it touched on a lot of our advice, but yeah, our advice this round and along\nwith honestly all the other rounds is just to be yourself. I know it's cliche, but we really do just want to\nget to know you. So if it helps you to practice talking about your past experiences, then we definitely\nrecommend doing that."),
 Document(metadata={'page': 10, 'source': 'spring2024_info_session_transcript.pdf'}, page_content="And whether it's through our takeovers or just like word of mouth, you've probably just heard how\nimportant community is for us.\nIt's something that is really important to get our members tight knit and just to get to know one\nanother is through programs like we have like a big little pr

In [8]:
openai.api_key = 'REDACTED'

In [9]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
llm_name = "gpt-3.5-turbo" # Can used more advanced model, but for our this should be sufficient
llm = ChatOpenAI(model_name=llm_name, temperature=0, openai_api_key = openai.api_key)

  llm = ChatOpenAI(model_name=llm_name, temperature=0, openai_api_key = openai.api_key)


In [11]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, say I don't have enough information to answer the question. 
Always say "thanks for asking!" at the end of the answer. 
Context: {context}
Question: {question}"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [12]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [13]:
question = "How can I succeed in coffee chats?"
result = qa_chain({"query": question})

  result = qa_chain({"query": question})


In [14]:
result["result"]

"To succeed in coffee chats, it is important to be yourself, practice talking about your past experiences, and be able to build off of other people's ideas to show collaboration skills. Additionally, be prepared to discuss your experiences and tie them into tangible skills that you can bring to the club. Showing eagerness to learn and engage with the social aspects of the club is also important. Thanks for asking!"

In [18]:
def ask_question(question): 
    result = qa_chain({"query": question})
    print(result['result'])

In [19]:
ask_question("What is the application timeline for this quarter?")

The application deadline is this Thursday at midnight. Some applicants will be contacted on Friday for coffee chats on Saturday. Interviews will take place from Monday to Tuesday, with emails sent out to successful applicants by Tuesday night. Thanks for asking!


In [20]:
ask_question("Can you give me a brief overview of the resume screener project by Justin?")

Justin led a project last quarter called the automated resume screener. His team used natural language processing to analyze thousands of resumes and extract important features. They built a clustering model using unsupervised learning to filter the resumes and selected one cluster based on performance. The project also included visualizations of the data, such as separating words in accounting resumes. It was a great learning experience for Justin and his team members. Thanks for asking!


In [21]:
ask_question("Can you tell me what the second quarter curriculum is?")

The second quarter curriculum is a lot more independent, where you get to pick and work on a project that interests you. You will get to do the complete project from data collection to data cleaning to modeling. You will also be paired up with mentors from DSU to guide you. Thanks for asking!


In [22]:
ask_question("What are DSU's four core pillars?")

The four core pillars of DSU are a proprietary curriculum, workshops to prepare for projects, internal and external projects, and a community of data scientists. Thanks for asking!
