# Creating Applications with Langchain

## Case Study: Resumes

Suppose we want to extract some meaningful information out of a bunch of resumes. While resumes contain a lot of the same information (e.g., education, experience, skills), they can all be formatted very differently. Therefore, extracting structured data from these files can be very challenging with standard document parsing code. What we can do instead is use a LLM to decipher this information for us.

We'll use a publicly-available [set](https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset?resource=download) of 2400+ annonymized PDF resumes labeled with the category of the job for which the person applied.

Before we begin, we will need to install the following packages:

```python
pip install openai langchain chromadb tiktoken pypdf
```

In [25]:
import os
RESUME_ROOT_DIR = os.path.expanduser('~/Documents/resume_data/data/data/')
PERSIST_DIR = os.path.expanduser("~/Documents/resume_data/persist/")

import os
path_to_file = os.path.expanduser('~/openai-key.txt')
with open(path_to_file, 'r') as f:
    os.environ['OPENAI_API_KEY'] = f.read().strip()

In [6]:
### display the image in the notebook

from IPython.display import Image
fig = Image(filename='images/resume.png')

We can load in the text from a PDF document using one of several PDF loaders available from Python libraries (pypdf,pymupdf,pdfplumer,pdfminer). Langchain acts as a wrapper to unite all of the different APIs.

In [20]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader(os.path.join(RESUME_ROOT_DIR,'FITNESS','10428916.pdf'))
pages = loader.load_and_split()

print(pages[0].page_content[:550])

RECREATION & SPORTS COORDINATOR
Objective
To gain a Recreation Supervisor position so that I can provide support to professional and part time staff. I am looking for the opportunity to guide
day to day operations of high quality, community recreation facilities. I hope to provide the type of support and management conducive to a healthy
work environment so that all staff can not only complete their roles & responsibilities, but also provide a facility that runs efficiently and offers
exceptional service to members.
Qualifications
ACSM Exercise


We could potentially submit a prompt containing all the text from the PDF as context and then ask the AI a question. However, a lot of times, the PDF will contain much more text than the maximum number of prompt context tokens (i.e., too long for a single call to the API). This problem becomes even more severe if we want to extract information from all of the PDFs in this dataset. Therefore, we need to come up with another solution.

### Vector stores

We can instead take our unstructured text data, embed the tokens into an embedding vector, and then store all of those vectors in a database. When it comes time to query the data, we then embed the query in the same way and find embedding vectors in the database that are 'most similar' to the query.

In [23]:
## display image of vector stores

from IPython.display import Image
fig = Image(filename='images/vector_stores.jpeg')

### Conducting a similarity search on a single document

Let's take the document we loaded earlier and query it using a vector database. Since we are searching for a specific part of the text that matches our query, we'll want to break up the PDF into smaller chunks. A simple way to to this is to break up the text into discrete chunks, separated by the characther "\n". We'll see why in a later example why that might not be ideal in all cases, but for now, we'll go with it. There is also a parameter that allows for some overlap between the chunks so that any relevant part of our search doesn't get cut into two pieces.

The first element of docs is the vector with the highest similarity score. When we print out the context, we see that it does in fact contain the text that shows the person's education.

In [55]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=200,chunk_overlap=10,separator='\n')
chunks = text_splitter.split_documents(pages)
db = Chroma.from_documents(chunks,OpenAIEmbeddings())

query = "What is the highest level of education listed in this resume?"
docs = db.similarity_search(query)

print(docs[0].page_content)

It's important to note that we didn't actually submit any query to OpenAI. We only used their embeddings to create our vector database. In this case, Chroma (the third-party module we used in this example) computed the similarity score mathematically and only returned the text. This can be a useful way of searching a lot of documents for information without needing to send a bunch of API calls to an AI.

Of course, we can get a more sophisticated and structured response if we instead point an AI toward our vector database as context for answering our query.