# Retrieval-Augmented Generation (RAG) Demo Project

## Overview

This project demonstrates a **Retrieval-Augmented Generation (RAG)** pipeline focused on extracting **key resume keywords** from a CSV file containing job descriptions. By manually implementing embedding-based similarity calculations, the project highlights how to efficiently retrieve context from the data and use it to generate targeted keywords for a specific job role query.

## Key Features

- **CSV Parsing**: Processes CSV files containing job descriptions with a `description` column.
- **Manual Embedding Calculation**: Embedding vectors are calculated for both job role queries and job descriptions.
- **Similarity Scoring and Ranking**: Calculates the dot product of query and description embeddings to rank descriptions based on relevance.
- **Keyword Extraction**: Passes the top-ranked descriptions into a ChatGPT prompt to extract the most relevant keywords for a resume.
- **Customizable Prompt Engineering**: Allows customization of instructions sent to the GPT API for improved accuracy.

## Workflow

1. **Data Loading**: Loads job descriptions from a CSV file.
2. **Embedding Calculation**: Uses OpenAI's embedding models to calculate vectors for job descriptions and the input job role query.
3. **Similarity Scoring**: Computes the dot product to score and rank the descriptions based on similarity to the query.
4. **Top Descriptions**: Retrieves the top 5 most relevant job descriptions.
5. **Keyword Generation**: Sends the top descriptions and the job role query as context to the GPT API to extract targeted keywords.

## Technologies Used

- **OpenAI GPT Models**: For generating keyword suggestions.
- **OpenAI Embeddings API**: For creating embedding vectors for job descriptions and queries.
- **Pandas**: For processing and analyzing CSV files.
- **Python**: Primary programming language for implementation.


In [5]:
import pandas as pd
from openai import OpenAI
import os
import ast
import numpy as np
import pdb

client = OpenAI()


# Simple Pass Through Prompt


In [16]:

question = "What is linear regression?"
question

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": question}]
)
response



ChatCompletion(id='chatcmpl-AsYI3vjdImicozPzdLpMsg1myVLxI', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Linear regression is a type of statistical technique used to understand the relationship between two continuous variables. It assumes that there is a linear relationship between the independent variable (X) and the dependent variable (Y). The goal of linear regression is to find the equation of the straight line that best fits the data points, allowing for the prediction of the dependent variable based on the independent variable.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1737564887, model='gpt-3.5-turbo-0125', object='chat.completion', service_tier='default', system_fingerprint=None, usage=CompletionUsage(completion_tokens=76, prompt_tokens=12, total_tokens=88, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, 

In [17]:
response.choices[0].message.content

'Linear regression is a type of statistical technique used to understand the relationship between two continuous variables. It assumes that there is a linear relationship between the independent variable (X) and the dependent variable (Y). The goal of linear regression is to find the equation of the straight line that best fits the data points, allowing for the prediction of the dependent variable based on the independent variable.'

# Now lets add some instructions

In [20]:
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are an assistant who is helping to answer questions. PLease answer as if you were talking to an 8 year old in a poem."}, 
        {"role": "user", "content": question}
        ]
)
response

ChatCompletion(id='chatcmpl-Asy1V9nGWC8qHA7LJeepkc1iUcVI7', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Linear regression is like a line,\nThat shows us how things will combine.\nIt helps us see trends in data so clear,\nTo predict the future is why it's here.\nWith points on a graph, a line will we draw,\nPredicting the future with what we saw.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1737663805, model='gpt-3.5-turbo-0125', object='chat.completion', service_tier='default', system_fingerprint=None, usage=CompletionUsage(completion_tokens=57, prompt_tokens=46, total_tokens=103, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

In [22]:
print(response.choices[0].message.content)

Linear regression is like a line,
That shows us how things will combine.
It helps us see trends in data so clear,
To predict the future is why it's here.
With points on a graph, a line will we draw,
Predicting the future with what we saw.


# Lets Add Retrieval Argumented Generation

In [36]:
data = pd.read_csv("data/postings.csv")
data = data.head(1000) # for testing
data.head(1)

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,Requirements: \n\nWe are seeking a College or ...,1713398000000.0,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0


In [37]:
def get_embedding(text, model = 'text-embedding-3-small'):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding


In [38]:
get_embedding(data['description'].iloc[0])

[-0.05323557183146477,
 0.07691117376089096,
 0.004437448922544718,
 0.05702035129070282,
 0.009537925012409687,
 0.0023913877084851265,
 -0.03143854811787605,
 -0.0006453293608501554,
 -0.020733417943120003,
 0.03320661932229996,
 0.011437221430242062,
 -0.01331579964607954,
 0.023565096780657768,
 -0.024255750700831413,
 0.0321292020380497,
 0.04166021943092346,
 -0.024642515927553177,
 -0.06486617773771286,
 -0.021534575149416924,
 -0.004137014504522085,
 0.040223658084869385,
 0.03444979712367058,
 -0.006167535670101643,
 0.011893053539097309,
 0.017031515017151833,
 0.015028620138764381,
 -0.0437321774661541,
 0.02864830754697323,
 0.07204896956682205,
 0.016658563166856766,
 0.0007903666119091213,
 -0.021603642031550407,
 -0.006713151931762695,
 -0.0033116834238171577,
 -0.009696775116026402,
 0.03069264069199562,
 -0.00019467795209493488,
 -0.0092616630718112,
 -0.01788792572915554,
 0.0036466503515839577,
 -0.004143921192735434,
 0.03690852224826813,
 0.04956129565834999,
 -0.0

In [39]:
%%time
data['description'].head(5).apply(get_embedding)

CPU times: total: 31.2 ms
Wall time: 1.07 s


0    [-0.05323557183146477, 0.07691117376089096, 0....
1    [-0.0542149618268013, -0.016733655706048012, 0...
2    [-0.026636719703674316, 0.032719530165195465, ...
3    [-0.01743720844388008, 0.0350717268884182, 0.0...
4    [-0.08258771896362305, 0.05970561504364014, 0....
Name: description, dtype: object

In [40]:
%%time
data['embedding'] = data['description'].apply(get_embedding)
data.to_csv("data/postings_with_embeddings.csv", index=False)
data.to_pickle("data/postings_with_embeddings.pkl")

CPU times: total: 7.3 s
Wall time: 4min 40s


In [43]:
%%time

df = pd.read_pickle("data/postings_with_embeddings.pkl")
df.head()

CPU times: total: 62.5 ms
Wall time: 60.5 ms


Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips,embedding
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,1713398000000.0,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0,"[-0.05323557183146477, 0.07691117376089096, 0...."
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,1712858000000.0,,0,FULL_TIME,USD,BASE_SALARY,83200.0,80521.0,8069.0,"[-0.05416886508464813, -0.016683384776115417, ..."
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,1713278000000.0,,0,FULL_TIME,USD,BASE_SALARY,55000.0,45202.0,39061.0,"[-0.026541192084550858, 0.032740212976932526, ..."
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,1712896000000.0,,0,FULL_TIME,USD,BASE_SALARY,157500.0,11040.0,36059.0,"[-0.01743720844388008, 0.0350717268884182, 0.0..."
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,1713452000000.0,,0,FULL_TIME,USD,BASE_SALARY,70000.0,52601.0,19057.0,"[-0.08258771896362305, 0.05970561504364014, 0...."


In [45]:
question = "Data scientist having 3 years of experience in solving real world probelms and experience in GLM's?"
question_embedding = get_embedding(question)

question, question_embedding

("Data scientist having 3 years of experience in solving real world probelms and experience in GLM's?",
 [-0.006650610361248255,
  0.00927453301846981,
  0.06272364407777786,
  -0.016019077971577644,
  -0.0011428779689595103,
  -0.029934007674455643,
  -0.006359411403536797,
  0.01683318242430687,
  0.004176984075456858,
  0.02657739259302616,
  0.024260323494672775,
  -0.013802208006381989,
  -0.012875380925834179,
  0.007295631803572178,
  0.02908232994377613,
  0.03546992316842079,
  0.02494918182492256,
  0.007539863232523203,
  0.0006595032173208892,
  0.05901633948087692,
  -0.008886267431080341,
  0.06893589347600937,
  0.005823980551213026,
  -0.04496363550424576,
  0.05170191824436188,
  -0.0006078388541936874,
  0.014779133722186089,
  0.015104776248335838,
  0.026752738282084465,
  0.008416591212153435,
  0.03567031770944595,
  -0.02429789863526821,
  -0.029382921755313873,
  -0.01965123787522316,
  -0.010815070010721684,
  0.005717521067708731,
  -0.0015084423357620835,
  0

In [46]:
def fn(description_embedding):
    return np.dot(description_embedding, question_embedding)

df['similarity'] = df['embedding'].apply(fn)

df.head()

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips,embedding,similarity
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0,"[-0.05323557183146477, 0.07691117376089096, 0....",0.224795
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,,0,FULL_TIME,USD,BASE_SALARY,83200.0,80521.0,8069.0,"[-0.05416886508464813, -0.016683384776115417, ...",0.203736
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,,0,FULL_TIME,USD,BASE_SALARY,55000.0,45202.0,39061.0,"[-0.026541192084550858, 0.032740212976932526, ...",0.172038
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,,0,FULL_TIME,USD,BASE_SALARY,157500.0,11040.0,36059.0,"[-0.01743720844388008, 0.0350717268884182, 0.0...",0.231028
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,,0,FULL_TIME,USD,BASE_SALARY,70000.0,52601.0,19057.0,"[-0.08258771896362305, 0.05970561504364014, 0....",0.1689


In [48]:
df.sort_values('similarity', ascending=False, inplace=True)
df.head()   

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips,embedding,similarity
283,3742692445,ZenithMinds Inc,Sr Data Engineer with Kafka,Data Engineer with Kafka (W2 Only)💯% Remote\nM...,,,"Austin, TX",81941852.0,39.0,,...,,0,FULL_TIME,,,,78701.0,48453.0,"[-0.028565187007188797, 0.0133209228515625, 0....",0.449585
901,3884430616,EPITEC,Data Management Analyst,Data Management Analysts support Sr. Data Mana...,47.0,HOURLY,United States,25461.0,41.0,,...,,0,CONTRACT,USD,BASE_SALARY,87360.0,,,"[-0.012955163605511189, 0.05803154408931732, 0...",0.445464
663,3871631334,NLB Services,Machine Learning Engineer,Job Title: Python AI/ MLType: FulltimeLocation...,,,"Dallas, TX",490432.0,22.0,,...,,0,FULL_TIME,,,,75201.0,48113.0,"[-0.04581267759203911, 0.011896686628460884, 0...",0.431802
830,3884429682,Diverse Lynx,Cyber security /Report Developer ( W2 Role),Role: Security Analyst/Report Developer (exper...,,,United States,90396.0,4.0,,...,,0,CONTRACT,,,,,,"[-0.00922182947397232, 0.03831450641155243, 0....",0.415013
375,3803659249,Diverse Lynx,BI Reporting Lead,Experience in Business Objects & Spotfire repo...,,,"Stamford, CT",90396.0,4.0,,...,,0,FULL_TIME,,,,6901.0,9001.0,"[0.007336806505918503, 0.029142320156097412, 0...",0.402743


In [53]:
context = df['description'].iloc[0] + "\n" + df['description'].iloc[1] + "\n" + df['description'].iloc[2] + "\n" + df['description'].iloc[3] + "\n" + df['description'].iloc[4] 
context 

"Data Engineer with Kafka (W2 Only)💯% Remote\nMin 10 to12+ strong development experience neededVery strong experience in Kafka and Kafka data injection Strong exp in working with API.Strong exp in Python with AWS.Experience with Informatica IICS and Snowflake. Expertise in Snowflake's cloud data platform, including data loading, transformation, and querying using Snowflake SQL.Experience with SQL-based development, optimization, and tuning for large-scale data processing.Strong understanding of dimensional modeling concepts and experience in designing and implementing data models for analytics and reporting purposes.hands-on experience in IICS or Informatica Power Center ETL development1+ years of hands-on experience in Linux and shell scripting.1+ years of experience working with git.1+ years of related industry experience in an enterprise environment.1+ years of hands-on experience in Python programming.\n\nData Management Analysts support Sr. Data Management Analysts, Data Managemen

In [54]:
response    = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are an assistant who is helping to answer questions fo a job seeker help find kewords in the job description that match the job seeker's experience."}, 
        {"role": "user", "content": question},
        {"role": "assistant", "content": f"use this information from linkedins job descriptions to answer user question: {context}. PLease stick to the context and help the user find the keywords in the job description that match the job seeker's experience."}
        ]
)

In [55]:
print(response.choices[0].message.content)

Based on the job descriptions provided, here are the keywords that match the experience of a data scientist with 3 years of experience and experience in Generalized Linear Models (GLMs):

1. Data scientist
2. Python
3. Experience with API
4. SQL
5. Data analysis
6. Data modeling
7. Data governance
8. Data quality improvement
9. Data architecture
10. Machine learning
11. Neural networks
12. Natural Language Processing (NLP)
13. TensorFlow
14. PyTorch
15. Cloud environment (Google Cloud, AWS, Azure)
16. Data validation
17. Data visualization
18. Experience in business analytics or systems analysis
19. Statistical skills
20. Experience in machine learning techniques

These keywords align with the experience of a data scientist with 3 years of experience and knowledge in GLMs. They can be used to tailor the resume or cover letter when applying for relevant job positions.


# Modularizing

In [58]:
# creating a function fo the above operation
# this functio =n just takes h=the question as input and performs all the above task

def get_job_key_words(question):
    question_embedding = get_embedding(question)
    distance_series = df['embedding'].apply(lambda x: np.dot(x, question_embedding))

    top_five = distance_series.sort_values(ascending=False).head(5)

    context = "\n".join(df.loc[top_five.index, 'description'].values)

    response    = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are an assistant who is helping to answer questions fo a job seeker help find kewords in the job description that match the job seeker's experience."}, 
            {"role": "user", "content": question},
            {"role": "assistant", "content": f"use this information from linkedins job descriptions to answer user question: {context}. PLease stick to the context and help the user find the keywords in the job description that match the job seeker's experience."}
            ]
    )
    return response.choices[0].message.content

In [63]:
print(get_job_key_words('data analyst with 3 yrs of experience in Insurnace industry'))

Some of the key keywords in the job descriptions that match the job seeker's experience as a data analyst in the insurance industry with 3 years of experience are:

1. Data analysis
2. Insurance industry
3. Data validation
4. Cyber security
5. Data protection
6. Business intelligence tools (Power BI, Tableau)
7. Data visualization
8. Metrics and reporting
9. SQL
10. Data management
11. Data governance
12. Data quality improvement
13. Analytical skills
14. Reporting
15. Insurance products
16. Financial analysis
17. Business process documentation
18. Agile/Waterfall methodologies
19. Healthcare experience
20. Data governance experience

These keywords align with the job seeker's experience in the insurance industry and data analysis, making them relevant for identifying potential job opportunities that match their skill set.


In [64]:
print(get_job_key_words('data engineer '))

Based on the information provided in the job descriptions, the keywords that match the job seeker's experience as a Data Engineer are:

1. Data Engineering
2. Data Modeling
3. ETL (Extract Transform Load)
4. Data Warehousing
5. Data Analytics
6. Azure Data Factory
7. Azure Databricks
8. Azure SQL
9. Snowflake
10. Spark
11. SQL
12. Python
13. AWS
14. Informatica IICS
15. Linux
16. Shell scripting

These keywords align well with the job seeker's experience as a Data Engineer. Good luck with your job search!


In [65]:
print(get_job_key_words('SOftware engineer'))

Based on the job descriptions provided, the keywords that match the job seeker's experience as a Software Engineer are:

- Software engineer
- Software design engineer
- Software development
- Python
- Data manipulation
- RESTful APIs
- Cloud platforms (AWS/Azure/Databricks/Snowflake/GCP)
- Docker/Kubernetes
- CI/CD Pipelines
- Jira
- Git
- Data science
- Machine learning (AI/ML)
- SQL
- Object-oriented programming
- Agile development methodologies

These keywords align with the job seeker's experience and qualifications as a Software Engineer.


In [66]:
print(get_job_key_words('Cyber security'))

Based on the job descriptions provided, here are some keywords related to cybersecurity that match the job seeker's experience:

1. Cybersecurity
2. IT Cybersecurity
3. Security Controls
4. Incident Response
5. Information Security
6. Security Operations
7. Network Security
8. Vulnerability Management
9. Threat Intelligence
10. Security Incident and Investigation
11. Security Orchestration, Automation, and Response (SOAR)
12. Compliance Management
13. Penetration Testing
14. Risk Assessment
15. Endpoint Protection
16. SIEM (Security Information and Event Management)
17. Forensic Investigation
18. Data Protection
19. Security Architecture
20. Security Policies

These keywords can help the job seeker identify relevant job opportunities and tailor their resume to highlight their experience in cybersecurity.


In [67]:
print(get_job_key_words('Business Analyst '))

Based on your experience, some keywords from the job descriptions that match your skills include:
- Business Analyst
- Technical specifications
- Analytical skills
- Business requirements
- System architecture
- User stories
- Data validation
- Business process improvements
- T-SQL
- Data management
- Data governance
- Data quality improvement
- SQL skills
- Business process documentation
- Requirements documentation/gathering
- Agile/Waterfall
- SDLC
- UAT experience
- Healthcare experience

These keywords align with your experience and expertise as a Business Analyst and Data Management Analyst. Good luck with your job search!


In [68]:
print(get_job_key_words('Full Stack developer'))

Based on the job descriptions provided, here are the keywords that match the experience of a Full Stack Developer job seeker:

1. Full Stack Development
2. JavaScript
3. Node.js
4. React
5. Angular
6. RESTful services
7. AWS Cloud
8. Code reviews
9. Testing
10. Debugging
11. React, JavaScript, TypeScript
12. Writing and maintaining code
13. Frontend and backend development
14. Test Driven Development
15. React components
16. API integration
17. Git
18. MySQL
19. Docker
20. AWS Lambda
21. Java
22. Spring Boot
23. Microservices
24. Kafka
25. .NET Core
26. C#
27. VueJS
28. Vite
29. Cloud services (Azure)
30. Microservices architectures
31. Containerization technologies (Docker, Kubernetes)

These keywords align with the skills and experience typically required for a Full Stack Developer role.
