<a href="https://colab.research.google.com/github/rihannh/Career-RAG-Chatbot/blob/main/career_rag_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Career RAG Chatbot

Sistem RAG (Retrieval-Augmented Generation) untuk analisis lowongan kerja menggunakan Groq LLM dan FAISS vector store.

## Instalasi Dependencies

In [None]:
!pip install groq langchain langchain-community langchain-huggingface faiss-cpu tiktoken python-dotenv




## 1. Import Libraries

In [None]:
# LangChain core
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain.docstore.document import Document

# Embeddings (updated import)
from langchain_huggingface import HuggingFaceEmbeddings

# LLM clients
from groq import Groq

# Utils
import pandas as pd
import os
import re

# Google Colab specific (comment out if not using Colab)
try:
    from google.colab import userdata
    IS_COLAB = True
except ImportError:
    IS_COLAB = False

print("All libraries imported successfully!")


All libraries imported successfully!


## 2. Konfigurasi API

In [None]:
# Setup Groq API Key
if IS_COLAB:
    os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')
else:
    # For local development, use environment variable or .env file
    from dotenv import load_dotenv
    load_dotenv()
    if not os.environ.get("GROQ_API_KEY"):
        raise ValueError("GROQ_API_KEY not found. Please set it in your environment or .env file")

client = Groq(api_key=os.environ["GROQ_API_KEY"])
print("✓ Groq client initialized successfully!")


✓ Groq client initialized successfully!


## 3. Data Loading & Cleaning

Memuat dan membersihkan data lowongan kerja dari CSV.

In [None]:
def clean_company_name(name):
    """
    Membersihkan nama perusahaan dari rating yang tertempel di akhir.

    Args:
        name (str): Nama perusahaan yang mungkin mengandung rating (e.g., "Company 3.8")

    Returns:
        str: Nama perusahaan yang sudah dibersihkan
    """
    return re.sub(r"\s\d\.\d$", "", str(name))


# Load data
df = pd.read_csv("eda_data.csv")

# Remove unnamed column if exists
if "Unnamed: 0" in df.columns:
    df = df.drop(columns=["Unnamed: 0"])

# Clean company names
df["Company Name"] = df["Company Name"].apply(clean_company_name)

print(f"✓ Data loaded: {len(df)} job postings")
print(f"✓ Columns: {list(df.columns)}")


✓ Data loaded: 742 job postings
✓ Columns: ['Job Title', 'Salary Estimate', 'Job Description', 'Rating', 'Company Name', 'Location', 'Headquarters', 'Size', 'Founded', 'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors', 'hourly', 'employer_provided', 'min_salary', 'max_salary', 'avg_salary', 'company_txt', 'job_state', 'same_state', 'age', 'python_yn', 'R_yn', 'spark', 'aws', 'excel', 'job_simp', 'seniority', 'desc_len', 'num_comp']


## 4. Document Processing

Membangun dokumen terstruktur dari setiap lowongan kerja dan melakukan text cleaning.

In [None]:
def build_document(row):
    """
    Membangun dokumen terstruktur dari baris DataFrame.

    Args:
        row (pd.Series): Baris DataFrame yang berisi informasi lowongan kerja

    Returns:
        str: Dokumen terstruktur dalam format teks
    """
    doc = f"""
Job Title: {row['Job Title']}
Company: {row['Company Name']}
Location: {row['Location']}
Salary Estimate: {row['Salary Estimate']}
Industry: {row['Industry']}
Sector: {row['Sector']}
Founded: {row['Founded']}
Rating: {row['Rating']}

Skills:
- Python: {row['python_yn']}
- R: {row['R_yn']}
- Excel: {row['excel']}
- AWS: {row['aws']}
- Spark: {row['spark']}

Job Description:
{row['Job Description']}
"""
    return doc


def clean_text(text):
    """
    Membersihkan teks dari multiple newlines dan spasi berlebih.

    Args:
        text (str): Teks yang akan dibersihkan

    Returns:
        str: Teks yang sudah dibersihkan
    """
    text = str(text)
    text = re.sub(r"\n+", "\n", text)  # Hapus multiple newline
    text = re.sub(r"\s+", " ", text)    # Hapus spasi berlebih
    return text.strip()


# Build documents
df["document"] = df.apply(build_document, axis=1)
df["document_clean"] = df["document"].apply(clean_text)

print(f"✓ Documents created: {len(df)}")


✓ Documents created: 742


## 5. Text Chunking

Membagi dokumen menjadi chunks yang lebih kecil untuk memudahkan retrieval.

In [None]:
# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=20,
    length_function=len,
)

# Create documents with metadata
documents = []
for idx, row in df.iterrows():
    doc_text = build_document(row)
    chunks = text_splitter.split_text(doc_text)

    for ch in chunks:
        documents.append({
            "text": ch,
            "metadata": {
                "job_id": int(idx),
                "job_title": str(row["Job Title"]),
                "company": str(row["Company Name"])
            }
        })

print(f"✓ Total chunks created: {len(documents)}")
print(f"✓ Sample chunk length: {len(documents[0]['text'])} characters")


✓ Total chunks created: 9045
✓ Sample chunk length: 265 characters


## 6. Vector Store Creation

Membuat vector store menggunakan FAISS dengan embeddings dari HuggingFace.

In [None]:
# Load embedding model
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)
print("✓ Embedding model loaded!")

# Convert to LangChain Document objects
langchain_documents = []
for doc_dict in documents:
    langchain_documents.append(
        Document(
            page_content=doc_dict['text'],
            metadata=doc_dict['metadata']
        )
    )

# Create vector store
vectorstore = FAISS.from_documents(langchain_documents, embedding_model)
print("✓ FAISS vectorstore created with metadata!")

# Save vector store
VECTORSTORE_PATH = "career_faiss_index"
vectorstore.save_local(VECTORSTORE_PATH)
print(f"✓ Vectorstore saved to: {VECTORSTORE_PATH}")


✓ Embedding model loaded!
✓ FAISS vectorstore created with metadata!
✓ Vectorstore saved to: career_faiss_index


## 7. RAG Functions

Fungsi-fungsi untuk retrieval dan generation menggunakan RAG pipeline.

In [None]:
def retrieve_docs(query, k=4):
    """
    Melakukan retrieval dokumen yang relevan dengan query.

    Args:
        query (str): Query dari user
        k (int): Jumlah dokumen yang akan di-retrieve (default: 4)

    Returns:
        tuple: (context_string, metadata_list) - Context yang digabung dan list metadata
    """
    results = vectorstore.similarity_search(query, k=k)
    contexts = []
    metadata_list = []

    for doc in results:
        contexts.append(doc.page_content)
        metadata_list.append(doc.metadata)

    return "\n\n".join(contexts), metadata_list


def build_prompt(user_query, retrieved_context, metadata_list):
    # format metadata untuk prompt
    metadata_text = "\n".join([
        f"- Job ID {m['job_id']}, Title: {m['job_title']}, Company: {m['company']}"
        for m in metadata_list
    ])

    return f"""
You are CareerMate, an AI job analysis assistant.

You MUST follow these rules:
1. Use information from BOTH the context and the metadata.
2. If the answer is not available, say:
   "The information is not available in the job postings."
3. Do NOT add assumptions.
4. If two or more jobs are retrieved for the query,
   you MUST respond: "Multiple job postings were retrieved. Please specify which job."


-----------------------
RETRIEVED CONTEXT:
{retrieved_context}

-----------------------
RETRIEVED METADATA:
{metadata_text}

-----------------------
USER QUESTION:
{user_query}
"""


def rag_answer(query, model="llama-3.1-8b-instant", k=4):

    context, metadata_list = retrieve_docs(query, k=k)

    prompt = build_prompt(query, context, metadata_list)

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are CareerMate, an expert job insights assistant."},
            {"role": "user", "content": prompt}
        ]
    )

    answer = response.choices[0].message.content

    return answer, metadata_list



## 8. Usage Examples

Contoh penggunaan sistem RAG untuk menjawab pertanyaan tentang lowongan kerja.


In [None]:
# Use Case 1 — Skill Extraction
query_1 = "What skills are required for a Data Scientist?"

# Use Case 2 — Job Requirements / Qualifications
query_2 = "Do any jobs require a Master’s degree?"

# Use Case 3 — Responsibilities Extraction
query_3 = "What responsibilities does a Data Scientist have?"

# Use Case 4 — Company Information
query_4 = "Which company is offering a Data Scientist position?"

# Use Case 5 — Location & Salary
query_5 = "What is the salary range for this job?"

# Use Case 6 — Filtering / Matching
query_6 = "Find jobs that require Python but not AWS."

# Use Case 7 — Comparison
query_7 = "Compare the skills required for Data Scientist and Data Analyst roles."

# Use Case 8 — Natural Language / Human-friendly Query
query_8 = "Is this a beginner-friendly job?"

# Use Case 9 — Out-of-Scope / Halusinasi Test
query_9 = "Who is the CEO of the company?"


In [None]:
answer, sources = rag_answer(query_1)

print("=" * 80)
print("QUERY:", query_1)
print("=" * 80)
print("\nANSWER:\n", answer)
print("\n" + "=" * 80)
print("SOURCES:")
for i, source in enumerate(sources, 1):
    print(f"  {i}. Job ID: {source.get('job_id')}, Title: {source.get('job_title')}, Company: {source.get('company')}")
print("=" * 80)


QUERY: What skills are required for a Data Scientist?

ANSWER:
 The Data Scientist position requires the following skills based on the provided information:

1. Computer programming
2. Databases
3. Data visualization
4. Version control
5. Problem-solving skills
6. Understanding of clinical relevance

These skills are mentioned in the experience requirement section of the job posting.

SOURCES:
  1. Job ID: 56, Title: Data Scientist, Company: Netskope
  2. Job ID: 448, Title: Data Scientist, Company: Netskope
  3. Job ID: 451, Title: Data Scientist, Company: CareDx
  4. Job ID: 486, Title: Data Scientist, Company: Brillient


In [None]:
answer, sources = rag_answer(query_2)

print("=" * 80)
print("QUERY:", query_2)
print("=" * 80)
print("\nANSWER:\n", answer)
print("\n" + "=" * 80)
print("SOURCES:")
for i, source in enumerate(sources, 1):
    print(f"  {i}. Job ID: {source.get('job_id')}, Title: {source.get('job_title')}, Company: {source.get('company')}")
print("=" * 80)


QUERY: Do any jobs require a Master’s degree?

ANSWER:
 Yes, the jobs retrieved for this query have several positions that require a Master's degree. 

Job ID 64 (Data Scientist), Job ID 625 (Data Scientist), Job ID 461 (Data Scientist) all require a Master's degree, preferably in Statistics, Computer Science, Operations Research, Engineering, Mathematics, Economics, or other quantitative fields. Additionally, a Master’s degree in business administration (MBA) or a related field is also required for Job ID 295 (Program/Data Analyst).

SOURCES:
  1. Job ID: 64, Title: Data Scientist, Company: DICK'S Sporting Goods - Corporate
  2. Job ID: 625, Title: Data Scientist, Company: DICK'S Sporting Goods - Corporate
  3. Job ID: 461, Title: Data Scientist, Company: PeoplesBank
  4. Job ID: 295, Title: Program/Data Analyst, Company: General Dynamics Information Technology


In [None]:
answer, sources = rag_answer(query_3)

print("=" * 80)
print("QUERY:", query_3)
print("=" * 80)
print("\nANSWER:\n", answer)
print("\n" + "=" * 80)
print("SOURCES:")
for i, source in enumerate(sources, 1):
    print(f"  {i}. Job ID: {source.get('job_id')}, Title: {source.get('job_title')}, Company: {source.get('company')}")
print("=" * 80)


QUERY: What responsibilities does a Data Scientist have?

ANSWER:
 The information is not available in the job postings. However, based on the provided metadata, we can infer some general job descriptions.

SOURCES:
  1. Job ID: 182, Title: VP, Data Science, Company: PennyMac
  2. Job ID: 284, Title: VP, Data Science, Company: PennyMac
  3. Job ID: 56, Title: Data Scientist, Company: Netskope
  4. Job ID: 448, Title: Data Scientist, Company: Netskope


In [None]:
answer, sources = rag_answer(query_4)

print("=" * 80)
print("QUERY:", query_4)
print("=" * 80)
print("\nANSWER:\n", answer)
print("\n" + "=" * 80)
print("SOURCES:")
for i, source in enumerate(sources, 1):
    print(f"  {i}. Job ID: {source.get('job_id')}, Title: {source.get('job_title')}, Company: {source.get('company')}")
print("=" * 80)


QUERY: Which company is offering a Data Scientist position?

ANSWER:
 Based on the metadata provided, multiple companies are offering Data Scientist positions. 

The companies with Data Scientist positions available are:
1. TRANZACT (Job ID 656)
2. Netskope (Job ID 56 and 448)
3. CareDx (Job ID 451)

To provide a more accurate answer, please specify which job you are interested in.

SOURCES:
  1. Job ID: 656, Title: Data Scientist, Company: TRANZACT
  2. Job ID: 56, Title: Data Scientist, Company: Netskope
  3. Job ID: 448, Title: Data Scientist, Company: Netskope
  4. Job ID: 451, Title: Data Scientist, Company: CareDx


In [None]:
answer, sources = rag_answer(query_5)

print("=" * 80)
print("QUERY:", query_5)
print("=" * 80)
print("\nANSWER:\n", answer)
print("\n" + "=" * 80)
print("SOURCES:")
for i, source in enumerate(sources, 1):
    print(f"  {i}. Job ID: {source.get('job_id')}, Title: {source.get('job_title')}, Company: {source.get('company')}")
print("=" * 80)


QUERY: What is the salary range for this job?

ANSWER:
 Based on the retrieved metadata, I do not see the job description for any job ID. However, I can suggest potential job IDs to retrieve job descriptions and salaries for.

To get the salary range for the specified job, you will need to provide more information. Can you please provide the job title and ID?

SOURCES:
  1. Job ID: 15, Title: Data Engineer I, Company: Audible
  2. Job ID: 67, Title: Data Scientist - Research, Company: C Space
  3. Job ID: 396, Title: Product Engineer – Spatial Data Science and Statistical Analysis, Company: Esri
  4. Job ID: 601, Title: Product Engineer – Spatial Data Science and Statistical Analysis, Company: Esri


In [None]:
answer, sources = rag_answer(query_6)

print("=" * 80)
print("QUERY:", query_6)
print("=" * 80)
print("\nANSWER:\n", answer)
print("\n" + "=" * 80)
print("SOURCES:")
for i, source in enumerate(sources, 1):
    print(f"  {i}. Job ID: {source.get('job_id')}, Title: {source.get('job_title')}, Company: {source.get('company')}")
print("=" * 80)


QUERY: Find jobs that require Python but not AWS.

ANSWER:
 Based on the provided context and metadata, I was able to identify potential jobs. However, since none of the job postings directly mention the exclusion of AWS while requiring Python, these jobs do not specifically match your requirements.

Multiple job postings (Sr. Data Engineer | Big Data SaaS Pipeline, Data Scientist, Data Engineer, Data Scientist) retrieved, but the job postings don't specify the exclusion of AWS. The metadata only provides information about the required skills for the jobs.

SOURCES:
  1. Job ID: 649, Title: Sr. Data Engineer | Big Data SaaS Pipeline, Company: Bridg
  2. Job ID: 435, Title: Data Scientist, Company: HG Insights
  3. Job ID: 267, Title: Data Engineer, Company: PennyMac
  4. Job ID: 433, Title: Data Scientist, Company: Change Healthcare


In [None]:
answer, sources = rag_answer(query_7)

print("=" * 80)
print("QUERY:", query_7)
print("=" * 80)
print("\nANSWER:\n", answer)
print("\n" + "=" * 80)
print("SOURCES:")
for i, source in enumerate(sources, 1):
    print(f"  {i}. Job ID: {source.get('job_id')}, Title: {source.get('job_title')}, Company: {source.get('company')}")
print("=" * 80)


QUERY: Compare the skills required for Data Scientist and Data Analyst roles.

ANSWER:
 Based on the provided context and metadata, I've identified the main skill sets required for the Data Scientist and Data Analyst roles:

**Data Scientist:**

- Bachelor’s degree from an accredited university or college in Data Analytics or Computer Science
- 2+ years’ work experience as a data analyst or a related field
- Ability to work with stakeholders to assess data output
- Ability to translate business requirements into non-technical, lay terms
- Proactive, self-motivated, able to recognize issues and resolve or escalate appropriately
- Excellent interpersonal skills, both in person and over the phone
- Previous experience working with statistical and/or predictive model development
- Experience with large-scale implementations of statistical methods to build decision support or recommender systems
- Innovative and entrepreneurial mindset
- Ability to work independently and oversee mid-level a

In [None]:
answer, sources = rag_answer(query_8)

print("=" * 80)
print("QUERY:", query_8)
print("=" * 80)
print("\nANSWER:\n", answer)
print("\n" + "=" * 80)
print("SOURCES:")
for i, source in enumerate(sources, 1):
    print(f"  {i}. Job ID: {source.get('job_id')}, Title: {source.get('job_title')}, Company: {source.get('company')}")
print("=" * 80)


QUERY: Is this a beginner-friendly job?

ANSWER:
 Multiple job postings were retrieved. Please specify which job.

SOURCES:
  1. Job ID: 215, Title: Data Scientist, Company: Spectrum Communications and Consulting
  2. Job ID: 20, Title: Data Scientist, Company: Porch
  3. Job ID: 76, Title: Customer Data Scientist/Sales Engineer (Bay, Company: h2o.ai
  4. Job ID: 146, Title: Customer Data Scientist/Sales Engineer, Company: h2o.ai


In [None]:
answer, sources = rag_answer(query_9)

print("=" * 80)
print("QUERY:", query_9)
print("=" * 80)
print("\nANSWER:\n", answer)
print("\n" + "=" * 80)
print("SOURCES:")
for i, source in enumerate(sources, 1):
    print(f"  {i}. Job ID: {source.get('job_id')}, Title: {source.get('job_title')}, Company: {source.get('company')}")
print("=" * 80)


QUERY: Who is the CEO of the company?

ANSWER:
 The information is not available in the job postings.

SOURCES:
  1. Job ID: 502, Title: Data Scientist, Company: Strategic Financial Solutions
  2. Job ID: 678, Title: Data Scientist, Company: Strategic Financial Solutions
  3. Job ID: 175, Title: Senior Data Scientist, Company: Plymouth Rock Assurance
  4. Job ID: 268, Title: Senior Data Scientist, Company: Plymouth Rock Assurance


## 9. Loading Saved Vector Store

Jika vector store sudah disimpan, dapat dimuat ulang tanpa perlu rebuild.

In [None]:
# Load saved vector store (uncomment if needed)
# vectorstore = FAISS.load_local(
#     VECTORSTORE_PATH,
#     embedding_model,
#     allow_dangerous_deserialization=True
# )
# print("✓ Vectorstore loaded from disk!")
