# Job Recommendation System with Graph Database

This Jupyter Notebook demonstrates how to build a job recommendation system using a graph database (ArangoDB). The process includes:
1. Installing necessary libraries.
2. Loading datasets and generating a directed graph using NetworkX.
3. Storing the graph in ArangoDB.
4. Retrieving the graph from ArangoDB for further use.
5. Implementing a chatbot to query the graph and recommend jobs.

Let's get started!

## Step 1: Install Required Libraries

We need to install the following Python libraries:
- `pandas`: For data manipulation.
- `networkx`: For creating and managing graphs.
- `arango`: Python driver for ArangoDB.
- `google-genai`: For interacting with Google's generative AI.
- `fuzzywuzzy`: For fuzzy string matching.
- `python-dotenv`: For loading environment variables.
- `langchain-community`: For graph-based QA chains.
- `langchain-google-genai`: For Google AI integration with LangChain.
- `langchain-openai`: For OpenAI integration with LangChain (optional).

Run the following commands in a code cell to install them.

In [None]:
%pip install pandas networkx arango google-genai fuzzywuzzy python-dotenv langchain-community langchain-google-genai langchain-openai

## Step 2: Load Datasets and Generate Graph

In this section, we'll load the CSV datasets and create a directed graph using NetworkX. The datasets include jobs, soft skills, hard skills, interests, and education, which will be represented as nodes and edges in the graph.

We'll also define a helper function `clean_key` to sanitize keys for node identifiers.

In [None]:
import pandas as pd
import networkx as nx
import os

# Load CSV datasets
df = pd.read_csv("dataset/sample/job.csv")

# Create a directed graph
G = nx.DiGraph()

def clean_key(key):
    key_str = str(key)
    return ''.join(c if c.isalnum() else '_' for c in key_str).strip('_').lower()

# Add Jobs and attributes
for index, row in df.iterrows():
    job_key = "job_"+clean_key(row["job_title"])
    G.add_node(job_key, type="Job", min_salary=row["min_salary"], max_salary=row["max_salary"],
               min_exp=row["min_exp"], max_exp=row["max_exp"], level=row["level"],
               category=row["job_category"], job_description=row["job_description"],
               name=row["job_title"])
    
    # Soft skills
    soft_skills = set(row["soft_skill"].split("|"))
    for skill in soft_skills:
        skill_key = "soft_"+clean_key(skill)
        G.add_node(skill_key, type="soft_skill", name=skill)
        G.add_edge(skill_key, job_key, relation="soft_skill_leads_to")

    # Hard skills
    hard_skills = set(row["hard_skill"].split("|"))
    for skill in hard_skills:
        skill_key = "hard_"+clean_key(skill)
        G.add_node(skill_key, type="hard_skill", name=skill)
        G.add_edge(skill_key, job_key, relation="hard_skill_leads_to")

    # Interest
    interests = set(row["interest"].split("|"))
    for interest in interests:
        interest_key = "int_"+clean_key(interest)
        G.add_node(interest_key, type="interest", name=interest)
        G.add_edge(interest_key, job_key, relation="supports")

    # Education
    educations = set(row["education"].split("|"))
    for edu in educations:
        edu_key = "edu_"+clean_key(edu)
        G.add_node(edu_key, type="education", name=edu)
        G.add_edge(edu_key, job_key, relation="enables_to")

print(f"Number of nodes: {G.number_of_nodes()}")
print(f"Number of edges: {G.number_of_edges()}")

Number of nodes: 2014
Number of edges: 11708


## Step 3: Store Graph in ArangoDB

Now, we'll connect to ArangoDB, create the necessary collections, and store the graph. We'll use environment variables for credentials, so make sure you have a `.env` file with `ARANGO_HOST`, `ARANGO_PASSWORD`, etc.

You can use your local ArangoDB using Docker by running this command:

```bash
docker run -d -p 8529:8529 -e ARANGO_ROOT_PASSWORD=password --name hiddenpaths_arangodb arangodb
```

In [2]:
from arango import ArangoClient
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Connect to ArangoDB
client = ArangoClient(hosts=os.getenv("ARANGO_HOST", "http://localhost:8529"))
sys_db = client.db("_system", username="root", password=os.getenv("ARANGO_PASSWORD", "password"), verify=True)

# Create 'JOB' database if it doesn't exist
if not sys_db.has_database("Test"):
    sys_db.create_database("Test")

db = client.db("Test", username="root", password=os.getenv("ARANGO_PASSWORD", "password"))

# Define collections
vertex_collections = ["job", "soft_skill", "hard_skill", "interest", "education"]
edge_collections = ["requires_softskill", "requires_hardskill", "supported_by_interest", "enables_job"]

for vc in vertex_collections:
    if not db.has_collection(vc):
        db.create_collection(vc)
for ec in edge_collections:
    if not db.has_collection(ec):
        db.create_collection(ec, edge=True)

# Map collections
collection_mapping = {
    "job": db.collection("job"),
    "soft_skill": db.collection("soft_skill"),
    "hard_skill": db.collection("hard_skill"),
    "interest": db.collection("interest"),
    "education": db.collection("education")
}
edge_mapping = {
    "requires_softskill": db.collection("requires_softskill"),
    "requires_hardskill": db.collection("requires_hardskill"),
    "supported_by_interest": db.collection("supported_by_interest"),
    "enables_job": db.collection("enables_job")
}

# Clear existing data
for collection in vertex_collections + edge_collections:
    db.collection(collection).truncate()

# Prepare data for insertion
vertices_to_insert = []
edges_to_insert = []

for node, data in G.nodes(data=True):
    doc = data.copy()
    doc["_key"] = node
    vertices_to_insert.append(doc)

for edge in G.edges(data=True):
    node1, node2, attr = edge
    relation = attr['relation']
    if relation == "soft_skill_leads_to":
        doc = {"_from": f"soft_skill/{node1}", "_to": f"job/{node2}", "relation": relation}
    elif relation == "hard_skill_leads_to":
        doc = {"_from": f"hard_skill/{node1}", "_to": f"job/{node2}", "relation": relation}
    elif relation == "supports":
        doc = {"_from": f"interest/{node1}", "_to": f"job/{node2}", "relation": relation}
    elif relation == "enables_to":
        doc = {"_from": f"education/{node1}", "_to": f"job/{node2}", "relation": relation}
    edges_to_insert.append(doc)

# Insert vertices
vertices_by_collection = {vc: [] for vc in vertex_collections}
for doc in vertices_to_insert:
    vertices_by_collection[doc['type'].lower().replace(" ", "_")].append(doc)

for vc, docs in vertices_by_collection.items():
    if docs:
        collection_mapping[vc].insert_many(docs)

# Insert edges
edges_by_collection = {ec: [] for ec in edge_collections}
for doc in edges_to_insert:
    relation = doc["relation"]
    if relation == "soft_skill_leads_to" and doc not in edges_by_collection["requires_softskill"]:
        edges_by_collection["requires_softskill"].append(doc)
    elif relation == "hard_skill_leads_to" and doc not in edges_by_collection["requires_hardskill"]:
        edges_by_collection["requires_hardskill"].append(doc)
    elif relation == "supports" and doc not in edges_by_collection["supported_by_interest"]:
        edges_by_collection["supported_by_interest"].append(doc)
    elif relation == "enables_to" and doc not in edges_by_collection["enables_job"]:
        edges_by_collection["enables_job"].append(doc)

for ec, docs in edges_by_collection.items():
    if docs:
        edge_mapping[ec].insert_many(docs)

# Create graph
graph_name = "job_graph"
if not db.has_graph(graph_name):
    db.create_graph(graph_name)

graph = db.graph(graph_name)
graph.create_edge_definition(edge_collection="requires_softskill", from_vertex_collections=["soft_skill"], to_vertex_collections=["job"])
graph.create_edge_definition(edge_collection="requires_hardskill", from_vertex_collections=["hard_skill"], to_vertex_collections=["job"])
graph.create_edge_definition(edge_collection="supported_by_interest", from_vertex_collections=["interest"], to_vertex_collections=["job"])
graph.create_edge_definition(edge_collection="enables_job", from_vertex_collections=["education"], to_vertex_collections=["job"])

print(f"Graph '{graph_name}' created with {len(graph.edge_definitions())} edge definitions")

Graph 'job_graph' created with 4 edge definitions


## Step 4: Retrieve Graph from ArangoDB

Here, we'll retrieve the graph from ArangoDB and load it back into a NetworkX DiGraph for further processing.

In [3]:
# Retrieve graph data from ArangoDB
G_retrieved = nx.DiGraph()

# Fetch vertices
vertex_collections = ["job", "soft_skill", "hard_skill", "interest", "education"]
for vc in vertex_collections:
    query = f"FOR doc IN {vc} RETURN {{_key: doc._key, data: doc}}"
    cursor = db.aql.execute(query)
    for doc in cursor:
        G_retrieved.add_node(doc["_key"], **{k: v for k, v in doc["data"].items() if k not in ["_key", "_id", "_rev"]})

# Fetch edges
edge_collections = ["requires_softskill", "requires_hardskill", "supported_by_interest", "enables_job"]
for ec in edge_collections:
    query = f"FOR doc IN {ec} RETURN {{source: SPLIT(doc._from, '/')[1], target: SPLIT(doc._to, '/')[1], data: doc}}"
    cursor = db.aql.execute(query)
    for doc in cursor:
        G_retrieved.add_edge(doc["source"], doc["target"], **{k: v for k, v in doc["data"].items() if k not in ["_from", "_to", "_id", "_rev"]})

# Verify retrieved graph
print(f"Retrieved graph - Number of nodes: {G_retrieved.number_of_nodes()}")
print(f"Retrieved graph - Number of edges: {G_retrieved.number_of_edges()}")

Retrieved graph - Number of nodes: 2014
Retrieved graph - Number of edges: 11708


## Before You Go Next: Get Your Google API Key

To use the Google Gemini API in this notebook (e.g., for feature extraction or chatbot responses), you'll need an API Key from Google AI Studio. 

Don’t worry—it’s quick and easy to set up! Follow these steps to get your key in less than 5 minutes.

### How to Create an API Key
1. **Sign In with Your Google Account**  
   - Visit [https://aistudio.google.com/app/apikey](https://aistudio.google.com/app/apikey).  
   - Log in with your Google account.

2. **Navigate to the API Key Page**  
   - Once in Google AI Studio, look at the top-left corner of the page. You’ll see a **"Get API Key"** button. Click it!

3. **Generate a New API Key**  
   - Click **"Create API Key"**.  
   - You’ll be prompted to either create a new project or use an existing one. For testing, we recommend creating a new project to keep things tidy.

4. **Copy Your API Key**  
   - After creation, you’ll see your API Key (it looks like a string of letters and numbers, e.g., `AIzaSy...`).  
   - Click the copy button to save it for later use.

5. **Use It in This Notebook**  
   - Add your API Key to a `.env` file in the same directory as this notebook,
   - Or you can replace environment variable `"GOOGLE_API_KEY"` in this notebook with your API Key.

## Step 5: Implement Chatbot

Finally, we'll implement a chatbot that processes natural language queries, extracts features, maps them to the graph, and provides job recommendations using LangChain and ArangoDB.

In [4]:
from langchain_community.graphs import ArangoGraph
from langchain_community.chains.graph_qa.arangodb import ArangoGraphQAChain
from langchain_google_genai import ChatGoogleGenerativeAI
from pydantic import BaseModel
from typing import List
import json
from google import genai
from fuzzywuzzy import fuzz, process
from IPython.display import display, Markdown

# Initialize Google AI client
google_client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))

# Define feature extraction schema
class Extract_feature(BaseModel):
    job: List[str]
    hard_skill: List[str]
    soft_skill: List[str]
    interest: List[str]
    education: List[str]

# Extract features from natural language query
def extract(nlq: str, verbose=False) -> dict:
    prompt = f"""
    You are an AI tasked with analyzing natural language input from a user and extracting specific details into a JSON format.
    User's input: {nlq}
    Identify and categorize the following from the user's input:
    - 'job': job titles or roles mentioned (e.g., Software Engineer, Teacher).
    - 'hard_skill': technical skills or specific abilities (e.g., programming languages, tools).
    - 'soft_skill': interpersonal or cognitive skills (e.g., problem-solving, leadership).
    - 'interest': hobbies or personal interests (e.g., gaming, reading).
    - 'education': fields of study or degrees (e.g., Mathematics, Computer Science).
    Return the result as a JSON object with these fields as arrays. If no information is found for a category, use an empty array [].
    Use your judgment to interpret the input accurately, even if the phrasing varies.
    """
    config = {"response_mime_type": "application/json", "response_schema": Extract_feature}
    response = google_client.models.generate_content(model="gemini-2.0-flash", contents=[prompt], config=config)
    return json.loads(response.text)

# Fuzzy mapping to correct input data
def fuzzy_mapping(preprocessed_json: dict):
    all_nodes = {}
    for vc in ["job", "soft_skill", "hard_skill", "interest", "education"]:
        query = f"FOR doc IN {vc} RETURN doc.name"
        cursor = db.aql.execute(query)
        all_nodes[vc] = [doc for doc in cursor]
    corrected_additional_data = {}
    for key, values in preprocessed_json.items():
        corrected_additional_data[key] = [
            process.extractOne(val, all_nodes[key], scorer=fuzz.token_sort_ratio)[0]
            for val in values if process.extractOne(val, all_nodes[key], scorer=fuzz.token_sort_ratio)[1] >= 50
        ]
    return corrected_additional_data

# Query the graph and generate response
def text_to_aql_to_text(query: str, preprocessed_json: dict):
    llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", api_key=os.getenv("GOOGLE_API_KEY"))
    arango_graph = ArangoGraph(db)
    chain = ArangoGraphQAChain.from_llm(llm=llm, graph=arango_graph, verbose=True, allow_dangerous_requests=True)
    combined_input = (
            f"Natural Language Query: {query}\n"
            f"Preprocessed Data for name attribute of node: {json.dumps(preprocessed_json)}\n"
            "Generate an AQL query based on the preprocessed data and the following database schema:\n"
            "- Nodes:\n"
            "  1. Job: {_key: startswith job_, type: Job, min_salary: int, max_salary: int, min_exp: int, max_exp: int, level: [Junior, Mid, Senior], category: str, job_description: str, name: job title}\n"
            "  2. hard_skill: {_key: startswith hard_, type: hard_skill, category: str, description: str, name: skill name}\n"
            "  3. soft_skill: {_key: startswith soft_, type: soft_skill, category: str, description: str, name: skill name}\n"
            "  4. interest: {_key: startswith int_, type: interest, category: str, name: interest name}\n"
            "  5. education: {_key: startswith edu_, type: education, category: str, name: education name}\n"
            "- Edges (direction matters):\n"
            "  1. requires_softskill: {_from: soft_skill/, _to: job/, relation: soft_skill_leads_to} (OUTBOUND from soft_skill to job)\n"
            "  2. requires_hardskill: {_from: hard_skill/, _to: job/, relation: hard_skill_leads_to} (OUTBOUND from hard_skill to job)\n"
            "  3. supported_by_interest: {_from: interest/, _to: job/, relation: supports} (OUTBOUND from interest to job)\n"
            "  4. enables_job: {_from: education/, _to: job/, relation: enables_to} (OUTBOUND from education to job)\n"
            "Instructions:\n"
            "1. Use the preprocessed JSON data (hard_skill, soft_skill, interests, education) to construct an AQL query based on the natural language query.\n"
            "2. Leverage edge relationships (e.g., requires_hardskill, enables_job) to connect the input data to relevant nodes (jobs, skills, interests, or education), paying careful attention to edge direction:\n"
            "   - Use OUTBOUND when traversing from skills, interests, or education to jobs (e.g., soft_skill -> job, hard_skill -> job).\n"
            "   - Use INBOUND when traversing from jobs to skills, interests, or education (e.g., job -> soft_skill), if applicable to the query.\n"
            "   - Respect the schema: edges are defined as OUTBOUND from skills/interests/education to jobs, so prioritize this direction unless the query explicitly requires reverse traversal.\n"
            "3. When matching jobs, allow flexibility: jobs should match at least one skill, interest, or education from the input data, not requiring all to be present.\n"
            "4. Limit the query results to a maximum of 5 objects.\n"
            "5. Internally generate the AQL query, but do not include it in the final response.\n"
            "6. Translate the query results into a natural, user-friendly response, avoiding technical terms like 'AQL', 'INBOUND', 'OUTBOUND', 'node', or 'edge':\n"
            "   - If jobs are relevant, provide a simple list of up to 5 job titles with details like salary range (e.g., '$50,000 - $70,000') and experience required (e.g., '2-5 years'), followed by a brief explanation of why they match (e.g., 'This job fits because you know Python, which is one of the skills it needs').\n"
            "   - If skills or interests are the focus, describe up to 5 related skills or interests and how they connect to potential jobs (e.g., 'Your skill in Python could help you with software engineering roles').\n"
            "   - Use conversational language, as if explaining to a non-technical person.\n"
            "7. If no relevant results are found, respond with a simple message like 'Sorry, I couldn’t find any matches for you,' followed by a brief reason (e.g., 'None of the skills or education you provided seem to connect to the jobs available').\n"
            "Please generate the response based on the instructions and the provided data in Markdown format."
        )
    result = chain.invoke(combined_input)
    return str(result["result"])

# Chatbot function
def chatbot(query: str) -> str:
    try:
        extracted_json = extract(query, True)
        corrected_json = fuzzy_mapping(extracted_json)
        response = text_to_aql_to_text(query, corrected_json)
        return response
    except Exception as e:
        return f"An error occurred: {str(e)}"

# For this notebook, I use dataset about technology, bussiness and creative jobs, so the chatbot will be able to answer questions related to these fields.
query = "I have experience in Python, Java, and C++. I am interested in Machine Learning and Artificial Intelligence. I have a degree in Computer Science. What jobs can I apply for?"
result = chatbot(query)
display(Markdown(result))



[1m> Entering new ArangoGraphQAChain chain...[0m
AQL Query (1):[32;1m[1;3m
WITH job, hard_skill, education
FOR hardSkill IN hard_skill
  FILTER hardSkill.name IN ["Python", "Java", "C++", "Machine Learning", "Business Intelligence"]
  FOR jobItem IN OUTBOUND hardSkill requires_hardskill
    FOR educationItem IN education
      FILTER educationItem.name IN ["PhD in Computer Science"]
      FOR jobEducation IN OUTBOUND educationItem enables_job
        FILTER jobEducation._key == jobItem._key
        LIMIT 5
        RETURN {
          "job_name": jobItem.name,
          "min_salary": jobItem.min_salary,
          "max_salary": jobItem.max_salary,
          "min_exp": jobItem.min_exp,
          "max_exp": jobItem.max_exp
        }
[0m
AQL Result:
[32;1m[1;3m[{'job_name': 'Artificial Intelligence Researcher', 'min_salary': 75000, 'max_salary': 120000, 'min_exp': 4, 'max_exp': 8}, {'job_name': 'Autonomous Vehicle Engineer', 'min_salary': 80000, 'max_salary': 120000, 'min_exp': 4, '

Based on your background in Python, Java, C++, Machine Learning, and Business Intelligence, along with your PhD in Computer Science, here are some jobs you might be interested in:

1.  **Artificial Intelligence Researcher:** Salary range \$75,000 - \$120,000, with 4-8 years of experience.
2.  **Autonomous Vehicle Engineer:** Salary range \$80,000 - \$120,000, with 4-8 years of experience.
3.  **Computer Vision Engineer:** Salary range \$75,000 - \$115,000, with 4-8 years of experience.
4.  **Data Scientist:** Salary range \$75,000 - \$120,000, with 4-8 years of experience.
5.  **Artificial Intelligence Researcher:** Salary range \$75,000 - \$120,000, with 4-8 years of experience.

## Step 6: Generate Career Path Diagram

In this step, we’ll generate a career path diagram using the `json_for_graph` function. This feature:
- Takes a natural language input from the user (e.g., skills, interests).
- Extracts and matches features against the graph to find the top 3 job recommendations.
- For each job, identifies up to 2 "children" jobs (higher-salary career progression options).
- Returns a JSON structure that could be visualized as a tree-like diagram.

We’ll implement the necessary functions with detailed explanations and test it with an example input. The output will be displayed as formatted Markdown for clarity.

In [5]:
from IPython.display import display, Markdown
import json
from pydantic import BaseModel
from typing import List
from fuzzywuzzy import fuzz, process
from IPython.display import display, Markdown

# --- Define Supporting Classes and Functions ---

class JobSelector:
    """A simple class to track selected jobs and avoid duplicates in career path generation."""
    def __init__(self):
        self.selected_job = []  # List to store job keys already selected

class Extract_feature(BaseModel):
    """Pydantic model to enforce JSON schema for extracted features from user input."""
    job: List[str]
    hard_skill: List[str]
    soft_skill: List[str]
    interest: List[str]
    education: List[str]

def extract(nlq: str, verbose=False) -> dict:
    """Extract features from natural language query using Google Gemini API."""
    # Define the prompt to instruct the AI on feature extraction
    prompt = f"""
    You are an AI tasked with analyzing natural language input from a user and extracting specific details into a JSON format.
    User's input: {nlq}
    Identify and categorize the following from the user's input:
    - 'job': job titles or roles mentioned (e.g., Software Engineer, Teacher).
    - 'hard_skill': technical skills or specific abilities (e.g., programming languages, tools).
    - 'soft_skill': interpersonal or cognitive skills (e.g., problem-solving, leadership).
    - 'interest': hobbies or personal interests (e.g., gaming, reading).
    - 'education': fields of study or degrees (e.g., Mathematics, Computer Science).
    Return the result as a JSON object with these fields as arrays. If no information is found for a category, use an empty array [].
    Use your judgment to interpret the input accurately, even if the phrasing varies.
    """
    # Configure the API to return structured JSON
    config = {"response_mime_type": "application/json", "response_schema": Extract_feature}
    # Call the Google Gemini API to generate the response
    response = google_client.models.generate_content(model="gemini-2.0-flash", contents=[prompt], config=config)
    json_response = json.loads(response.text)  # Parse the response into a Python dict
    if verbose:
        print(f"Original NLQ: {nlq}")
        print("Extracted JSON:", json_response)
    return json_response

def fuzzy_mapping(preprocessed_json: dict, verbose=False) -> dict:
    """Map extracted features to existing nodes in the graph using fuzzy matching."""
    all_nodes = {}
    # Fetch all node names from ArangoDB for each collection
    for vc in ["job", "soft_skill", "hard_skill", "interest", "education"]:
        query = f"FOR doc IN {vc} RETURN doc.name"
        cursor = db.aql.execute(query)
        all_nodes[vc] = [doc for doc in cursor]
    # Correct user input by finding the closest match in the graph (threshold: 50% similarity)
    corrected_additional_data = {}
    for key, values in preprocessed_json.items():
        corrected_additional_data[key] = [
            process.extractOne(val, all_nodes[key], scorer=fuzz.token_sort_ratio)[0]
            for val in values if process.extractOne(val, all_nodes[key], scorer=fuzz.token_sort_ratio)[1] >= 50
        ]
    if verbose:
        print("Corrected data:", corrected_additional_data)
    return corrected_additional_data

def required_for_job(job: str, G: nx.DiGraph) -> dict:
    """Get the required skills, interests, and education for a given job from the graph."""
    # Extract incoming edges to identify requirements
    required_hard_skills = list(set([n for n, t, attr in G.in_edges(job, data=True) 
                                     if attr['relation'] == 'hard_skill_leads_to']))
    required_soft_skills = list(set([n for n, t, attr in G.in_edges(job, data=True) 
                                     if attr['relation'] == 'soft_skill_leads_to']))
    required_interests = list(set([n for n, t, attr in G.in_edges(job, data=True) 
                                   if attr['relation'] == 'supports']))
    required_education = list(set([n for n, t, attr in G.in_edges(job, data=True) 
                                   if attr['relation'] == 'enables_to']))
    return {
        "hard_skill": required_hard_skills,
        "soft_skill": required_soft_skills,
        "interest": required_interests,
        "education": required_education
    }

def merge_dicts(dict1: dict, dict2: dict) -> dict:
    """Merge two dictionaries, combining lists and removing duplicates."""
    merged_dict = {}
    all_keys = set(dict1.keys()).union(set(dict2.keys()))
    for key in all_keys:
        merged_dict[key] = list(set(dict1.get(key, []) + dict2.get(key, [])))
    return merged_dict

def find_best_job(user_input: dict, G: nx.DiGraph) -> dict:
    """Find the best job matches based on user input, scoring them by skill matches."""
    job_nodes = [n for n, attr in G.nodes(data=True) if attr['type'] == 'Job']  # Filter job nodes
    job_scores = []
    mapped_node = {key: att['name'] for key, att in G.nodes(data=True)}  # Map node keys to names
    
    for job in job_nodes:
        # Get job requirements
        required_hard_skills = set([n for n, t, attr in G.in_edges(job, data=True) 
                                    if attr['relation'] == 'hard_skill_leads_to'])
        required_soft_skills = set([n for n, t, attr in G.in_edges(job, data=True) 
                                    if attr['relation'] == 'soft_skill_leads_to'])
        required_interests = set([n for n, t, attr in G.in_edges(job, data=True) 
                                  if attr['relation'] == 'supports'])
        required_education = set([n for n, t, attr in G.in_edges(job, data=True) 
                                  if attr['relation'] == 'enables_to'])
        
        # Map requirements to human-readable names
        mapped_required_hard_skills = [mapped_node[hs] for hs in required_hard_skills]
        mapped_required_soft_skills = [mapped_node[ss] for ss in required_soft_skills]
        mapped_required_interests = [mapped_node[it] for it in required_interests]
        mapped_required_education = [mapped_node[edu] for edu in required_education]
        required = [mapped_required_hard_skills, mapped_required_soft_skills, mapped_required_interests, mapped_required_education]
        
        # Convert user input to sets for comparison
        user_hard = set(user_input.get("hard_skill", []))
        user_soft = set(user_input.get("soft_skill", []))
        user_int = set(user_input.get("interest", []))
        user_edu = set(user_input.get("education", []))
        
        # Calculate matches and misses
        matched_hard = len(user_hard & set(mapped_required_hard_skills))
        matched_soft = len(user_soft & set(mapped_required_soft_skills))
        matched_int = len(user_int & set(mapped_required_interests))
        matched_edu = len(user_edu & set(mapped_required_education))
        missing_hard = len(set(mapped_required_hard_skills) - user_hard)
        missing_soft = len(set(mapped_required_soft_skills) - user_soft)
        
        # Calculate score based on matches, misses, and in-degree (complexity penalty)
        in_degree = G.in_degree(job)
        score = (matched_hard * 10) + (matched_soft * 5) + (matched_int * 1) + (matched_edu * 0.5) - (missing_hard * 1) - (missing_soft * 0.5) - (in_degree * 0.5)
        
        # Compute average salary
        avg_salary = (G.nodes[job]['min_salary'] + G.nodes[job]['max_salary']) / 2
        job_attribute = {
            "job_title": G.nodes[job]['name'],
            "category": G.nodes[job]['category'],
            "job_description": G.nodes[job]['job_description'],
            "min_salary": G.nodes[job]['min_salary'],
            "max_salary": G.nodes[job]['max_salary'],
            "min_exp": G.nodes[job]['min_exp'],
            "max_exp": G.nodes[job]['max_exp'],
            "level": G.nodes[job]['level'],
            "average_salary": avg_salary
        }
        
        # Store job details and score
        job_scores.append((job, score, matched_hard, matched_soft, missing_hard, missing_soft, in_degree, job_attribute, required))
    
    # Sort jobs by score (highest first)
    sorted_jobs = sorted(job_scores, key=lambda x: x[1], reverse=True)
    result = {
        'all_scores': [
            {
                'job_key': job,
                'score': score,
                'matched_hard': m_hard,
                'matched_soft': m_soft,
                'missing_hard': miss_hard,
                'missing_soft': miss_soft,
                'in_degree': in_deg,
                'attribute': attribute,
                'hard_skill': required[0],
                'soft_skill': required[1],
                'interest': required[2],
                'education': required[3]
            } for job, score, m_hard, m_soft, miss_hard, miss_soft, in_deg, attribute, required in sorted_jobs
        ]
    }
    return result

def find_children(job: str, avg_salary: float, G: nx.DiGraph, user_correct_json: dict, job_selector: JobSelector) -> list:
    """Find up to 2 higher-salary job progression options (children) for a given job."""
    required = required_for_job(job, G)  # Get requirements for the current job
    merge_required = merge_dicts(required, user_correct_json)  # Combine with user input
    results = find_best_job(merge_required, G)  # Find potential next jobs
    children = []
    for result in results['all_scores']:
        if len(children) >= 2:  # Limit to 2 children
            break
        # Check if job hasn’t been selected and offers higher salary
        if result['job_key'] not in job_selector.selected_job and result["attribute"]["average_salary"] > avg_salary:
            job_selector.selected_job.append(result['job_key'])
            children.append({
                "job": result["attribute"]["job_title"],
                "min_salary": result["attribute"]["min_salary"],
                "max_salary": result["attribute"]["max_salary"],
                "min_exp": result["attribute"]["min_exp"],
                "max_exp": result["attribute"]["max_exp"],
                "level": result["attribute"]["level"],
                "category": result["attribute"]["category"],
                "job_description": result["attribute"]["job_description"],
                "hard_skill": result["hard_skill"],
                "soft_skill": result["soft_skill"],
                "interest": result["interest"],
                "education": result["education"]
            })
    return children

def json_for_graph(user_input: str) -> dict:
    """Generate a JSON structure representing a career path diagram."""
    job_selector = JobSelector()  # Initialize job selector to track used jobs
    extracted_json = extract(user_input)  # Extract features from user input
    corrected_json = fuzzy_mapping(extracted_json)  # Correct features using fuzzy matching
    results = find_best_job(corrected_json, G)  # Find top job matches
    # Mark top 3 jobs as selected
    for j in results['all_scores'][:3]:
        job_selector.selected_job.append(j['job_key'])
    # Build JSON response with top 3 jobs and their children
    json_response = {
        "nlq": user_input,
        "user_input": corrected_json,
        "nodes": [{
            "job": r["attribute"]["job_title"],
            "min_salary": r["attribute"]["min_salary"],
            "max_salary": r["attribute"]["max_salary"],
            "min_exp": r["attribute"]["min_exp"],
            "max_exp": r["attribute"]["max_exp"],
            "level": r["attribute"]["level"],
            "category": r["attribute"]["category"],
            "job_description": r["attribute"]["job_description"],
            "hard_skill": r["hard_skill"],
            "soft_skill": r["soft_skill"],
            "interest": r["interest"],
            "education": r["education"],
            "children": find_children(r["job_key"], r["attribute"]["average_salary"], G, corrected_json, job_selector)
        } for r in results['all_scores'][:3]]
    }
    return json_response

In [6]:
# --- Test the Career Path Generation ---

# Example user input
user_input = "I have experience in sustainability plan, environmental management, and also have creativity. I am interested in gardening and sustainability. I have a degree in Agriculture."

# Generate career path JSON
career_path = json_for_graph(user_input)

# Format the output as Markdown for readability
markdown_output = f"""
## Career Path Recommendations

**Your Input:** {career_path['nlq']}

### Top Job Recommendations
"""
for i, node in enumerate(career_path['nodes'], 1):
    markdown_output += f"""
#### {i}. {node['job']}
- **Salary Range:** ${node['min_salary']} - ${node['max_salary']}
- **Experience Required:** {node['min_exp']} - {node['max_exp']} years
- **Level:** {node['level']}
- **Category:** {node['category']}
- **Description:** {node['job_description']}
- **Hard Skills:** {', '.join(node['hard_skill']) or 'None'}
- **Soft Skills:** {', '.join(node['soft_skill']) or 'None'}
- **Interests:** {', '.join(node['interest']) or 'None'}
- **Education:** {', '.join(node['education']) or 'None'}

**Potential Career Progression:**
"""
    if node['children']:
        for j, child in enumerate(node['children'], 1):
            markdown_output += f"""
- **{j}. {child['job']}**
  - **Salary Range:** ${child['min_salary']} - ${child['max_salary']}
  - **Experience Required:** {child['min_exp']} - {child['max_exp']} years
  - **Level:** {child['level']}
  - **Category:** {child['category']}
  - **Description:** {child['job_description']}
"""
    else:
        markdown_output += "- No higher-salary progression options found.\n"

# Display the formatted output
display(Markdown(markdown_output))

# Optionally, save the JSON to a file
with open("career_path.json", "w") as f:
    json.dump(career_path, f, indent=4)
print("Career path JSON saved to 'career_path.json'")


## Career Path Recommendations

**Your Input:** I have experience in sustainability plan, environmental management, and also have creativity. I am interested in gardening and sustainability. I have a degree in Agriculture.

### Top Job Recommendations

#### 1. Organic Farmer
- **Salary Range:** $40000 - $65000
- **Experience Required:** 2 - 10 years
- **Level:** Journey
- **Category:** Agriculture
- **Description:** Grows crops and raises livestock using organic methods.
- **Hard Skills:** Quality Control, Environmental Management, Process Improvement, Sustainability Planning, Operations Management
- **Soft Skills:** Persistence, Creativity, Sustainability Awareness, Communication, Problem Solving
- **Interests:** Healthy Living, Composting, Gardening, Sustainability
- **Education:** Bachelor's in Sustainability, Bachelor's in Agriculture

**Potential Career Progression:**

- **1. Soil Conservationist**
  - **Salary Range:** $50000 - $75000
  - **Experience Required:** 2 - 8 years
  - **Level:** Journey
  - **Category:** Environment
  - **Description:** Implements strategies to prevent soil erosion and degradation.

- **2. Sustainability Manager**
  - **Salary Range:** $65000 - $95000
  - **Experience Required:** 4 - 8 years
  - **Level:** Senior
  - **Category:** Business
  - **Description:** Transportation

#### 2. Sustainability Coordinator
- **Salary Range:** $50000 - $80000
- **Experience Required:** 2 - 8 years
- **Level:** Journey
- **Category:** Environment
- **Description:** Develops programs to promote sustainable agricultural practices.
- **Hard Skills:** Data Analysis, Environmental Management, Sustainability Planning, Project Management, Strategic Planning
- **Soft Skills:** Creativity, Sustainability Awareness, Stakeholder Management, Communication, Problem Solving
- **Interests:** Sustainability, Environmental Management, Green Initiatives, Research
- **Education:** Master's in Sustainability, Bachelor's in Sustainability

**Potential Career Progression:**

- **1. Sustainability Consultant**
  - **Salary Range:** $55000 - $85000
  - **Experience Required:** 3 - 7 years
  - **Level:** Mid
  - **Category:** Business
  - **Description:** Transportation

- **2. Chief Sustainability Officer**
  - **Salary Range:** $80000 - $130000
  - **Experience Required:** 5 - 10 years
  - **Level:** Senior
  - **Category:** Business
  - **Description:** Transportation

#### 3. Environmental Designer
- **Salary Range:** $50000 - $80000
- **Experience Required:** 3 - 7 years
- **Level:** Mid
- **Category:** Creative
- **Description:** Transportation
- **Hard Skills:** Resource Allocation, Environmental Management, Construction Management, Architecture, Sustainability Planning, Solidity
- **Soft Skills:** Spatial Thinking, Creativity, Sustainability Awareness, Communication, Problem Solving, Technical Understanding, Research
- **Interests:** Business Planning, Sustainable Design, Ecology, Green Business
- **Education:** Bachelor's in Architecture, Master's in Environmental Science, Master's in Sustainability

**Potential Career Progression:**

- **1. Environmental Engineer**
  - **Salary Range:** $65000 - $105000
  - **Experience Required:** 3 - 10 years
  - **Level:** Journey
  - **Category:** Environment
  - **Description:** Designs systems to mitigate environmental damage and pollution.

- **2. Environmental Scientist**
  - **Salary Range:** $60000 - $90000
  - **Experience Required:** 2 - 8 years
  - **Level:** Journey
  - **Category:** Environment
  - **Description:** Studies environmental impacts and develops solutions for pollution control.


Career path JSON saved to 'career_path.json'


## Conclusion

This notebook successfully demonstrates a comprehensive job recommendation system using a graph database (ArangoDB) and AI-powered features:
- Generated a graph from CSV datasets using NetworkX.
- Stored the graph in ArangoDB for persistent storage.
- Retrieved the graph for validation and further processing.
- Implemented a chatbot that queries the graph and provides conversational job recommendations.
- Added a career path generation feature that recommends top jobs and their potential progression paths, output as a JSON structure suitable for visualization.

With these capabilities, you can explore job matches based on skills and interests, get natural language responses, and plan career growth with a structured progression diagram. To extend this further, consider visualizing the career path as a tree diagram (e.g., using `graphviz`) or enhancing the scoring algorithm for more personalized recommendations. Happy career planning!