---
format:
    html:
        embed-resources: true
---

# Cleaning: Part-2 

The goal here is exactly the same as `HW-3.2-cleaning-1.ipynb`, except this time you should repeat the exercise but by leveraging LLM APIs and prompt engineering to stream line the cleaning process. 

Essentially, your job is to write an LLM wrapper to clean the job descriptions. 

How you do this is up to you. You can use any LLM API that you want, and you can use any prompt engineering techniques you want. 

For example, you can wrap around OpenAI's ChatGPT, however this requires that you buy a few credits, e.g. 5 to 10$ (you don't have to if you don't want to, you can use free LLM options as well).

Here is an example of how to use OpenAI's API:

[https://jfh.georgetown.domains/centralized-lecture-content/content/computer-science/general-concepts/openAI-API-example/notes.html](https://jfh.georgetown.domains/centralized-lecture-content/content/computer-science/general-concepts/openAI-API-example/notes.html)

There are also various LLM APIs that you can wrap around to get partial access. Do some googling and find a tool that seems like it will fit your needs.  

* [https://ai.google.dev/gemini-api/docs/quickstart?lang=python](https://ai.google.dev/gemini-api/docs/quickstart?lang=python)



In [2]:
import os
import json
import google.generativeai as genai

with open('/Users/zp/Desktop/gemini.json', 'r') as f:
    data = json.load(f)
api_key = genai.configure(api_key= data['api_key'])


In [3]:
import time
import pandas as pd

model = genai.GenerativeModel("gemini-1.5-flash")
json_folder_path = "/Users/zp/hw-3-zp199717"

def gemini(description):
    prompt = """
    Clean and streamline job descriptions.
    Extract the following information from this job description in JSON format with explicit keys:
    please JSON format!!!!!

    {
        "Job Title": ...,
        "Roles": ...,
        "Company Name": ...,
        "Sector/Industry": ...,
        "Location": ...,
        "City": ...,
        "State": ...,
        "Job Type": ...,
        "Salary": ...,
        "Experience Level": ...,
        "Education Level": ...,
        "Education Requirement": ...,
        "Skills/Technologies Required": ...,
        "Job Responsibilities/Duties": ...,
        "Summary": ...,
        "Required Years of Experience": ...,
        "Job Description Length": ...,
        "Certifications Required or Preferred": ...,
        "Visa Sponsorship Availability": ...,
        "Working Hours/Shift Type": ...
    }

    Job Description: """ + f'"{description}"'

   
    response = model.generate_content(prompt)
    
    json_text = response.candidates[0].content.parts[0].text
   
    json_text = json_text.strip("```json").strip("```").strip()
    return json.loads(json_text)


all_processed_data = []


for filename in os.listdir(json_folder_path):
    if filename.endswith(".json"):
        json_file_path = os.path.join(json_folder_path, filename)
        
        
        with open(json_file_path, 'r') as f:
            content = json.load(f)
            jobs_data = content['jobs_results']  
            
            for job in jobs_data:
                description = job.get("description", "")
                
                try:
                    job_features = gemini(description)
                    all_processed_data.append(job_features)
                    
                    
                    time.sleep(1)  

                except Exception as e:
                    #print(f"Error processing job description: {e}")
                    continue  

    
    time.sleep(0.5)  


df = pd.DataFrame(all_processed_data)
df.head()
df.to_csv('/Users/zp/hw-3-zp199717/data/processed-jobs-2.csv', index=False)


In [4]:
df.head()

Unnamed: 0,Job Title,Roles,Company Name,Sector/Industry,Location,City,State,Job Type,Salary,Experience Level,...,Work Best Category,Department,Time Type,Weekly Hours,FTE,Shift,Travel,Security Clearance,Remote Work,Remote Work Availability
0,Solution Architect or Data Scientist,"[Solution Architect, Data Scientist]",NVIDIA,Technology,Not Specified,Not Specified,Not Specified,Full-time,"120,000 USD - 276,000 USD",Senior,...,,,,,,,,,,
1,AI Automation Engineer,"[AI Specialist, AI Analyst, Machine Learning E...",Trilogy,Technology,,,,Full-time,,Mid-Level,...,,,,,,,,,,
2,AI and Information Security Analyst,[AI and Information Security Analyst],RAND Corporation,"National Security, AI, Cybersecurity, Policy R...","Multiple Locations (San Francisco, CA; Washing...","[San Francisco, Washington, Santa Monica, Pitt...","[CA, DC, CA, PA, MA]",Full-time,"$52,000 - $192,100","Entry Level, Mid-Level, Senior Level",...,,,,,,,,,,
3,Freelance Writer for AI Training,"Writer, Editor, AI Trainer",Outlier,"Artificial Intelligence, Technology, Writing",Remote,,,Freelance,$15 to $35 USD per hour,"Entry Level, Mid Level",...,,,,,,,,,,
4,"HPC/AI Sales Specialist, Federal",Sales Specialist & Consultant,Hewlett Packard Enterprise,Technology,Remote/Teleworker,,,Full-time,"$139,700.00 - $313,900.00",Experienced,...,,,,,,,,,,
