# Final Project

Analyze the collected Job data with OpenAI and store the results in a MongoDB database. The analyses include:


- Extract entities
- Summarize

## Install Python libraries.

- pymongo: manage the MongoDB database
- openai: call the OpenAI APIs.

In [11]:
pip install pymongo

Note: you may need to restart the kernel to use updated packages.


In [12]:
pip install openai

Note: you may need to restart the kernel to use updated packages.


## Secrete Manager Function

In [13]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Import Python Libraries and Credentials  

In [14]:
import pymongo
from pymongo import MongoClient
import json
from pprint import pprint
from tqdm.auto import tqdm
import re

openai_api_key  = get_secret('openai')['api_key']

mongodb_connect = get_secret('mongodb')['connection_string']

## Connect to the MongoDB cluster

In [22]:
mongo_client = MongoClient(mongodb_connect)
db = mongo_client.demo # use or create a database named demo
job_collection = db.job_collection #use or create a collection named tweet_collection


## Extract Job Data

Filter the Tweets you are interested in. You can use MongoDB Compass to help you write the queries.

In [23]:
filter={

    
}
project={
    'QualificationSummary': 1, 
    'PositionID': 1
}
#rename the client to mongo_client
result = mongo_client['demo']['job_collection'].find(
  filter=filter,
  projection=project
)

Save the extracted Tweets into the ```job_data``` list. Remove URLs and new lines to save the tokens.

In [24]:
job_data = []
#url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F]))+'
for job in result:
   # text_without_urls = re.sub(url_pattern, '', tweet['tweet']['text'])
    job_data.append({'position_id':job['PositionID'],'summary':job['QualificationSummary']})

In [25]:
print('Number of jobs: ',len(job_data))

Number of jobs:  419


## Set up OpenAI API

Load the OpenAI API key and set the API parameters.

- Model type: usegpt-4o by default, and you choose any [availabel models](https://platform.openai.com/docs/models).
- Token estimate: 100 tokens ~= 75 words in English. Total token usage = tokens in the prompt + tokens in the completion. You can get a more accurate estimate at [Tokenier](https://platform.openai.com/tokenizer).
- Temperature: Lower temperatures produce more consistent outputs, while higher values generate more diverse and creative results. 

A help function, ```openai_help```, is created to pass the prompt.

In [26]:
from openai import OpenAI
client = OpenAI(api_key=openai_api_key)
model="gpt-4o"
temperature=0

def openai_help(prompt, model=model, temperature =temperature ):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature

    )
    return response.choices[0].message.content

## Extract entities

Extract person and organization names from each tweet and save the result to the MongoDB database.

In [27]:
for job in tqdm(job_data):
  
    prompt = f"""
    Identify the common technology skills from the following job summary,
    job summary: {job['summary']},
    format the items in a JSON list,
    be consistent, generalize, and concise,
   if no technology skills presented, use "Unknown" in the list.
    Do not wrap the JSON codes in JSON markers
   
    """
#     print(prompt)
    try:
        extract_result =openai_help(prompt)
#        print(extract_result)

        job_collection.update_one(
                {'PositionID':job['position_id']},
                {"$set":{'skills':json.loads(extract_result)}}
                )
    except:
        pass

  0%|          | 0/419 [00:00<?, ?it/s]

## Close Database Connection

In [28]:
mongo_client.close()