# Analyze Twitter Data

Analyze the collected Twitter data with OpenAI and store the results in a MongoDB database. The analyses include:

- Sentiment analysis
- Language translation
- Identify emotions
- Extract entities
- Summarize

## Install Python libraries.

- pymongo: manage the MongoDB database
- openai: call the OpenAI APIs.

In [1]:
pip install pymongo

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install openai

Note: you may need to restart the kernel to use updated packages.


## Secret Manager Function

In [3]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Import Python Libraries and Credentials  

In [4]:
import pymongo
from pymongo import MongoClient
import json
from pprint import pprint
from tqdm.auto import tqdm
import re

openai_api_key  = get_secret('openai')['api_key']

mongodb_connect = get_secret('mongodb1')['api_key']

## Connect to the MongoDB cluster

In [5]:
mongo_client = MongoClient(mongodb_connect)
db = mongo_client.demo # use or create a database named demo
tweet_collection = db.tweet_collection #use or create a collection named tweet_collection
tweet_collection.create_index([("tweet_collection", pymongo.ASCENDING)],unique = True) # make sure the collected tweets are unique

DuplicateKeyError: Index build failed: 554addeb-af40-4952-b675-54bf2c51ff11: Collection demo.tweet_collection ( 507129cd-f3bf-4025-a948-25fced6f73f4 ) :: caused by :: E11000 duplicate key error collection: demo.tweet_collection index: tweet_collection_1 dup key: { tweet_collection: null }, full error: {'ok': 0.0, 'errmsg': 'Index build failed: 554addeb-af40-4952-b675-54bf2c51ff11: Collection demo.tweet_collection ( 507129cd-f3bf-4025-a948-25fced6f73f4 ) :: caused by :: E11000 duplicate key error collection: demo.tweet_collection index: tweet_collection_1 dup key: { tweet_collection: null }', 'code': 11000, 'codeName': 'DuplicateKey', 'keyPattern': {'tweet_collection': 1}, 'keyValue': {'tweet_collection': None}, '$clusterTime': {'clusterTime': Timestamp(1731721293, 17), 'signature': {'hash': b'\xad\x101\xf0\xf4\x94\xe9$j3\x0cx\xb7d\xfc%X$\xe1\xba', 'keyId': 7375602622792728592}}, 'operationTime': Timestamp(1731721293, 17)}

## Extract Twitter Data

Filter the Tweets you are interested in. You can use MongoDB Compass to help you write the queries.

In [6]:
filter={

    
}
project={
    'tweet.text': 1, 
    'tweet.id': 1
}
#rename the client to mongo_client
result = mongo_client['demo']['tweet_collection'].find(
  filter=filter,
  projection=project
)

Save the extracted Tweets into the ```tweet_data``` list. Remove URLs and new lines to save the tokens.

In [7]:
tweet_data = []
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
for tweet in result:
    text_without_urls = re.sub(url_pattern, '', tweet['tweet']['text'])
    tweet_data.append({'tweet_id':tweet['tweet']['id'],'tweet_text':text_without_urls.replace('\n','')})

In [8]:
print('Number of tweets: ',len(tweet_data))

Number of tweets:  200


## Set up OpenAI API

Load the OpenAI API key and set the API parameters.

- Model type: usegpt-4o by default, and you choose any [availabel models](https://platform.openai.com/docs/models).
- Token estimate: 100 tokens ~= 75 words in English. Total token usage = tokens in the prompt + tokens in the completion. You can get a more accurate estimate at [Tokenier](https://platform.openai.com/tokenizer).
- Temperature: Lower temperatures produce more consistent outputs, while higher values generate more diverse and creative results. 

A help function, ```openai_help```, is created to pass the prompt.

In [9]:
from openai import OpenAI
client = OpenAI(api_key=openai_api_key)
model="gpt-4o"
temperature=0

def openai_help(prompt, model=model, temperature =temperature ):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature

    )
    return response.choices[0].message.content

## Sentiment analysis

Analyze the sentiment of each tweet and save the result to the MongoDB database.

In [10]:
for tweet in tqdm(tweet_data):
  
    prompt = f"""
    What is the sentiment of the following tweet, 
    tweet text: {tweet['tweet_text']}
    return  the result with one word as Positive, Neutral,or Negative
 
    """
#     print(prompt)
    try:
        sentiment_result =openai_help(prompt)
    #     print(sentiment_result)

        tweet_collection.update_one(
            {'tweet.id':tweet['tweet_id']},
            {"$set":{'tweet.sentiment':sentiment_result}}
        )
    except:
        pass

  0%|          | 0/200 [00:00<?, ?it/s]

## Language translation

Translate each tweet into a different language and save the result to the MongoDB database.

In [11]:
for tweet in tqdm(tweet_data):
  
    prompt = f"""
    Translate the following tweet into Chinese
    tweet text: {tweet['tweet_text']}
 
    """
#     print(prompt)
    try:
        translate_result =openai_help(prompt)
#         print(translate_result)

        tweet_collection.update_one(
            {'tweet.id':tweet['tweet_id']},
            {"$set":{'tweet.translate':translate_result}}
        )
    except:
        pass

  0%|          | 0/200 [00:00<?, ?it/s]

## Identify emotions

Identify whether a tweet expresses anger, and save the result to the MongoDB database.

In [12]:
for tweet in tqdm(tweet_data):
  
    prompt = f"""
    Detect the emotion in the following tweet, and extract whether the tweet expresses anger.
    Provide the result as True, False, or Unknown. 
    Don't provide any reasoning or other output.
    tweet text: {tweet['tweet_text']}
 
    """
#     print(prompt)
    try:
        emotion_result =openai_help(prompt)
        # print(emotion_result)

        tweet_collection.update_one(
                {'tweet.id':tweet['tweet_id']},
                {"$set":{'tweet.anger':emotion_result}}
            )
    except:
        pass

  0%|          | 0/200 [00:00<?, ?it/s]

## Extract entities

Extract person and organization names from each tweet and save the result to the MongoDB database.

In [13]:
for tweet in tqdm(tweet_data):
  
    prompt = f"""
    Identify persons or organizations from the following tweet,
    tweet text: {tweet['tweet_text']},
    format the response as a JSON object with Person and Organization as the keys, and extracted items in a list,
    if no entities is not presented, use "Unknown" in the list.
    Do not wrap the JSON codes in JSON markers
   
    """
#     print(prompt)
    try:
        extract_result =openai_help(prompt)
#        print(extract_result)

        tweet_collection.update_one(
                {'tweet.id':tweet['tweet_id']},
                {"$set":{'tweet.extracted_item':json.loads(extract_result)}}
                )
    except:
        pass

  0%|          | 0/200 [00:00<?, ?it/s]

## Summarize

Summarize the tweet texts with a specific focus and save the result to the MongoDB database. By default, 500 tweets are analyzed in each batch. You can change the batch size based on the model you use.

In [None]:
# Define the batch size
batch_size = 500

start_index = 0


while start_index < len(tweet_data):
    batch = tweet_data[start_index:start_index + batch_size]

    tweet_id_list =[]
    tweet_text_summary =''
    
    for tweet in batch:
        tweet_id_list.append(tweet['tweet_id'])
        tweet_text_summary = tweet_text_summary+'.'+tweet['tweet_text']
        
    prompt = f"""
    Summarize the following tweets in at most 50 words, 
    tweet text: {tweet_text_summary,}
 
    """
#     print(prompt)
    try:
        summary_result =openai_help(prompt)

        tweet_summary = db.tweet_summary 
        tweet_summary.insert_one({'id_list':tweet_id_list,
                            'tweet_text_summary':summary_result})
        print(summary_result,'\n')
    except:
        pass
    start_index += batch_size

## Close Database Connection

In [None]:
mongo_client.close()