# Analyze Twitter Data with OpenAI V1

This notebook assumes that your Tweets were collected with Twitter API V1, or the Tweets are orgianzied as:
```
{
id:123,
text:'abc',
...
}

```
If you Tweets were collected with Twitter API V2 or organized in a different foramt, please use the code for [V2](https://github.com/xbwei/machine_learning_in_python/tree/master).

## Install Python libraries

We need the [pymongo](https://pypi.org/project/pymongo/) to manage the MongoDB database, and [openai](https://github.com/openai/openai-python) to call the OpenAI APIs.

In [1]:
!pip install pymongo

Collecting pymongo
  Downloading pymongo-4.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.4.2-py3-none-any.whl.metadata (4.9 kB)
Downloading pymongo-4.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (677 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m677.1/677.1 kB[0m [31m30.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.4.2-py3-none-any.whl (300 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.4/300.4 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.4.2 pymongo-4.6.0


In [2]:
!pip install openai

Collecting openai
  Downloading openai-1.2.3-py3-none-any.whl.metadata (16 kB)
Collecting anyio<4,>=3.5.0 (from openai)
  Downloading anyio-3.7.1-py3-none-any.whl.metadata (4.7 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.8.0-py3-none-any.whl (20 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.25.1-py3-none-any.whl.metadata (7.1 kB)
Collecting pydantic<3,>=1.9.0 (from openai)
  Downloading pydantic-2.4.2-py3-none-any.whl.metadata (158 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.6/158.6 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl.metadata (20 kB)
Collecting annotated-types>=0.4.0 (from pydantic<3,>=1.9.0->openai)
  Downloading annotated_types-0.6.0-py3-none-any.whl.metadata (12 kB)
Collecting pydantic-core==2.10.1 (from pydantic<3,>=1.9.0->openai)
  Downloading pydantic_core-2.10.1-cp310-cp310-manylinux_2_17_x86_64.man

## Import Python libraries

In [3]:
import pymongo
from pymongo import MongoClient
import json
from pprint import pprint
import configparser
from tqdm.auto import tqdm
import re

## Load the authorization info

Save the database connection info and API key in a config.ini file and use the configparse to load the authorization info.

The config.ini file should look like:
``` 
[myopenai]
openai_api = <your openai API>

[mymongo]
connection = <your monogdb connection>
```


In [4]:
config = configparser.ConfigParser(interpolation=None)
config.read('config.ini')

openai_api_key   = config['myopenai']['openai_api']

mongod_connect = config['mymongo']['connection']

## Connect to the MongoDB cluster

We will connect to the MongoDB database that contains the tweet data. You need to change the database name and collection name to match your settings.

In [5]:
client = MongoClient(mongod_connect)
db = client.tweet # use or create a database named tweet
tweet_collection = db.gun_va #use or create a collection named gun_va


## Extract Twitter Data

Search the Tweets you are intrested.
You can use [MongoDB Compass](https://www.mongodb.com/try/download/compass) to help you write the queries.

In [6]:
'''
The following code is generated in MongoDB Compass to find the top 100 tweets 
with a key word of 'shooting', ordered by the favorite count
'''
filter={
    '$text': {
        '$search': 'shooting'
    }
}
project={
    'id': 1, 
    'text': 1
}
sort=list({
    'favorite_count': -1
}.items())
limit=100
result = client['tweet']['gun_va'].find(
  filter=filter,
  projection=project,
  sort=sort,
  limit=limit
)

Save the extracted Tweets into the ```tweet_data``` list. Remove URLs and new lines to save the tokens. 

In [7]:
tweet_data = []
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
for tweet in result:
    text_without_urls = re.sub(url_pattern, '', tweet['text'])
    tweet_data.append({'tweet_id':tweet['id'],'tweet_text':text_without_urls.replace('\n','')})

In [8]:
print('Number of tweets: ',len(tweet_data))

Number of tweets:  73


## Set up OpenAI API

Load the OpenAI API key and set the API parameters. 
- Model type: use ```gpt-3.5-turbo``` by default, and you choose any [avaiabel models](https://platform.openai.com/docs/models/overview).
- Token estimate: 100 tokens ~= 75 words in English. Total token usage = tokens in the prompot + tokens in the completion. You can get a more accurate estimate at [Tokenier](https://platform.openai.com/tokenizer).
- Temperature: use default value 0. Lower temperature result in more consistent outputs, while higher values generate more diverse and creative results

A help funciton, ```openai_help```, is created to pass the prompt.

In [9]:
from openai import OpenAI
client = OpenAI(api_key=openai_api_key)
model="gpt-3.5-turbo"
temperature=0



def openai_help(prompt, model=model, temperature =temperature ):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature

    )
    return response.choices[0].message.content

## Sentiment anlysis

Analyze the sentiment of each tweet and save the result to the MongoDB database.

In [10]:
for tweet in tqdm(tweet_data):
  
    prompt = f"""
    What is the sentiment of the following tweet, 
    tweet text: {tweet['tweet_text']}
    return  the result with one word as positive, neutral,or negative
 
    """
#     print(prompt)
    try:
        sentiment_result =openai_help(prompt)
    #     print(sentiment_result)

        tweet_collection.update_one(
            {'id':tweet['tweet_id']},
            {"$set":{'sentiment':sentiment_result}}
        )
    except:
        pass


  0%|          | 0/73 [00:00<?, ?it/s]

## Language translation 

Translate each tweet into a different language, and save the result to the MongoDB database.

In [11]:
for tweet in tqdm(tweet_data):
  
    prompt = f"""
    Translate the follwoing tweet into Chinese
    tweet text: {tweet['tweet_text']}
 
    """
#     print(prompt)
    try:
        translate_result =openai_help(prompt)
#         print(translate_result)

        tweet_collection.update_one(
            {'id':tweet['tweet_id']},
            {"$set":{'translate':translate_result}}
        )
    except:
        pass


  0%|          | 0/73 [00:00<?, ?it/s]

## Identify emotions

Identify whether a tweet expresses anger, and save the result to the MongoDB database.

In [12]:
for tweet in tqdm(tweet_data):
  
    prompt = f"""
    Does the following tweet express anger?
    Provide the result as eitehr True or False.
    tweet text: {tweet['tweet_text']}
 
    """
#     print(prompt)
    try:
        emotion_result =openai_help(prompt)
    #     print(emotion_result)

        tweet_collection.update_one(
                {'id':tweet['tweet_id']},
                {"$set":{'anger':emotion_result}}
            )
    except:
        pass


  0%|          | 0/73 [00:00<?, ?it/s]

## Extract entities

Extract person and organzation names from each tweet and save the result to the MongoDB database. 

In [13]:
for tweet in tqdm(tweet_data):
  
    prompt = f"""
    Identify persons or organzations from the following tweet,
    tweet text: {tweet['tweet_text']},
    format the response as a JSON document with person and organzation as the keys.
    If the information is not presented, use "unknown".
    """
#     print(prompt)
    try:
        extract_result =openai_help(prompt)
#         print(extract_result)

        tweet_collection.update_one(
                {'id':tweet['tweet_id']},
                {"$set":{'extracted_item':json.loads(extract_result)}}
                )
    except:
        pass

  0%|          | 0/73 [00:00<?, ?it/s]

## Summarize

Summarize the tweet texts with a specific focus, and save the result to the MongoDB database.
By default, 50 tweets are analyzed in each batch. You can change the batch size based on the model you use.

In [None]:
# Define the batch size
batch_size = 50

start_index = 0


while start_index < len(tweet_data):
    batch = tweet_data[start_index:start_index + batch_size]

    tweet_id_list =[]
    tweet_text_summary =''
    
    for tweet in batch:
        tweet_id_list.append(tweet['tweet_id'])
        tweet_text_summary = tweet_text_summary+'.'+tweet['tweet_text']
        
    prompt = f"""
    Summarize the following tweets in at most 50 words, 
    and focusing why people oppose gun control
    tweet text: {tweet_text_summary,}
 
    """
#     print(prompt)
    try:
        summary_result =openai_help(prompt)
        
        tweet_summary = db.tweet_summary #use or create a collection named gun_va
        tweet_summary.insert_one({'id_list':tweet_id_list,
                            'tweet_text_summary':summary_result})
        print(summary_result,'\n')
    except:
        pass
    start_index += batch_size