A [Twitter Developer account](https://developer.twitter.com/) is required to run this script.

In [1]:
import tweepy
import json
import pandas as pd
import numpy as np

In [2]:
pd.set_option('display.max_colwidth', 150)
pd.set_option('display.max_rows', 20)

## Read Twitter API credentials

Read Twitter API credentials from `twitter-credentials.json`. The JSON file should contain the following key/values:

```json
{
    "consumer_key": "YOUR_CONSUMER_KEY",
    "consumer_secret": "YOUR_CONSUMER_SECRET",
    "access_token": "YOUR_ACCESS_TOKEN",
    "access_token_secret": "YOUR_ACCESS_TOKEN_SECRET",
    "bearer_token": "BEARER_TOKEN"
}
```

Create the `twitter-credentials.json` with your own keys and tokens in the same folder as this Jupyter notebook.

In [3]:
with open('twitter-credentials.json', 'r') as f:
    twitter_credentials_data = f.read()
    
twitter_credentials = json.loads(twitter_credentials_data)

## Create Tweepy client with credentials

Initialize `tweepy.Client` (Twitter API V2).

In [4]:
client = tweepy.Client(
    consumer_key=twitter_credentials["consumer_key"],
    consumer_secret=twitter_credentials["consumer_secret"],
    access_token=["access_token"],
    access_token_secret=twitter_credentials["access_token_secret"],
    bearer_token=twitter_credentials["bearer_token"],
    wait_on_rate_limit=True
)

## Create a function to retrieve tweets

Tweepy does not provide an automated way to handle pagination. This function uses `next_token` in each request to make follow-up requests until the maximum number of tweets are filled.

In [5]:
def retrieve_tweets(query, max_num_tweets=100, sort_order="recency"):
    response = None
    df = None
    num_tweets_retrieved = 0
    next_token = None
    author_id_username_map = {}
    
    print(f'==========================================')
    print(f'Retrieving tweets using query: {query}')
    
    while (max_num_tweets > num_tweets_retrieved) and ((response is None) or next_token):
        max_results = max(min(max_num_tweets - num_tweets_retrieved, 100), 10)
        
        print(f'num_tweets_retrieved={num_tweets_retrieved}, max_results={max_results}, next_token={next_token}')
        
        response = client.search_recent_tweets(
            query=query,
            max_results=max_results,
            sort_order=sort_order,
            next_token=next_token,
            user_fields=["name" , "username"],
            tweet_fields=["author_id"],
            expansions=["entities.mentions.username"]
        )
        
        new_df = pd.DataFrame(response.data)
        
        if 'edit_history_tweet_ids' in new_df.columns:
            new_df.drop(columns=['edit_history_tweet_ids'], inplace=True)
        num_tweets_retrieved += len(response.data)
        
        ids = new_df['author_id'].unique().tolist()

        for user in client.get_users(ids=ids).data:
            author_id_username_map[user['id']] = user['username']
            
        new_df['username'] = new_df['author_id'].map(author_id_username_map)

        if df is None:
            df = new_df
        else:
            df = pd.concat([df, new_df])
        
        # extract next_token for subsequent call
        next_token = response.meta['next_token'] if 'next_token' in response.meta else None
        
    df = df[['id', 'username', 'text', 'entities']]

    return df

## Invoke `retrieve_tweets()` with a keyword

The first invocation queries all non-retweets from verified users.

The second invocation queries only retweets from all users (including both verified and non-verified users).

In [6]:
search_keyword = 'ChatGPT'

df_tweets = retrieve_tweets(
    query=f'"{search_keyword}" -is:retweet is:verified lang:en',
    max_num_tweets=5000
)

df_retweets = retrieve_tweets(
    query=f'"{search_keyword}" is:retweet lang:en',
    max_num_tweets=5000
)

display(df_tweets.head(3))
print(f"df_tweets has {df_tweets.shape[0]} row(s)")
display(df_retweets.head(3))
print(f"df_retweets has {df_retweets.shape[0]} row(s)")

Retrieving tweets using query: "ChatGPT" -is:retweet is:verified lang:en
num_tweets_retrieved=0, max_results=100, next_token=None
num_tweets_retrieved=100, max_results=100, next_token=b26v89c19zqg8o3fqk70du2zzuniq3oq41sg2rtdf9njx
num_tweets_retrieved=200, max_results=100, next_token=b26v89c19zqg8o3fqk70du2ejjroey2sg7m0cmvlv9ksd
num_tweets_retrieved=300, max_results=100, next_token=b26v89c19zqg8o3fqk70du23l079iimfv5q4bx09sq2yl
num_tweets_retrieved=400, max_results=100, next_token=b26v89c19zqg8o3fqk70du1i37bvfy5t8s81007yi3r7h
num_tweets_retrieved=500, max_results=100, next_token=b26v89c19zqg8o3fqk70drz6ascr52oai4o9mfvfgi16l
num_tweets_retrieved=600, max_results=100, next_token=b26v89c19zqg8o3fqk70drxz9mi8p1fnh1m6qjgc4h8cd
num_tweets_retrieved=700, max_results=100, next_token=b26v89c19zqg8o3fqk70drx2orhj03hsx9atnbe2w08e5
num_tweets_retrieved=800, max_results=100, next_token=b26v89c19zqg8o3fqk70dpu5j3aoy18onnqfql4tnrm2l
num_tweets_retrieved=900, max_results=100, next_token=b26v89c19zqg8o3f

Unnamed: 0,id,username,text,entities
0,1623376134317690881,thedailybeast,"Plus! Weill and Sommer recap a bombshell investigation into Eliza Bleu, an overnight internet sensation who rose to prominence by becoming an ombu...",
1,1623376131570413568,thedailybeast,"Also on the podcast, Weill and Sommer interview @trevoraaronson, the host of investigative podcast the Alphabet Boys, which tells the story of FBI...","{'mentions': [{'start': 48, 'end': 63, 'username': 'trevoraaronson', 'id': '22180521'}]}"
2,1623376128651182081,thedailybeast,"“Ultimately, I think this AI stuff is awful,” Sommer says. “I think the people in 'Dune' knew what was up when they banned AI and it seemed like t...",


df_tweets has 5009 row(s)


Unnamed: 0,id,username,text,entities
0,1623376268245798939,afaf11140627605,RT @acrianetwork: https://t.co/SZ1H9KCR87 will take #ChatGPT to the next level: We enable the sharing of training data 💡\n\n10 Million AINF T…,"{'mentions': [{'start': 3, 'end': 16, 'username': 'acrianetwork', 'id': '1344595602479468546'}]}"
1,1623376261480587265,ayon_parvez,"RT @EstadoLatente: The ΔI𝚝𝚊𝚗𝚜 are a new breed of being\n❤️‍🔥\nSubmission teaser for @runwayml's #AIFilm Festival. Pictures, animation, poems…","{'mentions': [{'start': 3, 'end': 17, 'username': 'EstadoLatente', 'id': '1137227054'}, {'start': 82, 'end': 91, 'username': 'runwayml', 'id': '10..."
2,1623376255163764740,LanieGirl00,RT @StopTechnocracy: AI will totally disrupt the knowledge world in 2023\nhttps://t.co/b3tIxqIHpC,"{'mentions': [{'start': 3, 'end': 19, 'username': 'StopTechnocracy', 'id': '77918673'}]}"


df_retweets has 5000 row(s)


## Save as CSV files

While the "entities" column can be useful, we can extract mentions using regular expressions later if required.

In [7]:
for d in [df_tweets, df_retweets]:
    if 'withheld' in d.columns:
        d.drop(columns=['withheld'], inplace=True)

df_tweets.drop(columns=['entities']).to_csv(f'{search_keyword}-tweets.csv', index=None)
df_retweets.drop(columns=['entities']).to_csv(f'{search_keyword}-retweets.csv', index=None)