# Twitter Text Report

## By Quan Vo

### 10/25/22

For this assignment, I am tasked to collect tweets about a topic of my choice using the Twitter Recent Search API. I decided to choose the Russia-Ukraine War as my topic due to it being a current event that's gotten a lot of attention, so I could gather a lot of information from this topic. I didn't want to collect tweets from simply anybody though, so I narrowed my search down to verified and professional accounts. As a result, the first thing I thought of was to search for tweets specifically from a variety news and journalism accounts on Twitter. I figured these sources would offer professional insight on what's happening in the war.

In [622]:
import pandas as pd
import json
import requests
import urllib

When starting out, I initially had my question for the topic be "what are the most common viewpoints related to the war?" However, I realized that this question might not be as possible to answer because since I'm only collecting tweets from news and journalistic sources, they're more than likely to have wildly differing views from one another, so searching for similar viewpoints might be a bit of a challenge. Instead, I decided to have my question be "which events related to the war are being reported on the most?" The accounts I've chosen include but are not limited to: CNN, Fox News, Reuters, Guardian, Russia Today, and The Kyiv Independent.

In [623]:
endpoint = 'https://api.twitter.com/2/tweets/search/recent'
bt = pd.read_csv('Twitter_Token_9-22.txt', header = 0)

In [624]:
header = {'Authorization': 'Bearer {}'.format(bt['Bearer_Token'][0])}

I started off building the search query with urllib, choosing specific words related to Russia and Ukraine, and then choosing which accounts to search these specific words from. After that, I used variables to type out the specific fields that I want to display in my data frame.

In [625]:
query_param = urllib.parse.quote('(russia OR ukraine OR putin OR nato OR azov OR kiev OR kyiv OR zelenskyy OR russian OR russians OR ukrainian OR ukrainians) (from:KyivIndependent OR from:RT_com OR from:guardian OR from:CNN OR from:FoxNews OR from:reuters OR from:BBCBeaking OR from:nytimes OR from:democracynow)')

In [626]:
user_fields = 'username,name'

In [627]:
expansions = 'entities.mentions.username'

In [628]:
tweet_fields = 'author_id,public_metrics,created_at,lang'

In [629]:
query_url = endpoint + '?query={}&user.fields={}&expansions={}&tweet.fields={}&max_results=100'.format(query_param, user_fields, expansions, tweet_fields)

In [630]:
response = requests.get(query_url, headers = header)

In [631]:
response.status_code

200

In [632]:
# response.text

In [633]:
response_dict = json.loads(response.text)

In [634]:
response_dict.keys()

dict_keys(['data', 'includes', 'meta'])

In [635]:
type(response_dict['data'])

list

In [636]:
type(response_dict['data'][0])

dict

The cell below contains the different fields that are inside the data of the variable response_dict. These will be generated as columns for the first draft of my data frame. Some of the fields provided are what I need ('created_at', 'author_id', 'text') while others are fields that I could exclude ('edit_history_tweet_ids'). There's even one field that needed to be separated into multiple columns ('public_metrics'). The only other field I needed was 'username' but unfortunately, I was not able to include that within the data of resonse_dict, so it was left out. It seems that response_dict['data'] only included the fields from "tweet.fields" and completely excluded "user.fields" from the data. I unfortunately was not able to figure out how to fix that.

In [637]:
response_dict['data'][0].keys()

dict_keys(['lang', 'edit_history_tweet_ids', 'text', 'author_id', 'created_at', 'id', 'public_metrics'])

In [638]:
response_dict['data'][0]['text']

"Infrastructure bottlenecks hamper Russia's booming coal exports to China https://t.co/GVGOtJrQbm https://t.co/P2lVj9Impf"

In [639]:
type(response_dict['data'][0]['public_metrics'])

dict

In [640]:
response_dict['data'][0]['public_metrics'].keys()

dict_keys(['retweet_count', 'reply_count', 'like_count', 'quote_count'])

In [641]:
response_dict['data'][0]['public_metrics']['like_count']

18

In [642]:
response_dict['data'][4]['id']

'1585101864642629633'

In [643]:
response_df = pd.DataFrame(response_dict['data'])

Here is the first draft of the data frame that I generated. As you can see, each column is a different field containing different information. "public_metrics" is the only field that contains multiple types of information, so I need to take steps in order to separate that field into different columns. Another thing I noticed is that this data also contains fields called "withheld" and "entities." I'm not really sure why the data frame contains those fields considering that "response_dict['data'][0].keys()" never listed them. Either way, I think that these two fields aren't necessary to use, so I also need to add some code in order to remove them.

In [644]:
response_df.head()

Unnamed: 0,lang,edit_history_tweet_ids,text,author_id,created_at,id,public_metrics,withheld,entities
0,en,[1585114127197937664],Infrastructure bottlenecks hamper Russia's boo...,1652541,2022-10-26T03:40:32.000Z,1585114127197937664,"{'retweet_count': 7, 'reply_count': 0, 'like_c...",,
1,en,[1585114006074818561],A Russian court upheld Brittney Griner’s sente...,807095,2022-10-26T03:40:03.000Z,1585114006074818561,"{'retweet_count': 19, 'reply_count': 33, 'like...",,
2,en,[1585111475517100033],US may send antiquated missiles to Ukraine – R...,64643056,2022-10-26T03:30:00.000Z,1585111475517100033,"{'retweet_count': 8, 'reply_count': 9, 'like_c...","{'copyright': False, 'country_codes': ['AT', '...",
3,en,[1585101866849144833],"""The invocation of war on religious but not ov...",1462548977367359490,2022-10-26T02:51:49.000Z,1585101866849144833,"{'retweet_count': 44, 'reply_count': 14, 'like...",,
4,en,[1585101864642629633],Assistant Secretary to the Russian Security Co...,1462548977367359490,2022-10-26T02:51:48.000Z,1585101864642629633,"{'retweet_count': 43, 'reply_count': 15, 'like...",,


In [645]:
response_dict['meta']['next_token']

'b26v89c19zqg8o3fpzel4w4gviea0lc66c67j265hpvgd'

In [646]:
next_query_url = query_url + "&next_token={}".format(response_dict['meta']['next_token'])

In [647]:
next_response = requests.get(next_query_url, headers = header)

In [648]:
next_response.status_code

200

In [649]:
# next_response.text

In [650]:
next_response_dict = json.loads(next_response.text)

In [651]:
next_response_dict['meta']

{'newest_id': '1584907641762947073',
 'oldest_id': '1584647195533320192',
 'result_count': 100,
 'next_token': 'b26v89c19zqg8o3fpzel4prhl6qnnq14leykz20q4keil'}

In order to gather at least 300 tweets, I needed to create a function that allows me to do so. I named the function "twt_recent_search" and the parameters contain the query, the number of pages for tweets, and the header. Inside the function contains a for loop that ranges from 0 to the number of pages that will be entered. This will determine the number of tweets that will be collected.

In [652]:
def twt_recent_search (query, num_pages, header):
    response_list = []
    next_token = ''
    for i in range(0, num_pages):
        if i > 0:
            this_query = query + "&next_token={}".format(next_token)
        else:
            this_query = query
        
        this_response = requests.get(this_query, headers = header)
        print(this_response.status_code)
        this_response_dict = json.loads(this_response.text)
        response_list.append(this_response_dict)
        next_token = this_response_dict['meta']['next_token']
        
    return response_list

I made a new variable called "my_responses" In order to use the function. The function itself generates 100 tweets per page, so I entered 3 pages for the second parameter of the function.

In [653]:
my_responses = twt_recent_search(query_url, 3, header)

200
200
200


The second draft of my data frame required multiple steps, so I needed to create more than one variable in order for the data frame to display 300 rows of tweets. This needed "pd.DataFrame.from_records" to convert the data, and then listing the data, to concatenating the data.

In [654]:
results_1 = pd.DataFrame.from_records(my_responses)

In [655]:
data_list = list(results_1['data'])

In [656]:
data_list_of_dfs = [pd.DataFrame(x) for x in data_list]

In [657]:
data_df = pd.concat(data_list_of_dfs)

Here's what the second data frame draft looks like. It's now at 300 rows/tweets but some of the columns still need to be removed and "public_metrics" still needs to be separated.

In [658]:
data_df

Unnamed: 0,public_metrics,author_id,id,lang,created_at,edit_history_tweet_ids,text,withheld,entities
0,"{'retweet_count': 7, 'reply_count': 0, 'like_c...",1652541,1585114127197937664,en,2022-10-26T03:40:32.000Z,[1585114127197937664],Infrastructure bottlenecks hamper Russia's boo...,,
1,"{'retweet_count': 19, 'reply_count': 33, 'like...",807095,1585114006074818561,en,2022-10-26T03:40:03.000Z,[1585114006074818561],A Russian court upheld Brittney Griner’s sente...,,
2,"{'retweet_count': 8, 'reply_count': 9, 'like_c...",64643056,1585111475517100033,en,2022-10-26T03:30:00.000Z,[1585111475517100033],US may send antiquated missiles to Ukraine – R...,"{'copyright': False, 'country_codes': ['AT', '...",
3,"{'retweet_count': 44, 'reply_count': 14, 'like...",1462548977367359490,1585101866849144833,en,2022-10-26T02:51:49.000Z,[1585101866849144833],"""The invocation of war on religious but not ov...",,
4,"{'retweet_count': 43, 'reply_count': 15, 'like...",1462548977367359490,1585101864642629633,en,2022-10-26T02:51:48.000Z,[1585101864642629633],Assistant Secretary to the Russian Security Co...,,
...,...,...,...,...,...,...,...,...,...
95,"{'retweet_count': 24, 'reply_count': 10, 'like...",1652541,1584408150286950400,en,2022-10-24T04:55:14.000Z,[1584408150286950400],Oil prices slide as China demand data disappoi...,,
96,"{'retweet_count': 28, 'reply_count': 20, 'like...",1652541,1584401870394515456,en,2022-10-24T04:30:17.000Z,[1584401870394515456],Russia fired missiles and drones into the Ukra...,,
97,"{'retweet_count': 41, 'reply_count': 15, 'like...",64643056,1584401801419292672,en,2022-10-24T04:30:00.000Z,[1584401801419292672],Moscow warns Paris of ‘dirty bomb’ provocation...,"{'copyright': False, 'country_codes': ['AT', '...",
98,"{'retweet_count': 20, 'reply_count': 6, 'like_...",87818409,1584395308586319872,en,2022-10-24T04:04:12.000Z,[1584395308586319872],Terror to elation: Ukrainian woman’s journey f...,,


In [659]:
len(data_df)

300

Another new variable needed to be created in order to remove some of the columns. The ".drop" function is what I used to get rid of them.

In [660]:
full_data = data_df.drop(columns = ['edit_history_tweet_ids', 'id', 'withheld', 'entities'])

Third draft of the data frame. Now all I need left is to isolate the information from "public_metrics."

In [661]:
full_data

Unnamed: 0,public_metrics,author_id,lang,created_at,text
0,"{'retweet_count': 7, 'reply_count': 0, 'like_c...",1652541,en,2022-10-26T03:40:32.000Z,Infrastructure bottlenecks hamper Russia's boo...
1,"{'retweet_count': 19, 'reply_count': 33, 'like...",807095,en,2022-10-26T03:40:03.000Z,A Russian court upheld Brittney Griner’s sente...
2,"{'retweet_count': 8, 'reply_count': 9, 'like_c...",64643056,en,2022-10-26T03:30:00.000Z,US may send antiquated missiles to Ukraine – R...
3,"{'retweet_count': 44, 'reply_count': 14, 'like...",1462548977367359490,en,2022-10-26T02:51:49.000Z,"""The invocation of war on religious but not ov..."
4,"{'retweet_count': 43, 'reply_count': 15, 'like...",1462548977367359490,en,2022-10-26T02:51:48.000Z,Assistant Secretary to the Russian Security Co...
...,...,...,...,...,...
95,"{'retweet_count': 24, 'reply_count': 10, 'like...",1652541,en,2022-10-24T04:55:14.000Z,Oil prices slide as China demand data disappoi...
96,"{'retweet_count': 28, 'reply_count': 20, 'like...",1652541,en,2022-10-24T04:30:17.000Z,Russia fired missiles and drones into the Ukra...
97,"{'retweet_count': 41, 'reply_count': 15, 'like...",64643056,en,2022-10-24T04:30:00.000Z,Moscow warns Paris of ‘dirty bomb’ provocation...
98,"{'retweet_count': 20, 'reply_count': 6, 'like_...",87818409,en,2022-10-24T04:04:12.000Z,Terror to elation: Ukrainian woman’s journey f...


In order to separate the information from "public_metrics," I started off initializing a new variable called "final_data" and had that equal to "pd.DataFrame(list(full_data['public_metrics']))." That way, it will create a data frame that only lists the information from "public_metrics." This data frame itself contains different columns that I need to include in my final data frame.

In [662]:
final_data = pd.DataFrame(list(full_data['public_metrics']))

In [663]:
final_data

Unnamed: 0,retweet_count,reply_count,like_count,quote_count
0,7,0,18,2
1,19,33,75,8
2,8,9,26,3
3,44,14,244,0
4,43,15,232,5
...,...,...,...,...
295,24,10,88,4
296,28,20,65,8
297,41,15,138,4
298,20,6,44,1


I wanted to rename each of the columns in "public_metrics" as well, so I made another new variable and renamed each of them accordingly.

In [664]:
full_data['retweets'] = final_data['retweet_count']
full_data['replies'] = final_data['reply_count']
full_data['likes'] = final_data['like_count']
full_data['quote_tweets'] = final_data['quote_count']

Fourth draft of my data frame. I finally took the columns out of "public_metrics" but that field itself is still displaying as a separate column along with the rest. My final step is to remove that field.

In [665]:
full_data.head()

Unnamed: 0,public_metrics,author_id,lang,created_at,text,retweets,replies,likes,quote_tweets
0,"{'retweet_count': 7, 'reply_count': 0, 'like_c...",1652541,en,2022-10-26T03:40:32.000Z,Infrastructure bottlenecks hamper Russia's boo...,7,0,18,2
1,"{'retweet_count': 19, 'reply_count': 33, 'like...",807095,en,2022-10-26T03:40:03.000Z,A Russian court upheld Brittney Griner’s sente...,19,33,75,8
2,"{'retweet_count': 8, 'reply_count': 9, 'like_c...",64643056,en,2022-10-26T03:30:00.000Z,US may send antiquated missiles to Ukraine – R...,8,9,26,3
3,"{'retweet_count': 44, 'reply_count': 14, 'like...",1462548977367359490,en,2022-10-26T02:51:49.000Z,"""The invocation of war on religious but not ov...",44,14,244,0
4,"{'retweet_count': 43, 'reply_count': 15, 'like...",1462548977367359490,en,2022-10-26T02:51:48.000Z,Assistant Secretary to the Russian Security Co...,43,15,232,5


In [666]:
final_df = full_data.drop(columns = ['public_metrics'])

After going through each and every step, this is what my final data frame looks like. It contains some very important fields that I need to answer my question, namely the "text" and "created_at" fields. I needed "text" obviously to read the tweets of each row and I needed "created_at" to display how recent the tweets are as a way to display their relevance. I think it's debatable whether viewing the number of likes, retweets, and replies are needed for my question but I decided to keep them just for convenience. The only other field that I needed was "username" but as stated, I unfortunately couldn't figure out how to include that in my data frame. This made it harder to answer my question because I didn't know which tweet belonged to which user without the "username" column. As a result, I decided to copy and paste my search query to the Twitter search engine in order to view which accounts posted each tweet.

As for what news stories were reported, there was a surprising amount of variety on what was being reported despite almost all of stories being linked to the conflict between Russia and Ukraine. even though most of the news reports are different, I did notice a couple a couple specific events being reported from more than one of the news sources I searched from. One of those events that come to mind being Russia's claims of Ukraine having a "dirty bomb" plan. Other than that, most of the stories aren't being repeated across the different news and journalism accounts and are mostly different from one another.

In [667]:
final_df

Unnamed: 0,author_id,lang,created_at,text,retweets,replies,likes,quote_tweets
0,1652541,en,2022-10-26T03:40:32.000Z,Infrastructure bottlenecks hamper Russia's boo...,7,0,18,2
1,807095,en,2022-10-26T03:40:03.000Z,A Russian court upheld Brittney Griner’s sente...,19,33,75,8
2,64643056,en,2022-10-26T03:30:00.000Z,US may send antiquated missiles to Ukraine – R...,8,9,26,3
3,1462548977367359490,en,2022-10-26T02:51:49.000Z,"""The invocation of war on religious but not ov...",44,14,244,0
4,1462548977367359490,en,2022-10-26T02:51:48.000Z,Assistant Secretary to the Russian Security Co...,43,15,232,5
...,...,...,...,...,...,...,...,...
95,1652541,en,2022-10-24T04:55:14.000Z,Oil prices slide as China demand data disappoi...,596,61,5246,21
96,1652541,en,2022-10-24T04:30:17.000Z,Russia fired missiles and drones into the Ukra...,14,8,26,8
97,64643056,en,2022-10-24T04:30:00.000Z,Moscow warns Paris of ‘dirty bomb’ provocation...,20,2,74,1
98,87818409,en,2022-10-24T04:04:12.000Z,Terror to elation: Ukrainian woman’s journey f...,215,13,1650,6


In [668]:
final_df.to_csv(r'C:\Users\Quan\EMAT22110_FA22\Vo_TwitterTextReport.csv')