# Collecting and analysing social media data: A demo
**ResBaz 2022** 

*Noel Zeng, Centre for eResearch, Waipapa Taumata Rau | University of Auckland*

[noel.zeng@auckland.ac.nz](mailto:noel.zeng@auckland.ac.nz)

## Acknowledgement
Parts of this demo draw from [NYU's NLP+CSS 201 course](https://nlp-css-201-tutorials.github.io/nlp-css-201-tutorials/). It's an excellent resource with recordings and notebooks online, do check it out!

## Introduction

Social media data can be a rich source for social science research. Yet it can be daunting to get started, and it will require including quantitative methods to handle the volume and velocity of data that sites like Twitter provide. Thankfully, there are lots of tools, libraries and pre-trained models in the Python ecosystem that can help you collect, preprocess and analyse the data.

Through this demo, I hope to give you a taste of what kind of work goes into collecting, preprocessing and analysing social media data is like, and the kind of research natural language processing and computational social sciences techniques unlock.

### Prerequisites - do these first!
To start collecting and analysing Twitter data presented here, you would first need to have:
- [ ] Twitter API access - apply through [Twitter Developer](https://developer.twitter.com/) 
- [ ] Basic Unix shell knowledge (e.g. [SWC Bash Lessons 1 - 4](https://swcarpentry.github.io/shell-novice/)) - to interact with command-line tools like twarc2
- [ ] Basic Python knowledge (e.g. [SWC Python all lessons](http://swcarpentry.github.io/python-novice-gapminder/)), to interact with tools like Jupyter Notebook and libraries such as Pandas, NLTK and SpaCy)
- [ ] Go through your institution's ethics guidelines on using social media data, if any. University of Auckland has created a module on [Social Media Platforms](https://www.coursebuilder.cad.auckland.ac.nz/flexicourses/5066/publish/1/15_1.html)
- [ ] If you want to follow along on your own computer, install and set up required libraries per [the README document](README.md).

### Scenario
Nation states are increasingly conducting digital diplomacy to "amplify the government's policy priorities and perspective on world events and, in doing so, influence the global narrative on contemporary issues." ([Collins et al 2019](https://link.springer.com/article/10.1057/s41254-019-00119-5))

Let's explore the Twitter accounts of a few different foreign affairs ministries from states around the Pacific, to see what they're talking about. What countries do they mention and engage with?

![A screenshot of Fiji's Ministry of Foreign Affairs Twitter account.](assets/fiji_mofa.png)

### Ethics and privacy
There are ethical concerns around social media data, particularly around obtaining and offering the option to withdraw consent for users. Through context collapse ([Davis & Jergenson 2009](https://www.tandfonline.com/doi/full/10.1080/1369118X.2014.888458?journalCode=rics20)) , users may be posting content with the expectation that only followers will see it. 

This scenario uses Tweets made by public, high-profile organisations which sidesteps some of the concerns. However, it's worth noting their style of posting will be different from the average personal user - more formal, fewer uses of emojis, iRreGuLaR cASEs, or punctuation!!!

Refer to your ethics committee for guidance around how to collect and store data appropriately. University of Auckland has a module on [Social Media Platforms](https://www.coursebuilder.cad.auckland.ac.nz/flexicourses/5066/publish/1/15_1.html).

### A note about Jupyter notebooks...
We are using [Jupyter Notebook](https://jupyter.org/), a browser-based tool and format popular in data science applications that allows you to write and run code, and annotate them with rich-text comments using Markdown. It's made up of blocks of code, with results printed below. Here, we are writing both Python code and Unix shell commands. Unix shell commands have a `!` in front of them.

## Part One - Collecting data
Through the [Twitter API](https://developer.twitter.com/en/portal/dashboard), you can retrieve Tweets, Tweet metadata, and information about users. It allows search criteria to be applied, but applies limitations on how many Tweets you can get. Everyone gets the Essentials access level, researchers can apply for the Academic level which offers higher Tweet caps and more complex query terms.

APIs are not designed to be accessed manually. We are using **twarc2**, a command-line based tool that retrieves data for you from version 2 of the Twitter API. It will take care of things like pagination, API request limits for you. There are other tools, and browser-based tools are available, but they are harder to install.

Refer to the README file for setup and configuration instructions for the twarc tool.

To see a list of things `twarc` can do for you, run this command:

In [1]:
! twarc2 --help

Usage: twarc2 [OPTIONS] COMMAND [ARGS]...

  Collect data from the Twitter V2 API.

Options:
  --consumer-key TEXT         Twitter app consumer key (aka "App Key")
  --consumer-secret TEXT      Twitter app consumer secret (aka "App Secret")
  --access-token TEXT         Twitter app access token for user
                              authentication.
  --access-token-secret TEXT  Twitter app access token secret for user
                              authentication.
  --bearer-token TEXT         Twitter app access bearer token.
  --app-auth / --user-auth    Use application authentication or user
                              authentication. Some rate limits are higher with
                              user authentication, but not all endpoints are
                              supported.  [default: app-auth]
  -l, --log TEXT
  --verbose
  --metadata / --no-metadata  Include/don't include metadata about when and
                              how data was collected.  [de

We will create the needed directories, then run the `twarc2 timeline` command to collect tweets from a user's timeline.

In [109]:
! mkdir -p tweets demo
! twarc2 timeline MFATNZ > demo/mfatnz.jsonl

API limit of 3200 reached:  44%|█████▎      | 3226/7261 [01:23<01:44, 38.50it/s]


`twarc2` automatically makes multiple requests to the API which provides several Tweets each time, and accumulates them into a file in the `demo` directory named `mfatnz.jsonl`. Note we get an error after the first 3,200 Tweets. This is due to [Twitter timeline API limits](https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/introduction) - other APIs will have other limits.

If you open the file `mfatnz.jsonl` you will see what it's captured. The brackets and colons may seem confusing, but the data is in a [JSON format](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON).

Each line contains several tweets. This is not very easy to work with!

`twarc2` provides a way to break up, or "flatten" this file into individual Tweets. The resulting file has a Tweet in JSON format on each line.

In [4]:
! twarc2 flatten demo/mfatnz.jsonl demo/mfatnz.flatten.jsonl

100%|██████████████| Processed 17.7M/17.7M of input file [00:01<00:00, 11.1MB/s]


Since we know each line in this file represents a Tweet, let's see what a Tweet looks like in JSON format, and what extra metadata Twitter provides. The `jq` command prints JSON-formatted data in an easy-to-read way.

In [5]:
! head -n 1 demo/mfatnz.flatten.jsonl | jq

[1;39m{
  [0m[34;1m"public_metrics"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"retweet_count"[0m[1;39m: [0m[0;39m4[0m[1;39m,
    [0m[34;1m"reply_count"[0m[1;39m: [0m[0;39m0[0m[1;39m,
    [0m[34;1m"like_count"[0m[1;39m: [0m[0;39m15[0m[1;39m,
    [0m[34;1m"quote_count"[0m[1;39m: [0m[0;39m1[0m[1;39m
  [1;39m}[0m[1;39m,
  [0m[34;1m"source"[0m[1;39m: [0m[0;32m"Brandwatch"[0m[1;39m,
  [0m[34;1m"lang"[0m[1;39m: [0m[0;32m"en"[0m[1;39m,
  [0m[34;1m"edit_history_tweet_ids"[0m[1;39m: [0m[1;39m[
    [0;32m"1592743018221780992"[0m[1;39m
  [1;39m][0m[1;39m,
  [0m[34;1m"text"[0m[1;39m: [0m[0;32m"#Climate Change Minister James Shaw gave Aotearoa NZ’s National Statement at #COP27 overnight. He called for greater ambition to limit warming to 1.5 degrees, for all of us in the #Pacific. Read it here: New Zealand National Statement – COP27 | https://t.co/zGKgzCz2MQ"[0m[1;39m,
  [0m[34;1m"id"[0m[1;39m: [0m[0;32m"1592743

Finally, let's convert it into the CSV format, which can be opened by a spreadsheet program like Excel, or by Python libraries like Pandas.

In [5]:
! twarc2 csv demo/mfatnz.flatten.jsonl demo/mfatnz.csv

100%|██████████████| Processed 23.5M/23.5M of input file [00:01<00:00, 21.7MB/s]

ℹ️
Parsed 3224 tweets objects from 3224 lines in the input file.
Wrote 3224 rows and output 83 columns in the CSV.



Let's check how many Tweets we have from MFAT.

In [7]:
# wc is a program that counts how many lines there are in a file
! wc -l demo/mfatnz.csv

3225 demo/mfatnz.csv


Let's move the CSV file into the `tweets` directory for the next step.

In [3]:
! mv demo/mfatnz.csv tweets/mfatnz.csv

**Exercise**: Now do the same for [@DFAT](twitter.com/dfat) (Australia) and [@Fiji_MOFA](twitter.com/fiji_mofa) (Fiji). We should end up with nine files in total, and three CSV files in the `tweets` folder, one for each organisation.

# Part Two - Analysing what the Foreign Affairs departments are saying
We've finished collecting data from Twitter. Let's see how many Tweets we now have.

In [8]:
! wc -l tweets/*.csv

    3250 tweets/dfat.csv
    1977 tweets/fiji_mofa.csv
    3225 tweets/mfatnz.csv
    8452 total


That's more than 8,000 Tweets! Let's import them into Pandas, a Python library that has neat ways to manipulate data. 

In [14]:
import pandas as pd
from pandas import Grouper
from tqdm import tqdm

# To display Tweet text in full, make column width long.
pd.set_option('max_colwidth', 300)

In [15]:
aus_tweets = pd.read_csv("tweets/dfat.csv")
nz_tweets = pd.read_csv("tweets/mfatnz.csv")
fiji_tweets = pd.read_csv("tweets/fiji_mofa.csv")

To get a flavour of what the Tweets say, we can get a random sample of them. It's good to use this to check your assumptions about the data.

In [11]:
aus_tweets.sample(10).text

3083                     You're invited to join government and industry experts in Canberra on 27 February to network and hear how to take advantage of Australia's free trade agreements. \n\nFor more information, follow this link https://t.co/IAW3I73QuC\n\n#Ausbiz #freetrade #commercialdiplomacy @Austrade
2926                       🇦🇺 Australia has delivered emergency relief supplies to Fiji 🇫🇯  following #TCHarold. Supplies include shelters, water containers &amp; personal hygiene kits to help our Pacific family get back on their feet. #Vuvale @AusHCFJ @DeptDefence @AusHumanitarian https://t.co/F2t4Jd7vdz
219                                                   @PaulD57769364 @DFATVic @SenatorWong @3AWNeilMitchell Hi @PaulD57769364, we don’t have enough information from your comment to identify your application. Please call us on 131 232 or email passports.clientservices@dfat.gov.au with your contact details.
1362                       #ASEAN is at the heart of an inclusive, resilient &a

### How do we get through more than 8000 Tweets?
With Twitter, we are usually handling millions of Tweets at a time. It quickly becomes untenable to read through all of them in our analysis. We can rely on a technique called [distant reading](https://en.wikipedia.org/wiki/Distant_reading) to apply computational techniques.

This is where natural language processing (NLP) methods can help. Specifically, a branch of NLP called Information Extraction (IE). This is research on how to automate various tasks, many of which humans often take for granted. Some of them include:
* Named-entity recognition (NER) - identify named entities like people, organisations, dates.
* Entity linking/disambiguation - a common next step after NER, which names refer to the same entity. Often linked to a knowledgebase of predefined ontologies.
* Relation extraction - identify relationship between identified entities.
* Event extraction/frame semantic parsing - finding whether a given event is described in the text, and which words are associated with which roles. e.g. an ATTACK event might have a perpetrator, victim and weapon. 

Recent advances in deep learning techniques and the availability of big textual corpuses have improved the accuracy of these techniques.

Preprocessing the text is needed to make these approaches more successful. This includes:
* Tokenising - breaking a sentence into its constituent words or "tokens". There are Twitter-specific things to watch out for, such as hashtags, emojis, casings.
* Removing high- and low-frequency words - "stopwords"
* Casing - iRreguLar CaSiNG.
* Stemming and lemmatising - remove suffixes from words, e.g. "writing", "wrote" -> "write"
* Part-of-speech tagging - identifying the role of each word. e.g. nouns, verbs, prepositions
The steps required vary between models, and the techniques and text features you want to study. For example, it may not be desirable to convert everything to lowercase if you wish to study the emotional content of Tweets, and casing can express emotions. 

**Exercise**: How might we apply these techniques to the Tweets from the foreign affairs ministries?

**We will focus on a simple application of named-entity recognition today.** Let's try to find out how often these foreign affairs ministries talk about other countries!

### Introducing SpaCy
SpaCy is an "industrial-strength natural language processing" library. It incorporates various pre-written, pre-trained NLP models, so you don't have to start from scratch. By default, it also includes many pre-processing steps

#### What's the difference between NLTK and SpaCy?
You may have heard about NLTK, which is a popular library for teaching NLP and computational linguistics. The difference comes down to their approach and included models. Approach-wise, NLTK gives you lots of useful functions that you can combine to create a NLP pipeline, which makes it ideal for NLP teaching and research. SpaCy offers you a curated pipeline that you can customise, which makes it easy to use. NLTK is like a set of Lego bricks, compared to SpaCy which is like a pre-built Lego model. SpaCy also offers newer, more accurate pre-trained models, so you don't have to train them yourself. We will be drawing from both libraries today.

In [13]:
import spacy
import nltk
from spacy import displacy
from pprint import pprint

# SpaCy offers different pre-trained language models of various sizes you can download. We are using the _sm_all version.
nlp = spacy.load("en_core_web_sm")

Let's see what SpaCy can do out of the box. First we can select a random Tweet.

In [17]:
tweet = aus_tweets.sample(1).text.to_list()[0]
tweet

"DFAT's @CentreHealthSec is supporting research in Indonesia on COVID-19 treatment practices by community pharmacies and drug stores. Find out more about the PINTAR study 👇 https://t.co/kqt3V2y8Zs"

Then, we use SpaCy's `nlp` function to process the Tweet using the `en_core_web_sm` model we specified above. The built-in `displacy.render` visualiser function can show the Tweet with all the named entities the pipeline recognised and classified. This function be used to show other features recgonised by the pipeline too.

In [18]:
doc = nlp(tweet)
displacy.render(doc, style="ent", jupyter=True)

Looks like the pipeline was able to pick up on the fact that Indonesia is a country, and labelled it as a GPE (**g**eo**p**olitical **e**ntity). Each model in SpaCy has different labels it may apply. You can use the `spacy.explain` function to learn more about each label, look at [`en_core_web_sm` model's labels](https://spacy.io/models/en#en_core_web_sm), and learn more about [SpaCy's linguistics features labelling](https://spacy.io/usage/linguistic-features) in general.

To get back to the task at hand, we can figure out how often nation-states talk about each other by summing up the mentions. We can programatically access each word's entity labels and tally up the frequency for each mentioned country.

In [106]:
def freq_gpe(df):
    # First, we process all the Tweets using the nlp function.
    tweets = df['text'].tolist()
    docs = [
        # tqdm is a function that shows us a progress bar as we loop through the tweets.
        nlp(t) for t in tqdm(tweets)
    ]
    # This goes through each processed Tweet, looks through the list of entities
    #  and keep entities that are GPEs.
    # This gives a list of lists of GPE entities in each Tweet.
    gpe_token_lists = [
           [
               ent for ent in doc.ents 
                if ent.label_ == "GPE"
           ]
           for doc in docs
    ]
    # We flatten that list of lists into a list of tokens from all Tweets.
    gpe_tokens = []
    for l in gpe_token_lists:
        gpe_tokens += l
    # Finally, we use NLTK's FreqDist class to calculate
    # frequency for us.
    return nltk.FreqDist([token.text for token in gpe_tokens])

We now apply this function to each Foreign Affairs department' Twitter account.

In [96]:
aus_freq = freq_gpe(aus_tweets)

100%|██████████████████████████████████████████████████████████████| 3249/3249 [00:15<00:00, 203.10it/s]


Using `FreqDist`'s the `most_common` function, we can look at the 20 most frequently occurring entities labelled as geopolitical entities. In other words, 20 of the each foreign affairs ministry's most commonly mentioned countries.

In [26]:
aus_freq.most_common(20)

[('Australia', 909),
 ('Indonesia', 74),
 ('India', 66),
 ('Fiji', 65),
 ('Ukraine', 52),
 ('Russia', 41),
 ('Vietnam', 39),
 ('Tonga', 37),
 ('UK', 33),
 ('US', 32),
 ('Afghanistan', 29),
 ('China', 25),
 ('Japan', 23),
 ('Myanmar', 22),
 ('Philippines', 21),
 ('Canberra', 20),
 ('Singapore', 20),
 ('Solomon Islands', 20),
 ('Beirut', 18),
 ('Canada', 16)]

In [101]:
nz_freq = freq_gpe(nz_tweets)

100%|██████████████████████████████████████████████████████████████| 3224/3224 [00:17<00:00, 186.12it/s]


In [102]:
nz_freq.most_common(20)

[('New Zealand', 785),
 ('NZ', 453),
 ('New Zealand’s', 102),
 ('Australia', 59),
 ('US', 55),
 ('UK', 54),
 ('India', 50),
 ('Japan', 46),
 ('China', 46),
 ('Wellington', 45),
 ('Ukraine', 43),
 ('Russia', 42),
 ('Fiji', 41),
 ('Canada', 39),
 ('Singapore', 37),
 ('Auckland', 30),
 ("New Zealand's", 26),
 ('Myanmar', 26),
 ('New York', 26),
 ('Chile', 26)]

In [104]:
fiji_freq = freq_gpe(fiji_tweets)

100%|██████████████████████████████████████████████████████████████| 1976/1976 [00:10<00:00, 184.78it/s]


In [105]:
fiji_freq.most_common(20)

[('Fiji', 1034),
 ('Australia', 86),
 ('New Zealand', 42),
 ('Suva', 38),
 ('Japan', 37),
 ('India', 37),
 ('TCYasa', 21),
 ('the Republic of Fiji', 20),
 ('Nadi', 20),
 ('Geneva', 19),
 ('UK', 17),
 ('US', 16),
 ('USA', 16),
 ('Canada', 16),
 ('🇯', 15),
 ('the United States', 15),
 ('New York', 15),
 ('Vuvale', 14),
 ('Canberra', 14),
 ('@COP26', 14)]

## We did it!

I hope this gives a taste of what's possible with NLP techniques. There are many things to improve on. As you can see, there are entities which are obviously not countries, and separate entities that should be one (e.g. "the United States", "US" and "USA".) Applying preprocessing steps by customising the SpaCy pipeline could reduce these problems.

**Exercise:** Try using another SpaCy model, such as `en_core_web_trf`. How does it differ in results? What other things could we do to improve the results?