## Introduction to Computational Social Science methods with Python

### Natural Language Processing - Information Extraction

<div class='alert alert-block alert-success'>
<b>In this Python notebook</b>, 

we will explore how to use
- regular expressions and, 
- named entity recognition (NER) 

to perform information extraction from a corpus of tweets. 

Information extraction is a critical task in natural language processing (NLP), which involves identifying and extracting relevant information from unstructured text data.

Regular expressions are a powerful tool for pattern matching and text manipulation. They allow us to define complex patterns of characters and symbols that can match specific text patterns, such as email addresses, phone numbers, or URLs. Regular expressions are particularly useful for information extraction tasks that involve dates, times, or addresses.

Named entity recognition (NER) is a technique for identifying and classifying named entities in text data, such as people, organizations, or locations. NER is a critical component of many NLP applications, such as information retrieval, question answering, and text summarization. 

Our corpus of tweets consist of a sample of $500$ tweets related to a specific topic or event.

By the end of this notebook, you will have a basic understanding of how to use regular expressions and named entity recognition to perform information extraction from text data, and how to analyze and visualize the extracted information for insights and understanding. Let's get started!
</div>

In [1]:
import pandas as pd
import re
import emoji

# import the data 
tweets_df = pd.read_csv('../data/top_500_retweeted_tweets.csv', encoding = "utf-8")
tweets_df.head() 

Unnamed: 0,tweet_id,text,retweets
0,1265465820995411973,"This was me, and I want to make one thing clea...",257467
1,1266553959973445639,Mike Pence caught on hot mic delivering empty ...,135818
2,1258750892448387074,THE PANDEMIC IS STILL HAPPENING. THE PANDEMIC ...,88667
3,1263579286201446400,"This just happened on live tv. Wow, what a dou...",82495
4,1266546753182056453,Mask on,66604


## A. Extracting patterns using Regular Expressions

<a href="https://docs.python.org/3/howto/regex.html">Regular expressions</a> (also called regex, regexes, regex pattern, regexp, or REs) are a sequence of characters that define a search pattern. They are used in programming and text processing to match and manipulate strings of text based on a specific pattern.
A regular expression is a pattern used to match one or more text strings. It is usually composed of a combination of characters, symbols, and metacharacters. Metacharacters are special characters that have a specific meaning in regular expressions, for example, the period (.) that matches any single character, or the asterisk (*) that matches zero or more occurrences of the preceding character.
Regular expressions can be used to perform a variety of operations on text data, such as searching for specific patterns, replacing text with other text, or extracting specific information from a text string. 
Some common examples of regular expressions are matching an email addresses, phone numbers, dates, and URLs.

Regular expressions can be complex and difficult to read, but they are a powerful tool for manipulating and processing text data. Luckily, there are many resources that can help us write the correct regular expression for our task. Also, Python has built-in mobule (`re`) to use regular expressions.

<img src='../data/Regular_Expressions_Cheat_Sheet.png'>

In this example, we will extract all URLs from the text of the tweet. A possible regular expression to match an URL is:

`http[s]*\S+`

This regular expression will match all strings that starts with `http`, or eventually with `https`, followed by non-empty spaces. 

We will use the `findall` function from the Python module `re` to match all URLs in text of the tweets:

In [2]:
# we create a new column where we store all the URLs mentioned in the tweet extracted using regex
tweets_df['urls'] = tweets_df['text'].apply(lambda x: re.findall("http[s]*\S+", x))
tweets_df['urls'].values[0]

['https://t.co/349TZijtD8']

We can also extract **mentions and hashtags** applying opportune regular expressions:

In [3]:
# find all mentions
tweets_df['mentions'] = tweets_df['text'].apply(lambda x: re.findall("@[a-zA-Z0-9_]{1,50}", x))
print(tweets_df['mentions'].values[-1])

['@realDonaldTrump']


In [4]:
# excercise: find all hashtags
# hashtags example: #soccomquant
tweets_df['hashtags'] = tweets_df['text'].apply(lambda x: re.findall("#[a-zA-Z0-9_]{1,50}", x))
print(tweets_df['hashtags'].values[-5])

['#COVID']


For **emoji** extraction, in addition to regex, we will use the library called emoji (if not installed before, please install it before running the following cell). This library helps us transform emojis into the related codes (i.e., texts). Once the emojis are converted to text, we apply the same logic applied so far with regex to find them. 

The full list of emojis and related codes is available here: https://unicode.org/emoji/charts/full-emoji-list.html

Let's look at and example:

In [5]:
emoji.demojize("😂")

':face_with_tears_of_joy:'

We can apply this approach to the whole dataset:

In [6]:
def extract_emojis(text, return_codes=False):
    # first turn emojis into related text code
    text_de = emoji.demojize(text)
    # second find all emojis text code
    emojis_list_de = re.findall(r'(:[\a-z]+:)', text_de)
    # reconvert text code to emojis
    list_emoji = [emoji.emojize(x) for x in emojis_list_de]

    if return_codes:
        return emojis_list_de
    else:
        return list_emoji

tweets_df['emoji'] = tweets_df['text'].apply(extract_emojis)
tweets_df['emoji_text'] = tweets_df['text'].apply(extract_emojis, return_codes=True)

tweets_df.tail()

Unnamed: 0,tweet_id,text,retweets,urls,mentions,hashtags,emoji,emoji_text
495,1264986843948277760,"People who say ‘well, he’s doing the best he c...",9033,[https://t.co/5POEhfB6vi],[],[#COVID],[],[]
496,1260425005483073538,This young woman was killed in her home for no...,9021,[https://t.co/JzPgOzm4Rm],[],[#BreonnaTaylor],[],[]
497,1259587972728533000,I be like “oh shit my mask” like I’m Batman or...,8994,[],[],[],[😂😂],[:face_with_tears_of_joy::face_with_tears_of_j...
498,1266251584461090816,Really disappointed by @SAfridiOfficial‘s comm...,8984,[],"[@SAfridiOfficial, @narendramodi]",[],[🇮🇳],[:India:]
499,1266728243236950018,Let's be clear about what's happening:\n\n→ Am...,8974,[],[@realDonaldTrump],[],[],[]


Let's see the final results from our extraction example and sort values according to mentions.

In [7]:
tweets_df.sort_values(by='mentions', ascending=False)

Unnamed: 0,tweet_id,text,retweets,urls,mentions,hashtags,emoji,emoji_text
489,1258617080430997505,A Black New York State Senator (@zellnor4ny) a...,9151,[https://t.co/NoT8g4uAli],"[@zellnor4ny, @YourFavoriteASW]",[],[],[]
464,1266956300908363776,NEW: A volunteer on Kushner's coronavirus resp...,9327,[https://t.co/jvs2h4IfNQ],[@yabutaleb7],[],[: A volunteer on Kushner's coronavirus respon...,[: A volunteer on Kushner's coronavirus respon...
347,1260559563972960256,Wow! The Front Page @washingtonpost Headline r...,11591,[],[@washingtonpost],[],[],[]
360,1262940294305071104,it would appear that @vp was joking about carr...,11196,[https://t.co/hI9cO4lxcX],[@vp],[],[],[]
412,1261718681882693632,Very happy to present this unseen image of @ta...,10245,[https://t.co/3dzvynlUq3],"[@tarak9999, @DabbooRatnani]","[#HappyBirthdayNTR, #StayHomeStaySafe]",[😎\n\n📸 By @DabbooRatnani \n\n#HappyBirthdayNT...,[:smiling_face_with_sunglasses:\n\n:camera_wit...
...,...,...,...,...,...,...,...,...
167,1256717572373913605,Update: Got her permission with a fuck yeah. T...,19289,[https://t.co/MqV0QJ0D8h],[],[],[],[]
165,1265624335898869760,"Y'all, the mask goes OVER your nose.",19351,[],[],[],[],[]
164,1258599146522464256,Because if its Baghdad its okay for this to ha...,19457,[https://t.co/UdFy61zoT5],[],[],[],[]
163,1266343312304324608,I gotta be honest the worst looting I've ever ...,19527,[],[],[],[],[]


As a final exercise, let's clean text from urls, hashtags, mentions, and emojis for further text analysis.

In [8]:
def remove_urls(text):
    # find all URLs in text using regex
    urls = re.findall("http[s]*\S+", text)
    # iterate through the URLs and remove them
    for url in urls:
        text = text.replace(url, "")
    return text


def remove_hashtags(text):
    # find all hashtags in text using regex
    hashtags = re.findall("@[a-zA-Z0-9_]{1,50}", text)
    # iterate through the hashtags and remove them
    for hashtag in hashtags:
        text = text.replace(hashtag, "")
    return text


def remove_mentions(text):
    # find all mentions in text using regex
    mentions = re.findall("#[a-zA-Z0-9_]{1,50}", text)
    # iterate through the mentions and remove them
    for mention in mentions:
        text = text.replace(mention, "")
    return text


def remove_emojis(text):
    # find all emoji in text
    emojis = extract_emojis(text, return_codes=False)
    # iterate through the emojis and remove them
    for emoji in emojis:
        text = text.replace(emoji, "")
    return text


def clean_text(text):
    # create a cleaning pipeline 
    text = remove_urls(text)
    text = remove_hashtags(text)
    text = remove_mentions(text)
    text = remove_emojis(text)
    return text

tweets_df['cleaned_text'] = tweets_df['text'].apply(lambda x: clean_text(x))

Let's see how it worked:

In [9]:
print('Original Tweet:', tweets_df.text.values[412])
print('\n\nCleaned Tweet:', tweets_df.cleaned_text.values[412])

Original Tweet: Very happy to present this unseen image of @tarak9999 .. I hope you all like it 😎

📸 By @DabbooRatnani 

#HappyBirthdayNTR 🎉

#StayHomeStaySafe 🙏🏼 https://t.co/3dzvynlUq3


Cleaned Tweet: Very happy to present this unseen image of  .. I hope you all like it  


## B. Extracting named entities

A named entity is a real-life object which can be identified and denoted with a proper name. Named Entities can be a place, person, organization, time, object, or geographic entity. For example, named entities would be Joe Biden, New York city, and congress. Named entities are usually instances of entity instances. For example, Joe Biden is an instance of a politician/person, New York City is an instance of a place, and congress is as instance of an organization. 

**Named Entity Recognition** (NER) is the process of NLP for identifying and classifying named entities. The raw and structured text are used to find out named entities, which are classified into persons, organizations, places, money, time, etc. NER systems are developed with various linguistic approaches, as well as statistical and machine learning methods. 

NER model first identifies an entity and then categorizes the entity into the most suitable class. Some of the common types of Named Entities will be as follows and others can be found in the further example of a Wikipedia page text.

1. Organisations : NASA, CERN

2. Places: Istanbul, Germany

3. Money: 1 Billion Dollars, 50 Euros

4. Date: 24th January 2023, season 4

5. Person: Richard Feynman, George Floyd
 
<img src='../data/NER.png' style='height: 500px; float: left'>

<div class='alert-info'>
<big><b>Insight</b></big>

    
For NLP tasks like NER, POS tagging, dependency parsing, word vectors and more, <a href="https://spacy.io/">spaCy</a> has distinct features that provide clear advantage for processing text data and modeling. It is the most trending and advanced free open-source library for implementing NLP in Python nowadays. 
    
An important thing about NER models is that their ability to understand Named Entities depending on the data they have been trained on. There are many applications of NER. NER can be used for content classification, the various Named Entities of a text can be collected, and based on that data, the content themes can be understood.
    
We can use spaCy very easily for NER tasks. However, we need to consider training our own data for research, commercial, and business specific needs, the spaCy model generally performs well for all types of text data. 
    
</div>

As usual, let's import necessary libraries and packages and start with a toy example from our tweets dataframe, which is the second line of the text column. 

In [10]:
import spacy 

# before loading it we need to install this module via: #!python -m spacy download en_core_web_sm
NER = spacy.load("en_core_web_sm")

# Print the second tweet of our dataset
raw_text = tweets_df.cleaned_text[1]
print(raw_text)

Mike Pence caught on hot mic delivering empty boxes of PPE for a PR stunt. 


Now, we print the data on the Named Entities found in this raw text sample from our dataset.

In [11]:
# extract the entities using the spacy objects previously defined in the
NER_text = NER(raw_text)

# show all the entities extracted from the text
for word in NER_text.ents:
    print(word.text, word.label_)

Mike Pence PERSON
PPE ORG


<div class='alert-info'>
<big><b>Insight</b></big>
    
Here, PPE is a context specific word to be labeled as organization. In the COVID-19 context like in our example, it stands for "personal protective equipment"; which is not an organization. On the other hand, as an abbreviation of the Philosophy, Politics, and Economics Society, PPE can be labeled as an organization.
</div>  

Now, let's run NER on the full dataset and find out the output with Named Entities and who is the most cited Location:

In [12]:
import spacy

# Load the pre-trained model with NER
nlp = spacy.load("en_core_web_sm")

# Define a dictionary to store the count of each location
location_count = {}

# Loop over each text and analyze it with spaCy's NER
for text in tweets_df.cleaned_text:
    doc = NER(text)
    for ent in doc.ents:
        if ent.label_ == "LOC":
            # If the entity is a location, add it to the count dictionary
            name = ent.text
            if name in location_count:
                location_count[name] += 1
            else:
                location_count[name] = 1

# Find the person with the highest count
most_cited_location = max(location_count, key=location_count.get)
print("The most cited location is:", most_cited_location)


The most cited location is: the Southern Border


Finally, we save the processed data for later use:

In [13]:
tweets_df.to_csv("../data/top_500_retweeted_tweets_clean.csv", index=False)