# Advanced Preprocessing Techniques and Natural Language Processing Applications for Social Media Datasets

Table of Contents

1. N-grams and Phrase Analysis
2. Collocation Analysis
3. Part-of-Speech Tagging
4. Named Entity Recognition
5. Dependency Parsing

## 1. N-grams and Phrase Analysis

N-grams are sequences of N contiguous words in a text. They can provide insights into the co-occurrence of words and the context in which they appear. In this section, we will discuss how to generate and analyze N-grams from text data.

**Research Question:** "What patterns of language use can be observed in the communication of news headlines across different media outlets?"

By applying N-grams and phrase analysis to a large dataset of news headlines from various media outlets, researchers can identify common phrases, expressions, and linguistic patterns used in crafting headlines. This analysis can provide insights into the framing techniques employed by different media outlets, revealing any potential biases in their presentation of news stories and the way they attract readers' attention.

### 1.1 Generating N-grams

To generate N-grams, we can use the `nltk` library.

In [2]:
!pip install nltk

Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting regex>=2021.8.3
  Using cached regex-2023.5.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (769 kB)
Collecting click
  Using cached click-8.1.3-py3-none-any.whl (96 kB)
Installing collected packages: regex, click, nltk
Successfully installed click-8.1.3 nltk-3.8.1 regex-2023.5.5


In [20]:
from nltk import ngrams

In [21]:
def generate_ngrams(text, n):
    tokens = text.split()
    return list(ngrams(tokens, n))

In [24]:
text = "I love Python programming language because it is easy to learn and very versatile. I love Python programming language because it is easy to learn and very versatile. I love Python programming language because it is easy to learn and very versatile. I love Python programming language because it is easy to learn and very versatile."
bigrams = generate_ngrams(text, 2)
bigrams

[('I', 'love'),
 ('love', 'Python'),
 ('Python', 'programming'),
 ('programming', 'language'),
 ('language', 'because'),
 ('because', 'it'),
 ('it', 'is'),
 ('is', 'easy'),
 ('easy', 'to'),
 ('to', 'learn'),
 ('learn', 'and'),
 ('and', 'very'),
 ('very', 'versatile.'),
 ('versatile.', 'I'),
 ('I', 'love'),
 ('love', 'Python'),
 ('Python', 'programming'),
 ('programming', 'language'),
 ('language', 'because'),
 ('because', 'it'),
 ('it', 'is'),
 ('is', 'easy'),
 ('easy', 'to'),
 ('to', 'learn'),
 ('learn', 'and'),
 ('and', 'very'),
 ('very', 'versatile.'),
 ('versatile.', 'I'),
 ('I', 'love'),
 ('love', 'Python'),
 ('Python', 'programming'),
 ('programming', 'language'),
 ('language', 'because'),
 ('because', 'it'),
 ('it', 'is'),
 ('is', 'easy'),
 ('easy', 'to'),
 ('to', 'learn'),
 ('learn', 'and'),
 ('and', 'very'),
 ('very', 'versatile.'),
 ('versatile.', 'I'),
 ('I', 'love'),
 ('love', 'Python'),
 ('Python', 'programming'),
 ('programming', 'language'),
 ('language', 'because'),
 ('

### 2.2 Analyzing N-grams

After generating N-grams, we can analyze them to identify frequently occurring phrases and patterns. For example, we can count the frequency of each N-gram to find the most common ones:

In [25]:
from collections import Counter

bigram_counts = Counter(bigrams)
bigram_counts

Counter({('I', 'love'): 4,
         ('love', 'Python'): 4,
         ('Python', 'programming'): 4,
         ('programming', 'language'): 4,
         ('language', 'because'): 4,
         ('because', 'it'): 4,
         ('it', 'is'): 4,
         ('is', 'easy'): 4,
         ('easy', 'to'): 4,
         ('to', 'learn'): 4,
         ('learn', 'and'): 4,
         ('and', 'very'): 4,
         ('very', 'versatile.'): 4,
         ('versatile.', 'I'): 3})

## 2. Collocation Analysis

Collocations are word pairs that occur together more often than expected by chance. They can provide valuable insights into the relationships between words in the text. In this section, we will discuss how to perform collocation analysis using the `nltk` library.

**Research Question:** "What are the most common themes and topics discussed on social media platforms during a major political event?"

By using collocation analysis on social media data collected during a major political event, researchers can identify word pairs that frequently co-occur, shedding light on the main themes and topics of discussion. This information can be useful in understanding public opinion and sentiment, as well as the language patterns used by users when discussing political issues on social media platforms.

In [31]:
import nltk
nltk.download('stopwords')
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
from nltk.corpus import stopwords
import string

def find_collocations(text, num_collocations=10):
    """
    This function takes a text input and finds the top collocations (bigrams) based on their
    pointwise mutual information (PMI) scores.

    :param text: str, input text to analyze for collocations
    :param num_collocations: int, optional, the number of top bigrams to return based on their PMI scores
                             (default is 10)
    :return: list of tuples, the top num_collocations bigrams with the highest PMI scores
    """
    # Tokenize the input text
    tokens = nltk.word_tokenize(text)
    # Create a BigramAssocMeasures object to compute the PMI scores
    bigram_measures = BigramAssocMeasures()
    # Create a BigramCollocationFinder object from the tokens
    finder = BigramCollocationFinder.from_words(tokens)
    # Apply a frequency filter to keep only bigrams that appear at least twice
    finder.apply_freq_filter(2)
    # Apply a word filter to exclude bigrams containing stopwords or punctuations
    finder.apply_word_filter(lambda w: w.lower() in stopwords.words('english') or w in string.punctuation)
    # Return the top num_collocations bigrams with the highest PMI scores
    return finder.nbest(bigram_measures.pmi, num_collocations)

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [30]:
space_text = '''
Space exploration has been a topic of fascination for scientists, researchers, and the general public for decades. One of the most intriguing aspects of space exploration is the possibility of colonizing other planets, such as Mars. In recent years, multiple space agencies and private companies have set their sights on sending humans to Mars and establishing a permanent settlement on the red planet.

Mars has long been considered a potential candidate for human colonization due to its similarities to Earth in terms of climate, geology, and the presence of water ice. However, there are numerous challenges that must be overcome before humans can safely set foot on the Martian surface. These challenges include developing advanced propulsion systems, creating sustainable habitats, and ensuring the health and safety of astronauts during the long journey to Mars.

Several ambitious Mars missions are currently being planned by various organizations, including NASA, the European Space Agency (ESA), and private companies like SpaceX. These missions aim to further our understanding of Mars' geology, climate, and potential habitability, as well as to test the technologies needed for future human exploration.

One of the most notable Mars missions is NASA's Mars 2020 mission, which successfully landed the Perseverance rover on the Martian surface in February 2021. Perseverance has been exploring the Jezero Crater, searching for signs of ancient life and collecting samples to be returned to Earth by a future mission.

Meanwhile, SpaceX founder Elon Musk has announced ambitious plans to send humans to Mars as early as 2024, with the ultimate goal of establishing a self-sustaining colony on the planet. SpaceX's Starship, a reusable spacecraft currently under development, is designed to transport large numbers of people and cargo to Mars and other destinations in the solar system.

As the race to Mars continues, the world eagerly awaits the next major milestone in human space exploration. The potential discovery of past or present life on Mars, as well as the establishment of a permanent human presence on the red planet, would undoubtedly have profound implications for our understanding of the universe and our place in it.
'''

collocations = find_collocations(space_text)
collocations

[('Elon', 'Musk'),
 ('February', '2021'),
 ('Jezero', 'Crater'),
 ('advanced', 'propulsion'),
 ('collecting', 'samples'),
 ('colonization', 'due'),
 ('creating', 'sustainable'),
 ('developing', 'advanced'),
 ('eagerly', 'awaits'),
 ('founder', 'Elon')]

In [38]:
tokens = nltk.word_tokenize(space_text)
bigram_measures = BigramAssocMeasures()
# Create a BigramCollocationFinder object from the tokens
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_word_filter(lambda w: w.lower() in stopwords.words('english') or w in string.punctuation)
finder.nbest(bigram_measures.pmi, 15)

[('Elon', 'Musk'),
 ('February', '2021'),
 ('Jezero', 'Crater'),
 ('advanced', 'propulsion'),
 ('collecting', 'samples'),
 ('colonization', 'due'),
 ('creating', 'sustainable'),
 ('developing', 'advanced'),
 ('eagerly', 'awaits'),
 ('founder', 'Elon'),
 ('general', 'public'),
 ('include', 'developing'),
 ('intriguing', 'aspects'),
 ('large', 'numbers'),
 ('major', 'milestone')]

## 3. Part-of-Speech Tagging

Part-of-speech (POS) tagging is the process of assigning a grammatical category (such as noun, verb, adjective, etc.) to each word in a text. It can be a valuable tool in text analysis, as it helps identify the linguistic structure and composition of the text. In this section, we will discuss how to perform part-of-speech tagging using the nltk library, which offers pre-trained taggers and tools for training your own tagger.

**Research Question:** "How do social media users employ different parts of speech when expressing their opinions on a controversial topic?"

Part-of-speech tagging can be applied to a dataset of social media posts discussing a controversial topic to determine the prevalence of different parts of speech, such as nouns, verbs, and adjectives. This analysis can reveal insights into the ways users structure their thoughts and arguments and the types of language employed when discussing controversial issues, potentially providing a better understanding of online discourse dynamics.

In [39]:
nltk.download('averaged_perceptron_tagger')

def pos_tagging(text):
    tokens = nltk.word_tokenize(text)
    tagged_tokens = nltk.pos_tag(tokens)
    return tagged_tokens

# Assuming 'text_data' contains the text from your social media dataset
example_text = "I love the new iPhone! It's so fast and the camera is incredible."
pos_tags = pos_tagging(example_text)
print(pos_tags)

[('I', 'PRP'), ('love', 'VBP'), ('the', 'DT'), ('new', 'JJ'), ('iPhone', 'NN'), ('!', '.'), ('It', 'PRP'), ("'s", 'VBZ'), ('so', 'RB'), ('fast', 'JJ'), ('and', 'CC'), ('the', 'DT'), ('camera', 'NN'), ('is', 'VBZ'), ('incredible', 'JJ'), ('.', '.')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


**Outputs Breakdown** 

1. `('I', 'PRP')`: "I" is a pronoun (PRP).
2. `('love', 'VBP')`: "love" is a verb, non-3rd person singular present (VBP).
3. `('the', 'DT')`: "the" is a determiner (DT).
4. `('new', 'JJ')`: "new" is an adjective (JJ).
5. `('iPhone', 'NN')`: "iPhone" is a singular noun (NN).
6. `('!', '.')`: "!" is a punctuation mark (period or other sentence-ending punctuation, denoted by '.').
7. `('It', 'PRP')`: "It" is a pronoun (PRP).
8. `("'s", 'VBZ')`: "'s" (is) is a verb, 3rd person singular present (VBZ).
9. `('so', 'RB')`: "so" is an adverb (RB).
10. `('fast', 'JJ')`: "fast" is an adjective (JJ).
11. `('and', 'CC')`: "and" is a coordinating conjunction (CC).
12. `('the', 'DT')`: "the" is a determiner (DT).
13. `('camera', 'NN')`: "camera" is a singular noun (NN).
14. `('is', 'VBZ')`: "is" is a verb, 3rd person singular present (VBZ).
15. `('incredible', 'JJ')`: "incredible" is an adjective (JJ).
16. `('.', '.')`: "." is a punctuation mark (period or other sentence-ending punctuation, denoted by '.').

## 4. Named Entity Recognition

Named Entity Recognition (NER) is a subtask of information extraction that seeks to identify and classify named entities (such as people, organizations, locations, etc.) within the text. NER can be useful in various applications, including analyzing social media data to identify influential individuals or organizations mentioned in discussions. In this section, we will discuss how to perform named entity recognition using the nltk library and the spaCy library, which offer different approaches and models for NER tasks.

**Research Question:** "Can Named Entity Recognition be used to identify influential individuals and organizations mentioned in social media discussions on a specific issue?"

By applying Named Entity Recognition (NER) to social media data related to a particular issue, researchers can extract and analyze mentions of influential individuals, organizations, and other entities. This information can be used to explore the relationships between these entities and the issue being discussed, as well as to identify key players in the conversation and their potential impact on public opinion.

In [40]:
!pip install spacy
!python -m spacy download en_core_web_md

Collecting spacy
  Using cached spacy-3.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.6 MB)
Collecting spacy-legacy<3.1.0,>=3.0.11
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0
  Using cached spacy_loggers-1.0.4-py3-none-any.whl (11 kB)
Collecting thinc<8.2.0,>=8.1.8
  Using cached thinc-8.1.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (913 kB)
Collecting langcodes<4.0.0,>=3.2.0
  Using cached langcodes-3.3.0-py3-none-any.whl (181 kB)
Collecting preshed<3.1.0,>=3.0.2
  Using cached preshed-3.0.8-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (124 kB)
Collecting wasabi<1.2.0,>=0.9.1
  Using cached wasabi-1.1.1-py3-none-any.whl (27 kB)
Collecting cymem<2.1.0,>=2.0.2
  Using cached cymem-2.0.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Using cached catalogue-2.0.8-py3-none-any.whl (17 k

In [41]:
import spacy

nlp = spacy.load("en_core_web_md")

def named_entity_recognition(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

# Assuming 'text_data' contains the text from your social media dataset
example_text = "I recently visited New York and went to the MoMA. It was a fantastic experience."
named_entities = named_entity_recognition(example_text)
named_entities

[('New York', 'GPE')]

**Outputs Breakdown** 

`('New York', 'GPE')`: "New York" is a Geopolitical Entity (GPE).

In [46]:
doc = nlp(example_text)
doc.ents[0].label_

'GPE'

## 5. Dependency Parsing

Dependency parsing is a technique used to analyze the grammatical structure of a sentence by determining the relationships between words and their dependents. It can provide valuable insights into how users structure their arguments and express their opinions in a text, especially in the context of online discussions on social media platforms. In this section, we will discuss how to perform dependency parsing using the spaCy library, which provides an efficient and accurate parser for processing natural language text.

**Research Question:** "How do users structure their arguments when discussing controversial topics on social media platforms?"

By using dependency parsing to analyze the grammatical structure and relationships between words in social media text data, researchers can explore the ways users construct their arguments and express their opinions on controversial topics. This can provide valuable insights into the complexity and nuance of online discourse, as well as reveal potential patterns and strategies employed by users when engaging in discussions on social media platforms.

In [47]:
import spacy

nlp = spacy.load("en_core_web_md")

def dependency_parsing(text):
    doc = nlp(text)
    return [(token.text, token.dep_, token.head.text) for token in doc]

# Assuming 'text_data' contains the text from your social media dataset
example_text = "I believe the new policy will have a positive impact on our society."
dependencies = dependency_parsing(example_text)
dependencies

[('I', 'nsubj', 'believe'),
 ('believe', 'ROOT', 'believe'),
 ('the', 'det', 'policy'),
 ('new', 'amod', 'policy'),
 ('policy', 'nsubj', 'have'),
 ('will', 'aux', 'have'),
 ('have', 'ccomp', 'believe'),
 ('a', 'det', 'impact'),
 ('positive', 'amod', 'impact'),
 ('impact', 'dobj', 'have'),
 ('on', 'prep', 'impact'),
 ('our', 'poss', 'society'),
 ('society', 'pobj', 'on'),
 ('.', 'punct', 'believe')]

**Outputs Breakdown**

1. `('I', 'nsubj', 'believe')`: "I" is the nominal subject (nsubj) of the verb "believe".
2. `('believe', 'ROOT', 'believe')`: "believe" is the root (main verb) of the sentence.
3. `('the', 'det', 'policy')`: "the" is a determiner (det) for the noun "policy".
4. `('new', 'amod', 'policy')`: "new" is an adjectival modifier (amod) of the noun "policy".
5. `('policy', 'nsubj', 'have')`: "policy" is the nominal subject (nsubj) of the verb "have".
6. `('will', 'aux', 'have')`: "will" is an auxiliary verb (aux) of the verb "have".
7. `('have', 'ccomp', 'believe')`: "have" is a clausal complement (ccomp) of the verb "believe".
8. `('a', 'det', 'impact')`: "a" is a determiner (det) for the noun "impact".
9. `('positive', 'amod', 'impact')`: "positive" is an adjectival modifier (amod) of the noun "impact".
10. `('impact', 'dobj', 'have')`: "impact" is the direct object (dobj) of the verb "have".
11. `('on', 'prep', 'impact')`: "on" is a preposition (prep) linked to the noun "impact".
12. `('our', 'poss', 'society')`: "our" is a possessive modifier (poss) of the noun "society".
13. `('society', 'pobj', 'on')`: "society" is the object of the preposition (pobj) "on".
14. `('.', 'punct', 'believe')`: "." is punctuation (punct) attached to the verb "believe".

In [48]:
from spacy import displacy
# Visualize the dependency tree (uncomment the following line to display in the Jupyter Notebook)
displacy.render(nlp(example_text), style='dep', jupyter=True)

## Exercise: Exploratory Data Analysis on Social Media Data (Continued)

In [49]:
import pandas as pd
import nltk
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
from nltk import ngrams
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
from nltk.corpus import stopwords
import string

import spacy
nlp = spacy.load("en_core_web_md")

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [50]:
df = pd.read_csv("Tweets.csv")
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [None]:
# preprocessing

### 1. N-grams and Phrase Analysis

In [51]:
def generate_ngrams(text, n):
    tokens = text.split()
    return list(ngrams(tokens, n))

In [53]:
df["n_grams"] = df["text"].apply(lambda x: generate_ngrams(x, n=2))

### 2. Collocation Analysis

In [None]:
def find_collocations(text, num_collocations=10):
    tokens = nltk.word_tokenize(text)
    bigram_measures = BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tokens)
    finder.apply_freq_filter(2)
    finder.apply_word_filter(lambda w: w.lower() in stopwords.words('english') or w in string.punctuation)
    return finder.nbest(bigram_measures.pmi, num_collocations)

In [56]:
df["collocation"] = df["text"].apply(find_collocations)
df["collocation"]

0                                       []
1                                       []
2                                       []
3                                       []
4                                       []
                       ...                
14635                                   []
14636    [(Late, Flight), (minutes, Late)]
14637                                   []
14638                                   []
14639                     [(next, flight)]
Name: collocation, Length: 14640, dtype: object

### 3. Part-of-Speech Tagging

In [None]:
def pos_tagging(text):
    tokens = nltk.word_tokenize(text)
    tagged_tokens = nltk.pos_tag(tokens)
    return tagged_tokens

### 4. Named Entity Recognition

In [None]:
def named_entity_recognition(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

### 5. Dependency Parsing

In [None]:
def dependency_parsing(text):
    doc = nlp(text)
    return [(token.text, token.dep_, token.head.text) for token in doc]