# Intermediate Text Data Analysis Techniques and Introduction to Social Media Data

Table of Contents

1. Social Media Data Characteristics
2. Emoji and Emoticon Analysis
3. Sentiment Analysis
4. Regular Expressions for Advanced Text Cleaning

## 1. Social Media Data Characteristics

Social media data has some unique characteristics that make it different from other types of text data:

1. Shorter texts (e.g., tweets, comments)
2. Informal language and slang
3. Emojis and emoticons
4. URLs, mentions, and hashtags

Understanding these characteristics is essential for effectively processing and analyzing social media data.

## 2. Emoji and Emoticon Analysis

Emojis and emoticons are widely used in social media data to express emotions. Analyzing them can help us understand the sentiment of the text. We will use the emoji library to extract and analyze emojis in the dataset.

### 2.1 Extracting Emojis and Emoticons

Python package: [emoji](https://carpedm20.github.io/emoji/docs)

In [None]:
!pip install emoji

In [None]:
import emoji

print(emoji.emojize('Python is :thumbs_up:'))

In [None]:
print(emoji.emojize("Python is fun :red_heart:", variant="text_type"))

In [None]:
print(emoji.emojize("Python is fun :red_heart:", variant="emoji_type"))

#### Extracting emoji

In [None]:
# extracting emoji
emoji.emoji_list('Python is 👍')

In [None]:
text = "I love Python! 😍🐍 The weather is great today! 😊 #happy #sunny 🌞"
emoji.emoji_list(text)

In [None]:
# extract distinct emojis
emoji.distinct_emoji_list('Some emoji: 🌍, 😂, 😃, 😂, 🌍, 🌦️')

In [None]:
# count the number of emojis
emoji.emoji_count('Some emoji: 🌍, 😂, 😃, 😂, 🌍, 🌦️')

In [None]:
emoji.emoji_count('Some emoji: 🌍, 😂, 😃, 😂, 🌍, 🌦️', unique=True)

## 3. Sentiment Analysis

Sentiment analysis is the process of determining the sentiment or emotion expressed in a piece of text. We will use the TextBlob library to perform a simple sentiment analysis on the tweets.

In [None]:
!pip install textblob

In [None]:
from textblob import TextBlob

def get_sentiment(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity > 0:
        return 'positive'
    elif analysis.sentiment.polarity == 0:
        return 'neutral'
    else:
        return 'negative'

sentiment = get_sentiment(text)
sentiment

## 4. Regular Expressions for Advanced Text Cleaning

Regular expressions are a powerful tool for advanced text cleaning. We will use them to remove URLs, mentions, and special characters from the tweet text.

### 4.1 Removing URLs

To remove URLs from the text, we can use the following regular expression pattern:

In [None]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www.\S+')
    return url_pattern.sub('', text)

text = "Check out this amazing article: https://example.com/article #learning"
cleaned_text = remove_urls(text)
cleaned_text

### 4.2 Removing User Mentions

To remove user mentions from the text, we can use the following regular expression pattern:

In [None]:
def remove_mentions(text):
    mention_pattern = re.compile(r'@\w+')
    return mention_pattern.sub('', text)

text = "Thanks for the great article, @johndoe! #appreciation"
cleaned_text = remove_mentions(text)
cleaned_text

### 4.3 Removing Hashtags

To remove hashtags from the text, we can use the following regular expression pattern:

In [None]:
def remove_hashtags(text):
    hashtag_pattern = re.compile(r'#\w+')
    return hashtag_pattern.sub('', text).strip()

text = "I love Python! 😍🐍 #python #programming"
cleaned_text = remove_hashtags(text)
cleaned_text

### 4.4 All-in-One

In [None]:
def clean_text(text):
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # remove URLs
    text = re.sub(r'@\w+', '', text)  # remove mentions
    text = re.sub(r'\W', ' ', text)  # remove special characters
    text = re.sub(r'\s+', ' ', text)  # remove extra spaces
    return text.strip()

text = "Check out this amazing article: https://example.com/article #learning, Thanks for the great article, @johndoe! #appreciation, I love Python! 😍🐍 #python #programming"
cleaned_text = clean_text(text)
cleaned_text

### 6.1 Finding Collocations

To find collocations, we can use the `nltk` library.

In [None]:
import nltk
nltk.download('stopwords')
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
from nltk.corpus import stopwords
import string

def find_collocations(text, num_collocations=10):
    """
    This function takes a text input and finds the top collocations (bigrams) based on their
    pointwise mutual information (PMI) scores.

    :param text: str, input text to analyze for collocations
    :param num_collocations: int, optional, the number of top bigrams to return based on their PMI scores
                             (default is 10)
    :return: list of tuples, the top num_collocations bigrams with the highest PMI scores
    """
    # Tokenize the input text
    tokens = nltk.word_tokenize(text)
    # Create a BigramAssocMeasures object to compute the PMI scores
    bigram_measures = BigramAssocMeasures()
    # Create a BigramCollocationFinder object from the tokens
    finder = BigramCollocationFinder.from_words(tokens)
    # Apply a frequency filter to keep only bigrams that appear at least twice
    finder.apply_freq_filter(2)
    # Apply a word filter to exclude bigrams containing stopwords or punctuations
    finder.apply_word_filter(lambda w: w.lower() in stopwords.words('english') or w in string.punctuation)
    # Return the top num_collocations bigrams with the highest PMI scores
    return finder.nbest(bigram_measures.pmi, num_collocations)

In [None]:
space_text = '''
Space exploration has been a topic of fascination for scientists, researchers, and the general public for decades. One of the most intriguing aspects of space exploration is the possibility of colonizing other planets, such as Mars. In recent years, multiple space agencies and private companies have set their sights on sending humans to Mars and establishing a permanent settlement on the red planet.

Mars has long been considered a potential candidate for human colonization due to its similarities to Earth in terms of climate, geology, and the presence of water ice. However, there are numerous challenges that must be overcome before humans can safely set foot on the Martian surface. These challenges include developing advanced propulsion systems, creating sustainable habitats, and ensuring the health and safety of astronauts during the long journey to Mars.

Several ambitious Mars missions are currently being planned by various organizations, including NASA, the European Space Agency (ESA), and private companies like SpaceX. These missions aim to further our understanding of Mars' geology, climate, and potential habitability, as well as to test the technologies needed for future human exploration.

One of the most notable Mars missions is NASA's Mars 2020 mission, which successfully landed the Perseverance rover on the Martian surface in February 2021. Perseverance has been exploring the Jezero Crater, searching for signs of ancient life and collecting samples to be returned to Earth by a future mission.

Meanwhile, SpaceX founder Elon Musk has announced ambitious plans to send humans to Mars as early as 2024, with the ultimate goal of establishing a self-sustaining colony on the planet. SpaceX's Starship, a reusable spacecraft currently under development, is designed to transport large numbers of people and cargo to Mars and other destinations in the solar system.

As the race to Mars continues, the world eagerly awaits the next major milestone in human space exploration. The potential discovery of past or present life on Mars, as well as the establishment of a permanent human presence on the red planet, would undoubtedly have profound implications for our understanding of the universe and our place in it.
'''

collocations = find_collocations(space_text)
print(collocations)


## Exercise: Exploratory Data Analysis on Social Media Data

Now that we have covered several intermediate text data analysis techniques, let's apply them to a real-life social media dataset.

In [None]:
# Load necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk

Load the dataset and take a look at the first few rows:

In [None]:
df = pd.read_csv("Tweets.csv")
df.head()

### 1. Social Media Data Characteristics

In [None]:
df.info()

### 2. Emoji and Emoticon Analysis

In [None]:
def extract_emojis(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)
    return emoji_pattern.findall(text)

df['emojis'] = df['text'].apply(extract_emojis)
df['emojis'].head()

### 3. Sentiment Analysis

In [None]:
sia = SentimentIntensityAnalyzer()
df['sentiment_scores'] = df['text'].apply(lambda x: sia.polarity_scores(x))
df['sentiment_scores'].head()

In [None]:
# # explode the column 'sentiment_scores' to multiple columns
# df['sentiment_scores'].apply(pd.Series)

### 4. Regular Expressions for Advanced Text Cleaning

In [None]:
def clean_text(text):
    text = re.sub(r"@\w+", "", text)  # Remove mentions
    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)  # Remove URLs
    text = text.lower()  # Convert to lowercase
    return text

# df['cleaned_text'] = df['text'].apply(clean_text)
# df['cleaned_text'].head()