# Intermediate Text Data Analysis Techniques and Introduction to Social Media Data

Table of Contents

1. Social Media Data Characteristics
2. Emoji and Emoticon Analysis
3. Sentiment Analysis
4. Regular Expressions for Advanced Text Cleaning
5. N-grams and Phrase Analysis
6. Collocation Analysis
7. Exploratory Data Analysis on Social Media Data

## 1. Social Media Data Characteristics

Social media data has some unique characteristics that make it different from other types of text data:

1. Shorter texts (e.g., tweets, comments)
2. Informal language and slang
3. Emojis and emoticons
4. URLs, mentions, and hashtags

Understanding these characteristics is essential for effectively processing and analyzing social media data.

## 2. Emoji and Emoticon Analysis

Emojis and emoticons are widely used in social media data to express emotions. Analyzing them can help us understand the sentiment of the text. We will use the emoji library to extract and analyze emojis in the dataset.

### 2.1 Extracting Emojis and Emoticons

In [None]:
!pip install emoji

In [None]:
import emoji
import re

def extract_emojis_emoticons(text):
    emojis = [c for c in text if c in emoji.UNICODE_EMOJI["en"]]
    emoticons = re.findall(r'[:;=][-^]?[DP)(]', text)
    return emojis + emoticons

text = "I love Python! 😍🐍 The weather is great today! 😊 #happy #sunny 🌞"
emojis_emoticons = extract_emojis_emoticons(text)
print(emojis_emoticons)

### 2.2 Analyzing Emojis and Emoticons

After extracting emojis and emoticons, we can analyze them to gain insights into the emotions and sentiments expressed in the text. For example, we can count the frequency of each emoji and emoticon to identify the most commonly used ones:

In [None]:
from collections import Counter

emoji_counts = Counter(emojis_emoticons)
print(emoji_counts)

## 3. Sentiment Analysis

Sentiment analysis is the process of determining the sentiment or emotion expressed in a piece of text. We will use the TextBlob library to perform a simple sentiment analysis on the tweets.

In [None]:
from textblob import TextBlob

def get_sentiment(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity > 0:
        return 'positive'
    elif analysis.sentiment.polarity == 0:
        return 'neutral'
    else:
        return 'negative'

sentiment = get_sentiment(text)

## 4. Regular Expressions for Advanced Text Cleaning

Regular expressions are a powerful tool for advanced text cleaning. We will use them to remove URLs, mentions, and special characters from the tweet text.

### 4.1 Removing URLs

To remove URLs from the text, we can use the following regular expression pattern:

In [None]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www.\S+')
return url_pattern.sub('', text)

text = "Check out this amazing article: https://example.com/article #learning"
cleaned_text = remove_urls(text)

### 4.2 Removing User Mentions

To remove user mentions from the text, we can use the following regular expression pattern:

In [None]:
def remove_mentions(text):
    mention_pattern = re.compile(r'@\w+')
    return mention_pattern.sub('', text)

text = "Thanks for the great article, @johndoe! #appreciation"
cleaned_text = remove_mentions(text)

### 4.3 Removing Hashtags

To remove hashtags from the text, we can use the following regular expression pattern:

In [None]:
def remove_hashtags(text):
    hashtag_pattern = re.compile(r'#\w+')
    return hashtag_pattern.sub('', text)

text = "I love Python! 😍🐍 #python #programming"
cleaned_text = remove_hashtags(text)

### 4.4 All-in-One

In [None]:
def clean_text(text):
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # remove URLs
    text = re.sub(r'@\w+', '', text)  # remove mentions
    text = re.sub(r'\W', ' ', text)  # remove special characters
    text = re.sub(r'\s+', ' ', text)  # remove extra spaces
    return text.strip()

text = "Check out this amazing article: https://example.com/article #learning, Thanks for the great article, @johndoe! #appreciation, I love Python! 😍🐍 #python #programming"
cleaned_text = clean_text(text)

## 5. N-grams and Phrase Analysis

N-grams are sequences of N contiguous words in a text. They can provide insights into the co-occurrence of words and the context in which they appear. In this section, we will discuss how to generate and analyze N-grams from text data.

### 5.1 Generating N-grams

To generate N-grams, we can use the nltk library.

In [None]:
from nltk import ngrams

def generate_ngrams(text, n):
    tokens = text.split()
    return list(ngrams(tokens, n))

text = "I love Python programming language because it is easy to learn and very versatile."
bigrams = generate_ngrams(text, 2)

### 5.2 Analyzing N-grams

After generating N-grams, we can analyze them to identify frequently occurring phrases and patterns. For example, we can count the frequency of each N-gram to find the most common ones:

In [None]:
from collections import Counter

bigram_counts = Counter(bigrams)

## 6. Collocation Analysis

Collocations are word pairs that occur together more often than expected by chance. They can provide valuable insights into the relationships between words in the text. In this section, we will discuss how to perform collocation analysis using the nltk library.

### 6.1 Finding Collocations

To find collocations, we can use the `nltk` library.

In [None]:
import nltk
nltk.download('stopwords')
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
from nltk.corpus import stopwords

def find_collocations(text, num_collocations=10):
    tokens = nltk.word_tokenize(text)
    bigram_measures = BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tokens)
    finder.apply_freq_filter(2)
    finder.apply_word_filter(lambda w: w.lower() in stopwords.words('english'))
    return finder.nbest(bigram_measures.pmi, num_collocations)

text = "I love Python programming language because it is easy to learn and very versatile. Python is widely used for data analysis, web development, and automation tasks. Python has a large community and many useful libraries, which makes it a popular choice for developers. In addition to Python, there are other programming languages like Java, JavaScript, and C++, but Python remains my favorite due to its simplicity and flexibility."

collocations = find_collocations(text)
print(collocations)

## 7. Exploratory Data Analysis on Social Media Data

Now that we have covered several intermediate text data analysis techniques, let's apply them to a real-life social media dataset.

In [None]:
# Load necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.util import ngrams
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures

Load the dataset and take a look at the first few rows:

In [None]:
url = "https://raw.githubusercontent.com/kolaveridi/kaggle-Twitter-US-Airline-Sentiment-/master/Tweets.csv"
df = pd.read_csv(url)
df.head()

### 7.1 Social Media Data Characteristics

In [None]:
df.info()

### 7.2 Emoji and Emoticon Analysis

In [None]:
def extract_emojis(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)
    return emoji_pattern.findall(text)

df['emojis'] = df['text'].apply(extract_emojis)
df['emojis'].head()

### 7.3 Sentiment Analysis

In [None]:
sia = SentimentIntensityAnalyzer()
df['sentiment_scores'] = df['text'].apply(lambda x: sia.polarity_scores(x))
df['sentiment_scores'].head()

### 7.4 Regular Expressions for Advanced Text Cleaning

In [None]:
def clean_text(text):
    text = re.sub(r"@\w+", "", text)  # Remove mentions
    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)  # Remove URLs
    text = text.lower()  # Convert to lowercase
    return text

df['cleaned_text'] = df['text'].apply(clean_text)
df['cleaned_text'].head()

### 7.5 N-grams and Phrase Analysis

In [None]:
def generate_ngrams(text, n=2):
    tokens = nltk.word_tokenize(text)
    n_grams = list(ngrams(tokens, n))
    return n_grams

df['bigrams'] = df['cleaned_text'].apply(generate_ngrams)
df['bigrams'].head()

### 7.6 Collocation Analysis

In [None]:
# Combine all texts into a single string
all_texts = ' '.join(df['cleaned_text'])

# Tokenize all_texts
all_tokens = nltk.word_tokenize(all_texts)

# Create bigrams
all_bigrams = list(ngrams(all_tokens, 2))

# Collocation Analysis
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(all_tokens)
finder.apply_freq_filter(50)  # Only consider bigrams that appear at least 50 times

# Find top 10 bigrams based on PMI (Pointwise Mutual Information) score
top_bigrams = finder.nbest(bigram_measures.pmi, 10)
print("Top 10 bigrams:", top_bigrams)