# Intermediate Text Data Analysis Techniques and Introduction to Social Media Data

Table of Contents

1. Social Media Data Characteristics
2. Emoji and Emoticon Analysis
3. Sentiment Analysis
4. Regular Expressions for Advanced Text Cleaning

## 1. Social Media Data Characteristics

Social media data has some unique characteristics that make it different from other types of text data:

1. Shorter texts (e.g., tweets, comments)
2. Informal language and slang
3. Emojis and emoticons
4. URLs, mentions, and hashtags

Understanding these characteristics is essential for effectively processing and analyzing social media data.

## 2. Emoji and Emoticon Analysis

Emojis and emoticons are widely used in social media data to express emotions. Analyzing them can help us understand the sentiment of the text. We will use the emoji library to extract and analyze emojis in the dataset.

### 2.1 Extracting Emojis and Emoticons

Python package: [emoji](https://carpedm20.github.io/emoji/docs)

In [1]:
!pip install emoji

Collecting emoji
  Using cached emoji-2.2.0-py3-none-any.whl
Installing collected packages: emoji
Successfully installed emoji-2.2.0


In [2]:
import emoji

print(emoji.emojize('Python is :thumbs_up:'))

Python is 👍


In [3]:
print(emoji.emojize("Python is fun :red_heart:", variant="text_type"))

Python is fun ❤︎


In [4]:
print(emoji.emojize("Python is fun :red_heart:", variant="emoji_type"))

Python is fun ❤️


#### Extracting emoji

In [5]:
# extracting emoji
emoji.emoji_list('Python is 👍')

[{'match_start': 10, 'match_end': 11, 'emoji': '👍'}]

In [6]:
text = "I love Python! 😍🐍 The weather is great today! 😊 #happy #sunny 🌞"
emoji.emoji_list(text)

[{'match_start': 15, 'match_end': 16, 'emoji': '😍'},
 {'match_start': 16, 'match_end': 17, 'emoji': '🐍'},
 {'match_start': 46, 'match_end': 47, 'emoji': '😊'},
 {'match_start': 62, 'match_end': 63, 'emoji': '🌞'}]

In [7]:
# extract distinct emojis
emoji.distinct_emoji_list('Some emoji: 🌍, 😂, 😃, 😂, 🌍, 🌦️')

['🌦️', '🌍', '😂', '😃']

In [8]:
# count the number of emojis
emoji.emoji_count('Some emoji: 🌍, 😂, 😃, 😂, 🌍, 🌦️')

6

In [9]:
emoji.emoji_count('Some emoji: 🌍, 😂, 😃, 😂, 🌍, 🌦️', unique=True)

4

## 3. Sentiment Analysis

Sentiment analysis is the process of determining the sentiment or emotion expressed in a piece of text. We will use the TextBlob library to perform a simple sentiment analysis on the tweets.

In [10]:
!pip install textblob

Collecting textblob
  Using cached textblob-0.17.1-py2.py3-none-any.whl (636 kB)
Collecting nltk>=3.1
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting click
  Using cached click-8.1.3-py3-none-any.whl (96 kB)
Collecting regex>=2021.8.3
  Using cached regex-2023.5.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (769 kB)
Installing collected packages: regex, click, nltk, textblob
Successfully installed click-8.1.3 nltk-3.8.1 regex-2023.5.5 textblob-0.17.1


In [11]:
from textblob import TextBlob

def get_sentiment(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity > 0:
        return 'positive'
    elif analysis.sentiment.polarity == 0:
        return 'neutral'
    else:
        return 'negative'

In [23]:
sentiment = get_sentiment(text)
sentiment

'positive'

In [26]:
text1 = "Python is a programming language."
sentiment1 = get_sentiment(text1)
sentiment1

'neutral'

## 4. Regular Expressions for Advanced Text Cleaning

Regular expressions are a powerful tool for advanced text cleaning. We will use them to remove URLs, mentions, and special characters from the tweet text.

### 4.1 Removing URLs

To remove URLs from the text, we can use the following regular expression pattern:

In [28]:
import re

In [29]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www.\S+')
    return url_pattern.sub('', text)

text = "Check out this amazing article: https://example.com/article #learning"
cleaned_text = remove_urls(text)
cleaned_text

'Check out this amazing article:  #learning'

### 4.2 Removing User Mentions

To remove user mentions from the text, we can use the following regular expression pattern:

In [30]:
def remove_mentions(text):
    mention_pattern = re.compile(r'@\w+')
    return mention_pattern.sub('', text)

text = "Thanks for the great article, @johndoe! #appreciation"
cleaned_text = remove_mentions(text)
cleaned_text

'Thanks for the great article, ! #appreciation'

### 4.3 Removing Hashtags

To remove hashtags from the text, we can use the following regular expression pattern:

In [31]:
def remove_hashtags(text):
    hashtag_pattern = re.compile(r'#\w+')
    return hashtag_pattern.sub('', text).strip()

text = "I love Python! 😍🐍 #python #programming"
cleaned_text = remove_hashtags(text)
cleaned_text

'I love Python! 😍🐍'

### 4.4 All-in-One

In [33]:
def clean_text(text):
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # remove URLs
    print(f"remove URLs: {text}")
    text = re.sub(r'@\w+', '', text)  # remove mentions
    print(f"remove mentions: {text}")
    text = re.sub(r'#\w+', '', text) # remove hashtag
    print(f"remove hashtag: {text}")
    text = re.sub(r'\W', ' ', text)  # remove special characters
    print(f"remove special characters: {text}")
    text = re.sub(r'\s+', ' ', text)  # remove extra spaces
    print(f"remove extra spaces: {text}")
    return text.strip()

text = "Check out this amazing article: https://example.com/article #learning, Thanks for the great article, @johndoe! #appreciation, I love Python! 😍🐍 #python #programming"
cleaned_text = clean_text(text)
cleaned_text

remove URLs: Check out this amazing article:  #learning, Thanks for the great article, @johndoe! #appreciation, I love Python! 😍🐍 #python #programming
remove mentions: Check out this amazing article:  #learning, Thanks for the great article, ! #appreciation, I love Python! 😍🐍 #python #programming
remove hashtag: Check out this amazing article:  , Thanks for the great article, ! , I love Python! 😍🐍  
remove special characters: Check out this amazing article     Thanks for the great article      I love Python      
remove extra spaces: Check out this amazing article Thanks for the great article I love Python 


'Check out this amazing article Thanks for the great article I love Python'

## Exercise: Exploratory Data Analysis on Social Media Data

Now that we have covered several intermediate text data analysis techniques, let's apply them to a real-life social media dataset.

In [34]:
# Load necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk

Load the dataset and take a look at the first few rows:

In [35]:
df = pd.read_csv("Tweets.csv")
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [36]:
df.columns

Index(['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence',
       'negativereason', 'negativereason_confidence', 'airline',
       'airline_sentiment_gold', 'name', 'negativereason_gold',
       'retweet_count', 'text', 'tweet_coord', 'tweet_created',
       'tweet_location', 'user_timezone'],
      dtype='object')

### 1. Social Media Data Characteristics

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   tweet_id                      14640 non-null  int64  
 1   airline_sentiment             14640 non-null  object 
 2   airline_sentiment_confidence  14640 non-null  float64
 3   negativereason                9178 non-null   object 
 4   negativereason_confidence     10522 non-null  float64
 5   airline                       14640 non-null  object 
 6   airline_sentiment_gold        40 non-null     object 
 7   name                          14640 non-null  object 
 8   negativereason_gold           32 non-null     object 
 9   retweet_count                 14640 non-null  int64  
 10  text                          14640 non-null  object 
 11  tweet_coord                   1019 non-null   object 
 12  tweet_created                 14640 non-null  object 
 13  t

### 2. Emoji and Emoticon Analysis

In [38]:
def extract_emojis(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)
    return emoji_pattern.findall(text)

In [39]:
df['emojis'] = df['text'].apply(extract_emojis)
df['emojis'].head()

0    []
1    []
2    []
3    []
4    []
Name: emojis, dtype: object

### 3. Sentiment Analysis

In [45]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()
df['sentiment_scores'] = df['text'].apply(lambda x: sia.polarity_scores(x))
df['sentiment_scores'].head()

0    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
1    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
2    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
3    {'neg': 0.246, 'neu': 0.754, 'pos': 0.0, 'comp...
4    {'neg': 0.321, 'neu': 0.679, 'pos': 0.0, 'comp...
Name: sentiment_scores, dtype: object

In [47]:
df

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,emojis,sentiment_scores
0,570306133677760513,neutral,1.0000,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),[],"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
1,570301130888122368,positive,0.3486,,0.0000,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),[],"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),[],"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
3,570301031407624196,negative,1.0000,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),[],"{'neg': 0.246, 'neu': 0.754, 'pos': 0.0, 'comp..."
4,570300817074462722,negative,1.0000,Can't Tell,1.0000,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),[],"{'neg': 0.321, 'neu': 0.679, 'pos': 0.0, 'comp..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14635,569587686496825344,positive,0.3487,,0.0000,American,,KristenReenders,,0,@AmericanAir thank you we got on a different f...,,2015-02-22 12:01:01 -0800,,,[],"{'neg': 0.0, 'neu': 0.783, 'pos': 0.217, 'comp..."
14636,569587371693355008,negative,1.0000,Customer Service Issue,1.0000,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,,[],"{'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'comp..."
14637,569587242672398336,neutral,1.0000,,,American,,sanyabun,,0,@AmericanAir Please bring American Airlines to...,,2015-02-22 11:59:15 -0800,"Nigeria,lagos",,[],"{'neg': 0.0, 'neu': 0.723, 'pos': 0.277, 'comp..."
14638,569587188687634433,negative,1.0000,Customer Service Issue,0.6659,American,,SraJackson,,0,"@AmericanAir you have my money, you change my ...",,2015-02-22 11:59:02 -0800,New Jersey,Eastern Time (US & Canada),[],"{'neg': 0.0, 'neu': 0.866, 'pos': 0.134, 'comp..."


In [48]:
# # explode the column 'sentiment_scores' to multiple columns
df['sentiment_scores'].apply(pd.Series)

Unnamed: 0,neg,neu,pos,compound
0,0.000,1.000,0.000,0.0000
1,0.000,1.000,0.000,0.0000
2,0.000,1.000,0.000,0.0000
3,0.246,0.754,0.000,-0.5984
4,0.321,0.679,0.000,-0.5829
...,...,...,...,...
14635,0.000,0.783,0.217,0.3612
14636,0.286,0.714,0.000,-0.7906
14637,0.000,0.723,0.277,0.3182
14638,0.000,0.866,0.134,0.5027


In [49]:
pd.concat([df, df['sentiment_scores'].apply(pd.Series)])

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,...,tweet_coord,tweet_created,tweet_location,user_timezone,emojis,sentiment_scores,neg,neu,pos,compound
0,5.703061e+17,neutral,1.0000,,,Virgin America,,cairdin,,0.0,...,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),[],"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",,,,
1,5.703011e+17,positive,0.3486,,0.0000,Virgin America,,jnardino,,0.0,...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),[],"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",,,,
2,5.703011e+17,neutral,0.6837,,,Virgin America,,yvonnalynn,,0.0,...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),[],"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",,,,
3,5.703010e+17,negative,1.0000,Bad Flight,0.7033,Virgin America,,jnardino,,0.0,...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),[],"{'neg': 0.246, 'neu': 0.754, 'pos': 0.0, 'comp...",,,,
4,5.703008e+17,negative,1.0000,Can't Tell,1.0000,Virgin America,,jnardino,,0.0,...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),[],"{'neg': 0.321, 'neu': 0.679, 'pos': 0.0, 'comp...",,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14635,,,,,,,,,,,...,,,,,,,0.000,0.783,0.217,0.3612
14636,,,,,,,,,,,...,,,,,,,0.286,0.714,0.000,-0.7906
14637,,,,,,,,,,,...,,,,,,,0.000,0.723,0.277,0.3182
14638,,,,,,,,,,,...,,,,,,,0.000,0.866,0.134,0.5027


### 4. Regular Expressions for Advanced Text Cleaning

In [50]:
def clean_text(text):
    text = re.sub(r"@\w+", "", text)  # Remove mentions
    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)  # Remove URLs
    text = text.lower()  # Convert to lowercase
    return text

In [51]:
df['cleaned_text'] = df['text'].apply(clean_text)
df['cleaned_text'].head()

0                                          what  said.
1     plus you've added commercials to the experien...
2     i didn't today... must mean i need to take an...
3     it's really aggressive to blast obnoxious "en...
4             and it's a really big bad thing about it
Name: cleaned_text, dtype: object

In [52]:
df

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,emojis,sentiment_scores,cleaned_text
0,570306133677760513,neutral,1.0000,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),[],"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",what said.
1,570301130888122368,positive,0.3486,,0.0000,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),[],"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",plus you've added commercials to the experien...
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),[],"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",i didn't today... must mean i need to take an...
3,570301031407624196,negative,1.0000,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),[],"{'neg': 0.246, 'neu': 0.754, 'pos': 0.0, 'comp...","it's really aggressive to blast obnoxious ""en..."
4,570300817074462722,negative,1.0000,Can't Tell,1.0000,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),[],"{'neg': 0.321, 'neu': 0.679, 'pos': 0.0, 'comp...",and it's a really big bad thing about it
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14635,569587686496825344,positive,0.3487,,0.0000,American,,KristenReenders,,0,@AmericanAir thank you we got on a different f...,,2015-02-22 12:01:01 -0800,,,[],"{'neg': 0.0, 'neu': 0.783, 'pos': 0.217, 'comp...",thank you we got on a different flight to chi...
14636,569587371693355008,negative,1.0000,Customer Service Issue,1.0000,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,,[],"{'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'comp...",leaving over 20 minutes late flight. no warni...
14637,569587242672398336,neutral,1.0000,,,American,,sanyabun,,0,@AmericanAir Please bring American Airlines to...,,2015-02-22 11:59:15 -0800,"Nigeria,lagos",,[],"{'neg': 0.0, 'neu': 0.723, 'pos': 0.277, 'comp...",please bring american airlines to #blackberry10
14638,569587188687634433,negative,1.0000,Customer Service Issue,0.6659,American,,SraJackson,,0,"@AmericanAir you have my money, you change my ...",,2015-02-22 11:59:02 -0800,New Jersey,Eastern Time (US & Canada),[],"{'neg': 0.0, 'neu': 0.866, 'pos': 0.134, 'comp...","you have my money, you change my flight, and ..."
