# Business Understanding

Main goal of this work is to explain the basics of what is happening during the protests in Iran thorugh the tweets dataset. We are looking for the answers of the 3 questions listed below that will help to unravel the mysteries behind the protests:
1. -
2. -
3. What is the pre-eminent sentiment/emotion of the tweets in the dataset?

# Data Understanding

In this part we load the dataset into a pandas DataFrame and check for inconsistencies, missing values and other possible problems which may get in the way of a proper analysis of the dataset. 

In [33]:
import pandas as pd
from collections import defaultdict
import re
from data_wrangler import Wrangler, LINK_PATTERN, LATIN_HASHTAG, PERSIAN_HASHTAG, PERSIAN_WORD, USER_PATTERN

In [2]:
tweets = pd.read_csv('tweets.csv', dtype=object)
wrangle = Wrangler()
tweets.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source
0,Heidi 🌊💙🇺🇦🇺🇸🇸🇰🌻🪴☕️🦅,"Plainville, CT",I enjoy connecting w/ppl around the 🌍 Democrat...,2010-10-03 04:41:23+00:00,5916.0,6472,177328,False,2022-12-02 16:38:01+00:00,Don’t Let Them Stand Alone. The #women + Girls...,"['women', 'IranianRegime']",Twitter for iPhone
1,Captain Merika,earth 🌍🌎,I hate all forms of dictatorship,2012-04-02 20:18:17+00:00,51.0,85,43,False,2022-12-02 16:35:59+00:00,"🇮🇷Mina Yagoubi, 33, a citizen of Arak, arreste...",,Twitter for Android
2,marjan nourai,,in the now,2021-01-06 22:23:55+00:00,72.0,92,14751,False,2022-12-02 16:33:47+00:00,Tweeting isn’t enough! Social Media isn’t enou...,"['WomanLifeFreedom', 'IranProtests2022', 'Iran...",Twitter for iPhone
3,IranWire,,News and stories from the heart of #Iran.,2013-04-17 12:59:02+00:00,37877.0,1291,6455,False,2022-12-02 16:29:00+00:00,"At #Iran's temporary detention centres, for th...",['Iran'],Buffer
4,Hamidreza Azizi,Berlin,PhD | Visiting Fellow @SWPBerlin | Associate @...,2012-12-04 12:22:28+00:00,4523.0,470,34219,True,2022-12-02 16:28:44+00:00,There are at least two problems with this ethn...,,Twitter for iPhone


# Prepare Data

In this part we will pre-process our data for it to be ready for the actual analysis in terms of finding the answers to our questions listed at the top.

In [3]:
tweets = tweets.dropna(
    subset=['user_verified', 'text', 'user_followers', 'user_created', 'user_friends', 'user_favourites', 'date', 'source', 'user_name']
)

valid_sources = [
    'Twitter for Android',
    'Twitter for iPhone',
    'Twitter Web App',
    'Twitter for iPad'
]
valid_src_tweets = tweets.loc[tweets['source'].isin(valid_sources)]

print('Percentage of tweets removed due to having invalid sources is:')
print((len(tweets) - len(valid_src_tweets))/len(tweets) * 100)
print('Number of unique tweets in dataset is:')
print(len(valid_src_tweets.text.unique()))

Percentage of tweets removed due to having invalid sources is:
6.897837084370938
Number of unique tweets in dataset is:
279791


Aside from a few vital missing values in a small subset of tweets, I have realised that a considerable amount of tweets were posted using sources that are non-Twitter applications and third-party clients. Further research on internet showed that tweets with non-Twitter application sources point to user accounts likely to be managed by bots as well as tweets posted through validated Twitter applications are likely to be human beings. Therefore I have decided to filter out any tweets (rows) form the dataset which does not have a valid Twitter application as its source.

Further research showed that even state of the art bot detection algorithms for twitter depends heavily on the source of the tweet and therefore validated my approach.

Next, we only take tweets for further process and analysis

In [4]:
unique_tweets = pd.DataFrame(valid_src_tweets.text.unique(), columns=['tweets'])
unique_tweets['sentiment'] = ''
unique_tweets['emotion'] = ''
unique_tweets['processed_tweets'] = ''

In [5]:
unique_tweets.head()

Unnamed: 0,tweets,sentiment,emotion,processed_tweets
0,Don’t Let Them Stand Alone. The #women + Girls...,,,
1,"🇮🇷Mina Yagoubi, 33, a citizen of Arak, arreste...",,,
2,Tweeting isn’t enough! Social Media isn’t enou...,,,
3,There are at least two problems with this ethn...,,,
4,"#Iranprotests2022 : Dec 2 — Zahedan, SE — #Ira...",,,


# Model Data

### Question 3: What is the pre-eminent sentiment/emotion of the tweets in the dataset?

In [6]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request
import re

Top level function loading model, tokenizer, labels from cardiffnlp of huggingface for related task

In [7]:
def acquire_model(task: str):
    MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

    tokenizer = AutoTokenizer.from_pretrained(MODEL)

    # download label mapping
    labels=[]
    mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
    with urllib.request.urlopen(mapping_link) as f:
        html = f.read().decode('utf-8').split("\n")
        csvreader = csv.reader(html, delimiter='\t')
    labels = [row[1] for row in csvreader if len(row) > 1]

    model = AutoModelForSequenceClassification.from_pretrained(MODEL)
    model.save_pretrained(MODEL)
    tokenizer.save_pretrained(MODEL)
    return (model, tokenizer, labels)

Loading cardiffnlp's twitter roberta base models, sentiment and emotion predictions

In [8]:
(sentiment_model, sentiment_tokenizer, sentiment_labels) = acquire_model(task='sentiment')

In [9]:
(emotion_model, emotion_tokenizer, emotion_labels) = acquire_model(task='emotion')

Creating dictionaries of each task to pass to top level function below

In [10]:
emotion_task = dict()
emotion_task['labels'] = emotion_labels
emotion_task['model'] = emotion_model
emotion_task['tokenizer'] = emotion_tokenizer
emotion_task['string'] = 'emotion'

In [11]:
sentiment_task = dict()
sentiment_task['labels'] = sentiment_labels
sentiment_task['model'] = sentiment_model
sentiment_task['tokenizer'] = sentiment_tokenizer
sentiment_task['string'] = 'sentiment'

Top level function to modify each tweet text.

In [12]:
def preprocess(text):
    interim = LINK_PATTERN.sub('http', text).strip()
    interim = PERSIAN_HASHTAG.sub('', interim).strip()
    interim = PERSIAN_WORD.sub('', interim).strip()
    interim = USER_PATTERN.sub('@user', interim).strip()
    final_text = interim.strip()
    return final_text

Top level function that modifies the column of the `pandas.DataFrame` given `task`

In [13]:
def analyse_tweets(
    tweets_df: pd.DataFrame,
    task: dict
):
    for index, text in tweets_df['tweets'].items():
        analysis_label_scores = {label:0 for label in task['labels']}
        preprocessed_text = preprocess(text)
        tweets_df['processed_tweets'].iloc[index] = preprocessed_text
        encoded_input = task['tokenizer'](
            preprocessed_text,
            return_tensors='pt',
            truncation='only_first',
            max_length=512
        )
        output = task['model'](**encoded_input)
        scores = output[0][0].detach().numpy()
        scores = softmax(scores)

        ranking = np.argsort(scores)
        ranking = ranking[::-1]
        for i in range(scores.shape[0]):
            l = task['labels'][ranking[i]]
            s = scores[ranking[i]]
            analysis_label_scores[l] += s
        
        tweets_df[task['string']].iloc[index] = analysis_label_scores

        

To demonstrate, for first 10 tweets in the dataframe:

In [14]:
try_df = unique_tweets[0:25]

In [15]:
analyse_tweets(
    tweets_df=try_df,
    task=emotion_task   
)
analyse_tweets(
    tweets_df=try_df,
    task=sentiment_task   
)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [16]:
try_df

Unnamed: 0,tweets,sentiment,emotion,processed_tweets
0,Don’t Let Them Stand Alone. The #women + Girls...,negative 0.872835 neutral 0.120743 posi...,anger 0.881560 joy 0.007872 opti...,Don’t Let Them Stand Alone. The #women + Girls...
1,"🇮🇷Mina Yagoubi, 33, a citizen of Arak, arreste...",negative 0.713806 neutral 0.273410 posi...,anger 0.160401 joy 0.021945 opti...,"🇮🇷Mina Yagoubi, 33, a citizen of Arak, arreste..."
2,Tweeting isn’t enough! Social Media isn’t enou...,negative 0.398528 neutral 0.482813 posi...,anger 0.714236 joy 0.009623 opti...,Tweeting isn’t enough! Social Media isn’t enou...
3,There are at least two problems with this ethn...,negative 0.829890 neutral 0.164414 posi...,anger 0.890616 joy 0.004501 opti...,There are at least two problems with this ethn...
4,"#Iranprotests2022 : Dec 2 — Zahedan, SE — #Ira...",negative 0.458482 neutral 0.473956 posi...,anger 0.718373 joy 0.112985 opti...,"#Iranprotests2022 : Dec 2 — Zahedan, SE — #Ira..."
5,"#YasharTohidi , a political prisoner, was shot...",negative 0.927484 neutral 0.069237 posi...,anger 0.184036 joy 0.008386 opti...,"#YasharTohidi , a political prisoner, was shot..."
6,#Iranprotests2022\n#IranRevoIution: 22-year-ol...,negative 0.552572 neutral 0.401029 posi...,anger 0.312359 joy 0.074871 opti...,#Iranprotests2022\n#IranRevoIution: 22-year-ol...
7,"@melaniejoly Thank you, if you want to make a ...",negative 0.185354 neutral 0.614289 posi...,anger 0.613989 joy 0.021918 opti...,"@user Thank you, if you want to make a real im..."
8,Workload sure is heavy at the regime-controlle...,negative 0.574781 neutral 0.387337 posi...,anger 0.810209 joy 0.032480 opti...,Workload sure is heavy at the regime-controlle...
9,More than 10 children have been already killed...,negative 0.950485 neutral 0.047427 posi...,anger 0.569274 joy 0.012548 opti...,More than 10 children have been already killed...


Save demo dataframe as `csv` and `xlsx` files

In [17]:
try_df.to_csv('try_nlp_df.csv')

In [None]:
try_df.to_excel('try_nlp_df.xlsx')

---
Run the model for each tweet and save to dataframe for sentiment and emotion in order.

In [17]:
analyse_tweets(
    tweets_df=unique_tweets,
    task=sentiment_task
)

In [18]:
analyse_tweets(
    tweets_df=unique_tweets,
    task=emotion_task
)

Save the final dataframe to `csv` and `xlsx`.

In [19]:
unique_tweets.to_csv('NLP_analysis.csv')

In [20]:
unique_tweets.to_excel('NLP_analysis.xlsx')

In [21]:
print('Concluding data frame has size: ' + str(unique_tweets.size) + ' and shape: ' + str(unique_tweets.shape))

Concluding data frame has size: 1119164 and shape: (279791, 4)


In [22]:
unique_tweets.head()

Unnamed: 0,tweets,sentiment,emotion,processed_tweets
0,Don’t Let Them Stand Alone. The #women + Girls...,negative 0.872835 neutral 0.120743 posi...,anger 0.881560 joy 0.007872 opti...,Don’t Let Them Stand Alone. The #women + Girls...
1,"🇮🇷Mina Yagoubi, 33, a citizen of Arak, arreste...",negative 0.713806 neutral 0.273410 posi...,anger 0.160401 joy 0.021945 opti...,"🇮🇷Mina Yagoubi, 33, a citizen of Arak, arreste..."
2,Tweeting isn’t enough! Social Media isn’t enou...,negative 0.398528 neutral 0.482813 posi...,anger 0.714236 joy 0.009623 opti...,Tweeting isn’t enough! Social Media isn’t enou...
3,There are at least two problems with this ethn...,negative 0.829890 neutral 0.164414 posi...,anger 0.890616 joy 0.004501 opti...,There are at least two problems with this ethn...
4,"#Iranprotests2022 : Dec 2 — Zahedan, SE — #Ira...",negative 0.458482 neutral 0.473956 posi...,anger 0.718373 joy 0.112985 opti...,"#Iranprotests2022 : Dec 2 — Zahedan, SE — #Ira..."


In [34]:
negative_scores = defaultdict(int)
positive_scores = defaultdict(int)
neutral_scores = defaultdict(int)

for analysis in unique_tweets.sentiment:
    neg_scr = analysis.negative
    neu_scr = analysis.neutral
    pos_scr = analysis.positive
    s_list = sorted(
        [(neg_scr, 'neg'), (neu_scr, 'neu'), (pos_scr, 'pos')],
        key=lambda x: x[0],
        reverse=True
    )
    for i in range(len(s_list)):
        if s_list[i][1] == 'neg':
            negative_scores[i+1] += 1
        elif s_list[i][1] == 'neu':
            neutral_scores[i+1] += 1
        elif s_list[i][1] == 'pos':
            positive_scores[i+1] += 1


    


In [36]:
print('Positive Scores: ')
print(positive_scores)
print('\nNeutral Scores: ')
print(neutral_scores)
print('\nNegative Scores: ')
print(negative_scores)

Positive Scores: 
defaultdict(<class 'int'>, {3: 209214, 2: 40175, 1: 30402})

Neutral Scores: 
defaultdict(<class 'int'>, {2: 178545, 1: 101241, 3: 5})

Negative Scores: 
defaultdict(<class 'int'>, {1: 148148, 2: 61071, 3: 70572})


In [41]:
anger_scores = defaultdict(int)
joy_scores = defaultdict(int)
optimism_scores = defaultdict(int)
sadness_scores = defaultdict(int)

for analysis in unique_tweets.emotion:
    ang_scr = analysis.anger
    joy_scr = analysis.joy
    opt_scr = analysis.optimism
    sad_scr = analysis.sadness
    s_list = sorted(
        [(ang_scr, 'ang'), (joy_scr, 'joy'), (opt_scr, 'opt'), (sad_scr, 'sad')],
        key=lambda x: x[0],
        reverse=True
    )
    for i in range(len(s_list)):
        if s_list[i][1] == 'ang':
            anger_scores[i+1] += 1
        elif s_list[i][1] == 'joy':
            joy_scores[i+1] += 1
        elif s_list[i][1] == 'opt':
            optimism_scores[i+1] += 1
        elif s_list[i][1] == 'sad':
            sadness_scores[i+1] += 1

In [42]:
print('Anger Scores: ')
print(anger_scores)
print('\nJoy Scores: ')
print(joy_scores)
print('\nOptimism Scores: ')
print(optimism_scores)
print('\nSadness Scores: ')
print(sadness_scores)

Anger Scores: 
defaultdict(<class 'int'>, {1: 175031, 2: 52517, 3: 42999, 4: 9244})

Joy Scores: 
defaultdict(<class 'int'>, {4: 177198, 3: 44393, 1: 35577, 2: 22623})

Optimism Scores: 
defaultdict(<class 'int'>, {3: 104587, 2: 121972, 1: 46927, 4: 6305})

Sadness Scores: 
defaultdict(<class 'int'>, {2: 82679, 1: 22256, 3: 87812, 4: 87044})
