# Business Understanding

Main goal of this work is to explain the basics of what is happening during the protests in Iran thorugh the tweets dataset. We are looking for the answers of the 3 questions listed below that will help to unravel the mysteries behind the protests:
1. What are the relationships between the most common words (hashtags) used in the tweets?
2. -
3. -

# Data Understanding

In this part we load the dataset into a pandas DataFrame and check for inconsistencies, missing values and other possible problems which may get in the way of a proper analysis of the dataset. 

In [1]:
import pandas as pd
import numpy as np
from data_wrangler import Wrangler

In [2]:
tweets = pd.read_csv('tweets.csv', dtype=object)
wrangle = Wrangler()
tweets.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source
0,Heidi 🌊💙🇺🇦🇺🇸🇸🇰🌻🪴☕️🦅,"Plainville, CT",I enjoy connecting w/ppl around the 🌍 Democrat...,2010-10-03 04:41:23+00:00,5916.0,6472,177328,False,2022-12-02 16:38:01+00:00,Don’t Let Them Stand Alone. The #women + Girls...,"['women', 'IranianRegime']",Twitter for iPhone
1,Captain Merika,earth 🌍🌎,I hate all forms of dictatorship,2012-04-02 20:18:17+00:00,51.0,85,43,False,2022-12-02 16:35:59+00:00,"🇮🇷Mina Yagoubi, 33, a citizen of Arak, arreste...",,Twitter for Android
2,marjan nourai,,in the now,2021-01-06 22:23:55+00:00,72.0,92,14751,False,2022-12-02 16:33:47+00:00,Tweeting isn’t enough! Social Media isn’t enou...,"['WomanLifeFreedom', 'IranProtests2022', 'Iran...",Twitter for iPhone
3,IranWire,,News and stories from the heart of #Iran.,2013-04-17 12:59:02+00:00,37877.0,1291,6455,False,2022-12-02 16:29:00+00:00,"At #Iran's temporary detention centres, for th...",['Iran'],Buffer
4,Hamidreza Azizi,Berlin,PhD | Visiting Fellow @SWPBerlin | Associate @...,2012-12-04 12:22:28+00:00,4523.0,470,34219,True,2022-12-02 16:28:44+00:00,There are at least two problems with this ethn...,,Twitter for iPhone


# Prepare Data

In this part we will pre-process our data for it to be ready for the actual analysis in terms of finding the answers to our questions listed at the top.

In [3]:
tweets = tweets.dropna(
    subset=['user_verified', 'text', 'user_followers', 'user_created', 'user_friends', 'user_favourites', 'date', 'source', 'user_name']
)

valid_sources = [
    'Twitter for Android',
    'Twitter for iPhone',
    'Twitter Web App',
    'Twitter for iPad'
]
valid_src_tweets = tweets.loc[tweets['source'].isin(valid_sources)]

print('Percentage of tweets removed due to having invalid sources is:')
print((len(tweets) - len(valid_src_tweets))/len(tweets) * 100)

Percentage of tweets removed due to having invalid sources is:
6.897837084370938


Aside from a few vital missing values in a small subset of tweets, I have realised that a considerable amount of tweets were posted using sources that are non-Twitter applications and third-party clients. Further research on internet showed that tweets with non-Twitter application sources point to user accounts likely to be managed by bots as well as tweets posted through validated Twitter applications are likely to be human beings. Therefore I have decided to filter out any tweets (rows) form the dataset which does not have a valid Twitter application as its source.

Further research showed that even state of the art bot detection algorithms for twitter depends heavily on the source of the tweet and therefore validated my approach.

There are further cleaning and preparation of the data that is employed through the Wrangler class such as removal of stop words, various regex impressions to disect the text into different types of words etc. Please refer to the data_wrangler.py for more information.

# Model Data

### Question 1: What are the relationships between the most common words (hashtags) used in the tweets?

With the help of the `Wrangle` class, we will be able to take a look at each text of the tweet, remove links, group hashtags and words by the alphabet used and count each of the respective word's total occurence within the dataset.

In [4]:
wrangle.disect_text(
    tweets_list=valid_src_tweets.text.to_list()
)

In [5]:
latin_hashtags = wrangle.sort_to_list(
    dict_name='latin_hashtags',
    num=1
)

New len of list is : 19897
First 1 elements in list are: 

('mahsaamini', 208211)


In [6]:
latin_words =wrangle.sort_to_list(
    dict_name='latin_words',
    num=1,
)

New len of list is : 87997
First 1 elements in list are: 

('iran', 91127)


Reducing each tweet into lists of words and hastags in latin.

In [7]:
wrangle.tweet_to_hashtag(valid_src_tweets.text.unique())

In [8]:
wrangle.tweet_to_text(valid_src_tweets.text.unique())

Taking the first 150 most occured words and hashtags in order to analyse these elements.

In [9]:
words_to_use = []
hashtags_to_use = []

for i in range(150):
    words_to_use.append(latin_words[i][0])
    hashtags_to_use.append(latin_hashtags[i][0])


In [34]:
print('first 5 words chosen are:')
print(latin_words[0:5])
print('\nfirst 5 hashtags chosen are:')
print(latin_hashtags[0:5])

first 5 words chosen are:
[('iran', 91127), ('people', 64596), ('regime', 63421), ('iranian', 48115), ('islamic', 46955)]

first 5 hashtags chosen are:
[('mahsaamini', 208211), ('iranrevolution', 99783), ('iranprotests2022', 55146), ('iranprotests', 49644), ('iran', 44787)]


In [28]:
hash_np = pd.read_csv('../results/question_1/hashtags_mat150.csv').to_numpy()
word_np = pd.read_csv('../results/question_1/words_mat150.csv').to_numpy()

In [38]:
words_pair_occ = []
ind_list = []
for i in range(len(word_np)):
    for j in range(len(word_np)):
        val = np.max(word_np[i,j])
        word_i = words_to_use[i]
        word_j = words_to_use[j]
        if i != j:
            words_pair_occ.append((word_i, word_j, val))

words_pair_occ_top5 = sorted(
    list(set(words_pair_occ)),
    key=lambda x: x[2],
    reverse=True
)[0:5]

words_pair_occ_top5

[('republic', 'voice', 27115.0),
 ('islamic', 'freedom', 27115.0),
 ('iran', 'regime', 19205.0),
 ('iran', 'iranian', 18025.0),
 ('regime', 'people', 18025.0)]

In [36]:
hashtags_pair_occ = []
ind_list = []
for i in range(len(hash_np)):
    for j in range(len(hash_np)):
        val = np.max(hash_np[i,j])
        word_i = hashtags_to_use[i]
        word_j = hashtags_to_use[j]
        if i != j:
            hashtags_pair_occ.append((word_i, word_j, val))

hashtags_pair_occ_top5 = sorted(
    list(set(hashtags_pair_occ)),
    key=lambda x: x[2],
    reverse=True
)[0:5]

hashtags_pair_occ_top5

[('mahsaamini', 'iranprotests2022', 49157.0),
 ('mahsaamini', 'mahsa_amini', 34326.0),
 ('opiran', 'iranrevolution', 34326.0),
 ('mahsaamini', 'iranprotests', 28556.0),
 ('iranprotests2022', 'iranrevolution', 28556.0)]

Cross check each word with others and count when they occur together. Print to a matrix, save to a csv file.

In [11]:
words_mat = np.zeros((len(words_to_use), len(words_to_use)))
for i in range(len(words_to_use) - 1):
    for j in range(i+1, len(words_to_use)):
        count = 0
        for tweet in wrangle.tweet_texts_list:
            if words_to_use[i] in tweet:
                if words_to_use[j] in tweet:
                    count +=1
        words_mat[i,j] = count
        words_mat[j,i] = count


In [12]:
words_mat_pd = pd.DataFrame(words_mat)
words_mat_pd.to_csv('words_mat150.csv')

Cross check each hashtags with others and count when they occur together. Print to a matrix, save to a csv file.

In [13]:
hashtags_mat = np.zeros((len(hashtags_to_use), len(hashtags_to_use)))
for i in range(len(hashtags_to_use) - 1):
    for j in range(i+1, len(hashtags_to_use)):
        count = 0
        for tweet in wrangle.tweet_hashtags_list:
            if hashtags_to_use[i] in tweet:
                if hashtags_to_use[j] in tweet:
                    count +=1
        hashtags_mat[i,j] = count
        hashtags_mat[j,i] = count

In [14]:
hashtags_mat_pd = pd.DataFrame(hashtags_mat)
hashtags_mat_pd.to_csv('hashtags_mat150.csv')

Here we have created two distinct networks with nodes being individual words or hashtags and values are the 'connection weights' based on how many times the two corresponding word or hashtag were used in an unique tweet. 

---
Options for the visualisation of the networks, found with trial and error:

In [20]:
OPTION_LARGE = """
    var options = {
    "nodes": {
        "borderWidth": 2,
        "opacity": 0.8,
        "font": {
        "size": 20,
        "strokeWidth": 10
        }
    },
    "physics": {
        "forceAtlas2Based": {
        "gravitationalConstant": -20,
        "centralGravity": 0.03,
        "springLength": 100,
        "damping": 0.9
        },
        "minVelocity": 0.75,
        "solver": "forceAtlas2Based"
    }
}"""

OPTION_MEDIUM = """
    var options = {
    "nodes": {
        "borderWidth": 2,
        "opacity": 0.8,
        "font": {
        "size": 12,
        "strokeWidth": 6
        }
    },
    "physics": {
        "forceAtlas2Based": {
        "gravitationalConstant": -20,
        "centralGravity": 0.025,
        "springLength": 100,
        "damping": 0.9
        },
        "minVelocity": 0.75,
        "solver": "forceAtlas2Based"
    }
}"""

OPTION_SMALL = """
    var options = {
    "nodes": {
        "borderWidth": 2,
        "opacity": 0.8,
        "font": {
        "size": 10,
        "strokeWidth": 5
        }
    },
    "physics": {
        "forceAtlas2Based": {
        "gravitationalConstant": -20,
        "centralGravity": 0.02,
        "springLength": 100,
        "damping": 0.9
        },
        "minVelocity": 0.75,
        "solver": "forceAtlas2Based"
    }
}"""

Below is a top level function to visualise the networks created, allowing a cutoff ratio to be passed so that connections with lower magnitudes could be eliminated, so that complexity of the visualisation can be modified. Tenable cutoff ratios can be determined for each network by trial and error. 

In [31]:
from pyvis.network import Network


def show_graph(cutoff_ratio, file_name, mat_type, options):   
    if mat_type=='hashtag':
        to_use = hashtags_to_use
        mat = hashtags_mat
    elif mat_type=='word':
        to_use = words_to_use
        mat = words_mat

    filt_words_mat = mat.copy()
    tot = np.sum(filt_words_mat)
    filt_words_mat[filt_words_mat/tot < cutoff_ratio] = 0
    filt_word_mat_pd = pd.DataFrame(filt_words_mat)
    filt_word_mat_pd.to_csv(f'{file_name}.csv')
    net_w = Network(notebook = True, cdn_resources = 'remote')
    for i in range(len(mat)):
        net_w.add_node(i, label = to_use[i], value=np.sum(mat[i]))
    for i in range(len(mat)):
        for j in range(len(mat)):
            if filt_words_mat[i,j] > 0:
                net_w.add_edge(i,j, value=filt_words_mat[i,j])

    net_w.set_options(options)    
    
    #net_w.show_buttons(filter_=['nodes', 'edges', 'physics'])

    net_w.show(f'{file_name}.html')

Creating 15 different graph representations with varying cutoff ratios and saving them as html for hashtags.

In [29]:
i = 0
for cutoff in np.linspace(0.000075, 0.002, num=15):
    i += 1
    print((i, cutoff))
    if i < 5:
        options = OPTION_LARGE
    elif 5 <= i < 10:
        options = OPTION_MEDIUM
    elif 10 <= i:
        options = OPTION_SMALL
    
    show_graph(
        cutoff_ratio=cutoff,
        file_name='hashtag_' + str(i),
        mat_type='hashtag',
        options=options
    )

(1, 7.5e-05)
(2, 0.00021250000000000002)
(3, 0.00035)
(4, 0.0004875)
(5, 0.000625)
(6, 0.0007625)
(7, 0.0009)
(8, 0.0010375)
(9, 0.001175)
(10, 0.0013125)
(11, 0.0014500000000000001)
(12, 0.0015875000000000002)
(13, 0.001725)
(14, 0.0018625)
(15, 0.002)


Creating 15 different graph representations with varying cutoff ratios and saving them as html for words.

In [30]:
i = 0
for cutoff in np.linspace(0.0002, 0.002, num=15):
    i += 1
    print((i, cutoff))
    if i < 5:
        options = OPTION_LARGE
    elif 5 <= i < 10:
        options = OPTION_MEDIUM
    elif 10 <= i:
        options = OPTION_SMALL
    
    show_graph(
        cutoff_ratio=cutoff,
        file_name='word_' + str(i),
        mat_type='word',
        options=options
    )

(1, 0.0002)
(2, 0.0003285714285714286)
(3, 0.00045714285714285713)
(4, 0.0005857142857142858)
(5, 0.0007142857142857143)
(6, 0.0008428571428571429)
(7, 0.0009714285714285714)
(8, 0.0011)
(9, 0.0012285714285714287)
(10, 0.0013571428571428573)
(11, 0.001485714285714286)
(12, 0.0016142857142857144)
(13, 0.001742857142857143)
(14, 0.0018714285714285716)
(15, 0.002)
