# Business Understanding

Main goal of this work is to explain the basics of what is happening during the protests in Iran thorugh the tweets dataset. We are looking for the answers of the 3 questions listed below that will help to unravel the mysteries behind the protests:
1. -
2. Which time intervals have the most average tweets per day in the dataset and what kind of trend differences does tweets in this interval have when compared to the complete dataset?
3. -

# Data Understanding

In this part we load the dataset into a pandas DataFrame and check for inconsistencies, missing values and other possible problems which may get in the way of a proper analysis of the dataset. 

In [1]:
import pandas as pd
from data_wrangler import Wrangler

In [2]:
tweets = pd.read_csv('tweets.csv', dtype=object)
wrangle = Wrangler()
tweets.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source
0,Heidi 🌊💙🇺🇦🇺🇸🇸🇰🌻🪴☕️🦅,"Plainville, CT",I enjoy connecting w/ppl around the 🌍 Democrat...,2010-10-03 04:41:23+00:00,5916.0,6472,177328,False,2022-12-02 16:38:01+00:00,Don’t Let Them Stand Alone. The #women + Girls...,"['women', 'IranianRegime']",Twitter for iPhone
1,Captain Merika,earth 🌍🌎,I hate all forms of dictatorship,2012-04-02 20:18:17+00:00,51.0,85,43,False,2022-12-02 16:35:59+00:00,"🇮🇷Mina Yagoubi, 33, a citizen of Arak, arreste...",,Twitter for Android
2,marjan nourai,,in the now,2021-01-06 22:23:55+00:00,72.0,92,14751,False,2022-12-02 16:33:47+00:00,Tweeting isn’t enough! Social Media isn’t enou...,"['WomanLifeFreedom', 'IranProtests2022', 'Iran...",Twitter for iPhone
3,IranWire,,News and stories from the heart of #Iran.,2013-04-17 12:59:02+00:00,37877.0,1291,6455,False,2022-12-02 16:29:00+00:00,"At #Iran's temporary detention centres, for th...",['Iran'],Buffer
4,Hamidreza Azizi,Berlin,PhD | Visiting Fellow @SWPBerlin | Associate @...,2012-12-04 12:22:28+00:00,4523.0,470,34219,True,2022-12-02 16:28:44+00:00,There are at least two problems with this ethn...,,Twitter for iPhone


# Prepare Data

In this part we will pre-process our data for it to be ready for the actual analysis in terms of finding the answers to our questions listed at the top.

In [3]:
tweets = tweets.dropna(
    subset=['user_verified', 'text', 'user_followers', 'user_created', 'user_friends', 'user_favourites', 'date', 'source', 'user_name']
)

valid_sources = [
    'Twitter for Android',
    'Twitter for iPhone',
    'Twitter Web App',
    'Twitter for iPad'
]
valid_src_tweets = tweets.loc[tweets['source'].isin(valid_sources)]

print('Percentage of tweets removed due to having invalid attributes is:')
print((len(tweets) - len(valid_src_tweets))/len(tweets) * 100)

Percentage of tweets removed due to having invalid attributes is:
6.897837084370938


Aside from a few vital missing values in a small subset of tweets, I have realised that a considerable amount of tweets were posted using sources that are non-Twitter applications and third-party clients. Further research on internet showed that tweets with non-Twitter application sources point to user accounts likely to be managed by bots as well as tweets posted through validated Twitter applications are likely to be human beings. Therefore I have decided to filter out any tweets (rows) form the dataset which does not have a valid Twitter application as its source.

Further research showed that even state of the art bot detection algorithms for twitter depends heavily on the source of the tweet and therefore validated my approach.

There are further cleaning and preparation of the data that is employed through the Wrangler class such as removal of stop words, various regex impressions to disect the text into different types of words etc. Please refer to the data_wrangler.py for more information.

# Model Data

### Question 2: Which time intervals have the most average tweets per day in the dataset and what kind of trend differences does tweets in this interval have when compared to the complete dataset?

With the help of the `Wrangle` class, we will be able to take a look at each text of the tweet, remove links, group hashtags and words by the alphabet used and count each of the respective word's total occurence within the dataset.

In [4]:
wrangle.disect_text(valid_src_tweets.text.unique())
latin_words = wrangle.sort_to_list('latin_words', num=1)
latin_hashtags = wrangle.sort_to_list('latin_hashtags', num=1)

New len of list is : 87997
First 1 elements in list are: 

('iran', 84529)
New len of list is : 19897
First 1 elements in list are: 

('mahsaamini', 193824)


We first copy the original dataset, so that we can do independent analysis on it.

In [7]:
q2_tweets = valid_src_tweets.copy()

We transform the dates column which turns the dates of the tweets into datetime format.

In [8]:
q2_tweets.date = pd.to_datetime(q2_tweets.date, format='%Y-%m-%d %H:%M:%S%z')

We then create a new column consisting only year-month-day variables of the date column.

In [9]:
q2_tweets['YMD'] = q2_tweets['date'].dt.date

To demonstrate, the first row has this value in its new column:

In [10]:
q2_tweets.YMD[0]

datetime.date(2022, 12, 2)

We then find the occurence counts of each date using pandas.value_counts method. Then we zip these numbers with their respected date as below into a list of tuples.

In [11]:
date_date = q2_tweets.YMD.value_counts().sort_index().index
date_occ = q2_tweets.YMD.value_counts().sort_index()
dates = list(zip(date_date, date_occ))

Next we will use the `Wrangler.date_occ_group` method in order to take successive days into 1-7 day groups and calculate the average tweet count with respect to day group length.

In [12]:
occ_dict = wrangle.date_occ_group(dates, [1,2,3,4,5,6,7])

To find the highest average counts in each day group category, we sort each dictionary value and present the first elements.

In [13]:
max_dict = dict()
for key, value in occ_dict.items():
    maks = ('', 0)
    for elem in value:
        if elem[1] > maks[1]:
            maks = elem
    max_dict[key] = maks

max_dict

{1: ('between 10-03 - 10-03', 12009.0),
 2: ('between 10-02 - 10-03', 9163.5),
 3: ('between 10-01 - 10-03', 9116.333333333334),
 4: ('between 10-01 - 10-04', 7882.5),
 5: ('between 10-01 - 10-05', 7820.6),
 6: ('between 10-01 - 10-06', 6771.333333333333),
 7: ('between 10-01 - 10-07', 6872.571428571428)}

From these results, we can conclude that most of the tweets in the dataset was in October 3rd. On average most tweeted successive two days were October 2nd to 3rd, and three days are October 1st to 3rd and so on.

Additionally, we make the observation that each day group with length $N$, where $2 \le N \le 7$, includes the day group with length $K$ where $1 \le K  \le N-1$.

Since the trend regarding the amount of tweets by date in the dataset seems to be focused in one connected interval (October 1st to October 7th), I have decided that a 75% average tweet count falloff from the maximum tweets in one day in the dataset is an acceptable threshold. Therefore we continue with 3-day group, 10-01 to 10-03, for frequency comparison of words with the whole dataset.

In [14]:
import datetime
start_date = datetime.date(2022,10,1)
end_date = datetime.date(2022,10,3)
mask = (q2_tweets.YMD >= start_date) & (q2_tweets.YMD <= end_date)
q2_tweets_3day = q2_tweets.loc[mask]

In [15]:
q2_tweets_3day.YMD.value_counts()

2022-10-03    12009
2022-10-01     9022
2022-10-02     6318
Name: YMD, dtype: int64

We then take the tweets in this interval and disect them by being hashtag or word and latin or persian.

In [16]:
q2_wrangle = Wrangler()
q2_wrangle.disect_text(q2_tweets_3day.text.unique())

In [17]:
latin_words_q2 = q2_wrangle.sort_to_list('latin_words', num=1)

New len of list is : 20390
First 1 elements in list are: 

('iran', 8090)


In [18]:
latin_hashtags_q2 = q2_wrangle.sort_to_list('latin_hashtags', num=1)

New len of list is : 3277
First 1 elements in list are: 

('mahsaamini', 18909)


We take the most occurred first 100 words in the 3 day group and find the corresponding word counts in the whole dataset.

In [17]:
words_comparison = q2_wrangle.compare_freq(
    group_word_list=latin_words_q2,
    dataset_word_list=latin_words,
    len_group=len(q2_tweets_3day.text.unique()),
    len_dataset=len(valid_src_tweets.text.unique())
)

In [18]:
hashtags_comparison = q2_wrangle.compare_freq(
    group_word_list=latin_hashtags_q2,
    dataset_word_list=latin_hashtags,
    len_group=len(q2_tweets_3day.text.unique()),
    len_dataset=len(valid_src_tweets.text.unique())
)

Below, each element in the list represents a word in such structure:
- First element is the Word
- Second element is the ratio difference between the frequency values 3-day group and whole dataset
- Third element is the word's occurence frequency in 3-day group tweets
- Fourth element is the word's occurence frequency in whole dataset

Looking at the results, a cutoff of first 20 elements for words and hashtags frequency lists seemed reasonable.

In [21]:
words_comparison_cutoff = sorted(
    words_comparison[0:20],
    key=lambda x:x[1],
    reverse=True
)

words_comparison_cutoff

[('sharif', 8.11478173954355, 0.08942416258938653, 0.009810894560582721),
 ('students', 2.7260198625009573, 0.2186676703048551, 0.058686662544542176),
 ('university', 2.670089933919305, 0.13699661272111405, 0.03732786258314242),
 ('police', 1.3087736443852935, 0.056755739555890104, 0.02458263489533259),
 ('tehran', 0.8670021061213494, 0.06977794505080918, 0.0373743258360705),
 ('need', 0.5882178055313486, 0.05965374482499059, 0.03756017884778281),
 ('forces', 0.519347245334075, 0.08814452389913437, 0.05801473242527458),
 ('please', 0.4953124983194085, 0.1140760255927738, 0.07628908721152575),
 ('support', 0.2852370985326363, 0.07866014301844185, 0.061202826395416576),
 ('women', 0.23650712439677252, 0.0929619872036131, 0.07518111733401003),
 ('help', 0.1699571226020172, 0.061083929243507716, 0.052210399905643856),
 ('freedom', 0.16129364538550206, 0.1121942039894618, 0.09661139922299145),
 ('voice', 0.023339914567972107, 0.1220549491908167, 0.1192711702663774),
 ('iran', 0.007824650247

In [22]:
hashtags_comparison_cutoff = []

hashtags_comparison_cutoff = sorted(
    hashtags_comparison[0:20],
    key=lambda x:x[1],
    reverse=True
)

hashtags_comparison_cutoff

[('sharif_university',
  7.514452200309245,
  0.024162589386526157,
  0.0028378325249918688),
 ('sharifuniversity',
  7.377927180586662,
  0.05171245765901392,
  0.006172464446676269),
 ('iranianlivesmatter',
  1.3476474929277749,
  0.023334587881068874,
  0.009939562030229707),
 ('oplran', 0.8202243536147601, 0.06605193827625142, 0.03628780053682928),
 ('iranprotests2022',
  0.8114393341825191,
  0.34561535566428303,
  0.19079598700458555),
 ('tehran', 0.7160812360591224, 0.017952578095596538, 0.010461380101575818),
 ('iranianwomen',
  0.5520689434252547,
  0.01870530673692134,
  0.012051852990267736),
 ('endiranregime',
  0.5252423722901152,
  0.007452013549115544,
  0.004885789750206404),
 ('mahsa_amini', 0.4446383382262884, 0.11242002258185924, 0.07781880046177325),
 ('opiran', 0.37077529869164866, 0.19518253669552127, 0.14238842564628598),
 ('freedom',
  0.30673042837185516,
  0.0077154685735792245,
  0.0059044072182450474),
 ('humanrights',
  0.09053879903691973,
  0.007113285660

When comparing the most frequent words and hashtags in the 3-day group against the complete dataset, I have realised that words such as 'student', 'university', 'sharif' and hashtags such as 'sharifuniversity' and 'sharif_university' are among the most frequent in the 3-day group words.

Percentage of frequencies and their respective differences of words are presented below in a table:

|Word|Percentage Rise in Ratio between 3-day group and complete dataset|Frequency in 3-day group|Frequency in complete dataset|
|---|---|---|---|
|sharif|8.11478173954355|0.08942416258938653|0.009810894560582721|
|university|2.670089933919305|0.13699661272111405|0.03732786258314242|
|student|2.7260198625009573|0.2186676703048551|0.058686662544542176|

And hashtags are:

|Hashtag|Percentage Rise in Ratio between 3-day group and complete dataset|Frequency in 3-day group|Frequency in complete dataset|
|---|---|---|---|
|sharifuniversity|7.377927180586662|0.05171245765901392|0.006172464446676269|
|sharif_university|7.514452200309245|0.024162589386526157|0.0028378325249918688|

Research on the timeline of Mahsa Amini Protests revealed that there was a clash between students of the Sharif University of Technology and security forces on October 2nd. Considering this with the rise in ratio of frequency of these specific words and hashtags, we can deduce that the high average tweet counts are likely to be correlated with the events took place on October 2nd. Because of this correlation between the clash of October 2nd and the high average tweet counts, the natural hypothesis that one of the major motivations behind the tweets in this period are concerned with the clash is formed. Thus, this hypothesis warrants for further analysis of the event against tweets. Realising that taking the 3-day group starting from October 1st might have caused a loss in information since the words most frequent were attributed to an event that took place the next day, I have decided to repeat the frequency analysis between 10-01 to 10-02 which is the most tweeted 2-day group according to our previous analysis and would be more representative of the tweets related to the clash on October 2nd.

In [19]:
start_date = datetime.date(2022,10,2)
end_date = datetime.date(2022,10,3)
mask = (q2_tweets.YMD >= start_date) & (q2_tweets.YMD <= end_date)
q2_tweets_2day = q2_tweets.loc[mask]

q2_wrangle_2day = Wrangler()
q2_wrangle_2day.disect_text(q2_tweets_2day.text.unique())

latin_words_q2_2day = q2_wrangle_2day.sort_to_list('latin_words', num=1)
latin_hashtags_q2_2day = q2_wrangle_2day.sort_to_list('latin_hashtags', num=1)

words_comparison_2day = q2_wrangle_2day.compare_freq(
    group_word_list=latin_words_q2_2day,
    dataset_word_list=latin_words,
    len_group=len(q2_tweets_2day.text.unique()),
    len_dataset=len(valid_src_tweets.text.unique())
)

hashtags_comparison_2day = q2_wrangle_2day.compare_freq(
    group_word_list=latin_hashtags_q2_2day,
    dataset_word_list=latin_hashtags,
    len_group=len(q2_tweets_2day.text.unique()),
    len_dataset=len(valid_src_tweets.text.unique())
)

New len of list is : 16749
First 1 elements in list are: 

('students', 5613)
New len of list is : 2547
First 1 elements in list are: 

('mahsaamini', 12003)


In [26]:
hashtags_comparison_cutoff_2day = sorted(
    hashtags_comparison_2day[0:20],
    key=lambda x:x[1],
    reverse=True
)

hashtags_comparison_cutoff_2day

[('sharif_university',
  11.797929227935546,
  0.03631837981557957,
  0.0028378325249918688),
 ('sharifuniversity',
  11.592720777744391,
  0.07772812128754879,
  0.006172464446676269),
 ('mahsaamin', 3.365364990637652, 0.008768456186004412, 0.0020086421650446227),
 ('iranianlivesmatter',
  1.4644050272278715,
  0.024495106635741358,
  0.009939562030229707),
 ('tehran', 0.8277599483376481, 0.019120891553996718, 0.010461380101575818),
 ('endiranregime',
  0.7715283011301848,
  0.00865531481586242,
  0.004885789750206404),
 ('iranprotests2022',
  0.7273990737326296,
  0.3295808112236239,
  0.19079598700458555),
 ('iranianwomen',
  0.5865520073500283,
  0.019120891553996718,
  0.012051852990267736),
 ('mahsa_amini', 0.5396884849083128, 0.11981671098036997, 0.07781880046177325),
 ('humanrights',
  0.3096017811241447,
  0.008542173445720428,
  0.006522725891826399),
 ('freeiran', 0.24134111844718328, 0.020195734570345646, 0.016269286717585627),
 ('opiran', 0.17838687490924804, 0.16778865192

In [27]:
hashtags_comparison_cutoff_2day = sorted(
    hashtags_comparison_2day[0:20],
    key=lambda x:x[1],
    reverse=True
)

hashtags_comparison_cutoff_2day

[('sharif_university',
  11.797929227935546,
  0.03631837981557957,
  0.0028378325249918688),
 ('sharifuniversity',
  11.592720777744391,
  0.07772812128754879,
  0.006172464446676269),
 ('mahsaamin', 3.365364990637652, 0.008768456186004412, 0.0020086421650446227),
 ('iranianlivesmatter',
  1.4644050272278715,
  0.024495106635741358,
  0.009939562030229707),
 ('tehran', 0.8277599483376481, 0.019120891553996718, 0.010461380101575818),
 ('endiranregime',
  0.7715283011301848,
  0.00865531481586242,
  0.004885789750206404),
 ('iranprotests2022',
  0.7273990737326296,
  0.3295808112236239,
  0.19079598700458555),
 ('iranianwomen',
  0.5865520073500283,
  0.019120891553996718,
  0.012051852990267736),
 ('mahsa_amini', 0.5396884849083128, 0.11981671098036997, 0.07781880046177325),
 ('humanrights',
  0.3096017811241447,
  0.008542173445720428,
  0.006522725891826399),
 ('freeiran', 0.24134111844718328, 0.020195734570345646, 0.016269286717585627),
 ('opiran', 0.17838687490924804, 0.16778865192

Percentage of frequencies and their respective differences of words are presented below in a table:

|Word|Percentage Rise in Ratio between 2-day group and complete dataset|Frequency in 2-day group|Frequency in complete dataset|
|---|---|---|---|
|sharif|12.694508305431826|0.134355377043616|0.009810894560582721|
|university|4.248205196976158|0.19590428240085989|0.03732786258314242|
|student|4.4106204295141325|0.31753125530350174|0.058686662544542176|

And hashtags are:

|Hashtag|Percentage Rise in Ratio between 2-day group and complete dataset|Frequency in 2-day group|Frequency in complete dataset|
|---|---|---|---|
|sharifuniversity|11.592720777744391|0.07772812128754879|0.006172464446676269|
|sharif_university|11.797929227935546|0.03631837981557957|0.0028378325249918688|

From these results we can conclude that:
1. The most tweets on average in terms of 2-day groups in the dataset happened between October 2nd to 3rd with 9163.5 tweets/day.
2. Comparison between the frequencies of words and hashtags used in tweets posted in this 2-day group and the complete dataset shows that there is a significant rise in words such as 'sharif', 'university', 'student' and hashtags such as 'sharifuniversity' and 'sharif_university' which implies that there is high correlation between what people tweeted during this 2-day period and the events took place in Sharif University of Technology on the start date of this 2-day period. 


For more information regarding the clash between the students and security forces, please refer to [link](https://www.reuters.com/world/middle-east/iran-lawmakers-chant-thank-you-police-amid-growing-public-fury-over-womans-death-2022-10-02/)