### Objective

To collect data from an API/non static source, store and preprocess it and make preliminary analysis.

### Goal

Conduct social media analysis for Delhi Riots to conduct sentiment analysis and user profiling.

### Background

Following complete [lockdown of Indian-Administered-Kashmir](https://www.aljazeera.com/indepth/inpictures/pictures-100-days-crippling-lockdown-kashmir-191110141155667.html) on August 5, 2019 through abrogation of Article 370 of Indian constitution which gave autonomy to the reagion, government of India, on December 11, 2019, passed a controversial bill called ["Citizenship Amendment Bill"](https://www.bbc.com/news/world-asia-india-50670393), which aimed to provide citizenship to non-muslim minorities through naturalization. These two events were were opposed [nationally](https://edition.cnn.com/2019/12/31/opinions/india-citizenship-law-crosses-line-singh/index.html) and [internationally](https://www.indiatoday.in/india/story/caa-protest-world-students-international-foreign-modi-india-1637241-2020-01-16). Fueling religious radicalization, these events led [riots](https://en.wikipedia.org/wiki/2020_Delhi_riots) in major cities in the country, more specifically Delhi, capital of India. These riots are reffered to as **Delhi Riots**.

[EU DisinfoLab](https://www.disinfo.eu/), a Brussels based NGO, focused on tackling disinformation campaigns targeting EU, on November 26, 2019 realeased a report titled, ["Uncovered: 265 coordinated fake local media outlets serving Indian interests"](https://www.disinfo.eu/publications/uncovered-265-coordinated-fake-local-media-outlets-serving-indian-interests). This report further raised questions on authenticity of online content and this extends to content on social media. Governments and lobbyists have been using [social media](https://www.nytimes.com/2020/03/29/technology/facebook-google-twitter-november-election.html) to stir public preception.

### Data Identification

#### Platform selection

Criterion for platform selection are as following:  
- textual rich data
- api/tool for data collection availability  
- amount of disucussion  


In order to conduct this analysis, following social media analysis are considered:  
- facebook
- twitter
- reddit  

For this task **twitter** is chosen and data is acquired using [twint](https://github.com/twintproject/twint) an opensource library to fetch twitter public data without any limit. Figure below shows rationale behind the decision for platform and tool selection.

![](https://github.com/hamzaafridi/delhi-riots/blob/master/platform_selection_mindmap.jpeg?raw=true)

#### Tool selection|

In [None]:
#suppress warnings generated by ipython for cleaner working
import warnings
warnings.simplefilter('ignore')

Inorder to collect data [twint](https://github.com/twintproject/twint) is used. Twint is an advanced twitter scrapping tool which has no limits and no authentication required. The project is 2 years old, however, there is [active participation](https://github.com/twintproject/twint/graphs/code-frequency) by contributors.  

In [None]:
import twint #twitter scrapping tool
import nest_asyncio #twint has dependency on this
import os #to measure file size
import time #to process time data
import pandas as pd
import seaborn as sns #plotting tool
import matplotlib.pyplot as plt #plotting tool

### Data collection

There are two steps to data collection:  
1. collect and store tweets
2. extract contributing twitter handles and collect profile data for them

#### Collect and store tweets 
Due to some network issue system would break after acquiring 14000 (approx.) tweets. Thus implemented a safe aquring 10k tweets at a time with delay 30sec after each request is implemented so that it doesn't appear as a ddos attack. 

In [None]:
nest_asyncio.apply() #so that multiple requests can be made at the same time
tweet_client = twint.Config() # configure a client

#search configuration
tweet_client.Search = "delhiriots" #serach querry
tweet_client.Limit = 10000 #max number of results
tweet_client.Store_csv = True #store data to a csv file by appending it
tweet_client.Output = "tweets.csv" #file name
tweet_client.Hide_output = True #do not print output in the notebook
tweet_client.Resume = 'last_qerry.txt'#last checkpoint to continue search incase of error

In [None]:
old_size = os.path.getsize('./tweets.csv') #track file size
twint.run.Search(tweet_client) #run search
new_size = os.path.getsize('./tweets.csv') #track file size

In [None]:
while(old_size<new_size):
    twint.run.Search(tweet_client) #run search
    old_size = new_size
    new_size = os.path.getsize('./tweets.csv') #track file size
    time.sleep(30)

#### Collect and store participating user data 

This can be accomplished using preliminary data extraction and processing from raw data aquired in previous section. This will be accomplished by: 
- selecting only twitter handle from tweets.csv 
- removing duplicates if any
- acquire user data for each user

In [None]:
# reading twitter handles from tweets.csv
participant_handles_df = pd.read_csv('tweets.csv', usecols=['username'])

# remove duplicates
participant_handles_df.drop_duplicates(inplace=True)
participant_handles_df.to_csv("unique_users.csv") # store users for future use

In [None]:
print("number of unique users: ",participant_handles_df.shape[0])

In order to aquire these many users data simultanously we will employ technique of multiprocessing

In [None]:
import threading #multiprocessing
import asyncio

def search_query(username):
    asyncio.set_event_loop(asyncio.new_event_loop())
    
    #client configuration
    user_client = twint.Config()
    user_client.Store_csv = True
    user_client.Hide_output = True
    user_client.Output = "users_parallel.csv"
    for user in username:
        user_client.Username = user
        twint.run.Lookup(user_client)

max_queries=100 #number of simultanous querries
user_thread=[]

#generating and starting threads
for i in range(max_queries):
    user_thread.append(threading.Thread(target=search_query, args=(participant_handles_df.username[580*i:580*i+579],)))
    user_thread[i].start()

### Data preperation and analysis

#### Data loading

In this step both tweets and users data are stored in seperate data frames.

In [None]:
tweets_df = pd.read_csv("tweets.csv") # read tweets data and store in dataframe
users_df = pd.read_csv("users.csv") # read users data and store in dataframe

In [None]:
tweets_df.head()

As can be seen from the priliminary evaluation of above data there are many columns that are irrelavent to our analysis and can be safely dropped.

In [None]:
users_df.head()

Similar to previous case we can drop multiple features from here as they are not relavent to our goals or are empty anyways.

#### Feature analysis

In this step we will coduct an inspection of columns/features of raw data for both data sets. We will make conclusions to which columns to drop in the preprocessing step.

Let's look at columns/features of tweets data.

In [None]:
tweets_df.columns

From the feature list above following are the relavent features:
- created_at
- date
- time
- timezone
- username
- tweet
- mentions
- replies_count
- retweet_count
- likes_count
- hashtag

In [None]:
users_df.columns

From the features above following are relavent features:
- name
- username
- bio
- location
- join_date
- join_time
- tweets
- following
- followers
- likes
- private
- verified

#### Preprecessing

In this step I will have the data go through the following steps:
- drop irrelevant columns (features) from both datasets
- extract tweets that are before the event and keep a seperate copy
- convert date column to datetime format
- drop tweets that are before the event (February 2020)
- join the the two data sets appropriately

##### droping irrelavent features

In [None]:
#drop columns that are not required in tweets dataframe
tweets_df.drop(columns=['id', 'conversation_id', 'user_id', 'name', 'place', 'urls', 'photos',
       'cashtags', 'link', 'retweet', 'quote_url', 'video', 'near', 'geo',
       'source', 'user_rt_id', 'user_rt', 'retweet_id', 'reply_to',
       'retweet_date', 'translate', 'trans_src', 'trans_dest'], axis=1, inplace=True)

In [None]:
#verify if drop of columns successful in tweets dataframe
tweets_df.columns

In [None]:
#drop columns that are not required in users dataframe
users_df.drop(columns=['id', 'url', 'media', 'profile_image_url', 'background_image'], axis=1, inplace=True)

In [None]:
#verify if drop of columns successfil in users dataframe
users_df.columns

##### covert date column to datetime datetype

This conversion is easily accomplished using to_datetime function in pandas.

In [None]:
tweets_df['date'] = pd.to_datetime(tweets_df.date)
users_df['join_date'] = pd.to_datetime(users_df.join_date)

In [None]:
#verification of successful conversion
tweets_df['date'].head()

Before going anyfurther lets looks at when the oldest tweet was when we got this data

In [None]:
print("oldest tweet date:",min(tweets_df.date))

looks like we have older data, we will have to drop anything older than our time of interest

##### extrating tweets older than 1st February 2020

In [None]:
tweets_bft = tweets_df[(tweets_df.date<"2020-02-22")] #filter tweets older than 22nd February 2020

In [None]:
#verify successful filtering
max(tweets_bft.date)

##### drop tweets before the event

In [None]:
tweets_df = tweets_df[(tweets_df.date>"2020-02-21")] #filter tweets latest than 22nd February 2020

In [None]:
#verify successful filtering
min(tweets_df.date)

##### join users and tweets data frames

However, joining the two dataframes might not be an efficient solution. This step is just being done to demonstrate understading of integrating data together.

In [None]:
tweet_user_df = tweets_df.set_index('username').join(users_df.set_index('username')) #join with username as key

In [None]:
tweet_user_df.columns

In [None]:
# the below filter has been set as participating users were too large (58k+) even with multiprocessing it is taking more time. So data is scrapped using a seperate script
tweet_user_df[~(tweet_user_df.join_date.isna())].head(2)

#### Analysis


##### from tweets data

##### tweet frequency analysis

As we have have already filtered tweets data, let's first analyze the frequecy of tweets overtime.

In [None]:
date_df=tweets_df.date.groupby(tweets_df["date"]).count() #group by using date to generate histogram
date_df.plot(kind="bar", title="histogram of tweets", figsize=(16,6))

Above histogram shows that the during the riots there were maximum number of tweets, thus highlighting the importance of this issue. Maximum number of tweets are dated for 25 February, 2020.

##### engagement analysis

Another important analysis would be to visualize engagements on tweets. For this we will filter out any tweet that has zero interaction and visualize the spread of the data using boxplot.

In [None]:
tweets_df[['retweets_count','likes_count','replies_count']].boxplot()

As clear from the above diagram most engagement is closer to zero. If we filter that out we can have better understanding of spread. Also there are clearly few outliers, with significant likes and/or retweets.

We will now generate box plot again by ignoring tweets greater than 0 engagments.

In [None]:
tweets_df[((tweets_df.retweets_count+tweets_df.likes_count+tweets_df.replies_count)>0)][['retweets_count','likes_count','replies_count']].boxplot()

No signigicant different is observed. Let's visualize distribution of these engagements. Note: engagement is sum of retweets, likes and replies count.

In [None]:
plt.figure(figsize=(16, 6))
sns.distplot(tweets_df.retweets_count+tweets_df.likes_count+tweets_df.replies_count, hist=False, bins=1).set_title("distribution of engagement")

The clear spike not close to zero suggests that this was a highly engaging topic. There was contribution to the discussion by the participants.

##### top tweets

We will now analyze tweet of the with highest engagement feature both individual and commulative.

In [None]:
#tweet with most likes
best_tweet_likes = tweets_df[(tweets_df.likes_count==max(tweets_df.likes_count))]
print("Best tweet based on likes: \"%s\" by %s"%(best_tweet_likes.tweet.values[0], best_tweet_likes.username.values[0]))

In [None]:
#tweet with most likes
best_tweet_retweets = tweets_df[(tweets_df.retweets_count==max(tweets_df.retweets_count))]
print("Best tweet based on retweets: \"%s\" by %s"%(best_tweet_retweets.tweet.values[0], best_tweet_retweets.username.values[0]))

In [None]:
#tweet with most replies
best_tweet_replies = tweets_df[(tweets_df.replies_count==max(tweets_df.replies_count))]
print("Best tweet based on replies: \"%s\" by %s"%(best_tweet_replies.tweet.values[0], best_tweet_replies.username.values[0]))

In [None]:
#tweet with most engagements
best_tweet_engagements = tweets_df[((tweets_df.retweets_count+tweets_df.likes_count+tweets_df.replies_count)==max(tweets_df.retweets_count+tweets_df.likes_count+tweets_df.replies_count))]
print("Best tweet based on engagement: \"%s\" by %s"%(best_tweet_engagements.tweet.values[0], best_tweet_engagements.username.values[0]))

It can be seen that abdulhamidahmad who is Editor in Cheif of Gulf News got most likes, retweets and likes. On the other shekhargupta who is an Indian journalist got most replies.  

##### from participants data

Now we will analyze participant data.

##### verified users

It is expected to be very small but the engagement of verified vs un-verified is priliminary information we can get from the participating users data.

In [None]:
plt.figure(figsize=(7, 7))
plt.title("verified vs unverified participants")
plt.pie(users_df.username.groupby(users_df.verified).count(), labels=["unverified","verified"],autopct='%1.1f%%', explode=[0.2,0],startangle=90)
plt.show()

So from the above pie 2% participants are verified users and 98% are unverified users. Not alot of information is available on what percentage of users are verified. The closest numbers I could get were in this [article](https://www.brandwatch.com/blog/twitter-stats-and-statistics/). However, given strict rules of twitter on verification of twitter handles I believe it's a significant number, which suggests the seriousness of the issue.

##### twitter age analysis

In order to track down fake army accounts twitter age is an important paramter.

In [None]:
users_df["twitter_age"]=max(tweets_df['date'])-users_df['join_date'] #max is the latest tweet that is on 30th March 2020

In [None]:
sns.distplot(users_df["twitter_age"].dt.days, hist=False)
plt.title("distribution of twitter age of participants")
plt.show()

Interesting how distribution plots show two peaks. The first peak near zero might be an indicator of creation of twitter army.

##### followers, following, tweets and likes analysis

In order to analyse this I will divide the set into two groups based on age:
1. age < 50 days
2. age > 50 days

In [None]:
#statisitcal description of users with less than 50 days age
users_df[(users_df.twitter_age.dt.days<50)][['tweets',"followers","following","likes"]].describe()

For participants with are significantly new on twitter, number of average tweets and likes are suspicious. However, this needs to be further investigated to conclude fake users.

In [None]:
#statisitcal description of users with less than 50 days age
users_df[(users_df.twitter_age.dt.days>50)][['tweets',"followers","following","likes"]].describe()

This is a rought estimation, however, this includes celebrities with verified accounts and huge fan following. So this table will have skewed data.

### Conclusion

Just with priliminary analysis, it has been observed that twint is quite reliable source for data acquisition from twitter. It does however come with it's limitation. Analysis of delhiriots indicate usage of bots to influence public sentiments. However, further investigation has to be performed, to make viable conclusion.

### Future work

With involvement of NLP and other statistical analysis tool this investigation can be taken further and more insights can be taken out from this dataset. If successful these techniques can be employed to track and uncover such global operations.