# Step one

Collect one lakh tweets on a specific domain name and duration as input

#### Example

##### input
sports

##### output format:

|serial_number | screen_name| user_id| tweet_id| retweet_count| date| tweet|
| ---          |  ----------| -------| --------| --------     | --- | ---- |
| 1            | lorem      | ip     | 12      | 1            | 23  | xh   |


we'll be using [`twitterscraper`](https://github.com/taspinar/twitterscraper) for this purpose.


In [1]:
%%bash
twitterscraper python --limit 1000 --lang en --output ~/backups/today\'stweets.json

INFO: queries: ['python since:2006-03-21 until:2006-11-12', 'python since:2006-11-12 until:2007-07-06', 'python since:2007-07-06 until:2008-02-27', 'python since:2008-02-27 until:2008-10-20', 'python since:2008-10-20 until:2009-06-13', 'python since:2009-06-13 until:2010-02-04', 'python since:2010-02-04 until:2010-09-29', 'python since:2010-09-29 until:2011-05-23', 'python since:2011-05-23 until:2012-01-14', 'python since:2012-01-14 until:2012-09-06', 'python since:2012-09-06 until:2013-04-30', 'python since:2013-04-30 until:2013-12-22', 'python since:2013-12-22 until:2014-08-15', 'python since:2014-08-15 until:2015-04-09', 'python since:2015-04-09 until:2015-12-01', 'python since:2015-12-01 until:2016-07-24', 'python since:2016-07-24 until:2017-03-17', 'python since:2017-03-17 until:2017-11-08', 'python since:2017-11-08 until:2018-07-02', 'python since:2018-07-02 until:2019-02-24']
INFO: Querying python since:2006-03-21 until:2006-11-12
INFO: Querying python since:2006-11-12 until:200

In [15]:
import codecs
import json
import pandas as pd
pd.options.mode.chained_assignment = None
from typing import List, Dict

def load_json_file(file_path: str) -> Dict:
    with codecs.open(file_path, "r", "utf-8") as f:
        return json.load(f, encoding="utf-8")
    
tweets = load_json_file("/home/vinay/backups/today\'stweets.json")

list_tweets = [list(elem.values()) for elem in tweets]
list_columns = list(tweets[0].keys())

twitter_data = pd.DataFrame(list_tweets, columns=list_columns)
twitter_data.head()

Unnamed: 0,timestamp,url,text,user,html,retweets,replies,fullname,id,likes
0,2006-11-08T11:46:29,/larskflem/status/59306,coding python. happy time,larskflem,"<p class=""TweetTextSize js-tweet-text tweet-te...",0,0,Lars K. Flem,59306,0
1,2006-11-06T21:20:39,/sergio_101/status/57683,Trying to figure out what phone to get next.. ...,sergio_101,"<p class=""TweetTextSize js-tweet-text tweet-te...",0,0,sergio t. ruiz,57683,0
2,2006-10-23T00:21:20,/thomasknoll/status/46836,Learning python while kim watches city of god,thomasknoll,"<p class=""TweetTextSize js-tweet-text tweet-te...",0,0,Thomas Knoll,46836,0
3,2006-08-02T02:07:24,/marceloeduardo/status/15613,"Finishing some turbogears experience, writing ...",marceloeduardo,"<p class=""TweetTextSize js-tweet-text tweet-te...",1,0,Marcelo Eduardo,15613,1
4,2006-07-16T18:03:45,/nitin/status/10584,Heading to peets in emryvil to hack python tnx...,nitin,"<p class=""TweetTextSize js-tweet-text tweet-te...",1,1,Nitin Borwankar,10584,1


We can drop columns `html`, `url`, `likes`, `replies`.

We need to modify `timestamp` column, add `user` and `fullname` columns.
and get user_ids of the `user`.

order the columns, based on the given output format

In [3]:
# making timestamp YYYY-MM-DD
twitter_data['timestamp'] = twitter_data['timestamp'].apply(lambda x: x.split('T')[0])

# dropping html, url, likes and replies
twitter_data.drop(columns=['html', 'url', 'likes', 'replies'], inplace=True)

# twitter_data.head()
twitter_data.columns

Index(['timestamp', 'text', 'user', 'retweets', 'fullname', 'id'], dtype='object')

In [4]:
# renaming column names
twitter_data.columns = ['Date', 'Tweet', 'user', 'retweets', 'fullname', 'Tweet_id']

twitter_data.head()

Unnamed: 0,Date,Tweet,user,retweets,fullname,Tweet_id
0,2006-11-08,coding python. happy time,larskflem,0,Lars K. Flem,59306
1,2006-11-06,Trying to figure out what phone to get next.. ...,sergio_101,0,sergio t. ruiz,57683
2,2006-10-23,Learning python while kim watches city of god,thomasknoll,0,Thomas Knoll,46836
3,2006-08-02,"Finishing some turbogears experience, writing ...",marceloeduardo,1,Marcelo Eduardo,15613
4,2006-07-16,Heading to peets in emryvil to hack python tnx...,nitin,1,Nitin Borwankar,10584


## Step 2
from the step 1 output observe( 5th column of the table) i.e number of re tweets obtained for each tweet . If number of re tweets obtained for the given tweet is 0 then discard the tweet other wise print the tweet in the above format.

Output : print only the tweets which got re tweets and discard the tweets with no re tweets

This will contain the tweets with more than zero retweets.

In [5]:
twitter_data = twitter_data[twitter_data.retweets != "0"]
twitter_data.head()

Unnamed: 0,Date,Tweet,user,retweets,fullname,Tweet_id
3,2006-08-02,"Finishing some turbogears experience, writing ...",marceloeduardo,1,Marcelo Eduardo,15613
4,2006-07-16,Heading to peets in emryvil to hack python tnx...,nitin,1,Nitin Borwankar,10584
66,2016-07-23,tethne 0.8.1.dev8: Bibliographic network and c...,mastercodeonlin,1,MasterCode.Online,757001950088957952
71,2016-07-23,Thank @mandarlimaye 4 your follow and welcom #...,lennincaro,3,Lennin Caro,757000314071478272
72,2016-07-23,Thank @h1ng 4 your follow and welcom #PostgreS...,lennincaro,4,Lennin Caro,757000213622030336


## for step three

Step 3: Find out number of users who has been tweeted those tweets in step 2, because one user may post multiple tweets.

Input: output of step 2

Output:

|serial_number | user_name @mention | user_id | tweets (no of tweets posted by user) |
| ----         |  ----------        | ------- | --------                             |
|              |                    |         |                                      |

In [6]:
# for step 3 date column is irrelevant
# remove first date column
twitter_data_with_date = twitter_data
twitter_data.drop(columns=['Date', 'Tweet'], inplace=True)
twitter_data.head()

Unnamed: 0,user,retweets,fullname,Tweet_id
3,marceloeduardo,1,Marcelo Eduardo,15613
4,nitin,1,Nitin Borwankar,10584
66,mastercodeonlin,1,MasterCode.Online,757001950088957952
71,lennincaro,3,Lennin Caro,757000314071478272
72,lennincaro,4,Lennin Caro,757000213622030336


In [7]:
# rather than dropping duplicated we can `groupby` in pandas
# twitter_data.duplicated(subset='user', keep='first').sum()
tweet_count = twitter_data.groupby(twitter_data.user.tolist(),as_index=False).size()
tweet_count['mastercodeonlin']

2

In [8]:
def get_tweet_count(user: str) -> int:
    return tweet_count[user]

get_tweet_count('mastercodeonlin')

2

In [9]:
twitter_data['no_of_tweets'] = twitter_data['user'].apply(lambda x: get_tweet_count(x))

twitter_data_without_tweet_count = twitter_data.drop_duplicates(subset='user', keep="first")
twitter_data_without_tweet_count.reset_index(drop=True, inplace=True)
twitter_data_without_tweet_count.head()

Unnamed: 0,user,retweets,fullname,Tweet_id,no_of_tweets
0,marceloeduardo,1,Marcelo Eduardo,15613,1
1,nitin,1,Nitin Borwankar,10584,1
2,mastercodeonlin,1,MasterCode.Online,757001950088957952,2
3,lennincaro,3,Lennin Caro,757000314071478272,5
4,devbattles,9,Dev Battles,756996796786900993,2


In [10]:
# in order to get user_id for a user
# we need to use tweepy, need to work on getting user_ids twitterscraper way.

import tweepy

configs = load_json_file("configs.json")

APP_KEY = configs['APP_KEY']
APP_SECRET = configs['APP_SECRET']

# authenticate api
auth = tweepy.AppAuthHandler(APP_KEY, APP_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

if (not api):
    print("Can't Authenticate")
    sys.exit(-1)

In [16]:
# get user_id from screen name
def get_user_id_from_screen_name(screen_name: str, api: object) -> int:
    try:
        id = api.get_user(screen_name=screen_name).id
#         print(id)
        return id
    except tweepy.TweepError:
        return None

get_user_id_from_screen_name("nitin", api)

988

In [17]:
twitter_data_without_tweet_count['user_id'] = twitter_data_without_tweet_count['user'].apply(lambda x: int(get_user_id_from_screen_name(x, api)))

In [18]:
twitter_data_without_tweet_count.head()

Unnamed: 0,user,retweets,fullname,Tweet_id,no_of_tweets,user_id
0,marceloeduardo,1,Marcelo Eduardo,15613,1,3652
1,nitin,1,Nitin Borwankar,10584,1,988
2,mastercodeonlin,1,MasterCode.Online,757001950088957952,2,3041232857
3,lennincaro,3,Lennin Caro,757000314071478272,5,205824307
4,devbattles,9,Dev Battles,756996796786900993,2,2377678050


## for step four

All the users who are there in the output of step 3 are not influential users, to find out
Influential users from the above table, find out no of retweets obtained for each user
and calculate weight or user rank.

output format:

|serial_number | user_name @mention | user_id| #tweets (no of tweets posted by user)| # retweets | log(#retweets)|
| ----         |  ----------        | -------| --------                             | ----       | -----         |
|              |                    |        |                                      |            |               |


In [22]:
import math
twitter_data_without_tweet_count['log(retweets)'] = twitter_data_without_tweet_count['retweets'].apply(lambda x: math.log(int(x)))
twitter_data_without_tweet_count.head()

Unnamed: 0,user,retweets,fullname,Tweet_id,no_of_tweets,user_id,log(retweets)
0,marceloeduardo,1,Marcelo Eduardo,15613,1,3652,0.0
1,nitin,1,Nitin Borwankar,10584,1,988,0.0
2,mastercodeonlin,1,MasterCode.Online,757001950088957952,2,3041232857,0.0
3,lennincaro,3,Lennin Caro,757000314071478272,5,205824307,1.098612
4,devbattles,9,Dev Battles,756996796786900993,2,2377678050,2.197225


In [25]:
tw_data = twitter_data_without_tweet_count[['user', 'fullname', 'user_id', 'no_of_tweets', 'retweets', 'log(retweets)']]
tw_data.head()

Unnamed: 0,user,fullname,user_id,no_of_tweets,retweets,log(retweets)
0,marceloeduardo,Marcelo Eduardo,3652,1,1,0.0
1,nitin,Nitin Borwankar,988,1,1,0.0
2,mastercodeonlin,MasterCode.Online,3041232857,2,1,0.0
3,lennincaro,Lennin Caro,205824307,5,3,1.098612
4,devbattles,Dev Battles,2377678050,2,9,2.197225


## for step five

from the above table from step four, we've calculated weights of each user, from that pick out those users,
whose weight > 1.5

Output format:


|serial_number | user_name @mention | user_id| #tweets (no of tweets posted by user)| # retweets | weights > 1.5 |
| ----         |  ----------        | -------| --------                             | ----       | -----         |
|              |                    |        |                                      |            |               |


In [27]:
tw_data = tw_data[tw_data['log(retweets)'] > 1.5]
tw_data.head()

Unnamed: 0,user,fullname,user_id,no_of_tweets,retweets,log(retweets)
4,devbattles,Dev Battles,2377678050,2,9,2.197225
27,FollowMMA,Jason Chambers,22735770,1,14,2.639057
28,Doclach,Doc,21816418,1,6,1.791759
30,jedisct1,Frank Denis,17396038,2,6,1.791759
34,r_netsec,/r/netsec,238781296,1,7,1.94591


In [29]:
tw_data.reset_index(drop=True, inplace=True)
tw_data.head()

Unnamed: 0,user,fullname,user_id,no_of_tweets,retweets,log(retweets)
0,devbattles,Dev Battles,2377678050,2,9,2.197225
1,FollowMMA,Jason Chambers,22735770,1,14,2.639057
2,Doclach,Doc,21816418,1,6,1.791759
3,jedisct1,Frank Denis,17396038,2,6,1.791759
4,r_netsec,/r/netsec,238781296,1,7,1.94591


## for step six
In step five, count the number of users, # users are called as `Influential Users`

## for step seven
For Influential users, calculate global influential score for each user.

$$Influential score formula = \frac{no of retweets}{no of tweets}$$

In [33]:
def inf_score(retweets, tweets):
    return (retweets / tweets)


tw_data['inf_score'] = tw_data.apply(lambda x: inf_score(int(x.retweets), int(x.no_of_tweets)), axis=1)
tw_data.head()

Unnamed: 0,user,fullname,user_id,no_of_tweets,retweets,log(retweets),inf_score
0,devbattles,Dev Battles,2377678050,2,9,2.197225,4.5
1,FollowMMA,Jason Chambers,22735770,1,14,2.639057,14.0
2,Doclach,Doc,21816418,1,6,1.791759,6.0
3,jedisct1,Frank Denis,17396038,2,6,1.791759,3.0
4,r_netsec,/r/netsec,238781296,1,7,1.94591,7.0


## for step eight
write down global influence scores in descending order and give rank to each influential user.

example:
highest value of influential score = rank `1`
.
.
.
lowest value of influential score = rank `n`

Output format:

| global influential score in descending order |    user name    | global rank x_i |
| ---                                          | ---             | ---             |
|                                              |                 |                 |

In [36]:
tw_data = tw_data.sort_values(by=['inf_score'], ascending=False)

tw_data.reset_index(drop=True, inplace=True)
tw_data.head()

Unnamed: 0,user,fullname,user_id,no_of_tweets,retweets,log(retweets),inf_score
0,benhamner,Ben Hamner,22674817,1,21,3.044522,21.0
1,randal_olson,Randy Olson,49413866,1,15,2.70805,15.0
2,FollowMMA,Jason Chambers,22735770,1,14,2.639057,14.0
3,jetrubyagency,JetRuby,3092433987,1,12,2.484907,12.0
4,arnicas,Lynn Cherny,6146692,1,9,2.197225,9.0



## for step nine

* collect tweets of the influential users from the output of step two
* count no of tweets posted by influential users

Output format:

|serial_number | screen_name| user_id| tweet_id| retweet_count| tweet|
| ----         |  ----------| -------| --------| --------     | ---- |
|              |            |        |         |              |      |

In [38]:
twitter_data_with_date.loc[twitter_data_with_date['user'] == "benhamner"]

Unnamed: 0,user,retweets,fullname,Tweet_id,no_of_tweets
393,benhamner,21,Ben Hamner,585952312125829120,1
