In [2]:
import numpy as np 
import pandas as pd 
import altair as alt
import matplotlib.pyplot as plt
import re
import json

import os
print(os.listdir("./data"))

testFilePath = "./data/rule1-4_pubmetrics.twts"

['12ktweets.twts', 'rule1-4_trial3.twts', 'ProposedRules.txt', 'testtweets.twts', 'testdatasets', 'self.twts', 'rule1-4_pubmetrics2.twts', '6ktweets.twts', '40ktweets.twts', 'rule1-4_pubmetrics.twts', 'selectedtweets', 'rules3-4_trial2.twts', 'rules1-4_trial1.twts', 'skeptic.twts']


For this notebook, some hypotheses checks will be done on our newly filtered data. First, I will adjust our stream parameters. It will be changed to:

```
https://api.twitter.com/2/tweets/search/stream?tweet.fields=created_at,public_metrics&expansions=author_id,referenced_tweets.id
```
where *referenced_tweets.id* has been added to our parameter listing.

Next, for our two most popular rules (SocietalShift, and SelfAndID), additional negative constraints will be added: *" -is:retweet -is:quote -is:reply"*

Which will cut down our search to only original posts. The two less popular rules will not recieve these constraints, as there match rate is very low.

Analysis will be done on the tweets and metrics recorded on the two sets.


#### Test1: SocietalShift and SelfAndID tagged tweets are original tweets.

This means that we should not see "RT" appended to the tweet text. "@" symbols may be still be used mention others in original tweets.

I visually inspected a few raw tweets, for the tag "SocietalShift". Already, I can see that there are quotes and retweets present...eventhough I never requested this.

Question: For an Original Tweet (not Q, RT or Rep), (i) will the referenced tweet field show up? (ii) Will the public metrics near the top of the jsonObj be filled in?

Answer: Using the following two curl requests:

```
curl --request GET 'https://api.twitter.com/2/tweets?ids=1565347524013338625&tweet.fields=author_id,id,public_metrics,referenced_tweets,text&expansions=referenced_tweets.id' --header 'Authorization: Bearer <?>'

curl --request GET 'https://api.twitter.com/2/tweets?ids=1565587106101174274&tweet.fields=author_id,id,public_metrics,referenced_tweets,text&expansions=referenced_tweets.id' --header 'Authorization: Bearer <?>'

```

(i) No. As there are no referenced tweets, so the extension is not included in the json Payload.

(i) YES! And this means if we do see this top level subobject with non-zero values, that it must be an original tweet.

Non zero L, R, Q on top level public_metrics => original tweet.
Zero values /=> original tweet (it may or may not be - usually it is not).

**More simply, we can detect if a tweet is original if the referenced tweet field is present, or not.**

So to begin, lets extend our dataframe fields, and gather some more data.





In [3]:
#Main import and cleaning code goes here.

#String, Int -> DataFrame!
def generatedataframe(path,writeLimit):
    if ((type(writeLimit) != int)):
        raise("Error: writeLimit not an integer. Check argument.")
        
    cols = ["tweetid", "text", "created_at",
            "tagid","tag","userid","username",
            "rtcount","repcount","likecount",
            "qtcount","tweet_type","ref_tweetid", "ref_authorid",
            "ref_rtcount","ref_repcount","ref_likecount",
            "ref_qtcount"]
    tempDF = pd.DataFrame(columns=cols)
    tweetDF = pd.DataFrame(columns=cols)
    lineCount = 0
    ##Note! On a rare occasion, two or more tags can match. THis currently chooses the first tag set
    ##Information loss can occur.
    with open(path) as fp:
        line = fp.readline()
        while line:
            if (lineCount == writeLimit):
                tweetDF = pd.concat([tweetDF,tempDF],copy=False).reset_index().drop(columns="index")
                tempDF = pd.DataFrame(columns=cols) #blow it all to hell!!
                lineCount = 0
            jsonObj = json.loads(line)
            #done for readibility, not code terseness
            c1 = jsonObj["data"]["id"]
            c2 = jsonObj["data"]["text"]
            c3 = jsonObj["data"]["created_at"]
            c4 = jsonObj['matching_rules'][0]["id"]
            c5 = jsonObj['matching_rules'][0]["tag"]
            c6 = jsonObj["data"]["author_id"]
            #Note: This assumes the first includes user is the poster (!)
            c7 = jsonObj["includes"]["users"][0]["username"]
            c8 = jsonObj["data"]["public_metrics"]["retweet_count"]
            c9 = jsonObj["data"]["public_metrics"]["reply_count"]
            c10 = jsonObj["data"]["public_metrics"]["like_count"]
            c11 = jsonObj["data"]["public_metrics"]["quote_count"]
            
            #Now check to see if our tweet is Original or Not.
            c12 = "original"
            c13 = 0
            if ("referenced_tweets" in jsonObj["data"]):
                c12 = jsonObj["data"]["referenced_tweets"][0]["type"]
                c13 = jsonObj["data"]["referenced_tweets"][0]["id"]
            
            #If there is a referenced tweet, get its metrics
            c14 = 0 #"ref_authorid"
            c15 = 0 #"ref_trcount" 
            c16 = 0 #"ref_repcount" 
            c17 = 0 #"ref_likecount"
            c18 = 0 #"ref_qtcount"
            if ("tweets" in jsonObj["includes"]):
                c14 = jsonObj["includes"]["tweets"][0]["author_id"] #"ref_authorid"
                c15 = jsonObj["includes"]["tweets"][0]["public_metrics"]["retweet_count"] #"ref_trcount" 
                c16 = jsonObj["includes"]["tweets"][0]["public_metrics"]["reply_count"] #"ref_repcount" 
                c17 = jsonObj["includes"]["tweets"][0]["public_metrics"]["like_count"] #"ref_likecount"
                c18 = jsonObj["includes"]["tweets"][0]["public_metrics"]["quote_count"] #"ref_qtcount"

            tempDF.loc[len(tempDF.index)] = [c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18]    
    
            line = fp.readline()
            lineCount = lineCount + 1
                
        if (lineCount > 0): #cadd the last of the rows to final tweetDF.
            tweetDF = pd.concat([tweetDF,tempDF],copy=False).reset_index().drop(columns="index")
        fp.close()
        
    #Quickly convert to numbers.
    #Unlikely we will hit over 2^16 for tweet metrics. Also, are targets are little people anyways,
    #not huge twitter accounts. Downcasting will save some space.
    tweetDF = tweetDF.astype({"tweetid": "int64"}, copy=False) 
    tweetDF = tweetDF.astype({"tagid": "int64"}, copy=False) 
    tweetDF = tweetDF.astype({"userid": "int64"}, copy=False) 
    tweetDF = tweetDF.astype({"rtcount": "int16"}, copy=False) 
    tweetDF = tweetDF.astype({"repcount": "int16"}, copy=False) 
    tweetDF = tweetDF.astype({"likecount": "int16"}, copy=False) 
    tweetDF = tweetDF.astype({"qtcount": "int16"}, copy=False) 
    tweetDF = tweetDF.astype({"ref_tweetid": "int64"}, copy=False) 
    tweetDF = tweetDF.astype({"ref_authorid": "int64"}, copy=False) 
    tweetDF = tweetDF.astype({"ref_rtcount": "int16"}, copy=False) 
    tweetDF = tweetDF.astype({"ref_repcount": "int16"}, copy=False) 
    tweetDF = tweetDF.astype({"ref_likecount": "int16"}, copy=False) 
    tweetDF = tweetDF.astype({"ref_qtcount": "int16"}, copy=False) 

    #[!] for now, I don't use the date string, eventhough it is recorded. To be formatted into a DateTime object later

    tweetDF.info()
    return tweetDF

tweetDF = generatedataframe(testFilePath,5000)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26407 entries, 0 to 26406
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   tweetid        26407 non-null  int64 
 1   text           26407 non-null  object
 2   created_at     26407 non-null  object
 3   tagid          26407 non-null  int64 
 4   tag            26407 non-null  object
 5   userid         26407 non-null  int64 
 6   username       26407 non-null  object
 7   rtcount        26407 non-null  int16 
 8   repcount       26407 non-null  int16 
 9   likecount      26407 non-null  int16 
 10  qtcount        26407 non-null  int16 
 11  tweet_type     26407 non-null  object
 12  ref_tweetid    26407 non-null  int64 
 13  ref_authorid   26407 non-null  int64 
 14  ref_rtcount    26407 non-null  int16 
 15  ref_repcount   26407 non-null  int16 
 16  ref_likecount  26407 non-null  int16 
 17  ref_qtcount    26407 non-null  int16 
dtypes: int16(8), int64(5), obj

In [4]:
tweetDF.tail(5)


Unnamed: 0,tweetid,text,created_at,tagid,tag,userid,username,rtcount,repcount,likecount,qtcount,tweet_type,ref_tweetid,ref_authorid,ref_rtcount,ref_repcount,ref_likecount,ref_qtcount
26402,1564733345241698304,"Duvido mt q essa parada do Mineirão aconteça, ...",2022-08-30T21:54:35.000Z,1563582770315665411,HealthySkepticism,964120169034510336,AlexAgiotagens,0,0,0,0,original,0,0,0,0,0,0
26403,1564733346260946946,RT @cooltxchick: J.D. Vance is a bird of a fea...,2022-08-30T21:54:35.000Z,1564718289544151040,SelfandID,948410690464759809,JonZimmerman16,308,0,0,0,retweeted,1564428649603940355,22803269,308,23,643,10
26404,1564733347531816965,And a very proud Vmate. You forgot to add that...,2022-08-30T21:54:36.000Z,1564718289544151040,SelfandID,1508910788026765314,LeloGotStacks,0,0,0,0,quoted,1564726297665933315,228109780,15,5,42,2
26405,1564733345426292736,ser fã de artista internacional e pobre ao msm...,2022-08-30T21:54:35.000Z,1563582770315665411,HealthySkepticism,1222967589732802560,unfuckbrave,0,0,0,0,original,0,0,0,0,0,0
26406,1564733348727214081,@urlovelybones as legendas no msm nivel daquel...,2022-08-30T21:54:36.000Z,1563582770315665411,HealthySkepticism,1488538249492639747,cumskinnyy,0,0,0,0,replied_to,1564733044396900355,1517981648935165954,0,1,0,0


#### Check: Distribution of Tweet Types, and Tags:

(I) Our percentages for our tags have not shifted too much from last time, even with added constraints for two of the tags.

(II) The tweet type percentage chart is interesting. Only 11% of tweets are original, everything else is reactive content.


In [5]:
#Get a tag percentage summary
distDF = pd.DataFrame(columns=["sum","percentage"],index=tweetDF.tag.unique())
distDF["sum"] = tweetDF.groupby('tag')['tagid'].count()
totalSum = tweetDF.shape[0]
for strIndex in distDF.index:
    distDF["percentage"][strIndex] = (distDF["sum"][strIndex]*100)/totalSum

display(distDF)

#Get a tweet type Percentage Summary
typeDF = pd.DataFrame(columns=["sum","percentage"],index=tweetDF.tweet_type.unique())
typeDF["sum"] = tweetDF.groupby('tweet_type')['ref_tweetid'].count()
totalSum = tweetDF.shape[0]
for strIndex in typeDF.index:
    typeDF["percentage"][strIndex] = (typeDF["sum"][strIndex]*100)/totalSum

display(typeDF)

#finally., lets get our original v.s everything else ratio:




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  distDF["percentage"][strIndex] = (distDF["sum"][strIndex]*100)/totalSum


Unnamed: 0,sum,percentage
SelfandID,16530,62.597039
HealthySkepticism,8077,30.586587
SocietalShift,1778,6.733063
SearchTheVoid,22,0.083311


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  typeDF["percentage"][strIndex] = (typeDF["sum"][strIndex]*100)/totalSum


Unnamed: 0,sum,percentage
replied_to,10079,38.16791
retweeted,11820,44.760859
original,2869,10.864543
quoted,1639,6.206688


#### Question: Can sum(Reply Metrics) > sum(Ref Tweet Metrics). How often does this occur? 

640 times, or 2.5% of the time.

In [6]:
tDF = tweetDF #shortens...
querySum = (((tDF["rtcount"] + tDF["repcount"] + tDF["qtcount"] + tDF["likecount"])) > ((tDF["ref_rtcount"] + tDF["ref_repcount"] + tDF["ref_qtcount"] + tDF["ref_likecount"]))).sum()
print("Occurs:",querySum,"times")
print("Occurs:",(100*querySum/tweetDF.shape[0]),"percent of the time")

Occurs: 640 times
Occurs: 2.423599803082516 percent of the time


#### Check: How many non-original tweets do our modified tags?

I specified that no replies, quotes or retweets be selected for the 
"HealthySkepticism" and "SelfAndID" tagged tweets. It appears that many of these tweets got through the filter. But just how many?

Answer: There are a lot of tweets that get past the filter. 13 and 6.5% of them are filtered erroneously. Why?


In [7]:
hsTagSum = (tweetDF["tag"] == "HealthySkepticism").sum()
saiTagSum = (tweetDF["tag"] == "SelfandID").sum()

#Select all tweets that have HealthySkepticism tags, and count the originals.
#it will likely still be 10%
hsSubSet = tweetDF[tweetDF["tag"] == "HealthySkepticism"]
hsOrigTotal = hsSubSet[ hsSubSet["tweet_type"] == "original" ].shape[0]
print("Percentage of Non Original Tweets for HS:", (100*hsOrigTotal/hsTagSum))

saiSubSet = tweetDF[tweetDF["tag"] == "SelfandID"]
saiOrigTotal = hsSubSet[ hsSubSet["tweet_type"] == "original" ].shape[0]
print("Percentage of Non Original Tweets for Sai:", (100*saiOrigTotal/saiTagSum))

Percentage of Non Original Tweets for HS: 13.272254549956667
Percentage of Non Original Tweets for Sai: 6.485178463399879


In [8]:
#remember that we can plug in a series of truth values, to pick out rows.
rtCount = tweetDF[tweetDF["text"].str.contains("RT")].shape[0]
atCount = tweetDF[tweetDF["text"].str.contains("@")].shape[0]

print("Percentage of RTs::" + str(100*rtCount/tweetDF.shape[0]))
print("Percentage of ATs::" + str(100*atCount/tweetDF.shape[0]))


Percentage of RTs::45.03730071571932
Percentage of ATs::83.01586700496081


### Joining User Names to our tweetDF Dataframe

When fetching tweets, we can get a userID, but not user information directly.

From our stream URL Parameters, we can screen out users based on
there follows, likes, etc. It would be a good idea to verify that our limits.

were enforced. As seen previously, I asked for no replies/qt/retweets,
and only 10% of the tweets ended up original.

For this next section, usernames are exported from the tweetDF dataframe, and a node.js script is used to fetch all the user information.

Looking at our code (that uses Bearer Token Authentication), it works as a Get Request. This means we cannot load too many users, as they will be URL encoded (max length for URL). Googling around, we can't send more than 2kilobytes...our user list is 250kb+. So that doesnt work...

In [14]:
#An example of our user names
tweetDF["username"].iloc[20:40]
#Lets write our usernames to file:

#tweetDF.loc[:, ['username']].to_csv("./data/users.csv", sep='\t', encoding='utf-8',index=False)
pd.Series(tweetDF.username.unique()).to_csv("./data/users.csv", sep='\t', encoding='utf-8',index=False)