## Rule Refinement Template Notebook:

In this notebook, it is assumed that data has been mined from twitter and is present in a JSON format, in a local file.

This notebook will separate all tweets by tag, and then run sub-section analysis on each of the tag dataframes.

Users must manually look through the tweets (sadly), and classify what "good" and "bad" tweets are. 

Using simple data tidying, and intuitive methods (set difference between tokenized strings of "good and bad" tweets), we attempt to provide information to the user to properly refine their rules, and evaluate recent adjustments to their rules.

In [1]:
#Load our custom library. Ouputs files in local directory automatically.
from mods.dfprocess import *

['testjsonPretty.json', 'users.csv', 'rule1-4_trial3.twts', 'ProposedRules.txt', 'testtweets.twts', 'runSept5_rule1-4.twts', 'testdatasets', 'self.twts', 'rule1-4_pubmetrics2.twts', 'selectedtweets.json', 'parensrules.twts', 'rule1-4_pubmetrics.twts', 'rules3-4_trial2.twts', 'rules1-4_trial1.twts', 'skeptic.twts']


In [2]:
#Load the Dataframe: enter path
tweetDF = generatedataframe("./data/runSept5_rule1-4.twts",5000)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5541 entries, 0 to 5540
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   tweetid        5541 non-null   int64 
 1   text           5541 non-null   object
 2   created_at     5541 non-null   object
 3   tagid          5541 non-null   int64 
 4   tag            5541 non-null   object
 5   userid         5541 non-null   int64 
 6   username       5541 non-null   object
 7   rtcount        5541 non-null   int16 
 8   repcount       5541 non-null   int16 
 9   likecount      5541 non-null   int16 
 10  qtcount        5541 non-null   int16 
 11  tweet_type     5541 non-null   object
 12  ref_tweetid    5541 non-null   int64 
 13  ref_authorid   5541 non-null   int64 
 14  ref_rtcount    5541 non-null   int16 
 15  ref_repcount   5541 non-null   int16 
 16  ref_likecount  5541 non-null   int16 
 17  ref_qtcount    5541 non-null   int16 
dtypes: int16(8), int64(5), objec

For the next step, we need to clean the "text" field of the tweetDF dataframe. We eliminate the following:

1) Removal of duplicate text rows (which occur when a user edits their tweet, or a spam account repeats itself many times).

2) Apply lower-case to all text (helps simplify our tokenizing).

3) Clip out emoticons and weird characters (English characters only!)

4) Filter for slang or bullshit terms ( "fr fr", "i got u fam" "trolololololo" "ya'll!!", etc...).

Lets get Started!

In [3]:
#Detect duplicate text rows, and cut down tweetDF
tweetDF.drop_duplicates(subset=["text"],inplace=True)


In [4]:
#next make everything lowercase in the text column
tweetDF['text'] = tweetDF["text"].apply(lambda s: s.lower())

In [5]:
tweetDF['text'] = tweetDF["text"].apply(lambda s: remove_noneng_chars(s))

In [7]:
#lets generate our word removal list
#remember to apply after lowercase function!
screenWords = ["celebs","frfr","fr fr","lulz","rofl",
              "roflmao","lmao","lol","chuds","yall","y'all",
              "dem","demz","hella","cums","onlyfans","only fans",
              "plz","pls","noob","grindset","vibe","vibrations",
              "gurl","chill","nft","coom","cringe","based","alpha",
               "beta","sigma","mindset","babe","tpot","flex",
               "moon","pumps","apes","celeb","cuck","cucked"]

tweetDF['text'] = tweetDF["text"].apply(lambda s: eliminate_slang_strings(s,screenWords))

In [8]:
tweetDF.drop_duplicates(subset=["text"],inplace=True)

In [30]:
#next, separate Dataframes based on tags. Get the tags, and call
#our separator a number of times.
tagList = (tweetDF.tag.unique()).tolist()

#SelfAndID 
#WE have to add an index column, as to_json doesn't write when we orient=records
saiDF = (tweetDF[tweetDF["tag"] == "SelfandID"]).copy(deep=True).reset_index(drop=True).reset_index()
#SearchTheVoid
stvDF = (tweetDF[tweetDF["tag"] == "SearchTheVoid"]).copy(deep=True).reset_index(drop=True).reset_index()
#SocietalShift
ssDF = (tweetDF[tweetDF["tag"] == "SocietalShift"]).copy(deep=True).reset_index(drop=True).reset_index()
#HealthySkepticism
hsDF = (tweetDF[tweetDF["tag"] == "HealthySkepticism"]).copy(deep=True).reset_index(drop=True)

### SelfAndID Tweets:

First, we export the row indicies and text column, to identify what is a good tweet.

In [33]:
saiDF.loc[:,["index","tweetid","text"]].to_json("./data/saiDF.json",orient="records",index=True,force_ascii=False)

In [31]:
saiGoodIndices = [,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,]

Unnamed: 0,index,tweetid,text,created_at,tagid,tag,userid,username,rtcount,repcount,likecount,qtcount,tweet_type,ref_tweetid,ref_authorid,ref_rtcount,ref_repcount,ref_likecount,ref_qtcount
0,0,1566881137418539011,you might never write the most popular story i...,2022-09-05T20:09:09.000Z,1566880937417252880,SelfandID,1280637214813294592,connor_xander,0,0,0,0,original,0,0,0,0,0,0
1,1,1566881214358847488,fourth step guide journey into growth: hazelde...,2022-09-05T20:09:27.000Z,1566880937417252880,SelfandID,1557623677474291713,ErikSmitham,0,0,0,0,original,0,0,0,0,0,0
2,2,1566881240023650304,i still- they don't know,2022-09-05T20:09:33.000Z,1566880937417252880,SelfandID,1559426256093216768,AvaF3rin,0,0,0,0,original,0,0,0,0,0,0
3,3,1566881265734787072,journey to the edge of the light: a story of l...,2022-09-05T20:09:39.000Z,1566880937417252880,SelfandID,1553821745978695680,ChesterAufderh2,0,0,0,0,original,0,0,0,0,0,0
4,4,1566881283434655744,brendan fraser says “i looked different in tho...,2022-09-05T20:09:43.000Z,1566880937417252880,SelfandID,1510495488637693956,SandySidney3,0,0,0,0,original,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2979,2979,1567035171832213504,fraud companny @paywithring,2022-09-06T06:21:13.000Z,1566880937417252880,SelfandID,1481282004373368835,rtdpratik1995,0,0,0,0,original,0,0,0,0,0,0
2980,2980,1567035203633266689,why being so vulnerable so hard?,2022-09-06T06:21:21.000Z,1566880937417252880,SelfandID,1485722084378558467,imconfusedtaf,0,0,0,0,original,0,0,0,0,0,0
2981,2981,1567035319027122176,jaydot__'s journey; day 1 of trying to convinc...,2022-09-06T06:21:48.000Z,1566880937417252880,SelfandID,1459838331664044036,jaydot__twt,0,0,0,0,original,0,0,0,0,0,0
2982,2982,1567035401839452161,i feel like a legit mak cik now bringing this ...,2022-09-06T06:22:08.000Z,1566880937417252880,SelfandID,1478708346,MzAiyang,0,0,0,0,original,0,0,0,0,0,0
