## Rule Refinement Template Notebook:

In this notebook, it is assumed that data has been mined from twitter and is present in a JSON format, in a local file.

This notebook will separate all tweets by tag, and then run sub-section analysis on each of the tag dataframes.

Users must manually look through the tweets (sadly), and classify what "good" and "bad" tweets are. 

Using simple data tidying, and intuitive methods (set difference between tokenized strings of "good and bad" tweets), we attempt to provide information to the user to properly refine their rules, and evaluate recent adjustments to their rules.

In [70]:
#Load our custom library. Ouputs files in local directory automatically.
from mods.dfprocess import *

In [None]:
tweetDF = generatedataframe("./data/runSept5_rule1-4.twts",5000)

def cleanDF(tDF):
    tDF.drop_duplicates(subset=["text"],inplace=True)
    tDF['text'] = tDF["text"].apply(lambda s: s.lower())
    tDF['text'] = tDF["text"].apply(lambda s: eliminate_slang_strings(s,screenWords))
    tDF['text'] = tDF["text"].apply(lambda s: remove_punct(s))
    tDF['text'] = tDF["text"].apply(lambda s: remove_noneng_chars(s))
    return tDF

                                    

In [99]:
#Load the Dataframe: enter path
tweetDF = generatedataframe("./data/runSept5_rule1-4.twts",5000)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5541 entries, 0 to 5540
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   tweetid        5541 non-null   int64 
 1   text           5541 non-null   object
 2   created_at     5541 non-null   object
 3   tagid          5541 non-null   int64 
 4   tag            5541 non-null   object
 5   userid         5541 non-null   int64 
 6   username       5541 non-null   object
 7   rtcount        5541 non-null   int16 
 8   repcount       5541 non-null   int16 
 9   likecount      5541 non-null   int16 
 10  qtcount        5541 non-null   int16 
 11  tweet_type     5541 non-null   object
 12  ref_tweetid    5541 non-null   int64 
 13  ref_authorid   5541 non-null   int64 
 14  ref_rtcount    5541 non-null   int16 
 15  ref_repcount   5541 non-null   int16 
 16  ref_likecount  5541 non-null   int16 
 17  ref_qtcount    5541 non-null   int16 
dtypes: int16(8), int64(5), objec

For the next step, we need to clean the "text" field of the tweetDF dataframe. We eliminate the following:

1) Removal of duplicate text rows (which occur when a user edits their tweet, or a spam account repeats itself many times).

2) Apply lower-case to all text (helps simplify our tokenizing).

3) Clip out emoticons and weird characters (English characters only!)

4) Filter for slang or bullshit terms ( "fr fr", "i got u fam" "trolololololo" "ya'll!!", etc...).

Lets get Started!

In [100]:
#Detect duplicate text rows, and cut down tweetDF
tweetDF.drop_duplicates(subset=["text"],inplace=True)


In [101]:
#next make everything lowercase in the text column
tweetDF['text'] = tweetDF["text"].apply(lambda s: s.lower())

In [102]:
#lets generate our word removal list
#remember to apply after lowercase function!
screenWords = ["celebs","frfr","fr fr","lulz","rofl",
              "roflmao","lmao","lol","chuds","yall","y'all",
              "dem","demz","hella","cums","onlyfans","only fans",
              "plz","pls","noob","grindset","vibe","vibrations",
              "gurl","chill","nft","coom","cringe","based","alpha",
               "beta","sigma","mindset","babe","tpot","flex",
               "moon","pumps","apes","celeb","cuck","cucked",
              "smh","goes hard"," stan ","jesus","lord"," da ",
               "ass","mfers","mfer","thicc","nigga","!!","!!!",
              "??","???","http://","https://","ya'll"]

tweetDF['text'] = tweetDF["text"].apply(lambda s: eliminate_slang_strings(s,screenWords))

In [103]:
def remove_punct(data):
    replace = re.compile("["
        "."
        "!"
        "?"
        "\n"
        "/"               
        "\""
        ","
        "}"
        "{"
        "["
        "]"
        "<"
        ">"
        "("
        ")"
        "+"
        ":"
        ";"
                    "]+", re.UNICODE)
    return re.sub(replace,"",data)

tweetDF['text'] = tweetDF["text"].apply(lambda s: remove_punct(s))
tweetDF['text'] = tweetDF["text"].apply(lambda s: remove_noneng_chars(s))

In [104]:
tweetDF.drop_duplicates(subset=["text"],inplace=True)

In [105]:
#next, separate Dataframes based on tags. Get the tags, and call
#our separator a number of times.
tagList = (tweetDF.tag.unique()).tolist()

#SelfAndID 
#WE have to add an index column, as to_json doesn't write when we orient=records
saiDF = (tweetDF[tweetDF["tag"] == "SelfandID"]).copy(deep=True).reset_index(drop=True).reset_index()
#SearchTheVoid
stvDF = (tweetDF[tweetDF["tag"] == "SearchTheVoid"]).copy(deep=True).reset_index(drop=True).reset_index()
#SocietalShift
ssDF = (tweetDF[tweetDF["tag"] == "SocietalShift"]).copy(deep=True).reset_index(drop=True).reset_index()
#HealthySkepticism
hsDF = (tweetDF[tweetDF["tag"] == "HealthySkepticism"]).copy(deep=True).reset_index(drop=True).reset_index()

In [150]:
hsDF.shape

(108, 19)

### SelfAndID Tweets:

First, we export the row indicies and text column, to identify what is a good tweet.

In [107]:
saiDF.loc[:,["index","tweetid","text"]].to_json("./data/saiDF.json",orient="records",index=True,force_ascii=False)

In [108]:
saiGoodIndices = [0,62,65,291,381,445,447,566,592,641,945,960,1144,1167]

After going through almost 3000 tweets (ugh), I was able to extract about 30 "good" tweets, that mirror what I am looking for. The rest were spam, jibberish, or poor takes.

A few other filters will also have to be devised (implemented above). Most tweets that have URL likes (http://t.co ...) are crap tweets.
In addition, there is more spam than I imagined. There are many automated accounts that will post the same tweet, with a hashnumber appended to the end, to fool the twitter algorithm. Consider the following example, below:

**Note:** Cutting out URL tweets reduced our tweet set by a factor of 2. Noticable improvement!


So in addition to further screening, we need a function that measures string similarity (and apply a threshold test for all of our strings.

Based on this Stack Overflow post: https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings

The Levenshtein distance would be quite appropriate (strings don't have to be the same length, does not prefer prefix matches over other placed matches). However Comparing every string to every other strings is O(m n^2) complexity, which could end up cubic as Levenshtein recursive on string characters...

A faster method would be to perhaps calculate the LD for each row, based on a reference string "The Quick brown fox....", and then do histogram binning on strings with similar distances. This is an approximation at best, and may have significant errors, however.

I might implement this later, if the spam gets too bad. It might be easier just to add more constraints, to avoid going down this rabbit hole.

Now it is time to build the Tokenizing Sets and Hash functions, to do counts and set difference operations. From this, I hope to gather information about how to adjust my rules to get fewer, more focused matches. Lets get started...

**Remember:** You must complete all text cleaning before you identify "good indicies", else you risk referencing shifted/fictitious rows if you clean a second time.

In [111]:
saiGoodDict = {}
saiBadDict = {}
saiGoodSet = set()
saiBadSet = set()

#For each text string in our saiDF:
#Separate out good rows from bad (two separate dataframes)
#
saiGoodSer = saiDF["text"].iloc[saiGoodIndices]
saiBadSer = saiDF[~saiDF.index.isin(saiGoodIndices)]["text"]


#set, dictionary -> None (mutate arguments)
def insert_hashandset(textSeries,wSet,wDict):
    for text in textSeries:
        sTokens = text.split(" ")
        #Lets screen out tokens that are <= 2 in length,
        #Or are just punctuation or spaces
        for i in range(0,len(sTokens)):
            hold = sTokens[i]
            if ((len(hold) > 3)):
                wSet.add(hold)
                if (hold in wDict):
                    wDict[hold] = wDict[hold] + 1
                else:
                    wDict[hold] = 1
    return
        
insert_hashandset(saiGoodSer,saiGoodSet,saiGoodDict)  
insert_hashandset(saiBadSer,saiBadSet,saiBadDict)


In [113]:
len(saiGoodSet)

216

In [112]:
len(saiBadSet)

4560

In [117]:
uniqueGoodWords = saiGoodSet.difference(saiBadSet)
uniqueBadWords = saiBadSet.difference(saiGoodSet)
intersectWords = saiGoodSet.intersection(saiBadSet)
len(intersectWords)

183

### Visualizing our Top Words:

We need a fast way to rank our words by counts. Dictionaries are too primitive, and heaps/trees can be made to work on composite objects, but coding this takes time. When doing data analytics, the most important data structure is of course the DataFrame. A short column table will be made for each wordHash, which we plug+chug with our word sets. We then just do a row sort on the finished data structure. Easy enough


In [131]:
#making our saiGoodDF first:

saiGoodDF = pd.DataFrame(columns=["word","count"])
wordList = []
countList = []
for word in uniqueGoodWords:
    wordList.append(word)
    countList.append(saiGoodDict[word])

saiGoodDF["word"] = pd.Series(wordList)
saiGoodDF["count"] = pd.Series(countList,dtype="int16")
saiGoodDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   word    33 non-null     object
 1   count   33 non-null     int16 
dtypes: int16(1), object(1)
memory usage: 458.0+ bytes


And, looking at our saiGoodDF, all of our counts are one! 

It looks like all happy families are fairly unique. What about the unhappy ones? Can we find high counts of bad words?

In [132]:
saiBadDF = pd.DataFrame(columns=["word","count"])
wordList = []
countList = []
for word in uniqueBadWords:
    wordList.append(word)
    countList.append(saiBadDict[word])

saiBadDF["word"] = pd.Series(wordList)
saiBadDF["count"] = pd.Series(countList,dtype="int16")
saiBadDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4377 entries, 0 to 4376
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   word    4377 non-null   object
 1   count   4377 non-null   int16 
dtypes: int16(1), object(1)
memory usage: 42.9+ KB


In [143]:
pd.set_option('display.max_rows', 500)
saiBadDF.sort_values(ascending=False,axis="index",by="count",inplace=True)
display(saiBadDF.head(75))

Unnamed: 0,word,count
1572,dnj3989,250
3672,said,137
3918,there,136
2918,personal,132
3950,sort,126
4309,ٱےـهـربَـ,125
4059,نوننمشيay20ay20ٱےهےربَـ,125
139,didn'tك̷و̷د̶,125
2290,̸خ̷ص̸م̴,125
1599,best,119


**Conclusion:** The following words need to be screened out from the search string: empire, personal, fraud, real, education. Also the word "journey" needs to be removed from search, which was highly correlated with clickbait and link posting.

Will try to see if this improves our rules.

### SearchingtheVoid DataFrame Analysis:

In [147]:
stvDF.shape

(724, 19)

In [148]:
#export and select "good indicies"
stvDF.loc[:,["index","tweetid","text"]].to_json("./data/stvDF.json",orient="records",index=True,force_ascii=False)

**Conclusion**: I found a whole three tweets that I would label as good. This is simply too few to do a set difference analysis. Almost every tweet in the set has the words "help me" - this leads to an incredible amount of spam, bullshit requests that fall on deaf ears, and the odd post that has undertones of suicidal ideation. "Help me" is getting removed from the rule value.

### The Other Data Frames:

Both HealthySkepticism and Societal Shift have <120 tweets. I skimmed both of them just to see if there were any terms (by eye) that were highly correlated with low quality tweets. Didn't find anything  

¯\_(ツ)_/¯

In [151]:
ssDF.loc[:,["index","tweetid","text"]].to_json("./data/ssDF.json",orient="records",index=True,force_ascii=False)

In [152]:
hsDF.loc[:,["index","tweetid","text"]].to_json("./data/hsDF.json",orient="records",index=True,force_ascii=False)

### Final Result:

After adjusting my rules, there was a significant drop-off in our stream rate. I end up with one match every 30 seconds!

Hopefully spam/garbage will be reduced.