In [1]:
import numpy as np 
import pandas as pd 
import altair as alt
import matplotlib.pyplot as plt
import re
import json

import os
print(os.listdir("./data"))

testFilePath = "./data/testtweets.twts"


['ProposedRules.txt', 'testtweets.twts', 'testdatasets', 'self.twts', 'selectedtweets', 'rules3-4_trial2.twts', 'rules1-4_trial1.twts', 'skeptic.twts']


### Introduction:

In this notebook, I hope to load tweets from a file, parse them as json objects and do basic analytics on the said data.The tweets are in UTF-8, so cleaning and parsing the data may be challenging. Lets see how it goes...

To begin, lets load the dataset. First we use the top 200 tweets in the file, as a simple test.

In [2]:
dfTweets = pd.read_json(testFilePath,lines=True)
print(dfTweets.shape)
dfTweets.head()

(1000, 2)


Unnamed: 0,data,matching_rules
0,"{'id': '1560703992908451840', 'text': 'RT @dav...","[{'id': '1560702401342017538', 'tag': 'Selfand..."
1,"{'id': '1560703995701669889', 'text': 'RT @Fox...","[{'id': '1560702401342017538', 'tag': 'Selfand..."
2,"{'id': '1560703992308457472', 'text': 'RT @hos...","[{'id': '1560702401342017538', 'tag': 'Selfand..."
3,"{'id': '1560703992736272384', 'text': 'In this...","[{'id': '1560702401342017538', 'tag': 'Selfand..."
4,"{'id': '1560703994040909827', 'text': 'RT @Tej...","[{'id': '1560702401342017538', 'tag': 'Selfand..."


This ends up with dictionary objects in our entry cells, which is annoying. It looks like read_json() does not have parameters to stop this. 

A simple file read loop, with jsonparse() is probably a better approach.


In [3]:
tweetDF = pd.DataFrame(columns=["tweetid", "text", "tagid","tag"])
##Note! On a rare occasion, two or more tags can match. THis currently chooses the first tag set
##Information loss can occur.
with open(testFilePath) as fp:
    line = fp.readline()
    while line:
        jsonObj = json.loads(line)
        v1 = jsonObj["data"]["id"]
        v2 = jsonObj["data"]["text"]
        v3 = jsonObj['matching_rules'][0]["id"]
        v4 = jsonObj['matching_rules'][0]["tag"]
        tweetDF.loc[len(tweetDF.index)] = [v1,v2,v3,v4]    
        line = fp.readline()
    fp.close()

tweetDF["tweetid"] = pd.to_numeric(tweetDF["tweetid"])
tweetDF["tagid"] = pd.to_numeric(tweetDF["tagid"])



In [4]:
#verify that it works.
print(tweetDF.shape)
print()
print(tweetDF.isnull().sum())
print()
print(tweetDF.info())

(1000, 4)

tweetid    0
text       0
tagid      0
tag        0
dtype: int64

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   tweetid  1000 non-null   int64 
 1   text     1000 non-null   object
 2   tagid    1000 non-null   int64 
 3   tag      1000 non-null   object
dtypes: int64(2), object(2)
memory usage: 39.1+ KB
None


Next, lets generate some summary statistics for our 1000 tweets that we imported. 

In [5]:

distDF = pd.DataFrame(columns=["sum","percentage"],index=tweetDF.tag.unique())

#thankfully, when we write the series to our col, it matches row index titles.
distDF["sum"] = tweetDF.groupby('tag')['tagid'].count()
totalSum = tweetDF.shape[0]
#distDF["percentage"] = pd.Series([]) 
for strIndex in distDF.index:
    distDF["percentage"][strIndex] = (distDF["sum"][strIndex]*100)/totalSum

distDF
#5% societal shift, 61% self, 2.5% searchthevoid, 32% Skepticism.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  distDF["percentage"][strIndex] = (distDF["sum"][strIndex]*100)/totalSum


Unnamed: 0,sum,percentage
SelfandID,607,60.7
HealthySkepticism,319,31.9
SocietalShift,49,4.9
SearchTheVoid,25,2.5


From our percentages above, we can see how much of our bandwidth is eaten up by specific tags. There probably isn't enough data for the SocietalShift and SearchTheVoid rules - these should be separated and ran over a longer time frame, from the other two rules.

For the SelfandID, and HealthySkepticism matched tweets, we must separate poor tweets from good ones. This can be done as follows:

1) A subset of tweets is visually inspected in VS Code (for ease). "Good tweets are identified".

2) Good tweets are separated from bad ones.

3) Next, word sets are accumulated for the Good and Bad tweet sets.

4) We calculate the difference between the two sets. So unique words only found in the "good" tweets, and unique words only found in the "bad" tweets.

5) From this information, we refine our rules and gather data again.

In [6]:
#From our data table, lets separate our tweets by tag and write to file, to make tweet inspecting easier.

#first get SelfandID tweets:

selfDF =  tweetDF[tweetDF.tag.isin(['SelfandID'])]
selfDF.shape

skeptDF =  tweetDF[tweetDF.tag.isin(['HealthySkepticism'])]
skeptDF.shape


(319, 4)

In [7]:
#Can we write this to our data directory? Yes

selfDF.loc[:, ['tweetid', 'text']].to_csv("./data/self.twts", sep='\t', encoding='utf-8',index=False)

In [8]:
skeptDF.loc[:, ['tweetid', 'text']].to_csv("./data/skeptic.twts", sep='\t', encoding='utf-8',index=False)

After looking at the tweets in a visual editor, we can start to devise methods for classifying tweets as good and bad.

Manual Filter: Tokenize all good and bad tweets, narrow down the two sets (cut out articles, prepositions, etc...), and then find the set difference between them. Use this difference to construct negative filters.

**Implicit Assumption:** Those who construct "bad" tweets use certain words and expressions that I can identify, and use with minimal false positives on the "good" set. (!!!) This is a very strong assumption (!!!).

In [9]:
#Code for our tokenizing and set calculation will be found here

#testDF = pd.DataFrame({"a": ["a b c","a d e","a a a"], "b":["b","b","b"], "c":["c","c","c"]}, index=[1,2,3])
#for translate trick, see: https://stackoverflow.com/questions/3939361/remove-specific-characters-from-a-string-in-python
def tokenizeSetMap(text,wSet,wDict):
    for token in (text.split(" ")):
        token = token.translate({ord(c): None for c in '!@#$?:;,.[]<>@&*'}).lower()       
        wSet.add(token)
        if (token in wDict):
            wDict[token] = wDict[token]+1
        else:
            wDict[token] = 1
    return

#Signature: DF -> Set, Dict
def constructWordSet(targetDF,colName):
    #Extract text column from data frame
    wordSet = set()
    wordDict = {}
    targetDF["text"].apply(tokenizeSetMap, args=(wordSet,wordDict,))
    return [wordSet, wordDict]

hold = constructWordSet(skeptDF,"text")
print(hold[1])



The above is a huge mess. There are non-english words, lots of emoticons, misspellings, hashtags, random jibberish, textspeek, abbreviations...etc.

Before I go into NL Processing, I should filter more text with the Twitter endpoint, to minimize work.

Specifically, twitter can pre-filter the following:

"lang:en" - only English tweets allowed (if they have been classified, not perfect).
- "followers_count:a..b", "tweets_count:a..b", "following_count:a..b": Use this to cut out very new and very old accounts. We can find our "sweetspot" of users.
"-has:mentions": Cut out tweets that mention other users
"-has:cashtags", "-has:hashtags": cuts out tagged tweets (usually crap -by inspection).

A new tweet set will be mined, using these rules.


## SelfDF Results:

## SkeptDF Results: