## Setup

The following cell will import various packages that're needed for this text processing. 

In [None]:
# Some general import statements
import re
import json
import random
import numpy as np
import pandas as pd
from time import time
from pathlib import Path
import matplotlib.pyplot as plt

# Import statements related to Gensim + NLTK
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.models import phrases
from nltk.corpus import stopwords

# Import statements for the visualization tools
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import matplotlib.pyplot as plt
%matplotlib inline

# Setting up spaCy's nlp module
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Grab the stopwords from NLTK
stopWords = stopwords.words("english")

# Ignore warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

## Data Preparation

In order to process the tweets, I'll need to pull them from each of the JSONs and put them into a reasonable form. I'm going to create a dict where: 

*account name* --> {"tweets": \[tweet1, tweet2, ...\], "retweets: ..., "likes": ... }

Each of the tweets (whether they be original Tweets, Likes, or Retweets) will be a string. 

In order to properly run the rest of the notebook, you ought to split data into two folders: "Blocked" and "Not-Blocked". Then, insert their paths below: 

In [None]:
blocked_folder_path = ""
not_blocked_folder_path = ""

In [None]:
# Iterate through the folder where the JSONs are stored and store their paths
jsonPathList = []
jsonFolders = []
jsonFolders.append(Path(blocked_folder_path))
jsonFolders.append(Path(not_blocked_folder_path))
for jsonFolder in jsonFolders:
    for path in jsonFolder.iterdir():
        if (path.suffix==".json"):
            jsonPathList.append(path)
    
# Read all of the JSONs 
totalTweetCt = tweetCt = retweetCt = likeCt = 0
accountData = {}
for jsonPath in jsonPathList:
    with open(jsonPath, "r") as jsonFile:
        userType = jsonPath.parents[0].stem
        jsonData = json.load(jsonFile)
        username = jsonData["username"]
        textDict = {"type": userType,
                    "tweets": jsonData["tweets"], 
                    "retweets": jsonData["retweets"], 
                    "likes": jsonData["likes"]}
        accountData[username] = textDict
        totalTweetCt += len(jsonData["tweets"]) + len(jsonData["retweets"]) + len(jsonData["likes"])
        tweetCt += len(jsonData["tweets"])
        retweetCt += len(jsonData["retweets"])
        likeCt += len(jsonData["likes"])

In [None]:
# Create a list of all of the blocked / friendly accounts
accountTypeDict = {}
for account, dataDict in accountData.items():
    if (dataDict["type"] not in accountTypeDict):
        accountTypeDict[dataDict["type"]] = []
    accountTypeDict[dataDict["type"]].append(account)

In [None]:
# This method will extract a single string version of a Tweet
def extractTweetString(tweet, isRetweet=False):
    fullText = ""
    if (isRetweet): fullText = tweet["retweeted_status"]["full_text"]
    else: 
        if ("full_text" in tweet): fullText = tweet["full_text"]
        elif ("text" in tweet): fullText = tweet["text"]
    return fullText

# This will save "extracted Tweet" versions of each Tweet in the accountTextData dict
accountTextData = {}
for account in accountData.keys():
    tweets = accountData[account]["tweets"]
    retweets = accountData[account]["retweets"]
    likes = accountData[account]["likes"]
    accountTextData[account] = {"tweets": [extractTweetString(x) for x in tweets],
                                "retweets": [extractTweetString(x, True) for x in retweets],
                                "likes": [extractTweetString(x) for x in likes]}

## Pre-Processing Text 

The cell below defines a method that'll be used to pre-process the text contained in a Tweet. The cells afterwards preprocess each of the Tweets in the data. 

In [None]:
# This method will run through various pre-processing steps on Tweets 
def preprocessTweet(tweet):
    
    # Add a space at the end of the tweet (helps w/ processing)
    tweet = tweet + " "
    
    # Make everything in the tweet lowercase
    tweet = tweet.lower()
    
    # Remove any user account tags (i.e., @____) from the Tweet
    tweet = re.sub(r'@[a-z]+\s', "", tweet)
    
    # Remove any links from the Tweet
    tweet = re.sub(r'https:\/\/.+\/.+\s', "", tweet)
    
    # Do some "simple pre-processing" to tokenize / remove punctuation
    tweet = simple_preprocess(tweet)
    
    # Remove any of the English stopwords from the Tweet
    tweet = [x for x in tweet if x not in stopWords]
    
    # Return the preprocessed Tweet
    return tweet

In [None]:
# Create a dict w/ each accounts' Tweets processed
processedAccountTextData = {}
for account, textDict in accountTextData.items():
    newTextDict = {k:[] for k in textDict.keys()}
    for key, tweetList in textDict.items():
        for tweet in tweetList:
            newTextDict[key].append(preprocessTweet(tweet))
    processedAccountTextData[account] = newTextDict

## Data Exploration

There are a couple of things I'm interested in figuring out about the textual data before doing any NLP on it! Some of those things include: 

- What's the distribution of "Tweets" vs. "Retweets" for each of the accounts?
- What's the average Tweet length? 
- What does a word cloud look like for any particular user? 

### Tweets vs. Retweets 

The cell below will create a scatterplot, where the x-axis is "Number of Tweets" and the y-axis is "Number of Retweets". Since I can't actually pull the Tweets and Retweets separately, I just pulled "the most recent 400 Tweets" - it might be interesting to understand how everything is distributed. 

In [None]:
# Count each users tweet / retweet count, and store this in arrays
tweetCt = []
retweetCt = []
for account in accountData.keys():
    tweetCt.append(len(accountData[account]["tweets"]))
    retweetCt.append(len(accountData[account]["retweets"]))

# Create a scatterplot of "Tweet count" vs. "Retweet count"
plt.scatter(tweetCt, retweetCt)
plt.title("Tweet vs. Retweets (450 most recent Tweets)")
plt.xlabel("Tweets")
plt.ylabel("Retweets")
plt.show()

### Average Tweet Length

The cell below will try to calculate the "average Tweet length" for all of the users' Tweets. 

In [None]:
# This method will return the length of a given Tweet
def tweetLength(tweet): 
    splitTweet = tweet.split()
    return len(splitTweet)
    
# Iterate through each account and calculate an "average Tweet length"
for account in accountTextData.keys():
    tweetLengths = []
    for tweet in accountTextData[account]["tweets"]:
        tweetLengths.append(tweetLength(tweet))
    avgLength = sum(tweetLengths)/len(tweetLengths)
    print("%s: %.3f" % (account, avgLength))

### Word Clouds

The cells below will attempt to create word clouds from individual users' Tweets / Retweets / Likes.

In [None]:
# This method will compile the tweets for a given list of users into a 
# single text string; this will be used as input for the word cloud
def collectText(accountList):
    tokens = []
    for account in accountList:
        textDict = processedAccountTextData[account]
        for tweetType, tweetList in textDict.items():
            for tweet in tweetList:
                tokens = tokens + tweet
    return " ".join(tokens)


# When given two strings from the "collectText" method, this method will
# attempt to generate a list of the words that appear more frequently
# in either string. 
def compareCollectedText(str1, str2, compSize=100):
    
    # First, create corpuses of both strings 
    corpusList = []
    id2wordList = []
    for curStr in [str1, str2]:
        id2word = corpora.Dictionary([curStr.split()])
        corpus = id2word.doc2bow(curStr.lower().split())
        corpus = sorted(corpus, key = lambda x: x[1], reverse=True)
        corpusList.append(corpus)
        id2wordList.append(id2word)
        
    # Next, iterate through the top 100 in each to see which words are unique
    corpus1_top = [x[0] for x in corpusList[0][:compSize]]
    corpus2_top = [x[0] for x in corpusList[1][:compSize]]
    corpus1_unique = []
    corpus2_unique = []
    
    corpus1_wordList = set([id2wordList[0][x] for x in corpus1_top])
    corpus2_wordList = set([id2wordList[1][x] for x in corpus2_top])
    
    print(corpus1_wordList)
    print(corpus2_wordList)
    
    # Dealing w/ corpus1
    for wordID in corpus1_top:
        corpus1Word = id2wordList[0][wordID]
        if (corpus1Word not in corpus2_wordList): 
            corpus1_unique.append(wordID)
    print(corpus1_unique)
    print("\n\n\n\n\n\n")
    
    # Dealing with corpus2
    for wordID in corpus2_top:
        corpus2Word = id2wordList[1][wordID]
        if (corpus2Word not in corpus1_wordList): 
            corpus2_unique.append(wordID)
    print(corpus2_unique)
            
    # Print the lengths of the unique word lists
    print("corpus 1 has %d unqiue words" % len(corpus1_unique))
    print("corpus 2 has %d unqiue words" % len(corpus2_unique))
    
    # Creating dictionaries from the corpuses
    corpusDictList = []
    for corpus in corpusList:
        corpusDict = {}
        for wordID, freq in corpus:
            corpusDict[wordID] = freq
        corpusDictList.append(corpusDict)
    
    # Create strings representing the frequency of words within the top 100
    newStr = []
    for corpusIdx, corpus in enumerate([corpus1_unique, corpus2_unique]):
        words = []
        uniqueWords = []
        for wordID in corpus:
            word = (id2wordList[corpusIdx][wordID])
            uniqueWords.append(word)
            wordFreq = (corpusDictList[corpusIdx][wordID])
            repeatedWord = (word + " ") * wordFreq
            words = words + repeatedWord.split()
        random.shuffle(words)
        corpusStr = " ".join(words)
        newStr.append(corpusStr)
    
    # Return the new strings that're representative of the comparison 
    # between these two corpus
    return newStr

In [None]:
friendStr = collectText(accountTypeDict["Not-Blocked"])
blockStr = collectText(accountTypeDict["Blocked"])

In [None]:
temp = friendStr
friendStr = blockStr
blockStr = temp
print(len(blockStr))


In [None]:
print(len(friendStr))
print(len(blockStr))

In [None]:
newStrs = compareCollectedText(friendStr, blockStr, 500)

In [None]:
# Import libraries needed for visualizing the word cloud
from wordcloud import WordCloud
from PIL import Image
import matplotlib.pyplot as plt

# Create and show the wordcloud
friendUnique, blockUnique = newStrs
scale = 8
cloud = WordCloud(width=400*scale, height=200*scale).generate(blockUnique)
cloud.to_file("../Visualizations/Blocked Word Cloud UNIQUE.png")
plt.imshow(cloud, interpolation='bilinear')
plt.axis("off")
plt.show()

### LDA Topic Modelling

This is technically entering the "NLP" phase, but it's still a part of data exploration. I want to try and understand the different topics being represented in this data! 

In [None]:
accountTextList = []
for account in processedAccountTextData.keys():
    accountTextList.append(collectText([account]).split())

In [None]:
# This code will iterate through accountTextList and create some 
# necessary data structures for the LDA
bigram = phrases.Phraser(phrases.Phrases(accountTextList, min_count=3, threshold=100))
trigram = phrases.Phraser(phrases.Phrases(bigram[accountTextList], threshold=100))
processedAccountTextList = [trigram[x] for x in accountTextList]
id2word = corpora.Dictionary(processedAccountTextList)
corpus = [id2word.doc2bow(account) for account in processedAccountTextList]

In [None]:
# This will create the LDA model using the data structures above
model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word,
                                        num_topics=6, random_state=255,
                                        update_every=1, chunksize=100,
                                        passes=20, alpha="auto")

In [None]:
# This cell will visualize the topics in the topic model
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(model, corpus, id2word)
vis

In [None]:
# This method will look through *numWords* most associated w/ a topic, and
# return a list of all of the most popular words
def popularWords(model, numWords):
    topicKeywords = model.show_topics(num_words=numWords, formatted=False)
    topics = []
    for topicNum, wordList in topicKeywords:
        phrases = []
        for word, prob in wordList:
            phrases.append((word, prob))
        topics.append(phrases)
    return topics

# This method will look through *numWords* most associated w/ a topic, 
# and return a list of all of the multi-word phrases from that topic
def popularPhrases(model, numWords):
    topicKeywords = model.show_topics(num_words=numWords, formatted=False)
    topics = []
    for topicNum, wordList in topicKeywords:
        phrases = []
        for word, prob in wordList:
            if ("_" in word): phrases.append((word, prob))
        topics.append(phrases)
    return topics

In [None]:
for wordIdx, wordGroup in enumerate(popularWords(model, 40)):
    print("\n\nGROUP %d:\n%s\n\n" % (wordIdx, wordGroup))

In [None]:
for wordIdx, wordGroup in enumerate(popularPhrases(model, 400)):
    print("\n\nGROUP %d:\n%s\n\n" % (wordIdx, wordGroup))

## Word2Vec Embeddings

We're going to use Gensim's word2vec models to learn word embeddings for all of the words in the accounts! 

In [None]:
from gensim.models import Word2Vec
model = Word2Vec(processedAccountTextList, min_count=1, vector_size=128, sg=1)

### Visualizing Embeddings

Here, we want to visualize the embeddings on an account level. We're going to calculate embeddings for each account by averaging together all of the word embeddings used. 

In [None]:
processedAccountTextDict = {}
for idx, account in enumerate(processedAccountTextData.keys()):
    processedAccountTextDict[account] = processedAccountTextList[idx]

In [None]:
# This method will calculate the "average word embedding" for an account
def getAccountWordVec(model, account):
    tokenList = (processedAccountTextDict[account])
    wvList = np.array([model.wv[token] for token in tokenList])
    return np.mean(wvList, axis=0)

In [None]:
# Here, we'll calculate average word embeddings for each account
avgEmbeddingDict = {}
for acctType in accountTypeDict.keys():
    avgEmbeddingDict[acctType] = []
    for account in accountTypeDict[acctType]:
        avgEmbeddingDict[acctType].append((account, getAccountWordVec(model, account)))

In [None]:
# Create arrays for each of the accounts
blockedAcct_x = [x[1][0] for x in avgEmbeddingDict["Blocked"]]
blockedAcct_y = [x[1][1] for x in avgEmbeddingDict["Blocked"]]
friendAcct_x = [x[1][0] for x in avgEmbeddingDict["Not-Blocked"]]
friendAcct_y = [x[1][1] for x in avgEmbeddingDict["Not-Blocked"]]

In [None]:
# Plot these
fig = plt.figure()
ax = fig.add_axes([0, 0, 2, 2])
blocked = ax.scatter(blockedAcct_x, blockedAcct_y, color="red", alpha=.3)
friends = ax.scatter(friendAcct_x, friendAcct_y, color="green", alpha=.3)
plt.title("2-Dimensional Word Embedding of Twitter Network")
plt.legend((blocked, friends), ("Blocked", "Non-Blocked"))
plt.savefig("../Visualizations/WordEmb2.png")
plt.show()