**by: Zach Gozlan**  
McCourt School of Public Policy  
MS-DSPP, May 2020 Grad  
September 8, 2019  

These are a series of tools that I created for analyzing Facebook messenger conversations; I wrote these over the course of Summer 2019 prior to attending a pair of weddings. At both weddings, I had substantial Messenger conversations with one side of the couple. At one, I used information from this process to inform a speech I gave; at the other, the findings were presented as a 15-page "card."

To download your own data for this analysis, the path is currently Settings -> Your Facebook Information -> Download Your Information; select "json" as the output. As these files can be massive (I have been on Facebook since 2007) I strongly suggest filtering down to just "messages" on the selection screen and cutting dates down if you're aware of the earliest date the conversation could've began at.

#### Setup: Packages, Definitions

In [None]:
#package imports

import pandas as pd #dataframe package
import os           #for navigating xml library
import json         #for navigating json
#import ijson        #same
from pandas.io.json import json_normalize #same
import numpy as np  #habit
import networkx     #network package
import matplotlib.pyplot as plt
import itertools
import requests     #for twitter api adding
import bs4
from bs4 import BeautifulSoup
#import tweepy #twitter api stuff
import random
import ast
from collections import Counter

from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, Lasso, LinearRegression, Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from scipy.sparse import coo_matrix, csr_matrix
from scipy.sparse import hstack as sparse_hstack


#I'm pretty sure I use all of these, and import statements are basically costless so whatever dude

In [None]:
#stopword list, probably based on https://gist.github.com/sebleier/554280 but added a few words that come up a lot

stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now", 'nickname']

#### Import JSON

In [None]:
#put your file here; as far as I can tell it's always message_1 within the nested folder referencing the conversation

data = pd.read_json('personal projects/message_1.json', typ='series', encoding='latin-1')

In [None]:
#echo back to see the data frame makes sense

df = pd.DataFrame.from_dict(data.messages, orient='columns')
df

In [None]:
#converting timestamps to useable format

df['timestamp'] = pd.to_datetime(df['timestamp_ms'], unit='ms')
df.sort_values('timestamp')

#### Message Counts

In [None]:
df['sender_name'].value_counts()

**Word counts across all messages**

In [None]:
counts = Counter()

df2 = df[df['type'] == 'Generic'] #removes certain automated messages not sent by user

for sentence in df2['content']:
    if type(sentence) is not float:
        counts.update(word.strip('.,?!"/:@;@\)$/-(][').lower() for word in sentence.split())
    
for item in stopwords: 
    del counts[item]
    
word_count_dict = pd.DataFrame.from_dict(counts, orient='index')
word_count_dict = word_count_dict.sort_values(0, ascending=False)
word_count_dict

**Timestamps of specific phrases; case sensitive**

In [None]:
df3 = df.dropna(subset=['content'])
df3.loc[df3.content.str.contains('good morning'), 'timestamp'] #replace 'good morning' with your word or phrase of choice

**All YouTube links shared**

With some processing and when combined with https://www.labnol.org/internet/youtube-playlist-spreadsheet/29183/, this can be used to quickly work through all videos shared without having to click through all of them.

In [None]:
messages = df['share']
youtube = []
for item in messages:
    if 'youtube' in str(item):
        youtube.append(item)
        
DFyoutube = pd.DataFrame({'col':youtube})

In [None]:
DFyoutube

**Message Reaction Information**

You know, like, thumbs up, thumbs down, heart, that stuff. It was added to Messenger relatively recently so older conversations may not include this information.

In [None]:
reactions = pd.DataFrame.from_dict(df['reactions'], orient='columns')
reactions['content'] = df['content']

x = pd.DataFrame(reactions['reactions'].apply(pd.Series))
list(x.columns) #the number of columns on x is equal to the maximum number of reactions received by any message in the chat

In [None]:
#There must be one 'reactions['r1']' column for every column in the output list(x.columns). By default I am 
#treating this like a 1-on-1 conversation, but I have included example code for up to seven.

reactions['r1'] = x[0]
#reactions['r1'], reactions['r2'], reactions['r3'], reactions['r4'], reactions['r5'], reactions['r6'], reactions['r7'] = x[0], x[1], x[2], x[3], x[4], x[5], x[6]

reactions

In [None]:
reactions['r1'].value_counts()

In [None]:
#parses reactions into separate columns for the sender and what they sent
#very inefficient and i'd be totally down for fixes on this

y = reactions['r1'][6]

#y['reaction']

columns = ['r1'] #'r2'] #, 'r3', 'r4', 'r5', 'r6', 'r7']

for item in columns:
    print(item)
    head_r = str(item) + '_reactions'
    head_a = str(item) + '_actor'
    list_1 = reactions[item].apply(pd.Series)['reaction']
    list_2 = reactions[item].apply(pd.Series)['actor']
    reactions[head_r] = list_1
    reactions[head_a] = list_2

In [None]:
#this takes a while and I recommend saving the output when it's complete
#reactions = pd.read_csv('personal projects/reactions.csv')

In [None]:
#replacement encoded emoji names with their intuitive meanings and join output reaction columns back to main dataframe 

reactions2 = reactions[['r1_reactions', 'r1_actor']]

reactions2 = reactions2.replace('ð\x9f\x98\x8d', 'heart')
reactions2 = reactions2.replace('ð\x9f\x91\x8d', 'thumbsup')
reactions2 = reactions2.replace('ð\x9f\x98\x86', 'laugh')
reactions2 = reactions2.replace('ð\x9f\x98®', 'wow')
reactions2 = reactions2.replace('ð\x9f\x98\xa0', 'angry')
reactions2 = reactions2.replace('ð\x9f\x98¢', 'sad')
reactions2 =reactions2.replace('ð\x9f\x91\x8e', 'thumbsdown')
reactions2 = reactions2.replace('â¤', 'other')

#EXAMPLE CODE FOR GROUP CHATS:
#reactions2 = reactions[['r1_reactions', 'r1_actor', 'r2_reactions', 'r2_actor', 'r3_reactions',
#       'r3_actor', 'r4_reactions', 'r4_actor', 'r5_reactions', 'r5_actor',
#       'r6_reactions', 'r6_actor', 'r7_reactions', 'r7_actor']]
df = df.join(reactions2)

In [None]:
df

**TF-IDF Unique Terms by Sender**

In [None]:
df_spare = df[['sender_name', 'content']]
df_spare = pd.concat([df_spare, pd.get_dummies(df_spare['sender_name'])], axis=1)

In [None]:
#target_array_full = np.asarray(df_spare['Zach GozÅan'])
target_array_full = np.asarray(df_spare['Beth Shobudubudub'])
corpus = [str(item).lower() for item in df_spare['content']]

tfid_max = int(len(df_spare['content'])*.9)
tfid_vectorizer_all = TfidfVectorizer(ngram_range=(1,1), min_df=10, max_df=tfid_max, stop_words=stopwords)
full_corpus_vectorizer = tfid_vectorizer_all.fit(corpus)
feature_matrix = tfid_vectorizer_all.transform(corpus)

print(feature_matrix.shape)
print(target_array_full.shape)

In [None]:
#doing a straightforward linear regression here but feel free to try something more complex

lreg = LinearRegression()
lreg.fit(feature_matrix, target_array_full)
coefficients = list(lreg.coef_)

feature_names = tfid_vectorizer_all.get_feature_names()
df = pd.DataFrame(list(zip(feature_names, coefficients)), 
               columns =['Feature', 'Coefficient']) 
df_nonzero = df[df['Coefficient'] != 0].sort_values('Coefficient', ascending=False) #only shows non-zero coefficients
df_nonzero