# Facebook Messenger Data Wrangling

Python code that creates nice little .csv’s from a message history in your [Facebook Data Dump](https://www.facebook.com/settings).

### Note: This was written in Jan 2018. As of right now (Aug 2018), this will no longer work. I’m working on an update to this. 


In [13]:
from bs4 import BeautifulSoup
import pandas as pd

import csv
from datetime import datetime
from collections import defaultdict

## 1. Getting the actual conversation into a nicer file

Most data dumps come in .html, to provide a nice web interface for viewing your data. Try opening one of your conversation .html files and you’ll find that you can read the entire thing in your browser, starting from the beginning. This is pretty cool but not THE BEST for DOING THE DATA.

Some things about the data dump:
* Gifs/emojis are not captured and are represented as empty messages (Images are too, kinda).
* Data can be missing. I’ve had entire months excluded from dumps. Hello??? Mark?? why
* __The format changes every now and then (I’ve been doing this periodically for the past few years). This code was adjusted to process a dump from Jan 2018, but I often have to make a few changes. It isn’t too hard though, you can do it!!!__ 


In [14]:
fname = '119.html' # fname is the file name of the .html file that contains the conversation (group chat OR individual)
                   # that you want to analyze. These can be found in the “Messages” folder in the data dump
f = open(fname)
soup = BeautifulSoup(f,'lxml') # This can take a few minutes, depending on the size of the file. 

In [15]:
# A message has 3 parts: the person who says it, the meta data (the time it was sent), and the message itself.
# We’ll extract these 3 components into their own arrays (the length of each of these arrays represents the 
# number of messages in the conversation). This part might take a while as well sorry

people = [p.text.encode('utf-8') for p in soup.find_all("span","user")]
times = [t.text.encode('utf-8') for t in soup.find_all("span","meta")]
messages = [m.text.encode('utf-8') for m in soup.find_all("p")]

In [19]:
# Sometimes the data dump bugs out and has twice the <p> tags that it should (idk lol)
# This will take care of that 

if len(messages) == len(times) * 2:
    messages = messages[1::2]

In [20]:
# Create the actual “conversation” which we’ll represent as a list of “message objects” (dicts)

conversation = []
for i in range(len(people)): # theoretically you could replace “people” with “messages” or “times”
    person = people[i]
    time = times[i]
    message = messages[i]
    thing = {
        'Person':person,
        'Time':time,
        'Message':message
    }
    conversation.append(thing)

In [21]:
# you know shit’s going down when you convert to dataframe

df = pd.DataFrame(data=conversation)

In [23]:
# FB stores the time-data as one giant string, we’ll separate that into 'date' and 'time' columns,
# “time” being the hour and the minute, and “date” being the year/month/date

df['Time'] = pd.to_datetime(df['Time'], format='%A, %B %d, %Y at %I:%M%p %Z') 
df['Date'] = df['Time'].apply(lambda x: x.strftime('%Y/%m/%d'))
df['Time'] = df['Time'].apply(lambda x: x.strftime('%H:%M'))

In [31]:
# Just in case. We’ll sort the messages chronologically. 
# Why not sort by time? Because you likely send multiple messages in the same minute, and that’s 
# the lowest level of granularity provided by FB.

## `ascending=False` if you want the oldest message at the top, `True` for the opposite.

df = df.sort_index(ascending=False)

In [32]:
# We’ll export this as “conversation.csv,” which is basically your entire conversation history but in a nicer
# format.

# Just hold onto this for your on reference or for creating other files


df.to_csv('conversation.csv',index=False)

## 2. Make aggregated data files from this conversation

While the message content is probably the most interesting part, I’m hesitant to really use it because your messages PROBABLY contain some more sensitive data which you don’t want uploaded to THE CLOUD where everyone can see it. We’ll use the `conversation.csv` that we created earlier to make some files for *meta* analysis.

In [33]:
# remember this?
df = pd.read_csv('conversation.csv')

We’ll start by exploring who sent how many messages, at the daily level. This is kinda boring for individual chats but it’s *enlightening* for group chats. 

In [34]:
users = df['Person'].unique()
counts = defaultdict(lambda:{user:0 for user in users}) # obviously defaultdict isn’t necessary but it’s cool

In [35]:
# We’ll get the daterange of the conversation and “manually” get each user’s message counts for that day
# Yes this can probably be achieved with a groupby lol

start = df['Date'].min()
end = df['Date'].max()
for date in pd.date_range(start=start,end=end):
    date_df = df[df['Date'] == date.strftime('%Y/%m/%d')]
    for user in users:
        counts[date][user] = len(date_df[date_df['Person'] == user])
        counts[date]['Total'] = len(date_df)
counts = pd.DataFrame.from_dict(counts)   

In [36]:
# just because

counts = counts.T

In [37]:
counts['Date'] = counts.index

In [38]:
# and now we have a handy csv of every chat member’s daily message counts

counts.to_csv('datecounts.csv',index=False)

## 3. Making even MORE files from this

Because we have time/date columns, we can actually make even more granular message-count files from `conversation.csv`. 

In [39]:
df = pd.read_csv('conversation.csv')

We’ll start by getting message counts at the day-of-week level, as well as the hour-of-day level.

In [40]:
## add new columns (day of week + hour of day) to the conversation df

def date_to_day(date):
    ''' gets the day of week from a date'''
    ''' isn’t that cool                 '''
    date = datetime.strptime(date,'%Y/%m/%d')
    day = date.weekday()
    return day

df['Day'] = df.apply(lambda row: date_to_day(row['Date']), axis=1)
df['Hour'] = df.apply(lambda row: row['Time'][0:2], axis=1) # it feels a little dirty doing it this way but w/e lol

In [47]:
## we can create two new dataframes from this 

hour_day = df.groupby([df['Day'],df['Hour']]).count()
hour_date = df.groupby([df['Date'],df['Hour']]).count()

hour_day = hour_day.reset_index()
hour_date = hour_date.reset_index()

# i know the column is called “time” but it really represents the total number of messages lol
hour_day = hour_day[['Day','Hour','Time']]
hour_date = hour_date[['Hour','Time']]

In [49]:
## honestly i forgot what this does/why i did it this way but just…
## just… do it. it’s actually pretty straightforward. I just remember
## this took forever to type and could probably be wrapped in a function

hour1 = hour_day[['Hour','Time']]
hour2 = hour_date[['Hour','Time']]

hour_total = hour1.groupby([hour1['Hour']]).sum()
hour_average = hour2.groupby([hour2['Hour']]).mean()
hour_median = hour2.groupby([hour2['Hour']]).median()

hour_total = hour_total.reset_index()
hour_average = hour_average.reset_index()
hour_median = hour_median.reset_index()

hour_total = hour_total.rename(columns={'Time':'Total'})
hour_average = hour_average.rename(columns={'Time':'Average'})
hour_median = hour_median.rename(columns={'Time':'Median'})

hour_stats = hour_total.merge(hour_average)
hour_stats = hour_stats.merge(hour_median)

In [50]:
counts = df.groupby([df['Day'],df['Date']]).count()
counts = counts.reset_index()
days = counts[['Day','Time']]

In [51]:
day_average = days.groupby([days['Day']]).mean()
day_total = days.groupby([days['Day']]).sum()
day_median = days.groupby([days['Day']]).median()

In [52]:
## this is more of the same—basically just cleaning up these dataframes. 
## the rest of this notebook is just going through the same steps but for different metrics

day_average = day_average.reset_index()
day_total = day_total.reset_index()
day_median = day_median.reset_index()

day_average = day_average.rename(columns={'Time':'Average'})
day_total = day_total.rename(columns={'Time':'Total'})
day_median = day_median.rename(columns={'Time':'Median'})

day_stats = day_total.merge(day_average)
day_stats = day_stats.merge(day_median)

In [53]:
def daynum_to_daystr(daynum):
    days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
    try:
        day = days[int(daynum)]
    except ValueError:
        day = daynum
    return day
                
day_stats['Day'] = day_stats.apply(lambda row: daynum_to_daystr(row['Day']), axis=1)
day_stats.to_csv('day_stats.csv',index=False)
hour_stats.to_csv('hour_stats.csv',index=False)

In [54]:
hour_day_date = df.groupby([df['Date'],df['Hour'],df['Day']]).count()

In [55]:
hour_day_date = hour_day_date.reset_index()
hour_day_date = hour_day_date[['Hour','Day','Time']]

In [56]:
hdd_total = hour_day_date.groupby([hour_day_date['Hour'],hour_day_date['Day']]).sum()
hdd_average = hour_day_date.groupby([hour_day_date['Hour'],hour_day_date['Day']]).mean()
hdd_median = hour_day_date.groupby([hour_day_date['Hour'],hour_day_date['Day']]).median()

hdd_total = hdd_total.reset_index()
hdd_average = hdd_average.reset_index()
hdd_median = hdd_median.reset_index()

hdd_total = hdd_total.rename(columns={'Time':'Total'})
hdd_average = hdd_average.rename(columns={'Time':'Average'})
hdd_median = hdd_median.rename(columns={'Time':'Median'})

In [57]:
hdd = hdd_total.merge(hdd_average)
hdd = hdd.merge(hdd_median)

In [58]:
hdd = hdd.sort_values(by=['Day','Hour'])

In [59]:
def daynum_to_daystr(daynum):
    days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
    try:
        day = days[int(daynum)]
    except ValueError:
        day = daynum
    return day

def hournum_to_hourstr(hournum):
    t1 = datetime.strptime(hournum,'%H')
    t2 = t1.strftime('%I %p')
    return t2

hdd['Day'] = hdd.apply(lambda row: daynum_to_daystr(row['Day']),axis=1)
hdd['DayHour'] = hdd.apply(lambda row: row['Day'] + ' ' + hournum_to_hourstr(row['Hour']), axis=1)
hdd.to_csv('hdd.csv',index=False)