<img src="wiOtxI3emiC.png" alt="Drawing" style="width: 200px;"/>

## Introduction

In 2015, Facebook's Chat API, which allowed websites and apps to access the Facebook Messenger tool and analyze messenger data, was deprecated because app developers had access to user's friends' information and this raised concerns about people's privacy. Currently, Facebook Developer Toolkit mainly consists of the Ads API, the Graph API, and the Facebook SDK, but none of these allow us to access data from Facebook Messenger. Therefore, this tutorial will demonstrate how to extract information from your messages using Selenium Webdriver to login to your account and BeautifulSoup to parse each conversation to gather the necessary information. More specifically, I will analyze for each person you have a chat history with the number of messages you send vs. they send, the number of total characters that you've sent vs. they've sent, and the number of times you initiatied the conversation vs. they did and then calculate an overall rating for the contribution you have vs. they have in your conversation with them, since I thought these statistics would be interesting for people who use Facebook. However, this tutorial can serve a more general purpose on how to webscrape your Facebook Messenger to analyze any data that you are interested in.

## Table of Content:
* [Libaries](#libraries)
* [Login to Facebook](#login)
* [Accessing Sections in Facebook](#sections)
* [Show Older Conversations](#older)
* [Show Older Messages](#oldermsg)
* [Extract the Links of Conversations](#Links)
* [Parsing Dates](#dates)
* [Extracting Data From Conversation](#data)
* [Analyzing the Data](#analyze)
* [Overall Rating](#rating)
* [Visualizing the Data](#visual)
* [Bring It All Together](#all)
* [Print the Results](#print)

## Libraries <a name="libraries"></a>
Before getting started, this tutorial requires you to install certain libraries. You need to install the selenium package for the webdriver to login to your Facebook account and click on your Messages. I import ActionChains because it is needed to scroll up in Facebook to access older messages. You also need to install ChromeDriver if you want to use Google Chrome as the browser for your webdriver. These can be installed by running the commands:

    $pip install selenium
$brew install chromedriver

BeautifulSoup4 is needed to scrape the page to gather data about the conversation and it can be installed by running:

    $pip install beautifulsoup4
    
Other libraries used were getpass to allow the user to type in their password, time to use time.sleep() to allow the page to load before scraping any data, datetime to parse the dates of each message into a common format, re to use regular expressions, termcolor to access cprint to print text in bold for the final output, and then IPython display and widgets to create graphs of the results.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from bs4 import BeautifulSoup as bs

from getpass import getpass
import time
import datetime
import re

from termcolor import cprint
from IPython.html.widgets import FloatProgress
from IPython.display import display

## Login to Facebook <a name="login"></a>
Now that you've installed the libraries, the first step is to initialize the webdriver to login to your Facebook account. First you must input your username and password, and then open up the webdriver to https://www.facebook.com/. It finds the 'email' and 'password' field on the webpage and sends the previously inputted account information for the webdriver to type in and click on the 'Log in' button automatically. A cool aspect about the webdriver is that you can see it open up a new browser to the Facebook webpage and then type in your account information. Therefore, you can control the webdriver through your code but also manually by clicking and typing as you would in a normal browser. This is a useful tool because you have the power to access the web developer tools in chrome to access the html code and find certain elements on the page you are currently viewing, which will be useful when we use BeautifulSoup to parse the page. However, a downside to using this webdriver is that you need to wait for the website to load before actually moving on, which is why I installed the time package so the program can wait. Otherwise, the program will return an error because certain elements will not be found since the webpage was not fully loaded yet.

In [None]:
usr = input('Enter username: ')
pwd = getpass('Enter password: ')
driver = webdriver.Chrome()
driver.get('https://www.facebook.com/')

username = driver.find_element_by_id('email')
username.send_keys(usr)
password = driver.find_element_by_id('pass')
password.send_keys(pwd)
login = driver.find_element_by_id('loginbutton')
login.submit()

## Accessing Sections in Facebook <a name="sections"></a>
<img src="Sections1.png" alt="Drawing" style="width: 200px;"/>
<img src="Sections2.png" alt="Drawing" style="width: 200px;"/>

As stated earlier, most of the web parsing will be done by using BeautifulSoup. However, Selenium Webdriver also contains modules that allow you to search a page for certain elements. For the purposes of this tutorial, we are interested in analyzing messenger data so we search for the 'Messenger' tab. However, you can also search for other sections like 'Events', 'News Feed', etc.

In [None]:
link = driver.find_element_by_link_text('Messenger')
link.click()

## Show Older Conversations <a name="older"></a>
When first loading up the 'Messenger' page on Facebook, it shows the list of conversations you have with people along with a preview of the most recent message sent on the left panel. If you scroll down the list of conversations far enough, it loads more conversations, which means that not all conversations can be seen when the page is first loaded. Webdrivers have a built in scroll function, but in the Facebook website there are multiple scroll bars, so the built in scroll function does not work. Therefore, we must use another method to simulate scrolling. When you scroll, you can briefly see a 'Show Older' button before more conversations automatically load in, so if you search for the element 'Show Older' and tell the driver to click on it, it simulates scrolling down the conversation panel to load more conversations. For the purposes of this tutorial, I limited the amount of times it clicks on 'Show Older' to five times. It is also important to note that you should use time.sleep so the page has time to load the conversations before trying to click on 'Show Older' again because an error would occur.

In [None]:
for i in range(5):
    old = driver.find_element_by_link_text('Show Older')
    old.click()
    time.sleep(1)

## Show Older Messages <a name="oldermsg"></a>
Similarly with the issue with all the conversations not being shown in the section above, not all messages are appear when you first click on a conversation. If you scroll up, older messages will load in. However, there is no button for showing older messages like there was for conversations, so there is an element with the tag < div class = '_2k8v' > which is the loading icon for older messages. Then, if you scroll to it by calling the Selenium module ActionChains you can scroll to the loading icon, which then loads older messages. For the purpose of this tutorial I scrolled up eight times.

In [None]:
def scrollConvo():
    for i in range(8):
        try:
            element = driver.find_element_by_xpath("//div[@class='_2k8v']")
            actions = ActionChains(driver)
            actions.move_to_element(element).perform()
        except:
            return

## Extract the Links of Conversations <a name="links"></a>
Now that we know how to load all of our conversations, we want to be able to click on each one so we can see all the messages. Right click on one of the conversations, and click 'Inspect' as shown below.

<img src="inspect.png" alt="Drawing" style="width: 200px;"/>

This will show up on the right side of your screen.

<img src="inspectmenu.png" alt="Drawing" style="width: 200px;"/>

If you look towards the top, you can see an < a class = "_1ht5 _2il3 _5l-3 _3itx" > with an attribute of 'data-href' which is the link of the conversation. Therefore you can use BeautifulSoup to parse the page, and then find all tags with that class, then make a list of all the links.

In [None]:
def extractMsgLinks():
    res = []
    message = bs(driver.page_source, 'lxml')
    links = message.find_all('a', {'class' : '_1ht5 _2il3 _5l-3 _3itx'})
    for link in links:
        res.append(link['data-href'])
    return res

## Parsing Dates <a name="dates"></a>
Since we are trying to analyze the number of times each person initiates a conversation, we need to calculate the time difference between messages. Facebook Messenger has their own method of displaying dates, which makes it quite difficult to parse. If a message was sent in the current day, the timestamp shows up in a format like "12:04am". If a message was sent in the past week, it shows up in a format like "Monday 10:20am". If a message was sent within a month, it shows up in the format as "March 22nd, 11:16pm". Any other messages are shown as "September 13, 2015 3:18 am". Therefore, we need to take each message date and format it according to when the message was sent. It is the easiest to do this using the datetime package to keep the datatype and format consistent.

In [None]:
def parsedate(date):
    today = datetime.datetime.now()
    days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
    years = ['2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018']
    for day in days: #if message sent in past week i.e. Monday 10:20am
        if day in date:
            parsed_date = datetime.datetime.strptime(date, '%A %I:%M%p') #parse date
            temp = today - datetime.timedelta(days = (today.weekday()-days.index(day))%7) #get the correct day
            return temp.replace(hour = parsed_date.hour, minute = parsed_date.minute, second = parsed_date.second, microsecond = 0) #update the time
    if any(year in date for year in years): #if over a month ago i.e. September 13, 2015 3:18 am
        parsed_date = datetime.datetime.strptime(date, '%B %d, %Y %I:%M %p') #parse date
        return parsed_date
    elif any(month in date for month in months): #if message within a month i.e. March 22nd, 11:16pm
        date = re.sub(r'(\d)(st|nd|rd|th)', r'\1', date) #parse date using regular expressions
        parsed_date = datetime.datetime.strptime(date, '%B %d, %I:%M%p')
        return parsed_date.replace(year = today.year)
    else: #if message sent same day i.e. 12:04am
        parsed_date =  datetime.datetime.strptime(date, '%I:%M%p') #parse date
        temp = today
        return temp.replace(hour = parsed_date.hour, minute = parsed_date.minute, second = parsed_date.second, microsecond = parsed_date.microsecond) #update the time from todays date
    

## Extracting Data From Conversation <a name="data"></a>
Now that we can access every conversation we've had in Messenger, we need to start scraping information from the conversation. Using a similar technique as before, if you right-click inspect a message chat, you can see that each block of messages begins with a tag < div class = "_1t_p clearfix" >, so we find each one of them using BeautifulSoup. Then, the name of the person who sent the message is under a tag < div class = _4ldz _1t_r _p >. You can then access the actual text of the message by finding the tag < div attachments = "List []" >. Therefore, we are now able to extract the name of the person who sent each message, the number of messages they send, and the date it was sent.

<img src="inspectmsg.png" alt="Drawing" style="width: 200px;"/>
<img src="inspectmsgmenu.png" alt="Drawing" style="width: 200px;"/>

In [None]:
def extractdata():
    res = []
    message = bs(driver.page_source, 'lxml')
    temp = message.find_all('div', {'class' : '_1t_p clearfix'}) #find each message block
    for new in temp:
        name = new.find('div', {'class' : '_4ldz _1t_r _p'}) #find name of person who sent message
        if name == None:
            name = 'You'
        else:
            name = name['data-tooltip-content']
        tags = new.find_all('div', {'attachments' : 'List []'}) #find message content
        res.append([name, len(tags), [(msg.text , parsedate(msg['data-tooltip-content'])) for msg in tags]])
    return res

## Analyzing the Data <a name="analyze"></a>
Now that we've parsed the data that we need from each of the conversations, we need to analyze the data to receive statistics that we are interested in. First, to find the total number of characters sent in a message, we add up the lengths of all of the messages. To get the total number of messages sent we just add up the length of each list of messages. Finally, the more complicated part, is to find the number of times each person has initiated the conversation. To do this, we have to find the difference between each message, and I decided that if a message was sent 12 hours after the previous one, than that would suffice as initiating a conversation. However, this number can be changed according to your own interpretation.

In [None]:
def extractMsgLength(msglist):
    length = 0
    for i in range(msglist[1]):
        length += len(msglist[2][i][0]) #add the length of each of the messages
    return length

def extractFirstMsg(listofmsg):
    prev = None
    curr = None
    resFirst = dict() #keep track of number of times conversation was initiatied
    resLength = dict() #keep track of number of characters sent
    resFirst['You'] = 0
    resLength['You'] = [0,0]
    for i in range(len(listofmsg)):
        person = listofmsg[i][0] #person's name
        if person not in resFirst:
            resFirst[person] = 0
        if person not in resLength:
            resLength[person] = [0, 0]
        
        msgLength = extractMsgLength(listofmsg[i]) #find the length of each message in the list
        if person in resLength: #update result if person already in dictionary
            resLength[person][0] += listofmsg[i][1]
            resLength[person][1] += msgLength
        else: #create a new person in dictionary
            resLength[person] = [listofmsg[i][1], msgLength]
        
        for j in range(len(listofmsg[i][2])): #calculate time difference
            if prev == None:
                prev = listofmsg[i][2][j][1]
            else:
                curr = listofmsg[i][2][j][1]
                
                # Convert to Unix timestamp to subtract times
                curr_ts = time.mktime(curr.timetuple())
                prev_ts = time.mktime(prev.timetuple())  
                diff = int(curr_ts-prev_ts)/60/60#divide by 3600 to convert from seconds to hours
                prev = curr
    
                if (diff >= 12): #if difference greater or equal to 12 hours
                    if person in resFirst:
                        resFirst[person] += 1 #add to number of times perseon has initiated conversation
                    else:
                        resFirst[person] = 1
    return (resFirst, resLength)

## Overall Rating <a name="rating"></a>
We have now parsed through each of our conversations and received numbers on the times each of us has initiated a conversation, total number of characters we have sent, and the total number of messages sent. I thought an overall rating on how much each person contributes to a conversation would be interesting to see rather than just viewing a bunch of numbers, so I created a criteria in that the number of first messages sent contributes 30%, number of messages sent contributes 30%, and the total number of characters sent contributes 40% for the overall rating. I chose this criteria because I thought the number of characters sent is the greatest measurement of someone's participation in a conversation, although the others are also important. These percentages are very subjective and can be modified to your own criteria. I then calculate the ratio of each first messages sent, ratio of total number of messages sent, and ratio of number of characters sent by dividing by the total. Then to calculate the overall rating, you just multiply the ratios with the criteria.

In [None]:
def rate(data):
    
    #criteria
    firstmessage = 0.3
    nummsg = 0.3
    totalmsg = 0.4
    
    finalrating = []
    for person in data:
        totalfirst = data[person][0]+data[person][1]
        if totalfirst == 0: #if no one has initiated conversation its a tie
            ratefirst = 0.5000
        else:
            ratefirst = data[person][1]/totalfirst
        
        totalmsglength = data[person][4]+data[person][5]
        if totalmsglength == 0: #if no messages sent
            ratemsglength = 0.5000
        else:
            ratemsglength = data[person][5]/totalmsglength
            
        totalnummsg = data[person][2]+data[person][3]
        if totalnummsg == 0: #if no messages sent
            ratenummsg = 0.5000
        else:
            ratenummsg = data[person][3]/totalnummsg
        rating = ratefirst*firstmessage+ratemsglength*totalmsg+ratenummsg*nummsg #the final rating
        finalrating.append([person, rating, ratefirst, data[person][5], totalmsglength, data[person][3], totalnummsg]) #return the other statistics to print them out later
        
    return finalrating

## Visualizing the Data <a name="visual"></a>
Now that we have an overall rating, we want to visualize it. I imported the FloatProgress module from the iPython package to create a 'progress bar' to visualize the progress as the percentage of how much you contribute to the conversation. You have to declare what the in and max value are (0 and 100 respectively in this case), display it, and then set the value to the overall rating.

If your contribution is 0% it would look like:
<img src="0.png" alt="Drawing" style="width: 200px;"/>

If your contribution is 50% it would look like:
<img src="50.png" alt="Drawing" style="width: 200px;"/>

If your contribution is 100% it would look like:
<img src="100.png" alt="Drawing" style="width: 200px;"/>

In [None]:
def showbar(num):
    f = FloatProgress(min=0, max=100)
    display(f)
    for i in range(100):
       f.value = num

## Bring It All Together <a name="all"></a>
Now that we've defined all the functions to parse each of our conversations, all we have to do now is to iterate through each of the conversations and call the required functions. I've limited it to the first 15 conversations to minimize program run time, but you can parse through more conversations by changing the range in the for loop. We've created a function to extract all the links to each conversation, and then we iterate through each one by using a new function of Selenium Webdriver which is called find_element_by_xpath, which allows us to give it a html tag path, and it will find the element in the given page and click on it. Each conversation begins with the same tag < a class = "_1ht5 _2il3 _5l-3 _3itx" > but a different data-href attribute which we pass in from our extracted links list. We then scroll through the conversation to present older messages. We extract the data, and the analyze it to get the number of messages sent and the number of initial messages sent. We finally call the rate function to provide an overall rating to our results for this conversation.

In [None]:
def parseConversation():
    res = dict()
    listoflinks = extractMsgLinks()
    for i in range(15):
        driver.find_element_by_xpath("//a[@class='_1ht5 _2il3 _5l-3 _3itx'][@data-href='"+listoflinks[i]+"']").click()
        message = bs(driver.page_source, 'lxml')
        scrollConvo()
        time.sleep(1)
        msgdata = extractdata()
        firstmsg, avgmsg = extractFirstMsg(msgdata)
        for person in firstmsg:
            if person != 'You':
                if person not in res:
                    res[person] = [firstmsg[person], firstmsg['You'], avgmsg[person][0], avgmsg['You'][0], avgmsg[person][1], avgmsg['You'][1]]
                else:
                    res[person][0] += firstmsg[person]
                    res[person][1] += firstmsg['You']
                    res[person][2] += avgmsg[person][0]
                    res[person][3] += avgmsg['You'][0]
                    res[person][4] += avgmsg[person][1]
                    res[person][5] += avgmsg['You'][1]
    return rate(res);

## Print the Results <a name="print"></a>
Now to finalize the tutorial, we will print the results from our analyzations and show the overall rating with the graph. We show the message initiation rate as a percentage, and then the number of characters and number of messages sent as a ratio. Here is an image of what the final output will look like (with the names blurred out for privacy reasons).

<img src="output1.png" alt="Drawing" style="width: 200px;"/>
<img src="output2.png" alt="Drawing" style="width: 200px;"/>

In [None]:
ratings = parseConversation()
cprint('Contribution Rates\n', attrs=['bold'])
for person in ratings:
    rating = person[1]
    rating *= 100
    rating = round(rating, 2)
    cprint(person[0], attrs = ['bold'], end='')
    print(', Overall Rating: '+str(rating)+'%')
    print('\tMessage Initiation Rate: '+str(round(person[2]*100,2))+'%')
    print('\tTotal # of Characters Sent: '+str(person[3])+'/'+str(person[4]))
    print('\tTotal # of Messages Sent: '+str(person[5])+'/'+str(person[6]))
    showbar(rating)
    print('\n')