# Who are Your Friends? Natural Language Processing and You 

This tutorial will teach you some interesting analyses you can do using basic natural language processing tools. Extracting information from natural language is a fundamental aspect of data science: specifically, it doesn't charge us with solving true natural language processing, but rather attempts to generalize large amounts of text using computer science and statistics. There are many types of questions that natural language processing attempts to answer: is the writer of a piece of text enthusiastic or pessimistic? What topic does this document talk about the most? Do you generally use positive or negative language when talking with your friends? These are all questions natural language processing attempts to answer, and has surprising success.

## Context

If you use Facebook often, there it's highly likely you've seen those posts that say "Who should you live with?!" or "Who is your best friend on Facebook?!" They're ubiquitous, though very limited as they can only access public information on your profile. So they use very generic standards to measure your relationship with other people on Facebook: how many pictures you share with them, how often you post on each others walls, etc. However, this can be fairly unrepresentative of your true friendship with people, and while can make a good guess, doesn't give any qualitative analysis or why it chose that person as your friend (at best, it'll show a loading bar saying it's "calculating your best friend" whatever that means).

I'd argue that a much better metric for measuring your relationship with your friends on Facebook is how often you chat with people and what you chat about with them. This tutorial will cover how to retrieve your archived facebook messages, visualizing how frequent you chat with your frineds, and making a good guess at the true relationship between you and your friends. 

## Tutorial Content

In this tutorial,  I am assuming that you have a Facebook you use frequently, specifically to talk with friends. If you do not, then this tutorial probably isn't for you. We will be using the [fb-archive-parser](https://pypi.python.org/pypi/fbchat_archive_parser) and nltk.



We will cover the following topics in this tutorial:

* Installing the libraries
* Extracting the data from your Facebook
* Converting your Archived Messages to Useful Formats
* Visualizing Chat Frequency With Friends
* Determining Relationships With Friends 


## Installation

Before we get started, we first need to install fb-archive-parser. You can install it via pip by going to your Terminal and typing in the following

```pip install fbchat-archive-parser```

You will also need to have nltk. If you are using a jupyter notebook, then this should alraedy be pre-installed for you. Ohterwise, you can easily install nltk the same way you installed fbchat-archive-parser via pip

```pip install nltk```

For nltk, we will need the Stopword and Opinion Lexicon corpora. You can download these by going into a python shell and executing the following command:

In [None]:
import nltk

nltk.download()

This will open up a GUI selection screen where you can select all types of corpora. A corpus is simply a collection of curated words that are useful for natural language processing. In general, you should avoid downloading all corpora, as you will rarely ever need all of them. You can find the opinion lexicon and stopword corpus under the corpora tab. After running these installations, make sure the following runs properly for you:

In [3]:
import json, string, sys, operator, math, pandas
from collections import Counter
from nltk.corpus import stopwords
from nltk import bigrams
from collections import defaultdict

## Extracting the Data From Facebook

This will probably be the worst part of this tutorial, because you'll have to wait. Facebook has a relatively unknown feature where you are allowed to ask for all the data Facebook has on you: pictures, relationships, posts, messages... basically anything you can think of that you do on Facebook, you are allowed to request. This is called Facebook Archive Download. Go to the Drop down menu next to your privacy shortcuts on the main page of Facebook > Settings. From here, there should be some grey text beneath the bars letting you know you can download a copy of your Facebook data. Click there.



<img src = "fb1.png">

<img src = "fb2.png">

Click the "Start My Archive" button, and wait. This'll take awhile, especially if you do a lot on Facebook. Standard time it takes for archive requests to be fulfilled is about 6 to 8 hours. Go outside. Enjoy life. You'll get an email to the account that is linked with your Facebook when your archive is ready. Download the archive, and unpack the zip file. Just a warning, it might be really big, so don't worry if it takes awhile to open.

## Converting Your Messages to Useful Formats

Inside the unzipped folder, you'll find three folders and an html file that you can use to explore your data in a GUI format. We're only interested in one file though. Navigate to html > messages.htm

This file is a collection of all of your Facebook chat history in xml format. If you open up, you might notice that people you talk with a lot are split into seperate instances. For some reason, Facebook caps individual conversation segments to 200 messages – don't worry, the rest of the conversation is still there, it's just scattered throughout the document. Since we cannot work in this format, we'll need to convert it to something more useful. fb-archive-chat supports yaml, json, and csv formats. For this tutorial, let's use the json format. As Professor Kolter said, if you dont't have to write your own xml parser, then don't do it. In your Terminal, cd into the html folder and run the following command: 


```fbcap ./messages.htm -f json > file.json```

Rename file.json to anything you deem appropriate. This will convert all of your messages into a json format. Fair warning this might take awhile; I usually wait around 20 minutes. Also note, Facebook might give you XML that is not well formed, so you might see a warning messages that looks something like this.

```The streaming parser crashed due to malformed XML. Falling back to the less strict/efficient python html.parser. It may take a while before you see output... ```

 This is fine, just know it might take an extra few minutes to process everything.

# Exploratory Data Analysis

Now that we have our data what can we do with it? One built in function of fb-chat-archive is that you can run summary statistics for yourself. Run the following:

```fbcap ./message.htm -f stats```

This will give you summary statistics for your top 10 chats. Already, this is probably a better measure of your top 10 friends than any random guess will give you. Here is part of my output for reference. Some of the names are blurred out becasue I did not obtain permission from them to use their name in public. The second chat is actually an off branch of my first chat which is why it's mostly the same members. Something interesting to note is that people who change their name will have both names show up in the results.

<img src = stats.png>

It can be hard to read your json file (espeically since the parser puts everything in one line) so it'd be useful to print and find participants of group chats in your top 10. This is easily done using the following code:


In [7]:
# file.json is the name of your file that contains your json data
def printParticipants():
    with open('file.json') as chatHistory:
        raw = json.load(chatHistory)
        for thread in raw['threads']:
            print(thread['participants'])

def findParticipants(targets):
    result = list()
    with open('chatHistory.json','r') as f:
        raw = json.load(f)
        for thread in raw['threads']:
            if targets == set(thread['participants']):
                with open('file.json','w') as out:
                        json.dump(thread['messages'],out, indent = 4)
                        


Let's start quantiatively analyzing our data. It is very useful to know what we talk about with our friends. We can use something called the co-occurence matrix in order to find the number of times words co-occur in the same message.

This structure simply counts the number of times a word appears in your documents. The code for that is as follows:

In [8]:
def findCom(file):
    STOP = stopwords.words('english')
    com = defaultdict(lambda : defaultdict(int))
    with open(file,'r') as f:
        # f is the file pointer to the JSON data set
        data = json.loads(f.read())
        for line in data:
            message = line
            terms_only = [term for term in message['message']
                          if term not in STOP]
            # Build co-occurrence matrix
            for i in range(len(terms_only)-1):            
                for j in range(i+1, len(terms_only)):
                    word1, word2 = sorted([terms_only[i], terms_only[j]])                
                    if word1 != word2:
                        com[word1][word2] += 1
    return com

def findComMax(file, n=10):
    com = findCom(file)
    com_max = []
    for t1 in com:
        #Find the top 10 terms in the COM
        t1_max_terms = sorted(com[t1].items(), key=operator.itemgetter(1), reverse=True)[:10]
        for t2, t2_count in t1_max_terms:
            com_max.append(((t1, t2), t2_count))
    # Get the most frequent co-occurrences
    terms_max = sorted(com_max, key=operator.itemgetter(1), reverse=True)
    return terms_max[:n]

My output looks like this

```[(('guys', 'u'), 804), (('And', 'like'), 732), (('LOL', 'like'), 732), (('feel', 'like'), 604), (('like', 'one'), 577), (('like', 'u'), 529), (('dont', 'think'), 456), (('dont', 'like'), 455), (('know', 'u'), 449), (("I'm", 'like'), 412)]```

If you couldn't tell, we use the word 'like' a lot.


## Visualizing Chat Frequency

The first thing we are interested in is seeing how often we chat with out friends. These functions can be used to create json files recording how often we chat with friends. These files are easily interpretable using any visualizization package. I used d3, however, you can use matplotlib to try it yourself.


In [None]:
def countMessages(file):
    result = {}
    with open(file,'r') as f:
        raw = json.load(f)
        for message in raw['threads'][0]['messages']:
            date = message['date'][:10]
            if date in result:
                result[date] += 1
            else:
                result[date] = 0
    with open('output.json','w') as out:


def createTranscript(file,target):
    result = list()
    with open(file,'r') as f:
        raw = json.load(f)
        for message in raw['threads'][0]['messages']:
            if message['sender'] == target:
                result.append(message)
    with open('transcript.json', 'w') as out:
        json.dump(result,out,indent=4)

        
def countTranscript(file):
    result = {}
    with open(file,'r') as f:
        raw = json.load(f)
        for message in raw:
            date = message['date'][:10]
            if date in result:
                result[date] += 1
            else:
                result[date] = 0
    with open('file.json','w') as out:
        json.dump(result,out,indent = 4)

def makeTimeSeriesJson(file):
    result = []
    with open(file,'r') as f:
        raw = json.load(f)
        for date in raw:
            dataPoint = dict()
            dataPoint['date'] = date
            dataPoint['value'] = raw[date]
            result.append(dataPoint)
    with open('timeSeriesOutput.json','w') as out:
        json.dump(result,out,indent = 4)

<img src = chat_history.png>

The x-axis is partitioned by days, and on the day we talked the most, we sent up to 2000 messages. One thing to note is that if your friends changed names, you'll need to include that in the target of the ```create_transcript``` function. If you don't you'll get a straight line like the one below.

<img src = "nickname_history.png">

# Determining Relationship With Friends

To simplify this tutorial, let's assume that the top 10 chats from your summary statistics are the people who are your best friends. For me, I'll only do the first chat in my list of top 10.

We can figure out what topics you talk about the most and how you feel about them using very basic NLP unsupervised technique known as sentiment analysis.


## Finding the Semantic Orientation

To figure out good of a relationship you have with your friends, we'll be calculating something called the semantic orientation. This is a very simple concept with strong implication. Essentially given two dictionaries of "good" and "bad" words, if we count the number of times "good" words appear versus the number of times "bad" words appear, then we now have a quantifiable metric for determining how good of a relationship you have with your friend. To define this mathematically first we must define Pointwise Mutual Information (PMI) which is

$PMI(t_1, t_2) = \log(\frac{P(t_1 \cap t_2)}{P(t_1) \cdot P(t_2)})$

where term $t_1,t_2,$ are two different terms in a document (in our case a single message). Then the semantic orientation for this term is

$SO(t) = \displaystyle \sum_{t' \in V^+} PMI(t,t') -\sum_{t' \in V^-} PMI(t,t') $

$V^+$ and $V^-$ represent positive and negative vocabularies. We use the Bing Liu opinion dictioary that we imported at the beginning as our positive and negative vocabularies. These are especially good as they include frequent misspellings of common words.

The code for calulcating the semantic orientation is as follows:


In [None]:
def findSemanticOrientation(file):
    messageAdded = set()
    n_docs = findNDocs(file)
    count_stop_single = mostCommon(file)
    com = findCom(file)
    # n_docs is the total n. of messages
    p_t = {}
    p_t_com = defaultdict(lambda : defaultdict(int))
    for term, n in count_stop_single.items():
        p_t[term] = n / n_docs
        for t2 in com[term]:
            p_t_com[term][t2] = com[term][t2] / n_docs
    positive_vocab = opinionWords("positive.txt")
    negative_vocab = opinionWords("negative.txt")
    pmi = defaultdict(lambda : defaultdict(int))
    for t1 in p_t:
        for t2 in com[t1]:
            denom = p_t[t1] * p_t[t2]
            pmi[t1][t2] = math.log2(p_t_com[t1][t2] / denom)
    semantic_orientation = {}
    for term, n in p_t.items():
        positive_assoc = sum(pmi[term][tx] for tx in positive_vocab)
        negative_assoc = sum(pmi[term][tx] for tx in negative_vocab)
        semantic_orientation[term] = positive_assoc - negative_assoc
    return semantic_orientation

#N_docs counts the number of messages

def findNDocs(file):
    with open(file,'r') as f:
        data = json.loads(f.readline())
        return len(data)

def mostCommon(file):
    fname = file
    with open(fname, 'r') as f:
        count_all = Counter()
        data = json.loads(f.readline())
        for line in data:
            message = line
            # Create a list with all the terms
            try:
                terms_only = [term for term in message['text'] 
                              if term not in STOP] 
            except:
                continue
            count_all.update(terms_only)
    return count_all
    

## Interpreting Your Results

If you followed along correctly, you should get a list of tuples with a term and its semantic orientation like so:

```
[(LOL, 93.45424444219)]
```

While some of the terms may not make any sense, you will occasionally get words that are topical. You can see in general what you feel about certain topics. For example, with my friends one of my results is

```
[(math, -87.3131451313)]
```


Not surprising.




## Summary

Messing around with your Facebook friends activity is a very fun past time. With this tutorial, we learned how to retrieve and convert raw xml messags from Facebook into usable formats. We also learned how to visualize our data, and how to use the semantic orientation of our messages in order to learn more about how we communicate with our friends. Some other fun things you could do is make a chatbot from your friend's messages. Hopefully, you learned a little bit more about natural language processing and had fun reading this tutorial.   

# References

Here are the libraries and documentation for some of the classes we used.

* NLTK: http://www.nltk.org/
* facebook-archive-parser: https://github.com/CopOnTheRun/FB-Message-Parser

