# Facebook Public Page Classification

In this tutorial, we are going to explore a little bit the biggest social graph on earth - Facebook.

First, we'll need to setup the environment to be able to call Graph API. This include creating access token, downloading SDK and making some simple requests. Since the SDK is a thrid-party SDK and super limited. we also need to build a minimal Graph API implementation from scratch.

Then we will use our API to scrape a small portion of public Page on Facebook and recent public Posts from each of these Pages. 

Finally, we could clean up these raw data to generate a well-formated data to train a SVM model to predict the genre of a Page given its Posts content.



## Hi, Facebook

In this section, we will need to get an access token from Facebook, download the third party SDK and then using that SDK to get a list of friends of yours as a practice.

you can find a detailed introduction to Graph API here https://developers.facebook.com/docs/graph-api/overview


### Create Access Token
Since our third-party SDK don't have the authentication mechanism, we will have to use the Graph API Explorer to get our token.

1. open https://developers.facebook.com/tools/explorer
2. Click on the **Get Token button** in the top right of the Explorer.
3. Choose the option **Get User Access Token**.
4. In the following dialog don't check any boxes, just click the blue **Get Access Token** button.
5. You'll see a Facebook Login Dialog, click **OK** here to proceed.
6. Now you could see your **Access Token** filled in the Explorer.


**NOTE: This is a short-term token. It will expire in about 2 hours.**



In [1]:
ACCESS_TOKEN = "YOUR_TOKEN"

### Make the first request

The second thing you need to do is to download the sdk.

In your terminal, use the following command to download the **facebook-sdk** using pip
```
pip install facebook-sdk
```
once this succeeds, you should be able to import the sdk in Python.

Try the following code, it should not give you any warning.

**NOTE: the facebook sdk only support Graph API version up to 2.7**


In [16]:
import facebook
import json, requests, time, io, string
import pandas as pd
import numpy as np
import nltk
from collections import Counter
import sklearn

### Work with the Graph

Let's explain some basic concept here first.

The Graph API is named after the idea of a 'social graph' - a representation of the information on Facebook composed of:

* **nodes** - basically "things" such as a User, a Photo, a Page, a Comment
* **edges** - the connections between those "things", such as a Page's Photos, or a Photo's Comments
* **fields** - info about those "things", such as a person's birthday, or the name of a Page

#### Nodes

Each node has a unique **ID** which is used to access it via the Graph API. 
In the provided SDK, you could get any object by using the `get_object()` method.

If you want to get multiple objects at once, you could also use `get_objects()` method. **NOTE: the maximal number of objects are 50.**

Below is an example for how to get your own User Object.

In [20]:
graph = facebook.GraphAPI(access_token=ACCESS_TOKEN, version='2.7')
me = graph.get_object(id='me') # NOTE, in graph api, 'me' is an alias of current token owner's id

print me

{u'name': u'Yu Tianxin', u'id': u'453142421495400'}


#### Edges

Edges don't have an ID for it. It can only be used along with Nodes. for example, you could query for all your friends using the following code.

In [22]:
friends = graph.get_connections(id='me', connection_name='friends')
print friends['summary']['total_count']
print friends

144
{u'paging': {u'cursors': {u'after': u'QVFIUm1ZALUt6aXMxNFJHSEpqaTd4ZA0h6MThsWFZAldzEtbjJBTmhJSTBEYlRud05mVGNIc0VReUxySjFEUkhtUXJhS1NITi1GaWstdG1QVkt6TGR3WDY2ZAGRB', u'before': u'QVFIUm1ZALUt6aXMxNFJHSEpqaTd4ZA0h6MThsWFZAldzEtbjJBTmhJSTBEYlRud05mVGNIc0VReUxySjFEUkhtUXJhS1NITi1GaWstdG1QVkt6TGR3WDY2ZAGRB'}}, u'data': [{u'name': u'Muyang Li', u'id': u'1169501549760530'}], u'summary': {u'total_count': 144}}


### Get a friends list

As you can see in the above example, the output actually has pagination. So the final part of this secion is to handle that to get the full list of your friends' ids'. 

In [36]:
def get_all_connections(graph, id, connection_name, limit=100):
    results = []
    after = ''
    while True:
        response = graph.get_connections(id=id, connection_name=connection_name,after=after,limit=100)
        data = response['data']
        results.extend([node for node in data])
        if len(results) >= limit:
            break
        if 'paging' not in response:
            break
        if 'next' in response['paging']:
            print response['paging']
            after = response['paging']['cursors']['after']
        else:
            break
    return results
            

def get_friends(graph, id):
    return get_all_connections(graph, id, 'friends')

In [37]:
print len(get_friends(graph, 'me'))

1


It turns out that Since Graph API V2.0, Facebook stops to provide a full list of friends of the user, along with a lot of other types data involves user privacy.

Keep this in mind when collecting data via public API. The infomation you could get from a public API are highly possibly limited. So it's always a good idea to check the capability of different approaches used to collect data at the very first. This might change the methodology you choose to tackle a problem or even change your goal completely.
(A sad story from the author about why the topic of this tutorial changed from exploring social graph to Page classifiction.)






## Tell Me Your Story

So the goal of this turorial is to predict a public Page's genre only depending on the publicly available content.
In order to to that, we need to get all the posts from a given Page.
We also need to be get a decent number of different kinds of Pages to avoid underfiting the machine learning model. During this time, we should also consider adding diversity to our traning data.

### Get Posts from Page
In this section, we need to use the SDK to get all the posts given a Page.
As mentioned earlier, we need to handle privacy related exception here. if a Page have no public Posts, we can simply return a None here.

In [39]:
def get_posts(graph, page_id, limit=100):
    """ get the most recent Posts from given a Page ID.
    Inputs:
        graph: Facebook SDK instance
        page_id: str: the id of the requested Page
        limit: total number of Posts required.
        
    Outputs:
        pd.DataFrame: dataframe containing all the required posts, have fields ['created_time', 'id', 'message']
    """
    posts = get_all_connections(graph, page_id, 'posts', limit=limit)
    df = pd.DataFrame(posts)
    if df.shape[0]:
        return df.drop('story', axis=1, errors='ignore')
    else:
        return None

posts_taylor_swift = get_posts(graph, 'TaylorSwift')
print posts_taylor_swift.head()
print len(posts_taylor_swift)

               created_time                             id  \
0  2016-11-03T17:00:00+0000  19614945368_10154139688090369   
1  2016-11-03T17:00:00+0000  19614945368_10154139689400369   
2  2016-11-03T14:48:45+0000  19614945368_10154137185290369   
3  2016-11-03T04:27:16+0000  19614945368_10154136263430369   
4  2016-11-01T19:09:46+0000  19614945368_10154132476040369   

                                             message  
0  That post-show apartment hangggg\nKelsea Balle...  
1                                                Lol  
2  Congratulations on CMA Vocal Group of the Year...  
3                              HI #1 Little Big Town  
4  Feeling really honored that a band I've loved ...  
100


### Search For All Public Pages
The second step is to get a list of public Pages here. It's impossible to get a good list containing tens of thousands of Pages manually. 

After digging into the Graph API documentation https://developers.facebook.com/docs/graph-api/reference., one feasible solution is to use the `Search` endpoint to get a list of Pages given a keyword.

However, the given sdk doesn't support `Search` endpoint, we'll have to write it from scratch.
This also helps to get a deeper understanding how the Graph API works.

In [44]:
class GraphAPI:
    def __init__(self, token):
        """ create GraphAPI instance
        Inputs:
            token: str: Graph API token
        """
        self.BASE_URL = "https://graph.facebook.com/v2.7/"
        self.token = token   
    
    def node_request(self, url, **payload):
        """ Make GraphAPI node request
        Inputs:
            url: str: note endpoint
            payload: dict: custom parameter passing to Graph API
        """
        payload['access_token'] = self.token
        r = requests.get(self.BASE_URL + url, params=payload)
        return json.loads(r.text)
    
    def connection_request(self, url, limit=None, **payload):
        """ Make GraphAPI connection request
        Inputs:
            url: str: note endpoint
            limit: int: total nubmer of connections required
            payload: dict: custom parameter passing to Graph API
        """
        payload['access_token'] = self.token
        payload['limit'] = limit if limit <= 100 else 100
        r = requests.get(self.BASE_URL + url, params=payload)
        response = json.loads(r.text)
        
        result = []
        result.extend(response['data'])
        while len(result) < limit:
            if 'next' not in response['paging']:
                break
            response = json.loads(requests.get(response['paging']['next']).text)
            result.extend(response['data'])
        return pd.DataFrame(result)

api = GraphAPI(ACCESS_TOKEN)
taylor = api.node_request('TaylorSwift', fields=['category'])
artists = api.connection_request('search', type='page', q='Artist', limit=200)
print taylor
print len(artists)
print artists.head()


{u'category': u'Musician/Band', u'id': u'19614945368'}
200
                id                                               name
0  105799462785959                                             Artist
1  105453336153572                                           Artistic
2  100133933802825                                             Artist
3  119467874799527                   Scoala de make-up Ramona Chirita
4  221818141301837  MALI Tattoo Studio รับสักลาย แก้ลายสัก ออกแบบล...


### Build the training and evaluation table
With the help of these API, we now are able to collect all the data we need.

First, we will make some search queries to get a preferably random list of public pages.
Then, we need to get all the posts from these pages. 
Also, we need to get the page's category. This will be later to used to generate the label for training data.

In [46]:
keyword_list = ['Artist', 'Musician', 'Trump', 'Hillary', 'PHP', 'Data Science', 'NBA', 'Alpine Ski', 'Scuba Diving', 'Computer']

def get_all_pages_and_posts(graph, api, keyword_list, limit=1000):
    """ get all the pages and related posts from Graph API using `Search` endpoint given keyword_list
    Inputs:
        graph: Facebook SDK instance
        api: custom GraphAPI instance
        keyword_list: keywords list used to perform Search for Pages
        limit: number of Pages required for each keyword
    Output:
        all_pages: array: list of tuple(page_id, page_name)
        all_posts: dictionary(page_id, pd.Dataframe)
    """
    all_pages = []
    all_posts = {}
    for keyword in keyword_list:
        pages = api.connection_request('search', type='page', q=keyword, limit=limit)
        posts = {id: get_posts(graph, id) for id in pages['id']}
        all_pages.append(pages)
        all_posts.update(posts)
    return all_pages, all_posts

def get_page_category(graph, page_ids):
    """ Make GraphAPI connection request
    Inputs:
        graph: Facebook SDK instance
        api: custom GraphAPI instance
        keyword_list: keywords list used to perform Search for Pages
        limit: number of Pages required for each keyword
    Output:
        all_pages: array: list of tuple(page_id, page_name)
        all_posts: dictionary(page_id, pd.Dataframe)
    """
    result = {}
    for i in range(0, len(page_ids), 50):
        pages = graph.get_objects(ids=page_ids[i:i+50], fields='category')
        result.update({id: page['category'] for id, page in pages.items() if 'category' in page or 'Unknown' })
    return [result[id] for id in page_ids if id in result or 'Unknown']

def collect_training_data(pages_file, posts_file, graph, api, keyword_list):
    """ Wrap up all the previous function to collect data from Graph API. 
        If local files provided, simply load from them.
    Input:
        pages_file: file path to pages content
        posts_file: file path to posts content
        graph: Facebook SDK instance
        api: custom GraphAPI instance
        keyword_list: keywords list used to perform Search for Pages
    Output:
      df_pages: pd.Dataframe. containing fields ['id', 'name', 'category']
      df_posts: pd.Dataframe. containing fields ['id', 'message']    
    """
    if pages_file and posts_file:
        with open('all_pages.csv', 'r') as csv:
            df_pages = pd.DataFrame.from_csv(csv, encoding='UTF-8')
            
        with open('all_posts.csv', 'r') as csv:
            df_posts = pd.DataFrame.from_csv(csv, encoding='UTF-8')
        
        return df_pages.reset_index(), df_posts.reset_index()
    else:
        all_pages, all_posts = get_all_posts(graph, api, keyword_list)
        df_pages = pd.concat(all_pages)
        df_pages = df_pages.assign(category=get_page_category(graph, [str(id) for id in df_pages.index]))
        
        posts = [[id, posts.iloc[i]['message']] for id, posts in all_posts.items() if posts is not None for i in range(len(posts)) if 'message' in posts.iloc[i]]
        df_posts = pd.DataFrame(posts, columns=['id', 'message'])
        return df_pages, df_posts

df_pages, df_posts = collect_training_data('all_pages.csv', 'all_posts.csv', graph, api, keyword_list)

print df_pages.head()
print df_posts.head()

                id                                      name  \
0  105799462785959                                    Artist   
1  354577571346409               Artist in Residence Program   
2  105453336153572                                  Artistic   
3  906740039387472                     Artistically Speaking   
4  348604348593352  Associazione Artisti di Strada di Milano   

                 category  
0              Profession  
1  Community Organization  
2                Interest  
3           Event Planner  
4               Community  
                id                                            message
0  283319135042170  اعزاءنا الطلبة \r\nللاطلاع على القاعات الخاصة ...
1  283319135042170  إعلان هام لجميع الطلبة\r\nننوه للطلبة الأعزاء ...
2  283319135042170  الجامعة تستقبل الدكتور سايمون غالبين المدير ال...
3  283319135042170  الجامعة تستقبل الدكتور سايمون غالبين المدير ال...
4  283319135042170  الجامعة تستقبل الدكتور سايمون غالبين المدير ال...


## I Know It!
In this section, we could use the collected data to train a model to predict the genre of a Page given its public Posts content. Since there are lots of possible categories in the data. It's too much work for us to predict each of them. To simply the problem, we instead just predict if a given Page is IT related.

After a quick investigation of our collected data, we decide Page with provided category in the following list at IT-related:
* Computers/Technology
* Computer Company
* Internet/Software
* Computers
* Software
* Internet Company
* Computers/Internet Website


### Data pre processing
The first step is to pre-process the data to remove unusable data. That include:
1. remove all the pages with no posts
2. remove all the non-English posts

In [47]:
def pre_process_data(df_posts):
    """ remove posts can't be used for training
    Input:
        df_posts: df.DataFrame. posts dataset containing field `message`
    Output:
        df.DataFrame: cleaned dataset with same schema
    """
    def is_ascii(message):
        try:
            message.decode('ascii')
            return True
        except:
            return False
        
    df = df_posts[pd.notnull(df_posts['message'])]
    df = df[df['message'].apply(is_ascii)]
    df = df.reset_index(drop=True)
    return df

print len(df_posts)
posts = pre_process_data(df_posts)
print posts.head()
print len(posts)

305967
                 id                                            message
0  1507446969543009            GAME TIME!! Houston Rockets - NY Knicks
1  1507446969543009  Dal caldo al freddo....da Manu Ginobili al Bar...
2  1507446969543009   Tutto pronto per San Antonio Spurs - Miami Heat!
3  1507446969543009  Vista la giornata nuvolosa, relax e shopping s...
4  1507446969543009                                       EXTRA GAME!!
150326


### Extract features
Now that we have all our row data with some pre-process. we could use Nature Language Process(NLP) technique to create our features to train the Machine Learning model.

In this section, we need to do convert all the post content into a set of tokens for each page. for each token, we need to:
1. convert all the token to lowercase
2. convert all the token to their lemmatized form. This will remove all the Non-English content.
3. all words with punctuation should be processed as follows: (a) Apostrophe of the form `'s` should be ignored. (b)All other apostrophe should be ignored. (c) Break the word at all other punctuations


In [49]:
def process(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
    """ Normalizes case and handles punctuation
    Inputs:
        text: str: raw text
        lemmatizer: an instance of a class implementing the lemmatize() method
    Outputs:
        list(str): tokenized text
    """
    text = text.lower()
    text = text.replace("'s", '').replace("'", '')
    replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))
    text = text.translate(replace_punctuation)
    tokens = nltk.word_tokenize(text)

    result = []
    for token in tokens:
        try:
            res = lemmatizer.lemmatize(token)
            result.append(res)
        except:
            continue
    return result

print process('GAME TIME!! Houston Rockets - NY Knicks')

['game', 'time', 'houston', u'rocket', 'ny', 'knicks']


With the help of this function, we could now generate a list of tokens for each Page.

In [50]:
def generate_tokens(df_pages, df_posts):
    """ Generate tokens for each page.
    Input:
        df_pages: pd.DataFrame. pages dataset containing 'id' field
        df_posts: pd.DataFrame. Posts dataset containing 'id' and 'message' field.
    Output:
        pd.DataFrame. dataset with row per page 'id'
    """
    result = {}
    for id, indices in df_posts.groupby('id').groups.items():
        token = [process(str(df_posts.iloc[index]['message'])) for index in indices]
        token = np.concatenate(token)
        result[id] = token
    
    data = pd.merge(pd.DataFrame().assign(id=result.keys(), message=result.values()), df_pages, on='id', how='left')
    return data

data = generate_tokens(df_pages, posts)
print data.head()


                 id                                            message  \
0   542952449048579  [by, matt, beck, a, todos, nuestros, clientes,...   
1      129650712580  [the, ieee, computer, society, and, acm, have,...   
2   361468313944069  [from, eric, trump, twitter, erictrump, riding...   
3   178787235479558  [we, are, really, thrilled, to, introduce, kni...   
4  1472341039718413  [admiring, that, contour, yo, bts, of, this, b...   

                                    name                   category  
0         Hillary Salon Peluqueria & Spa  Spas/Beauty/Personal Care  
1  Computing Now - IEEE Computer Society                    Website  
2            Trump Tower at Century City                Real Estate  
3                             JoomShaper       Computers/Technology  
4                              Makeuport       Professional Service  


When doing prediction based on natural languange, it's almost certain that you don't need that many possible words.
Some of them are very popular which add no value to our model, like stopwords. Others are too rare that are most likely to be typos.
In the NLTK package, they provide a list of stopwords we could borrow. In the following section, we need to get a list of rare words. Rare words are defined as words only occured once.

In [51]:
def get_rare_words(data):
    """ use the word count information across all posts in training data to come up with a feature list
    Inputs:
        data: pd.DataFrame: the output of generate_tokens() function
    Outputs:
        list(str): list of rare words
    """
    token_list = [token for tokens in data['message'] for token in tokens]
    counter = Counter(token_list)
    return [k for k,v in counter.iteritems() if v == 1]

rare_words = get_rare_words(data)
print len(rare_words) 

91644


Now we could create a feature matrix for each page using `sklearn.feature_extraction.text.TfidfVectorizer`.
The `sklearn` package contains lots of useful APIs used by Machine Learning.
The `Tfidf` technique is used to convert natural languange words into machine understandable numbers.
It also describes the **Importance** of a word.

you can find more infomation about TF-IDF here https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [53]:
def create_features(data, rare_words):
    """ creates the feature matrix using the Page posts
    Inputs:
        data: pd.DataFrame: Page posts collected above
        rare_words: list(str): the output of get_rare_words() function
    Outputs:
        sklearn.feature_extraction.text.TfidfVectorizer: the TfidfVectorizer object used
        scipy.sparse.csr.csr_matrix: sparse bag-of-words TF-IDF feature matrix
    """
    stopwords = nltk.corpus.stopwords.words('english')
    stopwords.extend(rare_words)
    vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(stop_words=stopwords)
    transformer = sklearn.feature_extraction.text.TfidfTransformer()

    # get frequency counts (sparse) matrix
    all_tokens = [' '.join(tokens) for tokens in data['message']]
    freq_matrix = vectorizer.fit_transform(all_tokens)
    return (vectorizer, freq_matrix)

# AUTOLAB_IGNORE_START
(tfidf, X) = create_features(data, rare_words)
print X.shape
# AUTOLAB_IGNORE_STOP

(3619, 72898)


### Create Labels
The previous section converted all the posts content into numbers we could use to train the model.
But we are missing one important part here - The Label.

As we are solving a classficition problem, we need to assign each sample a Label as the class as the training target.
It's a simple process here. For each sample, if it's category is in the IT_related cateogry defined above, it should have a label with value 1, otherwise 0.

In [None]:
def create_labels(data):
    """ create label data.
    Input:
        data: pd.Dataframe. containing field 'category'
    Output:
        arraylike: a list of 1 or 0. 1 indicating the corresponding sample in data is IT-related
    """
    it_category = ['Computers/Technology', 'Computer Company', 'Internet/Software', 'Computers', 'Software',
                  'Internet Company', 'Computers/Internet Website']
    return np.array([1 if category in it_category else 0 for category in data['category']])

y = create_labels(data)

### Finally, Learning!

Finally we could learn our Machine Learning model now. 
The model we choose to use the is Support Vector Machine (SVM), you can find a descriptive introduction here https://en.wikipedia.org/wiki/Support_vector_machine.

The implementation of SVM is not very complicated, but we don't need to recreate the wheel for now.
With the help of `sklearn`, we could easily train a SVM classifier using
```
clf = sklearn.svm.SVC(kernel=kernel)
clf.fit(X,y)
```

However, you need to keep onething in mind: **The data we collected have a heavy sequential correlation**.
This is because we use keywords to Search for Pages, so neighboring Pages are mostly about the same topic.
In order to solve this problem, we need to shuffle our data before using it.



In [65]:
def learn_classifier(X, y, kernel='linear'):
    """ learns a classifier from the input features and labels using the kernel function supplied
    Inputs:
        X_train: scipy.sparse.csr.csr_matrix: sparse matrix of features
        y_train: numpy.ndarray(int): dense binary vector of class labels
        kernel: str: kernel function to be used with classifier. [linear|poly|rbf|sigmoid]
    Outputs:
        sklearn.svm.classes.SVC: classifier learnt from data
    """
    clf = sklearn.svm.SVC(kernel=kernel)
    clf.fit(X, y)
    return clf


Now we could evalute the accuracy of our model.
Note that we split our data into two part - train and validation.

This is because it's possible to overfiting a model with too much data, that it will be able to perfectly do prediction on the traininig data. But this will result in the loss of generic of that model.

So it's always a good idea to keep a portion of your data from training and only use that for evaluting the performance of your model.

In [71]:
def evaluate_classifier(classifier, X_validation, y_validation):
    """ evaluates a classifier based on a supplied validation data
    Inputs:
        classifier: sklearn.svm.classes.SVC: classifer to evaluate
        X_train: scipy.sparse.csr.csr_matrix: sparse matrix of features
        y_train: numpy.ndarray(int): dense binary vector of class labels
    Outputs:
        double: accuracy of classifier on the validation data
    """
    result = [classifier.predict(X_validation[i]) == y_validation[i] for i in range(X_validation.shape[0])]
    return float(sum(result)) / X_validation.shape[0]


N = len(y)
perm = np.random.permutation(range(N))
training_data_indices = perm[:int(N * 0.8)]
validation_data_indices = perm[int(N * 0.8):]
X_train = X[training_data_indices]
y_train = y[training_data_indices]
X_validation = X[validation_data_indices]
y_validation = y[validation_data_indices]

classifier = learn_classifier(X_train, y_train, 'linear')
accuracy = evaluate_classifier(classifier, X_train, y_train)
print accuracy

accuracy = evaluate_classifier(classifier, X_validation, y_validation)
print accuracy

0.968221070812
0.911602209945


As you can see from the output, the accuracy of the model is very high predicting on the training data, but dropped a little bit on the validation data.



## Last but Not Least
After training the first model, it's always a good idea to play around with different parameters. 
For example, in this model, we could change the kernal function used to see what's the best for us.https://en.wikipedia.org/wiki/Kernel_method

Also, we could use technique called cross validation to get a more accurate validation of the accuracy of our classifier. https://en.wikipedia.org/wiki/Cross-validation_(statistics)

Best of all, it's highly recommended to take the 15-688 class if you could to get a much better understand of all the content mentioned in this tutorial :P

Due to time limitation, we could not dig into furthur on these topics. 

**Disclaimer**: some of the ideas from this tutorial is referenced from the 15-688 class homework. But all the content and code are original.
