# Using Text Classification to identify UGbots from the 2016 Ugandan Election
We live in a time where `fake news` is becoming more and more prevalent, with users not really sure how much stock they should put in what they read online. It is important for social media users to be able to identify genuine information from biased information. This presents a unique opportunity to apply data science (i.e. Text Classification) to help with this identification. 

One good case for this was the 2016 Ugandan Presidential Election. It saw the increase of Social Media use in campaigns by politicians. From YouTube videos to Facebook Live sessions and trending hashtags and posts on Twitter. One interesting phenonmenon that rose from this election is the UGBot. UGBots are the term given to 'fake' accounts created in order to push the agenda of one candidate. These account are in no way automated, however borrow the 'bot' name due to the fact that their content during the election period was scripted, with several of them tweeting the exact same content at the same time. One challenge that came up was some of the accounts have since been suspended by Twitter. 

Twitter in particular saw a proliferation of these UGBot accounts, therefore this tutorial will focus on extracting Twitter data in order to identify these accounts. We shall be using the [Twitterscraper](https://github.com/taspinar/twitterscraper) to gather data from crucial points during the election period that saw a significant increase in Twitter traffic.  

### Tutorial Content
In this tutorial we shall cover the following
- [Installations](#Installation)
- [Data Pre-Processing: Collection, Conversion and Cleaning](#Data-Pre-processing:-Collection,-Conversion-and-Cleaning)
- [Text Processing](#Text-Processing)
- [Feature Extraction](#Feature-Extraction)
- [Classification](#Classification)



## Installation

To get started, you will need to set up your computer with the approporiate libraries. It is also recommended to use virtual environments. I used conda to set mine up on my machine. 

    $ conda create -new myenv python=3.6
    
    $ source activate myenv
    
Once the virtual environment is created and activated, you can proceed to install any environments that did not already come preinstalled with `conda`. To install Twitterscraper to get access to the tweets

    $ sudo pip install twitterscraper

Below are libraries that are needed to successfully complete this tutorial - Also included is how to install the libraries in case they were not included in your version of anaconda. 
- nltk: `conda install -c anaconda nltk`
- sklearn: `conda install -c anaconda scikit-learn`
- pandas: `conda install -c anaconda pandas`
- numpy:  `conda install -c anaconda numpy`
- scipy: `conda install -c anaconda scipy`

Two other libraries that will be important in this tutorial are: `json`, `Counter` and `string`

In [1]:
import nltk as nl
import sklearn as sk
import pandas as pd
import numpy as np
import scipy as sp

import json
import csv
import string
from collections import Counter

## Data Pre-processing: Collection, Conversion and Cleaning

After installing the libraries, we are now ready to download the data from Twitter and begin getting it ready for classification. 

Using the Twitterscraper, we shall download Twitter data from 10 accounts. 5 of the accounts will be known bot accounts and the other 5 accounts belonging to media personalities (i.e. Journalists). 

We shall be extracting tweets from the following 10 accounts: `@AnishaUwase, @SarahKagingo, @JonahByaru1, @CynthiaNyamai, @MtwahaN, @kacungira, @kasujja, @qataharraymond, @jkkarungi, @Snduhukire`

We shall specifically be targeting tweets from `02-11-2016` to `02-20-2016` because this time period coincides with the 2nd ever presidential debate as well as the election day. 

To extract the data, we shall be using TwitterScraper in command line:

    $ twitterscraper "from:twitter-handle" -l 1000 -bd 2016-02-11 -ed 2016-02-20 -o anisha.json

*replace `twitter-handle` with the user's twitter handle

_____________________

From this data, we are interested in the `user` and `text` columns - where text is the content of the tweet. 

The goal of this tutorial is to determine which accounts are the bot accounts, as our ground truth, we shall be using the user's twitter handle which is labelled `user`.

- @AnishaUwase, @SarahKagingo, @JonahByaru1, @CynthiaNyamai, @MtwahaN: are considered the bot accounts
- @kacungira, @kasujja, @qataharraymond, @jkkarungi, @Snduhukire: are considered to be users posting without influence or bias

The data that is extracted is stored as json file, we shall read these files in, extract the columns that are relevant to our classification problem and then convert the result to a csv. 

Access to all the json (and csv) files used in this tutorial can be downloaded from [here](https://cmu.box.com/s/t52gfcczaswab429ndn1f1m95blv66mr). Ensure that these are saved in the same folder as your notebook file.

In [2]:
all_files=['ray_1','anisha_1','byaru_1','cynthia_1','kacungira_1','kasujja_1','karungi_1','nduhu_1','sarah_1','twaha_1']

for file in all_files:
    
    data = json.load(open('{}.json'.format(file)))
    row_data =[]
    for each in data:
        row = {}
        row['user'] =each['user']
        row['text'] = each['text']
        row_data.append(row)
    
    file = open( "{}.csv".format(file), "w")
    fileWriter = csv.writer(file , delimiter=",",quotechar='"', quoting=csv.QUOTE_MINIMAL)
    headers = ['user','text']
    fileWriter.writerow(headers)

    for rows in row_data:
        output = [rows['user'], rows['text']]
        fileWriter.writerow(output)
    file.close()

After the csv conversion - we can then merge the user handles and tweets into one csv document. We use the RAND() function to randomize the order in which the data is displayed. 

The last step in getting our data ready will be to split into 3 chunks that we shall use in our classificaton, these are the `train-tweets`, `test-tweets` and `validation-tweets` - Access to these files can also be found [here](https://cmu.box.com/s/t52gfcczaswab429ndn1f1m95blv66mr).

## Text Processing

For this section, we shall used the `train-tweets` data. Using the NLTK package that we previously imported, we shall clean the data that we have collected. 

So to process all the text, we are going to split the methods we use into two. While this may appear not to be necessary, it will reduce how complex it would be to try and fit everything into one function (You can also view this as an attempt to apply the laws of object oriented programming). 

The first function will be `tokenize_clean` and it will focus on tokenizing the a string of text and also perfomring a number of tasks.

Here are the following tasks that we shall perform on our tweet content:
1. Change all content to lower case
2. Process all the punctuation, i.e 
    - Case 1: apostrophe where she's becomes she
    - Case 2: apostrophe where don't becomes dont
    - Case 3: hypen will cause a break in the work

Note: It is usefule to check and see if your nltk download is working. To do so, I verified by downloading the stopwords. 

```python
    >>> nl.download('stopwords')
    >>> nl.download('wordnet')
```

In [3]:
def tokenize_clean(text, lemmatizer=nl.stem.wordnet.WordNetLemmatizer()):
    """
    This function will take a string of text and clean it according to the requirements set above.
    We shall be cleaning out punctuation, i.e. apostrophe's and hyphens. 
    
    Inputs: text (str) - the raw text of the tweet content
    Outputs: list (str) -  tokenized version of the text
    
    """
    text = str(text.lower())
    
    if "'s" in text:
        text = text.replace("'s", "")
    if "'" in text:
        text = text.replace("'", "")
    
    text_punc = str.maketrans(string.punctuation, ' '*len(string.punctuation))
    clean_text = text.translate(text_punc)
    tokens = nl.word_tokenize(clean_text)
    final=[]
    
    for token in tokens:
        try:
            word = lemmatizer.lemmatize(token)
            final.append(word)
        except:
            continue
            
    return final

We now need to test out function to see that it is working. It is a good idea to use a simple string of text and not the actual data yet.

```python
   >>> tokenize_clean("Africa is actually not a country. She's a huge land mass that can contain several other continents. Think of a country like Burkina-Faso, wait.. Do we use a dash or not?")

['africa', 'is', 'actually', 'not', 'a', 'country', 'she', 'a', 'huge', 'land', 'mass', 'that', 'can', 'contain', 'several', 'other', 'continent', 'think', 'of', 'a', 'country', 'like', 'burkina', 'faso', 'wait', 'do', 'we', 'use', 'a', 'dash', 'or', 'not']
```

_________________________________

To complete our text processing section, we shall implement one more feature that will bring it all together: this is the `process` function. This function will take in a dataframe as input and an instance of the lemmatize method. The output of the function will be a dataframe whose text column has been converted to a list of strings that have been cleaned by our `tokenize_clean` method.

In [4]:
def process(data, lemmatizer=nl.stem.wordnet.WordNetLemmatizer()):
    """ This function will process all text in the dataframe using tokenize_clean() function.
    Inputs
        data: pd.DataFrame: dataframe containing a column 'text' loaded from the CSV file
        lemmatizer: an instance of a class implementing the lemmatize() method
                    (the default argument is of type nl.stem.wordnet.WordNetLemmatizer)
    Outputs
        pandas dataframe: dataframe with altered text column - now a list of strings.
    """
    for index, text in enumerate(data['text']):
        data['text'].iloc[index] = tokenize_clean(text, lemmatizer)
    
    return data


In [5]:
tweets = pd.read_csv("train-tweets.csv", na_filter=False)
all_tweets = process(tweets)

Your final output should look like this:

```python
    >>> print(all_tweets.head())
                 user                                               text
    0    SarahKagingo  [president, museveni, i, refused, to, negotiat...
    1    SarahKagingo  [uganda, olivia, byanyima, one, of, lioness, o...
    2         MtwahaN  [in, zimbabwe, a, man, appeared, and, announce...
    3  qataharraymond  [spartakussug, district, returning, officer, d...
    4     Jonahbyaru1  [candidate, museveni, greets, other, candidate...
    ```

## Feature Extraction

Now that we have tokenized tweets, we are ready to generate feature vectors that we shall use in our classification. To do this we shall be using Bag of Words TF-IDf (Term Frequency-Inverse Document Frequency). We define Bag of Words as a which takes as input a set of documents and outputs a table containing the frequency counts of each word in each document ([source](http://datameetsmedia.com/bag-of-words-tf-idf-explained/)) while TF-IDF is the product of the term frequency and the inverse document frequency ([source](http://datameetsmedia.com/bag-of-words-tf-idf-explained/)).

When we think of Bag of Words TF IDF, we are looking at the words that are to be used in the corpus, however, we do not want to include terms that are used very frequently (e.g like the use of articles in sentences) or very rare words. The reason for this is that these two categories of words do not add or represent any information when looking at the similarity of text

Again, like before, we shall be writing two functions that will help us reach our goals. The first function will be used to retreive all the rare words, we shall call this function `all_rare_words`. The logic of this method is to use the frequency with which a word appears in a text to determine whether it is rare or not. 

Our rare_words are then going to help us in the next method called `create_features`. In the create features we shall take this list of rare words and add to it stop words from the NLTK corpus, and then remove these words from our tweet content. We shall use SKLearn's TfidfVectorizer to create a vectorizer that we shall then transform to our corpus.

In [6]:
def retrieve_rare_words(all_tweets):
    """ from the word count of words in our corpus, determine which words are rare and return a list of them.
    
    Inputs:
        all_tweets: pd.DataFrame: the output of process function
    Outputs:
        list(str): list of rare words, sorted alphabetically.
    """
    
    all_tweets = all_tweets['text']
    rare_word_counter = Counter()
    
    for tweets in all_tweets:
        for word in tweets:
            rare_word_counter[word] += 1
    
    rare_words_dict = { word:count for word, count in rare_word_counter.items() if count == 1 }
    all_rare_words = sorted(list(rare_words_dict.keys()))
    return all_rare_words

# AUTOLAB_IGNORE_START
rare_words = retrieve_rare_words(all_tweets)
# AUTOLAB_IGNORE_STOP

When I ran the above method, I get a rare_word list of length: 1558
``` python
    >>> print(len(rare_words))
    1558
```

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix

def create_features(all_tweets, rare_words):
    """ creates the feature matrix using the processed tweet text
    Inputs:
        all_tweets: pd.DataFrame: tweets read from train/test csv file, containing the column 'text'
        rare_words: list(str): one of the outputs of retrieve_rare_words function
    Outputs:
        sklearn.feature_extraction.text.TfidfVectorizer: the TfidfVectorizer object used
                                                we need this to tranform test tweets in the same way as train tweets
        scipy.sparse.csr.csr_matrix: sparse bag-of-words TF-IDF feature matrix
    """
    
    stopwords = nl.corpus.stopwords.words("english")
    corpus = []
    
    for word in rare_words:
        stopwords.append(word)
    
    for row in all_tweets['text']:
        sentence = ' '.join(row)
        corpus.append(sentence)
        
    vectorizer = TfidfVectorizer(stop_words=stopwords, max_features=655)
    vect_corpus = vectorizer.fit_transform(corpus)
    result_matrix = csr_matrix(vect_corpus)
    
    return vectorizer, result_matrix

# AUTOLAB_IGNORE_START
(tfidf, X) = create_features(all_tweets, rare_words)
# AUTOLAB_IGNORE_STOP

The next step that we are going to do is to label our data. We shall do this using the method `create_labels`. For the labels we shall be using 0 or 1. i.e.  0 for 'MtwahaN','JonahByaru1','CynthiaNyamai','AnishaUwase', 'SarahKagingo', and lastly 1 for the rest.

In [8]:
def create_labels(all_tweets):
    """ creates the class labels from user column
    Inputs:
        tweets: pd.DataFrame: tweets read from train file, containing the column 'user'
    Outputs:
        numpy.ndarray(int): dense binary numpy array of class labels
    """
    class_labels = []
    
    for name in all_tweets['user']:
        if name == 'MtwahaN' or name == 'JonahByaru1'or name == 'CynthiaNyamai' or name == 'AnishaUwase'or name == 'SarahKagingo':
            class_labels.append(int(0))
        else:
            class_labels.append(int(1))
    
    return np.asarray(class_labels)

# AUTOLAB_IGNORE_START
y = create_labels(all_tweets)
# AUTOLAB_IGNORE_STOP

When you print out `y` it should look like this:

``` python
    >>> print(y)
    [0 0 0 1 1 1 1 1 1 0 0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1
 0 1 1 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1
 1 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 0 0 0 1 1 0 1 1 0 0 1 0
 1 1 0 0 1 0 1 1 1 0 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 1
 0 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 1 1
 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 0 0
 1 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 1 1 1 0 0 1 1 1 0 1 0
 0 1 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0
 1 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 0
 0 0 1 0 0 1 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 0 0 1 1 0 0 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 0 0 1 1 1
 1 1 0 1 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 1 1 0 1 0 1
 1 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 1 1
 1 1 1 0 0 0 1 1 1 0 0 1 1 1 0 0 0 1]
```

## Classification

Now that we have the parts ready, we shall be using the Support Vector Machine (SVM) classifier. A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes ([source](http://blog.aylien.com/support-vector-machines-for-dummies-a-simple/)). SVM supports several kernel functions `linear`, `poly`, `rbf`,`sigmoid` - it is good practise try several of them to see which one best fits your date - however, for purposes of this tutorial, we shall be using the `linear` kernal function. 

In [9]:
from sklearn.svm import SVC

def fit_classifier(X_train, y_train):
    """ learns a classifier from the input features and labels using the linear kernal function
    Inputs:
        X_train: scipy.sparse.csr.csr_matrix: sparse matrix of features, output of create_features_and_labels()
        y_train: numpy.ndarray(int): dense binary vector of class labels, output of create_features_and_labels()
    Outputs:
        sklearn.svm.classes.SVC: classifier learnt from data
    """
    X = X_train
    y = y_train
    
    clf = SVC(kernel='linear')
    clf.fit(X, y)
    
    return clf

# AUTOLAB_IGNORE_START
classifier = fit_classifier(X, y)
# AUTOLAB_IGNORE_STOP

Now that we have got our classifier ready, we can now evaluate it against the second batch of data that we set aside: the `validation-tweet` dataset. We shall measure using accuracy metric (although there are several other metrics that can be used to measure including precision, recall, etc). 

With accuracy we are looking at the fraction of how many of the tweets are correctly classified

In [10]:
# Prepare our validation data
validation_tweets = pd.read_csv("validation-tweets.csv", na_filter=False)
validation_all_tweets = process(validation_tweets)
validation_rare_words = retrieve_rare_words(validation_all_tweets)

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix

def create_validation_features(tfidf, all_tweets, rare_words):
    """ creates the feature matrix using the processed tweet text
    Inputs:
        tweets: pd.DataFrame: tweets read from train/test csv file, containing the column 'text'
        rare_words: list(str): one of the outputs of retrieve_rare_words function
    Outputs:
        sklearn.feature_extraction.text.TfidfVectorizer: the TfidfVectorizer object used
                                                we need this to tranform test tweets in the same way as train tweets
        scipy.sparse.csr.csr_matrix: sparse bag-of-words TF-IDF feature matrix
    """
    
    stopwords = nl.corpus.stopwords.words("english")
    corpus = []
    
    for word in rare_words:
        stopwords.append(word)
    
    for row in all_tweets['text']:
        sentence = ' '.join(row)
        corpus.append(sentence)
    
#     vectorizer = TfidfVectorizer(stop_words=stopwords, max_features=655)
    vect_corpus = tfidf.transform(corpus)
    result_matrix = csr_matrix(vect_corpus)
    
    return result_matrix

In [12]:
X_validation = create_validation_features(tfidf, validation_all_tweets, validation_rare_words)
y_validation = create_labels(validation_all_tweets)

In [13]:
from sklearn.metrics import accuracy_score

def classifier_evaluation(classifier, X_validation, y_validation):
    """ evaluates a classifier based on a supplied validation data
    Inputs:
        classifier: sklearn.svm.classes.SVC: classifer to evaluate
        X_train: scipy.sparse.csr.csr_matrix: sparse matrix of features
        y_train: numpy.ndarray(int): dense binary vector of class labels
    Outputs:
        double: accuracy of classifier on the validation data
    """
    prediction = classifier.predict(X_validation)
    accuracy = accuracy_score(y_validation, prediction)
    return accuracy

# AUTOLAB_IGNORE_START
accuracy = classifier_evaluation(classifier, X_validation, y_validation)
# AUTOLAB_IGNORE_STOP

When I ran the above method, I get the following accuracy

``` python
    >>> print(accuracy)
    0.925851703407
```
_________

Now we can use our model to classify unlablled tweets from our test-tweets dataset

In [14]:
def classify_unlabelled_data(tfidf, classifier, unlabeled_tweets):
    """ predicts class labels for raw tweet text
    Inputs:
        tfidf: sklearn.feature_extraction.text.TfidfVectorizer: the TfidfVectorizer object used on training data
        classifier: sklearn.svm.classes.SVC: classifier learnt
        unlabeled_tweets: pd.DataFrame: tweets read from tweets_test.csv
    Outputs:
        numpy.ndarray(int): dense binary vector of class labels for unlabeled tweets
    """
    
    processed_tweets = process(unlabeled_tweets)
    corpus = []
        
    for row in processed_tweets['text']:
        sentence = ' '.join(row)
        corpus.append(sentence)
    
    vect_corpus = tfidf.transform(corpus)
    result_matrix = csr_matrix(vect_corpus)
    y_pred = classifier.predict(result_matrix)
    
    return y_pred
    

# AUTOLAB_IGNORE_START
classifier = fit_classifier(X, y)
unlabelled_tweets = pd.read_csv("test-tweets.csv", na_filter=False, encoding='latin-1')
y_pred = classify_unlabelled_data(tfidf, classifier, unlabelled_tweets)
# AUTOLAB_IGNORE_STOP

## Conclusion

Now you have a model with which you can identify tweets that were posted without monetary influence or bias during the 2016 Ugandan Election. 

In an age where trending content quickly morphs into reality and influences decisions that people make - it is becoming increasingly important to identify whether content posted on social media is genuine or driven by hidden incentives such as monetary gain. 

In this tutorial we have covered: 
1. Installation of libraries that are important 
2. Data Pre-Processing: Collection, Conversion and Cleaning - How we collected the data as well as processed it.
3. Text Processing: Cleaning up the text content so that it is ready to begin the process of feature extraction
4. Feature Extraction: Here we looked at the identification of words that will not add value to our similarity calculations.
5. Classification: Here we looked at using the SVM classifier with the linear kernal function to fit our data. We then evaluated our classifier against our validation data. After which we ran our model on our unlabelled data.