# Stylext: Tweet Attribution with Naive Bayes & Logistic Regression

Note to non-programmer friends: *You need to select "Restart & Clear Output" under the Kernel tab before running the code cells below*

### Introduction

Although technically not 100% pure stylometry (because distinguishing one user from the other with the code below is affected by the respective topics being discussed), this notebook file will illustrate how the same sort of algorithms used to distinguish spam from valid email can also be used to distinguish one Twitter user from another using their post content alone.

Both feeds used in the sample csv data are about the same topic (economics). However, there are no *conscious* attempts by the users to obfuscate their Tweet styles, but the techniques used can still be useful if someone is unaware of what distinguishes their tweet style from others.

With each Python code cell, click on it to highlight then shift + enter to execute it. The * symbol means it's running, while a number means it completed.

## Part 1: Importing Needed Libraries

You will need *pandas* to read in rows and colums (containing the raw article text, and columns for all of the criteria of interest.

*Numpy* and *scipy* add functionality that you will depend on throughout notebook use. Very specific tools are also imported from *scikit-learn.* In particular, a few natural language processing tools are imported which may be used to boost model accuracy (with iterative trial and error).

**Do not worry if the brief library descriptions in the code below do not make sense to you; the specifics of what they do are ellaborated further in this notebook as they are put to use.**

In [104]:
# These are the core libraries you need to import to run the scripts that follow.

import pandas as pd # this is needed to read in dataframes (rows and columns of data)
import numpy as np # numpy allows you to efficiently work with and execute operations on arrays
import scipy as sp # scipy builds off of numpy, giving you added linear algebra functionality under the hood

Now that our core libraries are imported, we need to import several things from Scikit-Learn. These will allow use to *add structure* to otherwise unstructured text, *apply machine learning models* to classify text samples, and *measure the accuracy* of the output for the data we will load in. 

In [105]:
# Here are more specific tools from Scikit-Learn for natural language processing and measuring accuracy

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # two vectorization methods we want for later
from sklearn.naive_bayes import MultinomialNB # multinomial naive bayes classifier
from sklearn.linear_model import LogisticRegression # basic logistic regression classifier
from sklearn.model_selection import train_test_split # this splits the data loaded in into training & testing groups
from sklearn import metrics # this will help us understand the results of the train/test split simulation

## Part 2: Load in CSV File Containing Tweets and Define Train/Test Variables

Now we will read in the data file (in comma seperated values format) from an online github repo, and store it in a python variable called "tweets" so we can continue to work with it throughout the notebook. Pandas' "read_csv" feature will allow use to read in and define the CSV.

In [106]:
# Read post_feed.csv into a DataFrame. Any CSV with columns containing raw tweet contents and usernames can often work.
# If you're offline, replace the link with the file location for post_feed.csv if you have it stored locally.

url = 'csv/10users.csv' # define url as csv data
tweets = pd.read_csv(url) # read the csv file using the pandas python library and define it as 'tweets'

Defining training and testing variables must be done in order to even begin testing and improving predictive accuracy.

In [107]:
# define X and y, or the manipulated variable and the responding variable: Given the text, which user tweeted it?

X = tweets.raw_text  # this defines X as the csv column that contains all the raw tweet text of both users
y = tweets.username  # the responding variable (y) is now defined as the column with the two Twitter usernames


# split the new DataFrame into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

The code below ensures the dataframe has been read in and the variables are defined as intended.

In [108]:
 # check the first five rows/tweets

tweets.head()

Unnamed: 0,username,raw_text
0,DwyaneWade,b'I\xe2\x80\x99ll take each game just like thi...
1,DwyaneWade,b'https://t.co/CDaW9lo7dp'
2,DwyaneWade,b'RT @RobinRoberts: It was a show-stopping per...
3,DwyaneWade,b'I guess all black people do look alike \xf0\...
4,DwyaneWade,b'Happy Anniversary to you both!!! https://t.c...


In [109]:
# check the last five rows/tweets, notice the change in which username's tweets are visible

tweets.tail()

Unnamed: 0,username,raw_text
27098,wojespn,b'Sources: Brooklyn is out of the running on f...
27099,wojespn,b'@ChrisMannixYS @KDTrey5 Too bad KD plays for...
27100,wojespn,b'RT @TheVertical: Sources: Arron Afflalo agre...
27101,wojespn,b'RT @BobbyMarks42: Kings projected cap space ...
27102,wojespn,"b'Arron Afflalo has agreed to a two-year, $25M..."


In [110]:
# check the number of rows (tweets stored) and columns

tweets.shape

# even though there are 3 columns, most of the time we're only going to use two at a time (no need for time_stamp)

(27103, 2)

In [111]:
# check the first five rows in a shorter format - it's only displaying the raw_text column that X was defined as.

X.head()

0    b'I\xe2\x80\x99ll take each game just like thi...
1                           b'https://t.co/CDaW9lo7dp'
2    b'RT @RobinRoberts: It was a show-stopping per...
3    b'I guess all black people do look alike \xf0\...
4    b'Happy Anniversary to you both!!! https://t.c...
Name: raw_text, dtype: object

In [112]:
# now we will see the first five rows of the responding variable (the username of who posted what)

y.head()

0    DwyaneWade
1    DwyaneWade
2    DwyaneWade
3    DwyaneWade
4    DwyaneWade
Name: username, dtype: object

## Part 3: Time to Vectorize

- **What:** Separate text into units such as sentences or words that can be better quantified
- **Why:** Gives structure to previously unstructured text that machine learning can be applied to
- **Notes:** Relatively easy with English language text, not easy with some languages

We are now going to create what are called "document-term matrices" of the tweets. Think of these as rows and columns which store numbers representing how often certain terms appear in a given document (or passage of text). The image below may help you understand what that looks like under the hood:

&nbsp;

![Document-Term Matrix](http://mlg.postech.ac.kr/static/research/nmf_cluster1.PNG)

&nbsp;

In [113]:
# use CountVectorizer to create document-term matrices from X_train and X_test

vect = CountVectorizer() # because vect is way easier to type than CountVectorizer...
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.fit_transform(X_test)

# now we have quantitative info about the tweets that a 'multinomial naive Bayes classifier' can work with

**Just to clarify what's going on in the adjacent cells:** All the **rows** are the *individual tweets* that are stored in the CSV file. But the astronomical crapload of **columns** is literally *each unique term* that appears. Those are going to be the "features" used to "fingerprint" one user from another. 

In [114]:
# rows are documents, columns are terms (aka "tokens" or "features")

X_train_dtm.shape

(20327, 31449)

In [115]:
# last 50 features

print(vect.get_feature_names()[-50:])

['zn6trwryhi', 'zne', 'znianvui01', 'zo', 'zo2_', 'zo2lft2fhf', 'zoehan06090', 'zombie', 'zombies', 'zone', 'zones', 'zonkerdk', 'zonwdofjxf', 'zookeeper', 'zoolander', 'zoom', 'zoran_dragic', 'zosh7milj2', 'zpjwgn', 'zpydnmvv2n', 'zq7lirqyoj', 'zq8dprgk0f', 'zr', 'zrgvdvqnaf', 'zrrcisbnyz', 'zryl3kro1m', 'zs718ifpiy', 'zslkdfic', 'zsoldrofou', 'zsu9qc8dr1', 'zsxmxo9kqs', 'ztemupbtij', 'ztsbk0nzhb', 'zu1vo8woqy', 'zug0ncxmew', 'zuma', 'zusswrvq9u', 'zv50tc8ahl', 'zv91pwsgvr', 'zvwwvexg7h', 'zvzy2mudlw', 'zw9rfoqmcq', 'zwdrt26nmt', 'zwjthcryau', 'zxfodksdsq', 'zxfoyjg5vt', 'zynhggzxso', 'zyygjlthdg', 'zzuv1ayiyr', 'zzzz']


In [116]:
# show vectorizer options, which are currently at their default values

vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

**Take a look at the output above.** These are settings that CountVectorizer has (which is currently stored in "vect"). The main ones to keep in mind are lowercase (whether all the words get converted to lowercase), max_features (how many of those words are used to "fingerprint" the text of the tweets), ngram_range (if set to (1, 2), it will look at individual words as well as word pairs, and so on as you increment the latter number), and stop_words (words that are so common that they might not be useful for tweet classification can be ignored).

[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) - in case you might be interested.

In [117]:
# We will not convert to lowercase for now, but if we did it would reduce the number of unique words looked at

vect = CountVectorizer(lowercase=False)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(20327, 36404)

- Parameter **lowercase:** boolean, True by default
    - If True, Convert all characters to lowercase before tokenizing.
    
This can be useful for preventing word capitalization from making your results less predictive.

In [118]:
# last 50 features

print(vect.get_feature_names()[-50:])

['zgloIYgjV8', 'zgq0PajuSD', 'ziByLyfeKr', 'ziJUvzh8ks', 'ziKqmXMpkp', 'ziggurat', 'zigzaganalytics', 'zirvin21', 'zjRgmNjGBc', 'zjqLWzyWSZ', 'zkCkzNUpO5', 'zkT4KokYhS', 'zkvqUmSw', 'zlah2ov2Ty', 'zmB9YgJ9kA', 'zmane2', 'zmbiAtURsF', 'zmfHPnFDDj', 'zoNhzB7RMD', 'zombies', 'zone', 'zones', 'zoolander', 'zoomed', 'zoop', 'zoran_dragic', 'zoykWGGtwG', 'zplc1y5NQH', 'zpoRQzqjRF', 'zq72pYXw9x', 'zqQ7loGnE8', 'zquPwXRvxn', 'zquadri90', 'zrDUJukU', 'zrxUeJIyvQ', 'zsKF5HdiBS', 'zsQzDKn6', 'ztKX6JH2fy', 'ztPHrAlhLL', 'ztYAMYpbxb', 'ztsp8TUMUn', 'zufIuarf', 'zukovka', 'zupWf2Gxql', 'zvILQN9S', 'zw4DaT34MT', 'zxRPHDa7AR', 'zynga', 'zzZz', 'zzz']


In [119]:
# include 1-grams and 2-grams

vect = CountVectorizer(ngram_range=(1, 2))
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(20327, 168356)

- Parameter **ngram_range:** tuple (min_n, max_n)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [120]:
# last 50 features

print(vect.get_feature_names()[-50:])

['zuwhm6hiao', 'zuz5segxgh', 'zuz5segxgh https', 'zv6cpa9po6', 'zveeqrjpu2', 'zvilqn9s', 'zvl23bexir', 'zw4dat34mt', 'zwade', 'zwade what', 'zwazbpljyj', 'zwhueysw0l', 'zwischenzug', 'zwjm1bih47', 'zwomuphxnt', 'zwomuphxnt https', 'zxbjh2sbgr', 'zxfoyjg5vt', 'zxihl69ga5', 'zxmczel8', 'zxrphda7ar', 'zxsxfixu', 'zxsxfixu xe2', 'zxuvqys3ps', 'zxvjpiep6k', 'zxvwo28nay', 'zxxvkjvm', 'zyg_26', 'zyg_26 lol', 'zygrqqbfwm', 'zynga', 'zynga poker', 'zytm8cr5yt', 'zyuutsailv', 'zz2zwmw8j3', 'zz2zwmw8j3 xe2', 'zz46znjvyy', 'zzblosoqmj', 'zzly7stiny', 'zznxe7pr76', 'zzo39r7ggy', 'zzt3', 'zzt3 xe2', 'zztwdqd2fu', 'zzxnr0qtt7', 'zzz', 'zzz and', 'zzzz', 'zzzz https', 'zzzzzzzzz']


**Predicting which user made what Tweet:** 

Now for the moment of truth... How accurate can we predict who is who?

In [121]:
# use default options for CountVectorizer
vect = CountVectorizer()

# create document-term matrices
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# use Naive Bayes to predict the star rating
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

# calculate accuracy
print(metrics.accuracy_score(y_test, y_pred_class))

0.659238488784


**The cell below will eliminate the need for typing in the same code over and over again, as well as produce an output that includes all the information we need to know about how the number of unique features is affecting the classifier accuracy.**

In [122]:
# define a function that accepts a vectorizer and calculates the accuracy

lr = LogisticRegression()

def tokenize_test(vect, model):
    X_train_dtm = vect.fit_transform(X_train)
    print('Features: ', X_train_dtm.shape[1])
    X_test_dtm = vect.transform(X_test)
    if model == 'lr':
        lr.fit(X_train_dtm, y_train)
        y_pred_class = lr.predict(X_test_dtm)
        algorithm = 'Logistic Regression'
    elif model == 'nb':
        nb.fit(X_train_dtm, y_train)
        y_pred_class = nb.predict(X_test_dtm)
        algorithm = 'Multinomial Naive Bayes'
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))
    print(algorithm)

In [123]:
vect = CountVectorizer()
tokenize_test(vect, model='lr')

Features:  31449
Accuracy:  0.723288075561
Logistic Regression


In [124]:
# include 1-grams and 2-grams
vect = CountVectorizer()
tokenize_test(vect, model='nb')

vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect, model='nb')

vect = CountVectorizer(stop_words='english')
tokenize_test(vect, model='nb')

vect = CountVectorizer(stop_words='english',ngram_range=(1, 2))
tokenize_test(vect, model='nb')

Features:  31449
Accuracy:  0.659238488784
Multinomial Naive Bayes
Features:  168356
Accuracy:  0.636806375443
Multinomial Naive Bayes
Features:  31167
Accuracy:  0.67458677686
Multinomial Naive Bayes
Features:  146877
Accuracy:  0.668683589138
Multinomial Naive Bayes
