# Classifying Twitter users based on their tweets
In this workshop we will figure out who wrote which tweets from some Norwegian famous tweeters.

#### Structure of the notebook
There are three sections in this notebook: 
- Build Dataset, 
- Build machine learning model, and 
- See how well the model works.

The first section is not really needed to know well.
It contains all the "boring" data preprocessing that is necessary to prepare the data for the algorithm.
For completeness it is here and commented, so that you can go through it if you really want to.

The second part is where we will focus on during this workshop. 
For now, skim through the preprocessing, but really start at the second section.
The third section is an evaluation of how good your model is.
At the end there is some tips to get you going to create an even better model than this!

#### Jupyter notebooks
This jupyter notebook creates a neural network that use the tweet to predict the author. 
A jupyter notebook is a file where you can run code and show results in the same document. 
If you want a introduction, you can look [here](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/), 
but the main idea is that you have markdown chunks (like this one) and code chunks. 
The notebook has a 'python console' running in the background.
To run a chunk press **shift+enter**. Then the code is executed and you can see the results. To add a chunk press **esc+a** and an empty chunk will be create above where you stand :)


In [None]:
# Import all needed libraries
import datetime
import jsonlines
import os
import pandas as pd
import numpy as np
import re
from keras.preprocessing.sequence import pad_sequences
from sklearn import preprocessing
import keras
from keras.layers import *
from keras.models import Model
from sklearn.metrics import confusion_matrix
import seaborn as sns
%matplotlib inline

Using TensorFlow backend.


## Build dataset
First we need to build our dataset.
This is not really part of the workshop, but the preparation code is left here so that those interested can check it out later.
In the data directory there is some files from the twitter users we will investigate.
These files contains all the tweets the user have published, and is pre-downloaded for the workshop.
Basically what we do is the following:

- Read the files and put them in a nice pandas dataframe
- Build a vocabulary of all words and give them indicies
- Transform all tweets into indicies (something required by tensorflow later)


#### Read files 

In [None]:
# Read files:
def load_files_and_create_dataframe():
    files = ['data/' + f for f in os.listdir("data")]
    L = []
    for path in files:
        try:
            with jsonlines.open(path, mode='r') as reader:
                tmp = [line for line in reader.iter()]
                if len(tmp) > 50 & isinstance(tmp, list):
                    L.extend(tmp)
                else:
                    print("%s had less than 50 tweets. skipping.." %path)
        except:
            print("Did not manage to process: %s" % path) 
            
    raw = pd.DataFrame(L)
    raw.timestamp = pd.to_datetime(raw.timestamp)
    
    # Shuffle dataset and filter out some users that have sneaked into the files 
    # (i.e. have less than 50 observations)
    
    counts = raw.user.value_counts()
    counts = counts[counts > 50]
    keep_users = counts.index
    
    raw = (
        raw[raw.user.isin(keep_users)]
        .sample(frac=1) # shuffle dataset
        .reset_index(drop=True)
    )
    return raw

raw = load_files_and_create_dataframe()

print("We got %d observations/tweets in our dataset. The first 5 looks like this:" % raw.shape[0])
raw.head()

#### Build vocabulary
We build a vocabulary of the "known" words in the dataset.
Essentially we will require that we have seen the word 5 times before for it to be in the vocabulary.
Otherwise, we will replace the word with "UNK".

The rest is pretty straightforward. 
We split by space and only keep the most important characters in the words.


In [None]:
def tokenizeString(s):
    s = s.lower().strip()
    s = re.sub(u"([.!?])", r" ", s)
    
    # Let @ and # be independent words:
    s = re.sub(u"(#)", r"# ", s)
    s = re.sub(u"(@)", r"@ ", s)
    
    # Remove all other characters than the alphabet
    s = re.sub(u"[^a-zA-Z.!#@?\xf8\xe6\xe5]+", u" ", s)
    s = s.split(" ")
    return s

def build_vocabulary(text, min_count = 5):
    normalized = text.map(tokenizeString)
    all_words = np.array([item for sublist in normalized.values.tolist() for item in sublist])
    words, counts = np.unique(all_words, return_counts=True)
    keep_words = words[counts > min_count]
    ind2word = {i + 2 : w for i, w in enumerate(keep_words)}
    ind2word[1] = "UNK"
    ind2word[0] = "EMPTY"
    word2ind = {w : i for i, w in ind2word.items()}
    
    return word2ind, ind2word

word2ind, ind2word = build_vocabulary(raw.text)
print("We have a total vocabulary of %d words." % len(word2ind))

In [None]:
print("If you want to find the index of a word:")
print( word2ind["hus"] )
print("Similarly: If you want to find the word corresponding to an index:")
print( ind2word[1000] )

#### Preprocess text
Now we want to transform all tweets from words to a fixed length index.
We do this in two steps:
- First we transform each word into an index (the vocabulary we defined above)
- Then we set all tweets to a length of 40. If a tweet does not have 40 words then we fill with empty words. If the tweet have more, we truncate the end.

The whole process is contained in the encode_text function. Also see that we have a decode_tweet function to go back again!

In [None]:
# Prepare sequences:
vectorize_tweet = lambda x: [word2ind.get(w,1) for w in x]
PAD_LENGTH = 40

def encode_text(s, padlen = PAD_LENGTH):
    tokenized = tokenizeString(s)
    vectorized = vectorize_tweet(tokenized)
    padded = pad_sequences([vectorized], maxlen = padlen, padding = "post", truncating= "post")
    return padded

def decode_tweet(vec):
    dec = [ind2word.get(ind) for ind in vec if ind != 0]
    text = " ".join(dec)
    
    # Assume all @ and # is followed by word without space:
    text = re.sub(u"# ", r"#", text)
    text = re.sub(u"@ ", r"@", text)
    return text

# All the work happens here:
X = np.squeeze(np.array(raw.text.map(encode_text).values.tolist()))#.shape

In [None]:
print("We now have all our tweets in the numpy array X. Each row is a tweet (in total %d), and each column is a word (in total %d)" % (X.shape[0], X.shape[1]))

print("We can easily decode a tweet by doing the following:")
print('--')
decode_tweet(X[300,])

#### Preprocess labels
Each author represent a class. The classes needs to be represented by indicies.
Therefore, we will also transform all the usernames into numbers.

In [None]:
def build_classes(labels):
    le = preprocessing.LabelEncoder()
    le.fit(labels)
    label_index = le.transform(labels)
    ind2class = {i : user for i, user in  enumerate(le.classes_)}
    return label_index, ind2class

raw['user_class'], ind2class = build_classes(raw['user'])
N_classes = len(ind2class) # Store total number of authors somewhere
print("The index -> author mapping can be found in this dictionary:")
ind2class

In [None]:
y = raw.user_class.values

In [None]:
# Now y contains all the indicies of each user:
y

In [None]:
# We can find the author of the tweet above:
y[300]

In [None]:
# We can transform the index back to the user by using the ind2class dictionary:
ind2class[5]

In [None]:
# Further, it could be interesting to know how many tweets are from each author. This is a summary:
raw.groupby("user")['user'].count()

#### Split into training and test
When we build our model it is important that we test on a different dataset than we train on.
Otherwise, we could just have written a lookup table to predict each tweet author.
Instead, we want the model to be able to generalize 
i.e. that it in some sense understand what makes a tweet belong to that author.

When we train the algorithm, it will only see the training data. 
Then, when we have finished training, we see how well it is doing on the test data.

We split the data in 80% for training and 20% for testing.

In [None]:
np.random.seed(42)
train_ind = np.random.rand(X.shape[0]) < 0.7
X_train = X[train_ind,]
X_test = X[~train_ind,]
y_train = y[train_ind]
y_test = y[~train_ind]
assert(len(np.unique(y_train)) ==  len(np.unique(y_test)))

print("We have %d tweets in training set and %d tweets in the test set"% (X_train.shape[0], X_test.shape[0]))

## Build machine learning model (start here!)
This is where the machine learning starts.
Hopefully, you havent really looked at what we did above, so there will be short recap of what we have right now.

### Data:  
We have also split our data into the training data (X_train, y_train), and the test data (X_test, y_test).
When we build our model, we will only train it using the training data, but we will check how well it is doing it using the test data.

- The tweets are stored in X_train and X_test.  
- The authors of the corresponding tweets are stored in y_train and y_test.



In [None]:
print("The first tweet in our training data looks like this: ")
print(X_train[0,])
print("We can decode it back to natural language by using the decoding function: ")
print('--')
print(decode_tweet(X_train[0,]))
print('--')
print("Then we can get the author of the tweet by looking in the y_train variable: ")
print("--")
print(y[0])
print("--")
print("Which corresponds to the author...: ")
print("--")
print(ind2class[y[0]])

### Algorithm
We will build our model in a framework called [Keras](http://keras.io).
It is a deep learning framework using tensorflow as its backend (not important).

We have gone through the model in the slides.

In [None]:
tweet_input = Input((PAD_LENGTH,))
emb = Embedding(input_dim = len(ind2word), output_dim= 5)

# We can represent each word as a vector of length 5:
word_vectors = emb(tweet_input)

# These word vectors are generic, 
# The simplest way to handle them is just to use the average word vector:
avg_word_vectors = Lambda(lambda x: K.mean(x, 1))(word_vectors)

# Add a hidden layer to abstract the average word:
hidden_layer = Dense(10, activation = "relu")(avg_word_vectors)

# Compute another layer. The output of this one is a vector of size N_Classes: 
# Each number is a probability that the tweet belongs to that class:
probs = Dense(N_classes, activation = "softmax")(hidden_layer)


model = Model(inputs = tweet_input, outputs = probs)


The model is now defined in the 'model' object.
If we want to inspect how it looks we can either call "model.summary" to see the whole model, 
or we can look at the individual layers by simply printing the layer out here:

In [None]:
print('Print model:')
print('--')
print(model.summary())
print("show how a specific layer looks like (most importantly the dimensions):")
print('--')
print(probs)

## Train the model
We have defined the model, but the model has not seen any datapoints yet.
Do to so we need to feed it data.
We have to add an optimizer (how the parameters of the model should be updated.
Then we need to compile it and specify a couple of things:
- The loss we want to minimize (basically "sparse_categorical_crossentropy" will try to increase probability of the right class towards 100% and decrease everything else to 0%).
- We also specify that we want to monitor the accuracy (the share of observations that we correctly guess the right author).

Calling model.fit starts a training procedure that will take some time..

In [None]:
optimizer = keras.optimizers.SGD(lr = 0.05)
model.compile(loss = "sparse_categorical_crossentropy",
              optimizer = optimizer, 
              metrics = ["accuracy"])

In [None]:
model.fit(X_train,y_train, 
          verbose = 2,
          validation_data = (X_test, y_test),
         epochs = 10)

## See how well the model works
Now we have trained the model and we can monitor the val_acc metric above to see how often our model is guessing right.
However, it does not say anything about what the model is doing right and what the model is doing wrong.
For that, we can calculate a [confusion matrix](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/).
A confusion matrix tells us where we predicted all tweets from a specific user.
All rows is the real author, while the rows are the author the model predicted the tweets to be in.
We have made the confusion matrix relative in percent.

For example, if the 'audunstrand' row has 0.3 on the 'lenealexandra' column, it means that the model thinks 30% of audunstrand's tweets where from lenealexandra.

In [None]:
def create_relative_confusion_matrix(y_test,yhat):
    confusion = confusion_matrix(y_test,yhat)

    rel_confusion = np.round(confusion / confusion.sum(axis=1, keepdims = True),2)
    rel_confusion = pd.DataFrame(rel_confusion, columns = ind2class.values(), index= ind2class.values())
    rel_confusion.index.name = "true"
    rel_confusion.columns.name = "predicted"
    return rel_confusion

# Predict probabilities for each class:
yhat_probs = model.predict(X_test)

# For each observation we take the author with highest probability as the predicted author:
yhat = yhat_probs.argmax(axis=1)

rel_confusion = create_relative_confusion_matrix(y_test, yhat)
rel_confusion

In [None]:
# Plot the confusion matrix
sns.heatmap(rel_confusion)

In [None]:
# We can also show some random errors so that we get the feel of what the model misses to see:

def show_random_error():
    errors = np.where(y_test != yhat)[0]
    idx = np.random.choice(errors)
    print('--- Tweet: ---')
    print(decode_tweet(X_test[idx,]))
    
    print("Model believes:\t %s" %ind2class[yhat[idx]])
    print("True author:\t %s" %ind2class[y_test[idx]])
    
show_random_error()

## Now it's your turn..

Are you able to make a better model than the one we just ran? 

Possible extensions of the model:

- Try to increase the width of the dense layer (https://keras.io/layers/core/#dense)
- Try to adjust the learning rate of your optimizer. Lower learning rates takes long but usually gives better results (https://keras.io/optimizers/)
- Try to use a different optimizer. The Adam optizer is often a good alternative 
- Try to stack another dense layer on top of the one you have (or many!)
- We are just using the "average" vector to predict. Could we try to stack the words horizontally instead? (https://keras.io/layers/core/#reshape)
- Are the training accuracy much higher than the test accuracy? Try to add Dropout (https://keras.io/layers/core/#dropout) or Batch Normalization (https://keras.io/layers/normalization/)
- Instead those average vectors, a very fancy alternative approach would be to try to model a tweet as a recurrent neural network (https://keras.io/layers/recurrent/)
- There are millions of possible tricks. Try googling 'text classification twitter' or just 'text classification'.