## Text analytics - tweets

### Load Libraries

In [None]:
# For data and matrix manipulation
import pandas as pd
import numpy as np

# For visualisation
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# For string manipulation
import re 
import string

# For text pre-processing
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Necessary dependencies from NLTK
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# For assigning sentiment polarity scores
from textblob import TextBlob

# For extracting features -- i.e. the document-term matrix
from sklearn.feature_extraction.text import CountVectorizer

# Train_test_split 
from sklearn.model_selection import train_test_split

# Some ML models
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

# For evaluation of ML models
from sklearn.metrics import accuracy_score, classification_report


### Data preprocessing

Looking above, we can see there's much to be done -- some questions we could ask ourselves:
 - What do we do with handles? I.e. @Apple 
 - What do we do with punctuation? I.e. !?.-# 
 - How do we handle words spelt incorrectly? I.e. that vs. thattt
 - What do we do with Emojis? I.e. :), :-) etc.
 - What about words that have inflectional changes? Do we keep them or return them to their base? See [NLP Stanford on Stemming and Lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

It may be a good idea to begin with a sample sententce, and see how we do:

In [None]:
sample = 'This is the pre-processing stages          @ApplE. #whataday This is SOOOO EXCI~TING. Cows. Thats all i have to say come find me at https://decoded.co'

# word_tokenize(sample)
# sent_tokenize(sample)

In [None]:
re.sub('\s+', ' ', sample)

In [None]:
# Using the lemmatizer function - by default it lemmatizes nouns, for e.g.:
print(WordNetLemmatizer().lemmatize('cars'))

# But it can also be adjusted to lemmatize adjectives by setting the 'pos' parameter to 'a':
print(WordNetLemmatizer().lemmatize('cleanse', pos = 'v'))

In [None]:
PorterStemmer().stem('cleanse')

## Cleaning/Transformation function

Now let's create a function that does all the cleaning/transformation in one go.

Remember we can always go back and change our choices later!

In [None]:
def clean_text(sample_text):

    # First, let's try changing everything to lower case
    sample_text = sample_text.lower()    
    
    # Let's now replace a select few symbols with white space to make things easier for ourselves
    sample_text = re.sub('[#@~?!]',' ',sample_text)
    
    # We then strip any extra white space
    sample_text = re.sub('\s+',' ', sample_text)
    
    # Remove all URLs: 
    sample_text = sample_text.replace('https?:[A-Za-z0-9/.]*', '')
    
    # Then lemmatize our words -- note,  stemming was deemed too crude here, and therefore not chosen
    sample_text = WordNetLemmatizer().lemmatize(sample_text, pos = 'n')
    
    # Try lemmatizing adjectives & verbs
    
    # Now that we transformed our text, we need to tokenize it. Let's treat each word as a token.
    words = word_tokenize(sample_text)
    
    # As we now have a list of words,  we can go ahead and find and remove those words that also belong to the 
    # stopwords list from the NLTK corpus
    words = [w for w in words if w not in stopwords.words('english')]
    
    # We then proceed to joining those list of words, back to 'free text'  or string format
    text = ' '.join(words)
    
    return text

## Modeling

### Bag of words model

Now that we've cleaned our corpus of tweets, we have put it in a form that can be used by our machine learning models. The model we'll be using is called the **bag of words** model.

This involves creating a document-term matrix -- a matrix with **documents** or tweets as the rows, and unique **terms** as the columns -- which we use the _CountVectorizer()_ function for from the NLTK package. 

In [None]:
# We specify that we need no more than 10000 features -- i.e. 1000 unique terms. Of course, this is an arbitrary number
# feel free to play around with this parameter!

# We also specify the min_df parameter to be 0.01. This means that our terms should at least be used in 1% of our 
# tweets.

# Finally, we specify an ngram_range of 1. This means that we're only looking for words -- an ngram_range of (1,2) 
# would include both words (length = 1) and phrases or combinations of words of length = 2 
vector = CountVectorizer(max_features= 10000 , min_df=0.01, ngram_range= (1,1))

# We use the fit_transform() function to apply the above to our tweets
bag_of_words = vector.fit_transform(tweets)

### Model and Understand – Visualisation

There's lots to explore here! Let's begin with a few visualisations that could be of interest. 

In [None]:
# Find the sum of occurences of each term
sum_of_words = bag_of_words.sum(axis= 0)

# Create a list of tuples where each element represents the term in question and how many times it occurs in our 
# corpus.
words_freq = [(word, sum_of_words[0, idx]) for word, idx in vector.vocabulary_.items()]

# Sort in decreasing order of frequency.
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)

words_freq

In [None]:
# Ignoring top word (which is "apple" in this case)
top_words = words_freq[1:30]

word = []
count = []

for i, j in top_words: 
    word.append(i)
    count.append(j)

# Asjusting figure size
plt.figure(figsize = (10,10))

# Plotting a barplot of most frequent words using Seaborn
sns.barplot(x = count, y = word)

In [None]:
# Another way to plot most frequent words is through use of Wordclouds
words_dict = {}
for k,v in top_words:
    words_dict[k] = int(v)

# Using the WordCloud library
wordcloud = WordCloud(width=1000, height=500, background_color="white").generate_from_frequencies(words_dict)

plt.figure(figsize=(20, 10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

### Model and Understand – Sentiment analysis

Now that we visualised most frequent words, we can delve into more interesting aspects, like sentiment. 


To do this, we'll be using a library called **Textblob**. The package assigns a _sentiment polarity score_ from -1 to 1 to each of our words. It then calculates the sum total sentiment polarity for each tweet by averaging the scores for the terms in the tweet in question. 

Textblob assigns sentiment scores based on pre-populated lexicon, or dictionary of words that have been previously assigned scores by humans -- no magic here! For more information on how the scoring is calculated read more [here](https://planspace.org/20150607-textblob_sentiment/). 

In [None]:
# We create a list of sentiment scores on each of our tweets using Textblob
sentiments = []

for tweet in tweets:
    analysis = TextBlob(tweet)
    sentiments.append(analysis.sentiment.polarity)

# We add that list to a new dataframe of our tweets
tweets_df = pd.DataFrame(tweets)

tweets_df['sentiments'] = sentiments

tweets_df

In [None]:
# Create a new column that categorises sentiment -- 'Negative' if <0, 'Positive' if >0 and 'Neutral' if equal to 0.
categories = []

for sentiment in tweets_df['sentiments']: 
    if sentiment > 0:
        categories.append('Positive')
    elif sentiment < 0: 
        categories.append('Negative')
    else:
        categories.append('Neutral')
        

tweets_df['sentiment_category'] = categories

### Predict Sentiment – Text Classification

Now that we've explored our data (although not exhaustively) let's see if we can build a machine learning model that is able to predict the sentiment of a tweet!

We split into a training and a testing set as before -- specifying: 
 - test_size = 0.3 -- i.e. we train on 70% of our dataset and test on 30%
 - a random_state = 123 -- for our results to be reproducible

In [None]:
# Using the train_test_split function
x_train, x_test, y_train, y_test = train_test_split(tweets_df['text'], tweets_df['sentiment_category'],  
                                                   test_size = 0.3, random_state = 123)

We now apply our bag of words model to our entire text using the _fit_ function

In [None]:
# Fit bag of words model (Countvectorizer) to full text first
vector.fit(tweets_df['text'])

In [None]:
# Now we apply same feature transformation to both x_train, and x_test
x_train_bow = vector.transform(x_train)

x_test_bow = vector.transform(x_test)

In [None]:
x_train_bow

In [None]:
x_test_bow

In [None]:
# Notice the same number of features 
print(x_train_bow.shape, x_test_bow.shape)

Now we can call on sklearn's machine learning algorithms!

One basic model that is typically used for text modelling is _Naive Bayes_.

### Evaluate & Communicate

We've now modelled our tweets using the bag-of-words model. There are many more ways we can explore this further. To name a few directions: 

 - We can improve the accuracy of our current models
 - We can try other machine learning models
 - We can try different pre-processing techniques
 - There are other modelling techniques besides the bag of words model -- maybe try TFIDF or Word2Vec? Lots you can explore here!