## Text analytics - tweets

### Hello Jupyter Notebooks

DO: Try printing "hello world"! 

### Load Libraries

In [None]:
# For data and matrix manipulation
import pandas as pd
import numpy as np

# For visualisation
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# For string manipulation
import re 
import string

# For text pre-processing
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Necessary dependencies from NLTK
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# For assigning sentiment polarity scores
from textblob import TextBlob

# For extracting features -- i.e. the document-term matrix
from sklearn.feature_extraction.text import CountVectorizer


### Source and Manage Data

DO: load in the CSV data using pandas, print out the first 6 rows using the head() function

### Data preprocessing

Looking above, we can see there's much to be done -- some questions we could ask ourselves:
 - What do we do with handles? I.e. @Apple 
 - What do we do with punctuation? I.e. !?.-# 
 - How do we handle words spelt incorrectly? I.e. that vs. thattt
 - What do we do with Emojis? I.e. :), :-) etc.
 - What about words that have inflectional changes? Do we keep them or return them to their base? See [NLP Stanford on Stemming and Lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

It may be a good idea to begin with a sample sentence, and see how we do:

In [None]:
sample = 'This is the pre-processing stages    @ApplE. #whataday This is SOOOO EXCI~TING. Cows. Thats all i have to say. Come find me at https://decoded.com'

In [None]:
# word_tokenize - transforms our string/text into a list of words (separated by a white space), where each word is 
# called a token.
word_tokenize(sample)

In [None]:
# sent_tokenize - transforms our string/text into a list of sentence (separated by a '.'), where each sentence is 
# a token.
sent_tokenize(sample)

In [None]:
# Transforms each character to its lower case form
sample.lower()

In order to capture patterns in our text data, we use a handy tool called _regex_ or regular expressions. This is a whole interesting area of text analytics worthy of exploration. 

For reference on some regular expressions and how to use them: [W3 Schools Regex](https://www.w3schools.com/python/python_regex.asp)

For trialling various regex patterns and combinations on more text:  [Regexr](https://www.regexr.com)

In [None]:
# Removes URLs from the sample text:
re.sub('http\S+', '', sample)

In [None]:
# Substitutes a specified characted with another specified character of your choice
re.sub('@', '@@', sample)

In [None]:
# Removes extra white space
re.sub('\s+', ' ', sample)

In [None]:
# Removing a select few symbols
re.sub('#|!|~','', sample)

Finally, we may also want to consider exploring 'stopwords'. Those are words like 'and', 'you', and 'I' that add little value to our model and would only clutter our corpus. 

The NLTK Library has a very handy pre-populated dictionary of such words that we can use to our advantage.

They also have a [list of other corpora](http://www.nltk.org/nltk_data/) that may be useful for future text analytics projects!

In [None]:
stopwords.words('english')

Now that we've played around with text preprocessing and have some idea of what we can do, let's go ahead and put it all in a function.  

Remember we can always go back and change our choices later!

In [None]:
def clean_text(sample_text):
    
    # Let's remove all of the URLs from the text: 
    sample_text = re.sub('http\S+', '', sample_text)
    
    # Given a sample text (as a string), we first substitute a select few sybmols with white space
    sample_text = re.sub(r'[#|@|-|?|!]',r' ',sample_text)
    
    # We then strip extra white space
    sample_text = re.sub('\s+',' ', sample_text)
    
    # Then change everything to lower case
    sample_text = sample_text.lower()
    
    # Now that we transformed our text, we need to tokenize it. Let's treat each word as a token.
    words = word_tokenize(sample_text)
    
    # As we now have a list of words,  we can go ahead and find and remove those words that also belong to the 
    # stopwords list from the NLTK corpus
    words = [w for w in words if w not in stopwords.words('english')]
    
    # We then proceed to joining those list of words, back to 'free text'  or string format
    text = ' '.join(words)
    
    return text

DO: Let's apply the clean_text function to our tweets! First let's copy the original text so we don't lose it

## Modeling

### Bag of words model

Now that we've cleaned our corpus of tweets, we have put it in a form that can be used by our machine learning models. The model we'll be using is called the **bag of words** model.

This involves creating a document-term matrix -- a matrix with **documents** or tweets as the rows, and unique **terms** as the columns -- which we use the _CountVectorizer()_ function for from the NLTK package. 

In [None]:
# We specify that we need no more than 10000 features -- i.e. 1000 unique terms. Of course, this is an arbitrary number
# feel free to play around with this parameter!

# We also specify the min_df parameter to be 0.01. This means that our terms should at least be used in 1% of our 
# tweets.

# Finally, we specify an ngram_range of 1. This means that we're only looking for words -- an ngram_range of (1,2) 
# would include both words (length = 1) and phrases or combinations of words of length = 2 
vector = CountVectorizer(max_features= 10000 , min_df=0.01, ngram_range= (1,1))

# We use the fit_transform() function to apply the above to our tweets
bag_of_words = vector.fit_transform(tweets)

bag_of_words

We can see that this is a sparse matrix of:
 - 9991 documents - i.e tweets
 - 241 unique terms
 
Where each cell represents the number of times the term in question occurs in the document in question. It's sparse as there are inevitably, lots of zeros!

DO: Use the `get_feature_names()` function to view the features!

### Model and Understand – Visualisation

There's lots to explore here! Let's begin with a few visualisations that could be of interest. 

In [None]:
# Find the sum of occurences of each term
sum_of_words = bag_of_words.sum(axis= 0)

# Create a list of tuples where each element represents the term in question and how many times it occurs in our 
# corpus.
words_freq = [(word, sum_of_words[0, idx]) for word, idx in vector.vocabulary_.items()]

# Sort in decreasing order of frequency.
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)

words_freq

In [None]:
# Ignoring top word (which is "apple" in this case)
top_words = words_freq[1:30]

word = []
count = []

for i, j in top_words: 
    word.append(i)
    count.append(j)

# Asjusting figure size
plt.figure(figsize = (10,10))

# Plotting a barplot of most frequent words using Seaborn
sns.barplot(x = count, y = word)

In [None]:
# Another way to plot most frequent words is through use of Wordclouds
words_dict = {}
for k,v in top_words:
    words_dict[k] = int(v)

# Using the WordCloud library
wordcloud = WordCloud(width=1000, height=500, background_color="white").generate_from_frequencies(words_dict)

plt.figure(figsize=(20, 10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
top_words

In [None]:
# Let's try again!
words_dict = {}
for k,v in top_words:
    words_dict[k] = int(v)

wordcloud = WordCloud(width=1000, height=500, background_color="white").generate_from_frequencies(words_dict)

plt.figure(figsize=(20, 10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

### Model and Understand – Sentiment analysis

Now that we visualised most frequent words, we can delve into more interesting aspects, like sentiment. 


To do this, we'll be using a library called **Textblob**. The package assigns a _sentiment polarity score_ from -1 to 1 to each of our words. It then calculates the sum total sentiment polarity for each tweet by averaging the scores for the terms in the tweet in question. 

Textblob assigns sentiment scores based on pre-populated lexicon, or dictionary of words that have been previously assigned scores by humans -- no magic here! For more information on how the scoring is calculated read more [here](https://planspace.org/20150607-textblob_sentiment/). 

DO: Create a list of sentiment scores on each of our tweets using Textblob

DO: Find the most positive and negative words

DO: Use `sns.distplot` to plot the distribution of sentiments of the tweets

DO: Create a new column that categorises sentiment -- 'Negative' if <0, 'Positive' if >0 and 'Neutral' if equal to 0.

DO: Use `sns.countplot` to create a countplot of each different sentiment category