<img src="https://datasciencecampus.ons.gov.uk/wp-content/uploads/sites/10/2017/03/data-science-campus-logo-new.svg"
             alt="ONS Data Science Campus Logo"
             width = "240"
             style="margin: 0px 60px"
             />

# Python Solutions

This notebook contains all the solutions to this project, as well as example visualisations for section 3. You may produce different visualisations, but this doesn't matter.

In [None]:
#importing libraries we need
import json
import pandas as pd
import string
import re
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import nltk
nltk.data.path.append("../pre_course/nltk_data")
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import statistics
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from itertools import chain
import pyLDAvis
import pyLDAvis.gensim_models
import gensim
from gensim import models
import pyLDAvis.gensim_models as gensimvis
from gensim.models.coherencemodel import CoherenceModel
import seaborn as sns
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Use of emoji is optional!
# import emoji

# 1.0 Loading in the data

In [None]:
tweets = pd.read_csv("../Data/tweets.tsv",sep='\t') # replace with your file location

def replace_tweet_number(cell_contents):
    """
    By default the long Tweet IDs are displayed
    as exponentials. We convert them to strings
    as it makes it easier to read and doesn't
    impact our processing.
    
    params
    ------
    cell_contents:  float
                    Either a Tweet ID or NaN
    """
    try:
        return str(int(cell_contents))
    except ValueError:
        return cell_contents

tweets['repliedto_tweet'] = tweets['repliedto_tweet'].apply(replace_tweet_number)
tweets['quoted_tweet'] = tweets['quoted_tweet'].apply(replace_tweet_number)

tweets.head()

# 1.1 Preparing the data for sentiment analysis

The Vader package is able to process text that has been only minimally processed, and calculates sentiment based not just on words but also on punctuation, capitalisation, and emojis. For this reason we should be careful not to over-process our Tweets and instead only remove elements that will not contribute to the sentiment.

## 1.1.1 Removing hyperlinks and twitter handles

Hyperlinks and Twitter handles carry no meaning that would affect the sentiment, so we will remove them using regular expression (regex).

In [None]:
#defining the pattern we want to remove using regex
pattern = r'(https?://[^"\s]+)|(@\w+)'
# replace the hyperlinks and twitter handles with "" using pandas string methods
tweets['sentiment_analysis_text'] = tweets['text'].str.replace(pattern, "")

In [None]:
#viewing the tweets after removing hyperlinks and twitter handles
tweets.head()

### 1.1.3 Tokenising the Tweets

Tweets are made up of multiple sentences, which may each carry their own sentiment. We will use nltk's sentence tokeniser to split the Tweets into sentences for sentiment analysis. Later we can find the average sentiment of a Tweet.

In [None]:
#applying nltk's built in sentence tokenizer to the tweets
tweets['sentiment_analysis_text'] = tweets['sentiment_analysis_text'].apply(nltk.sent_tokenize)

tweets.head()

### 1.1.4 Removing unnecessary punctuation

Vader's sentiment analysis tool can analyse punctuation including '!' and '?', but some punctuation appears in Tweets without having any meaning attached, for example, the hashtag. These can be removed without losing meaning.

In [None]:
def remove_twitter_punct(sentence_list, remove):
    """
    Remove punctuation from user-specified list.
    
    params:
    ------
    ptext    (str) input text
    remove   (str) punctuation symbols to remove
    
    returns
    ------
    Text without punctuation from list
    """
    # remove the &amp symbols
    new_sentence_list = [re.sub(string = sent, pattern = r"&amp", repl="") for sent in sentence_list]
    # remove selected punctuation marks
    new_sentence_list = [re.sub(string = sent, pattern = f"[{remove}]", repl="").strip() for sent in new_sentence_list]
    # remove double spaces
    new_sentence_list = [re.sub(string = sent, pattern = r"\s+", repl=" ") for sent in new_sentence_list]
    # remove space from before punctuation mark
    new_sentence_list = [re.sub(string = sent, pattern = r"\s+(?=[!?])", repl="") for sent in new_sentence_list]
    return new_sentence_list

punctuation_to_remove = r"<>$£%&_,;:'’:#\n.\(\)\[\]-" # what would you define as punctuation that carries no meaning?

tweets['sentiment_analysis_text'] = tweets['sentiment_analysis_text'].apply(remove_twitter_punct, remove = punctuation_to_remove)
tweets.head()

This is enough pre-processing to prepare the Tweets for sentiment analysis using Vader. We have not:

* removed all punctuation, only a selection of symbols we have deemed unimportant
* removed numbers
* lowercased the text
* removed stopwords

We have justified these decisions by saying that Vader can handle complex text and is able to analyse sentiment using features like capitalisation and punctuation. You might want to experiment by performing more or less pre-processing than us, to see what effect it has on the sentiment analysis.

## 1.2 Preparing the data for topic modelling

Further pre-processing steps are required for topic modelling.

### 1.2.1 Lowercasing
If the text is in the same case, it is much easier for our model to interpret the words because the lower case and upper case will be treated the same.

In [None]:
# lowercase the text using pandas string methods
tweets['processed_text'] = tweets['text'].str.lower()

# remove hyperlinks and handles again
tweets['processed_text'] = tweets['processed_text'].str.replace(pattern, "")

In [None]:
#viewing the tweets after lowercasing text
tweets.head()

### 1.2.3 Extracting the meaning of emojis

Depending on the task at hand, we may choose to either:

* Remove emojis entirely
* Replace the emoji with its equivalent meaning

In [None]:
# Use of emoji is optional!

def replace_emojis(ptext):
    """
    Replace any emojis in the tweets with its meaning in words
        
    params
    ------
    ptext:  str
            Text containing emojis
    """
    ptext = emoji.demojize(ptext, delimiters=("", ""))
    return ptext

In [None]:
# alternative to using emoji

def remove_emojis(ptext):
    """
    Remove any UTF-8 characters (including emojis)
    
    params
    ------
    ptext:  str
            Text from which to remove characters
    """
    ptext = ptext.encode('ascii', 'ignore').decode()
    return ptext

In [None]:
#applying the functions to the data

# if using emoji
# tweets['processed_text'] = tweets['processed_text'].apply(replace_emojis)

# otherwise
tweets['processed_text'] = tweets['processed_text'].apply(remove_emojis)

In [None]:
#viewing the data after emojis have been removed
tweets.head()

### 1.2.3 Removing Punctuation

We can remove punctuation from the corpus that is not relevant to our analysis.

In [None]:
# display the standard string punctuation
print(string.punctuation)

In [None]:
# creating a regular expression which captures all the above punctuation characters
"[{}]".format(string.punctuation)

In [None]:
# Creating a function that uses regex to remove punctuation from strings
def remove_punct(ptext):
    """
    replace any punctuation with nothing "", effectively removing it
    
    params
    ------
    ptext:  str
            Text from which to remove punctuation
    """
    ptext = re.sub(string=ptext,
                   pattern=f"[{string.punctuation}]",
                   repl="")
    return ptext

# by making a function that works for one piece of text
# we can then apply the function to all the pandas text

In [None]:
# re-use our previous function to remove &amps and newlines
tweets['processed_text'] = tweets['processed_text'].apply(remove_punct)

In [None]:
#viewing data after punctuation has been removed
tweets.head()

### 1.2.4 Tokenizing the data

As explained above, we can use nltk's sentence tokeniser to split the data into sentences for sentiment analysis. Tokens don't have to be sentences, they can also be words or other sized pieces of text. Many natural language processing tasks require access to each word in a string, and we can achieve this via the following:



In [None]:
#applying nltk's built in word tokenizer to the tweets
tweets['processed_text'] = tweets['processed_text'].apply(nltk.word_tokenize)

In [None]:
#viewing tokenized tweets
tweets.head()

### 1.2.5 Lemmatization

Lemmatization involves stemming the word but makes sure that it does not lose its meaning. Lemmatization has a pre-defined dictionary that stores the context of words and checks the word in the dictionary while diminishing. In our example, lemmatization will work better than stemming, i.e "vaccinated" and "vaccinates" will change the word to "vaccine" rather than "vaccin", which is a lot more valuable to us.

In [None]:
def lemmatise(ptokens):
    """
    Apply lemmatization to given list of word tokens
    
    params
    ------
    ptokens:    List[str]
                List of word tokens
    """
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in ptokens]

In [None]:
## Applying lemmatization to the data
tweets['processed_text'] = tweets['processed_text'].apply(lemmatise)

In [None]:
#viewing data after text has been lemmatized
tweets.head()

### 1.2.6 Stopwords

Stopwords are the commonly used words and are removed from the text as they do not add any value to the analysis. These words carry less or no meaning.

The NLTK library consists of a list of words that are considered stopwords for the English language. We can see these below:

In [None]:
# Display the basic stopwords given by nltk
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

In [None]:
def clean_stopwords(tokens):
    """
    Remove stopwords from list of tokens
    
    params
    ------
    ptokens:    List[str]
                List of word tokens
    """
    # define stopwords
    stopwords = nltk.corpus.stopwords.words('english')
    newStopWords = ['office','national','statistics','statistic','amp','ons']
    stopwords.extend(newStopWords)
    # loop through each token and if the word isn't in the set 
    # of stopwords keep it
    return [item for item in tokens if item not in stopwords]

In [None]:
#remving stopwords from the tweets
tweets['processed_text'] = tweets['processed_text'].apply(clean_stopwords)

In [None]:
#viewing data after stopwords have been removed
tweets.head()

### 1.2.7 Removing numbers

Similar to punctuation, we can remove numbers from text data as they may hold no relevance to the analysis.

In [None]:
def remove_num(ptokens):
    """
    Keeps only alphabetic text
    
    params
    ------
    ptokens:    List[str]
                List of word tokens
    """
    return [token for token in ptokens if token.isalpha()]

In [None]:
#applying the above function to the data
tweets['processed_text'] = tweets['processed_text'].apply(remove_num)

In [None]:
#viewing tweets after only alphabetic text remains
tweets.head()

### 1.2.8 Removing words less than length 2

Ideally, we want as little noise as possible present in our text data to make sure we have the highest possible quality data. Removing tokens that are less than 2 characters long will help to get rid of this noise.

In [None]:
def remove_short_tokens(ptokens):
    """
    Remove tokens that are less than
    3 characters in length.
    
    params
    ------
    ptokens:    List[str]
                List of word tokens
    """
    return [token for token in ptokens if len(token) > 2]

In [None]:
#applying the below function to the data
tweets['processed_text'] = tweets['processed_text'].apply(remove_short_tokens)

In [None]:
#viewing tweets after short tokens have been removed
tweets.head()

# 2.1 Sentiment Analysis with Tweets

If you are not sure how to use VADER for sentiment analysis, take a look at the [instructions](../instructions.html#41_Sentiment_Analysis_with_VADER).

Now that we have a basic grasp on how VADER works, we can apply it to our dataframe of tweets. We will use the `sentiment_analysis_text` column, as this has only been part-preprocessed. VADER will do the rest for us.

To perform sentiment analysis on our dataframe of tweets, we need to do the following:
- Read in the dataframe of cleaned tweets
- Add the VADER metrics to the dataframe - `pos`, `neg`, `neu`, `compound`


In [None]:
# Add the VADER metrics to the dataframe - pos, neg, neu, compound

analyzer = SentimentIntensityAnalyzer()

tweets['rating'] = tweets['sentiment_analysis_text'].apply(lambda t: [analyzer.polarity_scores(sentence) for sentence in t])

print(tweets['sentiment_analysis_text'][0])
print(tweets['rating'][0])

In [None]:
def calculate_average_sentiment(list_of_score_dicts, key):
    """
    Calculate the mean score for the given key for a list of dictionaries.
    
    params
    ------
    list_of_score_dicts:    List[Dict]
                            A list of dictionaries containing VADER scores
                            
    key:                    str
                            Key of dictionary to extract to calculate mean.
    """
    return statistics.mean([score[key] for score in list_of_score_dicts])

In [None]:
# Split the dictionary into separate columns
for key in ['pos', 'neg', 'neu', 'compound']:
    tweets[key] = tweets['rating'].apply(calculate_average_sentiment, key = key)
    
tweets.drop(columns = ['rating'], inplace = True)
tweets.head()

In [None]:
#using .describe() to view general stats about the data
tweets.describe()

Now that we have our sentiment scores, we can explore some different visualisations of sentiment analysis.

# 3. Data Visualisation

The below visualisations will help you to answer the following questions:

1. Which Tweets have the most positive and most negative sentiments.
2. On what day of the week and time of day are the most people talking about the ONS?
3. On which times of day and days of the week do people show the most positive sentiment when talking about ONS?
4. Which Tweets from the @ONS account generated the most positive and negative responses?
5. Does length of Tweet have an impact on sentiment? If so what is the link between them?
6. What topics do people associate with the ONS and what is the average sentiment of Tweets about these topics?

## 3.1 - Which tweets have the most positive and negative sentiments?

We can start by creating three separate word clouds which will contain, respectively:

* most common words from all tweets

* most common words from tweets labelled positive

* most common words from tweets labelled negative

Although this won't help us answer the question at hand, it would be useful to give an insight into what words and phrases are associated with positive and negative sentiments across all of the tweets.

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

def convert_series_to_text(series):
    text = [contents for token in list(series) for contents in token]
    return ' '.join(text).lower()

def show_wordcloud(data, title, max_words = 100):
    """
    Displays a wordcloud based on the given data
    
    params
    ------
    data:       pd.Series
                List of Tweets from which to
                construct wordcloud
    title:      str
                Title of wordcloud
    """
    wordcloud = WordCloud(
        background_color = 'white',
        max_words = max_words,
        max_font_size = 40, 
        scale = 3,
        random_state = 42
    ).generate(convert_series_to_text(data))

    fig = plt.figure(1, figsize = (10, 10))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize = 20)
        fig.subplots_adjust(top = 2.3)
    plt.imshow(wordcloud)
    plt.show()
    
# print wordcloud
show_wordcloud(tweets["processed_text"], 'Most common words from all tweets', 500)

To create wordclouds for the positive and negative Tweets, we will need to define exactly what a positive, negative or neutral Tweet is. For this course, we will say that:

- a positive Tweet is a tweet with a compound score more than or equal to 0.05

- a negative Tweeet is a tweet with a compound score less than or equal to -0.05

- a neutral Tweet is everything else, i.e tweets with a compound score between (but not including) -0.05 and 0.05.

With this information we can split all the Tweets into positive and negative Tweets.

In [None]:
pos_str = tweets[tweets['compound'] >= 0.05]['sentiment_analysis_text'] # filter the dataframe on the compound column
neg_str = tweets[tweets['compound'] <= -0.05]['sentiment_analysis_text']

show_wordcloud(pos_str, 'Wordcloud of positive Tweets')

In [None]:
neg_str = tweets[tweets['compound'] <= -0.05]['sentiment_analysis_text']
show_wordcloud(neg_str, 'Wordcloud of negative Tweets')

In order to answer the question, we can filter the Tweets to include only positive ones, and order them by their compound score (desceding).

In [None]:
#show positive tweets and order by 'positivity' i.e. show tweets with highest positivity score first.
pos_tweets = tweets[tweets["compound"] >= 0.05].sort_values("compound", ascending = False)[["text", "compound"]].head(10)
pos_tweets

In [None]:
#show negative tweets and order by 'negativity' i.e. show tweets with highest negativity score first.
neg_tweets = tweets[tweets["compound"] <= -0.05].sort_values("compound")[["text", "compound"]].head(10)
neg_tweets

In [None]:
#most positive tweet
pos_tweets.iloc[2,0]

In [None]:
#most negative tweet
neg_tweets.iloc[0,0]

## 3.2 On what day of the week and time of day are the most people talking about the ONS?

Here we will be focusing on the `created_at` column. We   will need to make sure the column is in the correct format, and then split into two separate columns, which we will call `date` and `time`.

In [None]:
#change created_at column to datetime format
tweets['created_at'] = pd.to_datetime(tweets['created_at'])

In [None]:
#separating date and time into separate columns
tweets['date'] = pd.to_datetime(tweets['created_at'], errors='coerce').dt.date
tweets['time'] = pd.to_datetime(tweets['created_at'], errors='coerce').dt.time

In [None]:
tweets.head()

In [None]:
#create a third column named `day of week`, where 0 = Monday, 1 = Tuesday etc
tweets['day_of_week'] = tweets['created_at'].dt.dayofweek

In [None]:
"""
Map each number in the day_of_week column to the day of the week in word format.
This is not really a necessary step, but is easier to go by the word rather than the number.
"""

dw_mapping={
    0: 'Monday', 
    1: 'Tuesday', 
    2: 'Wednesday', 
    3: 'Thursday', 
    4: 'Friday',
    5: 'Saturday', 
    6: 'Sunday'
} 
tweets['day_of_week_name']=tweets['created_at'].dt.weekday.map(dw_mapping)

In [None]:
tweets.head()

In [None]:
#find the day that ONS is mentioned most in our dataset
a = tweets['date'].value_counts().idxmax()
b = tweets['day_of_week_name'].value_counts().idxmax()

print(f"ONS was mentioned most frequently on {a}, which is a {b}")

In [None]:
#find the most common day of the week for people to tweet in our dataset
tweets['day_of_week_name'].value_counts()

In [None]:
#create bar chart to visualise findings

day = tweets['day_of_week_name']
values = tweets['day_of_week_name'].value_counts()

ax = tweets[['day_of_week_name']], values.plot(kind='bar',
                                               title ="On what day of the week are the most people talking about the ONS?",
                                               figsize=(10, 5),
                                               ylabel='Number of tweets',
                                               xlabel='Day of the week',
                                               color = 'crimson',
                                               legend=False,
                                               fontsize=12)

From this barchart, we can clearly see that Thursday is the day of the week where the most people are talking about ONS, while Monday is the day where the least number of people mention ONS. Can you think of any reasons why this could be?

In [None]:
#split time into hourly increments, for example any tweet published between 16:00:00 and 16:59:59 will return 16
tweets['hour'] = tweets['created_at'].apply(lambda x: x.time().hour)

In [None]:
tweets.head()

In [None]:
#find the most common time of day for people to tweet in our dataset

times = tweets.hour.value_counts()
times.sort_index(inplace=True)

In [None]:
#create bar chart to visualise findings

ax = times.plot(kind='bar',
                title ="What time of day do most people talk about the ONS?",
                figsize=(10, 5),
                xlabel='Hour of the day',
                ylabel='Number of tweets',
                color = 'cornflowerblue',
                legend=False,
                fontsize=12)
plt.xticks(rotation=0);

The results of this barchart are what you might expect, with few people talking about ONS in the early hours of the morning. There is an sharp increase in conversation between 4am and 10am, which is the hour where most people are talking about ONS. There's a slow decrease in people talking about ONS as the day goes on, quite a dip at 1pm before increasing again at 2pm.

## 3.3 On which times of day and days of the week do people show the most positive sentiment when talking about ONS?

In [None]:
# We will need to know the 'compound' score and hour of the Tweets
tweet_times = tweets.filter(['text', 'compound', 'hour'], axis = 1)
tweet_times

In [None]:
#find the mean sentiment for each hourly increment of the day
tweet_times = tweet_times.groupby(['hour']).mean()
tweet_times = tweet_times.sort_values(by = 'compound', ascending=False)
tweet_times.reset_index(inplace=True)

tweet_times

As we can see from the bar chart above, it's not very common to see tweets late at night or very early in the morning. To make our data slightly simpler to look at, we could only include daytime tweets, for example from 7am-7pm.

In [None]:
#drop times that are outside of 7-19
tweet_times_daytime = tweet_times[(tweet_times['hour']>=7) & (tweet_times['hour']<=19)]
tweet_times_daytime.sort_values(by='hour',inplace=True)

Now we have got rid of 'unsociable hours', we can see midday has the most positive sentiment, so we can suggest to the comms team that if they would like to get the most positive reaction to their tweets then this is the time they should post.

In [None]:
#create horizontal bar chart to display findings
ax2 = tweet_times_daytime.plot(kind='bar',
                             x='hour',
                             ylabel='compound',
                             title ="What time of day do people show the most positive sentiment towards ONS?",
                             figsize=(10, 5),
                             color = 'coral',
                             legend=False,
                             fontsize=12)
plt.xticks(rotation=0);

Unlike our bar chart for the most popular times of day for people to  be talking about ONS, there's no clear pattern here. Sentiment generally tends to be more positive in the morning, with a very big dip in compund score at 2pm.

In [None]:
# We will need to know the 'compound' score and day of the Tweets
tweet_days = tweets.filter(['text', 'compound', 'day_of_week_name'], axis = 1)
tweet_days

In [None]:
#find the mean sentiment for each day of the week
tweet_days = tweet_days.groupby(['day_of_week_name']).mean()
tweet_days = tweet_days.sort_values(by = 'compound', ascending=False)
tweet_days.reset_index(inplace=True)

tweet_days

In [None]:
#create horizontal bar chart to display findings
ax2 = tweet_days.plot(kind='barh',
                      x='day_of_week_name',
                      xlabel='day of week',
                      ylabel='Compound',
                      title ="What day of the week do people show the most positive sentiment towards ONS?",
                      figsize=(10, 5),
                      color = 'green',
                      legend=False,
                      fontsize=12)

From our charts above, we discovered that Saturdays and Sundays tend to be quite quiet in terms of people talking about ONS. From this chart, we can see that the people who are talking about ONS are being quite negative about it! In contrast to the weekend, the most popular day of the week to mention ONS was a Thursday, which is also showing up with quite a negative sentiment in this chart. Only three days out of seven have an overall positive sentiment.

## 3.4 Which tweets from the @ONS account generated the most positive and negative responses?

For this question, we will be focusing on tweets that are in reply to the ONS, so we will need to filter our dataframe to reflect this. We can then further filter it by showing the `repliedto_tweet` id and the positive/negative sentiment score of each tweet. Then, by sorting our new dataframe by positive and negative sentiment, we can find the tweet that generated the most positive and negative responses.

In [None]:
#only show tweets that are in reply to @ONS
ons_tweets = tweets.loc[tweets['in_reply_to_ons'] == True]
ons_tweets.head()

In [None]:
#filter the dataframe to only show columns we need
ons_tweets = ons_tweets.filter(['text', 'repliedto_tweet', 'compound'], axis = 1)

In [None]:
#remove any null values
ons_tweets = ons_tweets.dropna()

In [None]:
ons_tweets

In [None]:
#find the mean sentiment for each replied to tweet id
ons_tweets = ons_tweets.groupby(['repliedto_tweet']).mean()

In [None]:
#sort by positive sentiment
ons_tweets.sort_values(by=['compound'], ascending = False).head(10)

Now that we have this information, we can copy the tweet id of the first row and add it to the following URL:

`http://twitter.com/ons/status/tweet-id-here`

This will then bring up the tweet that we are looking for. We can do the same for negative tweets:

In [None]:
#sort by negative sentiment
ons_tweets.sort_values(by=['compound']).head(10)

## 3.5 Does length of tweet have an impact on sentiment? If so, what is the link between them?

In [None]:
def label_tweets(num):
    """
    Label tweets as either pos, neg or neu
    
    params
    ------
    num:    float
            Compound score from VADER sentiment
            analysis indicating sentiment.
    """
    if num >=0.05:
        return 'pos'
    elif num <= -0.05:
        return 'neg'
    else:
        return 'neu'

In [None]:
#create a column in the tweets dataframe with tweet labels
tweets['label'] = tweets['compound'].apply(label_tweets)

In [None]:
def tweet_word_count(tweet):
    """
    Calculates the number of words in the given Tweet.
    
    params
    ------
    tweet:      List[str]
                Sentence-tokenized Tweet string
    """
    return sum([len(sent.split(' ')) for sent in tweet])

In [None]:
#calculating the length of each value in the sentiment_analysis_text column
tweets['length'] = tweets['sentiment_analysis_text'].apply(tweet_word_count)

In [None]:
#creating a new df with only the columns we need
sentiment_length_df = tweets[['sentiment_analysis_text','length','compound','label']]

In [None]:
#dividing data into X and y values
x = sentiment_length_df[['length']]
y = sentiment_length_df[['compound']]

In [None]:
#show findings with scatter plot

plt.scatter(x, y, alpha=0.2)
plt.xlabel("length of tweet")
plt.ylabel("compound")
plt.title("Does the length of a tweet affect its sentiment?")
plt.show()

In [None]:
#No correlation between predicted value of sentiment and actual value of sentiment.

## 3.6 What topics do people associate with ONS, and what is the overall sentiment of tweets about these topics?

We can use **Latent Dirichlet Allocation** (LDA), one of the most popular topic modelling algorithms, to extract topics from our tweets. For this project, we'll define a topic as a collection of dominant keywords that are typical representatives.

If you are unfamiliar with topic modelling, take some time to read through the topic modelling section in the [instructions](../instructions.html#41_Sentiment_Analysis_with_VADER) document.

In [None]:
processed_text = tweets['processed_text']

# Create dictionary
dictionary = gensim.corpora.Dictionary(processed_text)

In [None]:
# Test dictionary
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

In [None]:
# Bag of Words
bow_corpus = [dictionary.doc2bow(tweet) for tweet in processed_text]

In [None]:
# Iterate through range of k-topics fitting LDA model to each and computing coherence scores for each model
coherenceList_umass = []
coherenceList_cv = []
num_topics_list = np.arange(4,14+1)
for num_topics in num_topics_list:
    lda = models.LdaMulticore(corpus=bow_corpus, num_topics=num_topics, id2word=dictionary, 
                              passes=10,chunksize=4000,random_state=0)
    cm = CoherenceModel(model=lda, corpus=bow_corpus, 
                        dictionary=dictionary, coherence='u_mass')
    coherenceList_umass.append(cm.get_coherence())

In [None]:
# Plot coherence scores across topic numbers

plotData = pd.DataFrame({'Number of topics':num_topics_list,
                         'CoherenceScore':coherenceList_umass})
f,ax = plt.subplots(figsize=(16,10))
sns.set_style("darkgrid")
sns.set(font_scale = 2)
sns.pointplot(x='Number of topics', y= 'CoherenceScore',data=plotData)
plt.axhline(y=-4.8, color='red')
plt.title('Topic Coherence');

The coherence score is highest at number of topics = 5, so we will use 5 topics.

In [None]:
# LDA Model using BOW
lda_model_bow = gensim.models.LdaMulticore(corpus=bow_corpus, num_topics=5, id2word=dictionary, decay=0.5,
                                           chunksize=10000, passes=10, workers=4, random_state=0)

In [None]:
topic_desc = []
for idx, topic in lda_model_bow.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

In [None]:
pyLDAvis.enable_notebook()
lda_viz = gensimvis.prepare(lda_model_bow, bow_corpus, dictionary)
lda_viz

##### What does our output mean? 

* Each bubble on the left hand side represents a topic. The bigger the bubble, the more common the topic is in our tweets. Ideally, the bubbles should be spread across all four quadrants, and shouldn't overlap too much. If you have a lot of small bubbles with a lot of overlap, then you have too many topics.

* Hovering over each bubble shows the 30 most salient words related to that topic on the right hand side.

In [None]:
# index the model with the document to find out the topics and percentage contribution

lda_model_bow[bow_corpus[0]]

In [None]:
# In this case document 0 could be topic 0 (18% contribution), topic 2 (64% contribution) or topic 3 (16% contribution)

In [None]:
def assign_topic(document, lda_model):
    """
    Fetch the topics and percentage contribution of the document
    for the given LDA model and return the topic number with the 
    highest contribution.
    
    params
    -----
    document:   List[Tuple]
                A BOW document where each tuple represents
                (word index, count)
    """
    topics = lda_model[document]
    topics = sorted(topics, key=lambda x:(x[1])) # sort in descending order of percent contribution
    return topics[0][0]

tweets['topic'] = [assign_topic(doc, lda_model = lda_model_bow) for doc in bow_corpus]
tweets.head()

In [None]:
topics = {}

for idx, topic in lda_model_bow.print_topics(-1):
    topics[idx] = f"Topic {idx}: "
    topic_words = []
    for word in re.findall('[a-z]+', topic):
        topic_words.append(word)
    topics[idx] += ', '.join(topic_words)

tweets['topic_words'] = tweets['topic'].map(topics)
tweets[['text', 'topic', 'topic_words']].head(10)

In [None]:
# what is the average sentiment for each topic?

tweets.groupby(['topic_words']).mean()['compound']

In [None]:
tweets.groupby(['topic']).mean()['compound'].plot(kind='bar',
                                                  xlabel='Topic number',
                                                  ylabel='Mean sentiment',
                                                  title ="What is the average sentiment of Tweets about different topics?",
                                                  figsize=(8, 5),
                                                  color = 'green',
                                                  legend=False,
                                                    fontsize=12);

for key, val in topics.items():
    print(val)
plt.xticks(rotation=0);

## Conclusion

Now that we've successfully answered all six questions, we can get back to the ONS comms team to help them with their Twitter strategy. Here are the main points we should feed back to the team:

* All of the topics we found relate to Covid-19. Topics 0, 1 and 3 are slightly positive, which could be seen as suprising as the key words include death and suicide. Only topic 4 is somewhat negative - we could tentatively conclude that overall, the topics involving ONS that people are talking about on Twitter are positive. Although topic 2 is positive, it's so close to zero we can safely say that topic 2 has a neutral sentiment.
* However, only three days out of seven have people showing a positive sentiment towards ONS, and this is Tuesday, Wednesday and Friday. Weekend tweets should be avoided at all costs!
* People tend to talk about the ONS late morning, and on Thursdays.
* Midday is the time where tweets are at their most positive, but not for long; tweets between 2 and 4pm are much more negative!
* There is no relationship between sentiment and length of tweets, so the team don't need to worry about shortening or lengthening their tweets.