# Pfizer Vaccine Sentiment Analysis with fastai
In this notebook we will perform sentiment analysis on tweets about COVID-19 vaccines using the [`fastai`](https://docs.fast.ai/) library. I will provide a brief overview of the process here, but a much more in-depth explanation of NLP with [`fastai`](https://docs.fast.ai/) can be found in [lesson 8](https://course.fast.ai/videos/?lesson=8) of the [`fastai`](https://docs.fast.ai/) course. For convenience clicking on inline code written like [`this`](https://docs.fast.ai/tutorial.text.html) will take you to the relevant part of the [`fastai`](https://docs.fast.ai/) documentation where appropriate.

## Transfer learning in NLP - the ULMFiT approach

We will be making use of *transfer learning* to help us create a model to analyse tweet sentiment. The idea behind transfer learning is that neural networks learn information that generalises to new problems, [particularly the early layers of the network](https://arxiv.org/pdf/1311.2901.pdf). In computer vision, for example, we can take a model that was trained on the ImageNet dataset to recognise different features of images such as circles, then apply that to a smaller dataset and *fine-tune* the model to be more suited to a specific task (e.g. classifying images as cats or dogs). This technique allows us to train neural networks much faster and with far less data than we would otherwise need.

In 2018 [a paper](https://arxiv.org/abs/1801.06146) introduced a transfer learning technique for NLP called 'Universal Language Model Fine-Tuning' (ULMFiT). The approach is as follows:
1. Train a *language model* to predict the next word in a sentence. This step is already done for us; with [`fastai`](https://docs.fast.ai/) we can download a model that has been pre-trained for this task on millions of Wikipedia articles. A good language model already knows a lot about how language works in general - for  instance, given the sentence 'Tokyo is the capital of', the model might predict 'Japan' as the next word. In this case the model understands that Tokyo is closely related to Japan and that 'capital' refers to 'city' here instead of 'upper-case' or 'money'.
2. Fine-tune the language model to a more specific task. The pre-trained language model is good at understanding Wikipedia English, but Twitter English is a bit different. We can take the information the Wikipedia model has learned and apply that to a Twitter dataset to get a Twitter language model that is good at predicting the next word in a tweet.
3. Fine-tune a *classification model* to identify sentiment using the pre-trained language model. The idea here is that since our language model already knows a lot about Twitter English, it's not a huge leap from there to train a classifier that understands that 'love' refers to positive sentiment and 'hate' refers to negative sentiment. If we tried to train a classifier without using a pre-trained model it would have to learn the whole language from scratch first, which would be very difficult and time consuming.

<img alt="Diagram of the ULMFiT process (source: course.fast.ai)" width="700" align="left" caption="The ULMFiT process" id="ulmfit_process" src=https://i.imgur.com/8XLluAn.png>

This notebook will walk through steps 2 and 3 with [`fastai`](https://docs.fast.ai/). Afterwards we can use our new classifier to analyse sentiment in the COVID-19 vaccine tweets.

## Loading the data
First, let's import [`fastai`](https://docs.fast.ai/)'s [`text`](https://docs.fast.ai/tutorial.text.html) module and take a look at our data.

In [None]:
from fastai.text.all import *

In [None]:
path = Path('/kaggle/input/')
path.ls()

In [None]:
vax_tweets = pd.read_csv(path/'pfizer-vaccine-tweets/vaccination_tweets.csv')
vax_tweets.head()

We could use the `text` column of this dataset to train a Twitter language model, but since our end goal is sentiment analysis we will need to find another dataset that also contains sentiment labels to train our classifier. Let's use ['Complete Tweet Sentiment Extraction Data'](https://www.kaggle.com/maxjon/complete-tweet-sentiment-extraction-data), which contains 40,000 tweets labelled as either negative, neutral or positive sentiment. For more accurate results you could use the ['sentiment140'](https://www.kaggle.com/kazanova/sentiment140) dataset instead, which contains 1.6m tweets labelled as either positive or negative.

In [None]:
tweets = pd.read_csv(path/'complete-tweet-sentiment-extraction-data/tweet_dataset.csv')
tweets.head()

For our language model, the only input we need is the tweet text. As we will see in a moment [`fastai`](https://docs.fast.ai/) can handle text preprocessing and tokenization for us, but it might be a good idea to remove things like twitter handles, urls, hashtags and emojis first. You could experiment with leaving these in for your own models and see how it affects the results. There are also some rows with blank tweets which need to be removed.

We ideally want the language model to learn not just about tweet language, but more specifically about vaccine tweet language. We can therefore use text from both datasets as input for the language model. For the classification model we need to remove all rows with missing sentiment, however.

In [None]:
# Code via https://www.kaggle.com/garyongguanjie/comments-analysis
def de_emojify(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii')

# Code via https://www.kaggle.com/pawanbhandarkar/generate-smarter-word-clouds-with-log-likelihood
def tweet_proc(df, text_col='text'):
    df['orig_text'] = df[text_col]
    # Remove twitter handles
    df[text_col] = df[text_col].apply(lambda x:re.sub('@[^\s]+','',x))
    # Remove URLs
    df[text_col] = df[text_col].apply(lambda x:re.sub(r"http\S+", "", x))
    # Remove emojis
    df[text_col] = df[text_col].apply(de_emojify)
    # Remove hashtags
    df[text_col] = df[text_col].apply(lambda x:re.sub(r'\B#\S+','',x))
    return df[df[text_col]!='']

# Clean the text data and combine the dfs
tweets = tweets[['old_text', 'new_sentiment']].rename(columns={'old_text':'text', 'new_sentiment':'sentiment'})
vax_tweets['sentiment'] = np.nan
tweets = tweet_proc(tweets)
vax_tweets = tweet_proc(vax_tweets)
df_lm = tweets[['text', 'sentiment']].append(vax_tweets[['text', 'sentiment']])
df_clas = df_lm.dropna(subset=['sentiment'])
print(len(df_lm), len(df_clas))

In [None]:
df_clas.head()

## Training a language model
To train our language model we can use self-supervised learning; we just need to give the model some text as an independent variable and [`fastai`](https://docs.fast.ai/) will automatically preprocess it and create a dependent variable for us. We can do this in one line of code using the [`DataLoaders`](https://docs.fast.ai/data.core.html#DataLoaders) class, which converts our input data into a [`DataLoader`](https://docs.fast.ai/data.load.html#DataLoader) object that can be used as an input to a [`fastai`](https://docs.fast.ai/) [`Learner`](https://docs.fast.ai/learner.html#Learner).

In [None]:
dls_lm = TextDataLoaders.from_df(df_lm, text_col='text', is_lm=True, valid_pct=0.1)

Here we told [`fastai`](https://docs.fast.ai/) that we are working with text data, which is contained in the `text` column of a [`pandas`](https://pandas.pydata.org/docs/) [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) called `df_lm`. We set [`is_lm=True`](https://docs.fast.ai/text.data.html#TextDataLoaders) since we want to train a language model, so [`fastai`](https://docs.fast.ai/) needs to label the input data for us. Finally, we told [`fastai`](https://docs.fast.ai/) to hold out a random 10% of our data for a validation set using [`valid_pct=0.1`](https://docs.fast.ai/text.data.html#TextDataLoaders).

Let's take a look at the first two rows of the [`DataLoader`](https://docs.fast.ai/data.load.html#DataLoader) using [`show_batch`](https://docs.fast.ai/data.core.html#TfmdDL.show_batch).

In [None]:
dls_lm.show_batch(max_n=2)

We have a new column, `text_`, which is `text` offset by one. This is the dependent variable [`fastai`](https://docs.fast.ai/) created for us. By default [`fastai`](https://docs.fast.ai/) uses *word tokenization*, which splits the text on spaces and punctuation marks and breaks up words like *can't* into two separate tokens. [`fastai`](https://docs.fast.ai/) also has some special tokens starting with 'xx' that are designed to make things easier for the model; for example [`xxmaj`](https://docs.fast.ai/text.data.html) indicates that the next word begins with a capital letter and [`xxunk`](https://docs.fast.ai/text.data.html) represents an unknown word that doesn't appear in the vocabulary very often. You could experiment with *subword tokenization* instead, which will split the text on commonly occuring groups of letters instead of spaces. This might help if you wanted to leave hashtags in since they often contain multiple words joined together with no spaces, e.g. #CovidVaccine. The [`fastai`](https://docs.fast.ai/) tokenization process is explained in much more detail [here](https://youtu.be/WjnwWeGjZcM?t=626) for those interested.

### Fine-tuning the language model
The next step is to create a language model using [`language_model_learner`](https://docs.fast.ai/text.learner.html#language_model_learner).

In [None]:
learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()]).to_fp16()

Here we passed [`language_model_learner`](https://docs.fast.ai/text.learner.html#language_model_learner) our [`DataLoaders`](https://docs.fast.ai/data.core.html#DataLoaders), `dls_lm`, and the pre-trained [RNN](https://www.simplilearn.com/tutorials/deep-learning-tutorial/rnn) model, [*AWD_LSTM*](https://docs.fast.ai/text.models.awdlstm.html), which is built into [`fastai`](https://docs.fast.ai/). [`drop_mult`](https://docs.fast.ai/text.learner.html#text_classifier_learner) is a multiplier applied to all [dropouts](https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/) in the AWD_LSTM model to reduce overfitting. For example, by default [`fastai`](https://docs.fast.ai/)'s AWD_LSTM applies [`EmbeddingDropout`](https://docs.fast.ai/text.models.awdlstm.html#EmbeddingDropout) with 10% probability (at the time of writing), but we told [`fastai`](https://docs.fast.ai/) that we want to reduce that to 3%. The [`metrics`](https://docs.fast.ai/metrics.html) we want to track are *perplexity*, which is the exponential of the loss (in this case cross entropy loss), and *accuracy*, which tells us how often our model predicts the next word correctly. We can also train with fp16 to use less memory and speed up the training process.

We can find a good learning rate for training using [`lr_find`](https://docs.fast.ai/callback.schedule.html#Learner.lr_find) and use that to fit our model.

In [None]:
learn.lr_find()

When we created our [`Learner`](https://docs.fast.ai/learner.html#Learner) the embeddings from the pre-trained AWD_LSTM model were merged with random embeddings added for words that weren't in the vocabulary. The pre-trained layers were also automatically frozen for us. Using [`fit_one_cycle`](https://docs.fast.ai/callback.schedule.html#Learner.fit_one_cycle) with our [`Learner`](https://docs.fast.ai/learner.html#Learner) will train only the *new random embeddings* (i.e. words that are in our Twitter vocab but not the Wikipedia vocab) in the last layer of the neural network.

In [None]:
learn.fit_one_cycle(1, 3e-2)

After one epoch our language model is predicting the next word in a tweet around 23% of the time - not too bad! We can [`unfreeze`](https://docs.fast.ai/learner.html#Learner.unfreeze) the entire model, find a more suitable learning rate and train for a few more epochs to improve the accuracy further.

In [None]:
learn.unfreeze()
learn.lr_find()

In [None]:
learn.fit_one_cycle(4, 1e-3)

After a bit more training we can predict the next word in a tweet just under 26% of the time. Let's test the model out by using it to write some random tweets (in this case it will generate some text following 'I love').

In [None]:
# Text generation using the language model
TEXT = "I love"
N_WORDS = 30
N_SENTENCES = 2
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

Let's save the model *encoder* so we can use it to fine-tune our classifier. The encoder is all of the model except for the final layer, which converts activations to probabilities of picking each token in the vocabulary. We want to keep the knowledge the model has learned about tweet language but we won't be using our classifier to predict the next word in a sentence, so we won't need the final layer any more.

In [None]:
learn.save_encoder('finetuned_lm')

## Training a sentiment classifier
To get the [`DataLoaders`](https://docs.fast.ai/data.core.html#DataLoaders) for our classifier let's use the [`DataBlock`](https://docs.fast.ai/tutorial.datablock.html#Text) API this time, which is more customisable.

In [None]:
dls_clas = DataBlock(
    blocks = (TextBlock.from_df('text', seq_len=dls_lm.seq_len, vocab=dls_lm.vocab), CategoryBlock),
    get_x=ColReader('text'),
    get_y=ColReader('sentiment'),
    splitter=RandomSplitter()
).dataloaders(df_clas, bs=64)

To use the API, [`fastai`](https://docs.fast.ai/) needs the following:
* [`blocks`](https://docs.fast.ai/data.block.html#TransformBlock):
    * [`TextBlock`](https://docs.fast.ai/text.data.html#TextBlock): Our x variable will be text contained in a [`pandas`](https://pandas.pydata.org/docs/) [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). We want to use the same sequence length and vocab as the language model [`DataLoaders`](https://docs.fast.ai/data.core.html#DataLoaders) so we can make use of our pre-trained model.
    * [`CategoryBlock`](https://docs.fast.ai/data.block.html#CategoryBlock): Our y variable will be a single-label category (negative, neutral or positive sentiment).
* [`get_x`](https://docs.fast.ai/data.transforms.html#ColReader), [`get_y`](https://docs.fast.ai/data.transforms.html#ColReader): Get data for the model by reading the `text` and `sentiment` columns from the [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).
* [`splitter`](https://docs.fast.ai/data.transforms.html#RandomSplitter): We will use [`RandomSplitter()`](https://docs.fast.ai/data.transforms.html#RandomSplitter) to randomly split the data into a training set (80% by default) and a validation set (20%).
* [`dataloaders`](https://docs.fast.ai/data.block#DataBlock.dataloaders): Builds the [`DataLoaders`](https://docs.fast.ai/data.core.html#DataLoaders) using the [`DataBlock`](https://docs.fast.ai/tutorial.datablock.html#Text) template we just defined, the *df_clas* [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and a batch size of 64.

We can call show batch as before; this time the dependent variable is sentiment.

In [None]:
dls_clas.show_batch(max_n=2)

Initialising the [`Learner`](https://docs.fast.ai/learner.html#Learner) is similar to before, but in this case we want a [`text_classifier_learner`](https://docs.fast.ai/text.learner.html#text_classifier_learner).

In [None]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()

Finally, we want to load the encoder from the language model we trained earlier, so our classifier uses pre-trained weights.

In [None]:
learn = learn.load_encoder('finetuned_lm')

### Fine-tuning the classifier
Now we can train the classifier using *discriminative learning rates* and *gradual unfreezing*, which has been found to give better results for this type of model. First let's freeze all but the last layer:

In [None]:
learn.fit_one_cycle(1, 3e-2)

Now freeze all but the last two layers:

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

Now all but the last three:

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

Finally, let's unfreeze the entire model and train a bit more:

In [None]:
learn.unfreeze()
learn.fit_one_cycle(3, slice(1e-3/(2.6**4),1e-3))

In [None]:
learn.save('classifier')

Our model correctly predicts sentiment just under 77% of the time. We could perhaps do better with a larger dataset as mentioned earlier, or different model hyperparameters. It might be worth experimenting with this yourself to see if you can improve the accuracy.

We can quickly sense check the model by calling [`predict`](https://docs.fast.ai/learner.html#Learner.predict), which returns the predicted sentiment, the index of the prediction and predicted probabilities for negative, neutral and positive sentiment.

In [None]:
learn.predict("I love")

In [None]:
learn.predict("I hate")

## Analysing the tweets
To carry out sentiment analysis on the vaccine tweets, we can add them to the [`DataLoaders`](https://docs.fast.ai/data.core.html#DataLoaders) as a test set:

In [None]:
pred_dl = dls_clas.test_dl(vax_tweets['text'])

We can then make predictions using [`get_preds`](https://docs.fast.ai/learner.html#Learner.get_preds):

In [None]:
preds = learn.get_preds(dl=pred_dl)

Let's go ahead and check out the results.

In [None]:
# Get predicted sentiment
vax_tweets['sentiment'] = preds[0].argmax(dim=-1)
vax_tweets['sentiment'] = vax_tweets['sentiment'].map({0:'negative', 1:'neutral', 2:'positive'})

# Save to csv
vax_tweets.to_csv('vax_tweets_sentiment.csv')

# Plot sentiment value counts
vax_tweets['sentiment'].value_counts(normalize=True).plot.bar();

We can see that the predominant sentiment is neutral, with more positive tweets than negative. It's encouraging that negative sentiment isn't higher! We can also visualise how sentiment changes over time:

In [None]:
# Get counts of number of tweets by sentiment for each date
vax_tweets['date'] = pd.to_datetime(vax_tweets['date']).dt.date
timeline = vax_tweets.groupby(['date', 'sentiment']).agg(**{'tweets': ('id', 'count')}).reset_index()

# Plot results
import plotly.express as px
fig = px.line(timeline, x='date', y='tweets', color='sentiment')
fig.show()

### Further analysis using 'smarter' word clouds
To dig a bit deeper, let's generate some word clouds to see which words are indicative of each sentiment. The code below is from [this notebook](https://www.kaggle.com/pawanbhandarkar/generate-smarter-word-clouds-with-log-likelihood), which contains a more detailed explanation of the methodology used to generate 'smarter' word clouds. Please go and upvote the original notebook if you find this part useful!

In [None]:
!pip install wordninja
!pip install pyspellchecker

In [None]:
from wordcloud import WordCloud, ImageColorGenerator
import wordninja
from spellchecker import SpellChecker
from collections import Counter
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english'))  
stop_words.add("amp")

In [None]:
# FUNCTIONS REQUIRED

def flatten_list(l):
    return [x for y in l for x in y]

def is_acceptable(word: str):
    return word not in stop_words and len(word) > 2

# Color coding our wordclouds 
def red_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
    return f"hsl(0, 100%, {random.randint(25, 75)}%)" 

def green_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
    return f"hsl({random.randint(90, 150)}, 100%, 30%)" 

def yellow_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
    return f"hsl(42, 100%, {random.randint(25, 50)}%)" 

# Reusable function to generate word clouds 
def generate_word_clouds(neg_doc, neu_doc, pos_doc):
    # Display the generated image:
    fig, axes = plt.subplots(1,3, figsize=(20,10))
    
    wordcloud_neg = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(neg_doc))
    axes[0].imshow(wordcloud_neg.recolor(color_func=red_color_func, random_state=3), interpolation='bilinear')
    axes[0].set_title("Negative Words")
    axes[0].axis("off")

    wordcloud_neu = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(neu_doc))
    axes[1].imshow(wordcloud_neu.recolor(color_func=yellow_color_func, random_state=3), interpolation='bilinear')
    axes[1].set_title("Neutral Words")
    axes[1].axis("off")

    wordcloud_pos = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(pos_doc))
    axes[2].imshow(wordcloud_pos.recolor(color_func=green_color_func, random_state=3), interpolation='bilinear')
    axes[2].set_title("Positive Words")
    axes[2].axis("off")

    plt.tight_layout()
    plt.show();

def get_top_percent_words(doc, percent):
    # Returns a list of "top-n" most frequent words in a list 
    top_n = int(percent * len(set(doc)))
    counter = Counter(doc).most_common(top_n)
    top_n_words = [x[0] for x in counter]
    
    return top_n_words
    
def clean_document(doc):
    spell = SpellChecker()
    lemmatizer = WordNetLemmatizer()
    
    # Lemmatize words (needed for calculating frequencies correctly )
    doc = [lemmatizer.lemmatize(x) for x in doc]
    
    # Get the top 10% of all words. This may include "misspelled" words 
    top_n_words = get_top_percent_words(doc, 0.1)

    # Get a list of misspelled words 
    misspelled = spell.unknown(doc)
    
    # Accept the correctly spelled words and top_n words 
    clean_words = [x for x in doc if x not in misspelled or x in top_n_words]
    
    # Try to split the misspelled words to generate good words (ex. "lifeisstrange" -> ["life", "is", "strange"])
    words_to_split = [x for x in doc if x in misspelled and x not in top_n_words]
    split_words = flatten_list([wordninja.split(x) for x in words_to_split])
    
    # Some splits may be nonsensical, so reject them ("llouis" -> ['ll', 'ou', "is"])
    clean_words.extend(spell.known(split_words))
    
    return clean_words

def get_log_likelihood(doc1, doc2):    
    doc1_counts = Counter(doc1)
    doc1_freq = {
        x: doc1_counts[x]/len(doc1)
        for x in doc1_counts
    }
    
    doc2_counts = Counter(doc2)
    doc2_freq = {
        x: doc2_counts[x]/len(doc2)
        for x in doc2_counts
    }
    
    doc_ratios = {
        # 1 is added to prevent division by 0
        x: math.log((doc1_freq[x] +1 )/(doc2_freq[x]+1))
        for x in doc1_freq if x in doc2_freq
    }
    
    top_ratios = Counter(doc_ratios).most_common()
    top_percent = int(0.1 * len(top_ratios))
    return top_ratios[:top_percent]

# Function to generate a document based on likelihood values for words 
def get_scaled_list(log_list):
    counts = [int(x[1]*100000) for x in log_list]
    words = [x[0] for x in log_list]
    cloud = []
    for i, word in enumerate(words):
        cloud.extend([word]*counts[i])
    # Shuffle to make it more "real"
    random.shuffle(cloud)
    return cloud

In [None]:
# Convert string to a list of words
vax_tweets['words'] = vax_tweets.text.apply(lambda x:re.findall(r'\w+', x ))

neg_doc = flatten_list(vax_tweets[vax_tweets['sentiment']=='negative']['words'])
neg_doc = [x for x in neg_doc if is_acceptable(x)]

pos_doc = flatten_list(vax_tweets[vax_tweets['sentiment']=='positive']['words'])
pos_doc = [x for x in pos_doc if is_acceptable(x)]

neu_doc = flatten_list(vax_tweets[vax_tweets['sentiment']=='neutral']['words'])
neu_doc = [x for x in neu_doc if is_acceptable(x)]

# Clean all the documents
neg_doc_clean = clean_document(neg_doc)
neu_doc_clean = clean_document(neu_doc)
pos_doc_clean = clean_document(pos_doc)

# Combine classes B and C to compare against A (ex. "positive" vs "non-positive")
top_neg_words = get_log_likelihood(neg_doc_clean, flatten_list([pos_doc_clean, neu_doc_clean]))
top_neu_words = get_log_likelihood(neu_doc_clean, flatten_list([pos_doc_clean, neg_doc_clean]))
top_pos_words = get_log_likelihood(pos_doc_clean, flatten_list([neu_doc_clean, neg_doc_clean]))

# Generate syntetic a corpus using our loglikelihood values 
neg_doc_final = get_scaled_list(top_neg_words)
neu_doc_final = get_scaled_list(top_neu_words)
pos_doc_final = get_scaled_list(top_pos_words)

# Visualise our synthetic corpus
generate_word_clouds(neg_doc_final, neu_doc_final, pos_doc_final)

This looks pretty good! The positive tweets appear to be from people who have just received their first vaccine or are grateful for the job scientists and healthcare workers are doing, whereas the negative tweets seem to be from people who have suffered adverse reactions to the vaccine. The neutral tweets seem to be more like news, which could explain why it is the most prevelant sentiment; in fact, the vast majority of tweets contain urls:

In [None]:
vax_tweets['has_url'] = np.where(vax_tweets['orig_text'].str.contains('http'), 'yes', 'no')
vax_tweets['has_url'].value_counts(normalize=True).plot.bar();

People have died after receiving the vaccine in Norway, which explains why it shows up in the negative sentiment word cloud:

In [None]:
def get_cloud(df, string, c_func):
    string_l = string.lower()
    df[string_l] = np.where(df['text'].str.lower().str.contains(string_l), 1, 0)
    cloud_df = df.copy()[df[string_l]==1]
    doc = flatten_list(cloud_df['words'])
    doc = [x for x in doc if is_acceptable(x)]
    doc = clean_document(doc)
    fig, axes = plt.subplots(figsize=(9,5))
    wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(doc))
    axes.imshow(wordcloud.recolor(color_func=c_func, random_state=3), interpolation='bilinear')
    axes.set_title("Tweets Containg '%s'" % (string))
    axes.axis("off")
    plt.show();
    print(cloud_df['orig_text'].head(5))
    
get_cloud(vax_tweets, 'Norway', red_color_func)

The overall sentiment about the NHS appears to be positive, however:

In [None]:
get_cloud(vax_tweets, 'NHS', green_color_func)

## Conclusion
`fastai` make NLP really easy and we were able to get quite good results with a limited dataset and not a lot of training time by using the ULMFiT approach. To summarise, the steps are:
1. Fine-tune a language model to predict the next word in a tweet, using a model pre-trained on Wikipedia.
2. Fine-tune a classification model to predict tweet sentiment using the pre-trained language model.
3. Apply the classifier to unlabelled tweets to analyse sentiment.

Hopefully you found this useful!