Some people seem to not like [WordCloud](https://amueller.github.io/word_cloud/)? I've read recently that [Manoj](https://www.kaggle.com/mks2192) found them [useless](https://www.kaggle.com/c/tweet-sentiment-extraction/discussion/152852).


Joke aside, in this notebook I will explore this NLP vizualization technique and compare it to another recent one: [Shifterator](https://pypi.org/project/shifterator/). 

Some of the work is inspired from this [notebook](https://www.kaggle.com/mrisdal/shifterator-analysis-on-animal-crossing-reviews), so if you found this notebook useful, consider also exploring it. 

Let's get started!

# WordCloud

Alright, let's start with the most popular (is that so?) technique: WordCloud. 


Before that, we need few processing steps: 

- extract the words into a single string
- remove [stopwords](https://en.wikipedia.org/wiki/Stop_words) (i.e. words that don't add much to the meaning)
- (optional) select an appropriate mask



In [None]:
# Some useful imports

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pylab as plt
import pandas as pd
import numpy as np
from PIL import Image
import requests
from io import BytesIO
%matplotlib inline

In [None]:
# Loading train and test datasets
train_df = pd.read_csv("../input/tweet-sentiment-extraction/train.csv")
test_df = pd.read_csv("../input/tweet-sentiment-extraction/test.csv")

In [None]:
# Checking some of the stopwords
count = 0
for sw in STOPWORDS:
    print(sw)
    count += 1
    if count == 10:
        break

In [None]:
# Here is an "appropriate" mask.
url = "https://static01.nyt.com/images/2014/08/10/magazine/10wmt/10wmt-articleLarge-v4.jpg"
response = requests.get(url)
img = Image.open(BytesIO(response.content))

mask = np.array(img)
img

In [None]:
# And finally, generating the train wordcloud
text = " ".join(train_df["text"].dropna().str.lower().values)
stopwords = set(STOPWORDS)

wc = WordCloud(max_words=3000, mask=mask, stopwords=stopwords, margin=10,
               random_state=1, contour_color='white', contour_width=1).generate(text)

fig, ax = plt.subplots(1, 1, figsize=(15, 15))

ax.imshow(wc, interpolation="bilinear")
ax.set_title("Tweeter Sentiment Extraction Train")

In [None]:
# Let's do the same thing but this time for the test dataset
text = " ".join(test_df["text"].dropna().str.lower().values)

wc = WordCloud(max_words=3000, mask=mask, stopwords=stopwords, margin=10,
               random_state=1, contour_color='white', contour_width=1).generate(text)

fig, ax = plt.subplots(1, 1, figsize=(15, 15))

ax.imshow(wc, interpolation="bilinear")
ax.set_title("Tweeter Sentiment Extraction Test")

That was easy and "cute". Time to move to the second contender: Shifterator. 

# Shifterator

Alright, before we start, let's get something out of the way: this technique isn't comparble to wordcloud.
It is a another recent vizualization technique of NLP words but that's about the only common thing. Will see how they differ 
quite soon. 


In [None]:
# First, we need to install the library.
!pip install shifterator

In [None]:
# We also need to get the frequency (i.e. occurence) of each word, thus this short utility function.
from collections import Counter
from itertools import chain

def get_word_freq(s):

    return Counter(v for v in chain(*s.dropna().str.lower().str.split().values) if v not in STOPWORDS)

We will start with what's called word shift graphs. 

## Word Shift Graphs

For this type of graph, you will need four inputs: 
    
1. Word frequencies for text 1
2. Word frequencies for text 2
3. Sentiment dict for text 1
4. Sentiment dict for text 2

Will use this graphical representation to compare train and test datasets. 
    
Also, here is a nice forumla for computing a word shift from the Github repo: 

<img src="https://raw.githubusercontent.com/ryanjgallagher/shifterator/master/figures/contribution.png"> 

Check it [out](https://github.com/ryanjgallagher/shifterator) for more details.

(Note: for now, the displayed graph isn't correct and I am struggling to build the sentiments' dicts. Please share your tips, thanks!)

In [None]:
from shifterator import relative_shift as rs


train_freq = dict(get_word_freq(train_df["text"]))
test_freq = dict(get_word_freq(test_df["text"]))

# TODO: These doesn't look right, fix! If you have any idea in the comments, pleas share!
# TODO: How to make the sentiment dict?
train_pos =  get_word_freq(train_df.loc[lambda df: df["sentiment"] == "positive", "text"].copy())
train_neg = get_word_freq(train_df.loc[lambda df: df["sentiment"] == "negative", "text"].copy())
test_pos = get_word_freq(test_df.loc[lambda df: df["sentiment"] == "positive", "text"].copy())
test_neg = get_word_freq(test_df.loc[lambda df: df["sentiment"] == "negative", "text"].copy())
train_sentiment_score = {**train_pos, **train_neg}
test_sentiment_score = {**test_pos, **test_neg}




sentiment_shift = rs.SentimentShift(train_freq, test_freq, train_sentiment_score, test_sentiment_score)

sentiment_shift.get_shift_graph(title="Word Shift for Train (left) vs Test (right) datasets")

Next, we will explore entrop shift graphs. Notice that there are other similar graphs to the entropy one (Kullback-Leibler Divergence 
and Jensen-Shannon Divergence) but we won't explore these since they are quite similar.

## Entropy Shift Graphs

For these graphs, you will only need two things: 
    
1. Word frequencies for text 1
2. Word frequencies for text 2

In what follows, we will build three different entropy shift graphs: 

- Positive train vs negative train
- Positive test vs negative test
- Train vs test

In [None]:
from shifterator import relative_shift as rs


train_pos_freq = get_word_freq(train_df.loc[lambda df: df["sentiment"] == "positive", "text"])
train_neg_freq = get_word_freq(train_df.loc[lambda df: df["sentiment"] == "negative", "text"])



sentiment_shift = rs.EntropyShift(reference=train_pos_freq,
                                  comparison=train_neg_freq)

sentiment_shift.get_shift_graph(title="Entropy Shift for Train Positive (left) vs Negative (right) Sentiments")

In [None]:
test_pos_freq = get_word_freq(test_df.loc[lambda df: df["sentiment"] == "positive", "text"])
test_neg_freq = get_word_freq(test_df.loc[lambda df: df["sentiment"] == "negative", "text"])



sentiment_shift = rs.EntropyShift(reference=test_pos_freq,
                                  comparison=test_neg_freq)
sentiment_shift.get_shift_graph(title="Entropy Shift for Test Positive (left) vs Negative (right) Sentiments")

In [None]:
from shifterator import relative_shift as rs


train_freq = get_word_freq(train_df["text"])
test_freq = get_word_freq(test_df["text"])



sentiment_shift = rs.EntropyShift(reference=train_freq,
                                  comparison=test_freq)
sentiment_shift.get_shift_graph(title="Entropy Shift for Train (left) vs Test (right) datasets")

# Conclusion

To sum up: 
    
- Shifterator offers graphs to compare two different texts using word shift graphs. These can be based on **relative word shifts** or **entropy** (and similar metrics) shifts. 
- WordCloud is used to quickly visualize words that are used the most. The more frequent the word, the bigger its size in 
the representation.

That's it for now. I am quite new to the world  of word shift graphs so please leave comments and I will try to improve this notebook as time permits. 
Finally, thanks to [Ryan Gallagher](https://github.com/ryanjgallagher) for creating the library. 

(Note: as stated above, my sentiment shift isn't quite correct and I will try to fix it in the upcoming days so please let me know
if you have any hints/tips. Many thanks!)