# SEaM free text sentiment visualiser

Start by selecting a file. (The widget seems to need you to click on `Select` before it actually starts properly.)

In [None]:
from ipyfilechooser import FileChooser

fc=FileChooser()
display(fc)

## Pull in the data and import libraries

So the data doesn't appear to be in a standard format. I'll continue to assume that the student responses are all in sheet 2, but I'll try to suck out the columns containing free text. Again, `NLTK` is probably the best approach for this.

Start by importing the necessary modules:

In [None]:
import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import string

And stop pandas from curtailing the outputs so we can see the whole text cells

In [None]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

First, let's get a SEaM data file.

In [None]:
# Change this line to read the seam file

seam_df=pd.read_excel(fc.selected, sheet_name=2)

Check the contents of the dataframe:

In [None]:
seam_df.head()

## Filter the columns to the text columns

There probably isn't an especially principled way of distinguishing the text columns from the non-text columns... Let's create a set of the terms which appear in the question columns:

In [None]:
question_terms={'1.',
                '2.',
                '3.',
                '4.',
                '5.',
                'definitely',
                'mostly',
                'neither',
                'student',
                'agree',
                'answer',
                'did',
                'disagree',
                'nor',
                'not',
                'question',
                'this'}

... and assume that a column represents text if it contains several terms that aren't in this list.

I'll create a function `text_series` which returns `True` if the values contain several terms which aren't in the list of question terms.

Let's put a threshold of 10. So it's true if there are more than 10 different non-question terms.

In [None]:
def text_series(s_in):
    '''return True if the series' values contain, say, 5 distinct terms which
       aren't in question_terms'''
    
    terms=(s_in
            
           .dropna()
           
           .str.lower()
           
           .str.split()
          
           .values)
    
    terms=[x.strip(string.punctuation+'0123456789') for y in terms
              for x in y]
    
    terms={term for term in terms
           if term not in question_terms
           and term}   # remove ''
    
    return len(terms)>=10

text_series(seam_df['If you answered Disagree to any of the statements above, we would like to understand why so we can make improvements in the future'])
    

In [None]:
feedback_df=seam_df.filter([c for c in seam_df.columns if text_series(seam_df[c])])

feedback_df.head()

Looks about right.

So now we should be able to do the splitting and merging thing on these columns.

## Split the sentences in the free text cells

To split the input into separate sentences, use the NLTK library function `sent_tokenize`:

In [None]:
# import the language model for sentence splitting

import nltk
nltk.download('punkt')

In [None]:
from nltk.tokenize import sent_tokenize

Let's see if we can put all the sentences into a single DataFrame. Reasonably tidily.

In [None]:
all_comments_df=pd.DataFrame(columns=['response', 'sentence_num'])

for text_column in feedback_df.columns:

    ss=(feedback_df[text_column]
        .dropna())

    l=[]

    for (idx, text) in ss.items():
        l.extend([{'response':idx, 'sentence_num':i, text_column:s}
                  for (i, s) in enumerate(sent_tokenize(ss[idx]))])
        
    all_comments_df=all_comments_df.merge(pd.DataFrame(l), how='outer')

In [None]:
all_comments_df.head()

A small quirk: the ordering seems to have gone awry in places, so let's just make sure it's sorted properly:

In [None]:
all_comments_df.sort_values(by=['response', 'sentence_num'], inplace=True)

## Apply the sentiment analyser

We can use the Vader sentiment analyser from NLTK.

In [None]:
# import the language model for sentiment analysis

import nltk
nltk.download('vader_lexicon')

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer

In [None]:
sia = SentimentIntensityAnalyzer()

In [None]:
sia.polarity_scores("TM351 was the best module I have ever imagined!")

In [None]:
sia.polarity_scores("TM351 is the worst course I have studied in decades at the OU")

The `'compound'` key in the dictionary is the one we want: range from -1 to +1.

## Visualising the responses

We can combine the power of *seaborn*, which generates nice graded palettes, with *pandas*'  styling methods for DataFrames.

Can use the palette:

In [None]:
sentiment_colour_map=sns.diverging_palette(10, 125, s=75, l=50,
                                           n=12, center="light", as_cmap=True)
sentiment_colour_map

Not sure why they've got the central values there as "bad". Still...

And then map the sentences in the DataFrame onto the `compound` values:

In [None]:
def polarity_scores_check(txt):
    '''Returns the result of polarity_scores, but with 0 for cases
       raising an error (avoids throwing errors for NaNs and the
       like).
    '''
    try:
        return sia.polarity_scores(txt)['compound']
    except:
        return 0

all_comments_df.applymap(polarity_scores_check)

And finally, we can use the polarity scores DataFrame to colour the cells in the text DataFrame:

In [None]:
all_comments_df.style.background_gradient(cmap=sentiment_colour_map,
                                         axis=None, vmin=-1, vmax=1,
                                          gmap=all_comments_df.applymap(polarity_scores_check))