## Sentiment Analysis for Consultation Sandbox

This notebook is a test of applying sentiment analysis to dummy consultation data

For this example, we'll use the cardiffnlp/twitter-roberta-base-sentiment model.

-----
### 1. Set up model
Import the libraries needed:

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request
import os

In [None]:
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Read in [labels for the outcomes](https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/sentiment/mapping.txt). These translate the model output into words (e.g. 0 is negative, 1 is neutral, 2 is positive).

In [None]:
# Download label mapping
labels=[]
mapping_link = "https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/sentiment/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

---
### 2. Prepare data

In [None]:
from arrow_pd_parser import reader

In [None]:
s3_bucket = "s3://alpha-everyone/nlp-code-examples/"
file_loc = "Consultation_Dummy_NewQuestions.csv"

In [None]:
df = reader.read(os.path.join(s3_bucket, file_loc))

Clean column names

In [None]:
import re 

def multiple_replace(replacements, text):
    # Create a regular expression from the dictionary keys
    regex = re.compile("(%s)" % "|".join(map(re.escape, replacements.keys())))
    # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda mo: replacements[mo.group()], text) 

In [None]:
replacements = {" ":"_",
              "-":"_",
              "/":"_",
              "?":"",
              "'":""}

new_cols = list()
for i in df.columns.str.split('- '):
    cleaned = multiple_replace(replacements, i[-1]).lower().strip()
    new_cols.append(cleaned)
df.columns = new_cols

Look at the column we want to do sentiment analysis on:

In [None]:
df.has_the_pilot_scheme_been_successful.head()

---
### 3. Apply Model

We'll pass a string to the model to get sentiment back. To do this, first encode the text so that it can be understood by the model:

In [None]:
def tokenize_string(text):
    encoded_text = tokenizer(text, return_tensors='pt')
    return encoded_text

In [None]:
df['tokenized_ps_success'] = df['has_the_pilot_scheme_been_successful'].apply(lambda x: tokenize_string(x))

Pass the encoded text to the model:

In [None]:
def analyze_sentiment(encoded_text):
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    return scores

In [None]:
df['scores'] = df['tokenized_ps_success'].apply(lambda x: analyze_sentiment(x))

---
### 4. Tidy Results

Interpret scores:

In [None]:
# Function to extract the first number from an array
# def get_n_number(array, n) :
#     value = array[n]
#     return value

In [None]:
# for i in range(0,len(labels)):
#     new_col_name = labels[i]
#     df[new_col_name] = df['scores'].apply(lambda x: get_n_number(x, i)) 

Next: Find min/max value for each score and save to seperate columns

In [None]:
def get_max_score(array, results_labels):
    # Find the index for where max/min value is stored
    max_index = np.where(array == max(array))[0][0]
    min_index = np.where(array == min(array))[0][0]

    # Extract max/min value and label
    max_score = array[max_index]
    max_label = labels[max_index]
    min_score = array[min_index]
    min_label = labels[min_index]
    
    # Store results
    results = [max_score, max_label, min_score, min_label]
    
    # Zip
    final_results = {k: v for k, v in zip(results_labels, results)}
    
    return final_results

In [None]:
results_labels = ["max_score", "max_label", "min_score", "min_label"]

In [None]:
df["extreme_scores"] = df.scores.apply(lambda x: get_max_score(x, results_labels))

In [None]:
df.extreme_scores.head()

In [None]:
def extract_dict_content(result_dict, dict_key):
    content = result_dict[dict_key]
    return content    

In [None]:
for i in results_labels:
    df[i] = df.extreme_scores.apply(lambda x: extract_dict_content(x, i))

In [None]:
df.head()

---
### 5. Examine Results
Are they as expected?

In [None]:
df_subset = df[["has_the_pilot_scheme_been_successful","max_label", "max_score", "min_label", "min_score"]]

In [None]:
df_subset.max_label.value_counts()

In [None]:
n = 5
top_n_pos = df_subset.loc[df.max_label == "positive"].sort_values("max_score", ascending = False).head(n)
bottom_n_pos = df_subset.loc[df.max_label == "positive"].sort_values("max_score", ascending = False).tail(n)
top_n_neg = df_subset.loc[df.max_label == "negative"].sort_values("max_score", ascending = False).head(n)
bottom_n_neg = df_subset.loc[df.max_label == "negative"].sort_values("max_score", ascending = False).tail(n)
top_n_neutral = df_subset.loc[df.max_label == "neutral"].sort_values("max_score", ascending = False).head(n)
bottom_n_neutral = df_subset.loc[df.max_label == "neutral"].sort_values("max_score", ascending = False).tail(n)

#### Positive examples:

In [None]:
for i in range(0, n):
    print(top_n_pos.has_the_pilot_scheme_been_successful.iloc[i] + "\n")

In [None]:
for i in range(0, n):
    print(bottom_n_pos.has_the_pilot_scheme_been_successful.iloc[i] + "\n")

#### Negative examples

In [None]:
for i in range(0, n):
    print(top_n_neg.has_the_pilot_scheme_been_successful.iloc[i] + "\n")

In [None]:
for i in range(0, n):
    print(bottom_n_neg.has_the_pilot_scheme_been_successful.iloc[i] + "\n")

#### Neutral examples

In [None]:
for i in range(0, n):
    print(top_n_neutral.has_the_pilot_scheme_been_successful.iloc[i] + "\n")

In [None]:
for i in range(0, n):
    print(bottom_n_neutral.has_the_pilot_scheme_been_successful.iloc[i] + "\n")

---
### 6. Next Steps

- Do these results make sense?
- Work out whether this is working in the optimum way - would breaking responses down into smaller chunks be beneficial?