# Introduction

This notebook provides simple examples of using pretrained models from [John Snow Labs](https://www.johnsnowlabs.com) [spark-nlp](https://www.johnsnowlabs.com/spark-nlp) for sentiment and emotion analysis. Additionally, a small demo of text-to-text analysis, i.e. question answering, is presented (directly from the spark-nlp demos).

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

It's important to realize that the pre-trained models and pipelines have constant weights and cannot be fine-tuned. Typically, one would like to begin with a pre-trained deep-learning model and then fine-tune it for a specific task (aka transfer learning). However, spark-nlp doesn't support this:
* https://github.com/JohnSnowLabs/spark-nlp/issues/989

I've found the following resources helpful in my use of spark-nlp:
* https://www.johnsnowlabs.com/
* https://nlp.johnsnowlabs.com/models
* https://nlp.johnsnowlabs.com/api/python/reference/index.html
* https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/
* https://nlp.johnsnowlabs.com/demos
* https://github.com/JohnSnowLabs/spark-nlp-workshop
  * https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public
* https://www.johnsnowlabs.com/annotation-lab/

## In case of notebook errors, consider these problems and fixes

**Occasionally, a spark-nlp download (referenced in pretrained pipeline or model) fails with something like:**
* ConnectionRefusedError: [Errno 111] Connection refused.
* ERROR:root:Exception while sending command.
* Py4JError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel

Fix by:
* Restart the runtime
* Run the initial cells to create the spark context
```spark = get_spark```
* Run the spark-nlp import cell
* Re-run cell starting with the download (which throws the error)

**Reference to a pretrained model or pipeline will throw an exception TypeError: 'JavaPackage' object is not callable, when the spark session is created without having loaded the required spark-nlp jar.** (Explained [here](https://stackoverflow.com/questions/72859407/typeerror-javapackage-object-is-not-callable-using-java-11-for-spark-3-3-0/73693255#73693255).)

**Odd errors can occur when a model is only *partially* downloaded.** You can delete the models using this command cell:
```
%%bash
ls -l /root/cache_pretrained/
rm -fr /root/cache_pretrained/
```

Some errors seem to arise because of limited memory. When run with 25g memory, then fewer restarted are needed.

# Prepare Spark environment

## Install Spark into Colab

In [1]:
!pip install -q pyspark findspark

In [2]:
# Not always necessary, but just in case...
import findspark
findspark.init()

In [3]:
CUSTOM_SPARK_SESSION = True

# Common method to create Spark session
from pyspark.sql import SparkSession

if not CUSTOM_SPARK_SESSION:
  spark = SparkSession.builder\
          .master("local[*]")\
          .appName("Colab")\
          .config('spark.ui.port', '4050')\
          .getOrCreate()
  print(f"Spark version: {spark.version}")

Dependencies for spark-nlp.
  * Maven search for package com.johnsnowlabs to find com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0
  * PyPi install spark-nlp
    * Ensure the PyPi spark-nlp version (e.g., 4.2.4) matches the maven spark-nlp version (4.2.4)

In [4]:
!pip install -q spark-nlp==4.2.4

In [5]:
# Because spark-nlp relies on jars, use this function to load them when creating a session.
from pyspark.sql import SparkSession

SPARK_JARS = ["com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.4"]

def get_spark(master="local[*]", name="Colab"):
    builder = SparkSession.builder.appName(name)
    builder.config('spark.ui.port', '4050')
    builder.config('spark.jars.packages', ",".join(SPARK_JARS))
    builder.config("spark.driver.memory", "16G")
    builder.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    builder.config("spark.kryoserializer.buffer.max", "2000M")
    builder.config("spark.driver.maxResultSize", "0")
    return builder.getOrCreate()

if CUSTOM_SPARK_SESSION:  
  spark = get_spark()
  print(f"Spark version: {spark.version}")

Spark version: 3.3.1


## Install ngrok for access to Spark UI on Colab [optional]
* You'll need a free account: https://dashboard.ngrok.com/signup
  * Place the authorization code into the cell below
* Link to Spark UI will appear
* If it fails, restart the runtime and try again


In [6]:
!wget -qnc https://bin.equinox.io/c/bNyj1mQVY4c/ngrok-v3-stable-linux-amd64.tgz
!tar zxf ngrok-v3-stable-linux-amd64.tgz
get_ipython().system_raw('./ngrok http 4050 &')
!sleep 5
!curl -s http://localhost:4040/api/tunnels | grep -Po 'public_url":"(?=https)\K[^"]*'

https://1744-34-80-37-173.jp.ngrok.io


In [7]:
# Replace AUTH_CODE with your ngrok authtoken
# Run before accessing the web link
# !./ngrok config add-authtoken AUTH_CODE
!./ngrok config add-authtoken 2EfllF8PjhOlOyVEV7NCgQ0uWVz_3LfjWp5Z9edQ1KHgfs2pD

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


## Provide Access to you Google Drive for data [optional]

In [8]:
# from google.colab import drive
# drive.mount('/content/drive')

# Prepare spark-nlp

In [9]:
import sparknlp

from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.base import *
from sparknlp.annotator import *

import pandas as pd

# This start() is ignored if a Spark session exists.
# This creates a new spark session if one has not been created previously.
# Additionally, it only loads the jar for spark-nlp, which is a problem if you want to load other jars.
# https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/SparkNLP.scala
spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

Spark NLP version 4.2.4
Apache Spark version: 3.3.1


In [10]:
# common text to check
comment = "The movie I watched today was not a good one"

# Pretrained Pipeline
* Pretrained pipelines are applied directly. You can customize a *pipeline* for pretrained models, as shown in the following section. (None of these pipelines provide for transfer learning with the pretrained models.)

## Sentiment (general text, Vivek, pipeline)

In [11]:
# https://nlp.johnsnowlabs.com/2021/03/24/analyze_sentiment_en.html
sentiment = PretrainedPipeline('analyze_sentiment', lang='en')
result = sentiment.annotate(comment)

result['sentiment']

analyze_sentiment download started this may take some time.
Approx size to download 4.9 MB
[OK!]


['negative']

## Sentiment (IMDB)

In [12]:
# https://nlp.johnsnowlabs.com/2021/01/15/analyze_sentimentdl_use_imdb_en.html
sentiment_imdb = PretrainedPipeline('analyze_sentimentdl_use_imdb', lang='en')
result = sentiment_imdb.annotate(comment)

result['sentiment']

analyze_sentimentdl_use_imdb download started this may take some time.
Approx size to download 935.7 MB
[OK!]


['pos']

## Sentiment (glove)

In [13]:
# https://nlp.johnsnowlabs.com/2021/01/15/analyze_sentimentdl_glove_imdb_en.html
sentiment_imdb_glove = PretrainedPipeline('analyze_sentimentdl_glove_imdb', lang='en')
result = sentiment_imdb_glove.annotate(comment)
sentiment_imdb_glove.fullAnnotate(comment)[0]['sentiment']

analyze_sentimentdl_glove_imdb download started this may take some time.
Approx size to download 155.3 MB
[OK!]


[Annotation(category, 0, 43, pos, {'sentence': '0', 'pos': '0.99994636', 'neg': '5.3649594E-5'})]

# Specify training pipeline for a model
* Pipelines must be trained with a model ```.fit()``` before they can be applied ```.transform()```
* Pipelines allow customization of the steps prior to the application of a pre-trained model or a learning algorithm.

## Download data used to illustrate pipeline.

In [14]:
!wget -O IMDB-Dataset.csv https://github.com/Ankit152/IMDB-sentiment-analysis/blob/master/IMDB-Dataset.csv?raw=true

data = spark.read.csv("IMDB-Dataset.csv", inferSchema=True, header=True, mode='DROPMALFORMED')
data = data.withColumnRenamed('review', 'text').withColumnRenamed('sentiment', 'sentiment_label')

# pdf = pd.read_csv("IMDB-Dataset.csv")
# data = spark.createDataFrame(pdf).toDF('text', 'sentiment_label')

data.show(truncate=30)

--2022-12-01 21:14:44--  https://github.com/Ankit152/IMDB-sentiment-analysis/blob/master/IMDB-Dataset.csv?raw=true
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/Ankit152/IMDB-sentiment-analysis/raw/master/IMDB-Dataset.csv [following]
--2022-12-01 21:14:45--  https://github.com/Ankit152/IMDB-sentiment-analysis/raw/master/IMDB-Dataset.csv
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv [following]
--2022-12-01 21:14:45--  https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercon

## Sentiment (Vivekn pipeline with data)

In [15]:
# https://nlp.johnsnowlabs.com/2021/11/22/sentiment_vivekn_en.html

document = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

token = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

normalizer = Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normal")

vivekn =  ViveknSentimentModel.pretrained() \
.setInputCols(["document", "normal"]) \
.setOutputCol("result_sentiment")

finisher = Finisher() \
.setInputCols(["result_sentiment"]) \
.setOutputCols("final_sentiment")

pipeline = Pipeline().setStages([document, token, normalizer, vivekn, finisher])

# Fit to data
pipelineModel = pipeline.fit(data)
result = pipelineModel.transform(data)

result.show(truncate=75)

sentiment_vivekn download started this may take some time.
Approximate size to download 873.6 KB
[OK!]
+---------------------------------------------------------------------------+---------------+---------------+
|                                                                       text|sentiment_label|final_sentiment|
+---------------------------------------------------------------------------+---------------+---------------+
|One of the other reviewers has mentioned that after watching just 1 Oz e...|       positive|     [positive]|
|Basically there's a family where a little boy (Jake) thinks there's a zo...|       negative|     [positive]|
|I sure would like to see a resurrection of a up dated Seahunt series wit...|       positive|     [positive]|
|This show was an amazing, fresh & innovative idea in the 70's when it fi...|       negative|     [positive]|
|Encouraged by the positive comments about this film on here I was lookin...|       negative|     [negative]|
|If you like orig

In [47]:
result


DataFrame[text: string, document: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, sentence_embeddings: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, sentiment: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>]

In [46]:
result.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContains

### Apply the "trained" model, created from the pipeline.
Note that the sentiement model weights are fixed for pre-trained models.

In [16]:
# apply trained model to text
example = spark.createDataFrame([[comment]]).toDF("text")
pipelineModel.transform(example).show(truncate=False)

+--------------------------------------------+---------------+
|text                                        |final_sentiment|
+--------------------------------------------+---------------+
|The movie I watched today was not a good one|[negative]     |
+--------------------------------------------+---------------+



### Evaluate the "trained" model.
* Note that fitting a pretrained does not change the model, as the following shows.

In [17]:
from sklearn.metrics import classification_report, accuracy_score

df2 = pipelineModel.transform(data).selectExpr('sentiment_label','text',"final_sentiment as result").toPandas()

df2['result'] = df2['result'].apply(lambda x: x[0])

print(classification_report(df2.sentiment_label, df2.result))
print(accuracy_score(df2.sentiment_label, df2.result))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

   ,positive       0.00      0.00      0.00         2
    negative       0.53      0.62      0.57     13792
    positive       0.58      0.49      0.53     14897

    accuracy                           0.55     28691
   macro avg       0.37      0.37      0.37     28691
weighted avg       0.56      0.55      0.55     28691

0.5523334843679203


  _warn_prf(average, modifier, msg_start, len(result))


### Evaluate the untrained model.
* The following shows that including a trained model in a pipeline (and running fit) does not change the trained model. Thus, the results with training (above) and without training (below) the model are the same.

In [18]:
# Fit to no data (thus no "training")
empty_data = spark.createDataFrame([['']]).toDF("text")
pipelineModel = pipeline.fit(empty_data)

In [19]:
# apply same pipeline, but model has not been "trained"
example = spark.createDataFrame([[comment]]).toDF("text")
pipelineModel.transform(example).show(truncate=False)

+--------------------------------------------+---------------+
|text                                        |final_sentiment|
+--------------------------------------------+---------------+
|The movie I watched today was not a good one|[negative]     |
+--------------------------------------------+---------------+



In [20]:
# Same result as pipeline without "training"

from sklearn.metrics import classification_report, accuracy_score

df2 = pipelineModel.transform(data).selectExpr('sentiment_label','text',"final_sentiment as result").toPandas()

df2['result'] = df2['result'].apply(lambda x: x[0])

print(classification_report(df2.sentiment_label, df2.result))
print(accuracy_score(df2.sentiment_label, df2.result))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

   ,positive       0.00      0.00      0.00         2
    negative       0.53      0.62      0.57     13792
    positive       0.58      0.49      0.53     14897

    accuracy                           0.55     28691
   macro avg       0.37      0.37      0.37     28691
weighted avg       0.56      0.55      0.55     28691

0.5523334843679203


  _warn_prf(average, modifier, msg_start, len(result))


## XLNet sentiment (IMDB trained)
* Best pre-trained available from spark-nlp for sentiment?

In [21]:
# https://nlp.johnsnowlabs.com/2021/12/23/xlnet_base_sequence_classifier_imdb_en.html
# https://huggingface.co/datasets/imdb (Trained on Stanford sentimetn 25,000 movies)

document_assembler = DocumentAssembler() \
  .setInputCol('text') \
  .setOutputCol('document')

tokenizer = Tokenizer() \
  .setInputCols(['document']) \
  .setOutputCol('token')

sequenceClassifier = XlnetForSequenceClassification \
  .pretrained('xlnet_base_sequence_classifier_imdb', 'en') \
  .setInputCols(['token', 'document']) \
  .setOutputCol('class') \
  .setCaseSensitive(False) \
  .setMaxSentenceLength(512)

pipeline = Pipeline(stages=[
  document_assembler,
  tokenizer,
  sequenceClassifier
  ])

# Fit to data
pipelineModel = pipeline.fit(data)
result = pipelineModel.transform(data)

result.show(truncate=False)

xlnet_base_sequence_classifier_imdb download started this may take some time.
Approximate size to download 419.7 MB
[OK!]
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [22]:
# apply trained model to text
example = spark.createDataFrame([[comment]]).toDF("text")
result = pipelineModel.transform(example)
result.select('text', 'class').show(truncate=False)
result.selectExpr('document.result[0] as sentence', 'class.result[0] as sentiment', 'class.metadata as metadata').show(truncate=False)

+--------------------------------------------+------------------------------------------------------------------------------------------------+
|text                                        |class                                                                                           |
+--------------------------------------------+------------------------------------------------------------------------------------------------+
|The movie I watched today was not a good one|[{category, 0, 43, neg, {sentence -> 0, Some(neg) -> 0.9989093, Some(pos) -> 0.0010907185}, []}]|
+--------------------------------------------+------------------------------------------------------------------------------------------------+

+--------------------------------------------+---------+--------------------------------------------------------------------+
|sentence                                    |sentiment|metadata                                                            |
+--------------------------

## sentiment (spark-nlp deep learning model)
* Best pre-trained available from spark-nlp for sentiment?

In [23]:
# https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN.ipynb
# https://nlp.johnsnowlabs.com/2021/01/15/sentimentdl_use_imdb_en.html (Trained on Stanford sentimetn 25,000 movies)

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

# notice SentimentDLModel not ClassifierDLModel
sentimentdl = SentimentDLModel.pretrained(name='sentimentdl_use_imdb', lang="en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

nlpPipeline = Pipeline(
      stages = [
          documentAssembler,
          use,
          sentimentdl
      ])

empty_data = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_data)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentimentdl_use_imdb download started this may take some time.
Approximate size to download 12 MB
[OK!]


In [24]:
# apply trained model to text
example = spark.createDataFrame([[comment]]).toDF("text")
result = pipelineModel.transform(example)
result.select('text', 'sentiment').show(truncate=False)
result.selectExpr('document.result[0] as sentence', 'sentiment.result[0] as sentiment', 'sentiment.metadata as metadata').show(truncate=False)

+--------------------------------------------+------------------------------------------------------------------------------------+
|text                                        |sentiment                                                                           |
+--------------------------------------------+------------------------------------------------------------------------------------+
|The movie I watched today was not a good one|[{category, 0, 43, pos, {sentence -> 0, pos -> 0.9950311, neg -> 0.0049688937}, []}]|
+--------------------------------------------+------------------------------------------------------------------------------------+

+--------------------------------------------+---------+--------------------------------------------------------+
|sentence                                    |sentiment|metadata                                                |
+--------------------------------------------+---------+-------------------------------------------------------

## Emotion (bert_sequence_classifier_emotion)

In [25]:
# https://nlp.johnsnowlabs.com/2022/01/14/bert_sequence_classifier_emotion_en.html
# https://huggingface.co/nateraw/bert-base-uncased-emotion: "Not the best model, but it works in a pinch I guess..."

document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')

sequenceClassifier = BertForSequenceClassification \
.pretrained('bert_sequence_classifier_emotion', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class')

pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier])

# Rely on the given trained model, so "train" on and empty data to create the model.
empty_data = spark.createDataFrame([['']]).toDF("text")

# Fit to data
pipelineModel = pipeline.fit(empty_data)

bert_sequence_classifier_emotion download started this may take some time.
Approximate size to download 391.1 MB
[OK!]


In [26]:
# apply trained model to text
example = spark.createDataFrame([[comment]]).toDF("text")
result = pipelineModel.transform(example)
result.selectExpr('document.result[0] as sentence', 'class.result[0] as sentiment', 'class.metadata as metadata').show(truncate=False)

+--------------------------------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|sentence                                    |sentiment|metadata                                                                                                                                                                                 |
+--------------------------------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|The movie I watched today was not a good one|joy      |[{Some(sadness) -> 0.017389426, Some(surprise) -> 0.0030828605, Some(fear) -> 0.003483941, Some(love) -> 0.005832119, Some(anger) -> 0.01538162, Some(joy) -> 0.95483005, sentence -> 0}]|
+---------------------------

In [27]:
# Similar, but not exactly the same as found with original model:
# https://huggingface.co/nateraw/bert-base-uncased-emotion?text=I+like+you.+I+love+you

example = spark.createDataFrame([["I like you. I love you"]]).toDF("text")
result = pipelineModel.transform(example)
result.selectExpr('document.result[0] as sentence', 'class.result[0] as sentiment', 'class.metadata as metadata').show(truncate=False)

+----------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|sentence              |sentiment|metadata                                                                                                                                                                               |
+----------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|I like you. I love you|love     |[{Some(sadness) -> 0.027391676, Some(surprise) -> 0.002645207, Some(fear) -> 0.002119432, Some(love) -> 0.61994994, Some(anger) -> 0.015406931, Some(joy) -> 0.3324868, sentence -> 0}]|
+----------------------+---------+------------------------------------------------------------------------------------------

## Emotion (spark-nlp deep learning model)
* Best pre-trained available from spark-nlp for emotions?

In [28]:
# https://nlp.johnsnowlabs.com/2020/07/03/classifierdl_use_emotion_en.html

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")


sentimentdl = ClassifierDLModel.pretrained(name='classifierdl_use_emotion')\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

nlpPipeline = Pipeline(
      stages = [
          documentAssembler,
          use,
          sentimentdl
      ])

empty_data = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_data)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
classifierdl_use_emotion download started this may take some time.
Approximate size to download 21.3 MB
[OK!]


In [29]:
# apply trained model to text
example = spark.createDataFrame([[comment]]).toDF("text")
result = pipelineModel.transform(example)
result.selectExpr('text', 'sentiment.result as result', 'sentiment.metadata as metadata').show(truncate=False)

+--------------------------------------------+---------+---------------------------------------------------------------------------------------------------------+
|text                                        |result   |metadata                                                                                                 |
+--------------------------------------------+---------+---------------------------------------------------------------------------------------------------------+
|The movie I watched today was not a good one|[sadness]|[{surprise -> 0.03977137, joy -> 6.099707E-4, fear -> 2.8340122E-5, sadness -> 0.9595903, sentence -> 0}]|
+--------------------------------------------+---------+---------------------------------------------------------------------------------------------------------+



In [30]:
# Compare with https://nlp.johnsnowlabs.com/2020/07/03/classifierdl_use_emotion_en.html

sample = spark.createDataFrame([["@Mira I just saw you on live t.v!!"],
                              ["Just home from group celebration - dinner at Trattoria Gianni, then Hershey Felder's performance - AMAZING!!"],
                              ["Nooooo! My dad turned off the internet so I can't listen to band music!"],
                              ["My soul has just been pierced by the most evil look from @rickosborneorg. A mini panic attack and chill in bones followed soon after."]]).toDF("text")
result = pipelineModel.transform(sample)
result.selectExpr('text', 'sentiment.result as result', 'sentiment.metadata as metadata').show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                 |result    |metadata                                                                                                    |
+-------------------------------------------------------------------------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------+
|@Mira I just saw you on live t.v!!                                                                                                   |[surprise]|[{surprise -> 0.99995255, joy -> 2.7403673E-6, fear -> 8.520834E-6, sadness -> 3.62882

### spark-nlp deep learning example

In [31]:
# https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN_EMOTION.ipynb#scrollTo=1XxHWemdE5hX
MODEL_NAME='classifierdl_use_emotion'

text_list = [
            """I am SO happy the news came out in time for my birthday this weekend! My inner 7-year-old cannot WAIT!""",
            """That moment when you see your friend in a commercial. Hahahaha!""",
            """My soul has just been pierced by the most evil look from @rickosborneorg. A mini panic attack &amp; chill in bones followed soon after.""",
            """For some reason I woke up thinkin it was Friday then I got to school and realized its really Monday -_-""",
            """I'd probably explode into a jillion pieces from the inablility to contain all of my if I had a Whataburger patty melt right now. #drool""",
            """These are not emotions. They are simply irrational thoughts feeding off of an emotion""",
            """Found out im gonna be with sarah bo barah in ny for one day!!! Eggcitement :)""",
            """That awkward moment when you find a perfume box full of sensors!""",
            """Just home from group celebration - dinner at Trattoria Gianni, then Hershey Felder's performance - AMAZING!!""",
            """Nooooo! My dad turned off the internet so I can't listen to band music!""",
            ]

In [32]:
# Apply the model to the data

import pandas as pd

df = spark.createDataFrame(pd.DataFrame({"text":text_list}))
result = pipelineModel.transform(df)
result.selectExpr('text', 'sentiment.result as result', 'sentiment.metadata as metadata').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------+----------+-------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                   |result    |metadata                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------+----------+-------------------------------------------------------------------------------------------------------------+
|I am SO happy the news came out in time for my birthday this weekend! My inner 7-year-old cannot WAIT!                                 |[surprise]|[{surprise -> 0.9964477, joy -> 0.0035500138, fear -> 4.2735905E-7, sadness

In [33]:
# Display results
from pyspark.sql import functions as F

r = result.selectExpr("document.result as document", "sentiment.result as sentiment") \
          .select(F.explode(F.arrays_zip('document', 'sentiment')).alias('cols')).selectExpr("cols.*")

pdf = r.toPandas()
pdf

Unnamed: 0,document,sentiment
0,I am SO happy the news came out in time for my...,surprise
1,That moment when you see your friend in a comm...,surprise
2,My soul has just been pierced by the most evil...,fear
3,For some reason I woke up thinkin it was Frida...,sadness
4,I'd probably explode into a jillion pieces fro...,sadness
5,These are not emotions. They are simply irrati...,fear
6,Found out im gonna be with sarah bo barah in n...,surprise
7,That awkward moment when you find a perfume bo...,surprise
8,Just home from group celebration - dinner at T...,joy
9,Nooooo! My dad turned off the internet so I ca...,sadness


# Question / Answer models

**Only seems to work if runtime is *restarted* prior to T5**

If you have more than 24g of memory, it's likely to work better. :-)


## Download T5 Model and Create Spark NLP Pipeline
* https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/10.Question_Answering_and_Summarization_with_T5.ipynb

In [34]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") 

# Can take in document or sentence columns
t5 = T5Transformer.pretrained(name='t5_base',lang='en')\
    .setInputCols('document')\
    .setOutputCol("T5")\
    .setMaxOutputLength(400)

t5_base download started this may take some time.
Approximate size to download 451.8 MB
[OK!]


### Set the Task to question

In [35]:
# Set the task for questions on T5. Depending to what this is currently set, we get different behaivour
t5.setTask('question')

T5TRANSFORMER_8078c2d39352


### Answer Closed Book Questions

Closed book means that no additional context is given and the model must answer the question with the knowledge stored in it's weights


In [36]:
# Build pipeline with T5
pipe_components = [documentAssembler,t5]
pipeline = Pipeline().setStages(pipe_components)

# define Data
data = [["Who is president of the USA? "],
        ["What is the capital of the USA?"],
        ["What is the capital of Montana?"],
        ["What is the best breed of dog?"],
        ["Who is president of Nigeria? "],
        ["What is the most common language in India? "],
        ["What is the capital of Germany? "],]
df=spark.createDataFrame(data).toDF('text')

#Predict on text data with T5
model = pipeline.fit(df)
annotated_df = model.transform(df)
annotated_df.select(['text','t5.result']).show(truncate=False)

+-------------------------------------------+------------------+
|text                                       |result            |
+-------------------------------------------+------------------+
|Who is president of the USA?               |[John F. Kennedy] |
|What is the capital of the USA?            |[Washington]      |
|What is the capital of Montana?            |[Montana]         |
|What is the best breed of dog?             |[dog]             |
|Who is president of Nigeria?               |[Muhammadu Buhari]|
|What is the most common language in India? |[Hindi]           |
|What is the capital of Germany?            |[Berlin]          |
+-------------------------------------------+------------------+




### Answer Open Book Questions

These are questions where we give the model some additional context, that is used to answer the question


In [37]:
context    = 'context: Peters last week was terrible! He had an accident and broke his leg while skiing!'
question1  = 'question: Why was peters week so bad? ' #
question2  = 'question: How did peter broke his leg? ' 
question3  = 'question: How did peter broke his leg? '
 
data = [[question1+context],[question2+context],[question3+context],]
df=spark.createDataFrame(data).toDF('text')

#Predict on text data with T5
model = pipeline.fit(df)
annotated_df = model.transform(df)
annotated_df.select(['text','t5.result']).show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+
|text                                                                                                                             |result                                             |
+---------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+
|question: Why was peters week so bad? context: Peters last week was terrible! He had an accident and broke his leg while skiing! |[He had an accident and broke his leg while skiing]|
|question: How did peter broke his leg? context: Peters last week was terrible! He had an accident and broke his leg while skiing!|[skiing]                                           |
|question: How did peter broke his leg? context: Peters last week was terrible! 

In [38]:
# Ask T5 questions in the context of a News Article
question1 = 'question: Who is Jack ma? '
question2 = 'question: Who is founder of Alibaba Group? '
question3 = 'question: When did Jack Ma re-appear? '
question4 = 'question: How did Alibaba stocks react? '
question5 = 'question: Whom did Jack Ma meet? '
question6 = 'question: Who did Jack Ma hide from? '


# from https://www.bbc.com/news/business-55728338 
news_article_context = """ context:
Alibaba Group founder Jack Ma has made his first appearance since Chinese regulators cracked down on his business empire.
His absence had fuelled speculation over his whereabouts amid increasing official scrutiny of his businesses.
The billionaire met 100 rural teachers in China via a video meeting on Wednesday, according to local government media.
Alibaba shares surged 5% on Hong Kong's stock exchange on the news.
"""

data = [
             [question1+ news_article_context],
             [question2+ news_article_context],
             [question3+ news_article_context],
             [question4+ news_article_context],
             [question5+ news_article_context],
             [question6+ news_article_context]]


df=spark.createDataFrame(data).toDF('text')

#Predict on text data with T5
model = pipeline.fit(df)
annotated_df = model.transform(df)
annotated_df.select(['t5.result']).show(truncate=False)

+-----------------------+
|result                 |
+-----------------------+
|[Alibaba Group founder]|
|[Jack Ma]              |
|[Wednesday]            |
|[surged 5%]            |
|[100 rural teachers]   |
|[Chinese regulators]   |
+-----------------------+



## Summarize documents

In [39]:
# Set the task for questions on T5
t5.setTask('summarize')

T5TRANSFORMER_8078c2d39352

In [40]:
# https://www.reuters.com/article/instant-article/idCAKBN2AA2WF
text = """(Reuters) - Mastercard Inc said on Wednesday it was planning to offer support for some cryptocurrencies on its network this year, joining a string of big-ticket firms that have pledged similar support.

The credit-card giant’s announcement comes days after Elon Musk’s Tesla Inc revealed it had purchased $1.5 billion of bitcoin and would soon accept it as a form of payment.

Asset manager BlackRock Inc and payments companies Square and PayPal have also recently backed cryptocurrencies.

Mastercard already offers customers cards that allow people to transact using their cryptocurrencies, although without going through its network.

"Doing this work will create a lot more possibilities for shoppers and merchants, allowing them to transact in an entirely new form of payment. This change may open merchants up to new customers who are already flocking to digital assets," Mastercard said. (mstr.cd/3tLaPZM)

Mastercard specified that not all cryptocurrencies will be supported on its network, adding that many of the hundreds of digital assets in circulation still need to tighten their compliance measures.

Many cryptocurrencies have struggled to win the trust of mainstream investors and the general public due to their speculative nature and potential for money laundering.
"""
data = [[text]]
df=spark.createDataFrame(data).toDF('text')
#Predict on text data with T5
model = pipeline.fit(df)
annotated_df = model.transform(df)
annotated_df.select(['t5.result']).show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                            |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [41]:
v = annotated_df.take(1)
print(f"Original Length {len(v[0].text)}   Summarized Length : {len(v[0].T5[0].result)} ")

Original Length 1284   Summarized Length : 352 


In [42]:
# Full summarized text
v[0].T5[0].result

'mastercard said on Wednesday it was planning to offer support for some cryptocurrencies on its network this year . the credit-card giant’s announcement comes days after Elon Musk’s Tesla Inc revealed it had purchased $1.5 billion of bitcoin . asset manager blackrock and payments companies Square and PayPal have also recently backed cryptocurrencies .'

## Multi Problem T5 model for Summarization and more
The main T5 model was trained for over 20 tasks from the SQUAD/GLUE/SUPERGLUE datasets. See [this notebook](https://github.com/JohnSnowLabs/nlu/blob/master/examples/webinars_conferences_etc/multi_lingual_webinar/7_T5_SQUAD_GLUE_SUPER_GLUE_TASKS.ipynb) for a demo of all tasks 


# Overview of every task available with T5
[The T5 model](https://arxiv.org/pdf/1910.10683.pdf) is trained on various datasets for 17 different tasks which fall into 8 categories.



1. Text summarization
2. Question answering
3. Translation
4. Sentiment analysis
5. Natural Language inference
6. Coreference resolution
7. Sentence Completion
8. Word sense disambiguation

### Every T5 Task with explanation:
|Task Name | Explanation | 
|----------|--------------|
|[1.CoLA](https://nyu-mll.github.io/CoLA/)                   | Classify if a sentence is gramaticaly correct|
|[2.RTE](https://dl.acm.org/doi/10.1007/11736790_9)                    | Classify whether if a statement can be deducted from a sentence|
|[3.MNLI](https://arxiv.org/abs/1704.05426)                   | Classify for a hypothesis and premise whether they contradict or contradict each other or neither of both (3 class).|
|[4.MRPC](https://www.aclweb.org/anthology/I05-5002.pdf)                   | Classify whether a pair of sentences is a re-phrasing of each other (semantically equivalent)|
|[5.QNLI](https://arxiv.org/pdf/1804.07461.pdf)                   | Classify whether the answer to a question can be deducted from an answer candidate.|
|[6.QQP](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)                    | Classify whether a pair of questions is a re-phrasing of each other (semantically equivalent)|
|[7.SST2](https://www.aclweb.org/anthology/D13-1170.pdf)                   | Classify the sentiment of a sentence as positive or negative|
|[8.STSB](https://www.aclweb.org/anthology/S17-2001/)                   | Classify the sentiment of a sentence on a scale from 1 to 5 (21 Sentiment classes)|
|[9.CB](https://ojs.ub.uni-konstanz.de/sub/index.php/sub/article/view/601)                     | Classify for a premise and a hypothesis whether they contradict each other or not (binary).|
|[10.COPA](https://www.aaai.org/ocs/index.php/SSS/SSS11/paper/view/2418/0)                   | Classify for a question, premise, and 2 choices which choice the correct choice is (binary).|
|[11.MultiRc](https://www.aclweb.org/anthology/N18-1023.pdf)                | Classify for a question, a paragraph of text, and an answer candidate, if the answer is correct (binary),|
|[12.WiC](https://arxiv.org/abs/1808.09121)                    | Classify for a pair of sentences and a disambigous word if the word has the same meaning in both sentences.|
|[13.WSC/DPR](https://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492/0)       | Predict for an ambiguous pronoun in a sentence what it is referring to.  |
|[14.Summarization](https://arxiv.org/abs/1506.03340)          | Summarize text into a shorter representation.|
|[15.SQuAD](https://arxiv.org/abs/1606.05250)                  | Answer a question for a given context.|
|[16.WMT1.](https://arxiv.org/abs/1706.03762)                  | Translate English to German|
|[17.WMT2.](https://arxiv.org/abs/1706.03762)                   | Translate English to French|
|[18.WMT3.](https://arxiv.org/abs/1706.03762)                   | Translate English to Romanian|

