<a href="https://colab.research.google.com/github/raineydavid/22-day-coding-challenge/blob/master/python/CLASSIFICATION_TREC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_TREC.ipynb)




# **Classify text according to TREC classes**

## 1. Colab Setup

In [1]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash
# !bash colab.sh
# -p is for pyspark
# -s is for spark-nlp
# !bash colab.sh -p 3.1.1 -s 3.0.1
# by default they are set to the latest

--2021-06-27 01:14:36--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2021-06-27 01:14:36--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1608 (1.6K) [text/plain]
Saving to: ‘STDOUT’

setup Colab for PySpark 3.0.3 and Spark NLP 3.1.1

2021-06-27 01:14:36 (41.0 MB/s) - written to stdout [1608/1608]

Ign:1 https://developer.download.nvidia.

In [2]:
!wget http://setup.johnsnowlabs.com/colab.sh 

--2021-06-27 01:16:19--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2021-06-27 01:16:20--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1608 (1.6K) [text/plain]
Saving to: ‘colab.sh’


2021-06-27 01:16:20 (27.0 MB/s) - ‘colab.sh’ saved [1608/1608]



In [3]:
!cat colab.sh

#!/bin/bash

#default values for pyspark, spark-nlp, and SPARK_HOME
SPARKNLP="3.1.1"
PYSPARK="3.0.3"
SPARKHOME="/content/spark-3.1.2-bin-hadoop2.7"

while getopts s:p: option
do
 case "${option}"
 in
 s) SPARKNLP=${OPTARG};;
 p) PYSPARK=${OPTARG};;
 esac
done

echo "setup Colab for PySpark $PYSPARK and Spark NLP $SPARKNLP"
apt-get update
apt-get purge -y openjdk-11* -qq > /dev/null && sudo apt-get autoremove -y -qq > /dev/null
apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

if [[ "$PYSPARK" == "3.1"* ]]; then
  wget -q "https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz" > /dev/null
  tar -xvf spark-3.1.2-bin-hadoop2.7.tgz > /dev/null
  SPARKHOME="/content/spark-3.1.2-bin-hadoop2.7"
elif [[ "$PYSPARK" == "3.0"* ]]; then
  wget -q "https://downloads.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz" > /dev/null
  tar -xvf spark-3.0.3-bin-hadoop2.7.tgz > /dev/null
  SPARKHOME="/content/spark-3.0.3-bin-hadoop2.7"
elif [[ "$PYSPARK" == "2"* ]];

In [4]:
import pandas as pd
import numpy as np
import json
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

## 2. Start Spark Session

In [5]:
spark = sparknlp.start()

## 3. Select the DL model and re-run all the cells below

The classes in TREC-6 are

ABBR - Abbreviation

DESC - Description and abstract concepts

ENTY - Entities

HUM - Human beings

LOC - Locations

NYM - Numeric values

the classes in TREC-50 can be found here https://cogcomp.seas.upenn.edu/Data/QA/QC/definition.html.



In [6]:
### Select Model
#model_name = 'classifierdl_use_trec6'
model_name = 'classifierdl_use_trec50'

## 4. Some sample examples

In [7]:

text_list = [
    "What effect does pollution have on the Chesapeake Bay oysters?",
    "What financial relationships exist between Google and its advertisers?",
    "What financial relationships exist between the Chinese government and the Cuban government?",
    "What was the number of member nations of the U.N. in 2000?",
    "Who is the Secretary-General for political affairs?",
    "When did the construction of stone circles begin in the UK?",
    "In what country is the WTO headquartered?",
    "What animal was the first mammal successfully cloned from adult cells?",
    "What other prince showed his paintings in a two-prince exhibition with Prince Charles in London?",
    "Is there evidence to support the involvement of Garry Kasparov in politics?",
  ]

## 5. Define Spark NLP pipeline

In [8]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained(lang="en") \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")


document_classifier = ClassifierDLModel.pretrained(model_name, 'en') \
          .setInputCols(["document", "sentence_embeddings"]) \
          .setOutputCol("class")

nlpPipeline = Pipeline(stages=[
 documentAssembler, 
 use,
 document_classifier
 ])


tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
classifierdl_use_trec50 download started this may take some time.
Approximate size to download 21.2 MB
[OK!]


## 6. Run the pipeline

In [9]:
empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)
df = spark.createDataFrame(pd.DataFrame({"text":text_list}))
result = pipelineModel.transform(df)

## 7. Visualize results

In [10]:
result.select(F.explode(F.arrays_zip('document.result', 'class.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("document"),
        F.expr("cols['1']").alias("class")).show(truncate=False)

+------------------------------------------------------------------------------------------------+------------+
|document                                                                                        |class       |
+------------------------------------------------------------------------------------------------+------------+
|What effect does pollution have on the Chesapeake Bay oysters?                                  | DESC_desc  |
|What financial relationships exist between Google and its advertisers?                          | DESC_desc  |
|What financial relationships exist between the Chinese government and the Cuban government?     | DESC_desc  |
|What was the number of member nations of the U.N. in 2000?                                      | NUM_count  |
|Who is the Secretary-General for political affairs?                                             | HUM_ind    |
|When did the construction of stone circles begin in the UK?                                     | LOC_o