##Goal: Train and deploy a NLP classification model on labeled text data via Spark NLP transfer learning on BERT (Bidirectional Encoder Representations from Transformers) language model

**Source dataset**: https://www.kaggle.com/datasets/reihanenamdari/mental-health-corpus?resource=download

**Dataset description**: "The Mental Health Corpus is a collection of texts related to people with anxiety, depression, and other mental health issues. The corpus consists of two columns: one containing the comments, and the other containing labels indicating whether the comments are considered poisonous or not. The corpus can be used for a variety of purposes, such as sentiment analysis, toxic language detection, and mental health language analysis. The data in the corpus may be useful for researchers, mental health professionals, and others interested in understanding the language and sentiment surrounding mental health issues."

## Databricks cluster config considerations see here:

https://nlp.johnsnowlabs.com/docs/en/production-readiness

## Confirm Spark NLP is available

In [0]:
spark.sparkContext.getConf().get('spark.jars.packages')

Out[5]: 'com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.8'

## Starting of Spark Session

In [0]:
import sparknlp
from pyspark.ml import Pipeline
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

spark = sparknlp.start()

print("Version of SparkNLP:", sparknlp.version())
print("Version of Spark :", spark.version)

Version of SparkNLP: 4.2.8
Version of Spark : 3.3.1


## Loading and Reading the DataSet

In [0]:
from pyspark.sql.functions import *

df = spark.read.table('default.mental_health') \
.withColumnRenamed('label', 'category') \
.select("category", "text") \
.withColumn("category", when(col("category") == 1, "depressed") \
.otherwise("not_depressed"))

df.groupBy("category").count().show()

+-------------+-----+
|     category|count|
+-------------+-----+
|not_depressed|14139|
|    depressed|13838|
+-------------+-----+



In [0]:
df.show(50, truncate=100)

+-------------+----------------------------------------------------------------------------------------------------+
|     category|                                                                                                text|
+-------------+----------------------------------------------------------------------------------------------------+
|not_depressed|dear american teens question dutch person heard guys get way easier things learn age us sooooo th...|
|    depressed|nothing look forward lifei dont many reasons keep going feel like nothing keeps going next day ma...|
|not_depressed|music recommendations im looking expand playlist usual genres alt pop minnesota hip hop steampunk...|
|    depressed|im done trying feel betterthe reason im still alive know mum devastated ever killed myself ever p...|
|    depressed|worried  year old girl subject domestic physicalmental housewithout going lot know girl know girl...|
|    depressed|hey rredflag sure right place post this goes  im 

## Split the dataset into training and testing sets

In [0]:
train_text, test_text = df.randomSplit([0.8, 0.2], seed = 12345)

## Start MLflow for experiment and model tracking

In [0]:
import mlflow
from mlflow.models import Model, infer_signature, ModelSignature

mlflow_run = mlflow.start_run()

# Signature
signature = infer_signature(test_text)

## Classification with `BertSentenceEmbeddings`

Using **BERT LaBSE Sentence Embeddings** (https://nlp.johnsnowlabs.com/2020/09/23/labse.html) and **ClassifierDLApproach** (https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/classifier/dl/ClassifierDLApproach)

In [0]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

embeddings = BertSentenceEmbeddings\
    .pretrained("labse", "xx") \
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

classsifierdl = ClassifierDLApproach()\
   .setInputCols(["sentence_embeddings"])\
   .setOutputCol("class")\
   .setLabelColumn("category")\
   .setMaxEpochs(10)\
   .setEnableOutputLogs(True)

nlp_pipeline_bert = Pipeline(
    stages=[document, 
            embeddings,
            classsifierdl])

labse download started this may take some time.
Approximate size to download 1.7 GB
[ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][OK!]


#### Train model on training set... depending on cluster configuration this can take some time (30 minutes to an hour)

In [0]:
classification_model_bert = nlp_pipeline_bert.fit(train_text)

#### Log the model in MLflow and build a reference to the model URI

In [0]:
import pandas as pd

input_example = pd.DataFrame([{"index": 0, "text": "I'm lost in the darkness and can't find my way out."}, {"index": 1, "text": "Life is full of exciting possibilities!"}])

conda_env = {
    'channels': ['conda-forge'],
    'dependencies': [
        'python=3.9.5',
        {
            "pip": [
              'pyspark==3.1.2',
              'mlflow<3,>=2.1',
              'spark-nlp==4.2.8'
            ]
        }
    ],
    'name': 'mlflow-env'
}

model_name = "BERT_NLP_mental_health_classification_model"

mlflow.spark.log_model(classification_model_bert, model_name, conda_env=conda_env, signature=signature, input_example=input_example)

mlflow.log_artifacts("com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.8")

mlflow.end_run()

mlflow_model_uri = "runs:/{}/{}".format(mlflow_run.info.run_id, model_name)

display(mlflow_model_uri)



'runs:/96ddf7167cd748b68e38af22cd5cc54b/BERT_NLP_mental_health_classification_model'

#### Load the logged model back in from Experiments (takes a while)

In [0]:
loaded_model = mlflow.spark.load_model(mlflow_model_uri)

2023/02/06 20:58:55 INFO mlflow.spark: 'runs:/96ddf7167cd748b68e38af22cd5cc54b/BERT_NLP_mental_health_classification_model' resolved as 'dbfs:/databricks/mlflow-tracking/507237834301906/96ddf7167cd748b68e38af22cd5cc54b/artifacts/BERT_NLP_mental_health_classification_model'


#### Run predictions on test dataset (also takes some time to process)

In [0]:
df_bert = loaded_model.transform(test_text).select("category", "text", "class.result").toPandas()

#### View some 'depressed' predictions

In [0]:
df_bert_depressed = df_bert[df_bert['category'] == 'depressed']
df_bert_depressed.head(50)

Unnamed: 0,category,text,result
0,depressed,mg xanax thinking taking all know im even po...,[depressed]
1,depressed,college graduate make k year live atlanta kid...,[not_depressed]
2,depressed,commit redflag mondaymy situation pity asked ...,[depressed]
3,depressed,days ago redflag failedthe past days despera...,[depressed]
4,depressed,dead every fucking second day good enough sta...,[depressed]
5,depressed,extremely depressed struggle feel sense impor...,[depressed]
6,depressed,feel bad cant pass way idk want cry time feel...,[depressed]
7,depressed,hours i hopeim indecisive bastard first jan ...,[depressed]
8,depressed,male college student need help badlyi really ...,[depressed]
9,depressed,malemy mum alcoholic mum means mom england dr...,[not_depressed]


#### View some 'not_depressed' predictions

In [0]:
df_bert_not_depressed = df_bert[df_bert['category'] == 'not_depressed']
df_bert_not_depressed.head(50)

Unnamed: 0,category,text,result
2805,not_depressed,yes need proof prove need start fundament...,[not_depressed]
2806,not_depressed,im still horndog reddit listening musicals b...,[not_depressed]
2807,not_depressed,already starting shitty woke damn spider dick...,[not_depressed]
2808,not_depressed,assignments core class alone feelin good,[not_depressed]
2809,not_depressed,bbc production jane eyre starring zelah clark...,[not_depressed]
2810,not_depressed,biased muppet fan love treasure island christ...,[not_depressed]
2811,not_depressed,cool facts house cats ahhhhh oh god fyck fycj...,[not_depressed]
2812,not_depressed,days ago days ago lost cat congestive heart ...,[depressed]
2813,not_depressed,days till im oap okay maybe oap styll one yea...,[not_depressed]
2814,not_depressed,e n bad joke,[not_depressed]


## Generate Scikit-learn Classification report

In [0]:
from sklearn.metrics import classification_report, accuracy_score

print(classification_report(df_bert.category, df_bert.result.str[0]))

               precision    recall  f1-score   support

    depressed       0.91      0.91      0.91      2805
not_depressed       0.91      0.91      0.91      2917

     accuracy                           0.91      5722
    macro avg       0.91      0.91      0.91      5722
 weighted avg       0.91      0.91      0.91      5722



#### Create new dataframe with unlabeled text snippets to test functionality of model on previously unseen data

In [0]:
df_new = spark.createDataFrame([
    (1, "I'm so alone and isolated in my darkness."),
    (2, "remember that when we show consideration for others, great things can be achieved"),
    (3, "oopdeedoop beedlebob goopdeedoo pooplepoo!"),
    (4, "i am very unhappy my life is miserable no one likes me i want to be alone"),
    (5, "i went to bed early last night and got lots of sleep"),
    (6, "today wasn't bad i got some shit done and went ona nice walk"),
    (7, "Kitties and doggies snuggle up together, giving each other the love and warmth that only furry friends can provide. Their purrs and yips fill the air with joy and sweetness. Together, they make a truly wonderful pair."),
    (8, "I feel like I'm all alone in this world. No one understands me or cares about me. I just wish things were different."),
    (9, "Every sunrise brings a new possibility for creating your own happiness."),
    (10, "I just can't find the will to go on.")
]).toDF("id", "text")

#### See prediction results

In [0]:
df = loaded_model.transform(df_new).select("text", "class.result").toPandas()
display(df)

text,result
I'm so alone and isolated in my darkness.,List(depressed)
"remember that when we show consideration for others, great things can be achieved",List(not_depressed)
oopdeedoop beedlebob goopdeedoo pooplepoo!,List(not_depressed)
i am very unhappy my life is miserable no one likes me i want to be alone,List(depressed)
i went to bed early last night and got lots of sleep,List(depressed)
today wasn't bad i got some shit done and went ona nice walk,List(not_depressed)
"Kitties and doggies snuggle up together, giving each other the love and warmth that only furry friends can provide. Their purrs and yips fill the air with joy and sweetness. Together, they make a truly wonderful pair.",List(not_depressed)
I feel like I'm all alone in this world. No one understands me or cares about me. I just wish things were different.,List(depressed)
Every sunrise brings a new possibility for creating your own happiness.,List(not_depressed)
I just can't find the will to go on.,List(depressed)
