<img src="../assets/a_type_readme.gif" style="float:right ; margin: 10px ; width:300px;"> 
<h1><left>NLP Project</left></h1>
<h4><left>Using Natural Language Processing to better understand Depression & Anxiety</left></h4>
___

## 3. Analysis

In [None]:
# !pip install tokenizers==0.9.4 
# !pip install --upgrade transformers==4.2.2 

In [21]:
import numpy as np
import pandas as pd

from random import randint
from time import time 
import logging 
import multiprocessing

import re
import json
from sklearn.model_selection import train_test_split

import transformers
assert transformers.__version__ == "4.2.2"

from transformers import AutoTokenizer, AutoModelWithLMHead
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from transformers import pipeline

# from transformers import BertTokenizerFast, BertForSequenceClassification
# from transformers import Trainer, TrainingArguments

In [3]:
logging.basicConfig(filename="../logs/7_finetune_language-model.log",
                    format='%(asctime)s %(message)s',
                    filemode='w')
logger = logging.getLogger()

def print_time(intput_str, start_time=0):
    print("{}: {} min".format(input_str, round((time() - start_time) / 60, 2)))
    
# #Setting the threshold of logger to DEBUG
# logger.setLevel(logging.DEBUG)
  
# #Test messages
# logger.debug("Harmless debug Message")
# logger.info("Just an information")
# logger.warning("Its a Warning")
# logger.error("Did you try to divide by zero")
# logger.critical("Internet is down")

In [4]:
model_data = pd.read_csv('../data/data_for_model.csv', keep_default_na=False)
print(model_data.info())
model_data.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1930 entries, 0 to 1929
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   title                      1930 non-null   object
 1   selftext                   1930 non-null   object
 2   author                     1930 non-null   object
 3   score                      1930 non-null   int64 
 4   num_comments               1930 non-null   int64 
 5   is_anxiety                 1930 non-null   int64 
 6   url                        1930 non-null   object
 7   selftext_clean             1930 non-null   object
 8   selftext_broken_sentences  1930 non-null   object
 9   selftext_broken_words      1930 non-null   object
 10  title_clean                1930 non-null   object
 11  author_clean               1930 non-null   object
 12  megatext_clean             1930 non-null   object
dtypes: int64(3), object(10)
memory usage: 196.1+ KB
None


Unnamed: 0,title,selftext,author,score,num_comments,is_anxiety,url,selftext_clean,selftext_broken_sentences,selftext_broken_words,title_clean,author_clean,megatext_clean
0,Our most-broken and least-understood rules is ...,We understand that most people who reply immed...,SQLwitch,2319,175,0,https://www.reddit.com/r/depression/comments/d...,understand people reply immediately op invitat...,['we understand that most people who reply imm...,"['understand', 'people', 'reply', 'immediately...",broken least understood rule helper may invite...,sql witch,sql witch understand people reply immediately ...
1,"Regular Check-In Post, with important reminder...",Welcome to /r/depression's check-in post - a p...,SQLwitch,312,1136,0,https://www.reddit.com/r/depression/comments/m...,welcome r depression check post place take mom...,"[""welcome to /r/depression's check-in post - a...","['welcome', 'r', 'depression', 'check', 'post'...",regular check post important reminder private ...,sql witch,sql witch welcome r depression check post plac...
2,Low,I'm so low rn I can't even type anything coher...,RagingFlock89,263,43,0,https://www.reddit.com/r/depression/comments/n...,low rn even type anything coherent want expres...,"[""i'm so low rn i can't even type anything coh...","['low', 'rn', 'even', 'type', 'anything', 'coh...",low,raging flock 89,raging flock 89 low rn even type anything cohe...


In [52]:
data_column = "selftext_clean"
labels_idx2str = {0: "depression", 1: "anxiety"}
labels_str2idx = {'depression': 0, 'anxiety': 1}
# model_data["megatext_clean"].to_csv(data_path, header=None, index=None, sep='\t', mode='a')

## Fine Tuning

### Language Model

#### Prepare data and model

In [53]:
def build_text_files(texts, dest_path):
    f = open(dest_path, 'w')
    data = ''
    for text in texts:
        data += text + "  "
    f.write(data)

In [54]:
def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)
     
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)   
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset, test_dataset, data_collator


In [55]:
EPOCHS = 10
TRAIN_BATCH_SIZE = 32
EVAL_BATCH_SIZE = 64
reddit = {}

for label_str, label_int in labels_str2idx.items():
    data = model_data[model_data["is_anxiety"] == label_int]
    print(label_str, "total data =", len(data))
    
    train_text, test_text, train_labels, test_labels = train_test_split(
        data[data_column].tolist(), 
        data["is_anxiety"].tolist(), 
        test_size=.15
    )
    print("\tTrain data =", len(train_text))
    print("\tTest data =", len(test_text))
    
    train_path = '../data/{}.bert_lm.train_dataset.txt'.format(label_str)
    test_path = '../data/{}.bert_lm.test_dataset.txt'.format(label_str)
                                             
    build_text_files(train_text, train_path)
    build_text_files(test_text, test_path)

    tokenizer = AutoTokenizer.from_pretrained("distilgpt2")  
    train_dataset, test_dataset, data_collator = load_dataset(train_path, test_path, tokenizer)                                             
    model = AutoModelWithLMHead.from_pretrained("distilgpt2")
    
    training_args = TrainingArguments(
        output_dir='../models/{}.bert_lm'.format(label_str),    #The output directory
        overwrite_output_dir=True,                              #overwrite the content of the output directory
        num_train_epochs=EPOCHS,                                # number of training epochs
        per_device_train_batch_size=TRAIN_BATCH_SIZE,           # batch size for training
        per_device_eval_batch_size=EVAL_BATCH_SIZE,             # batch size for evaluation
        eval_steps = 400,                                       # Number of update steps between two evaluations.
        save_steps=800,                                         # after # steps model is saved 
        warmup_steps=500,                                       # number of warmup steps for learning rate scheduler
        prediction_loss_only=True,
        logging_dir ='../logs/{}.bert_lm'.format(label_str),    # directory for storing logs
        logging_steps=200,                                      # log & save weights each logging_steps
        load_best_model_at_end=True,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
    )
    
    trainer.train()  
    
    trainer.evaluate()
    
    model_path = "../models/{}.bert_lm".format(label_str)
#     model.save_pretrained(model_path)
#     tokenizer.save_pretrained(model_path)
#     trainer.save_model(model_path)
    
    reddit[label_str] = pipeline('text-generation', model=model_path, tokenizer='distilgpt2',config={'max_length':800})

depression total data = 932
	Train data = 792
	Test data = 140




Step,Training Loss


anxiety total data = 998
	Train data = 848
	Test data = 150




Step,Training Loss


#### Examples

In [None]:
# # reload our model/tokenizer. Optional, only usable when in Python files instead of notebooks
# for label_str, label_int in labels_str2idx.items():
#     model_path = "../models/{}.bert_lm".format(label_str)

#     model = AutoModelWithLMHead.from_pretrained(model_path, num_labels=len(labels_str2idx)).to("cuda")
#     tokenizer = AutoTokenizer.from_pretrained(model_path)
#     reddit[label_str] = pipeline('text-generation', model=model_path, tokenizer='distilgpt2',config={'max_length':800})

In [57]:
reddit

{'depression': <transformers.pipelines.text_generation.TextGenerationPipeline at 0x7f9dccdb0e80>,
 'anxiety': <transformers.pipelines.text_generation.TextGenerationPipeline at 0x7f9dcc139af0>}

In [76]:
exp_count = 2
results = []

for label_str, label_int in labels_str2idx.items():
    data = model_data[model_data["is_anxiety"] == label_int][data_column]
    
    for i in range(exp_count):
        seed_text = data[randint(data.index[0], data.index[-1]+1)]
        seed_text = seed_text[:50]
        
        generated_dep = reddit[labels_idx2str[0]](seed_text)[0]['generated_text']
        generated_anx = reddit[labels_idx2str[1]](seed_text)[0]['generated_text']
        
        model_results = {}
        model_results["seed_text"] = seed_text
        model_results["is_anxiety"] = label_int
        model_results["generated_depression"] = generated_dep
        model_results["generated_anxiety"] = generated_anx

        results.append(model_results) 

results = pd.DataFrame(results)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [77]:
pd.set_option("display.max_colwidth", 1000)
results

Unnamed: 0,seed_text,is_anxiety,generated_depression,generated_anxiety
0,talking family member face face friend gave bad ne,0,talking family member face face friend gave bad nek think bad one see world face bad get worse every time friend find fault find family friend give wrong friend feel bad day get worse want someone think really like world like see world like see past see world like,talking family member face face friend gave bad nexie irlfax prescription day ago ago bought caffeine drink caffeine med last night thought wa really bad ever tried caffeine drink really help feel good still got depressed since first thought wa really bad wa much way
1,stuck life want succeed taking much losing sad ene,0,stuck life want succeed taking much losing sad enecember last 2nd year lost love hope friend wa im going get better i dont remember last month wa looking im thinking maybe one day would start asking thing one day try get therapy everyday one night want,stuck life want succeed taking much losing sad eneapnity got good time anxious make anxiety worse day would get bad experience bad experience even worse go see doctor day think something anxiety make mistake stop going see doctor try trying try every day try find
2,turned kind long post tl dr recently depression an,1,turned kind long post tl dr recently depression an average day last 5 minute yesterday started dating would rather never get married got really lonely got divorced 2 year ago 4 year ago 2 year ago decided to return get married 2 year ago 1 year,turned kind long post tl dr recently depression an 11 year old went doctor wanted 2 month ago 3 month ago 4 month ago also went 3 month ago back last year ha diagnosed anxiety disorder later diagnosis anxiety disorder anxiety disorder schizophrenia diagnosis
3,issue forever whenever something bad happens make,1,issue forever whenever something bad happens make much worse time wa time every time time like want want feel like need something else could live without always felt like really bad time want become useless one time think like wrong something,issue forever whenever something bad happens make feel alone feel alone never want to deal honestly feel alone feel like getting together like never felt alone feel like really hard feel like need help someone know anyone else always scared really


In [81]:
source = 'everything looks like'
test1 = reddit["depression"](source)[0]['generated_text']
test2 = reddit["anxiety"](source)[0]['generated_text']
print("**depression**\n", test1)
print("\n**anxiety**\n", test2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


**depression**
 everything looks like it just like every time look back look back look back look back look back look back want look back look back look back look back look back see nothing always look back always look back remember know always look back remember remember remember remember always remember

**anxiety**
 everything looks like someone else may be one thing seem weird feel like something everyone seems like think different make someone feel like one get started want something make someone think else seem like someone else get started get started feel tired feeling like someone someone else think get busy


In [79]:
model_results = {}
model_results["seed_text"] = source
model_results["is_anxiety"] = -1
model_results["generated_depression"] = test1
model_results["generated_anxiety"] = test2

results.append(model_results, ignore_index=True) 

Unnamed: 0,seed_text,is_anxiety,generated_depression,generated_anxiety
0,talking family member face face friend gave bad ne,0,talking family member face face friend gave bad nek think bad one see world face bad get worse every time friend find fault find family friend give wrong friend feel bad day get worse want someone think really like world like see world like see past see world like,talking family member face face friend gave bad nexie irlfax prescription day ago ago bought caffeine drink caffeine med last night thought wa really bad ever tried caffeine drink really help feel good still got depressed since first thought wa really bad wa much way
1,stuck life want succeed taking much losing sad ene,0,stuck life want succeed taking much losing sad enecember last 2nd year lost love hope friend wa im going get better i dont remember last month wa looking im thinking maybe one day would start asking thing one day try get therapy everyday one night want,stuck life want succeed taking much losing sad eneapnity got good time anxious make anxiety worse day would get bad experience bad experience even worse go see doctor day think something anxiety make mistake stop going see doctor try trying try every day try find
2,turned kind long post tl dr recently depression an,1,turned kind long post tl dr recently depression an average day last 5 minute yesterday started dating would rather never get married got really lonely got divorced 2 year ago 4 year ago 2 year ago decided to return get married 2 year ago 1 year,turned kind long post tl dr recently depression an 11 year old went doctor wanted 2 month ago 3 month ago 4 month ago also went 3 month ago back last year ha diagnosed anxiety disorder later diagnosis anxiety disorder anxiety disorder schizophrenia diagnosis
3,issue forever whenever something bad happens make,1,issue forever whenever something bad happens make much worse time wa time every time time like want want feel like need something else could live without always felt like really bad time want become useless one time think like wrong something,issue forever whenever something bad happens make feel alone feel alone never want to deal honestly feel alone feel like getting together like never felt alone feel like really hard feel like need help someone know anyone else always scared really
4,everything looks like,-1,everything looks like life anymore life even worse end always end life life end really never really hope make want death even hate every time else hate never really want want help make happy life feel different get worse life feel better life feel worse lot better life feel like,everything looks like shit said know want make sure feel like like like wa feel like want feel like make weird feel like feel like make feel like wa feel like wa feel like like like feel like feel like wa feel like wa feel like feel like wa feel
