## BERT Vector Creation for Binary Depression Classifier

In the following notebook, the entire corpus consisting of depressive text and neutral text, 30,000 reddit posts each, are converted to bERT vectors via spaCy's transformer module and the distilbert language model, which offers faster processing with minimal losses in semantic accuracy.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

import spacy
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

from nltk.tokenize import RegexpTokenizer, word_tokenize
import re

import tensorflow as tf
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Dropout, GRU, Input, Flatten, LSTM, Conv1D, MaxPooling1D, GlobalMaxPooling1D, Input, Embedding, Bidirectional
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.regularizers import l2

In [9]:
corpus = pd.read_csv('../data/corpus.csv').drop(columns='Unnamed: 0')

In [10]:
corpus

Unnamed: 0,full_text,subreddit,class,neg,neu,pos,comp
0,why is it that the person who beats themself u...,CasualConversation,0,0.087,0.830,0.083,-0.0258
1,dealing with sadness hi i’m will and i’ve been...,CasualConversation,0,0.164,0.726,0.110,-0.8376
2,"my life has never been better, and i feel as t...",CasualConversation,0,0.033,0.863,0.104,0.9637
3,it‘s my cake day!!!! :o i love reddit and will...,CasualConversation,0,0.032,0.617,0.351,0.9429
4,can i have weed dealer i colorado about 15 min...,CasualConversation,0,0.000,1.000,0.000,0.0000
...,...,...,...,...,...,...,...
79995,eye discomfort and heaviness when particularly...,Anxiety,2,0.177,0.773,0.050,-0.7116
79996,"cbd gummies for anxiety hi, i recently bought ...",Anxiety,2,0.179,0.722,0.100,-0.3182
79997,dae have to open their eyes multiple times whi...,Anxiety,2,0.093,0.840,0.067,-0.6848
79998,"pandemic ruined my life, my work and dreams co...",Anxiety,2,0.047,0.793,0.159,0.9421


The full file includes posts from r/Anxiety, which we will not use for this analysis, so we exclude them to get depression and neutral text.

In [11]:
binary_df = corpus[corpus['subreddit']!='Anxiety']

In [12]:
binary_df.head()

Unnamed: 0,full_text,subreddit,class,neg,neu,pos,comp
0,why is it that the person who beats themself u...,CasualConversation,0,0.087,0.83,0.083,-0.0258
1,dealing with sadness hi i’m will and i’ve been...,CasualConversation,0,0.164,0.726,0.11,-0.8376
2,"my life has never been better, and i feel as t...",CasualConversation,0,0.033,0.863,0.104,0.9637
3,it‘s my cake day!!!! :o i love reddit and will...,CasualConversation,0,0.032,0.617,0.351,0.9429
4,can i have weed dealer i colorado about 15 min...,CasualConversation,0,0.0,1.0,0.0,0.0


### Getting Bert Vectors for the Whole Corpus

In [24]:
X = binary_df['full_text']
y = binary_df['subreddit']

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [26]:
import numpy as np
import spacy
from sklearn.base import BaseEstimator, TransformerMixin

class WordVectorTransformer(TransformerMixin,BaseEstimator):
    def __init__(self, model="en_trf_distilbertbaseuncased_lg"):    #put bert embeddings here
        self.model = model
    def fit(self,X,y=None):
        return self
    def transform(self,X):
        nlp = spacy.load(self.model)
        return np.concatenate([nlp(doc).vector.reshape(1,-1) for doc in X])

In [27]:
bertvect = WordVectorTransformer()

In [29]:
X_train_bvect = bertvect.fit_transform(X_train)
X_test_bvect = bertvect.transform(X_test)

In [81]:
y_train_cat = y_train.map({'depression':1,'CasualConversation':0,'happy':0})
y_train_cat

35125    1
59835    1
24115    0
43143    1
52381    1
        ..
49340    1
54185    1
37984    1
8520     0
6759     0
Name: subreddit, Length: 48000, dtype: int64

In [82]:
y_test_cat = y_test.map({'depression':1,'CasualConversation':0,'happy':0})
y_test_cat

4067     0
13389    0
20021    0
26140    0
57845    1
        ..
58989    1
32142    1
40661    1
50425    1
3614     0
Name: subreddit, Length: 12000, dtype: int64

In [46]:
y_train_vect = tf.keras.utils.to_categorical(y_train_cat)
y_test_vect = tf.keras.utils.to_categorical(y_test_cat)

In [31]:
X_train_bvect.shape

(48000, 768)

In [32]:
X_test_bvect.shape

(12000, 768)

In [48]:
y_train_vect.shape

(48000, 2)

In [49]:
y_test_vect.shape

(12000, 2)

In [33]:
# save numpy array as npy file
from numpy import asarray
from numpy import save
from numpy import load

### Saving Vectors and Corresponding Classes as Numpy Arrays

In [35]:
save('X_train_bvect', X_train_bvect)

In [36]:
save('X_test_bvect', X_test_bvect)

In [50]:
save('y_train_vect', y_train_vect)

In [51]:
save('y_test_vect', y_test_vect)

In [54]:
peek = load('X_train_bvect.npy')

In [57]:
peek[:2]

array([[ -4.640353 ,  25.654243 ,  50.3219   , ..., -67.152176 ,
          7.7993217, -20.4163   ],
       [ 45.34771  ,  64.30396  ,  47.350266 , ..., -11.698645 ,
         26.163946 ,  -1.3377178]], dtype=float32)

In [58]:
X_train_bvect[:2]

array([[ -4.640353 ,  25.654243 ,  50.3219   , ..., -67.152176 ,
          7.7993217, -20.4163   ],
       [ 45.34771  ,  64.30396  ,  47.350266 , ..., -11.698645 ,
         26.163946 ,  -1.3377178]], dtype=float32)

In [59]:
X_train_reshape = X_train_bvect.reshape(-1,768,1)
X_test_reshape = X_test_bvect.reshape(-1,768,1)

In [64]:
X_train_reshape.shape

(48000, 768, 1)

In [65]:
X_test_reshape.shape

(12000, 768, 1)

### Testing the Vectors Out on a Neural Net Classifier, One Epoch

In [87]:
model_l = Sequential()

model_l.add(Conv1D(32, 7, activation = 'relu'))
model_l.add(MaxPooling1D())
model_l.add(Bidirectional(LSTM(24)))
model_l.add(Dense(64,activation='relu',kernel_regularizer=l2(0.001)))
model_l.add(Dropout(0.5))
model_l.add(Dense(64,activation='relu',kernel_regularizer=l2(0.001)))
model_l.add(Dropout(0.5))
model_l.add(Dense(1,activation='sigmoid'))

In [88]:
model_l.compile(optimizer='nadam', metrics=['accuracy'], loss='binary_crossentropy')

In [90]:
history_l = model_l.fit(X_train_reshape, y_train_cat.to_numpy(), validation_data=(X_test_reshape,y_test_cat.to_numpy()), epochs=1)

In [67]:
preds = model_l.predict(X_test_reshape[:10])
preds

array([[0.33556902],
       [0.12134761],
       [0.0249674 ],
       [0.23292479],
       [0.97206175],
       [0.97704536],
       [0.69252664],
       [0.66885364],
       [0.92831266],
       [0.23672733]], dtype=float32)

In [92]:
save('y_test_cat',y_test_cat.to_numpy())

In [93]:
save('y_train_cat',y_train_cat.to_numpy())

In [94]:
X_train

35125    even if i'm having a good time i feel bad toda...
59835    i need a better reason to live i fantasize abo...
24115    the kalimba, or african thumb piano, is a mode...
43143    i feel alone but not lonely i know this sounds...
52381    i can’t feel empathy for others anymore. somet...
                               ...                        
49340    i'm home alone, it's been a tough day and i ne...
54185    i have been feeling bad lately i feel like i'm...
37984    i'm both physically and mentally tired. i go t...
8520     i'm wondering how to accept the fact that i di...
6759     i just had to call 911 while leaving kroger be...
Name: full_text, Length: 48000, dtype: object

In [95]:
y_train

35125            depression
59835            depression
24115                 happy
43143            depression
52381            depression
                ...        
49340            depression
54185            depression
37984            depression
8520     CasualConversation
6759     CasualConversation
Name: subreddit, Length: 48000, dtype: object

In [96]:
X_train.to_csv('../data/X_train.csv')
X_test.to_csv('../data/X_test.csv')

y_train.to_csv('../data/y_train.csv')
y_test.to_csv('../data/y_test.csv')