Start by some imports and reading the data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

RootDir = "/kaggle/input/sentimental-analysis-for-tweets"


filename = RootDir + "/sentiment_tweets3.csv"
df = pd.read_csv(filename)
print (df.shape)

We have 10,314 tweets in our dataset. Each one has a label: 0=not depressed, 1=depressed. Let's get the tweets (input) and labels (output), and print a sample of each ttype of tweet:

In [None]:
tweets = df.values[:,1]
labels = df.values[:,2].astype(float)
print (tweets[40], labels[40])
print (tweets[8002], labels[8002])

Next, we load BERT:

In [None]:
!pip install sentence-transformers
from sentence_transformers import SentenceTransformer
bert_model = SentenceTransformer('distilbert-base-nli-mean-tokens')

We can now run the BERT model on all tweets to get their encoding

In [None]:
embeddings = bert_model.encode(tweets, show_progress_bar=True)
print (embeddings.shape)

The embeddings will be our features to train a classifier, but first we need tp split the data into training and test sets:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(embeddings, labels, 
                                          test_size=0.2, random_state=42)
print ("Training set shapes:", X_train.shape, y_train.shape)
print ("Test set shapes:", X_test.shape, y_test.shape)

There are 768 features in the embedding vector for every tweet. Now build a simple classification model to work on them

In [None]:
from tensorflow.keras import Sequential, layers

classifier = Sequential()
classifier.add (layers.Dense(256, activation='relu', input_shape=(768,)))
classifier.add (layers.Dense(1, activation='sigmoid'))
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])  
    
hist = classifier.fit (X_train, y_train, epochs=10, batch_size=16, 
                      validation_data=(X_test, y_test))

Plot the loss and accuracy:

In [None]:
from matplotlib import pyplot

pyplot.figure(figsize=(15,5))
pyplot.subplot(1, 2, 1)
pyplot.plot(hist.history['loss'], 'r', label='Training loss')
pyplot.plot(hist.history['val_loss'], 'g', label='Validation loss')
pyplot.legend()
pyplot.subplot(1, 2, 2)
pyplot.plot(hist.history['accuracy'], 'r', label='Training accuracy')
pyplot.plot(hist.history['val_accuracy'], 'g', label='Validation accuracy')
pyplot.legend()
pyplot.show()

It seems we reach a very good prediction accuracy (>98%) immediately, with almost no improvement by additional epochs