# Analysis of outwardly depressive mood on social media

Use of Natural Language Processing on posts made on Twitter and Reddit to predict depressive thoughts.

---

This notebook is written to be run both locally or on Google Colab.

## Setup for local run

- Download the root file as is.
- Install packages

In [None]:
# ! pip install pandas
# ! pip install numpy
# ! pip install nltk
# ! pip install pickle
# ! pip install keras
# ! pip install tqdm
# ! pip install dask
# ! pip install seaborn
# ! pip install wordcloud

## Setup for Google Colab

- Download this notebook and upload onto Google Colab
- Download the zip files (within /input) and upload into root directory of your Google Drive.

*You may download the [Sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140) directly from the source and replace the provided one (within /input). No edits were made to the data.*

In [None]:
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt

from re import sub
from time import time

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from gensim.models.word2vec import Word2Vec

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, load_model
from keras.layers import Embedding, Dropout, CuDNNLSTM, Dense
from keras.callbacks import ReduceLROnPlateau, EarlyStopping

from tqdm import tqdm
from tqdm.keras import TqdmCallback
tqdm.pandas()

from dask.diagnostics import ProgressBar
import dask.dataframe as dd
ProgressBar().register()

from wordcloud import WordCloud, STOPWORDS

In [None]:
def runLocally():

    import shutil
    import os
    import tensorflow as tf

    print("Running locally...")

    path = './build'

    if not os.path.isdir(path):
        for x in os.listdir('./input'):
            shutil.unpack_archive(f'./input/{x}', path)
            print(f"Extracted {x} into '{path}' directory")
    else:
        print(f"{path} directory already exists. Skipping extracting of zip files.")

    gpuCount = len(tf.config.list_physical_devices('GPU'))
    
    if gpuCount > 0:

        print(f"{gpuCount} GPUs detected.")

        if tf.test.is_built_with_cuda():
            print(f"Tensorflow has CUDA support.")

        if not tf.test.is_built_with_cuda():
            print("Tensorflow doesn't have CUDA support.")
    else:
        print("No GPUs detected on local device.")

    return path

def runOnColab():

    from google.colab import drive
    
    print("Running on Google Colab")
    
    drive.mount('/content/drive')

    !unzip "/content/drive/MyDrive/training.1600000.processed.noemoticon.csv.zip"
    !unzip "/content/drive/MyDrive/scrapped_posts.zip"

    return '/content'
    
directory = runLocally()

# **Loading data**

We are using 2 sources of data, pre-catagorised twitter posts from kaggle, and scrapped reddit post from specific subreddits.

1. Twitter Posts from Kaggle

2. Scrape posts from subreddits; [/r/depression](https://www.reddit.com/r/depression/), [/r/suicidewatch](https://www.reddit.com/r/SuicideWatch/)

This is to allow for a greater vocabulary between the two different websites for more general NLP.

---

## Twitter data from [Sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140)

In [None]:
df1 = pd.read_csv(f'{directory}/training.1600000.processed.noemoticon.csv', encoding = 'latin', header=None)
df1.columns = ['sentiment', 'id', 'date', 'query', 'user_id', 'text']
df1.head()

---

## Reddit data from scrapper

*Below is a code block of the scrapping code we ran ahead of time. The raw data was exported to csv files and zipped into /input/scrapped_posts.zip as it tooks hours to scrapped the data and there were limitations with the api used.*

*Note: Since last touched, the api endpoints may have changed.*

In [None]:
def redditScrapper():
    from psaw import PushshiftAPI
    api = PushshiftAPI()

    api_request_generator = api.search_submissions(after=1645459200 ,before=1647878400,filter=['id', 'title', 'author', 'selftext', 'score', 'num_comments', 'created', 'subreddit', 'full_link'],subreddit='depression')

    finalframe = pd.DataFrame([submission.d_ for submission in api_request_generator])

    finalframe['created'] = pd.to_datetime(finalframe['created'],  unit='s') 
    finalframe.columns=["Post_iD","Title","Author","Body","Score","Total_no_of_comments", "Publish_date","Subreddit","Link"]
    
    return finalframe

In [None]:
df2 = pd.read_csv(f"{directory}/depression.csv")
df3 = pd.read_csv(f"{directory}/suicide_watch.csv")

df2.head()

---

# Merging data twitter-reddit

- Standardise twitter columns
  - Drop excess columns
  - Followed reddit date format
- Standardise reddit columns to twitter columns
  - Remove invalid posts and users ~ Users and posts may be deleted or removed
  - Rename columns
  - Merge title and body (reddit posts) into body ~ twitter posts don't have titles
  - Drop excess columns
- Assign sentiment score based on subreddit pulled from
- Merge into single dataframe

*Assume posts from same subreddit have similar sentiment score; posts from [/r/depression](https://www.reddit.com/r/depression/), [/r/suicidewatch](https://www.reddit.com/r/SuicideWatch/) are negative.*

*The size of twitter data from [Sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140) far exceeds that of reddit data, so this assumption does not effect the data much.*

In [None]:
def convert(date):

    monthdict={
        "Jan":"01", 
        "Feb":"02", 
        "Mar": "03", 
        "Apr":"04", 
        "May":"05", 
        "Jun":"06", 
        "Jul":"07", 
        "Aug":"08", 
        "Sep":"09", 
        "Oct":"10", 
        "Nov": "11", 
        "Dec":"12"}

    year = date[24:28]
    day = date[8:10]
    month = date[4:7]
    month = monthdict[month]
    time= date[11:20]
    unix= year +"-"+ month +"-"+ day + " " + time
    
    return unix

def standardiseTwitterTweets(df):
    
    df = df.drop(columns=['query'])
    df["date"]=df["date"].progress_map(convert)

    return df

df1 = standardiseTwitterTweets(df1)

df1.head()

In [None]:
def removeInvalidRedditPost(df):
    df = df[df.Body.notna()]
    df = df[df.Author != "[removed]"]
    df = df[df.Body != "[removed]"]
    df = df[df.Author != "[deleted]"]
    df = df[df.Body != "[deleted]"]

    return df

def standardiseRedditDF(dff, sentimentValue=None):

    dff = removeInvalidRedditPost(dff)
    dff.rename(columns={'Author': 'user_id', 'Post_iD': 'id', 'Publish_date':'date', 'Body':'text'}, inplace=True)
    dff['text'] = dff['Title'].str.cat(dff['text'], sep=" ")
    dff = dff.drop(columns=['Score', 'Total_no_of_comments', 'Link', 'Subreddit', 'Title'])
    
    if sentimentValue != None:
        dff['sentiment'] = sentimentValue

    return dff

df2 = standardiseRedditDF(df2, 0)
df3 = standardiseRedditDF(df3, 0)

df2.head()

In [None]:
df = pd.concat([df1, df2, df3])

df.sample(10)

# Prepping data

## Mapping sentiments

* 0 - negative
* 2 - neutral
* 4 - positive



In [None]:
def sentimentMapping(label):
    decodeMap = {0: "Negative", 2: "Neutral", 4: "Positive"}
    return decodeMap[int(label)]

df.sentiment = df.sentiment.progress_map(lambda x: sentimentMapping(x))

## Cleaning text

1. Lower casing
2. Replacing URLs
3. Replacing username references 
4. Removing non-alphanumerics
5. Removing stopwords

In [None]:
def preprocess(text):
  
    text = str(text).lower()
    
    urlPattern        = r"((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)"
    userPattern       = "@[^\s]+" 
    alphaPattern      = "[^a-zA-Z0-9]"

    text = sub(urlPattern, ' URL', text).strip()
    text = sub(userPattern, ' USER', text).strip()
    text = sub(alphaPattern, ' ', text).strip()
    
    stopWords = stopwords.words("english")
    tokens = list(filter(lambda x: x not in stopWords, text.split()))

    return " ".join(tokens)

def parellelPreProcess(df):
    ddf = dd.from_pandas(df, npartitions=4)
    ddf["text"] = ddf["text"].map(lambda x: preprocess(x), meta=('result', str))
    return ddf.compute()

df = parellelPreProcess(df)

## Cleaned and merged data

In [None]:
df.sample(10)

# Initial data analysis

In [None]:
def wordcloudPosNeg(df):
    df_neg=df.loc[df['sentiment']=="Negative"]
    df_pos=df.loc[df['sentiment']=="Positive"]

    stopwords = set(STOPWORDS)
    stopwords.update(["a", "the", "I", "of", "then", "dont", "don", "\m","going","don't","make","\s", "m", "way","day","one", "s", "t", "dont", "ve","USER","URl","amp"])
    
    neg_text = " ".join(str(review) for review in df_neg.text)
    wordcloud_neg = WordCloud(stopwords=stopwords,max_font_size=300, max_words=100, width = 1200, height = 1200, scale = 1, background_color="white").generate(neg_text)

    pos_text = " ".join(str(review) for review in df_pos.text)
    wordcloud_pos = WordCloud(stopwords=stopwords,max_font_size=300, max_words=100, width = 1200, height = 1200, scale = 1, background_color="white").generate(pos_text)

    fig, axs = plt.subplots(1, 2, figsize=(20, 8))
    fig.tight_layout()

    axs[0].set_title(f"Wordcloud negative sentiment")
    axs[0].axis('off')
    axs[1].set_title(f"Wordcloud positive sentiment")
    axs[1].axis('off')

    axs[0].imshow(wordcloud_neg)
    axs[1].imshow(wordcloud_pos)

    plt.show()

wordcloudPosNeg(df)

# Creating the model

## Generate model

- Train and test splitting ~ 80/20 split
- Tokenisation
- Encoder
- Building Embedding layer

In [None]:
def generateModel(df):
    
    # Splitting train and test
    trainData, testData = train_test_split(df, train_size=0.8)

    # Tokenisation
    tokeniser = Tokenizer()
    tokeniser.fit_on_texts(trainData.text)

    vocabSize = len(tokeniser.word_index) + 1

    # Encoder
    encoder = LabelEncoder()
    encoder.fit(trainData.sentiment.to_list())

    # X and Y train and test
    xTrain = pad_sequences(tokeniser.texts_to_sequences(trainData.text), maxlen = 300)
    xTest = pad_sequences(tokeniser.texts_to_sequences(testData.text), maxlen = 300)

    yTrain = encoder.transform(trainData.sentiment.to_list()).reshape(-1,1)
    yTest = encoder.transform(testData.sentiment.to_list()).reshape(-1,1)

    # Building Word2Vec and Embedding layer
    w2vModel = Word2Vec(vector_size=300, window=7, min_count=10, workers=6)

    _words = [_text.split() for _text in tqdm(trainData.text)]

    w2vModel.build_vocab(_words)
    w2vModel.train(tqdm(_words), 
                    total_examples=len(_words), 
                    epochs=8,
                )

    embMatrix = np.zeros((vocabSize, 300))
    for word, i in tqdm(tokeniser.word_index.items()):
        if word in w2vModel.wv:
            embMatrix[i] = w2vModel.wv[word]

    embLayer = Embedding(vocabSize, 300, weights=[embMatrix], input_length=300, trainable=False)

    # Model building
    model = Sequential()
    model.add(embLayer)
    model.add(Dropout(0.5))
    model.add(CuDNNLSTM(100))
    model.add(Dense(1, activation='sigmoid'))

    return xTrain, yTrain, xTest, yTest, model, tokeniser

xTrain, yTrain, xTest, yTest, model, tokeniser = generateModel(df)
model.summary()

## Model compilation and training

In [None]:
%%time

model.compile(optimizer='adam', loss='binary_crossentropy')

model.fit(xTrain, yTrain,
          batch_size=1024,
          epochs=8,
          validation_split=0.1,
          verbose=0,
          callbacks=[ 
            ReduceLROnPlateau(monitor='val_loss', patience=5, cooldown=0), 
            EarlyStopping(monitor='val_loss', min_delta=1e-4, patience=5),
            TqdmCallback(verbose=2)
          ])

# Saving and loading model

This function is to save a trained model and load a pre-trained model from model.h5. To use place 'model.h5' file within


In [None]:
def saveModel(model, tokeniser):
    try:
        model.save("model.h5")
        pickle.dump(tokeniser, open("tokenizer.pkl", "wb"), protocol=0)
        return "Successfully saved"
    except:
        return "Failed save"

def loadModel(pathToModel, pathToPKL):
    with open(pathToPKL, 'rb') as f:
        tokeniser = pickle.load(f)
    return load_model(pathToModel), tokeniser

# Example usage
# model, tokeniser = loadModel("./model.h5", "./tokenizer.pkl")
saveModel(model, tokeniser)

# Analysis

In [None]:
def sentimentFromScore(score):
  score = float(score)
  label = 'Neutral'
  if score <= 0.35:
      label = 'Negative'
  elif score >= 0.65:
      label = 'Positive'

  return label

def predict(text, wantsTime=False):
  if wantsTime: 
      start_at = time()
  
  text = str(text)
  score = model.predict(pad_sequences(tokeniser.texts_to_sequences([text]), maxlen=300))

  result = {"label": sentimentFromScore(score), 
            "score": score}
  
  if wantsTime: 
      result["elapsedTime"] = time() - start_at

  return result

# Example usage
prediction = predict("I'm sick of this game", True)
print(f"Label: {prediction['label']}")
print(f"Score: {prediction['score']}")
print(f"Time elapsed: {prediction['elapsedTime']}s")