# Analysis of outwardly depressive modd on social media

Use of Natural Language Processing on posts made on Twitter and Reddit to predict depressive thoughts.

---

This notebook is written to be run both locally or on Google Colab.

## Setup for local run

- Download the root file as is.
- Install packages

```python
pip install -r requirements.txt
```


## Setup for Google Colab

- Download this notebook and upload onto Google Colab
- Download the zip files (within /input) and upload into root directory of your Google Drive.

*You may download the [Sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140) directly from the source and replace the provided one (within /input). No edits were made to the data.*

In [None]:
import pandas as pd
import numpy as np
import nltk
import pickle

from re import sub
from time import time

nltk.download('stopwords')
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from gensim.models.word2vec import Word2Vec

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, load_model
from keras.layers import Embedding, Dropout, LSTM, Dense
from keras.callbacks import ReduceLROnPlateau, EarlyStopping

from matplotlib.pyplot import show

from tqdm import tqdm
tqdm.pandas()

In [None]:
def runLocally():

    import shutil
    import os
    import tensorflow as tf

    print("Running locally...")

    path = './build'

    if not os.path.isdir(path):
        for x in os.listdir('./input'):
            shutil.unpack_archive(f'./input/{x}', path)
            print(f"Extracted {x} into '{path}' directory")
    else:
        print(f"{path} directory already exists. Skipping extracting of zip files.")

    gpuCount = len(tf.config.list_physical_devices('GPU'))
    
    if gpuCount > 0:

        print(f"{gpuCount} GPUs detected.")

        if tf.test.is_built_with_cuda():
            print(f"Tensorflow has CUDA support.")

        if not tf.test.is_built_with_cuda():
            print("Tensorflow doesn't have CUDA support.")
    else:
        print("No GPUs detected on local device.")

    return path

def runOnColab():

    from google.colab import drive
    
    print("Running on Google Colab")
    
    drive.mount('/content/drive')

    !unzip "/content/drive/MyDrive/training.1600000.processed.noemoticon.csv.zip"
    !unzip "/content/drive/MyDrive/scrapped_posts.zip"

    return '/content'
    
directory = runLocally()

# **Loading data**

We are using 2 sources of data, pre-catagorised twitter posts from kaggle, and scrapped reddit post from specific subreddits.

1. Twitter Posts from Kaggle

2. Scrape posts from subreddits; [/r/depression](https://www.reddit.com/r/depression/), [/r/suicidewatch](https://www.reddit.com/r/SuicideWatch/)

This is to allow for a greater vocabulary between the two different websites for more general NLP.

---

## Twitter data from [Sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140)

In [None]:
df1 = pd.read_csv(f'{directory}/training.1600000.processed.noemoticon.csv', encoding = 'latin', header=None)
df1.columns = ['sentiment', 'id', 'date', 'query', 'user_id', 'text']
df1.head()

---

## Reddit data from scrapper

*Below is a code block of the scrapping code we ran ahead of time. The raw data was exported to csv files and zipped into /input/scrapped_posts.zip as it tooks hours to scrapped the data and there were limitations with the api used.*

*Note: Since last touched, the api endpoints may have changed.*

In [None]:
df2 = pd.read_csv(f"{directory}/depression.csv")
df3 = pd.read_csv(f"{directory}/suicide_watch.csv")

df2.head()

---

# Merging data twitter-reddit

- Standardise twitter columns
  - Drop excess columns
- Standardise reddit columns to twitter columns
  - Rename columns
  - Merge title and body (reddit posts) into body ~ twitter posts don't have titles
  - Drop excess columns
- Assign sentiment score based on subreddit pulled from
- Merge into single dataframe

*Assume posts from same subreddit have similar sentiment score; posts from [/r/depression](https://www.reddit.com/r/depression/), [/r/suicidewatch](https://www.reddit.com/r/SuicideWatch/) are negative.*

*The size of twitter data from [Sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140) far exceeds that of reddit data, so this assumption does not effect the data much.*

In [None]:
df1 = df1.drop(columns=['query'])
df1.head()

In [None]:
def standardiseRedditDF(dff, sentimentValue=None):

    dff.rename(columns={'Author': 'user_id', 'Post_iD': 'id', 'Publish_date':'date', 'Body':'text'}, inplace=True)
    dff['text'] = dff['Title'].str.cat(dff['text'], sep=" - ")
    dff = dff.drop(columns=['Score', 'Total_no_of_comments', 'Link', 'Subreddit', 'Title'])
    
    if sentimentValue != None:
        dff['sentiment'] = sentimentValue

    return dff

df2 = standardiseRedditDF(df2, 0)
df3 = standardiseRedditDF(df3, 0)

df2.head()

In [None]:
df = pd.concat([df1, df2, df3])

df.sample(10)

# Prepping data

## Removing invalid posts

Reddit posts contains deleted users and removed posts. We need to remove these.

In [None]:
%%time

def removeInvalidRedditPost(df):
    df = df[df.text != "[removed]"]
    df = df[df.user_id != "[deleted]"]

    return df

df = removeInvalidRedditPost(df)

## Mapping sentiments

* 0 - negative
* 2 - neutral
* 4 - positive



In [None]:
def sentimentMapping(label):
    decodeMap = {0: "Negative", 2: "Neutral", 4: "Positive"}
    return decodeMap[int(label)]

df.sentiment = df.sentiment.progress_apply(lambda x: sentimentMapping(x))

## Cleaning text

1. Lower casing
2. Replacing URLs
3. Replacing username references 
4. Removing non-alphanumerics
5. Removing stopwords

In [None]:
def preprocess(text):
  
  stopWords = stopwords.words("english")
  urlPattern        = r"((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)"
  userPattern       = "@[^\s]+" 
  alphaPattern      = "[^a-zA-Z0-9]"

  text = str(text).lower()
  text = sub(urlPattern, ' URL', text).strip()
  text = sub(userPattern, ' USER', text).strip()
  text = sub(alphaPattern, ' ', text).strip()
  
  tokens = []
  for token in text.split():
    if token not in stopWords:
      tokens.append(token)

  return " ".join(tokens)

df.text = df.text.progress_apply(lambda x: preprocess(x))

## Cleaned and merged data

In [None]:
df.sample(10)

# Creating the model

## Train and test splitting

80/20 split

In [None]:
trainData, testData = train_test_split(df, train_size=0.8)

print("Train size:", len(trainData))
print("Test size:", len(testData))

## Tokenisation

In [None]:
tokeniser = Tokenizer()
tokeniser.fit_on_texts(trainData.text)

vocabSize = len(tokeniser.word_index) + 1
print(f'Vocab size: {vocabSize}')

## Encoder

In [None]:
%%time

encoder = LabelEncoder()
encoder.fit(trainData.sentiment.to_list())

## Reshaping train and test variables

In [None]:
%%time

xTrain = pad_sequences(tokeniser.texts_to_sequences(trainData.text), maxlen = 300)
xTest = pad_sequences(tokeniser.texts_to_sequences(testData.text), maxlen = 300)

yTrain = encoder.transform(trainData.sentiment.to_list()).reshape(-1,1)
yTest = encoder.transform(testData.sentiment.to_list()).reshape(-1,1)

## Model build

In [None]:
%%time

w2vModel = Word2Vec(vector_size=300, window=7, min_count=10, workers=6)

_words = [_text.split() for _text in trainData.text]

w2vModel.build_vocab(_words)
w2vModel.train(_words, total_examples=len(_words), epochs=8)

embMatrix = np.zeros((vocabSize, 300))
for word, i in tokeniser.word_index.items():
  if word in w2vModel.wv:
    embMatrix[i] = w2vModel.wv[word]

embLayer = Embedding(vocabSize, 300, weights=[embMatrix], input_length=300, trainable=False)

In [None]:
model = Sequential()
model.add(embLayer)
model.add(Dropout(0.5))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.summary()

In [None]:
%%time

model.compile(optimizer='adam', loss='binary_crossentropy')
callbacks = [ ReduceLROnPlateau(monitor='val_loss', patience=5, cooldown=0), 
              EarlyStopping(monitor='val_loss', min_delta=1e-4, patience=5)]

In [None]:
%%time

history = model.fit(xTrain, yTrain,
                    batch_size=1024,
                    epochs=8,
                    validation_split=0.1,
                    verbose=1,
                    callbacks=callbacks)

# Saving and loading model

This function is to save a trained model and load a pre-trained model from model.h5. To use place 'model.h5' file within


In [None]:
def saveModel():
    model.save("model.h5")
    pickle.dump(tokeniser, open("tokenizer.pkl", "wb"), protocol=0)

def loadModel(pathToModel, pathToPKL):
    with open(pathToPKL, 'rb') as f:
        tokeniser = pickle.load(f)
    return load_model(pathToModel), tokeniser

# Example usage
# model, tokeniser = loadModel("./model.h5", "./tokenizer.pkl")

# Analysis

In [None]:
def sentimentFromScore(score):
  label = 'Neutral'
  if score <= 0.35:
      label = 'Negative'
  elif score >= 0.65:
      label = 'Positive'

  return label

def predict(text):
  start_at = time()
  score = model.predict(pad_sequences(tokeniser.texts_to_sequences([text]), maxlen=300))

  return {"label": sentimentFromScore(score), 
          "score": score,
          "elapsedTime": time() - start_at}

prediction = predict("I'm sick of this game")

print(f"Label: {prediction['label']}")
print(f"Score: {prediction['score']}")
print(f"Time elapsed: {prediction['elapsedTime']}")

## Preparing data from uncategorised subreddits

In [None]:
def hi():
    df4 = pd.read_csv(f"{directory}/teenagers.csv")
    df4 = standardiseRedditDF(df4)
    df4 = removeInvalidRedditPost(df4)
    df4.sentiment = df4.text.progress_apply(lambda x: predict(str(x))['label'])

    df4.hist(column="date")
    show()
    