# Project Cerina!
The first approach for a machine learning model fine-tuned to detect when a person is showing self-harming tendencies by analyzing their texts. The second approach would follow advance fine-tuning using OpenAI api models (check the other folder for the detailed walkthrough). Hope you like it! We have used the model obtained below into developing a small application:


`Web App Link:`   
https://ubaidkhan08-mental-health-application-ml-st-appstreamlit-efzurp.streamlit.app/

`GitHub Repo:`    
https://github.com/ubaidkhan08/Mental-Health-Application-ML-stack

# 

# Reading required libraries

In [1]:
import json
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import tensorflow as tf
import seaborn as sns
%matplotlib inline

# 

# Reading the dataset!

In [2]:
df = pd.read_csv("data/dataset.csv")
df = df.drop(['Unnamed: 0'], axis=1)

In [137]:
df.head()

Unnamed: 0,text,class
0,Ex Wife Threatening SuicideRecently I left my ...,suicide
1,Am I weird I don't get affected by compliments...,non-suicide
2,Finally 2020 is almost over... So I can never ...,non-suicide
3,i need helpjust help me im crying so hard,suicide
4,"I’m so lostHello, my name is Adam (16) and I’v...",suicide


# 

# Preprocessing!
Here, we encode the labels using LabelEncoder, split the data into training and test sets, and tokenize the text with a maximum vocabulary size of 5000. This approach enables the machine learning model to learn from the data by converting the text into numerical inputs that can be processed by the model.

In [3]:
df = pd.read_csv('data/dataset.csv')
X = df['text'].values
y = df['class'].values

encoder = LabelEncoder()
y = encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tokenizer = Tokenizer(num_words=5000, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)

# Save the tokenizer to a file
tokenizer_json = tokenizer.to_json()
with open('tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(tokenizer_json)

# 

# Loading the saved tokenizer file (for later usage)

In [4]:
import tensorflow as tf

with open('tokenizer.json', 'r', encoding='utf-8') as f:
    tokenizer_json = f.read()
tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(tokenizer_json)

# 

# Model training!
Now, we create a Sequential model using Keras with an embedding layer to learn a dense vector representation of the text, an LSTM layer for processing sequential data, and a dense output layer with a `sigmoid activation` function for binary classification. The model is trained using the binary_crossentropy loss function and the Adam optimizer.

In [17]:
X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_train_padded = pad_sequences(X_train_sequences, padding='post', maxlen=100)

X_test_sequences = tokenizer.texts_to_sequences(X_test)
X_test_padded = pad_sequences(X_test_sequences, padding='post', maxlen=100)

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(5000, 16, input_length=100),
    tf.keras.layers.LSTM(64, dropout=0.2),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train_padded, y_train, epochs=10, validation_data=(X_test_padded, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1b9aece05e0>

# 

# Saving & loading the model files!

In [5]:
from tensorflow.keras.models import load_model

#model.save('my_model.h5')
loaded_model = load_model('my_model.h5')

# 

# Model Evaluation!

In [22]:
from sklearn.metrics import classification_report

y_pred = []
for x in X_train[0:1000]:
    a = health(x)
    y_pred.append(a)

new = np.array(y_pred)
new = encoder.fit_transform(new)

print(classification_report(y_train[0:1000], new))

              precision    recall  f1-score   support

           0       0.98      0.95      0.96       483
           1       0.95      0.98      0.96       517

    accuracy                           0.96      1000
   macro avg       0.96      0.96      0.96      1000
weighted avg       0.96      0.96      0.96      1000



# 

# Predictions using our model!

In [50]:
def health(text):
    text_sequence = tokenizer.texts_to_sequences([text])
    text_padded = pad_sequences(text_sequence, padding='post', maxlen=100)
    prediction = loaded_model.predict(text_padded)

    if prediction[0] >= 0.5:
        return "Self-harmful"

    elif prediction[0] < 0.5:
        return "Normal"
    
    
health("I'm feeling really down today. I don't know if I can take it anymore.")



'Self-harmful'

# 

# Thank you!