# Data Pre-Processing

Here we have two datasets:

1. mental_heath.csv (From Kaggle: https://www.kaggle.com/datasets/reihanenamdari/mental-health-corpus)

The Mental Health Corpus is a collection of texts related to people with anxiety, depression, and other mental health issues. The corpus consists of two columns: one containing the comments, and the other containing labels indicating whether the comments are considered poisonous or not. The corpus can be used for a variety of purposes, such as sentiment analysis, toxic language detection, and mental health language analysis. The data in the corpus may be useful for researchers, mental health professionals, and others interested in understanding the language and sentiment surrounding mental health issues.

2. Suicide_Detection.csv (From Kaggle: https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch)

The dataset is a collection of posts from "SuicideWatch" and "depression" subreddits of the Reddit platform. The posts are collected using Pushshift API. All posts that were made to "SuicideWatch" from Dec 16, 2008(creation) till Jan 2, 2021, were collected while "depression" posts were collected from Jan 1, 2009, to Jan 2, 2021.

We will be cleaning both the datasets and merging them with final aim of having only two columns in the dataset : Text and Label(0 for not depression and 1 for depression)

#### Pre-processing dataset 1

In [2]:
import pandas as pd
import numpy as np

In [3]:
# import data in dataset1 from mental_health.csv

dataset1 = pd.read_csv('mental_health.csv')
dataset1.head()

Unnamed: 0,text,label
0,dear american teens question dutch person hear...,0
1,nothing look forward lifei dont many reasons k...,1
2,music recommendations im looking expand playli...,0
3,im done trying feel betterthe reason im still ...,1
4,worried year old girl subject domestic physic...,1


In [4]:
dataset1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27977 entries, 0 to 27976
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    27977 non-null  object
 1   label   27977 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 437.3+ KB


From the info() function it is clear that their are no null values. Also it only contains two columns one containing text and other a binary label so no further pre-processing is required.

#### Pre-processing dataset 2

In [5]:
# import data in dataset2 from Suicide_Detection.csv

dataset2 = pd.read_csv('Suicide_Detection.csv')
dataset2.head()

Unnamed: 0.1,Unnamed: 0,text,class
0,2,Ex Wife Threatening SuicideRecently I left my ...,suicide
1,3,Am I weird I don't get affected by compliments...,non-suicide
2,4,Finally 2020 is almost over... So I can never ...,non-suicide
3,8,i need helpjust help me im crying so hard,suicide
4,9,"I’m so lostHello, my name is Adam (16) and I’v...",suicide


In [6]:
dataset2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232074 entries, 0 to 232073
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  232074 non-null  int64 
 1   text        232074 non-null  object
 2   class       232074 non-null  object
dtypes: int64(1), object(2)
memory usage: 5.3+ MB


In [7]:
dataset2 = dataset2.drop(dataset2.columns[0], axis=1)
dataset2

Unnamed: 0,text,class
0,Ex Wife Threatening SuicideRecently I left my ...,suicide
1,Am I weird I don't get affected by compliments...,non-suicide
2,Finally 2020 is almost over... So I can never ...,non-suicide
3,i need helpjust help me im crying so hard,suicide
4,"I’m so lostHello, my name is Adam (16) and I’v...",suicide
...,...,...
232069,If you don't like rock then your not going to ...,non-suicide
232070,You how you can tell i have so many friends an...,non-suicide
232071,pee probably tastes like salty tea😏💦‼️ can som...,non-suicide
232072,The usual stuff you find hereI'm not posting t...,suicide


Converting sucide to 0 and non- sucide to 1

In [8]:
dataset2["class"] = dataset2["class"].map({"suicide" : 1, "non-suicide" : 0})
dataset2

Unnamed: 0,text,class
0,Ex Wife Threatening SuicideRecently I left my ...,1
1,Am I weird I don't get affected by compliments...,0
2,Finally 2020 is almost over... So I can never ...,0
3,i need helpjust help me im crying so hard,1
4,"I’m so lostHello, my name is Adam (16) and I’v...",1
...,...,...
232069,If you don't like rock then your not going to ...,0
232070,You how you can tell i have so many friends an...,0
232071,pee probably tastes like salty tea😏💦‼️ can som...,0
232072,The usual stuff you find hereI'm not posting t...,1


In [9]:
dataset2 = dataset2.rename(columns={'class' : 'label'})
dataset2

Unnamed: 0,text,label
0,Ex Wife Threatening SuicideRecently I left my ...,1
1,Am I weird I don't get affected by compliments...,0
2,Finally 2020 is almost over... So I can never ...,0
3,i need helpjust help me im crying so hard,1
4,"I’m so lostHello, my name is Adam (16) and I’v...",1
...,...,...
232069,If you don't like rock then your not going to ...,0
232070,You how you can tell i have so many friends an...,0
232071,pee probably tastes like salty tea😏💦‼️ can som...,0
232072,The usual stuff you find hereI'm not posting t...,1


#### Adding both dataframes and saving it in a csv

In [10]:
newDataset = pd.concat([dataset1, dataset2])
newDataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 260051 entries, 0 to 232073
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    260051 non-null  object
 1   label   260051 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 6.0+ MB


In [11]:
test_data = newDataset.sample(frac=0.1)
train_data = newDataset.drop(test_data.index)
test_data
train_data

Unnamed: 0,text,label
0,dear american teens question dutch person hear...,0
1,nothing look forward lifei dont many reasons k...,1
2,music recommendations im looking expand playli...,0
4,worried year old girl subject domestic physic...,1
5,hey rredflag sure right place post this goes ...,1
...,...,...
232067,I’m a fucking scummy person and I can’t change...,1
232070,You how you can tell i have so many friends an...,0
232071,pee probably tastes like salty tea😏💦‼️ can som...,0
232072,The usual stuff you find hereI'm not posting t...,1


In [12]:
test_data.to_csv("test_data.csv", index=False)
train_data.to_csv("train_data.csv", index=False)

In [3]:
# import data in dataset1 from mental_health.csv

dataset1 = pd.read_csv('train_data.csv')
dataset1.head()

Unnamed: 0,text,label
0,dear american teens question dutch person hear...,0
1,nothing look forward lifei dont many reasons k...,1
2,music recommendations im looking expand playli...,0
3,worried year old girl subject domestic physic...,1
4,hey rredflag sure right place post this goes ...,1


In [4]:
dataset1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 228949 entries, 0 to 228948
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    228949 non-null  object
 1   label   228949 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.5+ MB


In [8]:
dataset1['label'].value_counts()

0    114610
1    114339
Name: label, dtype: int64