## Checking Data

This was a general first check of the data gathered by each scraper separately, to make sure things worked correctly with the PushShift API. It is not very important in the context of the analysis.

In [2]:
import pandas as pd
import numpy as np
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [35]:
casomc = pd.read_csv('../data/CasualConvOMC.csv')
happy = pd.read_csv('../data/happy_15k.csv')
depression = pd.read_csv('../data/depression_30k.csv')
anxiety = pd.read_csv('../data/anxiety_20k.csv')

In [25]:
casomc.head()

Unnamed: 0.1,Unnamed: 0,title,author,selftext,created_utc,subreddit
0,0,Why is it that the person who beats themself u...,ToesyToeNails,[removed],1602713864,CasualConversation
1,1,Dealing with sadness,willhound71,Hi I’m Will and I’ve been a lurker for a while...,1602713155,CasualConversation
2,2,"My life has never been better, and I feel as t...",mrsleveman,"Hi :). I live in the UK and I'm 18, currently ...",1602713095,CasualConversation
3,3,It‘s my cake day!!!! :o,sinah-mv,I love Reddit and will probably spend too much...,1602713014,CasualConversation
4,4,Can I have weed dealer I colorado about 15 min...,WALMART_RAPIST,[removed],1602712660,CasualConversation


In [26]:
happy.shape

(15000, 6)

In [27]:
casomc.shape

(30000, 6)

In [7]:
analyzer = SentimentIntensityAnalyzer()

In [8]:
cc_sample = casomc[casomc['subreddit']=='CasualConversation'].sample(100)
analyzer.polarity_scores(cc_sample['title'])

{'neg': 0.089, 'neu': 0.76, 'pos': 0.151, 'compound': 0.9977}

In [9]:
omc_sample = casomc[casomc['subreddit']=='offmychest'].sample(100)
analyzer.polarity_scores(omc_sample['title'])

{'neg': 0.169, 'neu': 0.679, 'pos': 0.152, 'compound': -0.959}

In [10]:
casomc['subreddit'].unique()

array(['CasualConversation', 'offmychest'], dtype=object)

#### Creating a BaseText Corpus from CasualConversation and Happy

In [11]:
basetext = pd.concat([casomc[casomc['subreddit']=='CasualConversation'],happy])

In [12]:
basetext.shape

(30000, 6)

In [13]:
basetext.drop(columns = 'Unnamed: 0', inplace=True)

In [14]:
basetext.head()

Unnamed: 0,title,author,selftext,created_utc,subreddit
0,Why is it that the person who beats themself u...,ToesyToeNails,[removed],1602713864,CasualConversation
1,Dealing with sadness,willhound71,Hi I’m Will and I’ve been a lurker for a while...,1602713155,CasualConversation
2,"My life has never been better, and I feel as t...",mrsleveman,"Hi :). I live in the UK and I'm 18, currently ...",1602713095,CasualConversation
3,It‘s my cake day!!!! :o,sinah-mv,I love Reddit and will probably spend too much...,1602713014,CasualConversation
4,Can I have weed dealer I colorado about 15 min...,WALMART_RAPIST,[removed],1602712660,CasualConversation


#### Extracting a Sample and Seeing if We Can Attach Vader Polarity Scores to the Corpus

In [15]:
base_sample = basetext.sample(100)

In [16]:
res = analyzer.polarity_scores(base_sample['title'])

In [17]:
[res['neg'], res['neu']]

[0.06, 0.739]

In [18]:
def sentiment(row):
    analyzer = SentimentIntensityAnalyzer()
    res = analyzer.polarity_scores(row)
    return pd.Series([res['neg'], res['neu'], res['pos'], res['compound']])

In [19]:
base_sample[['neg','neu','pos','comp']] = base_sample['title'].apply(sentiment)

In [37]:
base_sample.head()

Unnamed: 0,title,author,selftext,created_utc,subreddit,neg,neu,pos,comp
4045,I have my new cover for my now serialized book...,Faustyna,,1592786749,happy,0.0,0.857,0.143,0.7921
14094,I'm still happy about something that happened ...,rocijim,"A month ago, the popular manga haikyuu came to...",1598473453,CasualConversation,0.0,0.865,0.135,0.3291
414,Are you voting for Orange Man or Senile Old Fool?,PapadinDanse,[removed],1602590213,CasualConversation,0.244,0.756,0.0,-0.4404
10887,Today I proved to myself that I can be a good ...,KilotonCarcajou,I've been the shift manager of a small rental ...,1576901204,happy,0.0,0.791,0.209,0.4404
1689,Sometimes I wish I could be tiny so I could se...,ryoto500,[removed],1602188188,CasualConversation,0.0,0.924,0.076,0.4019


#### Let's do a quick check on the sentiments of our base class, Depression class, and Anxiety class using sampling to see if our base class will be different enough

In [33]:
analyzer.polarity_scores(basetext.sample(500)['title'])

{'neg': 0.076, 'neu': 0.722, 'pos': 0.202, 'compound': 1.0}

In [34]:
analyzer.polarity_scores(depression.sample(500)['title'])

{'neg': 0.219, 'neu': 0.678, 'pos': 0.103, 'compound': -1.0}

In [36]:
analyzer.polarity_scores(anxiety.sample(500)['title'])

{'neg': 0.216, 'neu': 0.671, 'pos': 0.113, 'compound': -0.9999}

##### Looking good compared to intuition. Now we can export our base text corpus for use on the more robust classifier.

#### Exporting the basetext corpus

In [38]:
basetext.head()

Unnamed: 0,title,author,selftext,created_utc,subreddit
0,Why is it that the person who beats themself u...,ToesyToeNails,[removed],1602713864,CasualConversation
1,Dealing with sadness,willhound71,Hi I’m Will and I’ve been a lurker for a while...,1602713155,CasualConversation
2,"My life has never been better, and I feel as t...",mrsleveman,"Hi :). I live in the UK and I'm 18, currently ...",1602713095,CasualConversation
3,It‘s my cake day!!!! :o,sinah-mv,I love Reddit and will probably spend too much...,1602713014,CasualConversation
4,Can I have weed dealer I colorado about 15 min...,WALMART_RAPIST,[removed],1602712660,CasualConversation


In [40]:
basetext.to_csv('./basetext.csv')