## Scrape Potentially Depressive Tweets from Twitter

We would like to gather data from twitter based on depressive hashtags, such as #depressed, #depression, #loneliness and #hopelessness
Then apply various techniques to remove non-depressive messages
The result of this script will provide a dataset that contains a filtered collection of tweets that are potentially depressive. The script also removes all hashtags from the tweets, so that the machine learning model cannot cheat by just looking for depressive hashtags.
The final dataset will be manually reviewed and labelled, so that both the depressive and non-depressive messages within it will be correctly marked.

In [None]:
!pip install nest_asyncio

Collecting nest_asyncio
  Downloading https://files.pythonhosted.org/packages/09/82/76dfcb16ba761a70bc89d93c6af2d13c1677f04bb8fa70213685ee071588/nest_asyncio-1.1.0-py3-none-any.whl
Installing collected packages: nest-asyncio
Successfully installed nest-asyncio-1.1.0


In [None]:
!pip install twint



In [None]:
import nest_asyncio
nest_asyncio.apply()
import pandas as pd
import twint

In [None]:
import pandas as pd
import re

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [None]:
users = pd.read_csv('/content/gdrive/My Drive/data/users_list.csv')

In [None]:
depress_tags = ["#depressed", "#depression", "#loneliness", "#hopelessness"]

content = {}
for i in range(len(depress_tags)):
    print(depress_tags[i])
    c = twint.Config()
    
    c.Format = "Tweet id: {id} | Tweet: {tweet}"
    c.Search = depress_tags[i]
    c.Limit = 1000
    c.Store_csv = True
    c.Store_Object = True
    c.Output = "/content/gdrive/My Drive/data/dataset_en_all7.csv"
    c.Hide_output = True
    c.Stats = True
    c.Lowercase  = True
    c.Filter_retweets = True
#    c.Near = "Detroit"
    twint.run.Search(c)

#depressed
#depression
#loneliness
#hopelessness


In [None]:
# add more examples of depressed and depression tags, but with another year so it doesnt overlap
depress_tags = ["#depressed", "#depression"]

content = {}
for i in range(len(depress_tags)):
    c = twint.Config()
    
    c.Format = "Tweet id: {id} | Tweet: {tweet}"
    c.Search = depress_tags[i]
    c.Limit = 1000
    c.Year = 2016
    c.Store_csv = True
    c.Store_Object = True
    c.Output = "/content/gdrive/My Drive/data/dataset_en_al19.csv"
    c.Hide_output = True
    c.Stats = True
    c.Lowercase  = True
#    c.Near = "london"    
    twint.run.Search(c)

In [None]:
df1 = pd.read_csv("/content/gdrive/My Drive/data/dataset_en_all7.csv")
df2 = pd.read_csv("/content/gdrive/My Drive/data/dataset_en_al19.csv")
df_all = pd.concat([df1, df2])
len(df1), len(df2), len(df_all)

(4034, 2000, 6034)

In [None]:
df1.hashtags.value_counts()

['#depressed']                                                                                                                                                                                                                                                                335
['#loneliness']                                                                                                                                                                                                                                                               170
['#depression']                                                                                                                                                                                                                                                               129
['#suicidal', '#flashbacks', '#panic', '#sadness', '#grief', '#guilt', '#worthlessness', '#hopelessness']                                                                         

In [None]:
len(df1), len(df2), len(df_all)

(4034, 2000, 6034)

In [None]:
len(df_all.id.value_counts())

5969

 **1. Combine dataset and remove duplicates based on id and tweet content**

In [None]:
df_all = df_all.drop_duplicates(subset =["id"]) 

In [None]:
df_all.shape

(5969, 31)

In [None]:
pd.set_option('display.max_colwidth', -1)

In [None]:
df_all.head()

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date
0,1163051752083488768,1162721526644260864,1566128120000,2019-08-18,11:35:20,UTC,1154343391578091520,las733re,Las733re,,Manakit ng babae kc lahat sila ng aano sakin eh #depressed HAHAHAHA,"['jullanayysbel', 'keisha_nemd']",[],[],0,0,0,['#depressed'],[],https://twitter.com/las733re/status/1163051752083488768,False,,0,,,,,,,"[{'user_id': '1154343391578091520', 'username': 'las733re'}, {'user_id': '993896021536079872', 'username': 'jullanayysbel'}, {'user_id': '978525852412465153', 'username': 'keisha_neMD'}]",
1,1163050916330770433,1163050916330770433,1566127921000,2019-08-18,11:32:01,UTC,1062143056017846273,lowerdepression,Lower Depression,,"#Depressed mood can be caused by infectious diseases, nutritional deficiencies, neurological conditions, and physiological problems.",[],[],[],0,0,0,['#depressed'],[],https://twitter.com/LowerDepression/status/1163050916330770433,False,,0,,,,,,,"[{'user_id': '1062143056017846273', 'username': 'LowerDepression'}]",
2,1163050768221425665,1163048181409599488,1566127886000,2019-08-18,11:31:26,UTC,740917930104262656,aliyanolasco_,liah,,#Iyak #depressed #blade,"['httpliaaah', 'ravenanaaa', 'je']",[],[],1,0,1,"['#iyak', '#depressed', '#blade']",[],https://twitter.com/aliyanolasco_/status/1163050768221425665,False,,0,,,,,,,"[{'user_id': '740917930104262656', 'username': 'aliyanolasco_'}, {'user_id': '901042136509894656', 'username': 'httpliaaah'}, {'user_id': '1042036723402854400', 'username': 'ravenanaaa'}, {'user_id': '29068692', 'username': 'je'}]",
3,1163042193575153664,1163042193575153664,1566125841000,2019-08-18,10:57:21,UTC,1163035997921271808,samtayl30246562,Sam Taylor,,Go listen to our latest track <3 https://soundcloud.com/ajyr/like-that-prodtaylorj … #xxxtentacion #sad #rap #love #music #llj #memes #lilpeep #like #hiphop #lilpump #follow #ix #art #liluzivert #sadedits #ripxxxtentacion #edits #depressed #aesthetic #bhfyp #drake #trippieredd #meme #depression,[],['https://soundcloud.com/ajyr/like-that-prodtaylorj'],[],0,0,1,"['#xxxtentacion', '#sad', '#rap', '#love', '#music', '#llj', '#memes', '#lilpeep', '#like', '#hiphop', '#lilpump', '#follow', '#ix', '#art', '#liluzivert', '#sadedits', '#ripxxxtentacion', '#edits', '#depressed', '#aesthetic', '#bhfyp', '#drake', '#trippieredd', '#meme', '#depression']",[],https://twitter.com/SamTayl30246562/status/1163042193575153664,False,,0,,,,,,,"[{'user_id': '1163035997921271808', 'username': 'SamTayl30246562'}]",
4,1163039493806444545,1163039493806444545,1566125198000,2019-08-18,10:46:38,UTC,4467577942,naijabin,Naija Bin,,#Depressed Man Kills Himself By Jumping Inside A Well In Enugu (Watch Video) - https://naijabin.com/depressed-man-kills-himself-by-jumping-inside-a-well-in-enugu-watch-video/ … pic.twitter.com/jFTVw80Biy,[],['https://naijabin.com/depressed-man-kills-himself-by-jumping-inside-a-well-in-enugu-watch-video/'],['https://pbs.twimg.com/media/ECPyQZeU8AA982X.jpg'],0,0,0,['#depressed'],[],https://twitter.com/NaijaBin/status/1163039493806444545,False,,0,,,,,,,"[{'user_id': '4467577942', 'username': 'NaijaBin'}]",


In [None]:
df_all.hashtags.value_counts().head(20)

['#depressed']                                                                                                                                       619
['#depression']                                                                                                                                      267
['#loneliness']                                                                                                                                      170
['#suicidal', '#flashbacks', '#panic', '#sadness', '#grief', '#guilt', '#worthlessness', '#hopelessness']                                            113
['#hopelessness']                                                                                                                                    108
['#onlinetherapy', '#anxiety', '#depression']                                                                                                        48 
['#sadness', '#hopelessness', '#battledepression']                                

Let~s have a look at an example where there are the same long stream of tags reoccurring many times. That looks suspiciously like a marketing message

In [None]:
df_all[df_all["hashtags"] =="['#suicidal', '#flashbacks', '#panic', '#sadness', '#grief', '#guilt', '#worthlessness', '#hopelessness']"]

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date
3029,1161787287123873793,1160863504196349952,1565826648000,2019-08-14,23:50:48,UTC,788477216535425024,boomer12k,Happiness Today,,"The technique has brought me GOOD results. I can stop #suicidal thoughts, #flashbacks, #panic attacks, #sadness, #grief, #guilt, #worthlessness, #hopelessness, and more. Best Wishes from a researcher who has ALSO stood at the Abyss. Blessings... pic.twitter.com/ae0897jHKa","['timetochange', 'breaking_taboo']",[],['https://pbs.twimg.com/media/EB9_YHsUEAAB90s.jpg'],0,0,0,"['#suicidal', '#flashbacks', '#panic', '#sadness', '#grief', '#guilt', '#worthlessness', '#hopelessness']",[],https://twitter.com/boomer12k/status/1161787287123873793,False,,0,,,,,,,"[{'user_id': '788477216535425024', 'username': 'boomer12k'}, {'user_id': '20527466', 'username': 'TimetoChange'}, {'user_id': '865579308319985665', 'username': 'Breaking_Taboo'}]",
3033,1161498231991353344,1161454142050770947,1565757732000,2019-08-14,04:42:12,UTC,788477216535425024,boomer12k,Happiness Today,,"The technique has brought me GOOD results. I can stop #suicidal thoughts, #flashbacks, #panic attacks, #sadness, #grief, #guilt, #worthlessness, #hopelessness, and more. Best Wishes from a researcher who has stood at the Abyss. Blessings...",['mayoclinic'],[],[],0,0,0,"['#suicidal', '#flashbacks', '#panic', '#sadness', '#grief', '#guilt', '#worthlessness', '#hopelessness']",[],https://twitter.com/boomer12k/status/1161498231991353344,False,,0,,,,,,,"[{'user_id': '788477216535425024', 'username': 'boomer12k'}, {'user_id': '14592723', 'username': 'MayoClinic'}]",
3040,1161023584816340992,1161015597565534208,1565644568000,2019-08-12,21:16:08,UTC,788477216535425024,boomer12k,Happiness Today,,"The technique has brought me and OTHERS...GOOD results. It can stop #suicidal thoughts, #flashbacks, #panic attacks, #sadness, #grief, #guilt, #worthlessness, #hopelessness, and more. Best Wishes from a researcher who has stood at the Abyss. Blessings...",['psypost'],[],[],0,0,0,"['#suicidal', '#flashbacks', '#panic', '#sadness', '#grief', '#guilt', '#worthlessness', '#hopelessness']",[],https://twitter.com/boomer12k/status/1161023584816340992,False,,0,,,,,,,"[{'user_id': '788477216535425024', 'username': 'boomer12k'}, {'user_id': '125512325', 'username': 'PsyPost'}]",
3043,1160416840109084674,1160367617049186304,1565499908000,2019-08-11,05:05:08,UTC,788477216535425024,boomer12k,Happiness Today,,"Stops more than a broken heart. The technique has brought me GOOD results. I can stop #suicidal thoughts, #flashbacks, #panic attacks, #sadness, #grief, #guilt, #worthlessness, #hopelessness, and more. Best Wishes from a researcher who has stood at the Abyss. Blessings...",['psychcentral'],[],[],0,0,0,"['#suicidal', '#flashbacks', '#panic', '#sadness', '#grief', '#guilt', '#worthlessness', '#hopelessness']",[],https://twitter.com/boomer12k/status/1160416840109084674,False,,0,,,,,,,"[{'user_id': '788477216535425024', 'username': 'boomer12k'}, {'user_id': '17813135', 'username': 'PsychCentral'}]",
3044,1160415821346177026,1160370499081269249,1565499665000,2019-08-11,05:01:05,UTC,788477216535425024,boomer12k,Happiness Today,,"I got Depression covered... The technique has brought me GOOD results. I can stop #suicidal thoughts, #flashbacks, #panic attacks, #sadness, #grief, #guilt, #worthlessness, #hopelessness, and more. Best Wishes from a researcher who has stood at the Abyss. Blessings... pic.twitter.com/TsC6NdNRKO",['psychtoday'],[],['https://pbs.twimg.com/media/EBqgCXHUIAAdoaU.jpg'],0,0,1,"['#suicidal', '#flashbacks', '#panic', '#sadness', '#grief', '#guilt', '#worthlessness', '#hopelessness']",[],https://twitter.com/boomer12k/status/1160415821346177026,False,,0,,,,,,,"[{'user_id': '788477216535425024', 'username': 'boomer12k'}, {'user_id': '27726303', 'username': 'PsychToday'}]",
3047,1159943607584866304,1159928476150853632,1565387081000,2019-08-09,21:44:41,UTC,788477216535425024,boomer12k,Happiness Today,,"The technique has brought me GOOD results. I can stop #suicidal thoughts, #flashbacks, #panic attacks, #sadness, #grief, #guilt, #worthlessness, #hopelessness, WITHOUT DRUGS OR EVEN A THERAPIST, I DID IT ALONE. Best Wishes from a researcher who has stood at the Abyss. Blessings.",['psychcentral'],[],[],0,0,1,"['#suicidal', '#flashbacks', '#panic', '#sadness', '#grief', '#guilt', '#worthlessness', '#hopelessness']",[],https://twitter.com/boomer12k/status/1159943607584866304,False,,0,,,,,,,"[{'user_id': '788477216535425024', 'username': 'boomer12k'}, {'user_id': '17813135', 'username': 'PsychCentral'}]",
3050,1159647238806630400,1159634405335412736,1565316421000,2019-08-09,02:07:01,UTC,788477216535425024,boomer12k,Happiness Today,,"It helps with ""flagging motivation""... The technique has brought me GOOD results. I can stop #suicidal thoughts, #flashbacks, #panic attacks, #sadness, #grief, #guilt, #worthlessness, #hopelessness, and more. Best Wishes from a researcher who has stood at the Abyss. Blessings...","['jeannieclarkson', 'careynieuwhof']",[],[],0,0,1,"['#suicidal', '#flashbacks', '#panic', '#sadness', '#grief', '#guilt', '#worthlessness', '#hopelessness']",[],https://twitter.com/boomer12k/status/1159647238806630400,False,,0,,,,,,,"[{'user_id': '788477216535425024', 'username': 'boomer12k'}, {'user_id': '4344465795', 'username': 'JeannieClarkson'}, {'user_id': '14317735', 'username': 'Careynieuwhof'}]",
3053,1159347867657465856,1155671774580944897,1565245045000,2019-08-08,06:17:25,UTC,788477216535425024,boomer12k,Happiness Today,,"The technique has brought me GOOD results. I can stop #suicidal thoughts, #flashbacks, #panic attacks, #sadness, #grief, #guilt, #worthlessness, #hopelessness, and more. Best Wishes from a researcher who has stood at the Abyss. Blessings...","['wxmangrosshans', 'drvernig']",[],[],0,0,1,"['#suicidal', '#flashbacks', '#panic', '#sadness', '#grief', '#guilt', '#worthlessness', '#hopelessness']",[],https://twitter.com/boomer12k/status/1159347867657465856,False,,0,,,,,,,"[{'user_id': '788477216535425024', 'username': 'boomer12k'}, {'user_id': '56207784', 'username': 'wxmangrosshans'}, {'user_id': '1164040386', 'username': 'drvernig'}]",
3055,1159213232025821184,1159204825684488192,1565212946000,2019-08-07,21:22:26,UTC,788477216535425024,boomer12k,Happiness Today,,"Results MATTER... The technique has brought me GOOD results. I can stop #suicidal thoughts, #flashbacks, #panic attacks, #sadness, #grief, #guilt, #worthlessness, #hopelessness, and more. Best Wishes from a researcher who has stood at the Abyss. Blessings... pic.twitter.com/DXU0Micp8C",['namicommunicate'],[],['https://pbs.twimg.com/media/EBZaSZYUcAAE94u.jpg'],0,0,1,"['#suicidal', '#flashbacks', '#panic', '#sadness', '#grief', '#guilt', '#worthlessness', '#hopelessness']",[],https://twitter.com/boomer12k/status/1159213232025821184,False,,0,,,,,,,"[{'user_id': '788477216535425024', 'username': 'boomer12k'}, {'user_id': '24178302', 'username': 'NAMICommunicate'}]",
3058,1158974480544256005,1158954038509150208,1565156023000,2019-08-07,05:33:43,UTC,788477216535425024,boomer12k,Happiness Today,,"Well.. I hope that helps... The technique has brought me GOOD results. I can stop #suicidal thoughts, #flashbacks, #panic attacks, #sadness, #grief, #guilt, #worthlessness, #hopelessness, and more. Best Wishes from a researcher who has stood at the Abyss. Blessings...",['amybarnhorst'],[],[],0,0,0,"['#suicidal', '#flashbacks', '#panic', '#sadness', '#grief', '#guilt', '#worthlessness', '#hopelessness']",[],https://twitter.com/boomer12k/status/1158974480544256005,False,,0,,,,,,,"[{'user_id': '788477216535425024', 'username': 'boomer12k'}, {'user_id': '2612653424', 'username': 'amybarnhorst'}]",


## 2. Filtering out the relevant rows

**Ideas for cleaning / filtering**
1. remove entries that contain positive, or medical sounding tags
2. remove entries with more than three hashtags, as it may be promotional messages
3. remove entries with at mentions, as it may be promotional messages
4. remove entries with less than x chars / words
5. remove entries containing urls - again as they are likely to be promotional messages

In [None]:
selection_to_remove = ["#mentalhealth", "#health", "#happiness", "#mentalillness", "#happy", "#joy", "#wellbeing"]

In [None]:
# 1. remove entries that contain positive, or medical sounding tags
mask1 = df_all.hashtags.apply(lambda x: any(item for item in selection_to_remove if item in x))
df_all[mask1].tweet.tail()
#len(df_all[mask])

1985    2015: when music destroyed #mentalhealth stigma  http://goo.gl/52eKru  #despair #depression #anxiety #suicide #bipolar via .@guardian                                                                                 
1986    Be happy in 2016. Enjoy a special #HealthyMeSummit with @taniadejong #depression & #anxiety  http://ow.ly/W0387   http://fb.me/3rRZ5rnxX                                                                              
1987    Be happy in 2016. Enjoy a special #HealthyMeSummit with @taniadejong #depression & #anxiety  http://ow.ly/W0387  pic.twitter.com/b0y5KcstCe                                                                           
1990    RT mc1748 When words don't work, #arts program can help heal #veterans  http://strib.mn/1mPKarx  #PTSD #MentalHealth #NAMI #depression #anxi…                                                                         
1991    Debunking the myth that #suicides increase over the holiday season  http://nymag.com/scienceofus/201

In [None]:
# review the result of remving certain tags
df_all[mask1==False].tweet.head(10)

0     Manakit ng babae kc lahat sila ng aano sakin eh #depressed HAHAHAHA                                                                                                                                                                                                                                      
1     #Depressed mood can be caused by infectious diseases, nutritional deficiencies, neurological conditions, and physiological problems.                                                                                                                                                                     
2     #Iyak #depressed #blade                                                                                                                                                                                                                                                                                  
3     Go listen to our latest track <3   https://soundcloud.com/ajyr/like-that-prodtaylo

In [None]:
# above results look okay, let~s apply the mask1
df_all = df_all[mask1==False]
len (df_all)

4924

In [None]:
# 2. remove entries with more than three hashtags, as it may be promotional messages
mask2 = df_all.hashtags.apply(lambda x: x.count("#") < 4)

In [None]:
df_all = df_all[mask2]

In [None]:
len(df_all)

2768

In [None]:
df_all.head()


Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date
0,1163051752083488768,1162721526644260864,1566128120000,2019-08-18,11:35:20,UTC,1154343391578091520,las733re,Las733re,,Manakit ng babae kc lahat sila ng aano sakin eh #depressed HAHAHAHA,"['jullanayysbel', 'keisha_nemd']",[],[],0,0,0,['#depressed'],[],https://twitter.com/las733re/status/1163051752083488768,False,,0,,,,,,,"[{'user_id': '1154343391578091520', 'username': 'las733re'}, {'user_id': '993896021536079872', 'username': 'jullanayysbel'}, {'user_id': '978525852412465153', 'username': 'keisha_neMD'}]",
1,1163050916330770433,1163050916330770433,1566127921000,2019-08-18,11:32:01,UTC,1062143056017846273,lowerdepression,Lower Depression,,"#Depressed mood can be caused by infectious diseases, nutritional deficiencies, neurological conditions, and physiological problems.",[],[],[],0,0,0,['#depressed'],[],https://twitter.com/LowerDepression/status/1163050916330770433,False,,0,,,,,,,"[{'user_id': '1062143056017846273', 'username': 'LowerDepression'}]",
2,1163050768221425665,1163048181409599488,1566127886000,2019-08-18,11:31:26,UTC,740917930104262656,aliyanolasco_,liah,,#Iyak #depressed #blade,"['httpliaaah', 'ravenanaaa', 'je']",[],[],1,0,1,"['#iyak', '#depressed', '#blade']",[],https://twitter.com/aliyanolasco_/status/1163050768221425665,False,,0,,,,,,,"[{'user_id': '740917930104262656', 'username': 'aliyanolasco_'}, {'user_id': '901042136509894656', 'username': 'httpliaaah'}, {'user_id': '1042036723402854400', 'username': 'ravenanaaa'}, {'user_id': '29068692', 'username': 'je'}]",
4,1163039493806444545,1163039493806444545,1566125198000,2019-08-18,10:46:38,UTC,4467577942,naijabin,Naija Bin,,#Depressed Man Kills Himself By Jumping Inside A Well In Enugu (Watch Video) - https://naijabin.com/depressed-man-kills-himself-by-jumping-inside-a-well-in-enugu-watch-video/ … pic.twitter.com/jFTVw80Biy,[],['https://naijabin.com/depressed-man-kills-himself-by-jumping-inside-a-well-in-enugu-watch-video/'],['https://pbs.twimg.com/media/ECPyQZeU8AA982X.jpg'],0,0,0,['#depressed'],[],https://twitter.com/NaijaBin/status/1163039493806444545,False,,0,,,,,,,"[{'user_id': '4467577942', 'username': 'NaijaBin'}]",
6,1163030382360629248,1163030382360629248,1566123025000,2019-08-18,10:10:25,UTC,2468633587,chrisbontheweb,Chris,,"With all of this unnessary family drama, I feel like moving far away and starting over again. From one thing to another I just feel #depressed. Hope I get through this",[],[],[],0,0,0,['#depressed'],[],https://twitter.com/ChrisBOnTheWeb/status/1163030382360629248,False,,0,,,,,,,"[{'user_id': '2468633587', 'username': 'ChrisBOnTheWeb'}]",


In [None]:
# 3. remove tweets with at mentions as they are sometimes retweets
mask3 = df_all.mentions.apply(lambda x: len(x) < 5)

In [None]:
df_all = df_all[mask3]

In [None]:
len(df_all)

2172

In [None]:
# let~s check the hashtags value counts again
df_all.hashtags.value_counts().head(20)

['#depressed']                                        453
['#depression']                                       217
['#loneliness']                                       137
['#hopelessness']                                     72 
['#onlinetherapy', '#anxiety', '#depression']         48 
['#sadness', '#hopelessness', '#battledepression']    44 
['#anxiety', '#depression']                           21 
['#depression', '#anxiety']                           16 
['#mindfulness', '#anxiety', '#depression']           13 
['#depression', '#depressed']                         12 
['#tms', '#depression']                               11 
['#depressed', '#stressed', '#alone']                 10 
['#depression', '#helpme', '#iwantpeace']             10 
['#sad', '#depressed']                                10 
['#stoner', '#instahookah', '#depressed']             8  
['#loneliness', '#depression']                        7  
['#depressed', '#anxious']                            6  
['#sleep', '#d

In [None]:
df_all.tweet.tail(10)

1964    #DEPRESSION                                                                                                                                                                                                                    
1965    ur best is plenty good enough 4 anyone or anything that is meant 4U😊Don't let ppl nor circumstances kill you😘#suicideprevention #depression                                                                                    
1968    RT talkspace #Depression costs companies $52 billion/year in absenteeism & reduced productivity; results in 400 million lost work days/year…                                                                                   
1977    Sleep is extremely important, and for this author, regulating #sleep is what finally relieved his #depression --  http://buff.ly/1QVI7yW                                                                                       
1978    Stemming the Tide of #Depression with Transcendental #Meditation

In [None]:
# 4. remove entries with less than x chars / words

In [None]:
mask4a = df_all.tweet.apply(lambda x: len(x) > 25)


In [None]:
df_all = df_all[mask4a]
len(df_all)

2102

In [None]:
mask4b = df_all.tweet.apply(lambda x: x.count(" ") > 5)

In [None]:
df_all = df_all[mask4b]
len(df_all)

1921

In [None]:
df_all.tweet

1       #Depressed mood can be caused by infectious diseases, nutritional deficiencies, neurological conditions, and physiological problems.                                                                                                                                                                               
4       #Depressed Man Kills Himself By Jumping Inside A Well In Enugu (Watch Video) -  https://naijabin.com/depressed-man-kills-himself-by-jumping-inside-a-well-in-enugu-watch-video/ … pic.twitter.com/jFTVw80Biy                                                                                                       
6       With all of this unnessary  family drama, I feel like moving far away and starting over again. From one thing to another I just feel #depressed. Hope I get through this                                                                                                                                           
7       Stress na nga sa bahay, stress pa sa school😔

In [None]:
# 5. remove entries containing urls - as they are likely to be promotional messages
mask5 = df_all.urls.apply(lambda x: len(x) < 5)

In [None]:
# let~s have a look at what we will be removing from the dataset
df_all[mask5==False].tweet.head(10), df_all[mask5==False].tweet.tail(10)

(4      #Depressed Man Kills Himself By Jumping Inside A Well In Enugu (Watch Video) -  https://naijabin.com/depressed-man-kills-himself-by-jumping-inside-a-well-in-enugu-watch-video/ … pic.twitter.com/jFTVw80Biy                                                                                                          
 17     I stay home EVERYDAY and everybody knows this 😞 #Depressed  https://twitter.com/litlikezayy/status/1162967050593214464 …                                                                                                                                                                                              
 20     7 Tips for Anyone Who Gets #Depressed in the #Summer. https://www.self.com/story/summer-depression-tips …                                                                                                                                                                                                             
 22     I'm #depressed and I wont do shit a

The above shows that tweets with urls are indeed more likely to be promotional / informational  / educational messages and not indicative of the user~s actual emotional state, and thus can be removed (or marked as negative scenarios)

In [None]:
df_all = df_all[mask5]
len(df_all)

1102

## 3. Finally, let~s create a column containing the tweet text, but with all hashtags removed

This column can be used as input to the model, or can be sent to another software for further emotion and linguistic analysis. The idea is, if the hashtags are removed, the model and the software will examine the text and clairy if the actual emotion is negative and indicative of depression

In [None]:
df_all["mod_text"] = df_all["tweet"].apply(lambda x: re.sub(r'#\w+', '', x))

In [None]:
df_all.mod_text.head(15), df_all.mod_text.tail(15)

(1      mood can be caused by infectious diseases, nutritional deficiencies, neurological conditions, and physiological problems.                                                                                                                                                                       
 6     With all of this unnessary  family drama, I feel like moving far away and starting over again. From one thing to another I just feel . Hope I get through this                                                                                                                                   
 7     Stress na nga sa bahay, stress pa sa school😔                                                                                                                                                                                                                                                     
 8     Step 1.  Anfangen, richtig zu essen. Nicht zu wenig, nicht zu viel. & am besten ausgewogen.  Damit ich

In [None]:
# let~s check the hashtags value counts again
df_all.hashtags.value_counts().head(20)

['#depressed']                               296
['#depression']                              110
['#loneliness']                              78 
['#hopelessness']                            21 
['#depressed', '#stressed', '#alone']        10 
['#sad', '#depressed']                       9  
['#depression', '#anxiety']                  9  
['#stoner', '#instahookah', '#depressed']    8  
['#depression', '#depressed']                6  
['#tms', '#depression']                      6  
['#depression', '#helpme', '#iwantpeace']    5  
['#lonely', '#depressed']                    4  
['#depressed', '#lonely']                    4  
['#anxiety', '#depression']                  4  
['#depressed', '#anxious']                   4  
['#depressed', '#positive']                  3  
['#ptsd', '#depression']                     3  
['#depression', '#notjustsad']               3  
['#loneliness', '#depression']               3  
['#depressed', '#sad']                       3  
Name: hashtags, dtyp

In [None]:
df_all.columns

Index(['id', 'conversation_id', 'created_at', 'date', 'time', 'timezone',
       'user_id', 'username', 'name', 'place', 'tweet', 'mentions', 'urls',
       'photos', 'replies_count', 'retweets_count', 'likes_count', 'hashtags',
       'cashtags', 'link', 'retweet', 'quote_url', 'video', 'near', 'geo',
       'source', 'user_rt_id', 'user_rt', 'retweet_id', 'reply_to',
       'retweet_date', 'mod_text'],
      dtype='object')

In [None]:
col_list = ["id", "conversation_id", "date", "username", "mod_text", "hashtags", "tweet"]

In [None]:
df_final1 = df_all[col_list]
df_final1 = df_final1.rename(columns={"mod_text": "tweet_processed", "tweet": "tweet_original"})


In [None]:
df_final1["target"] = 1

In [None]:
df_final1.head()

Unnamed: 0,id,conversation_id,date,username,tweet_processed,hashtags,tweet_original,target
1,1163050916330770433,1163050916330770433,2019-08-18,lowerdepression,"mood can be caused by infectious diseases, nutritional deficiencies, neurological conditions, and physiological problems.",['#depressed'],"#Depressed mood can be caused by infectious diseases, nutritional deficiencies, neurological conditions, and physiological problems.",1
6,1163030382360629248,1163030382360629248,2019-08-18,chrisbontheweb,"With all of this unnessary family drama, I feel like moving far away and starting over again. From one thing to another I just feel . Hope I get through this",['#depressed'],"With all of this unnessary family drama, I feel like moving far away and starting over again. From one thing to another I just feel #depressed. Hope I get through this",1
7,1163028021244133376,1163028021244133376,2019-08-18,kimberlybenedi5,"Stress na nga sa bahay, stress pa sa school😔","['#doublekill', '#depressed']","Stress na nga sa bahay, stress pa sa school😔 #doublekill #depressed",1
8,1163027065463087104,1163027065463087104,2019-08-18,ag0n1z3d,"Step 1. Anfangen, richtig zu essen. Nicht zu wenig, nicht zu viel. & am besten ausgewogen. Damit ich dann die nötige Kraft habe, um den Tag zu überstehen. In der letzten Zeit war ich viel zu schwach. Das muss sich ändern.",['#depressed'],"Step 1. Anfangen, richtig zu essen. Nicht zu wenig, nicht zu viel. & am besten ausgewogen. Damit ich dann die nötige Kraft habe, um den Tag zu überstehen. In der letzten Zeit war ich viel zu schwach. Das muss sich ändern. #depressed",1
11,1163020226977386497,1163020226977386497,2019-08-18,wildfoxtherapy,"I'm going to keep banging on about this, cos it's true. What you focus on, you get more of. Stop telling yourself you're or . Tell yourself you're happy, strong, confident, powerful. Not only cos you ARE, but cos your brilliant mind listens to what you tell it. pic.twitter.com/gBQn7yEjsJ","['#depressed', '#anxious']","I'm going to keep banging on about this, cos it's true. What you focus on, you get more of. Stop telling yourself you're #depressed or #anxious. Tell yourself you're happy, strong, confident, powerful. Not only cos you ARE, but cos your brilliant mind listens to what you tell it. pic.twitter.com/gBQn7yEjsJ",1


In [None]:
len(df_final1) 

1102

In [None]:
df_final1_1 = df_final1[:400]
df_final1_2 = df_final1[400:800]
df_final1_3 = df_final1[800:]
len(df_final1_1), len(df_final1_2), len(df_final1_3), 

(400, 400, 302)

In [None]:
df_final1.to_csv("/content/gdrive/My Drive/data/tweets_final.csv")

In [None]:
df_final1_1.to_csv("/content/gdrive/My Drive/data/tweets_final_1.csv")
df_final1_2.to_csv("/content/gdrive/My Drive/data/tweets_final_2.csv")
df_final1_3.to_csv("/content/gdrive/My Drive/data/tweets_final_3.csv")

In [None]:
df_all.to_csv("/content/gdrive/My Drive/data/tweets_v3.csv")

In [None]:
users = df_all.username

In [None]:

content = {}
for i in users: #users1['Names']:

    
    c = twint.Config()
    c.Search = "#depressed"
    c.Username = "noneprivacy"
    c.Username = i
    c.Format = "Tweet id: {id} | Tweet: {tweet}"
    c.Limit = 100
    c.Store_csv = True
    c.Store_Object = True
    c.Output = "/content/gdrive/My Drive/data/dataset_v3.csv"
    c.Hide_output = True
    c.Stats = True
    c.Lowercase  = True
    twint.run.Search(c)
    
#     tweets = twint.output.tweets_list()
#     print(tweets)
#     for tweet in tweets:
#     # then iterate over the hashtags of that single tweet
#         for t in tweet.tweet:
#         # increment the count if the hashtag already exists, otherwise initialize it to 1
#             if tweet.username in content:
#                 content[tweet.username].append(t)
#             else:
#                 content[tweet.username] = []
#                 content[tweet.username].append(t)
        
    print(i)
#     print(content)
#     with open('dataset.csv', 'w') as output:
#         output.write('username, tweet\n')
#         for user in content:
#             for h in content[user]:
#                 output.write('{},{}\n'.format(user, content[user][h]))
    

ag0n1z3d
simonblue16
puffpuffnpass1
lowerdepression
bobymcboby
_arxn_
depressedaunty
joshstebbins2
hokey_hoke18
ericsequeira
hunterwastaken
nick63360
rimrod007
nick63360
lowlifekev
celerglersk
wildfoxtherapy
epicgabe
samanthajoule
paklongmail1
al__zaainn
janusha61949990
friedonbusiness
sadtimes0813
semsannen_
maudlinmuse
sadtimes0813
_bluenightx
puffpuffnpass1
masederealwolf
ilyseroyal
amishman9000
goodboypaden
sadtimes0813
jameswifties
briannakole19
shy91771526
aleuthemermaid
gracie_m721
lena38348916
katrinamunoz18
clarenstro
hashtagsaloobin
hashtagsaloobin
hashtagsaloobin
nctzoozeus
vaporaccessshop
masederealwolf
delzharina
hulk27watkins
therabbitchu
wildfoxtherapy
little_red2596
siddiqbetrayer
dark_swan
semsannen_
mozenkoffmich
badassid
naveentp36cq
reeteshkhadgi
trillasahbella
richerd2020
lowerdepression
airametuc09
paddasumeet
joshlaioloplays
lowerdepression
lowerdepression
wendy_ellas
alttheoalt
darkymishi
gracie_m721
chrisbontheweb
lisamonique_04
alyssamnunez
mickirei
mickirei
m

KeyboardInterrupt: ignored

In [None]:
help(twint.output.tweets_list)

Help on list object:

class list(object)
 |  list(iterable=(), /)
 |  
 |  Built-in mutable sequence.
 |  
 |  If no argument is given, the constructor creates a new empty list.
 |  The argument must be an iterable if specified.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate sign