# **Preprocessing**
The purpose of this notebook is to execute preprocessing by combining posts and comments. The raw data are in the `JSON` format and we need to transform them into data frames for the further analysis.

In [None]:
import pandas as pd
import numpy as np
import json
from ast import literal_eval
import multiprocess as mp

### Load submission and comment data from the `txt` files.

In [None]:
subs = []
comments = []

# load submissions data
with open('depressed_submission.txt') as file:
    for line in file:
        if literal_eval(line)['selftext'] != '[removed]' and literal_eval(line)['selftext'] != '[deleted]' and len(literal_eval(line)['selftext']) != 0:
            subs.append(literal_eval(line))

# load comment data
with open('depressed_comment.txt') as file:
    for line in file:
        if literal_eval(line)['body'] != '[removed]' and literal_eval(line)['body'] != '[deleted]' and len(literal_eval(line)['body']) != 0:
            comments.append(literal_eval(line))

In [None]:
# formulate to dataframe
posts_df = pd.DataFrame(subs)[['id', 'title', 'author', 'subreddit', 'selftext']]
display(posts_df.head())
comments_df = pd.DataFrame(comments)[['id', 'author', 'link_id', 'parent_id', 'body']]
display(comments_df.head())

Unnamed: 0,id,title,author,subreddit,selftext
0,j4jr9d,I have become hollow from inside.,VeeerrWho,depressed,"I have never received a ""surprise birthday pre..."
1,j4j99c,Losing myself,Salty-Strain-7322,depressed,I think I’ve lost myself as a person . I can’t...
2,j4isu3,Not even the worst day of my life,Paruffy,depressed,Today my mom rejected the birthdaygift I spend...
3,j4ipcc,Can someone help me to understand what Is goin...,kraiovida,depressed,I have been diagnosed with severe depression f...
4,j4ih5r,Pain,Ice_and_water50,depressed,My will to live is minor and my hope to die so...


Unnamed: 0,id,author,link_id,parent_id,body
0,g7j6k0e,livongwaters,t3_j4hy93,t3_j4hy93,please don’t go my guy
1,g7j60h8,whoiamidonotknow,t3_j4h8sg,t1_g7j220r,I'm so sorry to hear that this and all else ha...
2,g7j5ptc,_anxious_lemon,t3_j4hy93,t3_j4hy93,Heyy you still there?
3,g7j5074,Every-Common,t3_j4hy93,t1_g7j4o8f,"Yes, I don't know how people can handle divorc..."
4,g7j4wl2,_anxious_lemon,t3_j4hy93,t1_g7j4ptq,And what are your interests?


In [None]:
# rename columns' name for both dataframes
posts_df.rename(columns = {"id": "link_id", "author": "post_author", "selftext": "post_content"}, inplace = True)
comments_df.rename(columns = {"id": "comment_id", "author": "comment_author", "body": "comment_content"}, inplace = True)

# remove unnecessary characters in ids
comments_df.link_id = comments_df.link_id.apply(lambda x: x.split('_')[1])
comments_df.parent_id = comments_df.parent_id.apply(lambda x: x.split('_')[1])
comments_df.head()

Unnamed: 0,comment_id,comment_author,link_id,parent_id,comment_content
0,g7j6k0e,livongwaters,j4hy93,j4hy93,please don’t go my guy
1,g7j60h8,whoiamidonotknow,j4h8sg,g7j220r,I'm so sorry to hear that this and all else ha...
2,g7j5ptc,_anxious_lemon,j4hy93,j4hy93,Heyy you still there?
3,g7j5074,Every-Common,j4hy93,g7j4o8f,"Yes, I don't know how people can handle divorc..."
4,g7j4wl2,_anxious_lemon,j4hy93,g7j4ptq,And what are your interests?


In [None]:
# merge post df and comment df
df = posts_df.merge(comments_df, on = 'link_id')

# rearrage by link_id (:= post) and parent_id (:= thread)
df.sort_values(by = ['link_id', 'parent_id'], inplace = True)
df.head(10)

Unnamed: 0,link_id,title,post_author,subreddit,post_content,comment_id,comment_author,parent_id,comment_content
50105,106igx,Too many things at once,[deleted],depressed,too many things gone wrong at once: *horrible*...,c6bedgs,spark0,106igx,Is it weird that I as a guy feel the same way ...
50106,106igx,Too many things at once,[deleted],depressed,too many things gone wrong at once: *horrible*...,c6b98jw,[deleted],106igx,"you ever get that, where you look in a mirror ..."
50107,106igx,Too many things at once,[deleted],depressed,too many things gone wrong at once: *horrible*...,c6b4wva,bacon_nuts,106igx,You have to separate yourself from the things ...
50104,106igx,Too many things at once,[deleted],depressed,too many things gone wrong at once: *horrible*...,c6bl817,[deleted],c6bedgs,*hugs*
50102,10hy0b,At least,kelso408,depressed,no one can see my face on Reddit. Much love.,c6fdayt,AtooZ,10hy0b,Ok
50103,10hy0b,At least,kelso408,depressed,no one can see my face on Reddit. Much love.,c6domkb,MykeHawk,10hy0b,&lt;3
50096,10nre1,Lots of people posting here tonight.. Don't mi...,[deleted],depressed,I've been pretty down lately.. Old feelings su...,c6fb43e,MykeHawk,10nre1,Hey bud! Hang in there :)\n\nMaybe you could s...
50100,10nre1,Lots of people posting here tonight.. Don't mi...,[deleted],depressed,I've been pretty down lately.. Old feelings su...,c6f4udo,Unixchaos,10nre1,Sorry your feeling down and I couldn't say hel...
50101,10nre1,Lots of people posting here tonight.. Don't mi...,[deleted],depressed,I've been pretty down lately.. Old feelings su...,c6f1vs4,[deleted],10nre1,I can't even get a damn comment... I guess I s...
50099,10nre1,Lots of people posting here tonight.. Don't mi...,[deleted],depressed,I've been pretty down lately.. Old feelings su...,c6f4up0,[deleted],c6f4udo,I live in a small town.. There isn't anything ...


In [None]:
# remove deleted post contents or comment contents
df = df[(df.post_content != '[deleted]') & (df.post_content != '[removed]') & (df.comment_content != '[deleted]') & (df.comment_content != '[removed]')]
df = df[(df.post_content.str.len() != 0) & (df.comment_content.str.len() != 0)]
df.dropna(subset = ['post_content', 'comment_content'], inplace = True)

# remove deleted author names for both post and comment
df = df[(df.post_author != '[deleted]') & (df.post_author != '[removed]') & (df.comment_author != '[deleted]') & (df.comment_author != '[removed]')]
df = df[(df.post_author.str.len() != 0) & (df.comment_author.str.len() != 0)]
df.dropna(subset = ['post_author', 'comment_author'], inplace = True)
df = df.reset_index().drop('index', axis = 1)
df

Unnamed: 0,link_id,title,post_author,subreddit,post_content,comment_id,comment_author,parent_id,comment_content
0,10hy0b,At least,kelso408,depressed,no one can see my face on Reddit. Much love.,c6fdayt,AtooZ,10hy0b,Ok
1,10hy0b,At least,kelso408,depressed,no one can see my face on Reddit. Much love.,c6domkb,MykeHawk,10hy0b,&lt;3
2,10smhx,No one to talk to and hate my situation,bondlegolas,depressed,I'm still 17 going to community college (becau...,c7kb9pa,meaningfuluse2,10smhx,Don't let your school work suffer rather rewar...
3,10smhx,No one to talk to and hate my situation,bondlegolas,depressed,I'm still 17 going to community college (becau...,c6gm57o,bacon_nuts,10smhx,I know how you feel with friends moving away. ...
4,10v1hu,I obviously can't tell anyone I actually know ...,RWN406,depressed,"So I'm a single, 21 year old college dropout, ...",c6h4kcw,Willisis2,10v1hu,"I think you shouldn't be so caught up on ""Will..."
...,...,...,...,...,...,...,...,...,...
49556,zm26b,I don't know what to do anymore...,InsertRudeWordsHere,depressed,I used to be an A student and have loads of fr...,c667u09,MykeHawk,zm26b,"Hey, i have been in a similar state and i know..."
49557,zr77m,Hey everyone -- You've Got This,shuriken36,depressed,"Hey all,\n\nI don't post a ton, but I've been ...",c8rmyyq,BeatThaOdds,zr77m,aint got shit
49558,zr77m,Hey everyone -- You've Got This,shuriken36,depressed,"Hey all,\n\nI don't post a ton, but I've been ...",c77e17y,18janselau,zr77m,Thanks :') my extremely shitty day has gotten ...
49559,zr77m,Hey everyone -- You've Got This,shuriken36,depressed,"Hey all,\n\nI don't post a ton, but I've been ...",c6f2rpw,Mominator,zr77m,Much love to you


In [None]:
# save dataframe
df.to_csv('depressed_df_convs.csv', index = False)

### 2. For Large files, restart the notebook and directly load the saved dataframe to produce dyadic conversations and multiparty conversations.

In [None]:
# open the processed dataframe 
dtypes = {'link_id': pd.np.str,
          'title':pd.np.str,
          'post_author':pd.np.str,
          'subreddit':pd.np.str,
          'post_content':pd.np.str,
          'comment_id':pd.np.str,
          'comment_author':pd.np.str,
          'parent_id':pd.np.str,
          'comment_content':pd.np.str}
df_chunks = pd.read_csv('Anxietyhelp_df_convs.csv', dtype = dtypes, chunksize = 100000, engine = 'python')

  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  """
  
  import sys
  
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


In [None]:
def get_dyadic_convs(df):
    """
    From the original dataframe, build a sub-dataframe containing dyadic conversations
    between post authors and first comment authors, including only the first comment thread.

    Arg:
        df: The raw dataframe got by scraping in a given subreddit.
    Return:
        df_dyadic_convs: The dataframe containing only the dyadic conversations between 
                         post author and comment author from the first thread.
    """
    df_dyadic_convs = pd.DataFrame()
    
    # consider each post
    for link_id in df['link_id'].unique():
        df_link_id = df[df['link_id'] == link_id].reset_index().drop('index', axis = 1)

        # consider only the first conversation thread between post author and comment author
        if len(df_link_id) < 1:
            continue

        post_author = df_link_id.loc[0, 'post_author']
        first_comment_author = df_link_id.loc[0, 'comment_author']

        if len(df_link_id[df_link_id['link_id'] == df_link_id['parent_id']].index) > 1:
            first_thread_index = df_link_id[df_link_id['link_id'] == df_link_id['parent_id']].index[0]
            second_thread_index = df_link_id[df_link_id['link_id'] == df_link_id['parent_id']].index[1]
            first_thread = df_link_id.loc[first_thread_index:second_thread_index, :]
        elif len(df_link_id[df_link_id['link_id'] == df_link_id['parent_id']].index) == 1:
            first_thread_index = df_link_id[df_link_id['link_id'] == df_link_id['parent_id']].index[0]
            first_thread = df_link_id.loc[first_thread_index:, :]
        else:
            continue

        df_dyadic_convs = df_dyadic_convs.append(first_thread.loc[first_thread_index, :])
        if len(first_thread) > 1:
            for i in range(first_thread_index+1, first_thread_index+len(first_thread)):
                if first_thread.loc[i, 'comment_author'] == post_author:
                    df_dyadic_convs = df_dyadic_convs.append(first_thread.loc[i, :])
                elif first_thread.loc[i, 'comment_author'] == first_comment_author:
                    df_dyadic_convs = df_dyadic_convs.append(first_thread.loc[i, :])

    return df_dyadic_convs
        

def get_multi_convs(df):
    """
    From the original dataframe, build a sub-dataframe containing multiparty conversations 
    between post authors and the comment authors in the longest thread.

    Arg: 
        df: The raw dataframe got by scraping in a given subreddit.
    Return:
        df_multi_convs: The dataframe containing only the multi-party conversations from the longest threads.
    """
    df_multi_convs = pd.DataFrame()
    
    # consider each post
    for link_id in df['link_id'].unique():
        df_link_id = df[df['link_id'] == link_id].reset_index().drop('index', axis = 1)

        # consider only the first conversation thread between post author and comment author
        if len(df_link_id) < 1:
            continue
        
        # include those rows only in the thread with longest conversations
        if len(df_link_id[df_link_id['link_id'] == df_link_id['parent_id']].index) > 1:
            post_indices = list(df_link_id[df_link_id['link_id'] == df_link_id['parent_id']].index)
            post_convs_list = []

            # store each conversation into a list
            for i in range(len(post_indices)):
                if i == len(post_indices)-1:
                    current_post_ind = post_indices[i]
                    post_convs_list.append(df_link_id.loc[current_post_ind:, :])
                    break
                current_post_ind = post_indices[i]
                next_post_ind = post_indices[i+1]
                post_convs_list.append(df_link_id.loc[current_post_ind:next_post_ind-1, :])

            # pick the longest conversation
            len_convs_list = list(map(len, post_convs_list))
            max_convs_ind = len_convs_list.index(max(len_convs_list))
            long_thread = post_convs_list[max_convs_ind].reset_index().drop('index', axis = 1)
        elif len(df_link_id[df_link_id['link_id'] == df_link_id['parent_id']].index) == 1:
            long_thread_index = df_link_id[df_link_id['link_id'] == df_link_id['parent_id']].index[0]
            long_thread = df_link_id.loc[long_thread_index:, :]
        else:
            continue

        if len(long_thread) > 1:
            df_multi_convs = df_multi_convs.append(long_thread)

    return df_multi_convs

In [None]:
pool = mp.Pool(16)
chunk_list_dyadic = []

for chunk in df_chunks:
    # preprocess each chunk 
    filtered_chunk = chunk.drop_duplicates().dropna().reset_index().drop('index', axis = 1)

    # get dyadic conversations
    chunk_dyadic_convs = (pool.apply_async(get_dyadic_convs, [filtered_chunk])).get()

    # append the result to the list
    chunk_list_dyadic.append(chunk_dyadic_convs)


# construct dyadic conversations based on link_id, post_author, and comment_author
df_dyadic_convs = pd.concat(chunk_list_dyadic).reset_index().drop('index', axis = 1)
df_dyadic_convs

Unnamed: 0,comment_author,comment_content,comment_id,link_id,parent_id,post_author,post_content,subreddit,title
0,cldhrdfacts,Hey you seem like a really sweet and nice pers...,c698cof,1002a4,1002a4,throwaway930017,"This is a bit of a ramble, so be warned. \n\nW...",depression,"I've tried everything, but it's back."
1,SirBrutis,I don't mean this in a bad way but there are l...,c69bc1z,10036f,10036f,coleiscool,Yesterday i told me best friend (thats a girl)...,depression,On the verge of ending it.
2,freezyjelly,"Just a question, do you exercise? It's the big...",c69cytn,100373,100373,AwkwardRecluse,I don't even know why I'm doing this. I've jus...,depression,All I know how to be anymore is a giant useles...
3,dontyousassme,1.rotten fruit\n2.my messy hair\n3.children\n4...,c69rs06,1003tb,1003tb,opal--moon,* Sundays \n* Yellow sunlight\n* Homework\n* L...,depression,List of strange things I find depressing- Anyo...
4,paganel,Just hang on. Not because life is beautiful or...,c69dvsp,1004e7,1004e7,Rho7000,I don't know exactly how I feel or how to expl...,depression,I don't know why I bother to continue anymore....
...,...,...,...,...,...,...,...,...,...
849044,Waroftheages,But my brain is telling me none of this will h...,c698p3j,zzxe7,zzxe7,stupidgroupie,I voluntarily admitted myself to a ward. I was...,depression,What I've learned from my day hospital program.
849045,cldhrdfacts,Dude just start lifting. Wake up. Drink a pr...,c6984g1,zzypa,zzypa,throwaway34353,"x/post from suicide watch, still feel like I n...",depression,I feel like the biggest failure ever...
849046,SolvencyMechanism,Bi-Polar Disorder (manic depression) can manif...,c696h94,zzyv8,zzyv8,allmytoes,I'm concerned that I may have something like d...,depression,Resources for figuring out if I actually have ...
849047,x34460,Thank you for sharing this.\n\nHow about appro...,c698cn1,zzz26,zzz26,jocalyga,"Hi. I'm a senior in high school, and I've been...",depression,I feel like I have no more friends to talk to ...


In [None]:
# construct multiparty conversations based on link_id, post_author, and comment_author
pool = mp.Pool(16)
chunk_list_multi = []

for chunk in df_chunks:
    # preprocess each chunk 
    filtered_chunk = chunk.drop_duplicates().dropna().reset_index().drop('index', axis = 1)

    # get dyadic conversations
    chunk_multi_convs = (pool.apply_async(get_multi_convs, [filtered_chunk])).get()

    # append the result to the list
    chunk_list_multi.append(chunk_multi_convs)

df_multi_convs = pd.concat(chunk_list_multi).reset_index().drop('index', axis = 1)
df_multi_convs

Unnamed: 0.1,Unnamed: 0,link_id,title,post_author,subreddit,post_content,comment_id,comment_author,parent_id,comment_content
0,52605,1sf0k9,Some guided meditations,MrParker12,Anxietyhelp,These are some guided meditations that I have ...,cdx8e04,justifiablehate,1sf0k9,might have forgotten the link :)
1,52604,1sf0k9,Some guided meditations,MrParker12,Anxietyhelp,These are some guided meditations that I have ...,cdxbs3d,MrParker12,cdx8e04,I did indeed. Thanks for telling me!\n\nhttp:/...
2,52603,1sf19k,Cool website with some good article on anxiety,MrParker12,Anxietyhelp,"Don't be scared by the title, you do not have ...",cdx06hj,wiharu,1sf19k,"Hey there, I don't see the links, MrParker12."
3,52602,1sf19k,Cool website with some good article on anxiety,MrParker12,Anxietyhelp,"Don't be scared by the title, you do not have ...",cdxbrve,MrParker12,cdx06hj,"Ha sorry, off to a good start!\n\nhttp://tinyb..."
4,52598,1sgew0,Start a diary,MrParker12,Anxietyhelp,This is something my therapist and numerous se...,cdxuxtq,quid__,1sgew0,Definitely agree with this. Mine is more a of ...
...,...,...,...,...,...,...,...,...,...,...
14219,19308,fmw38j,Video Chat Anxiety,RasputinsThirdLeg,Anxietyhelp,Does anyone else really have a hard time with ...,fl6epd8,myredbagwithmymakeup,fmw38j,Yes! I feel the same way. Hate video chatting ...
14220,19305,fmw38j,Video Chat Anxiety,RasputinsThirdLeg,Anxietyhelp,Does anyone else really have a hard time with ...,fpefrce,RasputinsThirdLeg,fpefkww,The horrible irony is that I’m an actor. I jus...
14221,18645,fsen54,How to reduce depression nd anxiety. Stop Weed...,johntucker_,Anxietyhelp,"Tips/Tricks to change habits nd behavior, redu...",fm0ws0v,johntucker_,fsen54,In my state seeing a mental professional costs...
14222,18639,fsen54,How to reduce depression nd anxiety. Stop Weed...,johntucker_,Anxietyhelp,"Tips/Tricks to change habits nd behavior, redu...",g23c3d7,johntucker_,g239bob,I have tinnitus too bro. Gotta hope that event...


In [None]:
# save dataframe
df_dyadic_convs.to_csv('depression_dyadic_convs.csv')
df_multi_convs.to_csv('Anxietyhelp_multi_convs.csv')