In this notebook, we clean and preprocess our reddit data.

For cleaning we do the following to the values of 'body' column:
  - Remove punctuations
  - Convert to lowercase
  - Remove Word Contractions (e.g. I've to I have).
  - Remove Stop Words (e.g. has, at, etc)

For Text Preprocessing we do the following to the values of 'cleaned_body' column:
 - Tokenize the text
 - Tag Parts of speech using NLTK standard POS tagging
 - Convert std NLTK POS tagging to WordNet POS tagging format. 
 - Lemmatize the Wordnet POS tagged data

 We save the cleaned and Preprocessed data into a csv for further EDA


In [None]:
pip install contractions

Collecting contractions
  Downloading https://files.pythonhosted.org/packages/00/92/a05b76a692ac08d470ae5c23873cf1c9a041532f1ee065e74b374f218306/contractions-0.0.25-py2.py3-none-any.whl
Collecting textsearch
  Downloading https://files.pythonhosted.org/packages/42/a8/03407021f9555043de5492a2bd7a35c56cc03c2510092b5ec018cae1bbf1/textsearch-0.0.17-py2.py3-none-any.whl
Collecting pyahocorasick
[?25l  Downloading https://files.pythonhosted.org/packages/f4/9f/f0d8e8850e12829eea2e778f1c90e3c53a9a799b7f412082a5d21cd19ae1/pyahocorasick-1.4.0.tar.gz (312kB)
[K     |████████████████████████████████| 317kB 11.4MB/s 
[?25hCollecting Unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 25.0MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
 

In [None]:
import contractions
import nltk
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
import string
%matplotlib inline

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
df1 = pd.read_csv('RedditCancer.csv')
df1.head()

df2 = pd.read_csv('RedditCancerCaregivers.csv')
df2.head()

df3 = pd.read_csv('RedditCancerFamilySupport.csv')
df3.head()

df = pd.concat([df1,df2], axis=0)
df = pd.concat([df,df3], axis=0)


In [None]:
#(#Rows, #Columns)
df.shape

(2631, 8)

In [None]:
df=df.dropna()
df.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
0,I’m not ready.,616,9isza1,https://www.reddit.com/r/cancer/comments/9isza...,82,1537917000.0,In 2016 I was diagnosed with stage 4 of a seve...,2018-09-25 23:13:21
5,Onto Hospice. End of journey.,447,8y27xr,https://www.reddit.com/r/cancer/comments/8y27x...,108,1531363000.0,"The last treatment option, Immunotherapy, for ...",2018-07-12 02:41:20
6,"Diagnosed in June, Dead in August",437,cuo28h,https://www.reddit.com/r/cancer/comments/cuo28...,49,1566644000.0,My wonderful husband was diagnosed with cancer...,2019-08-24 11:00:29
7,Goodbye my sweet angel. I Lost my 5 year old d...,441,e1o110,https://www.reddit.com/r/cancer/comments/e1o11...,47,1574750000.0,We had an incredible six months together after...,2019-11-26 06:31:46
9,I’m officially cancer free!!!,432,at7r12,https://www.reddit.com/r/cancer/comments/at7r1...,89,1550808000.0,Just saw my doctor and he gave me results from...,2019-02-22 04:07:07


In [None]:
#Rows that have values we could use
df.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2308 entries, 0 to 999
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   title      2308 non-null   object 
 1   score      2308 non-null   int64  
 2   id         2308 non-null   object 
 3   url        2308 non-null   object 
 4   comms_num  2308 non-null   int64  
 5   created    2308 non-null   float64
 6   body       2308 non-null   object 
 7   timestamp  2308 non-null   object 
dtypes: float64(1), int64(2), object(5)
memory usage: 162.3+ KB


In [None]:
#Detect Language of each 
# pretrained_model = "lid.176.bin" 
# model = fasttext.load_model(pretrained_model)
# langs = []
# for sent in rws['rating_description_str']:
#     lang = model.predict(sent)[0]
#     langs.append(str(lang)[11:13])
# rws['langs'] = langs

In [None]:
#Make everything lowercase
df['clean_body'] =  df["body"].str.lower()

In [None]:
def remove_punctuations_numbers(text):
    punc = string.punctuation + '0123456789'
    for punctuation in punc:
        text = text.replace(punctuation, '')
    return text

In [None]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [None]:
#remove punctuation 

df["clean_body"] = df["clean_body"].astype(str)
df['clean_body'] = df['clean_body'].apply(remove_punctuations_numbers)
df.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,lower,clean_body,clean_body_str,tokenized,pos_tags,wordnet_pos,lemmatized
0,I’m not ready.,616,9isza1,https://www.reddit.com/r/cancer/comments/9isza...,82,1537917000.0,In 2016 I was diagnosed with stage 4 of a seve...,2018-09-25 23:13:21,"[in, 2016, i, was, diagnosed, with, stage, 4, ...",diagnosed stage severely aggressive bone can...,2016 diagnosed stage 4 severely aggressive bon...,"[2016, diagnosed, stage, 4, severely, aggressi...","[(2016, CD), (diagnosed, VBD), (stage, NN), (4...","[(2016, n), (diagnosed, v), (stage, n), (4, n)...","[2016, diagnose, stage, 4, severely, aggressiv..."
5,Onto Hospice. End of journey.,447,8y27xr,https://www.reddit.com/r/cancer/comments/8y27x...,108,1531363000.0,"The last treatment option, Immunotherapy, for ...",2018-07-12 02:41:20,"[the, last, treatment, option, immunotherapy, ...",last treatment option immunotherapy sarcoma ca...,last treatment option immunotherapy sarcoma ca...,"[last, treatment, option, immunotherapy, sarco...","[(last, JJ), (treatment, NN), (option, NN), (i...","[(last, a), (treatment, n), (option, n), (immu...","[last, treatment, option, immunotherapy, sarco..."
6,"Diagnosed in June, Dead in August",437,cuo28h,https://www.reddit.com/r/cancer/comments/cuo28...,49,1566644000.0,My wonderful husband was diagnosed with cancer...,2019-08-24 11:00:29,"[my, wonderful, husband, was, diagnosed, with,...",wonderful husband diagnosed cancer june august...,wonderful husband diagnosed cancer june august...,"[wonderful, husband, diagnosed, cancer, june, ...","[(wonderful, JJ), (husband, NN), (diagnosed, V...","[(wonderful, a), (husband, n), (diagnosed, v),...","[wonderful, husband, diagnose, cancer, june, a..."
7,Goodbye my sweet angel. I Lost my 5 year old d...,441,e1o110,https://www.reddit.com/r/cancer/comments/e1o11...,47,1574750000.0,We had an incredible six months together after...,2019-11-26 06:31:46,"[we, had, an, incredible, six, months, togethe...",incredible six months together diagnosis cheri...,incredible six months together diagnosis cheri...,"[incredible, six, months, together, diagnosis,...","[(incredible, JJ), (six, CD), (months, NNS), (...","[(incredible, a), (six, n), (months, n), (toge...","[incredible, six, month, together, diagnosis, ..."
9,I’m officially cancer free!!!,432,at7r12,https://www.reddit.com/r/cancer/comments/at7r1...,89,1550808000.0,Just saw my doctor and he gave me results from...,2019-02-22 04:07:07,"[just, saw, my, doctor, and, he, gave, me, res...",saw doctor gave results last round test I am o...,saw doctor gave results last round test I am o...,"[saw, doctor, gave, results, last, round, test...","[(saw, NN), (doctor, NN), (gave, VBD), (result...","[(saw, n), (doctor, n), (gave, v), (results, n...","[saw, doctor, give, result, last, round, test,..."


In [None]:
#Remove Word contractions
df['clean_body'] = df['clean_body'].apply(lambda x: [contractions.fix(word) for word in x.split()])
df.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,lower,clean_body,clean_body_str,tokenized,pos_tags,wordnet_pos,lemmatized
0,I’m not ready.,616,9isza1,https://www.reddit.com/r/cancer/comments/9isza...,82,1537917000.0,In 2016 I was diagnosed with stage 4 of a seve...,2018-09-25 23:13:21,"[in, 2016, i, was, diagnosed, with, stage, 4, ...","[diagnosed, stage, severely, aggressive, bone,...",2016 diagnosed stage 4 severely aggressive bon...,"[2016, diagnosed, stage, 4, severely, aggressi...","[(2016, CD), (diagnosed, VBD), (stage, NN), (4...","[(2016, n), (diagnosed, v), (stage, n), (4, n)...","[2016, diagnose, stage, 4, severely, aggressiv..."
5,Onto Hospice. End of journey.,447,8y27xr,https://www.reddit.com/r/cancer/comments/8y27x...,108,1531363000.0,"The last treatment option, Immunotherapy, for ...",2018-07-12 02:41:20,"[the, last, treatment, option, immunotherapy, ...","[last, treatment, option, immunotherapy, sarco...",last treatment option immunotherapy sarcoma ca...,"[last, treatment, option, immunotherapy, sarco...","[(last, JJ), (treatment, NN), (option, NN), (i...","[(last, a), (treatment, n), (option, n), (immu...","[last, treatment, option, immunotherapy, sarco..."
6,"Diagnosed in June, Dead in August",437,cuo28h,https://www.reddit.com/r/cancer/comments/cuo28...,49,1566644000.0,My wonderful husband was diagnosed with cancer...,2019-08-24 11:00:29,"[my, wonderful, husband, was, diagnosed, with,...","[wonderful, husband, diagnosed, cancer, june, ...",wonderful husband diagnosed cancer june august...,"[wonderful, husband, diagnosed, cancer, june, ...","[(wonderful, JJ), (husband, NN), (diagnosed, V...","[(wonderful, a), (husband, n), (diagnosed, v),...","[wonderful, husband, diagnose, cancer, june, a..."
7,Goodbye my sweet angel. I Lost my 5 year old d...,441,e1o110,https://www.reddit.com/r/cancer/comments/e1o11...,47,1574750000.0,We had an incredible six months together after...,2019-11-26 06:31:46,"[we, had, an, incredible, six, months, togethe...","[incredible, six, months, together, diagnosis,...",incredible six months together diagnosis cheri...,"[incredible, six, months, together, diagnosis,...","[(incredible, JJ), (six, CD), (months, NNS), (...","[(incredible, a), (six, n), (months, n), (toge...","[incredible, six, month, together, diagnosis, ..."
9,I’m officially cancer free!!!,432,at7r12,https://www.reddit.com/r/cancer/comments/at7r1...,89,1550808000.0,Just saw my doctor and he gave me results from...,2019-02-22 04:07:07,"[just, saw, my, doctor, and, he, gave, me, res...","[saw, doctor, gave, results, last, round, test...",saw doctor gave results last round test I am o...,"[saw, doctor, gave, results, last, round, test...","[(saw, NN), (doctor, NN), (gave, VBD), (result...","[(saw, n), (doctor, n), (gave, v), (results, n...","[saw, doctor, give, result, last, round, test,..."


In [None]:
#Remove Stop Words
stop_words = set(stopwords.words('english'))
print(len(stop_words))
stop_words.add('i')
stop_words.add('even')
stop_words.add('still')
stop_words.add('ever')
stop_words.add('really')
stop_words.add('seem')
stop_words.add('almost')
stop_words.add('go')
print(len(stop_words))
df['clean_body'] = df['clean_body'].apply(lambda x: [word for word in x if word not in stop_words])
df.head()

179
186


Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,lower,clean_body,clean_body_str,tokenized,pos_tags,wordnet_pos,lemmatized
0,I’m not ready.,616,9isza1,https://www.reddit.com/r/cancer/comments/9isza...,82,1537917000.0,In 2016 I was diagnosed with stage 4 of a seve...,2018-09-25 23:13:21,"[in, 2016, i, was, diagnosed, with, stage, 4, ...","[diagnosed, stage, severely, aggressive, bone,...",diagnosed stage severely aggressive bone cance...,"[diagnosed, stage, severely, aggressive, bone,...","[(diagnosed, VBN), (stage, NN), (severely, RB)...","[(diagnosed, v), (stage, n), (severely, r), (a...","[diagnose, stage, severely, aggressive, bone, ..."
5,Onto Hospice. End of journey.,447,8y27xr,https://www.reddit.com/r/cancer/comments/8y27x...,108,1531363000.0,"The last treatment option, Immunotherapy, for ...",2018-07-12 02:41:20,"[the, last, treatment, option, immunotherapy, ...","[last, treatment, option, immunotherapy, sarco...",last treatment option immunotherapy sarcoma ca...,"[last, treatment, option, immunotherapy, sarco...","[(last, JJ), (treatment, NN), (option, NN), (i...","[(last, a), (treatment, n), (option, n), (immu...","[last, treatment, option, immunotherapy, sarco..."
6,"Diagnosed in June, Dead in August",437,cuo28h,https://www.reddit.com/r/cancer/comments/cuo28...,49,1566644000.0,My wonderful husband was diagnosed with cancer...,2019-08-24 11:00:29,"[my, wonderful, husband, was, diagnosed, with,...","[wonderful, husband, diagnosed, cancer, june, ...",wonderful husband diagnosed cancer june august...,"[wonderful, husband, diagnosed, cancer, june, ...","[(wonderful, JJ), (husband, NN), (diagnosed, V...","[(wonderful, a), (husband, n), (diagnosed, v),...","[wonderful, husband, diagnose, cancer, june, a..."
7,Goodbye my sweet angel. I Lost my 5 year old d...,441,e1o110,https://www.reddit.com/r/cancer/comments/e1o11...,47,1574750000.0,We had an incredible six months together after...,2019-11-26 06:31:46,"[we, had, an, incredible, six, months, togethe...","[incredible, six, months, together, diagnosis,...",incredible six months together diagnosis cheri...,"[incredible, six, months, together, diagnosis,...","[(incredible, JJ), (six, CD), (months, NNS), (...","[(incredible, a), (six, n), (months, n), (toge...","[incredible, six, month, together, diagnosis, ..."
9,I’m officially cancer free!!!,432,at7r12,https://www.reddit.com/r/cancer/comments/at7r1...,89,1550808000.0,Just saw my doctor and he gave me results from...,2019-02-22 04:07:07,"[just, saw, my, doctor, and, he, gave, me, res...","[saw, doctor, gave, results, last, round, test...",saw doctor gave results last round test I offi...,"[saw, doctor, gave, results, last, round, test...","[(saw, NN), (doctor, NN), (gave, VBD), (result...","[(saw, n), (doctor, n), (gave, v), (results, n...","[saw, doctor, give, result, last, round, test,..."


In [None]:
#Clean body Str:
df['clean_body_str'] = [' '.join(map(str, l)) for l in df['clean_body']]
df.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,lower,clean_body,clean_body_str,tokenized,pos_tags,wordnet_pos,lemmatized
0,I’m not ready.,616,9isza1,https://www.reddit.com/r/cancer/comments/9isza...,82,1537917000.0,In 2016 I was diagnosed with stage 4 of a seve...,2018-09-25 23:13:21,"[in, 2016, i, was, diagnosed, with, stage, 4, ...","[diagnosed, stage, severely, aggressive, bone,...",diagnosed stage severely aggressive bone cance...,"[diagnosed, stage, severely, aggressive, bone,...","[(diagnosed, VBN), (stage, NN), (severely, RB)...","[(diagnosed, v), (stage, n), (severely, r), (a...","[diagnose, stage, severely, aggressive, bone, ..."
5,Onto Hospice. End of journey.,447,8y27xr,https://www.reddit.com/r/cancer/comments/8y27x...,108,1531363000.0,"The last treatment option, Immunotherapy, for ...",2018-07-12 02:41:20,"[the, last, treatment, option, immunotherapy, ...","[last, treatment, option, immunotherapy, sarco...",last treatment option immunotherapy sarcoma ca...,"[last, treatment, option, immunotherapy, sarco...","[(last, JJ), (treatment, NN), (option, NN), (i...","[(last, a), (treatment, n), (option, n), (immu...","[last, treatment, option, immunotherapy, sarco..."
6,"Diagnosed in June, Dead in August",437,cuo28h,https://www.reddit.com/r/cancer/comments/cuo28...,49,1566644000.0,My wonderful husband was diagnosed with cancer...,2019-08-24 11:00:29,"[my, wonderful, husband, was, diagnosed, with,...","[wonderful, husband, diagnosed, cancer, june, ...",wonderful husband diagnosed cancer june august...,"[wonderful, husband, diagnosed, cancer, june, ...","[(wonderful, JJ), (husband, NN), (diagnosed, V...","[(wonderful, a), (husband, n), (diagnosed, v),...","[wonderful, husband, diagnose, cancer, june, a..."
7,Goodbye my sweet angel. I Lost my 5 year old d...,441,e1o110,https://www.reddit.com/r/cancer/comments/e1o11...,47,1574750000.0,We had an incredible six months together after...,2019-11-26 06:31:46,"[we, had, an, incredible, six, months, togethe...","[incredible, six, months, together, diagnosis,...",incredible six months together diagnosis cheri...,"[incredible, six, months, together, diagnosis,...","[(incredible, JJ), (six, CD), (months, NNS), (...","[(incredible, a), (six, n), (months, n), (toge...","[incredible, six, month, together, diagnosis, ..."
9,I’m officially cancer free!!!,432,at7r12,https://www.reddit.com/r/cancer/comments/at7r1...,89,1550808000.0,Just saw my doctor and he gave me results from...,2019-02-22 04:07:07,"[just, saw, my, doctor, and, he, gave, me, res...","[saw, doctor, gave, results, last, round, test...",saw doctor gave results last round test I offi...,"[saw, doctor, gave, results, last, round, test...","[(saw, NN), (doctor, NN), (gave, VBD), (result...","[(saw, n), (doctor, n), (gave, v), (results, n...","[saw, doctor, give, result, last, round, test,..."


In [None]:
#Tokenize
df['tokenized'] = df['clean_body_str'].apply(word_tokenize)
df.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,lower,clean_body,clean_body_str,tokenized,pos_tags,wordnet_pos,lemmatized
0,I’m not ready.,616,9isza1,https://www.reddit.com/r/cancer/comments/9isza...,82,1537917000.0,In 2016 I was diagnosed with stage 4 of a seve...,2018-09-25 23:13:21,"[in, 2016, i, was, diagnosed, with, stage, 4, ...","[diagnosed, stage, severely, aggressive, bone,...",diagnosed stage severely aggressive bone cance...,"[diagnosed, stage, severely, aggressive, bone,...","[(diagnosed, VBN), (stage, NN), (severely, RB)...","[(diagnosed, v), (stage, n), (severely, r), (a...","[diagnose, stage, severely, aggressive, bone, ..."
5,Onto Hospice. End of journey.,447,8y27xr,https://www.reddit.com/r/cancer/comments/8y27x...,108,1531363000.0,"The last treatment option, Immunotherapy, for ...",2018-07-12 02:41:20,"[the, last, treatment, option, immunotherapy, ...","[last, treatment, option, immunotherapy, sarco...",last treatment option immunotherapy sarcoma ca...,"[last, treatment, option, immunotherapy, sarco...","[(last, JJ), (treatment, NN), (option, NN), (i...","[(last, a), (treatment, n), (option, n), (immu...","[last, treatment, option, immunotherapy, sarco..."
6,"Diagnosed in June, Dead in August",437,cuo28h,https://www.reddit.com/r/cancer/comments/cuo28...,49,1566644000.0,My wonderful husband was diagnosed with cancer...,2019-08-24 11:00:29,"[my, wonderful, husband, was, diagnosed, with,...","[wonderful, husband, diagnosed, cancer, june, ...",wonderful husband diagnosed cancer june august...,"[wonderful, husband, diagnosed, cancer, june, ...","[(wonderful, JJ), (husband, NN), (diagnosed, V...","[(wonderful, a), (husband, n), (diagnosed, v),...","[wonderful, husband, diagnose, cancer, june, a..."
7,Goodbye my sweet angel. I Lost my 5 year old d...,441,e1o110,https://www.reddit.com/r/cancer/comments/e1o11...,47,1574750000.0,We had an incredible six months together after...,2019-11-26 06:31:46,"[we, had, an, incredible, six, months, togethe...","[incredible, six, months, together, diagnosis,...",incredible six months together diagnosis cheri...,"[incredible, six, months, together, diagnosis,...","[(incredible, JJ), (six, CD), (months, NNS), (...","[(incredible, a), (six, n), (months, n), (toge...","[incredible, six, month, together, diagnosis, ..."
9,I’m officially cancer free!!!,432,at7r12,https://www.reddit.com/r/cancer/comments/at7r1...,89,1550808000.0,Just saw my doctor and he gave me results from...,2019-02-22 04:07:07,"[just, saw, my, doctor, and, he, gave, me, res...","[saw, doctor, gave, results, last, round, test...",saw doctor gave results last round test I offi...,"[saw, doctor, gave, results, last, round, test...","[(saw, NN), (doctor, NN), (gave, VBD), (result...","[(saw, n), (doctor, n), (gave, v), (results, n...","[saw, doctor, give, result, last, round, test,..."


In [None]:
#Tagging Parts of Speech in the tokenized 
df['pos_tags'] = df['tokenized'].apply(nltk.tag.pos_tag)
df.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,lower,clean_body,clean_body_str,tokenized,pos_tags,wordnet_pos,lemmatized
0,I’m not ready.,616,9isza1,https://www.reddit.com/r/cancer/comments/9isza...,82,1537917000.0,In 2016 I was diagnosed with stage 4 of a seve...,2018-09-25 23:13:21,"[in, 2016, i, was, diagnosed, with, stage, 4, ...","[diagnosed, stage, severely, aggressive, bone,...",diagnosed stage severely aggressive bone cance...,"[diagnosed, stage, severely, aggressive, bone,...","[(diagnosed, VBN), (stage, NN), (severely, RB)...","[(diagnosed, v), (stage, n), (severely, r), (a...","[diagnose, stage, severely, aggressive, bone, ..."
5,Onto Hospice. End of journey.,447,8y27xr,https://www.reddit.com/r/cancer/comments/8y27x...,108,1531363000.0,"The last treatment option, Immunotherapy, for ...",2018-07-12 02:41:20,"[the, last, treatment, option, immunotherapy, ...","[last, treatment, option, immunotherapy, sarco...",last treatment option immunotherapy sarcoma ca...,"[last, treatment, option, immunotherapy, sarco...","[(last, JJ), (treatment, NN), (option, NN), (i...","[(last, a), (treatment, n), (option, n), (immu...","[last, treatment, option, immunotherapy, sarco..."
6,"Diagnosed in June, Dead in August",437,cuo28h,https://www.reddit.com/r/cancer/comments/cuo28...,49,1566644000.0,My wonderful husband was diagnosed with cancer...,2019-08-24 11:00:29,"[my, wonderful, husband, was, diagnosed, with,...","[wonderful, husband, diagnosed, cancer, june, ...",wonderful husband diagnosed cancer june august...,"[wonderful, husband, diagnosed, cancer, june, ...","[(wonderful, JJ), (husband, NN), (diagnosed, V...","[(wonderful, a), (husband, n), (diagnosed, v),...","[wonderful, husband, diagnose, cancer, june, a..."
7,Goodbye my sweet angel. I Lost my 5 year old d...,441,e1o110,https://www.reddit.com/r/cancer/comments/e1o11...,47,1574750000.0,We had an incredible six months together after...,2019-11-26 06:31:46,"[we, had, an, incredible, six, months, togethe...","[incredible, six, months, together, diagnosis,...",incredible six months together diagnosis cheri...,"[incredible, six, months, together, diagnosis,...","[(incredible, JJ), (six, CD), (months, NNS), (...","[(incredible, a), (six, n), (months, n), (toge...","[incredible, six, month, together, diagnosis, ..."
9,I’m officially cancer free!!!,432,at7r12,https://www.reddit.com/r/cancer/comments/at7r1...,89,1550808000.0,Just saw my doctor and he gave me results from...,2019-02-22 04:07:07,"[just, saw, my, doctor, and, he, gave, me, res...","[saw, doctor, gave, results, last, round, test...",saw doctor gave results last round test I offi...,"[saw, doctor, gave, results, last, round, test...","[(saw, NN), (doctor, NN), (gave, VBD), (result...","[(saw, n), (doctor, n), (gave, v), (results, n...","[saw, doctor, give, result, last, round, test,..."


In [None]:
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [None]:
df['wordnet_pos'] = df['pos_tags'].apply(lambda x: [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in x])
df.head()


Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,lower,clean_body,clean_body_str,tokenized,pos_tags,wordnet_pos,lemmatized
0,I’m not ready.,616,9isza1,https://www.reddit.com/r/cancer/comments/9isza...,82,1537917000.0,In 2016 I was diagnosed with stage 4 of a seve...,2018-09-25 23:13:21,"[in, 2016, i, was, diagnosed, with, stage, 4, ...","[diagnosed, stage, severely, aggressive, bone,...",diagnosed stage severely aggressive bone cance...,"[diagnosed, stage, severely, aggressive, bone,...","[(diagnosed, VBN), (stage, NN), (severely, RB)...","[(diagnosed, v), (stage, n), (severely, r), (a...","[2016, diagnose, stage, 4, severely, aggressiv..."
5,Onto Hospice. End of journey.,447,8y27xr,https://www.reddit.com/r/cancer/comments/8y27x...,108,1531363000.0,"The last treatment option, Immunotherapy, for ...",2018-07-12 02:41:20,"[the, last, treatment, option, immunotherapy, ...","[last, treatment, option, immunotherapy, sarco...",last treatment option immunotherapy sarcoma ca...,"[last, treatment, option, immunotherapy, sarco...","[(last, JJ), (treatment, NN), (option, NN), (i...","[(last, a), (treatment, n), (option, n), (immu...","[last, treatment, option, immunotherapy, sarco..."
6,"Diagnosed in June, Dead in August",437,cuo28h,https://www.reddit.com/r/cancer/comments/cuo28...,49,1566644000.0,My wonderful husband was diagnosed with cancer...,2019-08-24 11:00:29,"[my, wonderful, husband, was, diagnosed, with,...","[wonderful, husband, diagnosed, cancer, june, ...",wonderful husband diagnosed cancer june august...,"[wonderful, husband, diagnosed, cancer, june, ...","[(wonderful, JJ), (husband, NN), (diagnosed, V...","[(wonderful, a), (husband, n), (diagnosed, v),...","[wonderful, husband, diagnose, cancer, june, a..."
7,Goodbye my sweet angel. I Lost my 5 year old d...,441,e1o110,https://www.reddit.com/r/cancer/comments/e1o11...,47,1574750000.0,We had an incredible six months together after...,2019-11-26 06:31:46,"[we, had, an, incredible, six, months, togethe...","[incredible, six, months, together, diagnosis,...",incredible six months together diagnosis cheri...,"[incredible, six, months, together, diagnosis,...","[(incredible, JJ), (six, CD), (months, NNS), (...","[(incredible, a), (six, n), (months, n), (toge...","[incredible, six, month, together, diagnosis, ..."
9,I’m officially cancer free!!!,432,at7r12,https://www.reddit.com/r/cancer/comments/at7r1...,89,1550808000.0,Just saw my doctor and he gave me results from...,2019-02-22 04:07:07,"[just, saw, my, doctor, and, he, gave, me, res...","[saw, doctor, gave, results, last, round, test...",saw doctor gave results last round test I offi...,"[saw, doctor, gave, results, last, round, test...","[(saw, NN), (doctor, NN), (gave, VBD), (result...","[(saw, n), (doctor, n), (gave, v), (results, n...","[saw, doctor, give, result, last, round, test,..."


In [None]:
#Lemmatization
wnl = WordNetLemmatizer()
df['lemmatized'] = df['wordnet_pos'].apply(lambda x: [wnl.lemmatize(word, tag) for word, tag in x])
df.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp,lower,clean_body,clean_body_str,tokenized,pos_tags,wordnet_pos,lemmatized
0,I’m not ready.,616,9isza1,https://www.reddit.com/r/cancer/comments/9isza...,82,1537917000.0,In 2016 I was diagnosed with stage 4 of a seve...,2018-09-25 23:13:21,"[in, 2016, i, was, diagnosed, with, stage, 4, ...","[diagnosed, stage, severely, aggressive, bone,...",diagnosed stage severely aggressive bone cance...,"[diagnosed, stage, severely, aggressive, bone,...","[(diagnosed, VBN), (stage, NN), (severely, RB)...","[(diagnosed, v), (stage, n), (severely, r), (a...","[diagnose, stage, severely, aggressive, bone, ..."
5,Onto Hospice. End of journey.,447,8y27xr,https://www.reddit.com/r/cancer/comments/8y27x...,108,1531363000.0,"The last treatment option, Immunotherapy, for ...",2018-07-12 02:41:20,"[the, last, treatment, option, immunotherapy, ...","[last, treatment, option, immunotherapy, sarco...",last treatment option immunotherapy sarcoma ca...,"[last, treatment, option, immunotherapy, sarco...","[(last, JJ), (treatment, NN), (option, NN), (i...","[(last, a), (treatment, n), (option, n), (immu...","[last, treatment, option, immunotherapy, sarco..."
6,"Diagnosed in June, Dead in August",437,cuo28h,https://www.reddit.com/r/cancer/comments/cuo28...,49,1566644000.0,My wonderful husband was diagnosed with cancer...,2019-08-24 11:00:29,"[my, wonderful, husband, was, diagnosed, with,...","[wonderful, husband, diagnosed, cancer, june, ...",wonderful husband diagnosed cancer june august...,"[wonderful, husband, diagnosed, cancer, june, ...","[(wonderful, JJ), (husband, NN), (diagnosed, V...","[(wonderful, a), (husband, n), (diagnosed, v),...","[wonderful, husband, diagnose, cancer, june, a..."
7,Goodbye my sweet angel. I Lost my 5 year old d...,441,e1o110,https://www.reddit.com/r/cancer/comments/e1o11...,47,1574750000.0,We had an incredible six months together after...,2019-11-26 06:31:46,"[we, had, an, incredible, six, months, togethe...","[incredible, six, months, together, diagnosis,...",incredible six months together diagnosis cheri...,"[incredible, six, months, together, diagnosis,...","[(incredible, JJ), (six, CD), (months, NNS), (...","[(incredible, a), (six, n), (months, n), (toge...","[incredible, six, month, together, diagnosis, ..."
9,I’m officially cancer free!!!,432,at7r12,https://www.reddit.com/r/cancer/comments/at7r1...,89,1550808000.0,Just saw my doctor and he gave me results from...,2019-02-22 04:07:07,"[just, saw, my, doctor, and, he, gave, me, res...","[saw, doctor, gave, results, last, round, test...",saw doctor gave results last round test I offi...,"[saw, doctor, gave, results, last, round, test...","[(saw, NN), (doctor, NN), (gave, VBD), (result...","[(saw, n), (doctor, n), (gave, v), (results, n...","[saw, doctor, give, result, last, round, test,..."


In [None]:
df.to_pickle("Reddit_dataset_clean.pkl")

In [None]:
df.to_csv('Reddit_dataset_clean.csv')