# Milestone 2 - NLP Preprocessing Full Solution

In [9]:
import pandas as pd

pd.set_option('display.max_colwidth', None)

## Workflow 1: Open the `initial_eda.csv` as a Pandas dataframe

In [10]:
file_name = './initial_eda.csv'
df = pd.read_csv(file_name,index_col=0)

In [11]:
df.reply.shape

(6323,)

In [12]:
df.reply[20:60]

20                                                                                                                                                                                                                                                                           Yeah I’ve been playing around with it a lot and I can’t really pinpoint what exactly is the difference. They had it just perfect before, and now, they’re just like any other app.\n\nI used notability for about a month and I came to the conclusion that the iPad isn’t meant for extended writing, until I decided to try Goodnotes 4. That completely changed it, and I could finally say that the iPad is a complete paper replacement for me. You may have noticed that I tend to hang around /r/iPad a lot, and I have made many comments praising Goodnotes 4 for its inking, and now they took away the only feature that mattered to me in Goodnotes 5. I am devastated right now. ):
21    Here's a sneak peek of /r/ipad using the [top posts]

##  Some observations:
- HEX color code, http links, or other things that don't matter
    - remove them
    - technically, for the hex code since the reply was just giving names to the hex codes, we also don't care about the words within the reply so we can just take out that reply altogether but i wanted to show you how to remove something using regex if you do want to have other things you want to remove
- emojis
    - we can keep
- abbreviations like 'Thx', or the same word with differnt tenses
    - e.g. annoy, annoys, annoyed, annoying
    - we will want to stem the word
- upper case and lower case
    - we can preserve if we want to see when peopel are using Pronouns
    - or convert all to lower case to help with counting
- puncutations
    - if we are counting words, we don't really care about the puncutations
- numbers
    - numbers can be listed as numericals like 7, 5334, or spelled out like seven, five thousand three hundred and thirty four, do we care to combine?
    - likely small % in our dataset

## Workflow 2: Remove HEX color code and hyperlinks using replace and regular expression

### 1. remove hex

In [13]:
# there is a thread that post HEX color code and their name
df[df['submission_title']=='HEX set of bright highlighters — enjoy! (HEX codes will be also in the comments)'].head()

Unnamed: 0,submission_id,submission_score,submission_title,submission_link_flair_text,submission_selftext,reply_author,reply_body,reply_created_utc,reply,reply_char_counts,reply_word_counts_by_space
3311,jg3xzz,1,HEX set of bright highlighters — enjoy! (HEX codes will be also in the comments),Templates,,cleothefairy,lemon sunshine: FFDC74\nbeach at sunset: FBAC87\nwatermelon spring: FF8C87\ncotton candy cheeks: F3A6C8\nmoonlit lily: DEACF9\nclear sky sea: AEB5FF\nmountain breeze waves: 95C8F3\ntoes in the lagoon: 81E3E1\njungle getaway: 7DE198\ngolf club special: B3E561,1603387174,lemon sunshine: FFDC74\nbeach at sunset: FBAC87\nwatermelon spring: FF8C87\ncotton candy cheeks: F3A6C8\nmoonlit lily: DEACF9\nclear sky sea: AEB5FF\nmountain breeze waves: 95C8F3\ntoes in the lagoon: 81E3E1\njungle getaway: 7DE198\ngolf club special: B3E561,249,37
3312,jg3xzz,1,HEX set of bright highlighters — enjoy! (HEX codes will be also in the comments),Templates,,collegegeek99,You are a lifesaver!!!,1603391774,You are a lifesaver!!!,22,4
3313,jg3xzz,1,HEX set of bright highlighters — enjoy! (HEX codes will be also in the comments),Templates,,cleothefairy,Glad you like it!,1603393936,Glad you like it!,17,4
3314,jg3xzz,1,HEX set of bright highlighters — enjoy! (HEX codes will be also in the comments),Templates,,nblscgntn,Thanks! \nAnd for any one wanting just the codes (in order):\n\nFFDC74\nFBAC87\nFF8C87\nF3A6C8\nDEACF9\nAEB5FF\n95C8F3\n81E3E1\n7DE198\nB3E561,1603396860,Thanks! \nAnd for any one wanting just the codes (in order):\n\nFFDC74\nFBAC87\nFF8C87\nF3A6C8\nDEACF9\nAEB5FF\n95C8F3\n81E3E1\n7DE198\nB3E561,130,21
3315,jg3xzz,1,HEX set of bright highlighters — enjoy! (HEX codes will be also in the comments),Templates,,mayashhhh,I love the names of these colours!,1603406267,I love the names of these colours!,34,7


In [29]:
# check for whether the regular expression pattern is correct
import re

HEX_regex_pattern = r': (?:[0-9a-fA-F]){6}'
re.search(HEX_regex_pattern, 'golf club special: B3E561 color is great')



<re.Match object; span=(17, 25), match=': B3E561'>

Test your regular expression at this site:[https://regex101.com](https://regex101.com)

In [30]:
# remove the hex code for replies in that thread that talk about HEX codes
df['reply_clean'] = df['reply'].str.replace(HEX_regex_pattern,'')
df['reply_clean'] = df['reply_clean'].str.replace(':FFDC74FBAC87FF8C87F3A6C8DEACF9AEB5FF95C8F381E3E17DE198B3E561','')

  


### 2. Remove hyperlinks (e.g. http://drive.google.com/xyz ) from the reply

In [31]:
# Test on one sentence
test = 'http://drive.google.com/xyz'
hyperlink_regex_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
re.search(hyperlink_regex_pattern, test)

<re.Match object; span=(0, 27), match='http://drive.google.com/xyz'>

In [32]:
# Remove hyperlink
df['reply_clean'] = df['reply_clean'].str.replace(hyperlink_regex_pattern, '')

  


## Workflow 3-6: remove stopword and punctuations, convert to lower case, and stem the words

In [33]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import string

In [22]:
# we are getting punctuations from string, and adding in our custom punctuation ’
punctuation_list = [char for char in (string.punctuation + '’')]

In [23]:
punctuation_list

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~',
 '’']

In [26]:
# we are getting the stopwords from nltk.corpus
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [27]:
# a test sentence to illustrate what the functions are doing
test_sentence = df['reply'][5]
print("orgiinal sentence")
print(test_sentence)

orgiinal sentence
Am I missing something or did they significantly reduce the functionality of the bookmarks button? 
In goodnotes 4 I could make bookmarks in textbooks and name them, now goodnotes 5 just has them as favorites with no option to name them or change how they are viewed? 

I don't know if anyone else has noticed this or maybe found a way around this problem, but I would love to hear y'alls thoughts.


In [28]:
print('.lower()')
test_sentence_lower = test_sentence.lower()
print(test_sentence_lower)

.lower()
am i missing something or did they significantly reduce the functionality of the bookmarks button? 
in goodnotes 4 i could make bookmarks in textbooks and name them, now goodnotes 5 just has them as favorites with no option to name them or change how they are viewed? 

i don't know if anyone else has noticed this or maybe found a way around this problem, but i would love to hear y'alls thoughts.


In [29]:
print('remove punctuations')
test_sentence_lower_no_punctuation = "".join([char for char in test_sentence_lower if char not in punctuation_list])
print(test_sentence_lower_no_punctuation)

# alternatively, if you only want letters then check for .isalpha()

remove punctuations
am i missing something or did they significantly reduce the functionality of the bookmarks button 
in goodnotes 4 i could make bookmarks in textbooks and name them now goodnotes 5 just has them as favorites with no option to name them or change how they are viewed 

i dont know if anyone else has noticed this or maybe found a way around this problem but i would love to hear yalls thoughts


In [30]:
print('word tokenizer')
tokenzied = word_tokenize(test_sentence_lower_no_punctuation)
print(tokenzied)

word tokenizer
['am', 'i', 'missing', 'something', 'or', 'did', 'they', 'significantly', 'reduce', 'the', 'functionality', 'of', 'the', 'bookmarks', 'button', 'in', 'goodnotes', '4', 'i', 'could', 'make', 'bookmarks', 'in', 'textbooks', 'and', 'name', 'them', 'now', 'goodnotes', '5', 'just', 'has', 'them', 'as', 'favorites', 'with', 'no', 'option', 'to', 'name', 'them', 'or', 'change', 'how', 'they', 'are', 'viewed', 'i', 'dont', 'know', 'if', 'anyone', 'else', 'has', 'noticed', 'this', 'or', 'maybe', 'found', 'a', 'way', 'around', 'this', 'problem', 'but', 'i', 'would', 'love', 'to', 'hear', 'yalls', 'thoughts']


In [31]:
print('filter out stopword')
no_stopwords = [word for word in tokenzied if word not in stop_words]
print(no_stopwords)

filter out stopword
['missing', 'something', 'significantly', 'reduce', 'functionality', 'bookmarks', 'button', 'goodnotes', '4', 'could', 'make', 'bookmarks', 'textbooks', 'name', 'goodnotes', '5', 'favorites', 'option', 'name', 'change', 'viewed', 'dont', 'know', 'anyone', 'else', 'noticed', 'maybe', 'found', 'way', 'around', 'problem', 'would', 'love', 'hear', 'yalls', 'thoughts']


In [32]:
print('Porter Stemmer')
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in no_stopwords]
print(stemmed)

Porter Stemmer
['miss', 'someth', 'significantli', 'reduc', 'function', 'bookmark', 'button', 'goodnot', '4', 'could', 'make', 'bookmark', 'textbook', 'name', 'goodnot', '5', 'favorit', 'option', 'name', 'chang', 'view', 'dont', 'know', 'anyon', 'els', 'notic', 'mayb', 'found', 'way', 'around', 'problem', 'would', 'love', 'hear', 'yall', 'thought']


### Combining the steps

In [33]:
# Step remove punctuation
df['reply_no_punctuation'] =["".join([char.lower() for char in word if char not in punctuation_list]) for word in df['reply_clean']]
# tokenization
df['reply_tokenized'] = [word_tokenize(word) for word in df['reply_no_punctuation']]
# no stop word
df['reply_tokenized'] = [[word for word in word_list if word not in stop_words] for word_list in df['reply_tokenized']]
# stemming
df['reply_preprocessed'] = [[porter.stem(word) for word in word_list if word not in stop_words] for word_list in df['reply_tokenized']]

In [34]:
df.drop(columns=['reply_no_punctuation','reply_clean'],inplace=True)

In [35]:
df.head(20)

Unnamed: 0,submission_id,submission_score,submission_title,submission_link_flair_text,submission_selftext,reply_author,reply_body,reply_created_utc,reply,reply_char_counts,reply_word_counts_by_space,reply_tokenized,reply_preprocessed
0,aglcrj,1,Goodnotes 4 vs. Goodnotes 5 right now,,"I have used Goodnotes 4 for work a ton. And I do mean a ton. I bought and downloaded 5 yesterday, transfer was easy. I realized I can't use the Mac App with 5. Right? Changes I made in 5 aren't synced to the desktop app. Also I kept getting a syncing error in 5. What was it syncing with? I am going to continue to play with it but I am not sure I feel comfortable diving in yet.",Mauri97,"I'm getting a ton of bugs with 5 as well (snappy lines, no response at times) and for some reason the ability to sync to google drive and to download multiple files at once from google drive is gone.\n\n I think they need some time to cope with the new launch.",1547656134,"I'm getting a ton of bugs with 5 as well (snappy lines, no response at times) and for some reason the ability to sync to google drive and to download multiple files at once from google drive is gone.\n\n I think they need some time to cope with the new launch.",258,51,"[im, getting, ton, bugs, 5, well, snappy, lines, response, times, reason, ability, sync, google, drive, download, multiple, files, google, drive, gone, think, need, time, cope, new, launch]","[im, get, ton, bug, 5, well, snappi, line, respons, time, reason, abil, sync, googl, drive, download, multipl, file, googl, drive, gone, think, need, time, cope, new, launch]"
1,aglcrj,1,Goodnotes 4 vs. Goodnotes 5 right now,,"I have used Goodnotes 4 for work a ton. And I do mean a ton. I bought and downloaded 5 yesterday, transfer was easy. I realized I can't use the Mac App with 5. Right? Changes I made in 5 aren't synced to the desktop app. Also I kept getting a syncing error in 5. What was it syncing with? I am going to continue to play with it but I am not sure I feel comfortable diving in yet.",nathanwj,Goodnotes 5 is not yet compatible with the desktop app as it says in the release notes. There will be more features added in the near future.,1547658625,Goodnotes 5 is not yet compatible with the desktop app as it says in the release notes. There will be more features added in the near future.,141,27,"[goodnotes, 5, yet, compatible, desktop, app, says, release, notes, features, added, near, future]","[goodnot, 5, yet, compat, desktop, app, say, releas, note, featur, ad, near, futur]"
2,aglcrj,1,Goodnotes 4 vs. Goodnotes 5 right now,,"I have used Goodnotes 4 for work a ton. And I do mean a ton. I bought and downloaded 5 yesterday, transfer was easy. I realized I can't use the Mac App with 5. Right? Changes I made in 5 aren't synced to the desktop app. Also I kept getting a syncing error in 5. What was it syncing with? I am going to continue to play with it but I am not sure I feel comfortable diving in yet.",Rowyfo,"I haven't downloaded GN5 yet but watched a walkthrough and I did see that there's a snap option to check and uncheck, I think somewhere in pen options. Hope that helps!",1547662874,"I haven't downloaded GN5 yet but watched a walkthrough and I did see that there's a snap option to check and uncheck, I think somewhere in pen options. Hope that helps!",168,31,"[havent, downloaded, gn5, yet, watched, walkthrough, see, theres, snap, option, check, uncheck, think, somewhere, pen, options, hope, helps]","[havent, download, gn5, yet, watch, walkthrough, see, there, snap, option, check, uncheck, think, somewher, pen, option, hope, help]"
3,aglcrj,1,Goodnotes 4 vs. Goodnotes 5 right now,,"I have used Goodnotes 4 for work a ton. And I do mean a ton. I bought and downloaded 5 yesterday, transfer was easy. I realized I can't use the Mac App with 5. Right? Changes I made in 5 aren't synced to the desktop app. Also I kept getting a syncing error in 5. What was it syncing with? I am going to continue to play with it but I am not sure I feel comfortable diving in yet.",Mauri97,"That sounds a lot like what's going on, thanks!",1547663080,"That sounds a lot like what's going on, thanks!",47,9,"[sounds, lot, like, whats, going, thanks]","[sound, lot, like, what, go, thank]"
4,aglcrj,1,Goodnotes 4 vs. Goodnotes 5 right now,,"I have used Goodnotes 4 for work a ton. And I do mean a ton. I bought and downloaded 5 yesterday, transfer was easy. I realized I can't use the Mac App with 5. Right? Changes I made in 5 aren't synced to the desktop app. Also I kept getting a syncing error in 5. What was it syncing with? I am going to continue to play with it but I am not sure I feel comfortable diving in yet.",nongaussian,I will probably stick to GoodNotes 4 for a while. Noticed that in GoodNotes 5 there is no easy way to export all your notes to Dropbox in PDF format. Love to have those available for my non-Apple devices (i.e. most of them).\n\nNot a complaint: I expect that GN 5 will mature and get more features.,1547675267,I will probably stick to GoodNotes 4 for a while. Noticed that in GoodNotes 5 there is no easy way to export all your notes to Dropbox in PDF format. Love to have those available for my non-Apple devices (i.e. most of them).\n\nNot a complaint: I expect that GN 5 will mature and get more features.,296,57,"[probably, stick, goodnotes, 4, noticed, goodnotes, 5, easy, way, export, notes, dropbox, pdf, format, love, available, nonapple, devices, ie, complaint, expect, gn, 5, mature, get, features]","[probabl, stick, goodnot, 4, notic, goodnot, 5, easi, way, export, note, dropbox, pdf, format, love, avail, nonappl, devic, ie, complaint, expect, gn, 5, matur, get, featur]"
5,aglcrj,1,Goodnotes 4 vs. Goodnotes 5 right now,,"I have used Goodnotes 4 for work a ton. And I do mean a ton. I bought and downloaded 5 yesterday, transfer was easy. I realized I can't use the Mac App with 5. Right? Changes I made in 5 aren't synced to the desktop app. Also I kept getting a syncing error in 5. What was it syncing with? I am going to continue to play with it but I am not sure I feel comfortable diving in yet.",shshort,"Am I missing something or did they significantly reduce the functionality of the bookmarks button? \nIn goodnotes 4 I could make bookmarks in textbooks and name them, now goodnotes 5 just has them as favorites with no option to name them or change how they are viewed? \n\nI don't know if anyone else has noticed this or maybe found a way around this problem, but I would love to hear y'alls thoughts.",1547770692,"Am I missing something or did they significantly reduce the functionality of the bookmarks button? \nIn goodnotes 4 I could make bookmarks in textbooks and name them, now goodnotes 5 just has them as favorites with no option to name them or change how they are viewed? \n\nI don't know if anyone else has noticed this or maybe found a way around this problem, but I would love to hear y'alls thoughts.",398,72,"[missing, something, significantly, reduce, functionality, bookmarks, button, goodnotes, 4, could, make, bookmarks, textbooks, name, goodnotes, 5, favorites, option, name, change, viewed, dont, know, anyone, else, noticed, maybe, found, way, around, problem, would, love, hear, yalls, thoughts]","[miss, someth, significantli, reduc, function, bookmark, button, goodnot, 4, could, make, bookmark, textbook, name, goodnot, 5, favorit, option, name, chang, view, dont, know, anyon, els, notic, mayb, found, way, around, problem, would, love, hear, yall, thought]"
6,agoowm,1,The bundle is available !,,,daven1985,Thank you. I have been waiting!\n\n&amp;#x200B;,1547671644,Thank you. I have been waiting!\n\n&amp;#x200B;,45,7,"[thank, waiting, ampx200b]","[thank, wait, ampx200b]"
7,agoowm,1,The bundle is available !,,,TabulatorSpalte,"Upgrade is 8,99€ (Germany) for me. Seems to be the same price as the standalone app. Is this the correct price?",1547715935,"Upgrade is 8,99€ (Germany) for me. Seems to be the same price as the standalone app. Is this the correct price?",111,21,"[upgrade, 899€, germany, seems, price, standalone, app, correct, price]","[upgrad, 899€, germani, seem, price, standalon, app, correct, price]"
8,agoowm,1,The bundle is available !,,,3Dhisham,"They shouldn’t charge you. I just downloaded it yesterday and I’m on the German appstore. Note that it is not GoodNotes 5 that you have to download, it’s the upgrade bundle which you can get to when you view “more by this developer “",1547716542,"They shouldn’t charge you. I just downloaded it yesterday and I’m on the German appstore. Note that it is not GoodNotes 5 that you have to download, it’s the upgrade bundle which you can get to when you view “more by this developer “",233,44,"[shouldnt, charge, downloaded, yesterday, im, german, appstore, note, goodnotes, 5, download, upgrade, bundle, get, view, “, developer, “]","[shouldnt, charg, download, yesterday, im, german, appstor, note, goodnot, 5, download, upgrad, bundl, get, view, “, develop, “]"
9,agoowm,1,The bundle is available !,,,TabulatorSpalte,"Unfortunately, they are. http://imgur.com/UJqsKhy",1547717530,"Unfortunately, they are. http://imgur.com/UJqsKhy",49,4,[unfortunately],[unfortun]


## note:
- contraction not being expanded -> example i'm is not converted to i am and count as stop words
    - if you really want to get rid of them, an example: https://www.geeksforgeeks.org/nlp-expand-contractions-in-text-processing/

## Workflow 7: Export the csv file

In [37]:
df.to_csv('./preprocessed.csv')

Congratulations! You have finished Milestone 2! Good job!