In [1]:
from src.features.preprocess import PreProcess
import pandas as pd

### Call preprocess function with the dataframe and the column you want to preprocess on.

The preprocess function utilizes the full preprocess functionality.  
Can think of it as an all in one function that calls the other preprocessing functions to quickly generate preprocessed text.  
        Preprocesses the column by:  
        1. fill_na for empty strings  
        2. remove URLs  
        3. expand contractions  
        4. remove hashtags  
        5. remove escape chars '\n'  
        ### All of the above are performed in-place (the same column)  
        ### Below generates a new column  
        6. tokenize the text into list of words  
        7. filter out special characters that are not alphabetical or numerical  
        8. filter stopwords  
        9. stemming/lemmatization  

In [2]:
df = pd.read_csv("../data/raw/computerscience_comments.csv")
df.head()

Unnamed: 0,post_id,comment
0,n2n0ax,How late is too late to start a career in prog...
1,n2n0ax,I am a freshman at a university and haven't be...
2,n2n0ax,I'm still in highschool but really interested ...
3,n2n0ax,"This is probably a common question, but how we..."
4,n2n0ax,I am planning on starting a CS major this fall...


In [3]:
preprocessor = PreProcess()

preprocessor.preprocess(df, 'comment')
df

Unnamed: 0,post_id,comment,comment_word_token
0,n2n0ax,How late is too late to start a career in prog...,"[late, late, start, career, program]"
1,n2n0ax,I am a freshman at a university and have not b...,"[freshman, univers, abl, work, side, project, ..."
2,n2n0ax,I am still in highschool but really interested...,"[still, highschool, realli, interest, comput, ..."
3,n2n0ax,"This is probably a common question, but how we...","[probabl, common, question, well, code, bootca..."
4,n2n0ax,I am planning on starting a CS major this fall...,"[plan, start, cs, major, fall]"
...,...,...,...
3174,myc3u1,Look for Ben Eater on YouTube if you are looki...,"[look, ben, eater, youtub, look, stuff, transi..."
3175,myc3u1,"I would recommend ""Structured Computer Organiz...","[would, recommend, structur, comput, organ, an..."
3176,myc3u1,"Just to give you a very brief overview, there ...","[give, brief, overview, littl, circuit, compon..."
3177,myc3u1,Thank you for this amazing list of resources :...,"[thank, amaz, list, resourc, never, imagin, ge..."


It'll generate one extra column: column_word_token, which is the stemmed text tokens.

### Add lemmatization

In [4]:
df = pd.read_csv("../data/raw/computerscience_comments.csv")
df.head()

Unnamed: 0,post_id,comment
0,n2n0ax,How late is too late to start a career in prog...
1,n2n0ax,I am a freshman at a university and haven't be...
2,n2n0ax,I'm still in highschool but really interested ...
3,n2n0ax,"This is probably a common question, but how we..."
4,n2n0ax,I am planning on starting a CS major this fall...


In [5]:
preprocessor.preprocess(df, 'comment', lemm=True)
df

Unnamed: 0,post_id,comment,comment_word_token,comment_tag
0,n2n0ax,How late is too late to start a career in prog...,"[late, late, start, career, program]","[[(late, RB)], [(late, RB)], [(start, NN)], [(..."
1,n2n0ax,I am a freshman at a university and have not b...,"[freshman, university, able, work, side, proje...","[[(freshman, NN)], [(university, NN)], [(able,..."
2,n2n0ax,I am still in highschool but really interested...,"[still, highschool, really, interested, comput...","[[(still, RB)], [(highschool, NN)], [(really, ..."
3,n2n0ax,"This is probably a common question, but how we...","[probably, common, question, well, cod, bootca...","[[(probably, RB)], [(common, JJ)], [(question,..."
4,n2n0ax,I am planning on starting a CS major this fall...,"[planning, start, c, major, fall]","[[(planning, NN)], [(starting, VBG)], [(cs, NN..."
...,...,...,...,...
3174,myc3u1,Look for Ben Eater on YouTube if you are looki...,"[look, ben, eater, youtube, look, stuff, trans...","[[(look, NN)], [(ben, NN)], [(eater, NN)], [(y..."
3175,myc3u1,"I would recommend ""Structured Computer Organiz...","[would, recommend, structure, computer, organi...","[[(would, MD)], [(recommend, NN)], [(structure..."
3176,myc3u1,"Just to give you a very brief overview, there ...","[give, brief, overview, little, circuit, compo...","[[(give, VB)], [(brief, NN)], [(overview, NN)]..."
3177,myc3u1,Thank you for this amazing list of resources :...,"[thank, amaze, list, resource, never, imagine,...","[[(thank, NN)], [(amazing, VBG)], [(list, NN)]..."


Adds two columns: the lemmatized words and POS tags

### Call preprocess functions manually

In [6]:
df = pd.read_csv("../data/raw/computerscience_comments.csv")
df.head()

Unnamed: 0,post_id,comment
0,n2n0ax,How late is too late to start a career in prog...
1,n2n0ax,I am a freshman at a university and haven't be...
2,n2n0ax,I'm still in highschool but really interested ...
3,n2n0ax,"This is probably a common question, but how we..."
4,n2n0ax,I am planning on starting a CS major this fall...


In [7]:
# Fill NaN values with empty quotes ''
preprocessor.fill_na(df, 'comment')
# Remove URLs from the text
preprocessor.remove_urls(df, 'comment')
# Expand any contractions
preprocessor.expand_contractions(df, 'comment')

# Tokenize the string into list of words
preprocessor.tokenize(df, 'comment')

# Filter stopwords
preprocessor.filter_stopwords(df, 'comment')

### We choose to not stem or lemm the words, therefore the below code is commented out
### Uncomment the below to see how lemm changes the dataframe
# preprocessor.lemm(df, 'comment')

# Function to convert work_token column back into just a string
preprocessor.token_to_str(df, 'comment')
df

Unnamed: 0,post_id,comment,comment_word_token,comment_untokenized
0,n2n0ax,How late is too late to start a career in prog...,"[How, late, late, start, career, programming, ?]",How late late start career programming ?
1,n2n0ax,I am a freshman at a university and have not b...,"[I, freshman, university, able, work, side, pr...",I freshman university able work side project l...
2,n2n0ax,I am still in highschool but really interested...,"[I, still, highschool, really, interested, com...",I still highschool really interested computer ...
3,n2n0ax,"This is probably a common question, but how we...","[This, probably, common, question, ,, well, co...","This probably common question , well coding bo..."
4,n2n0ax,I am planning on starting a CS major this fall...,"[I, planning, starting, CS, major, fall, .]",I planning starting CS major fall .
...,...,...,...,...
3174,myc3u1,Look for Ben Eater on YouTube if you are looki...,"[Look, Ben, Eater, YouTube, looking, stuff, ``...",Look Ben Eater YouTube looking stuff `` transi...
3175,myc3u1,"I would recommend ""Structured Computer Organiz...","[I, would, recommend, ``, Structured, Computer...",I would recommend `` Structured Computer Organ...
3176,myc3u1,"Just to give you a very brief overview, there ...","[Just, give, brief, overview, ,, little, circu...","Just give brief overview , little circuit comp..."
3177,myc3u1,Thank you for this amazing list of resources :...,"[Thank, amazing, list, resources, :, ), Never,...",Thank amazing list resources : ) Never imagine...


By manually calling preprocessing functions, you get full control of what preprocessing functions are used. In the example above, if we didn't want to stem/lemm the text, we just simply don't call that function. If we don't want to filter out special characters we don't call that function.