https://docs.fast.ai/text.transform.html
https://github.com/fastai/fastai/blob/0b31610c6a836c56a337e2a34ee2d1510456d1c6/tests/test_text_transform.py#L19

### Steps

1. Load the dataset into a pandas dataframe.
1. Use regular expressions to remove elements that are not words such as HTML tags, LaTeX expressions, URLs, digits, line returns, and so on.
1. Remove missing values for texts
1. Remove texts that are extremely large or too short to bring any information to the model. We want to keep paragraphs that contain at least a few words and remove the paragraphs that are composed of large numerical tables.
1. Use a tokenizer to create a version of the original text that is a string of space-separated lowercase tokens.

### Deliverable

* A .csv file that contains the original columns and a new column for the string of lowercase, space-separated tokens

 ### Download the dataset

In [16]:
import os.path
from os import path
from urllib import request

GZ_FILE   = 'stackexchange_812k.csv.gz'
#DATA_FILE = 'stackexchange_812k.csv'
DATA_URL  = 'https://liveproject-resources.s3.amazonaws.com/116/other/stackexchange_812k.csv.gz'
if not path.exists(f'data/{GZ_FILE}'):
    request.urlretrieve(DATA_URL, f'data/{GZ_FILE}')          
    

### Load into Pandas

In [22]:
import gzip
import pandas as pd
with gzip.open(f'data/{GZ_FILE}') as f:
   df = pd.read_csv(f)

### Clean the Data

In [23]:
import re

def clean_text(txt):
    txt = re.sub(r'<pre>.*?</pre>', r'', txt, flags=re.S)
    txt = re.sub(r'<[^<]+?>', '', txt) #html tags
    txt = re.sub(r'\$[^$]+\$', '', txt)  #latex
    txt = re.sub(r'https?://[^\s]*', '', txt) #remove URLs
    txt = re.sub(r'\s+', ' ', txt) #condense spaces 
    return txt

df['text'] = df['text'].apply(clean_text)


In [25]:
remove_small = df["text"].str.len() > 10
df = df[remove_small]

In [26]:
df.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category
0,1,,,Eliciting priors from experts,title
1,2,,,What is normality?,title
2,3,,,What are some valuable Statistical Analysis op...,title
3,4,,,Assessing the significance of differences in d...,title
4,6,,,The Two Cultures: statistics vs. machine learn...,title


### Explore the Data

You can skip this section unless you want to see what the data looks like.

In [27]:
#largest text fields
df['length'] = df['text'].str.len()
df.sort_values('length', ascending=False).head()

Unnamed: 0,post_id,parent_id,comment_id,text,category,length
211158,123567,123063.0,,In this my answer (a second and additional to ...,post,22088
155389,438347,,,I would like to clean multiple time series of ...,post,20902
193171,316129,315502.0,,This answer aims to do four things: Review Ros...,post,18729
246925,247250,247094.0,,"If ""manually"" includes ""mechanical"" then you h...",post,16999
211091,123389,121852.0,,I am going to change the order of questions ab...,post,16892


In [28]:
#shortest text fields
df.sort_values('length', ascending=True).head()

Unnamed: 0,post_id,parent_id,comment_id,text,category,length
400901,181229,,344015.0,Maybe see:,comment,11
705986,350492,,782409.0,Some dups:,comment,11
261914,1889,,2034.0,like so ;-),comment,11
487681,375856,,706374.0,Please see,comment,11
688593,333532,,741790.0,[DataCamp](,comment,11


In [29]:
df['length'].plot.hist(bins=100)

<matplotlib.axes._subplots.AxesSubplot at 0x7f6d25cc6f60>

### Tokenize the results

#### spacy tokenization

In [30]:
import spacy
import re
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English


In [41]:
nlp = English()

In [None]:
df['tokenized'] = df.apply(lambda row, nlp=nlp: ' '.join([t.text for t in nlp(row.text)]), axis=1)

In [65]:
df.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category,length,tokenized
0,1,,,Eliciting priors from experts,title,29,Eliciting priors from experts
1,2,,,What is normality?,title,18,What is normality ?
2,3,,,What are some valuable Statistical Analysis op...,title,65,What are some valuable Statistical Analysis op...
3,4,,,Assessing the significance of differences in d...,title,58,Assessing the significance of differences in d...
4,6,,,The Two Cultures: statistics vs. machine learn...,title,50,The Two Cultures : statistics vs. machine lear...


#### fastai tokenization

I haven't run this yet.  I have it hear just in case it might be a faster solution. 

In [None]:
from fastai.text import *

In [48]:
tokenizer = Tokenizer()

In [49]:
%%time
texts  = df['text'].values
tokens = tokenizer.process_all(texts) #faster to do it all at once?


In [None]:
df['tokenized2'] = [' '.join(tt) for tt in tokens]

### Write to CSV

In [66]:
OUT_FILE= 'stackexchange_tokenized.csv'
df.to_csv(f'data/{OUT_FILE}')

In [57]:
#easy to combine spacy and fastai. 

#tokenizer = Tokenizer()
tok = SpacyTokenizer('en')
' '.join(tokenizer.process_text(df.loc[211091].text, tok))

'i am going to change the order of questions about . i \'ve found textbooks and lecture notes frequently disagree , and would like a system to work through the choice that can safely be recommended as best practice , and especially a textbook or paper this can be cited to . xxmaj unfortunately , some discussions of this issue in books and so on rely on received wisdom . xxmaj sometimes that received wisdom is reasonable , sometimes it is less so ( at the least in the sense that it tends to focus on a smaller issue when a larger problem is ignored ) ; we should examine the justifications offered for the advice ( if any justification is offered at all ) with care . xxmaj most guides to choosing a t - test or non - parametric test focus on the normality issue . xxmaj that ’s true , but it ’s somewhat misguided for several reasons that i address in this answer . xxmaj if performing an " unrelated samples " or " unpaired " t - test , whether to use a xxmaj welch correction ? xxmaj this ( to

In [5]:
texts = ['one two three four', 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.', "I'm suddenly SHOUTING FOR NO REASON"]


In [6]:
tokenizer.process_all(texts)

[['one', 'two', 'three', 'four'],
 ['xxmaj',
  'lorem',
  'ipsum',
  'dolor',
  'sit',
  'amet,',
  'consectetur',
  'adipiscing',
  'elit,',
  'sed',
  'do',
  'eiusmod',
  'tempor',
  'incididunt',
  'ut',
  'labore',
  'et',
  'dolore',
  'magna',
  'aliqua.'],
 ['xxmaj',
  "i'm",
  'suddenly',
  'xxup',
  'shouting',
  'xxup',
  'for',
  'xxup',
  'no',
  'xxup',
  'reason']]

load the dataset into a pandas dataframe

In [8]:
import gzip
import pandas as pd
with gzip.open(f'data/{GZ_FILE}') as f:
   df = pd.read_csv(f)

In [27]:
df.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category
0,1,,,Eliciting priors from experts,title
1,2,,,What is normality?,title
2,3,,,What are some valuable Statistical Analysis op...,title
3,4,,,Assessing the significance of differences in d...,title
4,6,,,The Two Cultures: statistics vs. machine learn...,title


Use regular expressions to remove elements that are not words such as HTML tags, LaTeX expressions, URLs, digits, line returns, and so on.


In [33]:
#what will fastai do here?
html = fix_html("<b>hello</b> 1234") #didn't really fix it. 
tokenizer.process_all([html])

[['<b>hello<', '/', 'b>', '1234']]

In [103]:
#simple RE approach
import re

pre = "text begin<pre>code inside!\n\n</pre> text end"
latex = r'hello $y = mx + b$ is my equation'
url = "my favorite website is https://www.stylemepretty.com. Love it"
spaces = "this is no good.\r\n no good.    bad\r\r\r\n\n boy."

text5 = re.sub(r'<pre>.*?</pre>', r'', pre, flags=re.S)
text = re.sub(r'<[^<]+?>', '', html)
txt2 = re.sub(r'\$[^$]+\$', '', latex)
txt3 = re.sub(r'https?://[^\s]*', '', url) #remove URLs
txt4 = re.sub(r'\s+', ' ', spaces) #condense spaces

#remove punctuation

text5

'text begin text end'

In [45]:
#messing with spacy
# https://spacy.io/usage/linguistic-features#native-tokenizers
import spacy
import re
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

special_cases = {":)": [{"ORTH": ":)"}]}
prefix_re = re.compile(r'''^[[("']''')
suffix_re = re.compile(r'''[])"']$''')
infix_re = re.compile(r'''[-~]''')
simple_url_re = re.compile(r'''^https?://''')

#nlp = spacy.load("en_core_web_sm")
nlp = English()
tok = Tokenizer(nlp.vocab, rules=special_cases,
                prefix_search=prefix_re.search,
                suffix_search=suffix_re.search,
                infix_finditer=infix_re.finditer,
                #token_match=simple_url_re.match
               )

#nlp.tokenizer = tok
doc = nlp("my favorite website is https://www.stylemepretty.com - boom. :)") #suffix_re matches to 'boom.'
print([t.text for t in doc])


['my', 'favorite', 'website', 'is', 'https://www.stylemepretty.com', '-', 'boom', '.', ':)']


Remove missing values for texts

In [53]:
df[df["text"].isnull()] #no text is null

Unnamed: 0,post_id,parent_id,comment_id,text,category


In [59]:
minsize = df["text"].str.len() > 10
maxsize = df["text"].str.len() < 300
df[maxsize & minsize].head()

Unnamed: 0,post_id,parent_id,comment_id,text,category
0,1,,,Eliciting priors from experts,title
1,2,,,What is normality?,title
2,3,,,What are some valuable Statistical Analysis op...,title
3,4,,,Assessing the significance of differences in d...,title
4,6,,,The Two Cultures: statistics vs. machine learn...,title


Remove the paragraphs that are composed of large numerical tables.

In [60]:
very_big = df["text"].str.len() > 300
df[very_big].head()

Unnamed: 0,post_id,parent_id,comment_id,text,category
91755,4,,,<p>I have two groups of data. Each with a dif...,post
91756,5,3.0,,"<p>The R-project</p>\n\n<p><a href=""http://www...",post
91757,6,,,"<p>Last year, I read a blog post from <a href=...",post
91758,7,,,<p>I've been working on a new method for analy...,post
91762,11,,,"<p>Is there a good, modern treatment covering ...",post


In [77]:
df.loc[91755].text

"<p>I have two groups of data.  Each with a different distribution of multiple variables.  I'm trying to determine if these two groups' distributions are different in a statistically significant way.  I have the data in both raw form and binned up in easier to deal with discrete categories with frequency counts in each.  </p>\n\n<p>What tests/procedures/methods should I use to determine whether or not these two groups are significantly different and how do I do that in SAS or R (or Orange)?</p>\n"

In [70]:
bar = df["text"].str.contains(r'\d{8,}', regex=True)
# contains can do regular expressions. Try that
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

In [71]:
df[bar].head()

Unnamed: 0,post_id,parent_id,comment_id,text,category
91757,6,,,"<p>Last year, I read a blog post from <a href=...",post
91796,71,58.0,,<p>It's an algorithm for training feedforward ...,post
91826,126,125.0,,"<p>My favorite is <a href=""http://www.amazon.c...",post
91839,151,118.0,,<p>If the goal of the standard deviation is to...,post
91883,231,223.0,,<p>This is one I've used successfully:</p>\n\n...,post


In [75]:
df.loc[91826].text

'<p>My favorite is <a href="http://www.amazon.com/exec/obidos/ISBN=158488388X/">"Bayesian Data Analysis"</a> by Gelman, et al.</p>\n'

In [81]:
#largest text fields
df['length'] = df['text'].str.len()
df.sort_values('length', ascending=False).head()

Unnamed: 0,post_id,parent_id,comment_id,text,category,length
99414,13845,,,<p>I’m working on a trading system and need to...,post,38847
124323,67228,,,<p>How do I calculate the uncertainties in lin...,post,35321
235896,215962,,,<p>I am trying to determine the habitat of a s...,post,33306
249773,254466,,,"<p>I have a time series Y, for one year and me...",post,29837
183579,286236,,,<p>I have a fitted mixed-effects model with a ...,post,29457


In [86]:
#example of large numerical table
#I wonder if i should filter out rows with `<pre>`?
df.loc[99414].text

'<p>I’m working on a trading system and need to apply some statistics on the results. Unfortunately I forgot all about statistics after I left university over a decade ago and now I really have no clue how I must calculate what I need. Hopefully someone can help me out.</p>\n\n<p>Out of the trading application (currently in test mode), I get profit / loss (PL) per trade and per day.</p>\n\n<p>Let’s say I have the day-to-day PL (an accumulation will give the total PL over the given period) of 5 years back testing (about 1250 points), what is the best way of “predicting” what the total profit might be in the next 6 months (125 points ahead) and the next year (250 points ahead)?</p>\n\n<p>Of course not every trade is profitable. So I have some trades with losses and (hopefully) more trades with profit.</p>\n\n<p>What is the best way of calculating what the profit per day (with a certain reliability) will be when you only take the winning trades into account, what lose will be when you onl

In [9]:
import re

def clean_text(txt):
    txt = re.sub(r'<pre>.*?</pre>', r'', txt, flags=re.S)
    txt = re.sub(r'<[^<]+?>', '', txt) #html tags
    txt = re.sub(r'\$[^$]+\$', '', txt)  #latex
    txt = re.sub(r'https?://[^\s]*', '', txt) #remove URLs
    txt = re.sub(r'\s+', ' ', txt) #condense spaces 
    return txt

df['text'] = df['text'].apply(clean_text)

In [10]:
df.loc[99414].text

'I’m working on a trading system and need to apply some statistics on the results. Unfortunately I forgot all about statistics after I left university over a decade ago and now I really have no clue how I must calculate what I need. Hopefully someone can help me out. Out of the trading application (currently in test mode), I get profit / loss (PL) per trade and per day. Let’s say I have the day-to-day PL (an accumulation will give the total PL over the given period) of 5 years back testing (about 1250 points), what is the best way of “predicting” what the total profit might be in the next 6 months (125 points ahead) and the next year (250 points ahead)? Of course not every trade is profitable. So I have some trades with losses and (hopefully) more trades with profit. What is the best way of calculating what the profit per day (with a certain reliability) will be when you only take the winning trades into account, what lose will be when you only look at the losing trades and what the PL

In [11]:
df.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category
0,1,,,Eliciting priors from experts,title
1,2,,,What is normality?,title
2,3,,,What are some valuable Statistical Analysis op...,title
3,4,,,Assessing the significance of differences in d...,title
4,6,,,The Two Cultures: statistics vs. machine learn...,title
