https://docs.fast.ai/text.transform.html
https://github.com/fastai/fastai/blob/0b31610c6a836c56a337e2a34ee2d1510456d1c6/tests/test_text_transform.py#L19

### Steps

1. Load the dataset into a pandas dataframe.
1. Use regular expressions to remove elements that are not words such as HTML tags, LaTeX expressions, URLs, digits, line returns, and so on.
1. Remove missing values for texts
1. Remove texts that are extremely large or too short to bring any information to the model. We want to keep paragraphs that contain at least a few words and remove the paragraphs that are composed of large numerical tables.
1. Use a tokenizer to create a version of the original text that is a string of space-separated lowercase tokens.

### Deliverable

* A .csv file that contains the original columns and a new column for the string of lowercase, space-separated tokens

 ### Download the dataset

In [68]:
import os.path
from os import path
from urllib import request

os.makedirs('./data', exist_ok=True)

GZ_FILE   = 'stackexchange_812k.csv.gz'
#DATA_FILE = 'stackexchange_812k.csv'
DATA_URL  = 'https://liveproject-resources.s3.amazonaws.com/116/other/stackexchange_812k.csv.gz'
if not path.exists(f'data/{GZ_FILE}'):
    request.urlretrieve(DATA_URL, f'data/{GZ_FILE}')          
    

### Load into Pandas

In [22]:
import gzip
import pandas as pd
with gzip.open(f'data/{GZ_FILE}') as f:
   df = pd.read_csv(f)

### Clean the Data

In [23]:
import re

def clean_text(txt):
    txt = re.sub(r'<pre>.*?</pre>', r'', txt, flags=re.S)
    txt = re.sub(r'<[^<]+?>', '', txt) #html tags
    txt = re.sub(r'\$[^$]+\$', '', txt)  #latex
    txt = re.sub(r'https?://[^\s]*', '', txt) #remove URLs
    txt = re.sub(r'\s+', ' ', txt) #condense spaces 
    return txt

df['text'] = df['text'].apply(clean_text)


In [25]:
remove_small = df["text"].str.len() > 10
df = df[remove_small]

In [26]:
df.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category
0,1,,,Eliciting priors from experts,title
1,2,,,What is normality?,title
2,3,,,What are some valuable Statistical Analysis op...,title
3,4,,,Assessing the significance of differences in d...,title
4,6,,,The Two Cultures: statistics vs. machine learn...,title


### Explore the Data

You can skip this section unless you want to see what the data looks like.

In [27]:
#largest text fields
df['length'] = df['text'].str.len()
df.sort_values('length', ascending=False).head()

Unnamed: 0,post_id,parent_id,comment_id,text,category,length
211158,123567,123063.0,,In this my answer (a second and additional to ...,post,22088
155389,438347,,,I would like to clean multiple time series of ...,post,20902
193171,316129,315502.0,,This answer aims to do four things: Review Ros...,post,18729
246925,247250,247094.0,,"If ""manually"" includes ""mechanical"" then you h...",post,16999
211091,123389,121852.0,,I am going to change the order of questions ab...,post,16892


In [28]:
#shortest text fields
df.sort_values('length', ascending=True).head()

Unnamed: 0,post_id,parent_id,comment_id,text,category,length
400901,181229,,344015.0,Maybe see:,comment,11
705986,350492,,782409.0,Some dups:,comment,11
261914,1889,,2034.0,like so ;-),comment,11
487681,375856,,706374.0,Please see,comment,11
688593,333532,,741790.0,[DataCamp](,comment,11


In [29]:
df['length'].plot.hist(bins=100)

<matplotlib.axes._subplots.AxesSubplot at 0x7f6d25cc6f60>

### Tokenize the results

#### spacy tokenization

In [30]:
import spacy
import re
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English


In [41]:
nlp = English()

In [None]:
%%time
df['tokenized'] = df.apply(lambda row, nlp=nlp: ' '.join([t.text for t in nlp(row.text)]), axis=1)

In [65]:
df.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category,length,tokenized
0,1,,,Eliciting priors from experts,title,29,Eliciting priors from experts
1,2,,,What is normality?,title,18,What is normality ?
2,3,,,What are some valuable Statistical Analysis op...,title,65,What are some valuable Statistical Analysis op...
3,4,,,Assessing the significance of differences in d...,title,58,Assessing the significance of differences in d...
4,6,,,The Two Cultures: statistics vs. machine learn...,title,50,The Two Cultures : statistics vs. machine lear...


#### fastai tokenization

I haven't run this yet.  I have it hear just in case it might be a faster solution. 

In [None]:
from fastai.text import *

In [48]:
tokenizer = Tokenizer()

In [49]:
%%time
texts  = df['text'].values
tokens = tokenizer.process_all(texts) #faster to do it all at once?


In [None]:
df['tokenized2'] = [' '.join(tt) for tt in tokens]

### Write to CSV

In [66]:
OUT_FILE= 'stackexchange_tokenized.csv'
df.to_csv(f'data/{OUT_FILE}')