# Preparing textual data for statistics and machine learning

1. Importing the dataset
2. Cleaning the dataset
3. Tokenization
4. Feature extraction on a large dataset



## Importing Data

Reddit (https://www.reddit.com/) self-Posts dataset avalaible on Kaggle

The data consists of 1.013M self-posts, posted from 1013 subreddits (1000 examples per class). For each post we give the subreddit, the title and content of the self-post.

A subreddit is a specific online community, and the posts associated with it, on the social media website Reddit. Subreddits are dedicated to a particular topic that people write about, and they're denoted by /r/, followed by the subreddit's name, e.g., /r/gaming.

For each post we give the subreddit, the title and content of the self-post.

In [1]:
import pandas as pd

In [2]:
posts_file = "rspct.tsv"

In [3]:
posts_df = pd.read_csv(posts_file, sep='\t')

In [4]:
posts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1013000 entries, 0 to 1012999
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   id         1013000 non-null  object
 1   subreddit  1013000 non-null  object
 2   title      1013000 non-null  object
 3   selftext   1013000 non-null  object
dtypes: object(4)
memory usage: 30.9+ MB


In [5]:
posts_df.head()

Unnamed: 0,id,subreddit,title,selftext
0,6d8knd,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, fi..."
1,58mbft,teenmom,"So what was Matt ""addicted"" to?",Did he ever say what his addiction was or is h...
2,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...
3,6ti6re,ringdoorbell,"Not door bell, but floodlight mount height.",I know this is a sub for the 'Ring Doorbell' b...
4,77sxto,intel,Worried about my 8700k small fft/data stress r...,"Prime95 (regardless of version) and OCCT both,..."


In [6]:
## number of subredddit
posts_df['subreddit'].nunique()

1013

**subreddit_info.csv**

Contains manual annotation of about 3000 subreddits :
    
    - a top-level category and subcategory for each subreddit, 
    
    - a reason for exclusion if this does not appear in the data.

These information can be considerered as  **metadata**: information on characteristics of the text (and not the content of the text)

In [7]:
subred_file = "subreddit_info.csv"
subred_df=pd.read_csv(subred_file).set_index(['subreddit'])

In [None]:
subred_df.info()

In [None]:
subred_df.head(10)

In [None]:
subred_df['in_data'].nunique()

In [None]:
n=(subred_df['in_data']==True).value_counts()
print(n)

In [None]:
subred_df.loc['Harley']

In [8]:
df=posts_df.join(subred_df, on ='subreddit')

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.isna().sum()

### Standardizing Attributes Names

Usual practise:
- **df**: name of the dataset
- **text**: name of the column containing text to analyze

In [None]:
print(df.columns)

#### Renaming columns

- selftext renamed as text
- category_1 renamed as category
- category_2 renamed as subcategory


 category_3, in_data and reason_for_exclusion **are suppressed (incomplete data)**

In [9]:
column_mapping = {
    'id':'id',
    'subreddit':'subreddit',
    'title':'title',
    'selftext':'text',
    'category_1':'category',
    'category_2':'subcategory',
    'category_3': None,
    'in_data': None,
    'reason_for_exclusion': None
}

In [10]:
columns=[c for c in column_mapping.keys() if column_mapping[c] != None]

In [11]:
print(columns)

['id', 'subreddit', 'title', 'selftext', 'category_1', 'category_2']


In [12]:
df=df[columns].rename(columns=column_mapping)

In [13]:
print(df.columns)

Index(['id', 'subreddit', 'title', 'text', 'category', 'subcategory'], dtype='object')


In [14]:
df.head()

Unnamed: 0,id,subreddit,title,text,category,subcategory
0,6d8knd,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, fi...",writing/stories,tech support
1,58mbft,teenmom,"So what was Matt ""addicted"" to?",Did he ever say what his addiction was or is h...,tv_show,teen mom
2,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...,autos,harley davidson
3,6ti6re,ringdoorbell,"Not door bell, but floodlight mount height.",I know this is a sub for the 'Ring Doorbell' b...,hardware/tools,doorbells
4,77sxto,intel,Worried about my 8700k small fft/data stress r...,"Prime95 (regardless of version) and OCCT both,...",electronics,cpu


### Selection of data for the autos category

We restrict the data to the auto category.

In [15]:
df=df[df['category']=='autos']

In [None]:
df.info()

In [None]:
df.head()

In [None]:
len(df)

## Python libraries

Two associated Python libraries:

**textacy**
    
        preprocessing = clean, normalize and explore raw data before processing it with spaCy*
        
**spaCy** : 
        
        fundamentals = tokenization, part-of-speech tagging, dependency parsing...

## Preliminary step: Cleaning Text Data with textacy

We don't have well edited texts. There are several problems of quality that we need to take into account:

- **Salutations, signatures and adresses**: usually not informative
    

- **Replies**: in case the text contains replies repeating the question, we need to eliminate the duplicated question. If not, we can introduce bias in the statistical analysis.
    
    
- **Special formatting and program code**: in case, the text contain special characters, HTML entities, Mardown tags,...Necessary to eliminate these signs before the analysis.

- TextaCy module used to perform (preliminary/cleaning) NLP tasks on texts:
    
    - replacing and removing punctuation, extra whitespaces, numbers from the text before processing with spaCy
    
- Built upon the SpaCy module in Python

https://www.geeksforgeeks.org/textacy-module-in-python/

In [16]:
text=df.loc[df.index[3],'text'] # selection of text by using df.index[list]
print(text)

https://www.cars.com/articles/how-often-should-i-change-engine-coolant-1420680853669/<lb><lb>I have a IS 250 AWD from 2006. About 73K miles on it. I've never touched the engine radiator coolant and can't find anything on when to change this in the book. It just says 'long life 100k Toyota coolant.' <lb><lb>Does anyone get this flushed or changed at ten years?? Do I wait until 100k? 


In [17]:
import textacy
import textacy.preprocessing as tprep

In [18]:
preproc = tprep.make_pipeline(
    tprep.replace.urls,
    tprep.remove.html_tags,
    tprep.normalize.hyphenated_words,
    tprep.normalize.quotation_marks,
    tprep.normalize.unicode,
    tprep.remove.accents,
    tprep.remove.punctuation,
    tprep.normalize.whitespace,
    tprep.replace.numbers
)

In [19]:
clean_text=preproc(text)

print(clean_text)

URL have a IS _NUMBER_ AWD from _NUMBER_ About 73K miles on it I ve never touched the engine radiator coolant and can t find anything on when to change this in the book It just says long life 100k Toyota coolant Does anyone get this flushed or changed at ten years Do I wait until 100k


### Alternative: creating a specific function

In [None]:
def normalize(text):
    text = tprep.replace.urls(text)# we replace url with text
    text = tprep.remove.html_tags(text)
    text = tprep.normalize.hyphenated_words(text)
    text = tprep.normalize.quotation_marks(text)
    text = tprep.normalize.unicode(text)
    text = tprep.remove.accents(text)
    text = tprep.remove.punctuation(text)
    text = tprep.normalize.whitespace(text)
    text = tprep.replace.numbers(text)
    return text

In [None]:
print(normalize(text))

We import spaCy one of trained pipelines for english 

https://spacy.io/models/en

In [20]:
import spacy

In [21]:
from spacy.cli import download
print(download('en_core_web_sm'))

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
None


In [22]:
doc = textacy.make_spacy_doc(clean_text,lang="en_core_web_sm")
doc._.preview

'Doc(62 tokens: "URL have a IS _NUMBER_ AWD from _NUMBER_ About ...")'

In [35]:
print(doc)

URL have a IS _NUMBER_ AWD from _NUMBER_ About 73K miles on it I ve never touched the engine radiator coolant and can t find anything on when to change this in the book It just says long life 100k Toyota coolant Does anyone get this flushed or changed at ten years Do I wait until 100k


### Displaying tokens in a document

In [37]:
for tok in doc:
    print(tok.text)

URL
have
a
IS
_
NUMBER
_
AWD
from
_
NUMBER
_
About
73
K
miles
on
it
I
ve
never
touched
the
engine
radiator
coolant
and
can
t
find
anything
on
when
to
change
this
in
the
book
It
just
says
long
life
100k
Toyota
coolant
Does
anyone
get
this
flushed
or
changed
at
ten
years
Do
I
wait
until
100k


### Tokens have attributes 

    - token.is_punct  : Is the token punctuation? 
    - token.is_alpha  : Does the token consist of alphabetic characters? 
    - token.like_email : Does the token resemble an email address?
    - token.like_url : : Does the token resemble a URL?
    - token.is_stop : Is the token part of a “stop list”?
    - token.lemma_ : Base form of the token, with no inflectional suffixes.
    - token.pos : core part-of-speech categories https://universaldependencies.org/u/pos/
            
            
See https://spacy.io/api/token for the list of all attributes

In [39]:
for token in doc:
    print(token,token.is_punct)

URL False
have False
a False
IS False
_ True
NUMBER False
_ True
AWD False
from False
_ True
NUMBER False
_ True
About False
73 False
K False
miles False
on False
it False
I False
ve False
never False
touched False
the False
engine False
radiator False
coolant False
and False
can False
t False
find False
anything False
on False
when False
to False
change False
this False
in False
the False
book False
It False
just False
says False
long False
life False
100k False
Toyota False
coolant False
Does False
anyone False
get False
this False
flushed False
or False
changed False
at False
ten False
years False
Do False
I False
wait False
until False
100k False


In [41]:
for token in doc:
    print(token,token.is_alpha)

URL True
have True
a True
IS True
_ False
NUMBER True
_ False
AWD True
from True
_ False
NUMBER True
_ False
About True
73 False
K True
miles True
on True
it True
I True
ve True
never True
touched True
the True
engine True
radiator True
coolant True
and True
can True
t True
find True
anything True
on True
when True
to True
change True
this True
in True
the True
book True
It True
just True
says True
long True
life True
100k False
Toyota True
coolant True
Does True
anyone True
get True
this True
flushed True
or True
changed True
at True
ten True
years True
Do True
I True
wait True
until True
100k False


In [40]:
for token in doc:
    print(token,token.is_stop)

URL False
have True
a True
IS True
_ False
NUMBER False
_ False
AWD False
from True
_ False
NUMBER False
_ False
About True
73 False
K False
miles False
on True
it True
I True
ve False
never True
touched False
the True
engine False
radiator False
coolant False
and True
can True
t False
find False
anything True
on True
when True
to True
change False
this True
in True
the True
book False
It True
just True
says False
long False
life False
100k False
Toyota False
coolant False
Does True
anyone True
get True
this True
flushed False
or True
changed False
at True
ten True
years False
Do True
I True
wait False
until True
100k False


### Tag-of-speech

- Refers to types of words are called **part-of-speech tags**

- examples: nouns, verbs, adjectives

- often important to restrict the types of words used to certain categories

In [46]:
for token in doc:
    print(token, token.pos_)

URL NOUN
have VERB
a DET
IS PROPN
_ PRON
NUMBER NOUN
_ PUNCT
AWD PROPN
from ADP
_ PROPN
NUMBER NOUN
_ NOUN
About ADV
73 NUM
K NOUN
miles NOUN
on ADP
it PRON
I PRON
ve AUX
never ADV
touched VERB
the DET
engine NOUN
radiator NOUN
coolant NOUN
and CCONJ
can AUX
t NOUN
find VERB
anything PRON
on ADP
when SCONJ
to PART
change VERB
this PRON
in ADP
the DET
book NOUN
It PRON
just ADV
says VERB
long ADJ
life NOUN
100k NUM
Toyota PROPN
coolant NOUN
Does AUX
anyone PRON
get VERB
this PRON
flushed ADJ
or CCONJ
changed VERB
at ADP
ten NUM
years NOUN
Do AUX
I PRON
wait VERB
until ADP
100k NUM


## Lemmatization/ Stemming

- Replacing words with their root: 
    - "economic", "economics", "economically" all replaced by the stem (the root) "economy"
    - Porter stemmer (Porter 1980): standard stemming tool for English language text
- smaller vocabulary: increase speed of execution

In [44]:
for token in doc:
    print(token,token.lemma_)

URL url
have have
a a
IS IS
_ _
NUMBER number
_ _
AWD AWD
from from
_ _
NUMBER number
_ _
About about
73 73
K k
miles mile
on on
it it
I I
ve ve
never never
touched touch
the the
engine engine
radiator radiator
coolant coolant
and and
can can
t t
find find
anything anything
on on
when when
to to
change change
this this
in in
the the
book book
It it
just just
says say
long long
life life
100k 100k
Toyota Toyota
coolant coolant
Does do
anyone anyone
get get
this this
flushed flushed
or or
changed change
at at
ten ten
years year
Do do
I I
wait wait
until until
100k 100k


### alternative syntax

In [25]:
nlp=spacy.load('en_core_web_sm')

In [26]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x17e19f3d970>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x17e19f3da30>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x17e0b375cb0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x17e1dafa750>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x17e1daf3310>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x17e0b375b60>)]

In [27]:
doc_alt = nlp(clean_text)

In [28]:
doc_alt._.preview

'Doc(62 tokens: "URL have a IS _NUMBER_ AWD from _NUMBER_ About ...")'

### Analysis of a Doc

- extracting n-grams

In [29]:
from textacy import extract
list(extract.ngrams(doc,2))

[73K,
 K miles,
 engine radiator,
 radiator coolant,
 t find,
 says long,
 long life,
 life 100k,
 100k Toyota,
 Toyota coolant]

- Identifying key terms

In [30]:
extract.keyterms.textrank(doc, normalize="lemma",topn=5)

[('engine radiator coolant', 0.0653202474980594),
 ('Toyota coolant', 0.04838537088251609),
 ('long life', 0.036703850291551265),
 ('k mile', 0.036327644669932505),
 ('AWD', 0.01817422855742821)]

### Remark: We can discard some function of the spaCy pipeline

We can import selected elements of the pipeline if some component are useless

In [43]:
nlp_2=spacy.load('en_core_web_sm', disable=["parser","ner"])

In [None]:
## Customizing Tokenization

Sometimes, it is necessary to adjust the Tokenizer to take into account hyphen, underscore, hash sign #, 

In [None]:
text = "@Pete: can't choose low-carb # food #eat-smart. _url_ ; -) "
doc = nlp.make_doc(text)
for token in doc:
    print(token, end="|")

## Working with stop words

- spaCy uses language-specific stop word lists to set the is_stop property for each token
- Filtering stop words (and punctuation tokens) is easy
- The list of stop words is loaded when a nlp object is created

In [51]:
print(nlp.Defaults.stop_words)

{'where', 'just', 'into', 'so', 'it', 'did', '’ll', 'around', 'why', 'thereupon', 'four', 'must', 'nor', '‘ve', 'own', 'nevertheless', 'eleven', 'until', 'two', 'top', 'ours', 'often', 'due', 'thru', 'take', 'when', 'throughout', 'forty', 'anything', 'whatever', 'yet', 'could', 'several', 'was', 'indeed', 'each', 'last', '’d', 'that', 'a', 'done', 'least', '‘s', 'never', 'seeming', 'seem', 'go', 'less', 'any', 'are', 'else', 'perhaps', 'of', 'between', 'hence', 'such', 'therefore', 'against', 'always', 'anywhere', 'us', 'although', 'using', 'moreover', 'would', 'no', "n't", 'anyway', 'hers', 'too', 'beyond', 'already', 'toward', 'if', 'and', 'becoming', 'in', 'next', 'the', 'whether', 'sixty', 'he', 'however', 'him', 'twenty', 'sometime', 'they', 'same', 'also', 'formerly', 'amongst', 'before', '‘re', '’m', 'towards', 'everyone', 'again', "'ve", 'as', 're', 'amount', 'per', 'off', 'among', 'under', 'namely', 'once', 'mostly', 'yours', 'anyhow', 'together', 'because', 'neither', 'whence

### The list of stop words can be modified

In [None]:
nlp.vocab['down'].is_stop=False
nlp.vocab['Dear'].is_stop=True
nlp.vocab['Regards'].is_stop = True

### Selection of words according to part-of-speech tags

- Each token in a spaCy doc has two part-of-speech attributes:
    - pos_
    - tag_
- tag_ can be language specific 
- pos_ contains the simplified tag of the universal part-of-speech tagset

    https://spacy.io/usage/linguistic-features

- pos_ can be used as an alternative to stop words
- pronouns, prepositions, conjunctions, determiners: 
    - called **function words**
    - their main function is to create grammatical relationships in a sentence
    - not very informative

- nouns, verbs, adjectives and adverbs: 
    - **content** words
    - the meaning of a sentence depends on them
    

- We can **part-of-speech tags** to select the word types

In [53]:
nouns=[t for t in doc if t.pos_ in ['NOUN','PROPN']]
print(nouns)

[URL, IS, NUMBER, AWD, _, NUMBER, _, K, miles, engine, radiator, coolant, t, book, life, Toyota, coolant, years]


In [54]:
# L'input doit être un objet de type doc
tokens=textacy.extract.words(doc, include_pos={"ADJ","NOUN"})

In [55]:
print(*[t for t in tokens], sep="|")

URL|NUMBER|NUMBER|K|miles|engine|radiator|coolant|t|book|long|life|coolant|flushed|years


In [49]:
# L'input doit être un objet de type doc
tokens=textacy.extract.words(doc, include_pos={"ADJ","NOUN"},min_freq=2)

In [None]:
print(*[t for t in tokens], sep="|")

In [None]:
def extract_lemmas(doc,**kwargs):
    return[t.lemma_ for t in textacy.extract.words(doc,**kwargs)]

In [None]:
lemmas = extract_lemmas(doc,  include_pos={"ADJ","NOUN"})
print(*lemmas, sep = "|")

In [None]:
### Extracting Named entities

- The process of detecting entities such as people, locations, organization in texts
- In the **Named-entity recognizer** attributes of Doc:
    - Doc.ents
    - Token.ent_iob_
    - Token.ent_type_

In [None]:
text="James O'Neill, chairman of World Cargo Inc, lives in San Francisco"
doc=nlp(text)

In [None]:
list(textacy.extract.entities(doc, include_types={"PERSON","ORG","LOCATION"}))

In [None]:
for ent in doc.ents:
    print(f"({ent.text},{ent.label_})",end="")

In [None]:
from spacy import displacy
displacy.render(doc,style="ent")

# Make a Corpus

A textacy.Corpus is an ordered collection of spaCy Doc all processed by the same language pipeline

In [31]:
records=df['text']

preproc_records=((preproc(text)) for text in records)

corpus=textacy.Corpus("en_core_web_sm",data=preproc_records)

In [None]:
corpus=textacy.Corpus(nlp,df['clean_text'])

In [34]:
corpus[-1]._.preview

'Doc(111 tokens: "Looking for some help I ve never owned any luxu...")'

In [None]:
corpus.n_docs, corpus.n_sents, corpus.n_tokens