# Preparing textual data for statistics and machine learning

1. Importing the dataset
2. Cleaning the dataset
3. Tokenization
4. Feature extraction on a large dataset



## Importing Data

Reddit (https://www.reddit.com/) self-Posts dataset avalaible on Kaggle

The data consists of 1.013M self-posts, posted from 1013 subreddits (1000 examples per class). For each post we give the subreddit, the title and content of the self-post.

A subreddit is a specific online community, and the posts associated with it, on the social media website Reddit. Subreddits are dedicated to a particular topic that people write about, and they're denoted by /r/, followed by the subreddit's name, e.g., /r/gaming.

For each post we give the subreddit, the title and content of the self-post.

In [1]:
import pandas as pd

In [2]:
posts_file = "rspct.tsv"

In [3]:
posts_df = pd.read_csv(posts_file, sep='\t')

In [None]:
posts_df.info()

In [None]:
posts_df.head()

In [None]:
## number of subredddit
posts_df['subreddit'].nunique()

**subreddit_info.csv**

Contains manual annotation of about 3000 subreddits :
    
    - a top-level category and subcategory for each subreddit, 
    
    - a reason for exclusion if this does not appear in the data.

These information can be considerered as  **metadata**: information on characteristics of the text (and not the content of the text)

In [4]:
subred_file = "subreddit_info.csv"
subred_df=pd.read_csv(subred_file).set_index(['subreddit'])

In [None]:
subred_df.info()

In [None]:
subred_df.head(10)

In [5]:
df=posts_df.join(subred_df, on ='subreddit')

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.isna().sum()

### Standardizing Attributes Names

Usual practise:
- **df**: name of the dataset
- **text**: name of the column containing text to analyze

In [None]:
print(df.columns)

#### Renaming columns

- selftext renamed as text
- category_1 renamed as category
- category_2 renamed as subcategory


 category_3, in_data and reason_for_exclusion **are suppressed (incomplete data)**

In [6]:
column_mapping = {
    'id':'id',
    'subreddit':'subreddit',
    'title':'title',
    'selftext':'text',
    'category_1':'category',
    'category_2':'subcategory',
    'category_3': None,
    'in_data': None,
    'reason_for_exclusion': None
}

In [7]:
columns=[c for c in column_mapping.keys() if column_mapping[c] != None]

In [8]:
print(columns)

['id', 'subreddit', 'title', 'selftext', 'category_1', 'category_2']


In [9]:
df=df[columns].rename(columns=column_mapping)

In [None]:
print(df.columns)

In [None]:
df.head()

### Selection of data for the autos category

We restrict the data to the auto category.

In [10]:
df=df[df['category']=='autos']

In [None]:
df.info()

In [None]:
df.head()

In [None]:
len(df)

## Python libraries

Two associated Python libraries:

**textacy**
    
        preprocessing = clean, normalize and explore raw data before processing it with spaCy*
        
**spaCy** : 
        
        fundamentals = tokenization, part-of-speech tagging, dependency parsing...

## Preliminary step: Cleaning Text Data with textacy

We don't have well edited texts. There are several problems of quality that we need to take into account:

- **Salutations, signatures and adresses**: usually not informative
    

- **Replies**: in case the text contains replies repeating the question, we need to eliminate the duplicated question. If not, we can introduce bias in the statistical analysis.
    
    
- **Special formatting and program code**: in case, the text contain special characters, HTML entities, Mardown tags,...Necessary to eliminate these signs before the analysis.

- TextaCy module used to perform (preliminary/cleaning) NLP tasks on texts:
    
    - replacing and removing punctuation, extra whitespaces, numbers from the text before processing with spaCy
    
- Built upon the SpaCy module in Python

https://www.geeksforgeeks.org/textacy-module-in-python/

In [11]:
text=df.loc[df.index[3],'text'] # selection of text by using df.index[list]
print(text)

https://www.cars.com/articles/how-often-should-i-change-engine-coolant-1420680853669/<lb><lb>I have a IS 250 AWD from 2006. About 73K miles on it. I've never touched the engine radiator coolant and can't find anything on when to change this in the book. It just says 'long life 100k Toyota coolant.' <lb><lb>Does anyone get this flushed or changed at ten years?? Do I wait until 100k? 


In [12]:
import textacy
import textacy.preprocessing as tprep

In [13]:
preproc = tprep.make_pipeline(
    tprep.replace.urls,
    tprep.remove.html_tags,
    tprep.normalize.hyphenated_words,
    tprep.normalize.quotation_marks,
    tprep.normalize.unicode,
    tprep.remove.accents,
    tprep.remove.punctuation,
    tprep.normalize.whitespace,
    tprep.replace.numbers
)

In [14]:
clean_text=preproc(text)

print(clean_text)

URL have a IS _NUMBER_ AWD from _NUMBER_ About 73K miles on it I ve never touched the engine radiator coolant and can t find anything on when to change this in the book It just says long life 100k Toyota coolant Does anyone get this flushed or changed at ten years Do I wait until 100k


### Alternative: creating a specific function

In [None]:
def normalize(text):
    text = tprep.replace.urls(text)# we replace url with text
    text = tprep.remove.html_tags(text)
    text = tprep.normalize.hyphenated_words(text)
    text = tprep.normalize.quotation_marks(text)
    text = tprep.normalize.unicode(text)
    text = tprep.remove.accents(text)
    text = tprep.remove.punctuation(text)
    text = tprep.normalize.whitespace(text)
    text = tprep.replace.numbers(text)
    return text

In [None]:
print(normalize(text))

## Linguistic Processing with spaCy

- Spacy: library for linguistic data processing
    
- Spacy provide an integrated pipeline of processing documents:
    
    1. a tokenizer (by default)
    2. a part-of-speech tagger  
    3. a dependency parser
    4. a named-entity recognizer
    5. a lemmatizer
    
- the tokenizes is based on language-dependent rules = > fast


- 2, 3 and 4 are based on pretrained neural models => can 10-20 times as long as tokenization

- The initial input is a text

- The final output is a **Doc** object

- The **Doc** object contains a list of **Tokens** objects

- Any range selection of tokens creates a **Span**

We import spaCy one of trained pipelines for english 

https://spacy.io/models/en

In [15]:
import spacy

In [16]:
from spacy.cli import download
print(download('en_core_web_sm'))

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
None


In [17]:
doc = textacy.make_spacy_doc(clean_text,lang="en_core_web_sm")
doc._.preview

'Doc(62 tokens: "URL have a IS _NUMBER_ AWD from _NUMBER_ About ...")'

In [40]:
print(doc)

URL have a IS _NUMBER_ AWD from _NUMBER_ About 73K miles on it I ve never touched the engine radiator coolant and can t find anything on when to change this in the book It just says long life 100k Toyota coolant Does anyone get this flushed or changed at ten years Do I wait until 100k


### Displaying tokens in a document

In [41]:
for tok in doc:
    print(tok.text)

URL
have
a
IS
_
NUMBER
_
AWD
from
_
NUMBER
_
About
73
K
miles
on
it
I
ve
never
touched
the
engine
radiator
coolant
and
can
t
find
anything
on
when
to
change
this
in
the
book
It
just
says
long
life
100k
Toyota
coolant
Does
anyone
get
this
flushed
or
changed
at
ten
years
Do
I
wait
until
100k


### Tokens have attributes 

    - token.is_punct  : Is the token punctuation? 
    - token.is_alpha  : Does the token consist of alphabetic characters? 
    - token.like_email : Does the token resemble an email address?
    - token.like_url : : Does the token resemble a URL?
    - token.is_stop : Is the token part of a “stop list”?
    - token.lemma_ : Base form of the token, with no inflectional suffixes.
    - token.pos : core part-of-speech categories https://universaldependencies.org/u/pos/
            
            
See https://spacy.io/api/token for the list of all attributes

In [None]:
for token in doc:
    print(token,token.is_punct)

In [None]:
for token in doc:
    print(token,token.is_alpha)

In [None]:
for token in doc:
    print(token,token.is_stop)

### Tag-of-speech

- Refers to types of words are called **part-of-speech tags**

- examples: nouns, verbs, adjectives

- often important to restrict the types of words used to certain categories

In [19]:
for token in doc:
    print(token, token.pos_)

URL NOUN
have VERB
a DET
IS PROPN
_ PRON
NUMBER NOUN
_ PUNCT
AWD PROPN
from ADP
_ PROPN
NUMBER NOUN
_ NOUN
About ADV
73 NUM
K NOUN
miles NOUN
on ADP
it PRON
I PRON
ve AUX
never ADV
touched VERB
the DET
engine NOUN
radiator NOUN
coolant NOUN
and CCONJ
can AUX
t NOUN
find VERB
anything PRON
on ADP
when SCONJ
to PART
change VERB
this PRON
in ADP
the DET
book NOUN
It PRON
just ADV
says VERB
long ADJ
life NOUN
100k NUM
Toyota PROPN
coolant NOUN
Does AUX
anyone PRON
get VERB
this PRON
flushed ADJ
or CCONJ
changed VERB
at ADP
ten NUM
years NOUN
Do AUX
I PRON
wait VERB
until ADP
100k NUM


In [20]:
for token in doc:
    print(token, token.tag_)

URL NN
have VBP
a DT
IS NNP
_ DT
NUMBER NN
_ .
AWD NNP
from IN
_ NNP
NUMBER NN
_ NN
About RB
73 CD
K NN
miles NNS
on IN
it PRP
I PRP
ve VBP
never RB
touched VBN
the DT
engine NN
radiator NN
coolant NN
and CC
can MD
t NN
find VB
anything NN
on IN
when WRB
to TO
change VB
this DT
in IN
the DT
book NN
It PRP
just RB
says VBZ
long JJ
life NN
100k CD
Toyota NNP
coolant NN
Does VBZ
anyone NN
get VB
this DT
flushed JJ
or CC
changed VBN
at IN
ten CD
years NNS
Do VBP
I PRP
wait VB
until IN
100k CD


## Lemmatization/ Stemming

- Replacing words with their root: 
    - "economic", "economics", "economically" all replaced by the stem (the root) "economy"
    - Porter stemmer (Porter 1980): standard stemming tool for English language text
- smaller vocabulary: increase speed of execution

In [None]:
for token in doc:
    print(token,token.lemma_)

### alternative syntax

In [None]:
nlp=spacy.load('en_core_web_sm')

In [None]:
nlp.pipeline

In [None]:
doc_alt = nlp(clean_text)

In [None]:
doc_alt._.preview

### Analysis of a Doc

- extracting n-grams

In [None]:
from textacy import extract
list(extract.ngrams(doc,2))

- Identifying key terms

In [None]:
extract.keyterms.textrank(doc, normalize="lemma",topn=5)

### Remark: We can discard some function of the spaCy pipeline

We can import selected elements of the pipeline if some component are useless

In [None]:
nlp_2=spacy.load('en_core_web_sm', disable=["parser","ner"])

## Working with stop words

- spaCy uses language-specific stop word lists to set the is_stop property for each token
- Filtering stop words (and punctuation tokens) is easy
- The list of stop words is loaded when a nlp object is created

In [None]:
print(nlp.Defaults.stop_words)

### The list of stop words can be modified

In [None]:
nlp.vocab['down'].is_stop=False
nlp.vocab['Dear'].is_stop=True
nlp.vocab['Regards'].is_stop = True

### Selection of words according to part-of-speech tags

- Each token in a spaCy doc has two part-of-speech attributes:
    - pos_
    - tag_
- tag_ can be language specific 
- pos_ contains the simplified tag of the universal part-of-speech tagset

- A simplified form is the identification of words as nouns, verbs, adjectives, adverbs, etc.

    https://spacy.io/usage/linguistic-features

- pos_ can be used as an alternative to stop words
- pronouns, prepositions, conjunctions, determiners: 
    - called **function words**
    - their main function is to create grammatical relationships in a sentence
    - not very informative

- nouns, verbs, adjectives and adverbs: 
    - **content** words
    - the meaning of a sentence depends on them
    

- We can use **part-of-speech tags** to select the word types

In [19]:
nouns=[t for t in doc if t.pos_ in ['NOUN','PROPN']]
print(nouns)

[URL, IS, NUMBER, AWD, _, NUMBER, _, K, miles, engine, radiator, coolant, t, book, life, Toyota, coolant, years]


### Extracting tokens according to pos_

In [20]:
# L'input doit être un objet de type doc
tokens1=textacy.extract.words(doc, include_pos={"ADJ","NOUN"})

In [21]:
print(*[t for t in tokens1], sep="|")

URL|NUMBER|NUMBER|K|miles|engine|radiator|coolant|t|book|long|life|coolant|flushed|years


In [22]:
# L'input doit être un objet de type doc
tokens2=textacy.extract.words(doc, include_pos={"ADJ","NOUN"},min_freq=2)

In [23]:
print(*[t for t in tokens2], sep="|")

NUMBER|NUMBER|coolant|coolant


### Extracting Lemma

In [24]:
def extract_lemmas(doc,**kwargs):
    return[t.lemma_ for t in textacy.extract.words(doc,**kwargs)]

In [36]:
tokenized_doc = extract_lemmas(doc,  include_pos={"ADJ","NOUN"})
print(*tokenized_doc, sep = "|")
len(tokenized_doc)

url|number|number|k|mile|engine|radiator|coolant|t|book|long|life|coolant|flushed|year


15

### Extracting Named entities

- The process of detecting entities such as people, locations, organization in texts
- In the **Named-entity recognizer** attributes of Doc:
    - Doc.ents
    - Token.ent_iob_
    - Token.ent_type_

In [None]:
list(textacy.extract.entities(doc, include_types={"PERSON","ORG","LOCATION"}))

In [None]:
for ent in doc.ents:
    print(f"({ent.text},{ent.label_})",end="")

In [None]:
from spacy import displacy
displacy.render(doc,style="ent")

# Make a Corpus

A textacy.Corpus is an ordered collection of spaCy Doc all processed by the same language pipeline

In [102]:
records=df['text']

preproc_records=((preproc(text)) for text in records)

corpus=textacy.Corpus("en_core_web_sm",data=preproc_records)

In [82]:
corpus[0]._.preview

'Doc(143 tokens: "Funny story I went to college in Las Vegas This...")'

In [45]:
corpus.n_docs, corpus.n_sents, corpus.n_tokens

(20000, 96359, 2674826)

### Transforming a corpus into an array 

**textacy.representations.vectorizers** : Transform a collection of tokenized docs into a doc-term matrix of shape (# docs, # unique terms), with various ways to filter or limit included terms and flexible weighting schemes for their values.
    
    
https://textacy.readthedocs.io/en/latest/api_reference/representations.html#  

In [129]:
tokenized_docs = ((term.lemma_ for term in textacy.extract.words(doc,include_pos={"ADJ","NOUN"})) for doc in corpus[:500])

In [121]:
from textacy.representations import Vectorizer

In [122]:
#vectorizer = Vectorizer( tf_type="linear")
vectorizer = Vectorizer(tf_type="linear", idf_type="smooth", norm="l2",min_df=3, max_df=0.95)

tf_type : specify the type of type frequency
    tf_type = linear 

idf_type : Type of inverse document frequency (idf) to use for weights’ global

In [123]:
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
doc_term_matrix

<500x883 sparse matrix of type '<class 'numpy.float64'>'
	with 9483 stored elements in Compressed Sparse Row format>

In [124]:
vectorizer.terms_list[:40]

['10k',
 '20k',
 '22k',
 '2nd',
 '3rd',
 '4runner',
 '4th',
 '50k',
 '5th',
 'a3',
 'a4',
 'able',
 'abs',
 'acceleration',
 'access',
 'accident',
 'accord',
 'activation',
 'actual',
 'ad',
 'adapter',
 'additional',
 'address',
 'adjustable',
 'advance',
 'advice',
 'aero',
 'aesthetic',
 'aftermarket',
 'age',
 'air',
 'amazing',
 'american',
 'amp',
 'android',
 'annoying',
 'answer',
 'app',
 'appreciated',
 'appropriate']

In [125]:
print(doc_term_matrix[:20, vectorizer.vocabulary_terms["story"]].toarray())

[[0.203154]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]
 [0.      ]]


In [126]:
vectorizer.weighting

'tf * log((n_docs + 1) / (df + 1)) + 1'

In [130]:
vectorizer_alt = Vectorizer( tf_type="linear")

In [131]:
doc_term_matrix_alt = vectorizer_alt.fit_transform(tokenized_docs)
doc_term_matrix_alt

<500x2790 sparse matrix of type '<class 'numpy.int32'>'
	with 11802 stored elements in Compressed Sparse Row format>

In [133]:
print(doc_term_matrix_alt[:20, vectorizer_alt.vocabulary_terms["story"]].toarray())

[[1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]]


In [134]:
vectorizer_alt.weighting

'tf'