# Preparing textual data for statistics and machine learning

The purpose of this session is to use Python specialized libraries to prepare a sample of text for a subsequent quantitative analysis, for instance text classification. 
The differents steps of the process are:

1. Importing the dataset
2. Cleaning the dataset
3. Tokenization
4. Feature extraction on a large dataset



##  Data

We use data of the reddit self-post classification task on Kaggle (https://www.kaggle.com/datasets/mswarbrickjones/reddit-selfposts)

Reddit (https://www.reddit.com/) is a social media website.  A subreddit is a specific online community, and the posts associated with it. 

Subreddits are dedicated to a particular topic that people write about, and they're denoted by /r/, followed by the subreddit's name, e.g., /r/gaming.

We have two datasets:

1. **rspct.tsv**

This dataset consists of 1.013M self-posts, posted from 1013 subreddits (1000 examples per class). 

For each post we give:
- the subreddit, 
- the title,
- the content of the self-post.

On this file, observations are separated by a tab


2. **subreddit_info.csv**

Contains manual annotation of about 3000 subreddits :

- a top-level category and subcategory for each subreddit, 

- a reason for exclusion if this does not appear in the data.


As a first step, we will:

- Import these two datasets
- Make a joint dataframe between these two dataframe based on the subreddit

In [None]:
import pandas as pd

In [None]:
posts_file = "rspct.tsv"

posts_df = pd.read_csv(posts_file, sep='\t')

posts_df.info()

In [None]:
posts_df.shape

In [None]:
posts_df.head(10)

In [None]:
## number of subreddit
posts_df['subreddit'].nunique()

In [None]:
mask=posts_df['subreddit']=='whatsthatbook'
posts_df.loc[mask,]

**subreddit_info.csv**

Contains manual annotation of about 3000 subreddits :
    
    - a top-level category and subcategory for each subreddit, 
    
    - a reason for exclusion if this does not appear in the data.

These information can be considerered as  **metadata**: information on characteristics of the text (and not the content of the text)

In [None]:
subred_file = "subreddit_info.csv"
subred_df=pd.read_csv(subred_file)
subred_df.info()
subred_df.head(10)

In [None]:
subred_file = "subreddit_info.csv"
subred_df=pd.read_csv(subred_file).set_index(['subreddit'])

In [None]:
subred_df.shape

In [None]:
subred_df.head(10)

## Joining the two dataframes ##

We want to gather the two previous datasets, on the basis of the subreddit which is a column of posts_df and the index of subred_df. 

subreddit : column in the caller (posts_df) to join on the index of subred_df


In [None]:
df=posts_df.join(subred_df, on ='subreddit')

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.isna().sum()

### Standardizing Attributes Names

Usual practise:
- **df**: name of the dataset
- **text**: name of the column containing text to analyze

In [None]:
print(df.columns)

In [None]:
df=df.drop(columns=['category_3', 'in_data', 'reason_for_exclusion'])

In [None]:
column_mapping = {
    'selftext':'text',
    'category_1':'category',
    'category_2':'subcategory',
}

In [None]:
df=df.rename(columns=column_mapping)
print(df.columns)

#### Renaming columns and suppressing NaN columns - alternative method

- selftext renamed as text
- category_1 renamed as category
- category_2 renamed as subcategory


 category_3, in_data and reason_for_exclusion **are suppressed (incomplete data)**

In [None]:
column_mapping = {
    'id':'id',
    'subreddit':'subreddit',
    'title':'title',
    'selftext':'text',
    'category_1':'category',
    'category_2':'subcategory',
    'category_3': None,
    'in_data': None,
    'reason_for_exclusion': None
}

In [None]:
column_mapping.keys()

In [None]:
columns=[c for c in column_mapping.keys() if column_mapping[c] != None]

In [None]:
print(columns)

In [None]:
df=df[columns].rename(columns=column_mapping)

In [None]:
print(df.columns)

In [None]:
df.head()

In [None]:
print(df['category'].unique())

### Selection of data for the autos category

We restrict the data to the autos category.

In [None]:
df=df[df['category']=='autos']

In [None]:
df.info()

In [None]:
df.head()

In [None]:
len(df)

## Python libraries

Two associated Python libraries:

**textacy**(https://pypi.org/project/textacy/)

        preprocessing = clean, normalize and explore raw data before processing it with spaCy*
        
**spaCy** (https://spacy.io/)
            
        fundamentals = tokenization, part-of-speech tagging, dependency parsing...

## Preliminary step: Cleaning Text Data with textacy

We don't have well edited texts. There are several problems of quality that we need to take into account:

- **Salutations, signatures and adresses**: usually not informative
    

- **Replies**: in case the text contains replies repeating the question, we need to eliminate the duplicated question. If not, we can introduce bias in the statistical analysis.
    
    
- **Special formatting and program code**: in case, the text contain special characters, HTML entities, Mardown tags,...Necessary to eliminate these signs before the analysis.

- TextaCy module used to perform (preliminary/cleaning) NLP tasks on texts:
    
    - replacing and removing punctuation, extra whitespaces, numbers from the text before processing with spaCy
    
- Built upon the SpaCy module in Python

https://www.geeksforgeeks.org/textacy-module-in-python/

In [None]:
df.index

In [None]:
text=df.loc[df.index[0],'text'] # selection of text by using df.index[list]
print(text)

Raw text sometimes needs to be cleaned before analysis

textacy.preprocessing sub-package contains a number of functions:

- to normalize (whitespace, quotation marks,...)

- remove (punctuations, accents,...)

- replace (URLs, emails, numbers, 

In [None]:
import textacy
import textacy.preprocessing as tprep

With make_pipeline, we make a callable pipeline which take a text as input, passes it through the functions in squential orders and then output a single preprocessed string text. 

In [None]:
preproc = tprep.make_pipeline(
    tprep.normalize.hyphenated_words,
    tprep.normalize.quotation_marks,
    tprep.normalize.unicode,
    tprep.normalize.whitespace,
    tprep.remove.html_tags,
    tprep.remove.accents,
    tprep.remove.punctuation,
    tprep.remove.brackets,
    tprep.replace.numbers,
    tprep.replace.urls,
    tprep.replace.currency_symbols
)

In [None]:
clean_text=preproc(text)

print(clean_text)

In [None]:
text2= 'There is (no) of these 10 examples of 100 £ loans'

In [None]:
preproc(text2)

### Alternative: creating a specific function

In [None]:
def normalize(text):
    text = tprep.replace.urls(text)# we replace url with text
    text = tprep.remove.html_tags(text)
    text = tprep.normalize.hyphenated_words(text)
    text = tprep.normalize.quotation_marks(text)
    text = tprep.normalize.unicode(text)
    text = tprep.remove.accents(text)
    text = tprep.remove.punctuation(text)
    text = tprep.normalize.whitespace(text)
    text = tprep.replace.numbers(text)
    return text

In [None]:
print(normalize(text))

In [None]:
df_small = df.loc[df.index[:5],]

In [None]:
df_small.info()

In [None]:
df_small['text'].apply(normalize)

## Linguistic Processing with spaCy

- Spacy: library for linguistic data processing

- spaCy's pipeline is language dependent: we hav to load a particular pipeline to process the text 
    
- Spacy provide an integrated pipeline of processing documents:
    
    1. a tokenizer (by default) : tok2vec
    2. a part-of-speech tagger : tagger
    3. a dependency parser : parser
    4. a sentence recognizer : senter
    5. a attribute ruler 
    6. a lemmatizer : lemmatizer
    7. a named-entity recognizer : ner
    
- the tokenizes is based on language-dependent rules = > fast


- 2, 3 and 4 are based on pretrained neural models => can 10-20 times as long as tokenization

- The initial input is a text

- The final output is a **Doc** object

- The **Doc** object contains a list of **Tokens** objects

- Any range selection of tokens creates a **Span**

In [None]:
We import spaCy one of trained pipelines for english 

For example, en_core_web_sm is a small English pipeline trained on was trained on an annotated corpus called “OntoNotes”: 2 million+ words drawn from “news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech,” which were meticulously tagged by a group of researchers and professionals for people’s names and places, for nouns and verbs, for subjects and objects, and much more.

https://spacy.io/models/en

In [None]:
import spacy

In [None]:
# 'en_core_wb_sm' is the name of the installed spaCy pipeline
from spacy.cli import download
print(download('en_core_web_sm'))
#print(download('en_core_web_md'))
#print(download('en_core_web_lg'))

We make a spaCy **Doc** from text

A doc is required as inputs of the functions of spaCy

In [None]:
doc = textacy.make_spacy_doc(clean_text,lang="en_core_web_sm")
doc._.preview


In [None]:
print(doc)

### Alternative code

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
nlp.pipeline

In [None]:
doc_alt = nlp(clean_text)
print(doc_alt)

### Displaying tokens in a document

In [None]:
for token in doc:
    print(token.text)

### Tokens have attributes 

    - token.is_punct  : Is the token punctuation? 
    - token.is_alpha  : Does the token consist of alphabetic characters? 
    - token.like_email : Does the token resemble an email address?
    - token.like_url : : Does the token resemble a URL?

    - token.is_stop : Is the token part of a “stop list”?
    - token.lemma_ : Base form of the token, with no inflectional suffixes.
    - token.pos : core part-of-speech categories https://universaldependencies.org/u/pos/
            
            
See https://spacy.io/api/token for the list of all attributes

In [None]:
for token in doc:
    print(token,token.is_punct)

In [None]:
# identifying alphabetical characters
for token in doc:
    print(token,token.is_alpha)

In [None]:
# identifying stop words in a document
for token in doc:
    print(token,token.is_stop)

## Tag-of-speech

- **part-of-speech** are the grammatical units of language: verbs, nouns, adjectives, adverbs, pronouns, prepositions

- part-of-speech can be used to explore syntax

- - Each token in a spaCy doc has two part-of-speech attributes:
    - pos_
    - tag_
- tag_ can be language specific 
- pos_ contains the simplified tag of the universal part-of-speech tagset
 
- pos_ can be used as an alternative to stop words

- pos_ can be classified into two categories 

- pronouns, prepositions, conjunctions, determiners: 
    - called **function words**
    - their main function is to create grammatical relationships in a sentence
    - not very informative

- nouns, verbs, adjectives and adverbs: 
    - **content** words
    - the meaning of a sentence depends on them
    

- We can use **part-of-speech tags** to select the word types

https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/13-POS-Keywords.html#

- Part-of-speech tags can be used to make a selection among tokens

spaCy has been trained to recognize pos_ according to the context in which the word appears

In [None]:
sentence1 = 'You need to write an abstract'
token_sentence1 = nlp(sentence1)
for token in token_sentence1:
    print(token, token.pos_)

In [None]:
sentence2 = 'At his age, he still fails to abstract certain concepts'
token_sentence2 = nlp(sentence2)
for token in token_sentence2:
    print(token, token.pos_)

In [None]:
sentence3 = "He ages well"
token_sentence3 = nlp(sentence3)
for token in token_sentence3:
    print(token, token.pos_)

### Tokens and pos_ of doc

In [None]:
for token in doc:
    print(token, token.pos_, spacy.explain(token.pos_))

In [None]:
We want to make the list of the nouns in doc

In [None]:
nouns=[]
for token in doc:
    if token.pos_== 'NOUN':
       nouns.append(token.text)
        

In [None]:
nouns

In [None]:
from collections import Counter 
nouns_count = Counter(nouns)
print(nouns_count)

In [None]:
nouns_count.most_common()

### Specific functions of Textacy to extract words according to their pos
The output is a list

In [None]:
token_alt =textacy.extract.words(doc)
print(list(token_alt))


In [None]:
# The input file must be a doc 
tokens1=textacy.extract.words(doc, include_pos={"ADJ","NOUN"})
print(list(tokens1))
#print(*[t for t in tokens1], sep="|")

In [None]:
tokens2=textacy.extract.words(doc, include_pos={"ADJ","NOUN"},min_freq=2)
print(list(tokens2)
#print(*[t for t in tokens2], sep="|")

### Tags 
A more detailled classification 

In [None]:
for token in doc:
    print(token,token.tag_,spacy.explain(token.tag_))


### dep_ structure of dependence

In [None]:
from spacy import displacy

In [None]:
#Set some display options for the visualizer
options = {"compact": True, "distance": 90, "color": "yellow", "bg": "black", "font": "Gill Sans"}

displacy.render(token_sentence1, style="dep", options=options)


In [None]:
for token in doc:
    print(token,token.dep_,spacy.explain(token.dep_))


## Lemmatization/ Stemming

- Replacing words with their root: 
    - "economic", "economics", "economically" all replaced by the stem (the root) "economy"
    - Porter stemmer (Porter 1980): standard stemming tool for English language text
- smaller vocabulary: increase speed of execution

In [None]:
for token in doc:
    print(token,token.lemma_)

### Analysis of a Doc

- Extracting n-grams

In [None]:
from textacy import extract
list(extract.ngrams(doc,2))

### Remark: We can discard some function of the spaCy pipeline

We can import selected elements of the pipeline if some component are useless

In [None]:
nlp_2=spacy.load('en_core_web_sm', disable=["parser","ner"])

In [None]:
nlp_2.pipeline

## Working with stop words

- spaCy uses language-specific stop word lists to set the is_stop property for each token
- Filtering stop words (and punctuation tokens) is easy
- The list of stop words is loaded when a nlp object is created

In [None]:
print(nlp.Defaults.stop_words)

### The list of stop words can be modified

In [None]:
nlp.vocab['down'].is_stop=False
nlp.vocab['Dear'].is_stop=True
nlp.vocab['Regards'].is_stop = True

### Extracting Lemma

In [None]:
def extract_lemmas(doc,**kwargs):
    return[t.lemma_ for t in textacy.extract.words(doc,**kwargs)]

In [None]:
tokenized_doc = extract_lemmas(doc,min_freq=2)
print(*tokenized_doc, sep = "|")
len(tokenized_doc)

In [None]:
tokenized_doc = extract_lemmas(doc,  include_pos={"ADJ","NOUN"})
print(*tokenized_doc, sep = "|")
len(tokenized_doc)

### Extracting Named entities

- The process of detecting entities such as people, locations, organization in texts
- In the **Named-entity recognizer** attributes of Doc:
    - Doc.ents
    - Token.ent_iob_
    - Token.ent_type_

In [None]:
text0=df.loc[df.index[0],'text'] # selection of text by using df.index[list]
print(text0)

In [None]:
# Preprocesssing with textacy pipeline
clean_text0=preproc(text0)

print(clean_text0)

In [None]:
doc0 = textacy.make_spacy_doc(clean_text0,lang="en_core_web_sm")
doc0._.preview

In [None]:
doc0

In [None]:
list(textacy.extract.entities(doc, include_types={"DATE","PRODUCT","ORG","LOCATION"}))

In [None]:
for ent in doc.ents:
    print(f"({ent.text},{ent.label_})",end="")

In [None]:
from spacy import displacy
displacy.render(doc,style="ent")

# Make a Corpus

A textacy.Corpus is an ordered collection of spaCy Doc all processed by the same language pipeline

In [None]:
records=df['text']

preproc_records=((preproc(text)) for text in records)

In [None]:
corpus=textacy.Corpus("en_core_web_sm",data=preproc_records)

In [None]:
corpus.n_docs, corpus.n_sents, corpus.n_tokens

In [None]:
corpus[0]._.preview

In [None]:
corpus[0]

### Transforming a corpus into an array 

**textacy.representations.vectorizers** : Transform a collection of tokenized docs into a **doc-term matrix** of shape (# docs, # unique terms), with various ways to filter or limit included terms and flexible weighting schemes for their values.
    
    
https://textacy.readthedocs.io/en/latest/api_reference/representations.html#  

In [None]:
tokenized_docs = ((term.lemma_ for term in textacy.extract.words(doc,include_pos={"ADJ","NOUN"})) for doc in corpus[:20])

In [None]:
from textacy.representations import Vectorizer

### Specification of the Vectorizer
tf_type : specify the type of type frequency
    tf_type = linear 

tf_type = can be linear, sqrt, log, binary

idf_type : Type of inverse document frequency (idf) to use for weights’ global 
        can be standard, smooth,bm25

In [None]:
vectorizer_alt = Vectorizer( tf_type="linear")

In [None]:
vectorizer_alt.weighting

In [None]:
doc_term_matrix_alt = vectorizer_alt.fit_transform(tokenized_docs)
doc_term_matrix_alt

Terms associated with columns

In [None]:
vectorizer_alt.terms_list[:10]

In [None]:
print(doc_term_matrix_alt[:20, vectorizer_alt.vocabulary_terms["story"]].toarray())

In [None]:
tokenized_docs_n = ((term.lemma_ for term in textacy.extract.words(doc,include_pos={"ADJ","NOUN"})) for doc in corpus[21:40])

In [None]:
doc_matrix_terms_alt_n = vectorizer_alt.transform(tokenized_docs_n)
doc_matrix_terms_alt_n

## Another example of tokenization and vectorization

In [None]:
tokenized_docs = ((term.lemma_ for term in textacy.extract.words(doc,include_pos={"VERB"})) for doc in corpus[:20])

In [None]:
#vectorizer = Vectorizer( tf_type="linear")
vectorizer = Vectorizer(tf_type="linear", idf_type="standard",min_df=5, max_df=0.95)

In [None]:
vectorizer.weighting

In [None]:
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
doc_term_matrix

In [None]:
print(doc_term_matrix[:20, vectorizer.vocabulary_terms["know"]].toarray())

In [None]:
tokenized_docs = ((term.lemma_ for term in textacy.extract.words(doc,include_pos={"VERB"})) for doc in corpus[21:41])

In [None]:
doc_matrix_terms= vectorizer.transform(tokenized_docs)
doc_matrix_terms

In [None]:
print(doc_matrix_terms[:20, vectorizer.vocabulary_terms["know"]].toarray())