# Preparing textual data for statistics and machine learning

The purpose of this session is to use Python specialized libraries to prepare a sample of text for a subsequent quantitative analysis, for instance text classification. 
The differents steps of the process are:

1. Importing the dataset
2. Cleaning the dataset
3. Tokenization
4. Feature extraction on a large dataset



##  Data

We use data of the reddit self-post classification task on Kaggle (https://www.kaggle.com/datasets/mswarbrickjones/reddit-selfposts)

Reddit (https://www.reddit.com/) is a social media website.  A subreddit is a specific online community, and the posts associated with it. 

Subreddits are dedicated to a particular topic that people write about, and they're denoted by /r/, followed by the subreddit's name, e.g., /r/gaming.

We have two datasets:

1. **rspct.tsv**

This dataset consists of 1.013M self-posts, posted from 1013 subreddits (1000 examples per class). 

For each post we give:
- the subreddit, 
- the title,
- the content of the self-post.

On this file, observations are separated by a tab


2. **subreddit_info.csv**

Contains manual annotation of about 3000 subreddits :

- a top-level category and subcategory for each subreddit, 

- a reason for exclusion if this does not appear in the data.


As a first step, we will:

- Import these two datasets
- Make a joint dataframe between these two dataframe based on the subreddit

In [1]:
import pandas as pd

In [2]:
posts_file = "rspct.tsv"

posts_df = pd.read_csv(posts_file, sep='\t')

posts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1013000 entries, 0 to 1012999
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   id         1013000 non-null  object
 1   subreddit  1013000 non-null  object
 2   title      1013000 non-null  object
 3   selftext   1013000 non-null  object
dtypes: object(4)
memory usage: 30.9+ MB


In [3]:
posts_df.shape

(1013000, 4)

In [4]:
posts_df.head(10)

Unnamed: 0,id,subreddit,title,selftext
0,6d8knd,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, fi..."
1,58mbft,teenmom,"So what was Matt ""addicted"" to?",Did he ever say what his addiction was or is h...
2,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...
3,6ti6re,ringdoorbell,"Not door bell, but floodlight mount height.",I know this is a sub for the 'Ring Doorbell' b...
4,77sxto,intel,Worried about my 8700k small fft/data stress r...,"Prime95 (regardless of version) and OCCT both,..."
5,5qw3x0,residentevil,What if Saddler won?,I just wanted to start a thread about what wou...
6,7jve7p,BATProject,Net Neutrality and Brave,If and when net neutrality laws are repealed i...
7,6icvfu,hockeyplayers,Inline Hockey: Where Do I Need To Be? (Positio...,My game is coming on well but one HUGE aspect ...
8,4y7c5c,asmr,[Question] Who is your favorite defunct ASMRtist?,"""Defunct"" being defined here as NOT having rel..."
9,6azhj1,rawdenim,Had a custom embroidery job done on my ranch j...,[Album First](http://imgur.com/a/DYdKC)<lb><lb...


In [None]:
## number of subreddit
posts_df['subreddit'].nunique()

In [None]:
mask=posts_df['subreddit']=='whatsthatbook'
posts_df.loc[mask,]

**subreddit_info.csv**

Contains manual annotation of about 3000 subreddits :
    
    - a top-level category and subcategory for each subreddit, 
    
    - a reason for exclusion if this does not appear in the data.

These information can be considerered as  **metadata**: information on characteristics of the text (and not the content of the text)

In [5]:
subred_file = "subreddit_info.csv"
subred_df=pd.read_csv(subred_file)
subred_df.info()
subred_df.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3394 entries, 0 to 3393
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   subreddit             3394 non-null   object
 1   category_1            3394 non-null   object
 2   category_2            3362 non-null   object
 3   category_3            536 non-null    object
 4   in_data               3394 non-null   bool  
 5   reason_for_exclusion  2381 non-null   object
dtypes: bool(1), object(5)
memory usage: 136.0+ KB


Unnamed: 0,subreddit,category_1,category_2,category_3,in_data,reason_for_exclusion
0,whatsthatbook,advice/question,book,,True,
1,CasualConversation,advice/question,broad,,False,too_broad
2,Clairvoyantreadings,advice/question,broad,,False,too_broad
3,DecidingToBeBetter,advice/question,broad,,False,too_broad
4,HelpMeFind,advice/question,broad,,False,too_broad
5,LifeProTips,advice/question,broad,,False,too_broad
6,MLPLounge,advice/question,broad,,False,too_broad
7,NoStupidQuestions,advice/question,broad,,False,too_broad
8,RBI,advice/question,broad,,False,too_broad
9,TooAfraidToAsk,advice/question,broad,,False,too_broad


In [3]:
subred_file = "subreddit_info.csv"
subred_df=pd.read_csv(subred_file).set_index(['subreddit'])

In [7]:
subred_df.shape

(3394, 5)

In [8]:
subred_df.head(10)

Unnamed: 0_level_0,category_1,category_2,category_3,in_data,reason_for_exclusion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
whatsthatbook,advice/question,book,,True,
CasualConversation,advice/question,broad,,False,too_broad
Clairvoyantreadings,advice/question,broad,,False,too_broad
DecidingToBeBetter,advice/question,broad,,False,too_broad
HelpMeFind,advice/question,broad,,False,too_broad
LifeProTips,advice/question,broad,,False,too_broad
MLPLounge,advice/question,broad,,False,too_broad
NoStupidQuestions,advice/question,broad,,False,too_broad
RBI,advice/question,broad,,False,too_broad
TooAfraidToAsk,advice/question,broad,,False,too_broad


## Joining the two dataframes ##

We want to gather the two previous datasets, on the basis of the subreddit which is a column of posts_df and the index of subred_df. 

subreddit : column in the caller (posts_df) to join on the index of subred_df


In [4]:
df=posts_df.join(subred_df, on ='subreddit')

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1013000 entries, 0 to 1012999
Data columns (total 9 columns):
 #   Column                Non-Null Count    Dtype 
---  ------                --------------    ----- 
 0   id                    1013000 non-null  object
 1   subreddit             1013000 non-null  object
 2   title                 1013000 non-null  object
 3   selftext              1013000 non-null  object
 4   category_1            1013000 non-null  object
 5   category_2            1013000 non-null  object
 6   category_3            136000 non-null   object
 7   in_data               1013000 non-null  bool  
 8   reason_for_exclusion  0 non-null        object
dtypes: bool(1), object(8)
memory usage: 62.8+ MB


In [51]:
df.shape

(1013000, 9)

In [52]:
df.isna().sum()

id                            0
subreddit                     0
title                         0
selftext                      0
category_1                    0
category_2                    0
category_3               877000
in_data                       0
reason_for_exclusion    1013000
dtype: int64

### Standardizing Attributes Names

Usual practise:
- **df**: name of the dataset
- **text**: name of the column containing text to analyze

In [53]:
print(df.columns)

Index(['id', 'subreddit', 'title', 'selftext', 'category_1', 'category_2',
       'category_3', 'in_data', 'reason_for_exclusion'],
      dtype='object')


In [5]:
df=df.drop(columns=['category_3', 'in_data', 'reason_for_exclusion'])

In [14]:
column_mapping = {
    'id':'id',
    'subreddit':'subreddit',
    'title':'title',
    'selftext':'text',
    'category_1':'category',
    'category_2':'subcategory',
}

In [6]:
column_mapping = {
    'selftext':'text',
    'category_1':'category',
    'category_2':'subcategory',
}

In [7]:
df=df.rename(columns=column_mapping)
print(df.columns)

Index(['id', 'subreddit', 'title', 'text', 'category', 'subcategory'], dtype='object')


#### Renaming columns and suppressing NaN columns - alternative method

- selftext renamed as text
- category_1 renamed as category
- category_2 renamed as subcategory


 category_3, in_data and reason_for_exclusion **are suppressed (incomplete data)**

In [None]:
column_mapping = {
    'id':'id',
    'subreddit':'subreddit',
    'title':'title',
    'selftext':'text',
    'category_1':'category',
    'category_2':'subcategory',
    'category_3': None,
    'in_data': None,
    'reason_for_exclusion': None
}

In [None]:
column_mapping.keys()

In [None]:
columns=[c for c in column_mapping.keys() if column_mapping[c] != None]

In [None]:
print(columns)

In [None]:
df=df[columns].rename(columns=column_mapping)

In [None]:
print(df.columns)

In [None]:
df.head()

In [None]:
print(df['category'].unique())

### Selection of data for the autos category

We restrict the data to the autos category.

In [8]:
df=df[df['category']=='autos']

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20000 entries, 2 to 1012979
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           20000 non-null  object
 1   subreddit    20000 non-null  object
 2   title        20000 non-null  object
 3   text         20000 non-null  object
 4   category     20000 non-null  object
 5   subcategory  20000 non-null  object
dtypes: object(6)
memory usage: 1.1+ MB


In [61]:
df.head()

Unnamed: 0,id,subreddit,title,text,category,subcategory
2,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...,autos,harley davidson
56,5s0q8r,Mustang,Roush vs Shleby GT500,"I am trying to determine which is faster, and ...",autos,ford
78,5z3405,Volkswagen,2001 Golf Wagon looking for some insight,Hello! <lb><lb>Trying to find some information...,autos,VW
270,7df18v,Lexus,IS 250 Coolant Flush/Change,https://www.cars.com/articles/how-often-should...,autos,lexus
286,5tpve8,volt,Gen1 mpg w/ dead battery?,"Hi, new to this subreddit. I'm considering bu...",autos,chevrolet


In [17]:
len(df)

20000

## Python libraries

Two associated Python libraries:

**textacy**(https://pypi.org/project/textacy/)

        preprocessing = clean, normalize and explore raw data before processing it with spaCy*
        
**spaCy** (https://spacy.io/)
            
        fundamentals = tokenization, part-of-speech tagging, dependency parsing...

## Preliminary step: Cleaning Text Data with textacy

We don't have well edited texts. There are several problems of quality that we need to take into account:

- **Salutations, signatures and adresses**: usually not informative
    

- **Replies**: in case the text contains replies repeating the question, we need to eliminate the duplicated question. If not, we can introduce bias in the statistical analysis.
    
    
- **Special formatting and program code**: in case, the text contain special characters, HTML entities, Mardown tags,...Necessary to eliminate these signs before the analysis.

- TextaCy module used to perform (preliminary/cleaning) NLP tasks on texts:
    
    - replacing and removing punctuation, extra whitespaces, numbers from the text before processing with spaCy
    
- Built upon the SpaCy module in Python

https://www.geeksforgeeks.org/textacy-module-in-python/

In [62]:
df.index

Index([      2,      56,      78,     270,     286,     337,     361,     415,
           502,     582,
       ...
       1012426, 1012455, 1012520, 1012552, 1012614, 1012634, 1012658, 1012859,
       1012969, 1012979],
      dtype='int64', length=20000)

In [10]:
text=df.loc[df.index[0],'text'] # selection of text by using df.index[list]
print(text)

Funny story. I went to college in Las Vegas. This was before I knew anything about motorcycling whatsoever. Me and some college buddies would always go out on the strip to the dance clubs. We always ended up at a bar called Hogs &amp; Heifers. It's worth noting the females working there can outdrink ANYONE. Anyway, there was a sign on the front door that read 'No Club Colors'. So we lose our ties and blazers before heading there. Also we assumed bright colors like red, yellow, green etc were not allowed. So we would always bring an xtra t-shirt and pair of jeans. This went on for years! Looking back now on how naive we were, it's just hilarious. I was never able to walk out of that bar....had to crawl out! So much booze. <lb><lb>Cheers. Ride safe, boys! 


Raw text sometimes needs to be cleaned before analysis

textacy.preprocessing sub-package contains a number of functions:

- to normalize (whitespace, quotation marks,...)

- remove (punctuations, accents,...)

- replace (URLs, emails, numbers, 

In [11]:
import textacy
import textacy.preprocessing as tprep

With make_pipeline, we make a callable pipeline which take a text as input, passes it through the functions in squential orders and then output a single preprocessed string text. 

In [13]:
preproc = tprep.make_pipeline(
    tprep.normalize.hyphenated_words,
    tprep.normalize.quotation_marks,
    tprep.normalize.unicode,
    tprep.normalize.whitespace,
    tprep.remove.html_tags,
    tprep.remove.accents,
    tprep.remove.punctuation,
    tprep.remove.brackets,
    tprep.replace.numbers,
    tprep.replace.urls,
    tprep.replace.currency_symbols
)

In [14]:
clean_text=preproc(text)

print(clean_text)

Funny story  I went to college in Las Vegas  This was before I knew anything about motorcycling whatsoever  Me and some college buddies would always go out on the strip to the dance clubs  We always ended up at a bar called Hogs   Heifers  It s worth noting the females working there can outdrink ANYONE  Anyway  there was a sign on the front door that read  No Club Colors   So we lose our ties and blazers before heading there  Also we assumed bright colors like red  yellow  green etc were not allowed  So we would always bring an xtra t shirt and pair of jeans  This went on for years  Looking back now on how naive we were  it s just hilarious  I was never able to walk out of that bar    had to crawl out  So much booze  Cheers  Ride safe  boys 


In [15]:
text2= 'There is (no) of these 10 examples of 100 £ loans'

In [16]:
preproc(text2)

'There is  no  of these _NUMBER_ examples of _NUMBER_ _CUR_ loans'

### Alternative: creating a specific function

In [17]:
def normalize(text):
    text = tprep.replace.urls(text)# we replace url with text
    text = tprep.remove.html_tags(text)
    text = tprep.normalize.hyphenated_words(text)
    text = tprep.normalize.quotation_marks(text)
    text = tprep.normalize.unicode(text)
    text = tprep.remove.accents(text)
    text = tprep.remove.punctuation(text)
    text = tprep.normalize.whitespace(text)
    text = tprep.replace.numbers(text)
    return text

In [18]:
print(normalize(text))

Funny story I went to college in Las Vegas This was before I knew anything about motorcycling whatsoever Me and some college buddies would always go out on the strip to the dance clubs We always ended up at a bar called Hogs Heifers It s worth noting the females working there can outdrink ANYONE Anyway there was a sign on the front door that read No Club Colors So we lose our ties and blazers before heading there Also we assumed bright colors like red yellow green etc were not allowed So we would always bring an xtra t shirt and pair of jeans This went on for years Looking back now on how naive we were it s just hilarious I was never able to walk out of that bar had to crawl out So much booze Cheers Ride safe boys


In [19]:
df_small = df.loc[df.index[:5],]

In [20]:
df_small.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 2 to 286
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           5 non-null      object
 1   subreddit    5 non-null      object
 2   title        5 non-null      object
 3   text         5 non-null      object
 4   category     5 non-null      object
 5   subcategory  5 non-null      object
dtypes: object(6)
memory usage: 280.0+ bytes


In [21]:
df_small['text'].apply(normalize)

2      Funny story I went to college in Las Vegas Thi...
56     I am trying to determine which is faster and I...
78     Hello Trying to find some information on repla...
270    URL have a IS _NUMBER_ AWD from _NUMBER_ About...
286    Hi new to this subreddit I m considering buyin...
Name: text, dtype: object

## Linguistic Processing with spaCy

- Spacy: library for linguistic data processing

- spaCy's pipeline is language dependent: we hav to load a particular pipeline to process the text 
    
- Spacy provide an integrated pipeline of processing documents:
    
    1. a tokenizer (by default) : tok2vec
    2. a part-of-speech tagger : tagger
    3. a dependency parser : parser
    4. a sentence recognizer : senter
    5. a attribute ruler 
    6. a lemmatizer : lemmatizer
    7. a named-entity recognizer : ner
    
- the tokenizes is based on language-dependent rules = > fast


- 2, 3 and 4 are based on pretrained neural models => can 10-20 times as long as tokenization

- The initial input is a text

- The final output is a **Doc** object

- The **Doc** object contains a list of **Tokens** objects

- Any range selection of tokens creates a **Span**

In [None]:
We import spaCy one of trained pipelines for english 

For example, en_core_web_sm is a small English pipeline trained on was trained on an annotated corpus called “OntoNotes”: 2 million+ words drawn from “news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech,” which were meticulously tagged by a group of researchers and professionals for people’s names and places, for nouns and verbs, for subjects and objects, and much more.

https://spacy.io/models/en

In [27]:
import spacy

In [28]:
# 'en_core_wb_sm' is the name of the installed spaCy pipeline
from spacy.cli import download
print(download('en_core_web_sm'))
#print(download('en_core_web_md'))
#print(download('en_core_web_lg'))

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
None


We make a spaCy **Doc** from text

A doc is required as inputs of the functions of spaCy

In [29]:
doc = textacy.make_spacy_doc(clean_text,lang="en_core_web_sm")
doc._.preview


'Doc(166 tokens: "Funny story  I went to college in Las Vegas  Th...")'

In [30]:
print(doc)

Funny story  I went to college in Las Vegas  This was before I knew anything about motorcycling whatsoever  Me and some college buddies would always go out on the strip to the dance clubs  We always ended up at a bar called Hogs   Heifers  It s worth noting the females working there can outdrink ANYONE  Anyway  there was a sign on the front door that read  No Club Colors   So we lose our ties and blazers before heading there  Also we assumed bright colors like red  yellow  green etc were not allowed  So we would always bring an xtra t shirt and pair of jeans  This went on for years  Looking back now on how naive we were  it s just hilarious  I was never able to walk out of that bar    had to crawl out  So much booze  Cheers  Ride safe  boys 


### Alternative code

In [38]:
nlp = spacy.load('en_core_web_sm')

In [77]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x24a23c7f170>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x24a23c7f350>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x24a23cb0200>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x24a21a28190>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x24a23ce2a90>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x24a23cb03c0>)]

In [87]:
doc_alt = nlp(clean_text)
print(doc_alt)

Funny story  I went to college in Las Vegas  This was before I knew anything about motorcycling whatsoever  Me and some college buddies would always go out on the strip to the dance clubs  We always ended up at a bar called Hogs   Heifers  It s worth noting the females working there can outdrink ANYONE  Anyway  there was a sign on the front door that read  No Club Colors   So we lose our ties and blazers before heading there  Also we assumed bright colors like red  yellow  green etc were not allowed  So we would always bring an xtra t shirt and pair of jeans  This went on for years  Looking back now on how naive we were  it s just hilarious  I was never able to walk out of that bar    had to crawl out  So much booze  Cheers  Ride safe  boys 


### Displaying tokens in a document

In [32]:
for token in doc:
    print(token.text)

Funny
story
 
I
went
to
college
in
Las
Vegas
 
This
was
before
I
knew
anything
about
motorcycling
whatsoever
 
Me
and
some
college
buddies
would
always
go
out
on
the
strip
to
the
dance
clubs
 
We
always
ended
up
at
a
bar
called
Hogs
  
Heifers
 
It
s
worth
noting
the
females
working
there
can
outdrink
ANYONE
 
Anyway
 
there
was
a
sign
on
the
front
door
that
read
 
No
Club
Colors
  
So
we
lose
our
ties
and
blazers
before
heading
there
 
Also
we
assumed
bright
colors
like
red
 
yellow
 
green
etc
were
not
allowed
 
So
we
would
always
bring
an
xtra
t
shirt
and
pair
of
jeans
 
This
went
on
for
years
 
Looking
back
now
on
how
naive
we
were
 
it
s
just
hilarious
 
I
was
never
able
to
walk
out
of
that
bar
   
had
to
crawl
out
 
So
much
booze
 
Cheers
 
Ride
safe
 
boys


### Tokens have attributes 

    - token.is_punct  : Is the token punctuation? 
    - token.is_alpha  : Does the token consist of alphabetic characters? 
    - token.like_email : Does the token resemble an email address?
    - token.like_url : : Does the token resemble a URL?

    - token.is_stop : Is the token part of a “stop list”?
    - token.lemma_ : Base form of the token, with no inflectional suffixes.
    - token.pos : core part-of-speech categories https://universaldependencies.org/u/pos/
            
            
See https://spacy.io/api/token for the list of all attributes

In [88]:
for token in doc:
    print(token,token.is_punct)

Funny False
story False
  False
I False
went False
to False
college False
in False
Las False
Vegas False
  False
This False
was False
before False
I False
knew False
anything False
about False
motorcycling False
whatsoever False
  False
Me False
and False
some False
college False
buddies False
would False
always False
go False
out False
on False
the False
strip False
to False
the False
dance False
clubs False
  False
We False
always False
ended False
up False
at False
a False
bar False
called False
Hogs False
   False
Heifers False
  False
It False
s False
worth False
noting False
the False
females False
working False
there False
can False
outdrink False
ANYONE False
  False
Anyway False
  False
there False
was False
a False
sign False
on False
the False
front False
door False
that False
read False
  False
No False
Club False
Colors False
   False
So False
we False
lose False
our False
ties False
and False
blazers False
before False
heading False
there False
  False
Also False
we False

In [90]:
# identifying alphabetical characters
for token in doc:
    print(token,token.is_alpha)

Funny True
story True
  False
I True
went True
to True
college True
in True
Las True
Vegas True
  False
This True
was True
before True
I True
knew True
anything True
about True
motorcycling True
whatsoever True
  False
Me True
and True
some True
college True
buddies True
would True
always True
go True
out True
on True
the True
strip True
to True
the True
dance True
clubs True
  False
We True
always True
ended True
up True
at True
a True
bar True
called True
Hogs True
   False
Heifers True
  False
It True
s True
worth True
noting True
the True
females True
working True
there True
can True
outdrink True
ANYONE True
  False
Anyway True
  False
there True
was True
a True
sign True
on True
the True
front True
door True
that True
read True
  False
No True
Club True
Colors True
   False
So True
we True
lose True
our True
ties True
and True
blazers True
before True
heading True
there True
  False
Also True
we True
assumed True
bright True
colors True
like True
red True
  False
yellow True
  Fa

In [89]:
# identifying stop words in a document
for token in doc:
    print(token,token.is_stop)

Funny False
story False
  False
I True
went False
to True
college False
in True
Las False
Vegas False
  False
This True
was True
before True
I True
knew False
anything True
about True
motorcycling False
whatsoever False
  False
Me True
and True
some True
college False
buddies False
would True
always True
go True
out True
on True
the True
strip False
to True
the True
dance False
clubs False
  False
We True
always True
ended False
up True
at True
a True
bar False
called False
Hogs False
   False
Heifers False
  False
It True
s False
worth False
noting False
the True
females False
working False
there True
can True
outdrink False
ANYONE True
  False
Anyway True
  False
there True
was True
a True
sign False
on True
the True
front True
door False
that True
read False
  False
No True
Club False
Colors False
   False
So True
we True
lose False
our True
ties False
and True
blazers False
before True
heading False
there True
  False
Also True
we True
assumed False
bright False
colors False
like F

## Tag-of-speech

- **part-of-speech** are the grammatical units of language: verbs, nouns, adjectives, adverbs, pronouns, prepositions

- part-of-speech can be used to explore syntax

- - Each token in a spaCy doc has two part-of-speech attributes:
    - pos_
    - tag_
- tag_ can be language specific 
- pos_ contains the simplified tag of the universal part-of-speech tagset
 
- pos_ can be used as an alternative to stop words

- pos_ can be classified into two categories 

- pronouns, prepositions, conjunctions, determiners: 
    - called **function words**
    - their main function is to create grammatical relationships in a sentence
    - not very informative

- nouns, verbs, adjectives and adverbs: 
    - **content** words
    - the meaning of a sentence depends on them
    

- We can use **part-of-speech tags** to select the word types

https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/13-POS-Keywords.html#

- Part-of-speech tags can be used to make a selection among tokens

spaCy has been trained to recognize pos_ according to the context in which the word appears

In [40]:
sentence1 = 'You need to write an abstract'
token_sentence1 = nlp(sentence1)
for token in token_sentence1:
    print(token, token.pos_)

You PRON
need VERB
to PART
write VERB
an DET
abstract NOUN


In [47]:
sentence2 = 'At his age, he still fails to abstract certain concepts'
token_sentence2 = nlp(sentence2)
for token in token_sentence2:
    print(token, token.pos_)

At ADP
his PRON
age NOUN
, PUNCT
he PRON
still ADV
fails VERB
to PART
abstract ADJ
certain ADJ
concepts NOUN


In [49]:
sentence3 = "He ages well"
token_sentence3 = nlp(sentence3)
for token in token_sentence3:
    print(token, token.pos_)

He PRON
ages VERB
well ADV


### Tokens and pos_ of doc

In [35]:
for token in doc:
    print(token, token.pos_, spacy.explain(token.pos_))

Funny ADJ adjective
story NOUN noun
  SPACE space
I PRON pronoun
went VERB verb
to ADP adposition
college NOUN noun
in ADP adposition
Las PROPN proper noun
Vegas PROPN proper noun
  SPACE space
This PRON pronoun
was AUX auxiliary
before SCONJ subordinating conjunction
I PRON pronoun
knew VERB verb
anything PRON pronoun
about ADP adposition
motorcycling VERB verb
whatsoever ADP adposition
  SPACE space
Me PRON pronoun
and CCONJ coordinating conjunction
some DET determiner
college NOUN noun
buddies NOUN noun
would AUX auxiliary
always ADV adverb
go VERB verb
out ADP adposition
on ADP adposition
the DET determiner
strip NOUN noun
to ADP adposition
the DET determiner
dance NOUN noun
clubs NOUN noun
  SPACE space
We PRON pronoun
always ADV adverb
ended VERB verb
up ADP adposition
at ADP adposition
a DET determiner
bar NOUN noun
called VERB verb
Hogs PROPN proper noun
   SPACE space
Heifers PROPN proper noun
  SPACE space
It PRON pronoun
s VERB verb
worth ADJ adjective
noting VERB verb
the D

In [None]:
We want to make the list of the nouns in doc

In [53]:
nouns=[]
for token in doc:
    if token.pos_== 'NOUN':
       nouns.append(token.text)
        

In [54]:
nouns

['story',
 'college',
 'college',
 'buddies',
 'strip',
 'dance',
 'clubs',
 'bar',
 'females',
 'sign',
 'door',
 'ties',
 'blazers',
 'colors',
 'etc',
 'shirt',
 'pair',
 'jeans',
 'years',
 'bar',
 'booze',
 'boys']

In [60]:
from collections import Counter 
nouns_count = Counter(nouns)
print(nouns_count)

Counter({'college': 2, 'bar': 2, 'story': 1, 'buddies': 1, 'strip': 1, 'dance': 1, 'clubs': 1, 'females': 1, 'sign': 1, 'door': 1, 'ties': 1, 'blazers': 1, 'colors': 1, 'etc': 1, 'shirt': 1, 'pair': 1, 'jeans': 1, 'years': 1, 'booze': 1, 'boys': 1})


In [63]:
nouns_count.most_common()

[('college', 2),
 ('bar', 2),
 ('story', 1),
 ('buddies', 1),
 ('strip', 1),
 ('dance', 1),
 ('clubs', 1),
 ('females', 1),
 ('sign', 1),
 ('door', 1),
 ('ties', 1),
 ('blazers', 1),
 ('colors', 1),
 ('etc', 1),
 ('shirt', 1),
 ('pair', 1),
 ('jeans', 1),
 ('years', 1),
 ('booze', 1),
 ('boys', 1)]

### Specific functions of Textacy to extract words according to their pos
The output is a list

In [119]:
token_alt =textacy.extract.words(doc)
print(list(token_alt))


[Funny, story, went, college, Las, Vegas, knew, motorcycling, whatsoever, college, buddies, strip, dance, clubs, ended, bar, called, Hogs, Heifers, s, worth, noting, females, working, outdrink, sign, door, read, Club, Colors, lose, ties, blazers, heading, assumed, bright, colors, like, red, yellow, green, etc, allowed, bring, xtra, t, shirt, pair, jeans, went, years, Looking, naive, s, hilarious, able, walk, bar, crawl, booze, Cheers, Ride, safe, boys]


In [113]:
# The input file must be a doc 
tokens1=textacy.extract.words(doc, include_pos={"ADJ","NOUN"})
print(list(tokens1))
#print(*[t for t in tokens1], sep="|")

Funny|story|college|college|buddies|strip|dance|clubs|bar|worth|females|sign|door|ties|blazers|bright|colors|red|yellow|green|etc|shirt|pair|jeans|years|naive|hilarious|able|bar|booze|safe|boys


In [114]:
tokens2=textacy.extract.words(doc, include_pos={"ADJ","NOUN"},min_freq=2)
print(list(tokens2)
#print(*[t for t in tokens2], sep="|")

college|college|bar|bar


### Tags 
A more detailled classification 

In [34]:
for token in doc:
    print(token,token.tag_,spacy.explain(token.tag_))


Funny JJ adjective (English), other noun-modifier (Chinese)
story NN noun, singular or mass
  _SP whitespace
I PRP pronoun, personal
went VBD verb, past tense
to IN conjunction, subordinating or preposition
college NN noun, singular or mass
in IN conjunction, subordinating or preposition
Las NNP noun, proper singular
Vegas NNP noun, proper singular
  _SP whitespace
This DT determiner
was VBD verb, past tense
before IN conjunction, subordinating or preposition
I PRP pronoun, personal
knew VBD verb, past tense
anything NN noun, singular or mass
about IN conjunction, subordinating or preposition
motorcycling VBG verb, gerund or present participle
whatsoever IN conjunction, subordinating or preposition
  _SP whitespace
Me PRP pronoun, personal
and CC conjunction, coordinating
some DT determiner
college NN noun, singular or mass
buddies NNS noun, plural
would MD verb, modal auxiliary
always RB adverb
go VB verb, base form
out RP adverb, particle
on IN conjunction, subordinating or prepositi

### dep_ structure of dependence

In [67]:
from spacy import displacy

In [70]:
#Set some display options for the visualizer
options = {"compact": True, "distance": 90, "color": "yellow", "bg": "black", "font": "Gill Sans"}

displacy.render(token_sentence1, style="dep", options=options)


In [65]:
for token in doc:
    print(token,token.dep_,spacy.explain(token.dep_))


Funny amod adjectival modifier
story ROOT root
  dep unclassified dependent
I nsubj nominal subject
went ROOT root
to prep prepositional modifier
college pobj object of preposition
in prep prepositional modifier
Las compound compound
Vegas pobj object of preposition
  dep unclassified dependent
This nsubj nominal subject
was ROOT root
before mark marker
I nsubj nominal subject
knew advcl adverbial clause modifier
anything dobj direct object
about prep prepositional modifier
motorcycling pcomp complement of preposition
whatsoever advmod adverbial modifier
  dep unclassified dependent
Me pobj object of preposition
and cc coordinating conjunction
some det determiner
college compound compound
buddies conj conjunct
would aux auxiliary
always advmod adverbial modifier
go conj conjunct
out prt particle
on prep prepositional modifier
the det determiner
strip pobj object of preposition
to prep prepositional modifier
the det determiner
dance compound compound
clubs pobj object of preposition
  d

## Lemmatization/ Stemming

- Replacing words with their root: 
    - "economic", "economics", "economically" all replaced by the stem (the root) "economy"
    - Porter stemmer (Porter 1980): standard stemming tool for English language text
- smaller vocabulary: increase speed of execution

In [71]:
for token in doc:
    print(token,token.lemma_)

Funny funny
story story
   
I I
went go
to to
college college
in in
Las Las
Vegas Vegas
   
This this
was be
before before
I I
knew know
anything anything
about about
motorcycling motorcycle
whatsoever whatsoever
   
Me I
and and
some some
college college
buddies buddy
would would
always always
go go
out out
on on
the the
strip strip
to to
the the
dance dance
clubs club
   
We we
always always
ended end
up up
at at
a a
bar bar
called call
Hogs Hogs
     
Heifers Heifers
   
It it
s s
worth worth
noting note
the the
females female
working work
there there
can can
outdrink outdrink
ANYONE anyone
   
Anyway anyway
   
there there
was be
a a
sign sign
on on
the the
front front
door door
that that
read read
   
No no
Club Club
Colors Colors
     
So so
we we
lose lose
our our
ties tie
and and
blazers blazer
before before
heading head
there there
   
Also also
we we
assumed assume
bright bright
colors color
like like
red red
   
yellow yellow
   
green green
etc etc
were be
not not
allowed a

### Analysis of a Doc

- Extracting n-grams

In [73]:
from textacy import extract
list(extract.ngrams(doc,2))

[Funny story,
 Las Vegas,
 motorcycling whatsoever,
 college buddies,
 dance clubs,
 bar called,
 called Hogs,
 s worth,
 worth noting,
 females working,
 Club Colors,
 assumed bright,
 bright colors,
 colors like,
 like red,
 green etc,
 xtra t,
 t shirt,
 Ride safe]

### Remark: We can discard some function of the spaCy pipeline

We can import selected elements of the pipeline if some component are useless

In [75]:
nlp_2=spacy.load('en_core_web_sm', disable=["parser","ner"])

In [76]:
nlp_2.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x24a43f38650>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x24a43f38830>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x24a41957150>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x24a3a8717d0>)]

## Working with stop words

- spaCy uses language-specific stop word lists to set the is_stop property for each token
- Filtering stop words (and punctuation tokens) is easy
- The list of stop words is loaded when a nlp object is created

In [79]:
print(nlp.Defaults.stop_words)

{'without', 'against', 'any', 'anything', 'twelve', 'many', 'about', '’re', 'my', 'various', 'myself', 'these', 'whom', 'nowhere', 'ten', 'neither', 'nevertheless', 'eleven', 'must', 'our', 'same', 'down', 'do', 'often', "'ll", 'thru', 'thereby', 'where', 'amongst', 'again', 'being', 'really', 'hereupon', 'here', 'front', 'might', 'should', 'most', 'your', 'yourself', 'sixty', 'formerly', 'serious', 'on', 'only', 'latterly', 'now', 'whenever', 'go', 'empty', 'using', 'i', 'two', 'unless', 'under', 'whence', 'this', 'because', 'he', 'several', 'sometime', 'call', 'someone', 'has', 'see', 'six', 'former', 'somehow', 'n’t', 'please', 'when', 'themselves', 'next', 'she', 'perhaps', 'sometimes', 'did', 'above', 'part', 'always', 'throughout', 'between', 'too', 'say', 'yours', 'first', 'seemed', 'mine', 'whereas', 'ourselves', 'may', 'thus', 'why', 'toward', 'due', 'which', 'its', 'else', 'they', 'whereupon', 'whereby', 'anyhow', 'up', 'nor', 'upon', 'top', 'onto', 'we', 'behind', 'via', 'wh

### The list of stop words can be modified

In [80]:
nlp.vocab['down'].is_stop=False
nlp.vocab['Dear'].is_stop=True
nlp.vocab['Regards'].is_stop = True

### Extracting Lemma

In [89]:
def extract_lemmas(doc,**kwargs):
    return[t.lemma_ for t in textacy.extract.words(doc,**kwargs)]

In [90]:
tokenized_doc = extract_lemmas(doc,min_freq=2)
print(*tokenized_doc, sep = "|")
len(tokenized_doc)

go|college|college|bar|s|Colors|color|go|s|bar


10

In [91]:
tokenized_doc = extract_lemmas(doc,  include_pos={"ADJ","NOUN"})
print(*tokenized_doc, sep = "|")
len(tokenized_doc)

funny|story|college|college|buddy|strip|dance|club|bar|worth|female|sign|door|tie|blazer|bright|color|red|yellow|green|etc|shirt|pair|jean|year|naive|hilarious|able|bar|booze|safe|boy


32

### Extracting Named entities

- The process of detecting entities such as people, locations, organization in texts
- In the **Named-entity recognizer** attributes of Doc:
    - Doc.ents
    - Token.ent_iob_
    - Token.ent_type_

In [92]:
text0=df.loc[df.index[0],'text'] # selection of text by using df.index[list]
print(text0)

Funny story. I went to college in Las Vegas. This was before I knew anything about motorcycling whatsoever. Me and some college buddies would always go out on the strip to the dance clubs. We always ended up at a bar called Hogs &amp; Heifers. It's worth noting the females working there can outdrink ANYONE. Anyway, there was a sign on the front door that read 'No Club Colors'. So we lose our ties and blazers before heading there. Also we assumed bright colors like red, yellow, green etc were not allowed. So we would always bring an xtra t-shirt and pair of jeans. This went on for years! Looking back now on how naive we were, it's just hilarious. I was never able to walk out of that bar....had to crawl out! So much booze. <lb><lb>Cheers. Ride safe, boys! 


In [93]:
# Preprocesssing with textacy pipeline
clean_text0=preproc(text0)

print(clean_text0)

Funny story  I went to college in Las Vegas  This was before I knew anything about motorcycling whatsoever  Me and some college buddies would always go out on the strip to the dance clubs  We always ended up at a bar called Hogs   Heifers  It s worth noting the females working there can outdrink ANYONE  Anyway  there was a sign on the front door that read  No Club Colors   So we lose our ties and blazers before heading there  Also we assumed bright colors like red  yellow  green etc were not allowed  So we would always bring an xtra t shirt and pair of jeans  This went on for years  Looking back now on how naive we were  it s just hilarious  I was never able to walk out of that bar    had to crawl out  So much booze  Cheers  Ride safe  boys 


In [94]:
doc0 = textacy.make_spacy_doc(clean_text0,lang="en_core_web_sm")
doc0._.preview

'Doc(166 tokens: "Funny story  I went to college in Las Vegas  Th...")'

In [95]:
doc0

Funny story  I went to college in Las Vegas  This was before I knew anything about motorcycling whatsoever  Me and some college buddies would always go out on the strip to the dance clubs  We always ended up at a bar called Hogs   Heifers  It s worth noting the females working there can outdrink ANYONE  Anyway  there was a sign on the front door that read  No Club Colors   So we lose our ties and blazers before heading there  Also we assumed bright colors like red  yellow  green etc were not allowed  So we would always bring an xtra t shirt and pair of jeans  This went on for years  Looking back now on how naive we were  it s just hilarious  I was never able to walk out of that bar    had to crawl out  So much booze  Cheers  Ride safe  boys 

In [96]:
list(textacy.extract.entities(doc, include_types={"DATE","PRODUCT","ORG","LOCATION"}))

[years]

In [97]:
for ent in doc.ents:
    print(f"({ent.text},{ent.label_})",end="")

(Las Vegas,GPE)(years,DATE)

In [98]:
from spacy import displacy
displacy.render(doc,style="ent")

# Make a Corpus

A textacy.Corpus is an ordered collection of spaCy Doc all processed by the same language pipeline

In [99]:
records=df['text']

preproc_records=((preproc(text)) for text in records)

In [100]:
corpus=textacy.Corpus("en_core_web_sm",data=preproc_records)

In [102]:
corpus.n_docs, corpus.n_sents, corpus.n_tokens

(20000, 88521, 2970183)

In [103]:
corpus[0]._.preview

'Doc(166 tokens: "Funny story  I went to college in Las Vegas  Th...")'

In [104]:
corpus[0]

Funny story  I went to college in Las Vegas  This was before I knew anything about motorcycling whatsoever  Me and some college buddies would always go out on the strip to the dance clubs  We always ended up at a bar called Hogs   Heifers  It s worth noting the females working there can outdrink ANYONE  Anyway  there was a sign on the front door that read  No Club Colors   So we lose our ties and blazers before heading there  Also we assumed bright colors like red  yellow  green etc were not allowed  So we would always bring an xtra t shirt and pair of jeans  This went on for years  Looking back now on how naive we were  it s just hilarious  I was never able to walk out of that bar    had to crawl out  So much booze  Cheers  Ride safe  boys 

### Transforming a corpus into an array 

**textacy.representations.vectorizers** : Transform a collection of tokenized docs into a **doc-term matrix** of shape (# docs, # unique terms), with various ways to filter or limit included terms and flexible weighting schemes for their values.
    
    
https://textacy.readthedocs.io/en/latest/api_reference/representations.html#  

In [143]:
tokenized_docs = ((term.lemma_ for term in textacy.extract.words(doc,include_pos={"ADJ","NOUN"})) for doc in corpus[:20])

In [144]:
from textacy.representations import Vectorizer

### Specification of the Vectorizer
tf_type : specify the type of type frequency
    tf_type = linear 

tf_type = can be linear, sqrt, log, binary

idf_type : Type of inverse document frequency (idf) to use for weights’ global 
        can be standard, smooth,bm25

In [146]:
vectorizer_alt = Vectorizer( tf_type="linear")

In [147]:
vectorizer_alt.weighting

'tf'

In [148]:
doc_term_matrix_alt = vectorizer_alt.fit_transform(tokenized_docs)
doc_term_matrix_alt

<Compressed Sparse Row sparse matrix of dtype 'int32'
	with 505 stored elements and shape (20, 383)>

Terms associated with columns

In [149]:
vectorizer_alt.terms_list[:10]

['100k',
 '35',
 '4matic',
 'EDIT',
 'a4',
 'able',
 'advice',
 'altitude',
 'anchor',
 'answer']

In [150]:
print(doc_term_matrix_alt[:20, vectorizer_alt.vocabulary_terms["story"]].toarray())

[[1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]]


In [152]:
tokenized_docs_n = ((term.lemma_ for term in textacy.extract.words(doc,include_pos={"ADJ","NOUN"})) for doc in corpus[21:40])

In [156]:
doc_matrix_terms_alt_n = vectorizer_alt.transform(tokenized_docs_n)
doc_matrix_terms_alt_n

<Compressed Sparse Row sparse matrix of dtype 'int32'
	with 228 stored elements and shape (19, 383)>

## Another example of tokenization and vectorization

In [169]:
tokenized_docs = ((term.lemma_ for term in textacy.extract.words(doc,include_pos={"VERB"})) for doc in corpus[:20])

In [170]:
#vectorizer = Vectorizer( tf_type="linear")
vectorizer = Vectorizer(tf_type="linear", idf_type="standard",min_df=5, max_df=0.95)

In [165]:
vectorizer.weighting

'tf * log(n_docs / df) + 1'

In [172]:
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
doc_term_matrix

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 42 stored elements and shape (20, 6)>

In [173]:
print(doc_term_matrix[:20, vectorizer.vocabulary_terms["know"]].toarray())

[[2.38629436]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [2.38629436]
 [2.38629436]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [2.38629436]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [2.38629436]
 [0.        ]
 [0.        ]]


In [174]:
tokenized_docs = ((term.lemma_ for term in textacy.extract.words(doc,include_pos={"VERB"})) for doc in corpus[21:41])

In [175]:
doc_matrix_terms= vectorizer.transform(tokenized_docs)
doc_matrix_terms

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 34 stored elements and shape (20, 6)>

In [177]:
print(doc_matrix_terms[:20, vectorizer.vocabulary_terms["know"]].toarray())

[[0.        ]
 [0.        ]
 [2.38629436]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [2.38629436]
 [2.38629436]
 [0.        ]
 [2.38629436]
 [2.38629436]
 [0.        ]
 [2.38629436]]
