<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Natural Language Processing: Intro and Preprocessing
              
</p>
</div>

Data Science Cohort Live NYC Nov 2022
<p>Phase 4: Topic 37</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

In [None]:
# Use this to install nltk if needed
# !pip install nltk

##using conda env
#conda activate <your_env>
#conda install -c anaconda nltk

In [None]:
%load_ext autoreload
%autoreload 2

import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import pandas as pd
import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer
import matplotlib.pyplot as plt
import string
import re
import numpy as np

In [None]:
# Use this to download the stopwords if you haven't already - only ever needs to be run once

nltk.download("stopwords")

#### Natural Language Processing (NLP)

- Machine learning tasks with unstructured free language text.

#### Supervised learning: training on labeled free text documents
- Build document classifiers

<center><img src = "Images/spamvsham.png" />
Spam filtration </center>

<center>
<img src = "Images/doc_classification.jpg" />
Document management systems for your business
</center>

- Using free text as input in regression.
    - e.g., free text reviews to predict restaurant quality 0-10
    - sentiment analysis (extremely displeased to ecstatic)
    
<img src = "Images/anton_ego.jpg" width = 450/>


<img src = "Images/ego_quote.jpg" width = 450/>
<center> Our algorithm predicts a 9.8 for Gusteau's. </center>

Based on text:
- Algorithm predicts Anton was extremely pleased.

<center><img src = "Images/sentiment_analysis.jpg" > Sentiment Analysis</center>

#### Unsupervised Learning 

- Topic modeling
    - learn topics from a collection of documents



<img src = "Images/topicmodels.png" width = 600 >

Many, many more types of NLP tasks.
- Just named a few.


Need to represent information in free text in a form useable by an ML model:
- i.e. vectorize/structure information inside body of documents
- create numeric representations of words, sentences, documents

Simple example: count vectorizer
<img src = "Images/vectorchart.png" >

Processing texting is multistep:
- Text pre-processing
- Feature extraction (vectorization)

A simple NLP workflow:

<img src = "Images/text_feature_pipe.png" >

Many types of vectorization schemes exist that can be trained:

- But first: text data must be preprocessed.
- This is the first phase in the NLP pipeline
- **Essential**: helps learning effective vector representation.

#### Text Preprocessing
1. **Tokenization**
2. Normalization

## Tokenization 

In order to convert the texts into data suitable for machine learning, we need to break down the documents into smaller parts. 

The first step in doing that is **tokenization**.

Tokenization is the process of splitting documents into units of observations. We usually represent the tokens as __n-grams__, where n represent the number of consecutive words occuring in a document that we will consider a unit. In the case of unigrams (one-word tokens), the sentence "David works here" would be tokenized into:

- "David", "works", "here";

If we want (also) to consider bigrams, we would (also) consider:

- "David works" and "works here".

Tokenizing: cutting text into small semantic subunits (tokens).
<img src = "Images/tokenization.webp" >


Tokenization: language-specific splitting/contraction rules

Many NLP packages with excellent tokenizers (among other things):
- nltk
- spaCy
- gensim

Will use nltk: The Natural Language Toolkit
    
<center><img src = "Images/nltk_logo.png" width = 250></center>    

In [None]:
import nltk # the natural language toolkit

In [None]:
# need to downlod punkt to access better tokenization rules
# word_tokenize won't work without it
nltk.download('punkt') 


In [None]:
from nltk.tokenize import word_tokenize # nltk's gold standard word tokenizer
from nltk.tokenize import sent_tokenize # nltk's sentence tokenizer

In [None]:
import pandas as pd
satire_df = pd.read_csv('data/satire_nosatire.csv')

Predict whether an article is satire or real.

In [None]:
satire_df.head()

In [None]:
satire_df.info()

In [None]:
first_doc = satire_df['body'].iloc[0]
first_doc

Let's see what word tokenizer does:

In [None]:
print(word_tokenize(first_doc, language='english'))

Deals with splitting on whitespace, punctuation, and contractions.

In [None]:
first_doc

There are other more powerful tokenizers that can be dialect specific.

Can explore this later.

The sentence tokenizer
- sometimes want to chunk sentences before doing word tokenization.

In [None]:
sent_tokenize(first_doc)

Word tokenize each chunked sentence:

In [None]:
print([word_tokenize(sent) for sent in sent_tokenize(first_doc)])

List of lists: each sentence, word tokenized.

For our use case: 
- vectorizing documents in word-count vector
- word tokenization

- Word tokenize each document in collection of documents
- List of token lists for each document in collection: **corpus**
- Unique tokens in entire corpus: **dictionary**

In [None]:
corpus = [word_tokenize(doc) for doc in satire_df['body']]
print(corpus[0:4])

For purposes of understanding the dictionary/vocabulary:
- flattening corpus

In [None]:
import itertools
flattenedcorpus_tokens = pd.Series(list(itertools.chain(*corpus)))
print(flattenedcorpus_tokens.shape)

Dictionary, then, is unique values of tokens in corpus:

In [None]:
dictionary = pd.Series(
    flattenedcorpus_tokens.unique())
print(len(dictionary))

In [None]:
flattenedcorpus_tokens.value_counts()

Tokens in the dictionary become features for a token-frequency matrix.

<center><img src = "Images/vectorchart.png" ></center>

In this light, think about the dictionary:

- any problems?
- look at various types of tokens. Anything that you notice?

#### Problem 1

- 30,000 features: way too much. Curse of dimensionality.

#### Problem 2
- Want features to help us in classification task
- But many useless features: tokens too common in english language.
    - punctuation
    - prepositions, articles, etc.: **stop words**

In [None]:
flattenedcorpus_tokens.value_counts()[0:20]

#### Problem 3

In [None]:
flattenedcorpus_tokens.isin(["warning"]).sum()

In [None]:
flattenedcorpus_tokens.isin(["Warning"]).sum()

Same exact word: just capitalized
- Shouldn't be independent feature.
- lowercase all of these.

#### Problem 4

In [None]:
flattenedcorpus_tokens.isin(["warns"]).sum()

In [None]:
flattenedcorpus_tokens.isin(["warned"]).sum()

In [None]:
flattenedcorpus_tokens.isin(["warn"]).sum()

All of these are treated as unique features:
- but are just variant of same word
- need to normalize these in some way

#### Problem 5

Let's get the number of tokens with only one occurence in entire corpus:

In [None]:
num_one_occurence = (flattenedcorpus_tokens.
                     value_counts() == 1).sum()
num_one_occurence

~ 1/3 of tokens only appear **once**!

- Rare token are not useful to keep around.
- Not useful in building relationship between features and target.

#### Problem 6

Many of these tokens are numbers: 
- don't have semantic meaning that will aid in classification

In [None]:
dictionary[dictionary.str.isnumeric()]

#### Addressing these problems step-by-step

- Lower casing, removing punctuation, and stop words.
- Keep only alphabetic tokens (drop numbers)

In [None]:
# imports package with many stopword lists
from nltk.corpus import stopwords

# get common stop words in english that we'll remove during tokenization/text normalization
stop_words = stopwords.words('english')
print(stop_words[0:5])

Create a simple helper function:

In [None]:
def first_step_normalizer(doc):
    # filters for alphabetic (no punctuation or numbers) and filters out stop words. 
    # lower cases all tokens
    norm_text = [x.lower() for x in word_tokenize(doc) if ((x.isalpha()) & (x not in stop_words)) ]
    return norm_text

In [None]:
satire_df['tok_norm'] = satire_df['body'].apply(first_step_normalizer)
satire_df.head()

In [None]:
print(satire_df['tok_norm'].iloc[0])

In [None]:
norm_toks_flattened = pd.Series(list(
    itertools.chain(*satire_df['tok_norm'])))
new_dictionary = norm_toks_flattened.unique()
print(len(new_dictionary))

Process removed 7000 features from the dictionary.

In [None]:
print(len(dictionary))

#### Text Preprocessing
1. Tokenization
2. **Normalization**

#### Next step: stemming/lemmatizing
- Converting variants of the same word to a base form or root

Stemmers consolidate similar words by chopping off the ends of the words.
<center><img src = "Images/stemmer.png" width = 200> Stem isn't always a word.</center>


Different stemming algorithms (in order of increasing aggression):
- Porter stemmer
- Lancaster stemmer (real aggressive, **ultrafast**)
- Snowball stemmer (faster, more aggressive, smarter)




<img src = "Images/stemmers.jpg" >

In [None]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

In [None]:
p_stemmer = PorterStemmer()
s_stemmer = SnowballStemmer(language="english")
l_stemmer = LancasterStemmer()

Running a Porter stemmer on a document

In [None]:
sample_doc = satire_df['tok_norm'].iloc[0]
print(sample_doc)

.stem(token) method

In [None]:
port_stemmed_doc  = [p_stemmer.stem(token) 
                     for token in sample_doc]
print(port_stemmed_doc)

Compare Porter and Snowball stemmer on a document

In [None]:
print(port_stemmed_doc)

In [None]:
snowball_stemmed_doc  = [s_stemmer.stem(token) 
                     for token in sample_doc]
print(snowball_stemmed_doc)

Nearly identical results. Snowball is generally faster. Often also better.

Marked difference in results between Porter/Snowball vs. Lancaster

In [None]:
print(snowball_stemmed_doc)

In [None]:
lancaster_stemmed_doc  = [l_stemmer.stem(token) 
                     for token in sample_doc]

print(lancaster_stemmed_doc)

#### Advantages/Disadvantages of stemming:
- Uses simple, **fast** tree-based algorithms to normalize word variants
- Stems not always words
- Can produce base forms that are pretty weird/merge different words

#### Lemmatization

- Another way to convert inflections of word to a base form 
- Not simply cutting to word root

Changes to word *lemma*:
- is, was, will $\rightarrow$ be
- haves, having, had $\rightarrow$ have
- leafs, leaves $\rightarrow$ leaf

This enhanced ability comes at a small cost:

- Requires part of speech (POS) information
- due to possible ambiguities in form

Example:
- *leaves* (verb or noun)
- *leaves* (noun) $\rightarrow$ leaf
- *leaves* (verb) $\rightarrow$ leave

nltk has implementation of the WordNet Lemmatizer:
- links into Wordnet
- the mother of all semantic/lexical databases
- stores library of contextual word relationships, POS tagging, etc.
- *excellent* for rule-based document parsing

<img src = "Images/wordnet.webp" >
<center><a href = "https://wordnet.princeton.edu/" >Princeton's WordNet</a> </center>

In [None]:
from nltk import WordNetLemmatizer # lemmatizer using WordNet
from nltk.corpus import wordnet # imports WordNet
from nltk import pos_tag # nltk's native part of speech tagging

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
#nltk.download('all')

Part of Speech (POS) Tagging

- identify parts of speech of each token from ordered list of tokens.

In [None]:
sent_string = "The dog licked the babies in the face."
sent_tok_list = word_tokenize(sent_string)

In [None]:
sent_tok_list

In [None]:
pos_tag(sent_tok_list)

<a href = "https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">List of NLTK POS tags</a>

Use POS tagging in lemmatizer, but:
- WordNet has different POS tagging system.
- Helper function to convert (reuse this code)

In [None]:
# helper function to change nltk's part of speech tagging to a wordnet format.
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:         
        return None

Let's see this tagging in action

In [None]:
# document to list of tuples with tokens and POS tags in nltk format
# converts to wordnet format

wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tag(sample_doc))) 
print(wordnet_tagged)

This format can be inputted directly into WordNet lemmatizer.

- Instantiate wordnet object:
- WordNetLemmatizer()
- has method .lemmatize()

In [None]:
wnl = WordNetLemmatizer()
doc_lemmatized = [wnl.lemmatize(token, pos) for token, pos in wordnet_tagged if pos is not None]
print(doc_lemmatized)

Compare original tokens and lemmatized tokens

In [None]:
print(sample_doc)

In [None]:
print(doc_lemmatized)

Compare snowball stemmer and lemmatization

In [None]:
print(snowball_stemmed_doc)

In [None]:
print(doc_lemmatized)

Lemmatization: 
- far superior to stemming in terms of semantic text normalization
- but need good POS tagging.
- slower than stemming: issue for processing large amounts of text

Applying lemmatizer to corpus
- useful to all preprocessing steps/necessary subroutines into one function


In [None]:
# takes in untokenized document and returns fully normalized token list
def process_doc(doc):

    #initialize lemmatizer
    wnl = WordNetLemmatizer()

    # helper function to change nltk's part of speech tagging to a wordnet format.
    def pos_tagger(nltk_tag):
        if nltk_tag.startswith('J'):
            return wordnet.ADJ
        elif nltk_tag.startswith('V'):
            return wordnet.VERB
        elif nltk_tag.startswith('N'):
            return wordnet.NOUN
        elif nltk_tag.startswith('R'):
            return wordnet.ADV
        else:         
            return None
        
    # remove stop words and punctuations, then lower case
    doc_norm = [tok.lower() for tok in word_tokenize(doc) if ((tok.isalpha()) & (tok not in stop_words)) ]

    #  POS detection on the result will be important in telling Wordnet's lemmatizer how to lemmatize
    
    # creates list of tuples with tokens and POS tags in wordnet format
    wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tag(doc_norm))) 
    doc_norm = [wnl.lemmatize(token, pos) for token, pos in wordnet_tagged if pos is not None]
    
    return doc_norm

In [None]:
print(process_doc(satire_df['body'].iloc[0]))

Apply text tokenization/normalization to whole body of documents

In [None]:
fully_normalized_corpus = satire_df['body'].apply(process_doc)

In [None]:
fully_normalized_corpus.head()

In [None]:
flattened_fully_norm = pd.Series(list(itertools.chain(*fully_normalized_corpus)))
len(flattened_fully_norm.unique())

Original dictionary length

In [None]:
print(len(dictionary))

Removed/cleaned dictionary to around half its size:
- Normalized text appropriately
- Still not dealt with infrequent tokens
- Tokens too common but not in stop words list.
- Will do when vectorizing.

Let's flatten the lists and save to csv:

In [None]:
fnc_output = fully_normalized_corpus.apply(
    " ".join)

fnc_output.to_csv("data/satire_norm.csv")

In [None]:
fnc_output