https://github.com/keithgalli/pycon2020

## Basic text classifier - Bag of Words
Bag of Words: this approach vectorizes using count (CountVectorizer) and trains the model.

Unigram - individual words by themselves
Bigram - two words at a time.
    binary = True and ngram_range = (1,2)


## Limitations
If a word which is not in the training set appears then the model doesn't know how to handle it so the results are always let to chance. Example: Instead of book, using 'books' or 'story' would mean that the classifier doesn't know how to really classify it.

In [13]:
%pip install -U scikit-learn
%pip install -U spacy
!python -m spacy download en_core_web_md


Note: you may need to restart the kernel to use updated packages.
Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
     ---------------------------------------- 0.0/42.8 MB ? eta -:--:--
     ---------------------------------------- 0.1/42.8 MB 3.5 MB/s eta 0:00:13
      --------------------------------------- 0.7/42.8 MB 8.3 MB/s eta 0:00:06
     -- ------------------------------------- 2.3/42.8 MB 18.6 MB/s eta 0:00:03
     --- ------------------------------------ 4.2/42.8 MB 24.2 MB/s eta 0:00:02
     ----- ---------------------------------- 6.0/42.8 MB 27.5 MB/s eta 0:00:02
     ------- -------------------------------- 7.9/42.8 MB 29.7 MB/s eta 0:00:02
     --------- ------------------------------ 9.8/42.8 MB 31.4 MB/s eta 0:00:02
     ---------- ---------------------------- 11.7/42.8 MB 40.9 MB/s eta 0:00:01
     ------------ -------------------------- 13

  _torch_pytree._register_pytree_node(


In [14]:
class Category:
    BOOKS = "BOOKS"
    CLOTHING = "CLOTHING"

train_x = ["i love book", "this is a great book", "the fit is great", "i love the shoes"]
train_y = [Category.BOOKS, Category.BOOKS, Category.CLOTHING, Category.CLOTHING]

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

# vectorizer = CountVectorizer(binary=True, ngram_range=(1,2))
vectorizer = CountVectorizer(binary=True)
train_x_vectors = vectorizer.fit_transform(train_x)

print(vectorizer.get_feature_names_out())
print(train_x_vectors.toarray())

['book' 'fit' 'great' 'is' 'love' 'shoes' 'the' 'this']
[[1 0 0 0 1 0 0 0]
 [1 0 1 1 0 0 0 1]
 [0 1 1 1 0 0 1 0]
 [0 0 0 0 1 1 1 0]]


In [16]:
from sklearn import svm 
clf_svm = svm.SVC(kernel="linear")
clf_svm.fit(train_x_vectors, train_y)

In [17]:
test_x  = vectorizer.transform(['i love the story'])
clf_svm.predict(test_x)

array(['CLOTHING'], dtype='<U8')

## Advanced text classifier - Word Vectors

Refer to Spacy

```python -m spacy download en_core_web_lg```

As the number of categories increase and words in the sentence increases, meanings can get lost due to averaging and merging. 
Powerful but don't solve everything. Lot more needs to be explored.

In [18]:
import spacy

nlp = spacy.load("en_core_web_md")

In [19]:
docs = [nlp(text) for text in train_x]
train_x_word_vectors = [x.vector for x in docs]

In [20]:
from sklearn import svm

clf_svm_wv = svm.SVC(kernel="linear")
clf_svm_wv.fit(train_x_word_vectors, train_y)

In [21]:
test_x = ["I love the book", "cloth", "books", "story", "earring", "kindle"]
test_docs = [nlp(text) for text in test_x]
test_x_word_vectors = [x.vector for x in test_docs]

clf_svm_wv.predict(test_x_word_vectors)

array(['BOOKS', 'CLOTHING', 'BOOKS', 'BOOKS', 'CLOTHING', 'BOOKS'],
      dtype='<U8')

## Regexes

Pattern matching of strings in Python. Example: 123-123-1234 555-555-5555 +1(123)-123-1234


In [22]:
import re

regexp = re.compile(r"read|story|book")

phrases = ["I liked that story", "the car treaded up the hill", "this hat is nice"]

matches = []
for phrase in phrases:
    if re.search(regexp, phrase):
        matches.append(phrase)

print(matches)

['I liked that story', 'the car treaded up the hill']


## Stemming and Lemmatization

Techniques used to normalize the words. example: take books and reduce it to canonical form 'book'

reading -> read
books -> book
stories -> ? stori for stemming and story for lemmatization (uses dictionary)

In [23]:
%pip install -U nltk

Note: you may need to restart the kernel to use updated packages.


In [24]:
import nltk

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Sunil\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sunil\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sunil\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [25]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

phrase = "reading the books."
words = word_tokenize(phrase)

stemmed_words= [stemmer.stem(word) for word in words]
" ".join(stemmed_words)

'read the book .'

In [26]:
# lemmatization
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

phrase = "reading the books."
words = word_tokenize(phrase)

lemmatized_words= [lemmatizer.lemmatize(word) for word in words]
" ".join(lemmatized_words)

'reading the book .'

## Stopwords removal

Most common english words which don't add much meaning to the sentances in the context of NLP

this, that, he, it etc.

In [27]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stopwords = stopwords.words("english")

phrase = "Here is an example sentence demonstrating the removal of stopwords."
words = word_tokenize(phrase)

filtered_words = [word for word in words if word not in stopwords]
" ".join(filtered_words)

'Here example sentence demonstrating removal stopwords .'

## Various other techniques (Spell Correction, Sentiment, & POS tagging)

In [28]:
%pip install -U textblob
!python -m textblob.download_corpora


Finished.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Sunil\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sunil\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Sunil\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Sunil\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\Sunil\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Sunil\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!

In [29]:
from textblob import TextBlob

phrase = "i read the book, it was great. It would be good if it was written by me."

tb_phrase = TextBlob(phrase)

tb_phrase.correct()
tb_phrase.tags
tb_phrase.sentiment

Sentiment(polarity=0.75, subjectivity=0.675)

## Recurrent Neural Networks

Wordvectors are set in stone, no matter how it is used, the word vector is same it is not dependent on context.

Drawbacks:
Longer dependencies don't always perform well.
"I need to go to the bank today so that I can make a deposit, I hope it's not closed" -> here closed may not be related to bank by the RNN due to the long statement.

Sequential in nature and can not be parallelized.

## Attention is all you need.

What tokens are particularly relevant and should be attended to?
"II need to go to the bank and write a check". Check will spike of write and bank. Bank is important to go.

Attention networks ask question on the token and learn based on the token within the rules of attention network.

## Transformer Architecture
BERT
GPT
ELMo (Not transformer architecture by the way)

In [30]:
%pip install -U spacy-transformers
!python -m spacy download en_core_web_trf


Collecting en-core-web-trf==3.7.3
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3-py3-none-any.whl (457.4 MB)
     ---------------------------------------- 0.0/457.4 MB ? eta -:--:--
     ---------------------------------------- 0.2/457.4 MB 5.0 MB/s eta 0:01:32
     --------------------------------------- 1.5/457.4 MB 15.8 MB/s eta 0:00:29
     --------------------------------------- 3.2/457.4 MB 22.4 MB/s eta 0:00:21
     --------------------------------------- 4.9/457.4 MB 25.9 MB/s eta 0:00:18
      -------------------------------------- 6.4/457.4 MB 27.2 MB/s eta 0:00:17
      -------------------------------------- 8.0/457.4 MB 28.5 MB/s eta 0:00:16
      -------------------------------------- 9.3/457.4 MB 28.4 MB/s eta 0:00:16
      ------------------------------------- 11.1/457.4 MB 34.4 MB/s eta 0:00:13
     - ------------------------------------ 12.7/457.4 MB 34.4 MB/s eta 0:00:13
     - -------------------

  _torch_pytree._register_pytree_node(


In [31]:
%pip install ipywidgets

Note: you may need to restart the kernel to use updated packages.


In [32]:
import spacy
import torch 
import numpy 


nlp = spacy.load("en_core_web_trf")
doc = nlp("Here is some text to encode.")

In [33]:
class Category:
    BOOKS= "BOOKS"
    BANK= "BANK"

train_x = ["good characters and plot progression", "check out that book", "good story, would recommend.", "novel recommendations", "need to make deposit", "balance inquiry savings", "save money"]
train_y = [Category.BOOKS, Category.BOOKS, Category.BOOKS, Category.BOOKS, Category.BANK, Category.BANK, Category.BANK]

In [35]:
from sklearn import svm

docs = [nlp(text) for text in train_x]
train_x_vectors = [doc.vector for doc in docs]
clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)

test_x = ["check this story out"]
docs = [nlp(text) for text in test_x]
test_x_vectors = [doc.vector for doc in docs]

clf_svm.predict(test_x_vectors)

ValueError: Found array with 0 feature(s) (shape=(7, 0)) while a minimum of 1 is required by SVC.