<a href="https://colab.research.google.com/github/weibb123/NLP_tutorial/blob/main/NLP_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm


In [None]:
!pip install spacy
!python -m spacy download en_core_web_md

# Bag of Words
Turn arbitrary text into fixed-length vectors by counting how many times each word appears.

In [6]:
# Build a text classification using Bag-of-words(SVM)
class Category:
  BOOKS = "BOOKS"
  CLOTHING = "CLOTHING"

train_x = ["i love the book", 'this is a great book', 'the fit is great', 'i love the shoes']
train_y = [Category.BOOKS, Category.BOOKS, Category.CLOTHING, Category.CLOTHING]

In [7]:
vectorizer = CountVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)

print(vectorizer.get_feature_names())
print(train_x_vectors.toarray())

['book', 'fit', 'great', 'is', 'love', 'shoes', 'the', 'this']
[[1 0 0 0 1 0 1 0]
 [1 0 1 1 0 0 0 1]
 [0 1 1 1 0 0 1 0]
 [0 0 0 0 1 1 1 0]]




In [8]:
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)

SVC(kernel='linear')

In [9]:
test_x = vectorizer.transform(['i like the book'])

clf_svm.predict(test_x)

array(['BOOKS'], dtype='<U8')

In [10]:
# Issue with this model
test_x = vectorizer.transform(['i like the story'])

clf_svm.predict(test_x)

array(['CLOTHING'], dtype='<U8')

<b>Bag of Words is great on the stuff that is trained on., but if it doesn't seen a word, it fails miserably</b>

# Word Vectors

Convert text into numerical vector and map onto Vector Space

Words that are associated to each other are closer together in Vector Space\
Ex. I like apple, I like fruits are closely together

**Capture semantic**

In [1]:
import spacy

nlp = spacy.load('en_core_web_md') # Word vector space model from spacy

In [11]:
print(train_x)

['i love the book', 'this is a great book', 'the fit is great', 'i love the shoes']


In [14]:
docs = [nlp(text) for text in train_x]
train_x_word_vectors = [x.vector for x in docs]

In [16]:
clf_svm_wv = svm.SVC(kernel='linear')
clf_svm_wv.fit(train_x_word_vectors, train_y)

SVC(kernel='linear')

In [18]:
#predict using word vector model
test_x = ["i love the story"]
test_docs = [nlp(text) for text in test_x]
test_x_word_vectors = [x.vector for x in test_docs]

clf_svm_wv.predict(test_x_word_vectors)

# This solves the issue that Bag-Word cannot handle !

array(['BOOKS'], dtype='<U8')

**Disadvantage of Word Vector not work well with long text and multi-class**

**Not do well when some words have different meaning in sentence**

#Regexes

Pattern Matching of strings in Python

Password checkers, phone numbers, emails, and more

123-123-1234 555-555-5555 + 1-(123)-123-1234

In [25]:
import re

regexp = re.compile(r"\bread\b|\bstory\b|book") #\b gets the word in between. so we can story not history

phrases = ['I liked that history', 'i like that book', 'this hat is nice', 'the car tread up the hill']

matches = []
for i in phrases:
  if re.search(regexp, i):
    matches.append(i)

print(matches)


['i like that book']


# Stemming/Lemmatization

Use it to normalize text

reading -> read\
Books -> book\
Stories -> stori for stemming, story for lemmatizing

In [27]:
import nltk

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [29]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

phrase = 'reading the books'
words = word_tokenize(phrase)

stemmed_words = []
for i in words:
  stemmed_words.append(stemmer.stem(i))

" ".join(stemmed_words)

'read the book'

you see how "reading the books" -> "read the book"

In [30]:
# Trying Lemmatize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

phrase = 'reading the books'
words = word_tokenize(phrase)

lemmatized_words = []
for i in words:
  lemmatized_words.append(lemmatizer.lemmatize(i))

" ".join(lemmatized_words)

'reading the book'

#Stopword Removal
(Remove most common words from sentences)

This might help us in word-vector model, more capture of meaning 

In [32]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words = stopwords.words('english') # common words in english

phrase = 'Here is an example sentence demonstrating the removal of stopwords'

words = word_tokenize(phrase)

stripped_phrase = []
for i in words:
  if i not in stop_words:
    stripped_phrase.append(i)

" ".join(stripped_phrase)


'Here example sentence demonstrating removal stopwords'

## Various other techniques (spell correction, sentiment, & part speech tagging)

In [None]:
!python -m textblob.download_corpora

In [41]:
from textblob import TextBlob

phrase = 'i hate reading this book'

tb_phrase = TextBlob(phrase)
print(tb_phrase.correct())

print(tb_phrase.tags)

print(tb_phrase.sentiment)

i hate reading this book
[('i', 'JJ'), ('hate', 'VBP'), ('reading', 'VBG'), ('this', 'DT'), ('book', 'NN')]
Sentiment(polarity=-0.8, subjectivity=0.9)


DT -> Determiner\
NN -> singular noun\
VBD -> verb


#Transformer

Feed in a phrase, As you iterate through the token in the phrase, you can basically figure out what each token needs extra attention.

In [None]:
!pip install spascy-transformers
!python -m spacy download en_trf_bertbaseuncased_lg

In [1]:
import spacy
import torch

nlp = spacy.load("en_trf_bertbaseuncased_lg")
doc = nlp("Here is some text to encode.")