# Natural Language Processing

## Steps
* [tokenization](#Tokenization)
* [vectorization](#Vectorization)
* [TD-IDF](#TF-IDF)

# Tokenization
## start small

In [None]:
token_test = "Here is a sentence. Or two, I don't think there will be more."
token_test_2 ="i thought this sentence was good."
token_test_3 = "Here's a sentence... maybe two. Depending on how you like to count!"

In [None]:
# let's tokenize a document... into sentences
def make_sentences(doc):
    pass

make_sentences(token_test_3)

In [None]:
# let's tokenize a document into words
# with these 3 test cases what would you look out for?
def tokenize_it(doc):
    pass

tokenize_it(token_test)

# Before running the next cell, let's look the nltk tokenizers in action
* https://text-processing.com/demo/tokenize/

In [None]:
# using the natural language toolkit library for tokenizing sentences
from nltk import sent_tokenize

In [None]:
sent_tokenize(token_test)

In [None]:
# let's tokenize a document into words now
from nltk import word_tokenize

In [None]:
# how would I find out which tokenizer nltk is using?
print(word_tokenize(token_test))
print(word_tokenize(token_test_2))
print(word_tokenize(token_test_3))

## Intuitively how would we compare these 'documents'? 
By counting the amount of words in each document! 
<br>
This is known as a **bag of words**


In [None]:
# this will take one sentence as tokens
def bag_o_words(bag):
    pass

bag_o_words(token_test)

## problems with comparing two documents?
ummm yea, ofc!

In [None]:
# write some potential problems here

### Stop words

In [None]:
from nltk.corpus import stopwords
print(stopwords.words('english')[:20])

In [None]:
# stopwords are unique to each corpus/project you do
my_stopwords = set(stopwords.words('english'))

In [None]:
# take the stop words out of 'token_test'
[x for x in word_tokenize(token_test) if x not in my_stopwords]

### Now that stop words are out of the way check out Stems and Lemmas in action
* https://text-processing.com/demo/stem/

In [None]:
from nltk.stem import LancasterStemmer, SnowballStemmer, RegexpStemmer, WordNetLemmatizer 

In [None]:
stem_sentence = """when data scientists are performing natural language processing analysis, they must take\
 different verb tenses and singular versus plural words into account."""

In [None]:
snowball = SnowballStemmer('english')

In [None]:
# function to get stems and lemmas
# fill in comments
def stem_words(document,stemmer):
    #
    toks = word_tokenize(document)
    wrd_list = []
    #
    for word in toks:
        #
        wrd_list.append(stemmer.stem(word))
    #
    return " ".join(wrd_list)

In [None]:
stem_words(stem_sentence,snowball)

In [None]:
lancaster = LancasterStemmer()

In [None]:
stem_words(stem_sentence,lancaster)

In [None]:
regex_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [None]:
stem_words(stem_sentence,regex_stemmer)

In [None]:
lemma = WordNetLemmatizer()

In [None]:
# function for lemmas
def lem_words(document,lemmer):
    toks = word_tokenize(document)
    wrd_list = []
    for word in toks:
        wrd_list.append(lemmer.lemmatize(word))
    return " ".join(wrd_list)

In [None]:
# test it out
lemma.lemmatize('things')

In [None]:
lem_words(stem_sentence,lemma)

# Vectorization
## this step happens after we account for stopwords and lemmas; depending on the library...
* we make a **Count Vector**, which is the formal term for a **bag of words**
* we use vectors to pass text into machine learning models


In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Let's check out the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)

In [None]:
# test the CountVectorizer method on 'basic_example'
basic_example = ['The Data Scientist wants to train a machine to train machine learning models.']
cv = CountVectorizer()
cv.fit(basic_example)

In [None]:
# what info can we get from cv?
# hint -- look at the docs again

## Vectorization allows us to compare two documents

In [None]:
# use pandas to help see what's happening
import pandas as pd

In [None]:
# we fit the CountVectorizer on the 'basic_example', now we transform 'basic_example'
example_vector_doc_1 = cv.transform(basic_example)

In [None]:
# # what is the type 

# print(type(example_vector_doc_1))

# # what does it look like

# print(example_vector_doc_1)

In [None]:
# let's visualize it
example_vector_df = pd.DataFrame(example_vector_doc_1.toarray(), columns=cv.get_feature_names())
example_vector_df

In [None]:
# # here we compare new text to the CountVectorizer fit on 'basic_example'
# new_text = ['the data scientist plotted the residual error of her model']
# new_data = cv.transform(new_text)
# new_count = pd.DataFrame(new_data.toarray(),columns=cv.get_feature_names())
# new_count

## N-grams

In [None]:
# in this the object 'sentences' becomes the corpus
sentences = ['The Data Scientist wants to train a machine to train machine learning models.',
'the data scientist plotted the residual error of her model in her analysis',
'Her analysis was so good, she won a Kaggle competition.',
'The machine gained sentiance']

In [None]:
# go back to the docs for count vectorizer, how would we use an ngram
# pro tip -- include stop words
bigrams = CountVectorizer()

In [None]:
bigram_vector = bigrams.fit_transform(sentences)
bigram_vector

In [None]:
print('There are '+str(len(bigrams.get_feature_names()))+ ' features for this corpus')
bigrams.get_feature_names()[:10]

In [None]:
# let's visualize it
bigram_df = pd.DataFrame(bigram_vector.toarray(), columns=bigrams.get_feature_names())
bigram_df.head()

# TF-IDF
## Term Frequency - Inverse Document Frequency

In [None]:
tf_idf_sentences = ['The Data Scientist wants to train a machine to train machine learning models.',
'the data scientist plotted the residual error of her model in her analysis',
'Her analysis was so good, she won a Kaggle competition.',
'The machine gained sentiance']
# take out stop words
tfidf = TfidfVectorizer(stop_words='english')
# fit transform the sentences
tfidf_sentences = tfidf.fit_transform(tf_idf_sentences)

In [None]:
# visualize it
tfidf_df = pd.DataFrame(tfidf_sentences.toarray(), columns=tfidf.get_feature_names())

In [None]:
tfidf_df

In [None]:
# compared to bigrams
bigram_df

In [None]:
# not let's test out our TfidfVectorizer
test_tdidf = tfidf.transform(['this is a test document','look at me I am a test document'])

In [None]:
# this is a vector
test_tdidf

In [None]:
test_tfidf_df = pd.DataFrame(test_tdidf.toarray(), columns=tfidf.get_feature_names())
test_tfidf_df