# Section 39: Foundations of Natural Language Processing

- xx/xx/xx
- my-best-lesson-plans

## Learning Objectives

- Introduce the field of Natural Language Processing
- Learn about the extensive preprocessing involved with text data
- Walk through text classification - Finding Trump 


## Questions

### Questions from Gdoc

### Additional Questions?


# Natural Language Processing

> **_Natural Language Processing_**, or **_NLP_**, is the study of how computers can interact with humans through the use of human language.  Although this is a field that is quite important to Data Scientists, it does not belong to Data Science alone.  NLP has been around for quite a while, and sits at the intersection of *Computer Science*, *Artificial Intelligence*, *Linguistics*, and *Information Theory*. 

## Where is NLP Used?


- Reviews (i.e. Amazon)
- Stock market trading
- AI Assistants
- Spam Detection

- **Demonstration:**
    - [Google Duplex AI Assistant](https://youtu.be/D5VN56jQMWM)

# Working with Text Data

## Preprocessing

**Preparing text data requires more processing than normal data.**
1. We must remove things like:
    - punctuation
    - numbers
    - upper vs lowercase letters
    
    
2. It is always recommended that go a step beyond this and remove **commonly used words that contain little information (called "stopwords")** for our machine learning algorithms. Words like: the,was,he,she, it,etc.


3. Additionally, most analyses **need the text tokenzied** into a list of words and not in a natural sentence format. Instead, they are a list of words (**tokens**) separated by "`,`", which tells the algorithm what should be considered one word.


4. While not always required, it is often a good idea to reduce similar words down to a shared core.
There are often **multiple variants of the same word with the same/simiar meaning**,<br> but one may plural **(i.e. "democrat" and "democrats")**, or form of words is different **(i.e. run, running).**<br> Simplifying words down to the basic core word (or word *stem*) is referred to as **"stemming"**. <br><br> A more advanced form of this also understands things like words that are just in a **different tense** such as  i.e.  **"ran", "run", "running"**. This process is called  **"lemmatization**, where the words are reduced to their simplest form, called "**lemmas**"<br>  
    - Stemming<br><img src="https://raw.githubusercontent.com/learn-co-students/dsc-nlp-and-word-vectorization-online-ds-ft-100719/master/images/new_stemming.png" width=40%>
    - Lemmatization
    
|   Word   |  Stem | Lemma |
|:--------:|:-----:|:-----:|
|  Studies | Studi | Study |
| Studying | Study | Study |

5. Finally, we have to convert our text data into numeric form for our machine learning models to analyze, a process called **vectorization**.

## Vectorization

- For computers to process text it needs to be converted to a numerical representation of the text.
- **There are several different ways we can vectorize our text:**
    - Count vectorization
    - Term Frequency-Inverse Document Frequency (TF-IDF)
        -  Used for multiple texts
    - Word Embeddings (Deep NLP)
    
    
>- **_Term Frequency_** is calculated with the following formula:
$$ \text{Term Frequency}(t) = \frac{\text{number of times it appears in a document}} {\text{total number of terms in the document}} $$ <br>
- Which can also be represented as:
$$\begin{align}
 \text{tf}_{i,j} = \dfrac{n_{i,j}}{\displaystyle \sum_k n_{i,j} }
\end{align} $$

> - **_Inverse Document Frequency_** is calculated with the following formula:
$$ IDF(t) = log_e(\frac{\text{Total Number of Documents}}{\text{Number of Documents with it in it}})$$<br>
- Which can also be represented as: 
$$\begin{align}
idf(w) = \log \dfrac{N}{df_t}
\end{align} $$

> The **_TF-IDF_** value for a given word in a given document is just found by multiplying the two!
$$ \begin{align}
w_{i,j} = tf_{i,j} \times \log \dfrac{N}{df_i} \\
tf_{i,j} = \text{number of occurences of } i \text{ in} j \\
df_i = \text{number of documents containing } i \\
N = \text{total number of documents}
\end{align} $$

- There are additional ways to vectorize using Deep Neural Networks to create Word Embeddings (see Module 4 > Appendix: Deep NLP)

## Feature Engineering for Text Data


* Do we remove stop words or not?    
* Do we stem or lemmatize our text data, or leave the words as is?   
* Is basic tokenization enough, or do we need to support special edge cases through the use of regex?  
* Do we use the entire vocabulary, or just limit the model to a subset of the most frequently used words? If so, how many?  
* Do we engineer other features, such as bigrams, or POS tags, or Mutual Information Scores?   
* What sort of vectorization should we use in our model? Boolean Vectorization? Count Vectorization? TF-IDF? More advanced vectorization strategies such as Word2Vec?  


# Practicing Text Preprocessing - Finding Trump 

## Tweet Natural Language Processing Overview

To prepare Donal Trump's tweets for modeling, **it is essential to preprocess the text** and simplify its contents.
<br><br>
1. **At a minimum, things like:**
    - punctuation
    - numbers
    - upper vs lowercase letters<br>
    ***must*** be addressed before any initial analyses. I refer tho this initial cleaning as **"minimal cleaning"** of the text content<br>
    
> Version 1 of the tweet processing removes these items, as well as the removal of any urls in a tweet. The resulting data column is referred to here as "content_min_clean".

<br><br>
2. It is **always recommended** that go a step beyond this and<br> remove **commonly used words that contain little information** <br>for our machine learning algorithms. Words like: (the,was,he,she, it,etc.)<br> are called **"stopwords"**, and it is critical to address them as well.

> Version 2 of the tweet processing removes these items and the resulting data column is referred here as `cleaned_stopped_content`

<br>

3. Additionally, many analyses **need the text tokenzied** into a list of words<br> and not in a natural sentence format. Instead, they are a list of words (**tokens**) separated by ",", which tells the algorithm what should be considered one word.<br><br>For the tweet processing, I used a version of tokenization, called `regexp_tokenziation` <br>which uses pattern of letters and symbols (the `expression`) <br>that indicate what combination of alpha numeric characters should be considered a single token.<br><br>The pattern I used was `"([a-zA-Z]+(?:'[a-z]+)?)"`, which allows for words such as "can't" that contain "'" in the middle of word. This processes was actually applied in order to process Version 1 and 2 of the Tweets, but the resulting text was put back into sentence form. 

> Version 3 of the tweets keeps the text in their regexp-tokenized form and is reffered to as `cleaned_stopped_tokens`
<br>

4. While not always required, it is often a good idea to reduce similar words down to a shared core.
There are often **multiple variants of the same word with the same/simiar meaning**,<br> but one may plural **(i.e. "democrat" and "democrats")**, or form of words is different **(i.e. run, running).**<br> Simplifying words down to the basic core word (or word *stem*) is referred to as **"stemming"**. <br><br> A more advanced form of this also understands things like words that are just in a **different tense** such as  i.e.  **"ran", "run", "running"**. This process is called  **"lemmatization**, where the words are reduced to their simplest form, called "**lemmas**"<br>  

> Version 4 of the tweets are all reduced down to their word lemmas, futher aiding the algorithm in learning the meaning of the texts.
<!-- 

#### EXAMPLE TWEETS AND PROCESSING STEPS:

**TWEET FROM 08-25-2017 12:25:10:**
* **["content"] column:**<p><blockquote>***"Strange statement by Bob Corker considering that he is constantly asking me whether or not he should run again in '18. Tennessee not happy!"***
    
    
* **["content_min_clean"] column:**<p><blockquote>***"strange statement by bob corker considering that he is constantly asking me whether or not he should run again in  18  tennessee not happy "***
    
    
* **["cleaned_stopped_content"] column:**<p><blockquote>***"strange statement bob corker considering constantly asking whether run tennessee happy"***
    
    
* **["cleaned_stopped_tokens"] column:**<p><blockquote>***"['strange', 'statement', 'bob', 'corker', 'considering', 'constantly', 'asking', 'whether', 'run', 'tennessee', 'happy']"***
    
    
* **["cleaned_stopped_lemmas"] column:**<p><blockquote>***"strange statement bob corker considering constantly asking whether run tennessee happy"*** -->

## Finding Trump - Code

In [None]:
# !pip install -U fsds
from fsds.imports import *

In [None]:
import pandas as pd
finding_trump = 'finding-trump.csv'
df = pd.read_csv(finding_trump)
df.head()

In [None]:
## Create a variable "corpus" containing all text
corpus = df['text'].to_list()
corpus[:10]

### Make a Bag-of-Words Frequency Distribution 

- "bag-of-words": collection of all words from a corpus and their frequencies


In [None]:
from nltk import FreqDist
corpus[0]

In [None]:
## Make a FreqDist from the corpus

## Display 100 most common words


> That's not quite right...

In [None]:
## Tokenize corpus then generate FreqDist
from nltk import word_tokenize


> Better...but what's our next issue?

In [None]:
## Make a list of stopwords to remove
from nltk.corpus import stopwords
import string

In [None]:
# Get all the stop words in the English language

## Add punctuation to stopwords_list


In [None]:
## Some additional Tweet Punctuation
additional_punc = ['“','”','...',"''",'’','``']
stopwords_list.extend(additional_punc)#['“','”'])#'...'

In [None]:
## Commentary on not always accepting what is or isn't in stopwords
print('until' in stopwords_list)

stopwords_list.remove('until')
print('until' in stopwords_list)

In [None]:
## Remove stopwords and then re-product the FreqDist


### Additional Ways to Show Frequency

- [Word Clouds](https://www.geeksforgeeks.org/generating-word-cloud-python/)

In [None]:
from wordcloud import WordCloud
wordcloud = WordCloud(stopwords=stopwords_list,collocations=False)
wordcloud.generate(','.join(stopped_tokens))
plt.figure(figsize = (12, 12), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis('off')

### Comparing Phases of Proprocessing/Tokenization

In [None]:
from nltk import word_tokenize
from ipywidgets import interact

@interact
def tokenize_tweet(i=(0,len(corpus)-1)):
    from nltk.corpus import stopwords
    import string
    from nltk import word_tokenize,regexp_tokenize
    
    print(f"- Tweet #{i}:\n")
    print(corpus[i],'\n')
    tokens = word_tokenize(corpus[i])

    # Get all the stop words in the English language
    stopwords_list = stopwords.words('english')
    stopwords_list += string.punctuation
    stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
    
    print(tokens,end='\n\n')
    print(stopped_tokens)

> What recognizable pattern of characters is high on the frequency list?

## Other Bag of Words Statistics

### Bigrams

In [None]:
import nltk
bigram_measures = nltk.collocations.BigramAssocMeasures()
tweet_finder = nltk.BigramCollocationFinder.from_words(stopped_tokens)
tweets_scored = tweet_finder.score_ngrams(bigram_measures.raw_freq)

In [None]:
## Make a DataFrame from the Bigrams


### Mutual Information Scores

In [None]:
import nltk
bigram_measures = nltk.collocations.BigramAssocMeasures()

tweet_pmi_finder = nltk.BigramCollocationFinder.from_words(stopped_tokens)
tweet_pmi_finder.apply_freq_filter(5)

tweet_pmi_scored = tweet_pmi_finder.score_ngrams(bigram_measures.pmi)

In [None]:
## Make a DataFrame from the Bigrams with PMI
pd.DataFrame.from_records(tweet_pmi_scored,columns=['Words','PMI']).head(20)

# Regular Expressions

- Regular expressions can help us capture/remove complicated patterns in our text.
- Best regexp resource and tester: https://regex101.com/

    - Make sure to check "Python" under Flavor menu on left side.
    
    
- Let's use regular expressions to remove URLs

In [1]:
## Select an example tweet
text =  corpus[6615]
text

NameError: name 'corpus' is not defined

In [2]:
## Select a second example tweet
text2=corpus[7347]
text2

NameError: name 'corpus' is not defined

In [None]:
## From tjhe lessons
from nltk import regexp_tokenize
pattern = r"([a-zA-Z]+(?:'[a-z]+)?)"
regexp_tokenize(text,pattern)

### Let's use regex to find/remove URLS

- www.regex101.com
    - Copy and paste example text to search
    - Test out regular expressions and see what they pick up

In [None]:
print(text,text2)

In [None]:
import re
re.findall(r"(https://\w*\.\w*/+\w+)",text)

In [None]:
def clean_text(text,regex=True):
    from nltk.corpus import stopwords
    import string
    from nltk import word_tokenize,regexp_tokenize

    ## tokenize text
    if regex:
        pattern = r"([a-zA-Z]+(?:'[a-z]+)?)"
        tokens= regexp_tokenize(text,pattern)
    else:
        tokens = word_tokenize(text)
        
    # Get all the stop words in the English language
    stopwords_list = stopwords.words('english')
    stopwords_list += string.punctuation
    
    stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
    
    return stopped_tokens

In [None]:
## Other uses of RegEx for Tweet preprocessing
import re

def find_urls(string): 
    return re.findall(r"(http[s]?://\w*\.\w*/+\w+)",string)

def find_hashtags(string):
    return re.findall(r'\#\w*',string)

def find_retweets(string):
    return re.findall(r'RT [@]?\w*:',string)

def find_mentions(string):
    return re.findall(r'\@\w*',string)

In [None]:
find_urls(text)

In [None]:
find_mentions(text2)

## Stemming/Lemmatization

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize('feet')) # foot
print(lemmatizer.lemmatize('running')) # run [?!] Does not match expected output

In [None]:
text_in =  corpus[6615]
text_in

In [None]:
def process_tweet(text,as_lemmas=False,as_tokens=True):
#     text=text.copy()
    for x in find_urls(text):
        text = text.replace(x,'')
        
    for x in find_retweets(text):
        text = text.replace(x,'')    
        
    for x in find_hashtags(text):
        text = text.replace(x,'')    

    if as_lemmas:
        from nltk.stem.wordnet import WordNetLemmatizer
        lemmatizer = WordNetLemmatizer()
        text = lemmatizer.lemmatize(text)
    
    if as_tokens:
        text = clean_text(text)
    
    if len(text)==0:
        text=''
            
    return text

In [None]:
@interact
def show_processed_text(i=(0,len(corpus)-1)):
    text_in = corpus[i]#.copy()
    print(text_in)
    text_out = process_tweet(text_in)
    print(text_out)
    text_out2 = process_tweet(text_in,as_lemmas=True)
    print(text_out2)

In [None]:
corpus[:6]

# ACTIVITY: FINDING TRUMP

## Finding Trump with sklearn

In [None]:
finding_trump = '../finding-trump.csv'#'https://raw.githubusercontent.com/jirvingphd/online-ds-pt-1007109-text-classification-finding-trump/master/finding-trump.csv'

df = pd.read_csv(finding_trump,#'https://raw.githubusercontent.com/jirvingphd/capstone-project-using-trumps-tweets-to-predict-stock-market/master/data/trump_tweets_12012016_to_01012020.csv',
                index_col='created_at',parse_dates=['created_at'])
df.head()

### Early During His Presidency, Trump used an unofficial Android Phone  
> During this time period, his staffers were the ones Tweeting from the official iPhone. 

In [None]:
## Check Value Counts for Source
df['source'].value_counts()

In [None]:
## Get time period where Trump still had his personal Android


In [None]:
## Check new value counts 


In [None]:
## What do the "Web" tweets look like?


In [None]:
## Remove the Web tweets


In [None]:
## Make new Trump Tweet Column of 0 and 1s


In [None]:
## Make X and y


In [None]:
## Train Test Split
from sklearn.model_selection import train_test_split


In [None]:
## Check y_train and y_test value counts



### Tokenization & Vectorization 

In [None]:
import nltk
tokenizer = nltk.tokenize.TweetTokenizer(preserve_case=False,)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

## Make TfidfVectorizer


# Make X_train_tfidf and X_test_tfidf


### Modeling Continued

In [None]:
## Make and fit a random forest
rf = RandomForestClassifier(class_weight='balanced')
rf.fit(X_train_tfidf,y_train)

In [None]:
## Get predictions
y_hat_test = rf.predict(X_test_tfidf)
y_hat_train = rf.predict(X_train_tfidf)

In [None]:
from sklearn import metrics
import matplotlib.pyplot as plt
# my_scorer = metrics.make_scorer(evaluate_model,)

def evaluate_model(y_test,y_hat_test,X_test,clf=None,
                  scoring=metrics.recall_score,verbose=False,
                  scorer=False,classes=['Not Trump','Trump']):

    print(metrics.classification_report(y_test,y_hat_test,
                                        target_names=classes))
    
    metrics.plot_confusion_matrix(clf,X_test,y_test,normalize='true',
                                 cmap='Blues',display_labels=classes)
    plt.show()
    if verbose:
        print("MODEL PARAMETERS:")
        print(pd.Series(rf.get_params()))
        
    if scorer:
        
        return scoring(y_test,y_hat_test)
    
    
## Evaluate MOdel
evaluate_model(y_test,y_hat_test,X_test_tfidf,rf)

### Note About Pipelines and GridSearch

- You may want to to this process in multiple steps (first Count Vectorize, then transform to TF or TF-IDF.
- Can then use these in a Pipeline to be able to GridSearch more aspect of the text preprocessing

```python
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer #TfidfVectorizer
from sklearn.pipeline import Pipeline

count_vect = CountVectorizer()
#X_train_counts = count_vect.fit_transform(twenty_train.data)

tf_transformer = TfidfTransformer(use_idf=False)
#tf_transformer.fit(X_train_counts)
#X_train_tf = tf_transformer.transform(X_train_counts)
#X_train_tf.shape


text_pipe = Pip

```


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer #TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

count_vect = CountVectorizer()
tf_transform = TfidfTransformer(use_idf=True)

text_pipe = Pipeline(steps=[
    ('count_vectorizer',count_vect),
    ('tf_transformer',tf_transform)])

full_pipe = Pipeline(steps=[
    ('text_pipe',text_pipe),
    ('clf',RandomForestClassifier(class_weight='balanced'))
])

In [None]:
X_train

In [None]:
X_train_pipe = text_pipe.fit_transform(X_train)
X_test_pipe = text_pipe.transform(X_test)
X_train_pipe

In [None]:
## Modeling with full pipeline
full_pipe.fit(X_train,y_train)
y_hat_test = full_pipe.predict(X_test)
evaluate_model(y_test,y_hat_test,X_test,full_pipe)

In [None]:
from sklearn import set_config
set_config(display='text')

full_pipe

In [None]:
import nltk
from sklearn.model_selection import GridSearchCV

tokenizer = nltk.tokenize.TweetTokenizer(preserve_case=False,)

params = {'text_pipe__tf_transformer__use_idf':[True,False],
         'text_pipe__count_vectorizer__tokenizer':[None,tokenizer.tokenize],
         'clf__criterion':['gini','entropy']}

grid = GridSearchCV(full_pipe,params)
grid.fit(X_train,y_train)
grid.best_params_

In [None]:
best_pipe = grid.best_estimator_
y_hat_test = best_pipe.predict(X_test)
evaluate_model(y_test,y_hat_test,X_test,clf=best_pipe)

### Get feature importances as text

In [None]:
X_train_pipe = text_pipe.fit_transform(X_train)
X_test_pipe = text_pipe.transform(X_test)
X_train_pipe

In [None]:
X_train_pipe.shape

In [None]:
features = text_pipe.named_steps['count_vectorizer'].get_feature_names()
features[:10]

In [None]:
len(features)

In [None]:
# vectorizer.get_feature_names()
with plt.style.context('seaborn-talk'):
    importance = pd.Series(rf.feature_importances_,index= vectorizer.get_feature_names())
    importance.sort_values(inplace=True)

    importance.sort_values().tail(30).plot(kind='barh')

In [3]:
# df[df['text'].str.contains('...',regex=False)]['source'].value_counts(normalize=True)

In [None]:
top_word_probs = {}
for word in importance.tail(20).index:
    rows = df['text'].str.contains(word,regex=False,case=False)
    val_count= df[rows]['source'].value_counts(normalize=True)
    top_word_probs[word] = val_count
#     print(f'\n\n{word}:\n{val_count}')

In [None]:
top_probs = pd.DataFrame(top_word_probs).T
top_probs.style.background_gradient(axis=1)

## T-SNE (for Student Question)

In [None]:
X_train_pipe.todense()

In [None]:
from sklearn.manifold import TSNE
from mpl_toolkits.mplot3d import Axes3D

In [None]:
## TSNE For Visualizing High Dimensional Data
t_sne_object_3d = TSNE(n_components=3)
transformed_data_3d = t_sne_object_3d.fit_transform(X_train_pipe)
transformed_data_3d

In [None]:
## Separate into Trump/Not Trump
trump = transformed_data_3d[y_train==1]
not_trump = transformed_data_3d[y_train==0]

In [None]:
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(projection='3d')
ax.scatter(trump[:,0],trump[:,1],
           trump[:,2],c='orange',label='Trump')
ax.scatter(not_trump[:,0],not_trump[:,1],
           not_trump[:,2],c='black',label='Not Trump')
ax.legend()
ax.view_init(30, 10)


fig.tight_layout()

In [None]:
## TSNE For Visualizing High Dimensional Data
t_sne_object_2d = TSNE(n_components=2)
transformed_data_2d = t_sne_object_2d.fit_transform(X_train_pipe)
## Separate into Trump/Not Trump
trump = transformed_data_2d[y_train==1]
not_trump = transformed_data_2d[y_train==0]

In [None]:
fig,ax = plt.subplots(figsize=(20,10))
ax.scatter(trump[:,0],trump[:,1],c='orange',label='Trump')
ax.scatter(not_trump[:,0],not_trump[:,1],c='black',label='Not Trump')
ax.legend()

fig.tight_layout()

## Other Classifiers - Naive Bayes

In [None]:
nb_classifier = MultinomialNB()#alpha = 1.0e-08)
nb_classifier.fit(X_train_pipe,y_train)
y_hat_test = nb_classifier.predict(X_test_pipe)
evaluate_model(y_test,y_hat_test,X_test_pipe,nb_classifier)

# APPENDIX

## GridSearch Random Forest

In [None]:
# from sklearn.model_selection import GridSearchCV
# params  = {'criterion':['gini','entropy'],
#            'max_depth':[3,5,10,50,100,None],
#           'class_weight':['balanced',None],
#            'bootstrap':[True ,False],
#           'min_samples_leaf':[1,2,3,4],
#           }
# rf_clf = RandomForestClassifier()
# grid = GridSearchCV(rf_clf,params,return_train_score=False,
#                     scoring='recall_weighted',n_jobs=-1)
# grid.fit(X_train_tfidf,y_train)
# print(grid.best_score_)
# grid.best_params_

In [None]:
# best_rf = grid.best_estimator_
# best_rf.fit(X_train_tfidf, y_train)

# y_hat_test = best_rf.predict(X_test_tfidf)

In [None]:
# evaluate_model(y_test,y_hat_test,X_test_tfidf,best_rf)

In [None]:
# importance = pd.Series(best_rf.feature_importances_,index= vectorizer.get_feature_names())
# importance.sort_values().tail(20).plot(kind='barh')

## BOOKMARK: Better Handling Emojis

> https://medium.com/towards-artificial-intelligence/emoticon-and-emoji-in-text-mining-7392c49f596a

## Excluded Code

### Summary table from Finding Trump

In [None]:
## Summary Table with Most Frequent Words 
prob_cols =['Twitter for Android','Twitter for iPhone']
top_probs['importance'] = importance.loc[top_probs.index]

top_probs['max_prob'] = top_probs[prob_cols].max(axis=1)
top_probs['Max Prob Class'] = top_probs[prob_cols].idxmax(axis=1)
top_probs.sort_values('importance',0,0,inplace=True)
top_probs.style.bar('importance')\
                    .background_gradient(subset=['max_prob'])\
                    .highlight_max(subset=prob_cols,axis=1,color='lightgreen')
#.background_gradient(subset=prob_cols,axis=1,cmap='Reds')

In [None]:
results = top_probs[['Max Prob Class','max_prob','importance']]
display(results.style.bar('importance').background_gradient(subset=['max_prob']))
results['Max Prob Class'].value_counts(1)


### Interactive Tokenizer Example

In [None]:
from nltk import word_tokenize
from ipywidgets import interact

@interact
def tokenize_tweet(i=(0,len(corpus)-1)):
    from nltk.corpus import stopwords
    import string
    from nltk import word_tokenize,regexp_tokenize
    
    print(f"- Tweet #{i}:\n")
    print(corpus[i],'\n')
    tokens = word_tokenize(corpus[i])

    # Get all the stop words in the English language
    stopwords_list = stopwords.words('english')
    stopwords_list += string.punctuation
    stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
    
    print(tokens,end='\n\n')
    print(stopped_tokens)

### NLP Vocabulary
- Corpus
    - Body of text
    
- Bag of Words
    - Collection of all words from a corpus.


## Regular Expressions

- Use https://regex101.com/ to test out regular expressions

## Context-Free Grammers and POS Tagging

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-context-free-grammars-and-POS-tagging-online-ds-ft-100719/master/images/new_LevelsOfLanguage-Graph.png">

#### Syntax and Meaning Can be Difficult for Computers 

In English, sentences consist of a **_Noun Phrase_** followed by a **_Verb Phrase_**, which may optionally be followed by a **_Prepositional Phrase_**.

This ***seems simple, but it gets more tricky*** when we realize that there is a recursive structure to these phrases.

- A noun phrase may consist of multiple smaller noun phrases, and in some cases, even a verb phrase. 
- Similarly, a verb phrase can consist of multiple smaller verb phrases and noun phrases, which can themselves be made up of smaller noun phrases and verb phrases. 


This leads levels of **_ambiguity_** that can be troublesome for computers. NLTK's documentation explains this by examining the classic Groucho Marx joke:

> ***"While hunting in Africa, I shot an elephant in my pajamas. How he got into my pajamas, I don't know."***



<img src="https://raw.githubusercontent.com/jirvingphd/dsc-context-free-grammars-and-POS-tagging-online-ds-ft-100719/master/images/parse_tree.png">