# Text modeling

HI everyone! Today, we're exploring natural language processing. First, we will learn about linguistic features that our documents contain. We will then move on to deeper representations of our corpora and move on to text classification. We will be using `Scikit-learn`, `TextBlob`, and `SpaCy` for this notebook. We'll be taking a look at spam messages for this session. 

To install `spacy` and its models properly on the **terminal**:

`pip3 install spacy`

`python3 -m spacy download en_core_web_sm`

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('fivethirtyeight')

### Data exploration

In [None]:
# Import messages into dataframe
messages = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'message'])
messages.head()

In [None]:
# Desribe our dataset
messages.groupby('label').describe()

In [None]:
messages['length'] = messages['message'].apply(len)
messages.hist(column='length', by='label', bins=50, figsize=(12,4))

We can note here that spam messages usually contain more characters compared to real messages. 

-----

### Text Segmentation

In [None]:
from textblob import Word, TextBlob

Let's take a look at the message at index 219

In [None]:
message = messages['message'][219]
message

In [None]:
blob = TextBlob(message)
blob

Let's try to split our text data into more meaningful pieces. First, into separate sentences. Second, into separate terms. Generally, we call this process **tokenization**.

In [None]:
blob.sentences

In [None]:
sentence = blob.sentences[1]
sentence.words

TextBlob has a feature that extracts noun phrases, a group of words that form the subject, object, or prepositional object of a sentence. This feature isn't perfect, though. 

We can also get word counts like last time.

In [None]:
blob.noun_phrases

In [None]:
blob.word_counts

----

### Linguistic features
Text naturally has several features of interest. We will take a look at parts of speech and lemmas of terms in sentences. We will also take a look at word synonyms and definitions

**Parts of speech** (POS) refer to the function of terms in a given statement (e.g. noun, verb, etc.). POS tagging is a statistical rule-based approach that determines the likely categorization of words in a statement given context. A comprehensive list of tags is found at http://www.clips.ua.ac.be/pages/mbsp-tags.

In [None]:
sentence.tags

**Lemma** are the most rudimentary forms or inflections of words, such as the root of nouns (matrices -> matrix) or present tenses of verbs (threw -> throw). These can be derived through lemmatization. Here in TextBlob, it uses existing lexicons and their respective lemmas to statistically infer the lemmas of provided words. 

The `lemmatize()` method has an optional parameter:
    - 'n' for noun (default)
    - 'v' for verb
    - 'a' for adjective
    - 'r' for adverb (doesn't always work)

In [None]:
w = Word('went')
w.lemmatize('v')

In [None]:
w = Word('alumnae')
w.lemmatize('n')

In [None]:
w = Word('stronger')
w.lemmatize('a')

Inside TextBlob, we can also pluralize and find synonyms and definitions of words. It can also correct possible spelling errors in text. It can even compare similarities between words! Lastly, it can do translation.

In [None]:
w = Word('corpora')
w.pluralize()

In [None]:
w = Word('corpora')
w.singularize()

In [None]:
w = Word('eat')
list(zip(w.synsets, w.definitions))

In [None]:
w = Word('havv')
w.correct()

In [None]:
apple = Word('apple').synsets[0]
orange = Word('orange').synsets[0]
apple.path_similarity(orange)

In [None]:
b = TextBlob('Je suis Will.')
b.detect_language()

In [None]:
b.translate(from_lang='fr', to='en')

----

### Exercises

##### Get the message at index 518 and make it into a blob. 

In [None]:
# Your answer here

##### Print the first 5 unique POS tags in blob

In [None]:
# Your answer here

##### Find all the lemmas (of all nouns and verbs) in blob

In [None]:
# Your answer here

##### List all words in blob that are plural (with index of each word)

In [None]:
# Your anwer here

----

### Using SpaCy

In [None]:
import spacy
from spacy import displacy
try:
    nlp = spacy.load('en_core_web_sm')
except:
    print("Error loading 'en_core_web_sm' model.")

![title](spacy_pipeline.png)

In [None]:
doc = nlp(str(sentence))
tokens = pd.DataFrame(columns=
                      ['text', 'lemma', 'pos', 'tag',
                       'dependency', 'shape', 'is_alphabet',
                       'is_stopword', 'head_text', 'head_pos'])

for token in doc:
    data = [token.text, token.lemma_, token.pos_,
            token.tag_, token.dep_, token.shape_,
            token.is_alpha, token.is_stop,
            token.head.text, token.head.pos_]
    tokens.loc[len(tokens)] = data
tokens

In [None]:
s = 'Apple’s iOS 13 is here — or rather, the public beta for iOS 13 has arrived,giving the masses their first chance to take Apple’s latest operating system for a spin.'
doc = nlp(s)
displacy.render(doc, style='ent', jupyter=True)