# NLP tutorial


* How to analyse text:
   * the different parts of NLP will be covered in general, with explanations and codes as examples
   
   
* Libraries : sklearn, nltk, spacy and more ... 


* As usual, many comments and text will help you, hopefully, to understand a bit more about it 


* ENJOY

# 1) Text Preprocessing

**. Noise removal — stripping text of formatting (e.g., HTML tags)**

**. Tokenization — breaking text into individual words** 

**. Normalization — cleaning text data in any other way:**
   * -> **Stemming** is like a blunt axe that chops off word prefixes and suffixes. 
     (“booing” and “booed” become “boo”, but  “sing” may become “s” 
     and “sung” would remain “sung.”).
   * -> **Lemmatization** is a scalpel to bring words down to their root forms. 
     For example, NLTK’s savvy lemmatizer knows “am” and “are” are related to “be.”
     -Other common tasks include lowercasing, stopwords removal, spelling correction, etc.


In [23]:
# regex for removing punctuation!
import re
# nltk preprocessing magic
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
# grabbing a part of speech function:
from part_of_speech import get_part_of_speech # just the text

text = """So many squids are jumping out of suitcases these days that you can barely 
go anywhere without seeing one burst forth from a tightly packed valise. 
I went to the dentist the other day, and sure enough I saw an angry one jump out of 
my dentist's bag within minutes of arriving. She hardly even noticed."""

cleaned = re.sub('\W+', ' ', text)
print("cleaned \n", cleaned)
# Tokenization
tokenized = word_tokenize(cleaned)
# Stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokenized]
# Lemmatization (by default "lemmatize()"" treats every word as a noun)
lemmatizer = WordNetLemmatizer()
lemmatized_lessgood = [lemmatizer.lemmatize(token) for token in tokenized]
lemmatized = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized]

print("\n Stemmed text:")
print(stemmed)
print("\nlemmatized_lessgood text:")
print(lemmatized_lessgood)
print("\nBetter Lemmatized text:")
print(lemmatized)

cleaned 
 So many squids are jumping out of suitcases these days that you can barely go anywhere without seeing one burst forth from a tightly packed valise I went to the dentist the other day and sure enough I saw an angry one jump out of my dentist s bag within minutes of arriving She hardly even noticed 

 Stemmed text:
['So', 'mani', 'squid', 'are', 'jump', 'out', 'of', 'suitcas', 'these', 'day', 'that', 'you', 'can', 'bare', 'go', 'anywher', 'without', 'see', 'one', 'burst', 'forth', 'from', 'a', 'tightli', 'pack', 'valis', 'I', 'went', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'I', 'saw', 'an', 'angri', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minut', 'of', 'arriv', 'she', 'hardli', 'even', 'notic']

lemmatized_lessgood text:
['So', 'many', 'squid', 'are', 'jumping', 'out', 'of', 'suitcase', 'these', 'day', 'that', 'you', 'can', 'barely', 'go', 'anywhere', 'without', 'seeing', 'one', 'burst', 'forth', 'from', 'a', 'tightly'

# 2) Parsing Text

*Parsing is a stage of NLP concerned with segmenting text based on syntax.*


   * **Part-of-speech** tagging (**POS** tagging) 
     identifies parts of speech (verbs, nouns, adjectives, etc.).

   * **Named entity recognition** (**NER**)  
     helps identify the proper nouns (e.g., “Natalia” or “Berlin”) in a text. 
    
   * **Dependency grammar trees**
     help you understand the relationship between the words in a sentence. 
     Python library "spaCy" is at your service, even if it isn’t always perfect. 

   * **Regex parsing**  
     using Python’s "re" library, allows for a bit more nuance. 
     When coupled with POS tagging, you can identify specific phrase pieces. 
     (On its own, it can find you addresses, emails, and many other common 
     patterns within large pieces of text).

In [24]:
import spacy
from nltk import Tree
from squids import squids_text # just the text

dependency_parser = spacy.load('en')

parsed_squids = dependency_parser(squids_text)

# Assign my_sentence a new value:
my_sentence = "I can say shit aswell here!"
my_parsed_sentence = dependency_parser(my_sentence)

def to_nltk_tree(node):
  if node.n_lefts + node.n_rights > 0:
    parsed_child_nodes = [to_nltk_tree(child) for child in node.children]
    return Tree(node.orth_, parsed_child_nodes)
  else:
    return node.orth_

for sent in parsed_squids.sents:
  to_nltk_tree(sent.root).pretty_print()
  
for sent in my_parsed_sentence.sents:
  to_nltk_tree(sent.root).pretty_print()

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

# 3) Language Models

*-> We can help computers make predictions about language by training a
language model on a corpus (a bunch of example text)*

 _**Language models**_ = probabilistic computer models of language. 
    We build and use these models to figure out the likelihood that 
    a given sound, letter, word, or phrase will be used. 
    Once a model has been trained, it can be tested out on new texts.

### 3-1 Bag-of-Words Approach

* **Bag-of-words model** (good when grammar and word-order are irrelevant)
  = It counts the number of times a word is present in a text (after processing step)
  Shared topics (words) between different examples (textes) will be observed.
  Makes predictions concerning topic or sentiment of a text
  = One of the most common (unigram) statistical language models.

  Problem : During training, Bag-of-words model will probably come across 
  test words that it has never encountered before.


 > ex: 
 
      “The squids jumped out of the suitcases.”
 
      {"the": 2, "squid": 1, "jump": 1, "out": 1, "of": 1, "suitcase": 1}

In [25]:
# importing regex and nltk
import re, nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
# importing Counter to get word counts for bag of words
from collections import Counter
# importing a passage from Through the Looking Glass
#from looking_glass import looking_glass_text
# importing part-of-speech function for lemmatization
from part_of_speech import get_part_of_speech # just the text

# Change text to another string:
text = "Let's know what to write in order to check. This is your text. Just try it out"

cleaned = re.sub('\W+', ' ', text).lower()
tokenized = word_tokenize(cleaned)

stop_words = stopwords.words('english')
filtered = [word for word in tokenized if word not in stop_words]

normalizer = WordNetLemmatizer()
normalized = [normalizer.lemmatize(token, get_part_of_speech(token)) for token in filtered]
# Comment out the print statement below
bag_of_looking_glass_words = Counter(normalized)
print(bag_of_looking_glass_words)

Counter({'let': 1, 'know': 1, 'write': 1, 'order': 1, 'check': 1, 'text': 1, 'try': 1})


### 3-2 N-Grams and NLM (neural language models)

* **n-gram model**: considers a sequence of some number (n) units and calculates the probability of each unit in a    
    body of language given the preceding sequence of length n.
 
   Problem :
   During training, Bag-of-words model will probably come across test words that it has never encountered
   before.  
   As the chosen sequence length grows n, the number of examples (shifts) of each sequence within your
   training corpus (text) shrinks.          

 > ex : 

      “The squids jumped out of the suitcases. The squids were furious.” 
       
       For n = 2: 
       
        {('', 'the'): 2, ('the', 'squids'): 2, ('squids', 'jumped'): 1, ('jumped', 'out'): 1, ('out', 'of'): 1, ('of', 'the'): 1, ('the', 'suitcases'): 1, ('suitcases', ''): 1, ('squids', 'were'): 1, ('were', 'furious'): 1, ('furious', ''): 1}        

.
.

* **neural language models** (**NLM**): Deep learning approach. NLP has involved developing and training neural networks to approximate the approach our human brains take towards language. 
Allows computers a much more adaptive tack to processing human language. 

In [26]:
import nltk, re
from nltk.tokenize import word_tokenize
# importing ngrams module from nltk
from nltk.util import ngrams
from collections import Counter
from looking_glass import looking_glass_full_text # just the text 

cleaned = re.sub('\W+', ' ', looking_glass_full_text).lower()
tokenized = word_tokenize(cleaned)

# Change the n value to 2:
looking_glass_bigrams = ngrams(tokenized, 2)
looking_glass_bigrams_frequency = Counter(looking_glass_bigrams)

# Change the n value to 3:
looking_glass_trigrams = ngrams(tokenized, 3)
looking_glass_trigrams_frequency = Counter(looking_glass_trigrams)

# Change the n value to a number greater than 3:
looking_glass_ngrams = ngrams(tokenized, 1)
looking_glass_ngrams_frequency = Counter(looking_glass_ngrams)

print("Looking Glass Bigrams:")
print(looking_glass_bigrams_frequency.most_common(10))

print("\nLooking Glass Trigrams:")
print(looking_glass_trigrams_frequency.most_common(10))

print("\nLooking Glass n-grams:")
print(looking_glass_ngrams_frequency.most_common(10))

Looking Glass Bigrams:
[(('of', 'the'), 101), (('said', 'the'), 98), (('in', 'a'), 97), (('in', 'the'), 90), (('as', 'she'), 82), (('you', 'know'), 72), (('a', 'little'), 68), (('the', 'queen'), 67), (('said', 'alice'), 67), (('to', 'the'), 66)]

Looking Glass Trigrams:
[(('the', 'red', 'queen'), 54), (('the', 'white', 'queen'), 31), (('said', 'in', 'a'), 21), (('she', 'went', 'on'), 18), (('said', 'the', 'red'), 17), (('thought', 'to', 'herself'), 16), (('the', 'queen', 'said'), 16), (('said', 'to', 'herself'), 14), (('said', 'humpty', 'dumpty'), 14), (('the', 'knight', 'said'), 14)]

Looking Glass n-grams:
[(('the',), 1575), (('and',), 850), (('a',), 758), (('to',), 728), (('it',), 557), (('she',), 524), (('i',), 509), (('you',), 501), (('of',), 484), (('said',), 470)]


# 4 ) Topic Models

_Topic modeling is an area of NLP dedicated to uncovering latent, or hidden, topics within a body of language_  
* **Term frequency-inverse document frequency** (**tf-idf**) :  
     * _**deprioritize**_ the _most common words_ and 
     * **_prioritize_** the _less frequently_ used terms 
     
      
     
  (When we’re working with a lot of text, it makes sense if we don’t want our topics filled 
  with words like “the” and “is”).   
  The Python libraries gensim and sklearn have modules to handle _tf-idf_.
  
  
* **Latent Dirichlet Allocation** (**LDA**): 
    LDA is a statistical model that takes documents and determines which words keep popping up together in 
    the same contexts (i.e., documents).  
    The sklearn library can tackle this for us.

In [27]:
import nltk, re
from sherlock_holmes import bohemia_ch1, bohemia_ch2, bohemia_ch3, boscombe_ch1, boscombe_ch2, boscombe_ch3 # just the text
from preprocessing import preprocess_text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# preparing the text
corpus = [bohemia_ch1, bohemia_ch2, bohemia_ch3, boscombe_ch1, boscombe_ch2, boscombe_ch3]
preprocessed_corpus = [preprocess_text(chapter) for chapter in corpus]

# Update stop_list:
stop_list = ["say", "see", "holmes", "shall", "say", 
"man", "upon", "know", "quite", "one", 
"well", "could", "would", "take", "may", 
"think", "come", "go", "little", "must", 
"look"]
# filtering topics for stop words
def filter_out_stop_words(corpus):
  no_stops_corpus = []
  for chapter in corpus:
    no_stops_chapter = " ".join([word for word in chapter.split(" ") if word not in stop_list])
    no_stops_corpus.append(no_stops_chapter)
  return no_stops_corpus
filtered_for_stops = filter_out_stop_words(preprocessed_corpus)

# creating the bag of words model
bag_of_words_creator = CountVectorizer()
bag_of_words = bag_of_words_creator.fit_transform(filtered_for_stops)

# creating the tf-idf model
tfidf_creator = TfidfVectorizer(min_df = 0.2)
tfidf = tfidf_creator.fit_transform(preprocessed_corpus)

# creating the bag of words LDA model
lda_bag_of_words_creator = LatentDirichletAllocation(learning_method='online', n_components=10)
lda_bag_of_words = lda_bag_of_words_creator.fit_transform(bag_of_words)

# creating the tf-idf LDA model
lda_tfidf_creator = LatentDirichletAllocation(learning_method='online', n_components=10)
lda_tfidf = lda_tfidf_creator.fit_transform(tfidf)

print("~~~ Topics found by bag of words LDA ~~~")
for topic_id, topic in enumerate(lda_bag_of_words_creator.components_):
  message = "Topic #{}: ".format(topic_id + 1)
  message += " ".join([bag_of_words_creator.get_feature_names()[i] for i in topic.argsort()[:-5 :-1]])
  print(message)

print("\n\n~~~ Topics found by tf-idf LDA ~~~")
for topic_id, topic in enumerate(lda_tfidf_creator.components_):
  message = "Topic #{}: ".format(topic_id + 1)
  message += " ".join([tfidf_creator.get_feature_names()[i] for i in topic.argsort()[:-5 :-1]])
  print(message)

~~~ Topics found by bag of words LDA ~~~
Topic #1: find foot hand good
Topic #2: follow house cry time
Topic #3: hand cigar bad word
Topic #4: mccarthy father son mr
Topic #5: son mccarthy cigar mr
Topic #6: street house room minute
Topic #7: king majesty photograph sherlock
Topic #8: hand photograph woman street
Topic #9: mr hand majesty street
Topic #10: note paper matter write


~~~ Topics found by tf-idf LDA ~~~
Topic #1: slowly foot ten come
Topic #2: briony male read maid
Topic #3: hang last centre title
Topic #4: hat second quietly family
Topic #5: nature field service detail
Topic #6: holmes say upon man
Topic #7: colony pity whole immense
Topic #8: alive prefer bad god
Topic #9: fellow colour majesty listen
Topic #10: gun minute drive nice


# 5) Text Similarity

* **Levenshtein distance**   
    = *minimal edit distance between two words*  
    Distance is calculated through the minimum number of _insertions_, _deletions_, and _substitutions_ that would 
    need   to occur for one word to become another 
   
> Ex:
turning “bees” into “beans” would require one _substitution_ (“a” for “e”) and one _insertion_ (“n ”), so the Levenshtein distance would be two  
-> Addressing word similarity and misspelling for spellcheck or autocorrect often involves considering

* **Phonetic similarity**  
  _how much two words or phrases sound the same_  
   Is a major challenge within speech recognition

> Ex : English-speaking humans can easily tell from context whether someone said “euthanasia” or “youth in Asia”
  
  
* **Lexical similarity**  
   _the degree to which texts use the same vocabulary and phrases_  
   Is helpful to find out if texts are the same to guard against plagiarism
   

* **Semantic similarity**  
    _the degree to which documents contain similar meaning or topics_  
    Is useful when you want to find (or recommend) an article or book similar to one you recently finished.


In [28]:
import nltk
# NLTK has a built-in function
# to check Levenshtein distance:
from nltk.metrics import edit_distance

def print_levenshtein(string1, string2):
  print("The Levenshtein distance from '{0}' to '{1}' is {2}!".format(string1, string2, edit_distance(string1, string2)))

# Check the distance between
print_levenshtein("fart", "target") # levenstein distance = 3 
# Assign passing strings here:
three_away_from_code = "mate" # levenstein distance = 2 
two_away_from_chunk = "junk" # levenstein distance = 3
#executes the functions
print_levenshtein("code", three_away_from_code)
print_levenshtein("chunk", two_away_from_chunk)

The Levenshtein distance from 'fart' to 'target' is 3!
The Levenshtein distance from 'code' to 'mate' is 3!
The Levenshtein distance from 'chunk' to 'junk' is 2!


# 6) Language Prediction & Text Generation

* **Language prediction**  
Is an application of NLP concerned with predicting text given preceding text. _Autosuggest_, _autocomplete_, and _suggested replies_ are common forms of language prediction  
 * Bag of words alone is generally not a great model for language prediction; no matter what the preceding word was, you will just get one of the most commonly used words from your training corpus.
 * **Markov chains** are memory-less and make statistical predictions based entirely on the current _n-gram_ on hand. Rely on Markov chains to predict the statistical likelihood of each following word (or character) based on the training corpus
 >Ex : “I ate so many grilled cheese (_something_)”.  
 _n-gram_ used: Trigram model (n = 3)   
 Markov chain predicts the following word (_something_) as “_sandwiches_” based on the number of times the sequence “grilled cheese sandwiches” has appeared in the training data out of all the times “grilled cheese” has appeared in the training data
 * **Long Short Term Memory** (**LSTM**) model, is a more advanced approach using a neural language model. LSTM uses deep learning with a network of artificial “cells” that manage memory, making them _better_ suited for text prediction than traditional neural networks.
  

In [33]:
import nltk, re, random
from nltk.tokenize import word_tokenize
from collections import defaultdict, deque
from document1 import training_doc1 # lose yourself, eminem
from document2 import training_doc2 # linkin park, in the end
from document3 import training_doc3 # queen, bohemian rhaspsody

class MarkovChain:
  def __init__(self):
    self.lookup_dict = defaultdict(list)
    self._seeded = False
    self.__seed_me()

  def __seed_me(self, rand_seed=None):
    if self._seeded is not True:
      try:
        if rand_seed is not None:
          random.seed(rand_seed)
        else:
          random.seed()
        self._seeded = True
      except NotImplementedError:
        self._seeded = False
    
  def add_document(self, str):
    preprocessed_list = self._preprocess(str)
    pairs = self.__generate_tuple_keys(preprocessed_list)
    for pair in pairs:
      self.lookup_dict[pair[0]].append(pair[1])
  
  def _preprocess(self, str):
    cleaned = re.sub(r'\W+', ' ', str).lower()
    tokenized = word_tokenize(cleaned)
    return tokenized

  def __generate_tuple_keys(self, data):
    if len(data) < 1:
      return

    for i in range(len(data) - 1):
      yield [ data[i], data[i + 1] ]
      
  def generate_text(self, max_length=50):
    context = deque()
    output = []
    if len(self.lookup_dict) > 0:
      self.__seed_me(rand_seed=len(self.lookup_dict))
      chain_head = [list(self.lookup_dict)[0]]
      context.extend(chain_head)
      
      while len(output) < (max_length - 1):
        next_choices = self.lookup_dict[context[-1]]
        if len(next_choices) > 0:
          next_word = random.choice(next_choices)
          context.append(next_word)
          output.append(context.popleft())
        else:
          break
      output.extend(list(context))
    return " ".join(output)

my_markov = MarkovChain()
my_markov.add_document(training_doc1)
my_markov.add_document(training_doc2)
my_markov.add_document(training_doc3)
generated_text = my_markov.generate_text()
print(generated_text)

look out time when i m not that i m like i got so hard and you only one shot or shot do anything you only get one shot or shot do this there s cold product they were before you only one shot do the wind blows doesn t
