# PRE - PROCESSING TUTORIAL (beginner friendly) 
based on the courses of CodeCademy

### This tutorial will help to becomme familiar with NLP by explaining what are the principals things. I added many comments, explanations and examples to make things easier for your understanding. 

Pre-processing is:
- Cleaning and preparing text data for use in a specific context (ultimate goal is to reduce the text to only the words that you need for your NLP goals)
- Removing noise from data.


While this list is not exhaustive, we will cover a few common approaches for cleaning and processing text data.   

They include:

    . Using Regex & NLTK libraries
    . Removing unnecessary characters and formatting
    . Tokenization – break multi-word strings into smaller components
    . Normalization – a catch-all term for processing data; this includes stemming
      and lemmatization


### MENU:

1) Noise Removal

2) Tokenization

3) Normalization

4) Stemming

5) Lemmatization
* lemmatization 5.1
* lemmatization 5.2
      
6) Review


# 1) Noise Removal  
[More details abour Regular Expressions (re)](https://github.com/santinon/big_data/blob/master/Natural_Langage_Processing/Regular%20Expressions%20(RE)%2C%20regex%20or%20regexp.ipynb)

.sub() method in Python’s regular expression (re) library for most of your noise removal needs


The .sub() method has three required arguments:
- *pattern*:  a regular expression that is searched for in the input string. There must be an *r* preceding the string to indicate it is a raw string, which treats backslashes as literal characters.
- *replacement_text* : text that replaces all matches in the input string
- *input* : the input string that will be edited by the *.sub()* method


In [9]:
import re 

text = "<p>    This is a paragraph</p>"
# removes the HTML p signs (replaces the tags with an empty string '')
result = re.sub(r'<.?p>', '', text)  # ".?p" = whatever there is , if there is something (?), before the p

print(result) # see the spaces in-front

    This is a paragraph


In [10]:
import re 

text = "    This is a paragraph" # The whitespace consists of four spaces.
# removes the spaces (replaces each by an empty string '')
result = re.sub(r'\s{4}', '', text)

print(result) 

This is a paragraph


In [11]:
import re

headline_one = '<h1>Nation\'s Top Pseudoscientists Harness High-Energy Quartz Crystal Capable Of Reversing Effects Of Being Gemini</h1>'

tweet = '@fat_meats, veggies are better than you think.'

headline_no_tag = re.sub(r'<.?h1>', '', headline_one) # ".?h1" = whatever (.)there is , if there is something (?), before the h1
tweet_no_at = re.sub(r'@', '', tweet)

print(headline_no_tag, "\n") # "\n" = empty-line
print(tweet)

Nation's Top Pseudoscientists Harness High-Energy Quartz Crystal Capable Of Reversing Effects Of Being Gemini 

@fat_meats, veggies are better than you think.


# 2) Tokenization

The method for breaking text into smaller components is called tokenization and the individual components are called tokens   
= separates text words-by-words (tokens)

A few common operations that require tokenization include:

    - Finding how many words or sentences appear in text
    - Determining how many times a specific word or phrase exists
    - Accounting for which terms are likely to co-occur


In [38]:
# words-tokenization

from nltk.tokenize import word_tokenize

text = "Tokenize this text"
tokenized = word_tokenize(text)

print(tokenized)
print(type(tokenized))

['Tokenize', 'this', 'text']
<class 'list'>


In [22]:
# sentences-tokenization

from nltk.tokenize import sent_tokenize

text = "Tokenize this sentence. Also, tokenize this sentence."
tokenized = sent_tokenize(text)

print(tokenized)

['Tokenize this sentence.', 'Also, tokenize this sentence.']


In [19]:
# both tokenization occur here part below

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

ecg_text = 'An electrocardiogram is used to record the electrical conduction through a person\'s heart. \
            The readings can be used to diagnose cardiac arrhythmias.'

tokenized_by_word = word_tokenize(ecg_text)
tokenized_by_sentence = sent_tokenize(ecg_text)

print(tokenized_by_word, "\n") # "\n" = empty-line
print(tokenized_by_sentence)

['An', 'electrocardiogram', 'is', 'used', 'to', 'record', 'the', 'electrical', 'conduction', 'through', 'a', 'person', "'s", 'heart', '.', 'The', 'readings', 'can', 'be', 'used', 'to', 'diagnose', 'cardiac', 'arrhythmias', '.'] 

["An electrocardiogram is used to record the electrical conduction through a person's heart.", 'The readings can be used to diagnose cardiac arrhythmias.']


# 3) Normalization

Text normalization is a catch-all term for various text pre-processing tasks. In the next few exercises, we’ll cover a few of them:

* Upper or lowercasing


* _Stopword_ removal: 
   * They include words such as “a”, “an”, and “the”. NLTK provides a built-in library with these words


* **Stemming**: bluntly removing prefixes and suffixes from a word


* **Lemmatization**: replacing a single-word token with its root



In [30]:
# here,  Python’s built-in String methods to make a string all uppercase or lowercase

my_string = 'tHiS HaS a MiX oF cAsEs OH shiT'

print(my_string.upper())

print(my_string.lower(), "\n")


#other example : 
brands = 'Salvation IRA, YMCA, Boys & Girls Club of Ireland, feuk. Brexit'

brands_lower = brands.lower()
print(brands_lower)
brands_upper = brands.upper()
print(brands_upper)

THIS HAS A MIX OF CASES OH SHIT
this has a mix of cases oh shit 

salvation ira, ymca, boys & girls club of ireland, feuk. brexit
SALVATION IRA, YMCA, BOYS & GIRLS CLUB OF IRELAND, FEUK. BREXIT


In [22]:
# Stopword removal

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english')) # creates a set (the type) of stopwords

sentence = "NBC was founded in 1926 making it the oldest major broadcast network in the USA"

# tokenize nbc_statement
word_tokens = word_tokenize(sentence) 

# list comprehension to remove them from a sentence
statement_no_stop = [word for word in word_tokens if word not in stop_words]

# check here
print("tokens with stopwords: \n", word_tokens)
print("\n tokens without stopwords: \n",statement_no_stop)


print("\n 2nd example \n")
# Other example:
from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english')) 

survey_text = 'A YouGov study found that American\'s like Italian food more \
               than any other country\'s cuisine.'

tokenized_survey = word_tokenize(survey_text)

text_no_stops = [word for word in tokenized_survey if word not in stop_words]

print("\n tokens with stopwords: \n", tokenized_survey)
print("\n tokens without stopwords: \n", text_no_stops)

tokens with stopwords: 
 ['NBC', 'was', 'founded', 'in', '1926', 'making', 'it', 'the', 'oldest', 'major', 'broadcast', 'network', 'in', 'the', 'USA']

 tokens without stopwords: 
 ['NBC', 'founded', '1926', 'making', 'oldest', 'major', 'broadcast', 'network', 'USA']

 2nd example 


 tokens with stopwords: 
 ['A', 'YouGov', 'study', 'found', 'that', 'American', "'s", 'like', 'Italian', 'food', 'more', 'than', 'any', 'other', 'country', "'s", 'cuisine', '.']

 tokens without stopwords: 
 ['A', 'YouGov', 'study', 'found', 'American', "'s", 'like', 'Italian', 'food', 'country', "'s", 'cuisine', '.']


# 4) Stemming

Stemming is the text preprocessing normalization task concerned with bluntly **removing word affixes (prefixes and suffixes)**. 
> ex: stemming would cast the word “going” to “go”.   
(this is a common method used by search engines to improve matching between user input and website hits).

NLTK has a built-in stemmer called _PorterStemmer_. You can use it with a _list comprehension_ to stem each word in a tokenized list of words. 

In [8]:
# import and initialize the stemmer
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

# apply stemmer to each word in a list using a list comprehension
tokenized = ['NBC', 'was', 'founded', 'in', '1926', '.', 'This', 'makes', 'NBC', 'the',
             'oldest', 'major', 'broadcast', 'network', '.']

stemmed = [stemmer.stem(token) for token in tokenized]

# check
print(stemmed)

# Notice, the words like ‘was’ and ‘founded’ became ‘wa’ and ‘found’, respectively.  
# We you need to be careful when stemming strings.  
# Words can often be converted to something unrecognizable

['nbc', 'wa', 'found', 'in', '1926', '.', 'thi', 'make', 'nbc', 'the', 'oldest', 'major', 'broadcast', 'network', '.']


In [32]:
#  import and initialize the stemmer
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

# import tokenization thing
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize  

populated_island = 'Java is an Indonesian island in the Pacific Ocean. It is the most populated \
                    island in the world, with over 140 million people.'

# "populated_island" tokenization first
island_tokenized = word_tokenize(populated_island)
print(" tokenized sentence: \n", island_tokenized, "\n")

# stemming use list-comprehension 
stemmed = [stemmer.stem(token) for token in island_tokenized]
print('"stemmed sentence" : \n', stemmed)

 tokenized sentence: 
 ['Java', 'is', 'an', 'Indonesian', 'island', 'in', 'the', 'Pacific', 'Ocean', '.', 'It', 'is', 'the', 'most', 'populated', 'island', 'in', 'the', 'world', ',', 'with', 'over', '140', 'million', 'people', '.'] 

"stemmed sentence" : 
 ['java', 'is', 'an', 'indonesian', 'island', 'in', 'the', 'pacif', 'ocean', '.', 'It', 'is', 'the', 'most', 'popul', 'island', 'in', 'the', 'world', ',', 'with', 'over', '140', 'million', 'peopl', '.']


# 5) Lemmatization

Lemmatization is a method for casting words to their **root forms**.   
(this is a more involved process than stemming, because it requires the method to know the _part-of-speech_ for _each word_).   

Since lemmatization requires the _part-of-speech_, it is a less efficient approach than stemming.

###### To take advantage of the power of lemmatization, we need to tag each word in our text with the most likely _part of speech_.   We’ll do that in the next part lemmatization 5.2

#### lemmatization 5.1

In [24]:
# import NLTK’s WordNetLemmatizer to lemmatize text
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() # initializes instance of "WordNetLemmatizer"

# a list comprehension to apply the lemmatize operation to each word in a list
tokenized = ["NBC", "was", "founded", "in", "1926"]

lemmatized = [lemmatizer.lemmatize(token) for token in tokenized]

# check it
print(lemmatized)

# The result, saved to lemmatized contains 'wa', while the rest of the words remain the same. Not too useful.


['NBC', 'wa', 'founded', 'in', '1926']


In [37]:
# import tokenization library
from nltk.tokenize import word_tokenize

# import lemmatization library
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() # initializes instance of "WordNetLemmatizer"

# text
populated_island = 'Indonesia was founded in 1945. It contains the most populated island in the world, Java, with over 140 million people.'

# tokenization
tokenized_string = word_tokenize(populated_island)
print(" tokenized sentence: \n", tokenized_string, "\n")

# lemmatization
lemmatized_words = [lemmatizer.lemmatize(token) for token in tokenized_string]

# check
print("lemmatized sentence: \n", lemmatized_words)

 tokenized sentence: 
 ['Indonesia', 'was', 'founded', 'in', '1945', '.', 'It', 'contains', 'the', 'most', 'populated', 'island', 'in', 'the', 'world', ',', 'Java', ',', 'with', 'over', '140', 'million', 'people', '.'] 

lemmatized sentence: 
 ['Indonesia', 'wa', 'founded', 'in', '1945', '.', 'It', 'contains', 'the', 'most', 'populated', 'island', 'in', 'the', 'world', ',', 'Java', ',', 'with', 'over', '140', 'million', 'people', '.']


#### lemmatization 5.2
#### _Part-of-Speech_ Tagging
To improve lemmatization performance, we need to find the part of speech for each word in our string.

>ex: for the 8 major parts of speech in English grammar:  noun, pronoun, verb, adverb, adjective, conjunction, preposition, and interjection.

Detailed code : print the prints to understand the different steps (4 different)

In [61]:
# 1/ Import libraries
import nltk
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer
# Import wordnet and Counter
# wordnet:  database use for contextualizing words
# Counter: container that stores elements as dictionary keys
from nltk.corpus import wordnet # To get words in dictionary with their parts of speech
from collections import Counter #lemmatizes word based on it's parts of speech


# 2/ Get synonyms (via a function here)
# function to get the synonyms and then the POS for each word (print the prints to understand)
def get_part_of_speech(word):
    probable_part_of_speech = wordnet.synsets(word) # wordnet.synsets(): function to get a set of synonyms for the word = set with all synonyms
    #print("- probable_part_of_speech: ",word, "\n", probable_part_of_speech, "   type:  ", type(probable_part_of_speech))

    pos_counts = Counter() # container: stores elements as dictionary keys
    #print("- pos_counts BEFORE: \n", pos_counts, "   type:  ", type(pos_counts))

# 3/ Use synonyms to determine the most likely part of speech  
    # nouns
    pos_counts["n"] = len( [ word for word in probable_part_of_speech if word.pos()=="n"] ) 
    # verbs
    pos_counts["v"] = len( [ word for word in probable_part_of_speech if word.pos()=="v"] )
     # adjectives
    pos_counts["a"] = len( [ word for word in probable_part_of_speech if word.pos()=="a"] )
    # adverb (n)
    pos_counts["r"] = len( [ word for word in probable_part_of_speech if word.pos()=="r"] )
  #    NOTES PERSO:
  # -ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
  # - item.pos(): is the position of the "n" or "v" or "a" or "r".
  #   checks if one of these letters is in this position in "probable_part_of_speech"
  # - if yes, takes the word (synonym) by doing: "word for word" command in puts it in a list:
  #  "[ word for word in probable_part_of_speech if word.pos()=="n"]"
  # - then, takes length (len) of list = number of times the type of POS appears in the list 
  # - pos_counts["type of POS"]: adds len as value in dictionnary (key-value) = ( POS: count )
    #print("- pos_counts AFTER: \n", pos_counts, "   type:  ", type(pos_counts))

# 4/ Returns the most common part of speech   
    # most_common(n): Returns n most common elements
    most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
  # [0]: first indexer for getting the top POS from list, 
  # [0]: second indexer for getting POS from tuple( POS: count )
    #print("- most_likely_part_of_speech: ", most_likely_part_of_speech, "   type:  ", type(most_likely_part_of_speech), "\n")
    return most_likely_part_of_speech # returned synonyms come with their part of speech


# sentence we want to "stemmize"
populated_island = 'Indonesia was founded in 1945. It contains the most populated island in the world, Java, with over 140 million people.'

# tokenisation of the sentence
tokenized_string = word_tokenize(populated_island)

# lematization process
lemmatizer = WordNetLemmatizer() # initializes instance of "WordNetLemmatizer"
lemmatized_pos = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized_string] # lemmatization POS based

# check it
print("lemmatization POS based: \n", lemmatized_pos)

lemmatization POS based: 
 ['Indonesia', 'be', 'found', 'in', '1945', '.', 'It', 'contain', 'the', 'most', 'populate', 'island', 'in', 'the', 'world', ',', 'Java', ',', 'with', 'over', '140', 'million', 'people', '.']


### *Same code as above but more 'professional' with less comments here* in a single function

In [None]:
# function to get the synonyms and then the POS for each word (print the prints to understand)
def get_part_of_speech_via_synonyms(word):
    import nltk
    from nltk.tokenize import word_tokenize 
    from nltk.stem import WordNetLemmatizer
    # Import wordnet and Counter
    # wordnet:  database use for contextualizing words
    # Counter: container that stores elements as dictionary keys
    from nltk.corpus import wordnet # To get words in dictionary with their parts of speech
    from collections import Counter #lemmatizes word based on it's parts of speech
    
    # tokenisation of the sentence
    tokenized_string = word_tokenize(sentence_to_tokegnize)
    
    probable_part_of_speech = wordnet.synsets(word) # gets the synonyms 
    pos_counts = Counter() # container: stores elements as dictionary keys
    # Use synonyms to determine the most likely part of speech  
    # nouns
    pos_counts["n"] = len( [ word for word in probable_part_of_speech if word.pos()=="n"] ) 
    # verbs
    pos_counts["v"] = len( [ word for word in probable_part_of_speech if word.pos()=="v"] )
     # adjectives
    pos_counts["a"] = len( [ word for word in probable_part_of_speech if word.pos()=="a"] )
    # adverb
    pos_counts["r"] = len( [ word for word in probable_part_of_speech if word.pos()=="r"] )
    # Returns n most common elements
    most_likely_part_of_speech = pos_counts.most_common(1)[0][0]

    # lematization process
    lemmatizer = WordNetLemmatizer() # initializes instance of "WordNetLemmatizer"
    lemmatized_pos = [lemmatizer.lemmatize(token, most_likely_part_of_speech(token)) for token in tokenized_string] # lemmatization POS based


Code ready-to-use with less annotations "professional"

In [59]:
# 1/ Import libraries
import nltk
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet # To get words in dictionary with their parts of speech
from collections import Counter #lemmatizes word based on it's parts of speech


# 2/ Get synonyms (via a function here)
def get_part_of_speech(word):
    probable_part_of_speech = wordnet.synsets(word) # wordnet.synsets(): function to get a set of synonyms for the word = set with all synonyms
    pos_counts = Counter() # container: stores elements as dictionary keys
# 3/ Use synonyms to determine the most likely part of speech  
    pos_counts["n"] = len( [ word for word in probable_part_of_speech if word.pos()=="n"] ) # nouns
    pos_counts["v"] = len( [ word for word in probable_part_of_speech if word.pos()=="v"] ) # verbs
    pos_counts["a"] = len( [ word for word in probable_part_of_speech if word.pos()=="a"] ) # adjectives
    pos_counts["r"] = len( [ word for word in probable_part_of_speech if word.pos()=="r"] ) # adverb

# 4/ Returns the most common part of speech   
    most_likely_part_of_speech = pos_counts.most_common(1)[0][0] # most_common(n): Returns n most common elements
    return most_likely_part_of_speech # returned synonyms come with their part of speech


# sentence we want to "stemmize"
populated_island = 'Indonesia was founded in 1945. It contains the most populated island in the world, Java, with over 140 million people.'

# tokenisation of the sentence
tokenized_string = word_tokenize(populated_island)

# lematization process
lemmatizer = WordNetLemmatizer() # initializes instance of "WordNetLemmatizer"
lemmatized_pos = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized_string] # lemmatization POS based

# check it
print("lemmatization POS based: \n", lemmatized_pos)

lemmatization POS based: 
 ['Indonesia', 'be', 'found', 'in', '1945', '.', 'It', 'contain', 'the', 'most', 'populate', 'island', 'in', 'the', 'world', ',', 'Java', ',', 'with', 'over', '140', 'million', 'people', '.']


So for us it is good to know the difference:

. lemmatized sentence:   
 ['Indonesia', 'wa', 'founded', 'in', '1945', '.', 'It', 'contains', 'the', 'most', 'populated', 'island', 'in', 'the', 'world', ',', 'Java', ',', 'with', 'over', '140', 'million', 'people', '.']

. lemmatization POS based:   
 ['Indonesia', 'be', 'found', 'in', '1945', '.', 'It', 'contain', 'the', 'most', 'populate', 'island', 'in', 'the', 'world', ',', 'Java', ',', 'with', 'over', '140', 'million', 'people', '.']

# 6/ REVIEW
Let’s review what we covered in this lesson:

   - Text preprocessing is all about cleaning and prepping text data so that it’s ready for other NLP tasks.
   
   
   - Noise removal is a text preprocessing step concerned with removing unnecessary formatting from our text.
   
   
   - Tokenization is a text preprocessing step devoted to breaking up text into smaller units (usually words or discrete terms).
   

   - Normalization is the name we give most other text preprocessing tasks, including stemming, lemmatization, upper and lowercasing, and stopword removal.
   
   
   - Stemming is the normalization preprocessing task focused on removing word affixes.
   
   
   - Lemmatization is the normalization preprocessing task that more carefully brings words down to their root forms.