# AI for Business by AIChampionsHub
## Module : Natural Language Processing
## Lesson : 01 - NLP Basics

NLP Course we try to balance between Theory, Practical and Business Application.
This and other NLP Notebooks introduce key NLP Concepts.
- NTLK : A popular library that helps balance between Theory and Practice.
https://thinkinfi.com/how-to-download-nltk-corpus-manually/

In [4]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download("tests")  ## popular, tests, book Optional Step

[nltk_data] Downloading collection 'tests'
[nltk_data]    | 
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package porter_test to /root/nltk_data...
[nltk_data]    |   Unzipping stemmers/porter_test.zip.
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package wmt15_eval to /root/nltk_data...
[nltk_data]    |   Unzipping models/wmt15_eval.zip.
[nltk_data]    | Downloading package subjectivity to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/subjectivity.zip.
[nltk_data]    | Downloading package framenet_v17 to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/framenet_v17.zip.
[nltk_data]    | Download

True

In [10]:
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# BASICS

## Step 1 TOKENIZE
Token is a sequence of characters in text that serves as a unit. Example tokens
they could be words, emoticons, hashtags, links, or even individual characters.
*   A basic way of breaking language into tokens is by splitting the text based on whitespace and punctuation.
*   Then we can get words or numbers etc.



In [40]:
text1 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is! More costs "
text = text1

In [50]:
sentences = nltk.sent_tokenize(text)
print(len(sentences))
print("ORIGINAL TEXT : ", text)
print("SENTENCE TOKENISER OUTPUT: ", sentences)

5
ORIGINAL TEXT :  This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is! More costs 
SENTENCE TOKENISER OUTPUT:  ['This is the first sentence.', 'A gallon of milk in the U.S. costs $2.99.', 'Is this the third sentence?', 'Yes, it is!', 'More costs']


In [None]:
words = nltk.word_tokenize(text)
print(len(words))
print("ORIGINAL TEXT : ", text)
print("WORD TOKENISER OUTPUT: ", words)

In [58]:
words_unique = set(words)
list(set(words))[:10]

['third', ',', 'in', 'Is', 'This', 'this', 'A', 'the', 'U.S.', '!']

### Find unique set of words etc.

In [None]:
# Method 1 #  set () method: Used to convert any of the iterable to sequence of iterable elements with distinct elements
from collections import Counter
Counter(words)

In [57]:
# Method 2 : Frequency of words
from nltk.probability import FreqDist
dist = FreqDist(words)
len(dist)
dist

FreqDist({'the': 3, 'is': 2, 'sentence': 2, '.': 2, 'costs': 2, 'This': 1, 'first': 1, 'A': 1, 'gallon': 1, 'of': 1, ...})

In [65]:
# Give me all most frequently occuring words in a document or corpus with certain counts or thresholds
# Try len > 3 or 4 etc...
freqwords = set([w for w in words if len(w) > 1 and dist[w] > 1])
freqwords

{'costs', 'is', 'sentence', 'the'}

## 2 Stemming

In [67]:
# different forms of the same "word"
input1 = 'List listed lists listing listings'
words1 = input1.lower().split(' ')
words1

porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

['list', 'list', 'list', 'list', 'list']

In [68]:
set([porter.stem(t) for t in words1])

{'list'}

# NLP EXAMPLE EXERCISE : 1 - SENTIMENT ANALYSIS

We use sample Tweets to perform the analysis.
To use your own dataset, you can gather tweets from a specific time period, user, or hashtag by using the https://developer.twitter.com/en/docs.html

In [7]:
from nltk.test import *

In [12]:
nltk.download('twitter_samples')
from nltk.corpus import twitter_samples

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


In [69]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')   #  print all of the tweets within a dataset as strings

## STEP 1 : Tokenzie the data
Here we can use Punkt tokenizer. The punkt module is a pre-trained model that helps you tokenize words and sentences. Works on unsupervised data.

In [73]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [74]:
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')

In [75]:
print(tweet_tokens[0])

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']


## STEP 1 : Normalize the data - Stemming and Lemmatization
*    Like we saw before 'List, Lists, Listing' etc...may mean similar thinkgs but we don't want them to treat as different words.
*    So normailization is used - group together words with the same meaning but different forms.
*    Normalization in NLP is the process of converting a word to its canonical form.
* Explore Two popular techniques : Stemming and Lemmatization



. Stemming, working with only simple verb forms, is a heuristic process that removes the ends of words.

In this tutorial you will use the process of lemmatization, which normalizes a word with the context of vocabulary and morphological analysis of words in text. The lemmatization algorithm analyzes the structure of the word and its context to convert it to a normalized form. Therefore, it comes at a cost of speed. A comparison of stemming and lemmatization ultimately comes down to a trade off between speed and accuracy.

In [None]:
nltk.download('wordnet')  #wordnet is a lexical database for the English language that helps the script determine the base word.
nltk.download('averaged_perceptron_tagger')  # to determine the context of a word in a sentence.

In [77]:
# Before running a lemmatizer, you need to determine the context for each word in your text.
# This is achieved by a tagging algorithm, which assesses the relative position of a word in a sentence.

from nltk.tag import pos_tag
from nltk.corpus import twitter_samples

tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
print(pos_tag(tweet_tokens[0]))

[('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]


From the list of tags, here is the list of the most common items and their meaning:

1.   NNP: Noun, proper, singular
2.   NN: Noun, common, singular or mass
3.   IN: Preposition or conjunction, subordinating
4.   VBG: Verb, gerund or present participle
5.   VBN: Verb, past participle

To incorporate this into a function that normalizes a sentence, you should first generate the tags for each token in the text, and then lemmatize each word using the tag.  
Example : verb being changes to its root form, be, and the noun members changes to member

In [80]:
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer

def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence

In [81]:
print(lemmatize_sentence(tweet_tokens[0]))

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']


## Step 4 — Removing Noise from the Data

*   Noise is any part of the text that does not add meaning or information to data.
*   Stop words : Example stop words are “is”, “the”, and “a”. They are generally not relevant when processing language, unless a specific use case warrants their inclusion.
*   We use Regular Expressions and leverage a predefined or used function.
Link for more details : https://docs.python.org/3.6/howto/regex.html

*   Examples
**  All hyperlinks in Twitter are converted to the URL shortener t.co. and therefore don't add much value in analysis.
**  Replies:  Twitter handles in certain replies. These Twitter usernames are preceded by a @ symbol. Even these don't help on value from.
**  Punctuation and special characters - While these often provide context to textual data, this context is often difficult to process. For simplicity, you will remove all punctuation and special characters from tweets.



In [None]:
# LIST OF ALL STOPWORDS IN ENGLISH
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
# stop_words[:25]

In [82]:
# Code to remove noise, removes noise and incorporates the normalization and lemmatization.
# Used sample from another site.
import re, string

def fn_Remove_Noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

In [None]:
print(fn_Remove_Noise(tweet_tokens[0], stop_words))fn_Remove_Noise