In [1]:
text = "Are you curious about tokenization? Let's see how it works! We need to analyze a couple of sentences with punctuations to see it in action."

### Tokenisation

Tokenization is the process of dividing text into a set of meaningful pieces. These pieces are called tokens. For example, we can divide a chunk of text into words, or we can divide it into sentences. Depending on the task at hand, we can define our own conditions to divide the input text into meaningful tokens. Let's take a look at how to do this.

In [2]:
# Import Libraries

import nltk
nltk.download('punkt')

# Sentence tokenization
from nltk.tokenize import sent_tokenize


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\avtar8\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
sent_tokenize_list = sent_tokenize(text)
print ("\nSentence tokenizer:")
print (sent_tokenize_list)


Sentence tokenizer:
['Are you curious about tokenization?', "Let's see how it works!", 'We need to analyze a couple of sentences with punctuations to see it in action.']


In [4]:
# Create a new WordPunct tokenizer
from nltk.tokenize import WordPunctTokenizer
 
word_punct_tokenizer = WordPunctTokenizer()
print ("\nWord punct tokenizer:")
print (word_punct_tokenizer.tokenize(text))


Word punct tokenizer:
['Are', 'you', 'curious', 'about', 'tokenization', '?', 'Let', "'", 's', 'see', 'how', 'it', 'works', '!', 'We', 'need', 'to', 'analyze', 'a', 'couple', 'of', 'sentences', 'with', 'punctuations', 'to', 'see', 'it', 'in', 'action', '.']


### Stemming

When we deal with a text document, we encounter different forms of a word. Consider the word "play". This word can appear in various forms, such as "play", "plays", "player", "playing", and so on. These are basically families of words with similar meanings.

During text analysis, it's useful to extract the base form of these words. This will help us in extracting some statistics to analyze the overall text. The goal of stemming is to reduce these different forms into a common base form. This uses a heuristic process to cut off the ends of words to extract the base form. Let's see how to do this in Python.

In [5]:
# Import Libraries

from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer

In [6]:
words = ['table', 'probably', 'wolves', 'playing', 'is', 'dog', 'the', 'beaches', 'grounded', 'dreamt', 'envision']

In [7]:
# Compare different stemmers
stemmers = ['PORTER', 'LANCASTER', 'SNOWBALL']

In [8]:
# Initialize Stemmer

stemmer_porter = PorterStemmer()
stemmer_lancaster = LancasterStemmer()
stemmer_snowball = SnowballStemmer('english')

In [9]:
formatted_row = '{:>16}' * (len(stemmers) + 1)
print ('\n', formatted_row.format('WORD', *stemmers), '\n')


             WORD          PORTER       LANCASTER        SNOWBALL 



In [10]:
# Let's iterate through the list of words and stem them using the three stemmers:
print ('\n', formatted_row.format('WORD', *stemmers), '\n')

for word in words:
    stemmed_words = [stemmer_porter.stem(word),
        stemmer_lancaster.stem(word),
        stemmer_snowball.stem(word)]
    print (formatted_row.format(word, *stemmed_words))


             WORD          PORTER       LANCASTER        SNOWBALL 

           table            tabl            tabl            tabl
        probably         probabl            prob         probabl
          wolves            wolv            wolv            wolv
         playing            play            play            play
              is              is              is              is
             dog             dog             dog             dog
             the             the             the             the
         beaches           beach           beach           beach
        grounded          ground          ground          ground
          dreamt          dreamt          dreamt          dreamt
        envision           envis           envid           envis


The difference between the three stemming algorithms is basically the level of strictness with which they operate. If you observe the outputs, you will see that the Lancaster stemmer is stricter than the other two stemmers. The Porter stemmer is the least in terms of strictness and Lancaster is the strictest. 

### Lemmatisation

The goal of lemmatization is also to reduce words to their base forms, but this is a more structured approach. In the previous recipe, we saw that the base words that we obtained using stemmers don't really make sense. For example, the word "wolves" was reduced to "wolv", which is not a real word.

Lemmatization solves this problem by doing things using a vocabulary and morphological analysis of words. It removes inflectional word endings, such as "ing" or "ed", and returns the base form of a word. This base form is known as the lemma. If you lemmatize the word "wolves", you will get "wolf" as the output. The output depends on whether the token is a verb or a noun. Let's take a look at how to do this in this recipe.

In [11]:
# Import Libraries

from nltk.stem import WordNetLemmatizer

In [12]:
words = ['table', 'probably', 'wolves', 'playing', 'is', 'dog', 'the', 'beaches', 'grounded', 'dreamt', 'envision']

In [13]:
# Compare different lemmatizers
lemmatizers = ['NOUN LEMMATIZER', 'VERB LEMMATIZER']

In [14]:
lemmatizer_wordnet = WordNetLemmatizer()

formatted_row = '{:>24}' * (len(lemmatizers) + 1)
print ('\n', formatted_row.format('WORD', *lemmatizers), '\n')

for word in words:
    lemmatized_words = [lemmatizer_wordnet.lemmatize(word, pos='n'), lemmatizer_wordnet.lemmatize(word, pos='v')]
    print (formatted_row.format(word, *lemmatized_words))


                     WORD         NOUN LEMMATIZER         VERB LEMMATIZER 

                   table                   table                   table
                probably                probably                probably
                  wolves                    wolf                  wolves
                 playing                 playing                    play
                      is                      is                      be
                     dog                     dog                     dog
                     the                     the                     the
                 beaches                   beach                   beach
                grounded                grounded                  ground
                  dreamt                  dreamt                   dream
                envision                envision                envision


### Chunking

Chunking refers to dividing the input text into pieces, which are based on any random condition. This is different from tokenization in the sense that there are no constraints and the chunks do not need to be meaningful at all. This is used very frequently during text analysis. When you deal with really large text documents, you need to divide it into chunks for further analysis. In this recipe, we will divide the input text into a number of pieces, where each piece has a fixed number of words.

In [15]:
# Importing the Libraries

import nltk
nltk.download('brown')
 
import numpy as np
from nltk.corpus import brown

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\avtar8\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [16]:
# Split a text into chunks - divide the text based on spaces
def splitter(data, num_words):
    words = data.split(' ')
    output = []
    cur_count = 0
    cur_words = []
    for word in words:
        cur_words.append(word)
        cur_count += 1
        if cur_count == num_words:
            output.append(' '.join(cur_words))
            cur_words = []
            cur_count = 0
    output.append(' '.join(cur_words) )
    return output


In [17]:
if __name__=='__main__':
    # Read the data from the Brown corpus
    data = ' '.join(brown.words()[:10000])

In [18]:
# Define     # Number of words in each chunk
num_words = 1700
chunks = []
counter = 0
text_chunks = splitter(data, num_words)
print ("Number of text chunks =", len(text_chunks))

Number of text chunks = 6


### Bag-of-Words

When we deal with text documents that contain millions of words, we need to convert them into some kind of numeric representation. The reason for this is to make them usable for machine learning algorithms. These algorithms need numerical data so that they can analyze them and output meaningful information.

This is where the bag-of-words approach comes into picture. This is basically a model that learns a vocabulary from all the words in all the documents. After this, it models each document by building a histogram of all the words in the document.

In [19]:
# Importing the Libraries

import numpy as np
from nltk.corpus import brown
#from chunking import splitter

In [20]:
#!pip install chunking



You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [21]:
if __name__=='__main__':
    # Read the data from the Brown corpus
    data = ' '.join(brown.words()[:10000])

In [22]:
# Number of words in each chunk
num_words = 2000

chunks = []
counter = 0

text_chunks = splitter(data, num_words)

In [23]:
# Create a dictionary that is based on these text chunks
for text in text_chunks:
    chunk = {'index': counter, 'text': text}
    chunks.append(chunk)
    counter += 1

In [24]:
# The next step is to extract a document term matrix. 
#This is basically a matrix that counts the number of occurrences of each word in the document. 
#We will use scikit-learn to do this because it has better provisions as compared to NLTK for this particular task. 

# Extract document term matrix
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=5, max_df=.95)
doc_term_matrix = vectorizer.fit_transform([chunk['text'] for chunk in chunks])

In [25]:
vocab = np.array(vectorizer.get_feature_names())
print ("\nVocabulary:")
print (vocab)


Vocabulary:
['about' 'after' 'against' 'aid' 'all' 'also' 'an' 'and' 'are' 'as' 'at'
 'be' 'been' 'before' 'but' 'by' 'committee' 'congress' 'did' 'each'
 'education' 'first' 'for' 'from' 'general' 'had' 'has' 'have' 'he'
 'health' 'his' 'house' 'in' 'increase' 'is' 'it' 'last' 'made' 'make'
 'may' 'more' 'no' 'not' 'of' 'on' 'one' 'only' 'or' 'other' 'out' 'over'
 'pay' 'program' 'proposed' 'said' 'similar' 'state' 'such' 'take' 'than'
 'that' 'the' 'them' 'there' 'they' 'this' 'time' 'to' 'two' 'under' 'up'
 'was' 'were' 'what' 'which' 'who' 'will' 'with' 'would' 'year' 'years']


In [26]:
print ("\nDocument term matrix:")
chunk_names = ['Chunk-0', 'Chunk-1', 'Chunk-2', 'Chunk-3', 'Chunk-4']
formatted_row = '{:>12}' * (len(chunk_names) + 1)
print ('\n', formatted_row.format('Word', *chunk_names), '\n')

for word, item in zip(vocab, doc_term_matrix.T):
    # 'item' is a 'csr_matrix' data structure
    output = [str(x) for x in item.data]
    print (formatted_row.format(word, *output))


Document term matrix:

         Word     Chunk-0     Chunk-1     Chunk-2     Chunk-3     Chunk-4 

       about           1           1           1           1           3
       after           2           3           2           1           3
     against           1           2           2           1           1
         aid           1           1           1           3           5
         all           2           2           5           2           1
        also           3           3           3           4           3
          an           5           7           5           7          10
         and          34          27          36          36          41
         are           5           3           6           3           2
          as          13           4          14          18           4
          at           5           7           9           3           6
          be          20          14           7          10          18
        been           7

### Building a Text Classifier

The goal of text classification is to categorize text documents into different classes. This is an extremely important analysis technique in NLP. We will use a technique, which is based on a statistic called tf-idf, which stands for term frequency—inverse document frequency. This is an analysis tool that helps us understand how important a word is to a document in a set of documents. This serves as a feature vector that's used to categorize documents. You can learn more about it at http://www.tfidf.com.

In [27]:
# Import Data
from sklearn.datasets import fetch_20newsgroups

In [28]:
# These categories are available as part of the news groups dataset that we just imported

category_map = {'misc.forsale': 'Sales', 'rec.motorcycles': 'Motorcycles',
        'rec.sport.baseball': 'Baseball', 'sci.crypt': 'Cryptography',
        'sci.space': 'Space'}

In [29]:
# Load the training data based on the categories
training_data = fetch_20newsgroups(subset='train',
        categories=category_map.keys(), shuffle=True, random_state=7)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [30]:
# Feature extraction
from sklearn.feature_extraction.text import CountVectorizer

# Extracting Features
vectorizer = CountVectorizer()
X_train_termcounts = vectorizer.fit_transform(training_data.data)
print ("\nDimensions of training data:", X_train_termcounts.shape)


Dimensions of training data: (2968, 40605)


In [31]:
# Training the Classifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer

In [32]:
# Testing the Input Sentence
input_data = [
    "The curveballs of right handed pitchers tend to curve to the left",
    "Caesar cipher is an ancient form of encryption",
    "This two-wheeler is really good on slippery roads"
]

In [33]:
# Define the tf-idf transformer object and train it

# tf-idf transformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_termcounts)

# Multinomial Naive Bayes classifier
classifier = MultinomialNB().fit(X_train_tfidf, training_data.target)

# Input data Transformation using word counts
X_input_termcounts = vectorizer.transform(input_data)

# Transform the input data using the tf-idf transformer
X_input_tfidf = tfidf_transformer.transform(X_input_termcounts)

# Predict the output categories
predicted_categories = classifier.predict(X_input_tfidf)

In [34]:
# Print the outputs
for sentence, category in zip(input_data, predicted_categories):
    print ('\nInput:', sentence, '\nPredicted category:', \
            category_map[training_data.target_names[category]])


Input: The curveballs of right handed pitchers tend to curve to the left 
Predicted category: Baseball

Input: Caesar cipher is an ancient form of encryption 
Predicted category: Cryptography

Input: This two-wheeler is really good on slippery roads 
Predicted category: Motorcycles


#### The Workings of TF-IDF

The tf-idf technique is used frequently in information retrieval. The goal is to understand the importance of each word within a document. We want to identify words that are occur many times in a document. At the same time, common words like “is” and “be” don't really reflect the nature of the content. So we need to extract the words that are true indicators. The importance of each word increases as the count increases. At the same time, as it appears a lot, the frequency of this word increases too. These two things tend to balance each other out. We extract the term counts from each sentence. Once we convert this to a feature vector, we train the classifier to categorize these sentences.

The term frequency (TF) measures how frequently a word occurs in a given document. As multiple documents differ in length, the numbers in the histogram tend to vary a lot. So, we need to normalize this so that it becomes a level playing field. To achieve normalization, we divide term-frequency by the total number of words in a given document.

The inverse document frequency (IDF) measures the importance of a given word. When we compute TF, all words are considered to be equally important. To counter-balance the frequencies of commonly-occurring words, we need to weigh them down and scale up the rare ones. We need to calculate the ratio of the number of documents with the given word and divide it by the total number of documents. IDF is calculated by taking the negative algorithm of this ratio.

For example, simple words, such as "is" or "the" tend to appear a lot in various documents. However, this doesn't mean that we can characterize the document based on these words. At the same time, if a word appears a single time, this is not useful either. So, we look for words that appear a number of times, but not so much that they become noisy. This is formulated in the tf-idf technique and used to classify documents. Search engines frequently use this tool to order the search results by relevance.

### Gender Identification

Identifying the gender of a name is an interesting task in NLP. We will use the heuristic that the last few characters in a name is its defining characteristic. For example, if the name ends with "la", it's most likely a female name, such as "Angela" or "Layla". On the other hand, if the name ends with "im", it's most likely a male name, such as "Tim" or "Jim". As we are sure of the exact number of characters to use, we will experiment with this. Let's see how to do it.

In [35]:
# Importing the Libraries

import nltk
nltk.download('names')
import random
from nltk.corpus import names
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\avtar8\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\names.zip.


In [36]:
# Extract features from the input word
def gender_features(word, num_letters=2):
    return {'feature': word[-num_letters:].lower()}

In [42]:
# Defining the Main Function

if __name__=='__main__':
    # Extract labeled names
    labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
            [(name, 'female') for name in names.words('female.txt')])
    random.seed(7)
    random.shuffle(labeled_names)
    input_names = ['Leonardo', 'Amy', 'Sam']
    # Sweeping the parameter space
    for i in range(1, 5):
        print ('\nNumber of letters:', i)
        featuresets = [(gender_features(n, i), gender) for (n, gender) in labeled_names]
        # Divide this into train and test datasets
        train_set, test_set = featuresets[500:], featuresets[:500]
        classifier = NaiveBayesClassifier.train(train_set)
        # Print classifier accuracy
        print ('Accuracy ==>', str(100 * nltk_accuracy(classifier, test_set)) + str('%'))

        # Predict outputs for new inputs
        for name in input_names:
            print (name, '==>', classifier.classify(gender_features(name, i)))


Number of letters: 1
Accuracy ==> 76.2%
Leonardo ==> male
Amy ==> female
Sam ==> male

Number of letters: 2
Accuracy ==> 78.60000000000001%
Leonardo ==> male
Amy ==> female
Sam ==> male

Number of letters: 3
Accuracy ==> 76.6%
Leonardo ==> male
Amy ==> female
Sam ==> female

Number of letters: 4
Accuracy ==> 70.8%
Leonardo ==> male
Amy ==> female
Sam ==> female


### Sentiment Analysis

Sentiment analysis is one of the most popular applications of NLP. Sentiment analysis refers to the process of determining whether a given piece of text is positive or negative. In some variations, we consider "neutral" as a third option. This technique is commonly used to discover how people feel about a particular topic. This is used to analyze sentiments of users in various forms, such as marketing campaigns, social media, e-commerce customers, and so on.

In [43]:
# Importing the Libraries

import nltk
nltk.download('movie_reviews')
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\avtar8\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [44]:
def extract_features(word_list):
    return dict([(word, True) for word in word_list])

In [46]:
# Training Data

if __name__=='__main__':
    # Load positive and negative reviews
    positive_fileids = movie_reviews.fileids('pos')
    negative_fileids = movie_reviews.fileids('neg')
    
    # Seperating Positive and Negative Fields
    
    features_positive = [(extract_features(movie_reviews.words(fileids=[f])),
            'Positive') for f in positive_fileids]
    features_negative = [(extract_features(movie_reviews.words(fileids=[f])),
            'Negative') for f in negative_fileids]
    
    # Split the data into train and test (80/20)
    threshold_factor = 0.8
    threshold_positive = int(threshold_factor * len(features_positive))
    threshold_negative = int(threshold_factor * len(features_negative))

    # Extracting the Features
    features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
    features_test = features_positive[threshold_positive:] + features_negative[threshold_negative:]
    print ("\nNumber of training datapoints:", len(features_train))
    print ("Number of test datapoints:", len(features_test))

    # Using Naive Bayes to train a Model
    
    # Train a Naive Bayes classifier
    classifier = NaiveBayesClassifier.train(features_train)
    print ("\nAccuracy of the classifier:", nltk.classify.util.accuracy(classifier, features_test))
    
    # Get the Most Informative Word
    print ("\nTop 10 most informative words:")
    for item in classifier.most_informative_features()[:10]:
        print (item[0])
        
    # Create Sample input reviews
    input_reviews = [
        "It is an amazing movie",
        "This is a dull movie. I would never recommend it to anyone.",
        "The cinematography is pretty great in this movie",
        "The direction was terrible and the story was all over the place"
    ]

    # Running on Classifiers for Predictions
    
    print ("\nPredictions:")
    for review in input_reviews:
        print ("\nReview:", review)
        probdist = classifier.prob_classify(extract_features(review.split()))
        pred_sentiment = probdist.max()
        
        # Output
        print ("Predicted sentiment:", pred_sentiment)
        print ("Probability:", round(probdist.prob(pred_sentiment), 2))


Number of training datapoints: 1600
Number of test datapoints: 400

Accuracy of the classifier: 0.735

Top 10 most informative words:
outstanding
insulting
vulnerable
ludicrous
uninvolving
astounding
avoids
fascination
seagal
darker

Predictions:

Review: It is an amazing movie
Predicted sentiment: Positive
Probability: 0.61

Review: This is a dull movie. I would never recommend it to anyone.
Predicted sentiment: Negative
Probability: 0.77

Review: The cinematography is pretty great in this movie
Predicted sentiment: Positive
Probability: 0.67

Review: The direction was terrible and the story was all over the place
Predicted sentiment: Negative
Probability: 0.63


#### Working of the Classifier

We use NLTK's Naive Bayes classifier for our task here. In the feature extractor function, we basically extract all the unique words. However, the NLTK classifier needs the data to be arranged in the form of a dictionary. Hence, we arranged it in such a way that the NLTK classifier object can ingest it.

Once we divide the data into training and testing datasets, we train the classifier to categorize the sentences into positive and negative. If you look at the top informative words, you can see that we have words such as "outstanding" to indicate positive reviews and words such as "insulting" to indicate negative reviews. This is interesting information because it tells us what words are being used to indicate strong reactions.

### Topic Modelling

The topic modeling refers to the process of identifying hidden patterns in text data. The goal is to uncover some hidden thematic structure in a collection of documents. This will help us in organizing our documents in a better way so that we can use them for analysis. This is an active area of research in NLP. You can learn more about it at http://www.cs.columbia.edu/~blei/topicmodeling.html.

In [47]:
# Import Libraries

import nltk
nltk.download('stopwords')
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\avtar8\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [48]:
# Load input data
def load_data(input_file):
    data = []
    with open(input_file, 'r') as f:
        for line in f.readlines():
            data.append(line[:-1]) 
    
    return data

In [50]:
# Class to preprocess text
class Preprocessor(object):
    # Initialize various operators
    def __init__(self):
        # Create a regular expression tokenizer
        self.tokenizer = RegexpTokenizer(r'\w+')
        # get the list of stop words
        self.stop_words_english = stopwords.words('english')
        # Create a Snowball stemmer
        self.stemmer = SnowballStemmer('english')
    
    # Define a processor function that takes care of tokenization, stop word removal, and stemming
    # Tokenizing, stop word removal, and stemming
    def process(self, input_text):
        # Tokenize the string
        tokens = self.tokenizer.tokenize(input_text.lower())
        
        # Remove the stop words
        tokens_stopwords = [x for x in tokens if not x in self.stop_words_english]
        
        # Perform stemming on the tokens
        tokens_stemmed = [self.stemmer.stem(x) for x in tokens_stopwords]

        # Return the processed tokens
        return tokens_stemmed

In [53]:
# Writing the Main File

if __name__=='__main__':
    # File containing linewise input data
    input_file = 'data_topic_modeling.txt'
 
    # Load data
    data = load_data(input_file)
    
    # Create a preprocessor object
    preprocessor = Preprocessor()
    
    # Create a list for processed documents
    processed_tokens = [preprocessor.process(x) for x in data]

    # Create a dictionary based on the tokenized documents
    dict_tokens = corpora.Dictionary(processed_tokens)
    
    # Create a document-term matrix
    corpus = [dict_tokens.doc2bow(text) for text in processed_tokens]
    
    # Generate the LDA model based on the corpus we just created
    num_topics = 2
    num_words = 4
    
    ldamodel = models.ldamodel.LdaModel(corpus,
            num_topics=num_topics, id2word=dict_tokens, passes=25)

    # Once this identifies the two topics, we can see how it's separating these two topics by looking at the most-contributed words
    
    print ("Most contributing words to the topics:")
    for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
        print ("\nTopic", item[0], "==>", item[1])


Most contributing words to the topics:

Topic 0 ==> 0.056*"need" + 0.033*"order" + 0.033*"younger" + 0.033*"talent"

Topic 1 ==> 0.058*"need" + 0.035*"parti" + 0.035*"make" + 0.035*"sure"
