### Loading the Data

In [1]:
import nltk
import pandas as pd
nltk.download('wordnet')
Text = open('../data/C50train/JoeOrtiz/242939newsML.txt').read()
Text

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/puneet.jain/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


"After a five year struggle, creditors of the collapsed, fraud-ridden BCCI will receive a payment of $2.65 billion on Tuesday, equal to 24.5 percent of their claims, a spokesman for the liquidators said on Monday.\nBank of Credit and Commerce International, founded in 1972, was closed by central banks in 1991 and collapsed with debts of more than $12 billion when evidence of massive fraud and money laundering was unearthed leading to a tangled web of litigation which shows no sign of reaching an early conclusion.\nBCCI had assets of $24 billion and operations in 71 countries at the time of its collapse.\nLiquidator Deloitte and Touche said a further payment, reportedly of 10 percent, of the admitted claims which total some $10.5 billion should be made in the next 12 to 16 months.\nThe gross fund of amounts recovered by the liquidators stands at around $4.0 billion and includes $1.5 billion paid by BCCI's majority shareholder, the government of Abu Dhabi, which will pay a further $250 m

<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 1 
***
### Instructions
* Perform a sentence tokenization on the above data using `sent_tokenize()` and store it in a variable called '**Sent**'

In [2]:
import nltk
Sent = nltk.sent_tokenize(Text)
Sent

['After a five year struggle, creditors of the collapsed, fraud-ridden BCCI will receive a payment of $2.65 billion on Tuesday, equal to 24.5 percent of their claims, a spokesman for the liquidators said on Monday.',
 'Bank of Credit and Commerce International, founded in 1972, was closed by central banks in 1991 and collapsed with debts of more than $12 billion when evidence of massive fraud and money laundering was unearthed leading to a tangled web of litigation which shows no sign of reaching an early conclusion.',
 'BCCI had assets of $24 billion and operations in 71 countries at the time of its collapse.',
 'Liquidator Deloitte and Touche said a further payment, reportedly of 10 percent, of the admitted claims which total some $10.5 billion should be made in the next 12 to 16 months.',
 "The gross fund of amounts recovered by the liquidators stands at around $4.0 billion and includes $1.5 billion paid by BCCI's majority shareholder, the government of Abu Dhabi, which will pay a f

<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 2 
***
**Bag of Words** <br>
In this task, we will try to do the basic NLP operations of tokenizing, removing stop words and lemmatizing on our data. We will also try to create a list of most frequent words
### Instructions
- Iterate over every Sentence in the list **Sent**  using a for loop and convert every sentence into 
    - lower case 
    - and then tokenize it using the instantiated object 
- Now remove the stopwords from the tokens 
- Lemmatize them using `WordNetLemmatizer()` and save it in `lemmatized_tokens`
- Append `lemmatized_tokens` into the list called **Text**
- Convert `Counter(lemmatized_tokens)` into dictionary and save it in a variable called `BoW_dict`.
- Sort `BoW_dict` in descending order using `sorted()` function with the parameters `BoW_dict.items()`, `key=operator.itemgetter(1)`, `reverse=True`. Store it in a variable called `sorted_d`
- Finally append them into the list called **Texts** 
- Print `Texts` to check out the list of words with their frequency in descending order.
- Print Top 10 words from the `Texts`

In [3]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import Counter
import operator

tokenizer = RegexpTokenizer(r'\w+')

en_stop = set(stopwords.words('english'))

Lema = WordNetLemmatizer()

Texts = []
Text = []
for i in Sent:
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)
    stopped_tokens = [i for i in tokens if not i in en_stop]
    lemmatized_tokens = [Lema.lemmatize(i) for i in stopped_tokens]
    Text.append(lemmatized_tokens)
    BoW_dict = dict(Counter(lemmatized_tokens))
    sorted_d = sorted(BoW_dict.items(), key=operator.itemgetter(1), reverse=True)
    Texts.append(sorted_d)
print(Texts) 
print("\nTop 10 words:\n", sorted_d[:10])

[[('five', 1), ('year', 1), ('struggle', 1), ('creditor', 1), ('collapsed', 1), ('fraud', 1), ('ridden', 1), ('bcci', 1), ('receive', 1), ('payment', 1), ('2', 1), ('65', 1), ('billion', 1), ('tuesday', 1), ('equal', 1), ('24', 1), ('5', 1), ('percent', 1), ('claim', 1), ('spokesman', 1), ('liquidator', 1), ('said', 1), ('monday', 1)], [('bank', 2), ('credit', 1), ('commerce', 1), ('international', 1), ('founded', 1), ('1972', 1), ('closed', 1), ('central', 1), ('1991', 1), ('collapsed', 1), ('debt', 1), ('12', 1), ('billion', 1), ('evidence', 1), ('massive', 1), ('fraud', 1), ('money', 1), ('laundering', 1), ('unearthed', 1), ('leading', 1), ('tangled', 1), ('web', 1), ('litigation', 1), ('show', 1), ('sign', 1), ('reaching', 1), ('early', 1), ('conclusion', 1)], [('bcci', 1), ('asset', 1), ('24', 1), ('billion', 1), ('operation', 1), ('71', 1), ('country', 1), ('time', 1), ('collapse', 1)], [('10', 2), ('liquidator', 1), ('deloitte', 1), ('touche', 1), ('said', 1), ('payment', 1), ('

In [6]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/puneet.jain/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 3
***
Since Nouns are important in Topic Modeling process, we will try to figure out the top nouns from the bag of words we created in the last task.
### Instructions
- Join the previously created bag of words `lemmatized_tokens` back into a string using `join()` method and store the result in `BoW_joined`.
- Convert `Bow_joined` into a textblob using `TextBlob()` method and store the result into `blob`.
- Print out the `blob.tags` to look at the different tags associated with the words.
- Get the tags of all the words from `lemmatized_tokens` using `blob.tags` and store the result in a variable called `tags`
- From `tags`, extract the words which have `NN` tags and store them to a list called `nouns`
- The top 10 words which have appeared most frequently are already stored into a list called `top_words`
- Compare the two lists `top_words` and `nouns` and store the common elements between them in a new list called `top_nouns`
- Print `top_nouns` to see most commonly appearing nouns

In [7]:
from collections import Counter
import operator
import numpy as np
import nltk
from textblob import TextBlob
import warnings
warnings.filterwarnings('ignore')  

#Extracting the top 10 words with their count
top_10 = sorted_d[:10]

#Storing only the top 10 words
top_words=[]
for x in top_10:
    top_words.append(x[0])


#Code begins

# Joining the list back to sentences
BoW_joined = " ".join(lemmatized_tokens)

# Converting the data to textblob
blob = TextBlob(BoW_joined)

# Print the first 10 tags
print("First 10 tags:\n" ,blob.tags[:10])

#Extracting the tags
tags = blob.tags

#Initialising an empty list
nouns = []

#Extracing the words with NN tags
for x in tags:
    if x[1]=="NN":
        nouns.append(x[0])
        

#Comparing the two lists to extract the common elements        
top_nouns=[x for x in nouns if x in top_words]
top_nouns  = list(set(top_nouns))

print("\nTop Nouns:", top_nouns)

First 10 tags:
 [('report', 'NN'), ('high', 'JJ'), ('court', 'NN'), ('earlier', 'RBR'), ('year', 'NN'), ('liquidator', 'NN'), ('said', 'VBD'), ('legal', 'JJ'), ('fee', 'NN'), ('liquidation', 'NN')]

Top Nouns: ['court', 'fee', 'report', 'liquidator', 'year', 'liquidation']


<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 4
***
Using the method `.Dictionary()` inside the module `corpora` to create a unique token for every word and also print out the tokens assigned respectively using the `.token2id` attribute

In [8]:
from gensim import corpora, models
dictionary = corpora.Dictionary(Text)
print(dictionary)
print(dictionary.token2id)

Dictionary(171 unique tokens: ['2', '24', '5', '65', 'bcci']...)
{'2': 0, '24': 1, '5': 2, '65': 3, 'bcci': 4, 'billion': 5, 'claim': 6, 'collapsed': 7, 'creditor': 8, 'equal': 9, 'five': 10, 'fraud': 11, 'liquidator': 12, 'monday': 13, 'payment': 14, 'percent': 15, 'receive': 16, 'ridden': 17, 'said': 18, 'spokesman': 19, 'struggle': 20, 'tuesday': 21, 'year': 22, '12': 23, '1972': 24, '1991': 25, 'bank': 26, 'central': 27, 'closed': 28, 'commerce': 29, 'conclusion': 30, 'credit': 31, 'debt': 32, 'early': 33, 'evidence': 34, 'founded': 35, 'international': 36, 'laundering': 37, 'leading': 38, 'litigation': 39, 'massive': 40, 'money': 41, 'reaching': 42, 'show': 43, 'sign': 44, 'tangled': 45, 'unearthed': 46, 'web': 47, '71': 48, 'asset': 49, 'collapse': 50, 'country': 51, 'operation': 52, 'time': 53, '10': 54, '16': 55, 'admitted': 56, 'deloitte': 57, 'made': 58, 'month': 59, 'next': 60, 'reportedly': 61, 'total': 62, 'touche': 63, '0': 64, '1': 65, '250': 66, '4': 67, 'abu': 68, 'amo

<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 5
***
Now convert the dictionary into a bag of words list using the `.doc2bow()` method in `dictionary` and store it in a variable **corpus** 

In [9]:
corpus = [dictionary.doc2bow(text) for text in Text]
for line in corpus:
    print(line)

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1)]
[(5, 1), (7, 1), (11, 1), (23, 1), (24, 1), (25, 1), (26, 2), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1)]
[(1, 1), (4, 1), (5, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1)]
[(2, 1), (5, 1), (6, 1), (12, 1), (14, 1), (15, 1), (18, 1), (23, 1), (54, 2), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1)]
[(2, 1), (4, 1), (5, 2), (12, 1), (22, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1)]
[(14, 1), (81, 1), (89, 1), (90, 1), (91, 1)

<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 6
***
Create an LDA model with number of topics as 5 of your choice and your choice of total passes. Now print out the top 5 topics and also the top 3 words in every topic

In [10]:
import gensim
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=20)

print(ldamodel)

LdaModel(num_terms=171, num_topics=5, decay=0.5, chunksize=2000)


In [11]:
for topic in ldamodel.print_topics(num_topics=5, num_words=3):
    print(topic)

(0, '0.030*"billion" + 0.020*"million" + 0.020*"year"')
(1, '0.027*"bcci" + 0.027*"creditor" + 0.027*"united"')
(2, '0.030*"mahfouz" + 0.029*"million" + 0.016*"court"')
(3, '0.037*"billion" + 0.037*"liquidator" + 0.028*"said"')
(4, '0.031*"bcci" + 0.031*"amp" + 0.031*"bank"')


In [13]:
# Visualize the topics
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
vis

<img src="../images/icon/quiz.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

## Topic Modelling
***

Q1. What percentage of the total statements are correct with regards to Topic Modeling?
```python
1. It is a supervised learning technique
2. LDA (Linear Discriminant Analysis) can be used to perform topic modeling
3. Selection of number of topics in a model does not depend on the size of data
4. Number of topic terms are directly proportional to size of the data
A) 0
B) 25
C) 50
D) 75
E) 100

Solution: (A)

LDA is unsupervised learning model, LDA is latent Dirichlet allocation, not Linear discriminant analysis. Selection of the number of topics is directly proportional to the size of the data, while number of topic terms is not directly proportional to the size of the data. Hence none of the statements are correct.
```

Q2. In Latent Dirichlet Allocation model for text classification purposes, what does alpha and beta hyperparameter represent-
```python
A) Alpha: number of topics within documents, beta: number of terms within topics False
B) Alpha: density of terms generated within topics, beta: density of topics generated within terms False
C) Alpha: number of topics within documents, beta: number of terms within topics False
D) Alpha: density of topics generated within documents, beta: density of terms generated within topics True

Solution: (D)
```
Q3. Social Media platforms are the most intuitive form of text data. You are given a corpus of complete social media data of tweets. How can you create a model that suggests the hashtags?
```python
A) Perform Topic Models to obtain most significant words of the corpus
B) Train a Bag of Ngrams model to capture top n-grams – words and their combinations
C) Train a word2vector model to learn repeating contexts in the sentences
D) All of these

Solution: (D)

All of the techniques can be used to extract most significant terms of a corpus.
```
