<a href="https://colab.research.google.com/github/vortexash/Machine-Learning/blob/master/Latent_Dirichlet_Allocation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##loading the dataset


In [0]:
import pandas as pd
import numpy as np

In [0]:
url='https://raw.githubusercontent.com/udacity/NLP-Exercises/master/2.2-topic-modeling/abcnews-date-text.csv'

In [0]:
data=pd.read_csv(url,error_bad_lines=False)
data_text=data[:300000][['headline_text']]
data_text['index'] = data_text.index
documents =data_text

In [8]:
documents.head(5)

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


In [9]:
##total number of documents
print(len(documents))

300000


In [10]:
documents[:5]

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


#Data Preprocessing

1. **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.

2. Words that have fewer than 3 characters are removed.

3. All **stopwords** are removed.
Words are **lemmatized** - words in third person are changed to first person and verbs in past and future tenses are changed into present.

3. Words are **stemmed** - words are reduced to their root form.









In [0]:
## Loading Gensim and nltk libraries
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
np.random.seed(400)

In [14]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [15]:
### Practice Lemmatizer Example
 print(WordNetLemmatizer().lemmatize('went',pos = 'v')) ## past tence to present tense

go


In [19]:
#### Stemmer Example
stemmer = SnowballStemmer("english")
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted'] 
singles = [stemmer.stem(plural) for plural in original_words]
pd.DataFrame(data={'original_words':original_words,'Stemmed':singles})

Unnamed: 0,original_words,Stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [0]:
## now on the entire dataset
 def lemmantize_stemming(text):
    return(stemmer.stem(WordNetLemmatizer().lemmatize(text,pos='v')))
## Tokenize and lemmatize
def preprocess(text):
  result=[]
  for token in gensim.utils.simple_preprocess(text):
    if token not in gensim.parsing.preprocessing.STOPWORDS and len(token)>3:
      result.append(lemmantize_stemming(token))
  return(result)
  

In [32]:
 ## preview a document after preprocessing 
document_num = 4310
doc_sample = documents[documents['index']==document_num].values[0][0]
print("Original Document:")
words=[]
for word in doc_sample.split(' '):
  words.append(word)
print(words)
print("\n\nTokenized and lemmatized document:")
print(preprocess(doc_sample))

Original Document:
['rain', 'helps', 'dampen', 'bushfires']


Tokenized and lemmatized document:
['rain', 'help', 'dampen', 'bushfir']


In [0]:
## Now use map function from pandas to apply preprocessing() to the headline_text

processed_docs=documents['headline_text'].map(preprocess)

In [35]:
processed_docs[:10]

0            [decid, communiti, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

#Bag of word on the dataset

In [0]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [37]:
## checking dictionary created
count = 0
for k, v in dictionary.iteritems():
  print(k,v)
  count+=1
  if(count>10):
    break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


**Gensim filter_extremes**
filter_extremes(no_below=5, no_above=0.5, keep_n=100000)

Filter out tokens that appear in

1. less than no_below documents (absolute number) or

2. more than no_above documents (fraction of total corpus size, not absolute number).

3. after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [0]:
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)


###Gensim doc2bow
doc2bow(document):

1. Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [0]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [40]:
bow_corpus[document_num]

[(71, 1), (107, 1), (462, 1), (3530, 1)]

In [44]:
## Bow for our sample preprocessed document

bow_doc_4310 = bow_corpus[document_num]
for i in range(len(bow_doc_4310)):
  print("Word {} (\"{}\") appears {} time." .format(bow_doc_4310[i][0],
        dictionary[bow_doc_4310[i][0]],bow_doc_4310[i][1]))

Word 71 ("bushfir") appears 1 time.
Word 107 ("help") appears 1 time.
Word 462 ("rain") appears 1 time.
Word 3530 ("dampen") appears 1 time.
