## Tasks 5 - Topic Generation
This notebook will us LDA topic modeling to create the 10 topics.
Both the Bag of words and TIIDF vectorisation techniques will be used with the LDA.
### This notebook pipeline
- data loading and concatenation
- data preprocessing
- dictionary and corpus creation
- ifidf data vectorisation
- lda topic modeling (with bag of words and tfidf)
- topic testing on the reviews and the new data

The topic LDA modeling code was mostly taken from this article:
https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

### Load the data:

In [None]:
#Get the data from files
import os
import importlib
import pandas as pd
oneonetwo_data = pd.read_csv("data"+os.sep+"part"+os.sep+"Suomi112_eng.csv", encoding='utf-8-sig')['content']
sos_live = pd.read_csv("data"+os.sep+"part"+os.sep+"SosLive_eng.csv", encoding='utf-8-sig')['content']

In [None]:
#Concatanate data from two apps
data = pd.concat([oneonetwo_data,sos_live],ignore_index = True)

### Preprocess the data:

In [None]:
# Initialize the processor object:
import utils
importlib.reload(utils)
processor = utils.Processor()
processor.ini_dowload()

In [None]:
#Preprocess the data:
data = data.apply(lambda x: processor.preprocess(str(x)))
data = data.apply(lambda x: processor.tokenize(x))
data = data.apply(lambda x: processor.remove_stopwords(x))
data = data.apply(lambda x: processor.process_tokens(x))
data
data_copy = data

### Create dictionary and corpus:


In [None]:
#Create dictionary - containing the number of times a word appears in the dataset
flat_data = [tag for sentence in data for tag in sentence ]
from gensim.corpora import Dictionary
dictionary = Dictionary(data)
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000) #Remove unused and over used tokens
# dictionary reporting how many words and how many times those words appear:
bow_corpus = [dictionary.doc2bow(doc) for doc in data]

In [None]:
bow_corpus[1]

In [None]:
#Dictionary can be printed out:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1

In [None]:
#One can check how many times one word apears in the dict:
bow_rewiew = bow_corpus[2]
for i in range(len(bow_rewiew)):
    print("Word {} (\"{}\") appears {} time.".format(bow_rewiew[i][0],dictionary[bow_rewiew[i][0]],bow_rewiew[i][1]))

### Vectorize using the tfidf:

In [None]:
#Create tf idf vector representation of the words in the data:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

In [None]:
#preview TF-IDF scores for our first document.
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

### LDA topic modeling:

In [None]:
#Topics with simple bag of words
import gensim
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

In [None]:
# Topics with TFiDF vectorisation:
import gensim
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

### Topic testing:

In [None]:
# Test on one review (with tfidf representation)
print("Review: ", data[1])
rewiew = bow_corpus[1]
for index, score in sorted(lda_model_tfidf[rewiew], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))

In [None]:
#test on the new data
unseen_document = "This application drains battery!"
tmp =  processor.preprocess(str(unseen_document))
tmp = processor.tokenize(tmp)
tmp = processor.remove_stopwords(tmp)
preprocessed_document = processor.process_tokens(tmp)
bow_vector = dictionary.doc2bow(preprocessed_document)


In [None]:
# Test on bag on words topics:
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

In [None]:
# Test on tfidf topics:
for index, score in sorted(lda_model_tfidf[bow_vector], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))