<a href="https://colab.research.google.com/github/shaifuss/data_science_seminar/blob/master/topic_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [36]:
import numpy as np
import pandas as pd
import json
import pickle
import datetime



The goal of this section is to identify distinct topics within the corpus of pizza review texts. Once topics are identified, reviews that do not contain food-related topics can be filtered out. This has the potential to improve the image classification in the next stage by weeding out irrelevent samples - where reviewers didn't base their scores on the pizza.

This will be achieved by building a hybrid topic model that melds a latent Dirichlet allocation (LDA) model with a word vector clustering model.

The premise of LDA (Biel et al., 2003) is that documents with similar topics use similar words. The algorithm aims to discover groups of words the occur frequently occur together in the same document. A topic is modeled as a probability distribution over words. Moreover, a document can be modeled as a probability distribution over different topics. 

Thus, the algorithm words as follows:


1.   Remove unimportant words and set how many topics to find.
2.   Randomly assign each word in each document to a random topic
3.   For each document,
>a. choose a topic, assuming all others are allocated correctly

>>i. calculate the topic distribution within the document:  p(topic | document)

>>ii. calculate the word distribution within the topic: p(word | topic)

>> iii. multiply i and ii together and assign words to new topics based on the result

4. terminate when there are no new assignments

The model is finetuned by several parameters:
Alpha reflects how many topics are in a given document (higher values lead to more topics per document in the model)
Beta reflects how many words are in a given topic (higher values lead to more words per topic in the model).

 If a set of words are repeated in many documents, those words are said to be a topic.  



In [2]:
def load_reviews():
  with open(r'/content/drive/My Drive/Data Science Class/pizza_reviews.json', 'r') as f:
    pizza_reviews = json.load(f)
  return pizza_reviews
review_list = load_reviews()

In [3]:
review_df = pd.DataFrame(review_list)

text_df = review_df[['review_id', 'text']].copy()
text_df.head()

Unnamed: 0,review_id,text
0,mM8i91yWP1QbImEvz5ds0w,"In the heart of Chinatown, I discovered it enr..."
1,09qxjFi4abaW66JeSLazuQ,Was a Chicago style deep dish. Homemade type ...
2,K-wdPGHbErfxbKK6PetrmA,First time eating there and everything was so ...
3,jkVxX4ieJwVRO9n4E8tNMw,More than just Pizza. This location is small ...
4,Lb9r62Qlu12ZB909CbFeOQ,I ordered a pizza at 4:49. Got an email that s...


word level preprocessing utils

In [42]:
!pip install pyspellchecker
!pip install sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/24/47/0ed64014af68aaf36f2e0a42bb30a5caf82e54edf92329d8aca4959ba9d7/sentence-transformers-0.2.6.2.tar.gz (60kB)
[K     |████████████████████████████████| 61kB 1.8MB/s 
[?25hCollecting transformers==2.11.0
[?25l  Downloading https://files.pythonhosted.org/packages/48/35/ad2c5b1b8f99feaaf9d7cdadaeef261f098c6e1a6a2935d4d07662a6b780/transformers-2.11.0-py3-none-any.whl (674kB)
[K     |████████████████████████████████| 675kB 7.1MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 16.1MB/s 
Collecting tokenizers==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86

In [5]:
import nltk
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

import re
import time

from spellchecker import SpellChecker

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [12]:
def regex_filter(sentence):
    # fix missing delimiter - i.e deepDishPizza
    sentence = re.sub(r'([a-z])([A-Z])', r'\1\. \2', sentence)
    sentence = sentence.lower()
    sentence = re.sub(r'&gt|&lt', ' ', sentence)
    # fix letter repetition (if more than 2)
    sentence = re.sub(r'([a-z])\1{2,}', r'\1', sentence)
    # fix non-word repetition (if more than 1)
    sentence = re.sub(r'([\W+])\1{1,}', r'\1', sentence)
    # string * as delimiter
    sentence = re.sub(r'\*|\W\*|\*\W', '. ', sentence)
    # xxx[?!]. -- > xxx.
    sentence = re.sub(r'\W+?\.', '.', sentence)
    # [.?!] --> [.?!] xxx
    sentence = re.sub(r'(\.|\?|!)(\w)', r'\1 \2', sentence)
    # fix phrase repetition
    sentence = re.sub(r'(.{2,}?)\1{1,}', r'\1', sentence)

    return sentence.strip()

In [6]:
# remove numbers and punctuation marks
def filter_punctuation(word_list):
    return [word for word in word_list if word.isalpha()]

# remove unimportant connective words such as "and", "the", etc
def filter_stopwords(word_list):
  return [word for word in word_list if word not in stopwords.words('english')]

# TODO is this helpful or not?
def retain_nouns(word_list):
    return [word for (word, pos) in nltk.pos_tag(word_list) if pos[:2] in ['NN']] #, 'VBP', 'VBN']]

# normlize for part of speech
def stem_words(word_list):
  ps = PorterStemmer()
  return [ps.stem(word) for word in word_list]

def fix_spelling(word_list):
  spell = SpellChecker()
  return [spell.correction(word) for word in word_list]

In [14]:
def preprocess_words(text):
  word_list = word_tokenize(text)
  word_list = filter_punctuation(word_list)
  word_list = fix_spelling(word_list) 
  word_list = filter_stopwords(word_list)
  word_list = retain_nouns(word_list)
  return stem_words(word_list)

In [32]:
def preprocess(reviews, samp_size=None):
  if not samp_size:
        samp_size = 1000

  start = time.time()
  print('Stage 1: Preprocess raw review texts')
  #review_count = len(reviews)
  texts = []  
  token_lists = []  
  idx_in = []

  indicies = np.random.choice(len(reviews), samp_size)
  for i in indicies:
      text = regex_filter(reviews[i])
      token_list = preprocess_words(text)
      if token_list:
          idx_in.append(i)
          texts.append(text)
          token_lists.append(token_list)

  end = time.time()
  print("Preprocessing {} reviews took {} minutes".format(len(indicies), str((end - start)/60)))
  return texts, token_lists, idx_in
  

In [None]:
sentences, token_lists, idx_in = pre_process(text_df.text)

Stage 1: Preprocess raw review texts
Preprocessing 479792 reviews took 9.448280354340872 minutes


In [21]:
import keras
from keras.layers import Input, Dense
from keras.models import Model
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')


class Autoencoder:
    """
    Autoencoder for learning latent space representation
    architecture simplified for only one hidden layer
    """

    def __init__(self, latent_dim=32, activation='relu', epochs=200, batch_size=128):
        self.latent_dim = latent_dim
        self.activation = activation
        self.epochs = epochs
        self.batch_size = batch_size
        self.autoencoder = None
        self.encoder = None
        self.decoder = None
        self.his = None

    def _compile(self, input_dim):
        """
        compile the computational graph
        """
        input_vec = Input(shape=(input_dim,))
        encoded = Dense(self.latent_dim, activation=self.activation)(input_vec)
        decoded = Dense(input_dim, activation=self.activation)(encoded)
        self.autoencoder = Model(input_vec, decoded)
        self.encoder = Model(input_vec, encoded)
        encoded_input = Input(shape=(self.latent_dim,))
        decoder_layer = self.autoencoder.layers[-1]
        self.decoder = Model(encoded_input, self.autoencoder.layers[-1](encoded_input))
        self.autoencoder.compile(optimizer='adam', loss=keras.losses.mean_squared_error)

    def fit(self, X):
        if not self.autoencoder:
            self._compile(X.shape[1])
        X_train, X_test = train_test_split(X)
        self.his = self.autoencoder.fit(X_train, X_train,
                                        epochs=200,
                                        batch_size=128,
                                        shuffle=True,
                                        validation_data=(X_test, X_test), verbose=0)

Using TensorFlow backend.


In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from gensim import corpora
import gensim
import numpy as np

# define model object
class Topic_Model:
    def __init__(self, k=5, method='LDA_BERT'):
        """
        :param k: number of topics
        :param method: method chosen for the topic model
        """
        if method not in {'LDA', 'BERT', 'LDA_BERT'}:
            raise Exception('Invalid method!')
        self.k = k
        self.dictionary = None
        self.corpus = None
        self.cluster_model = None
        self.ldamodel = None
        self.vec = {}
        self.gamma = 15  # parameter for reletive importance of lda
        self.method = method
        self.AE = None
        self.id = method + '_' + str(time.time())

    def vectorize(self, sentences, token_lists, method=None):
        """
        Get vector representations from selected methods
        """
        # Default method
        if method is None:
            method = self.method

        # turn tokenized documents into a id <-> term dictionary
        self.dictionary = corpora.Dictionary(token_lists)
        # convert tokenized documents into a document-term matrix
        self.corpus = [self.dictionary.doc2bow(text) for text in token_lists]

        if method == 'LDA':
            print('Getting vector representations for LDA ...')
            if not self.ldamodel:
                self.ldamodel = gensim.models.ldamodel.LdaModel(self.corpus, num_topics=self.k, id2word=self.dictionary,
                                                                passes=20)

            def get_vec_lda(model, corpus, k):
                """
                Get the LDA vector representation (probabilistic topic assignments for all documents)
                :return: vec_lda with dimension: (n_doc * n_topic)
                """
                n_doc = len(corpus)
                vec_lda = np.zeros((n_doc, k))
                for i in range(n_doc):
                    # get the distribution for the i-th document in corpus
                    for topic, prob in model.get_document_topics(corpus[i]):
                        vec_lda[i, topic] = prob

                return vec_lda

            vec = get_vec_lda(self.ldamodel, self.corpus, self.k)
            print('Finished getting vector representations for LDA')
            return vec

        elif method == 'BERT':

            print('Getting vector representations for BERT ...')
            from sentence_transformers import SentenceTransformer
            model = SentenceTransformer('bert-base-nli-max-tokens')
            vec = np.array(model.encode(sentences, show_progress_bar=True))
            print('Finished getting vector representations for BERT')
            return vec

        #         elif method == 'LDA_BERT':
        else: 
            vec_lda = self.vectorize(sentences, token_lists, method='LDA')
            vec_bert = self.vectorize(sentences, token_lists, method='BERT')
            vec_ldabert = np.c_[vec_lda * self.gamma, vec_bert]
            self.vec['LDA_BERT_FULL'] = vec_ldabert
            if not self.AE:
                self.AE = Autoencoder()
                print('Fitting Autoencoder ...')
                self.AE.fit(vec_ldabert)
                print('Fitting Autoencoder Done!')
            vec = self.AE.encoder.predict(vec_ldabert)
            return vec

    def fit(self, sentences, token_lists, method=None, m_clustering=None):
        """
        Fit the topic model for selected method given the preprocessed data
        :docs: list of documents, each doc is preprocessed as tokens
        :return:
        """
        # Default method
        if method is None:
            method = self.method
        # Default clustering method
        if m_clustering is None:
            m_clustering = KMeans

        # turn tokenized documents into a id <-> term dictionary
        if not self.dictionary:
            self.dictionary = corpora.Dictionary(token_lists)
            # convert tokenized documents into a document-term matrix
            self.corpus = [self.dictionary.doc2bow(text) for text in token_lists]

        ####################################################
        #### Getting ldamodel or vector representations ####
        ####################################################

        if method == 'LDA':
            if not self.ldamodel:
                print('Fitting LDA ...')
                self.ldamodel = gensim.models.ldamodel.LdaModel(self.corpus, num_topics=self.k, id2word=self.dictionary,
                                                                passes=20)
                print('Fitting LDA Done!')
        else:
            print('Clustering embeddings ...')
            self.cluster_model = m_clustering(self.k)
            self.vec[method] = self.vectorize(sentences, token_lists, method)
            self.cluster_model.fit(self.vec[method])
            print('Clustering embeddings. Done!')

    def predict(self, sentences, token_lists, out_of_sample=None):
        """
        Predict topics for new_documents
        """
        # Default as False
        out_of_sample = out_of_sample is not None

        if out_of_sample:
            corpus = [self.dictionary.doc2bow(text) for text in token_lists]
            if self.method != 'LDA':
                vec = self.vectorize(sentences, token_lists)
                print(vec)
        else:
            corpus = self.corpus
            vec = self.vec.get(self.method, None)

        if self.method == "LDA":
            lbs = np.array(list(map(lambda x: sorted(self.ldamodel.get_document_topics(x),
                                                     key=lambda x: x[1], reverse=True)[0][0],
                                    corpus)))
        else:
            lbs = self.cluster_model.predict(vec)
        return lbs

In [44]:
outfile = r'/content/drive/My Drive/Data Science Class/tm.file'

def main():
    
  method = "LDA_BERT"
  samp_size = 100
  ntopic = 5

  data = text_df
  data = data.fillna('')  # only the comments has NaN's
  reviews = data.text
  sentences, token_lists, idx_in = preprocess(reviews, samp_size=samp_size)
  tm = Topic_Model(k = ntopic)
  tm.fit(sentences, token_lists)
  with open(outfile, 'wb') as f:
    pickle.dump(tm, f, pickle.HIGHEST_PROTOCOL)
  
  # coherence measures internal consistency of a topic
  #print('Coherence:', get_coherence(tm, token_lists, 'c_v'))
  # silhoutte measures consistency of clusters
  #print('Silhouette Score:', get_silhouette(tm))
  # visualize and save img
  #visualize(tm)
  #for i in range(tm.k):
  #  get_wordcloud(tm, token_lists, i)


In [43]:
main()

Stage 1: Preprocess raw review texts
Preprocessing 100 reviews took 1.1484277844429016 minutes
Clustering embeddings ...
Getting vector representations for LDA ...
Finished getting vector representations for LDA
Getting vector representations for BERT ...


100%|██████████| 405M/405M [00:44<00:00, 9.20MB/s]


HBox(children=(FloatProgress(value=0.0, description='Batches', max=13.0, style=ProgressStyle(description_width…


Finished getting vector representations for BERT
Fitting Autoencoder ...
Fitting Autoencoder Done!
Clustering embeddings. Done!


NameError: ignored