<a href="https://colab.research.google.com/github/salmanarif86/MLAI/blob/master/Slack_Bot_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Problem statement:**

I need a way to more quickly understand what's happening in the Kaggle forums and act on it. I want a faster way to summerize trends on forums posts, figure out what questions are good for me to answer and alert my teammates if the community is reporting something en masse.

This is actually three seperate problems:

Summerization (level of activicty, new/emerging topics, topics that are newly popular)
Flag questions I'm likely to know the answer to
Identify possible answerers for a given question
Alerts based on anomaly detection (lots of community discussion around a specific topic)

**Measuring success:**

*Summerization:*

       * User feedback on bot output (online learning)
       * Usupervised NLU
       
              * Manual verification of topics
              * Manual verification of keywords
              
       *Flagging questions:
       
              * Accuracy of predicting questions I replied using my forum history
              
       *Alerts
          
              * Accuracy of past event/bugs

*Possible approches to summerization:*

  **Level of activicty**
  
        * time series modelling of # of posts over time
        * X posts this week (+- from last week), most popular (upvotes), most replied to
        
** Keywords**

      * https://repositorio.inesctec.pt/bitstream/123456789/7623/1/P-00N-NF5.pdf
      * Faster

**Topics**

    * More flexible to differences in vocabulary
    
**Hybrid Approach**

  * Keywords + embeddings to group similar keywords
  
**Clustering based on embeddings**

  * Do we want to train our own embeddings?
  * Look into current approaches

**Notes**


**Unsupervised text Clustering**

*Words - > Inputs*

* Embeddings (might not be helpful if we dont train new embeddings for each time we run the model)

    * Fasttext can handle out of vocabulary words
    * Subword embeddings
    * Biggest factor - How long do they take to train?
    * Universal Sentence Encoder Embedding - We train our own embedding each time

* Td-idf

* LDA

* Take the frequency matrix, remove the expected frequency (by subtracting, or using the column marginal as a noise model) -Leland McInnes

* Embedding weighted with tf-idf

* Embedding and then performing PCA, removing first principal component - Arora (2018) - 'A simple but tough to beat baseline sentence embeddings'

* pLSA (Cheaper version of LDA)


*Topic Modelling*

* LDA
        
        *Too slow for this particular use case
        *Hard to interpret
        

*Clustering with Embeddings*

* Hierach. Clustering
* Brown Clusters

        * Hierachical
        * work on the word level
        * can be updated actively
        * would need to find a python code
        
* DBSCAN

      * needs embedding as input
      * should reduce dimensionality
      * note: clusters should be of similar density
      * HDSCAN is the hierachical version

--------------------------------------------------------


Whole pipeline : 

* https://topsmb.github.io
* https://github.com/bigartm/bigartm

1. Words to  numbers

* tf-idf
* LDA
* pLSA
* Embeddings

    * fasttext
    * USE (universal sentence encoder embeddings ) Embeddings
    * GloVe?
    * Word2Vec
    * ELMo Embeddings
    * text to knowledge mapping - mapping text to knowledge using Multi-Sense LSTMs - wont probably do as this would require building our own knoeldge graph -   no bandwith

2. Dimensionality Reduction
  
    * UMAP
    * PCA
    
3. Clustering

    * DBSCAN
    * HDBSCAN
    * Specteral Clusterings
    
    

Experimenting with YAKE!!

In [0]:
! pip install git+https://github.com/LIAAD/yake

In [0]:
import yake
import pandas as pd
from Clustering import Corpus
from Brown_Clustering_yangyuan import *
from nltk import word_tokenize
from nltk import RegexpTokenizer
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


#### Testing the Yake algorithm with a custom sentence to get a feel of how the algorthim is doing interms of extracting keywords. This method takes in custom pre-defined parameters where we dont have to tweek anything

In [21]:
text_content = """
	Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning
	competitions. Details about the transaction remain somewhat vague , but given that Google is hosting
	its Cloud Next conference in San Francisco this week, the official announcement could come as early
	as tomorrow.  Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the
	acquisition is happening. Google itself declined 'to comment on rumors'.
"""

# assuming default parameters
simple_kwextractor = yake.KeywordExtractor()
keywords = simple_kwextractor.extract_keywords(text_content)

for kw in keywords:
	print(kw)

('machine learning competitions', 0.005240218636588412)
('hosts data science', 0.007231800172852763)
('learning competitions', 0.02651298908496934)
('platform that hosts', 0.03633877348319309)
('hosts data', 0.03633877348319309)
('data science', 0.03633877348319309)
('science and machine', 0.03633877348319309)
('machine learning', 0.03633877348319309)
('ceo anthony goldbloom', 0.03727546242790534)
('acquiring kaggle', 0.046501713202057565)
('kaggle co-founder ceo', 0.05500284979172434)
('san francisco', 0.05743727907793731)
('google', 0.06726505100386607)
('google is acquiring', 0.06754045633911093)
('anthony goldbloom declined', 0.07472471161713069)
('co-founder ceo anthony', 0.07489379267114575)
('francisco this week', 0.09080847547468633)
('ceo anthony', 0.10396234776227407)
('anthony goldbloom', 0.10396234776227407)
('hosting its cloud', 0.11556884354166654)


#### Same algorthim on the same corpus but tis method allows us to input some parameters in order to tweek it based on our use case. Looks like the default parameters give almost the same results as the custom one. So we are just going to use simple keyword extractor

In [22]:
# specifying parameters
custom_kwextractor = yake.KeywordExtractor(lan="en", n=3, dedupLim=0.9, dedupFunc='seqm', windowsSize=1, top=20, features=None)

keywords = custom_kwextractor.extract_keywords(text_content)

for kw in keywords:
	print(kw)

('machine learning competitions', 0.005240218636588412)
('hosts data science', 0.007231800172852763)
('learning competitions', 0.02651298908496934)
('platform that hosts', 0.03633877348319309)
('hosts data', 0.03633877348319309)
('data science', 0.03633877348319309)
('science and machine', 0.03633877348319309)
('machine learning', 0.03633877348319309)
('ceo anthony goldbloom', 0.03727546242790534)
('acquiring kaggle', 0.046501713202057565)
('kaggle co-founder ceo', 0.05500284979172434)
('san francisco', 0.05743727907793731)
('google', 0.06726505100386607)
('google is acquiring', 0.06754045633911093)
('anthony goldbloom declined', 0.07472471161713069)
('co-founder ceo anthony', 0.07489379267114575)
('francisco this week', 0.09080847547468633)
('ceo anthony', 0.10396234776227407)
('anthony goldbloom', 0.10396234776227407)
('hosting its cloud', 0.11556884354166654)


In [0]:
forum_post = pd.read_csv('/content/ForumMessages.csv')

### Now trying the simple keyword extractor on our forum text  and we will try a few meesage post to see the results and get some intuttion into the results

In [24]:
simple_kwextractor.extract_keywords(forum_post.Message[0])

[('free hunch blog', 0.00036928641235685713),
 ('div', 0.003962474153257461),
 ('hunch blog', 0.0050494898992956925),
 ('free hunch', 0.005274063752072036),
 ('href', 0.026476296695875336),
 ('features', 0.05842835043495728),
 ('blog', 0.06936327841616126),
 ('free', 0.07243199525061537),
 ('hunch', 0.07243199525061537),
 ('mce', 0.0803814044228403),
 ('train a svm', 0.14354624602359264),
 ('svm', 0.2056295305809825),
 ('sequences', 0.21175280747641545),
 ('sequence', 0.21175280747641545),
 ('los alamos', 0.23088984029260495),
 ('website', 0.23849729378219245),
 ('give', 0.2394444994311782),
 ('response', 0.2524750771695876),
 ('comment', 0.2524750771695876),
 ('model', 0.2722507351479002)]

In [25]:
simple_kwextractor.extract_keywords(forum_post.Message[1])

[('super computing power', 0.0007587825802569232),
 ('released upon entry.', 0.007789229896782814),
 ('computing power', 0.007789229896782814),
 ('interested in participating', 0.009281830427428734),
 ('access to proprietary', 0.009281830427428734),
 ('proprietary software', 0.009281830427428734),
 ('software and super', 0.009281830427428734),
 ('super computing', 0.009281830427428734),
 ('predictions work', 0.009281830427428734),
 ('simulated by users', 0.0670458894265181),
 ('trademarked and licensed.', 0.0670458894265181),
 ('platform', 0.07536829389396676),
 ('possibly even release', 0.07813908785389614),
 ('release a model', 0.07813908785389614),
 ('make a contribution', 0.07813908785389614),
 ('contest', 0.08059613614664227),
 ('entry.', 0.08059613614664227),
 ('power', 0.08059613614664227),
 ('interested', 0.09589822912399033),
 ('participating', 0.09589822912399033)]

####  Now we are going to take keywords for each post and turn them into a text string "sentence". This "sentence" will serve as an input into our brown cluster algorthim

#### Lower the score of our clusters higher the importance

#### We will first try it on the first 100 messages

In [0]:
sample_messages = forum_post.Message[-1000:].astype(str)

In [46]:
sentences =[]

for post in sample_messages:
  
  
  post_keywords = simple_kwextractor.extract_keywords(post)
  
  
  sentences_output = ''
  
  for words, numbers in post_keywords:
    
    sentences_output = sentences_output + words + " "
    
  
  sentences.append(sentences_output)


  avgTF = validTFs.mean()
  ret = ret.dtype.type(ret / rcount)
  keepdims=keepdims)
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
  ret = ret.dtype.type(ret / rcount)


#### This will tokenize our "sentences" as well as remove any punctutions. This is a nltk method

In [0]:
tokenizer = RegexpTokenizer(r'\w+')
sample_data_tokenized =[w.lower() for w in sentences]
sample_data_tokenized =[tokenizer.tokenize(i) for i in sample_data_tokenized]

#### Now we will use the brown clustering algorthim and put it to training. This should take less time to run as its just runing on keywords as oppose to runing on Messages

In [0]:
corpus = Corpus(sample_data_tokenized, 0.001)
clustering = BrownClustering(corpus, 6)
clustering.train()


#### Now lets test out and see what similar words to we get after the algorthim has been trained on our keyword Corpus. Now by checking whats similar to submission. Now i think the numbers within each tuple is mutual information.High mutual information indicates a large reduction in uncertainty; low mutual information indicates a small reduction; and zero mutual information between two random variables means the variables are independent.

In [55]:
clustering.get_similar('submission')

[('data', 385),
 ('kaggle', 385),
 ('dataset', 385),
 ('set', 385),
 ('test', 385),
 ('train', 385),
 ('model', 385),
 ('make', 385),
 ('public', 385),
 ('team', 385)]

#### Now lets see if the word "data" would give us the same words. Looks like it doent and my guess would be that brown cluster algorthim probably returns words with the highest reduction in uncertanity but within the same cluster

In [56]:
clustering.get_similar('data')

[('model', 400),
 ('kaggle', 399),
 ('test', 398),
 ('set', 397),
 ('private', 396),
 ('public', 395),
 ('dataset', 394),
 ('make', 393),
 ('train', 392),
 ('file', 391)]

#### The results look decent but I am not sure if this will give us cohernt clusters which we as humans can look at and infer the topic. I think the keyword extraction/ dimensionality reduction and then clustering worked really well. Usually keyword extraction reduces the informativness of the n-gram but we didnt see that problem happening. 