
# Divison1: Topic Modeling work and cosine similarity search on articles featured as topics

### PRINCIPLES:
* Separtion of code from data
* Minimally Viable Product

### ORGANIZATION:
* ./DATA
* ./CORPUSES
* ./MODELS
* requirements.txt
* create_corpus.py
* create_models.py

### SYSTEM DEVELOPMENT ENVIRONMENT ADVICE
* Best to install a new virtual environment in python 
* The code was tested in ipython terminal
* All code is developed in python3 for python3 systems

ex.: Install code dependencies after creating new virtualenvenn

 (potential way to set up environment)
    > mkvirtualenv --python=/usr/bin/python3 summerpy3
    > workon summerpy3
    ~/.virtualenvs/summerpy3/bin/pip3 install ipython[notebook]
 (get python 3 requirements)
    >  ~/.virtualenvs/summerpy3/bin/pip install -r requirements.txt 
 (start following along with tutorial)
    > ipython
    > %run ...
    ...
    > deactivate

### ATTRIBUTIONS
The data-->corpus-->model work I've done is based/inspired on gensim creator's online tutorials:
    http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html
  
The word cloud work is based/inspired on code from "Building Machine Learning Systems with Python by Richert and Coelho, available under MIT license


### TODO 
1. Biggest thing to work on is an update capacity; currently to update the LDA model, you need to create a new corpus and new model
    
2. Merge code written for Collocation and NER cleaning


## CREATING  LEMMATIZED CORPUS FROM DIRECTORY OF .TXT FILES
1. **Create corpuses if you have a directory of .txt files.** 

    Save corpuses (.mm) and
    dictionaries (.dict) for  lemmatized processed data

    The lemmitized corpus will be saved in ./CORPUSES (Script uses relative path for CORPUSES location.)
    
        ./create_corpuses.py  dir_txt_files corpus_name --nodup(flag to add duplicate content only 1x) 
        
    >   **./create_corpuses.py "./DATA/content"  "db"**


****
** Let's create a corpus for **./DATA/content** which has 186 database articles **

Three files 
       * corpus_name.dict 
       * corpus_name.mm
       * corpus_name.mm_index*
will be placed in ./CORPUSES

In [123]:
%run create_corpuses.py "DATA/content" _db  # --nodup

INFO : adding document #0 to Dictionary(0 unique tokens: [])
INFO : built Dictionary(14414 unique tokens: ['richard', 'clinical', 'portalrecent', 'escrow', 'iyaz']...) from 186 documents (total 119911 corpus positions)
INFO : saving Dictionary object under ./CORPUSES/_db_lemmatized.dict, separately None
INFO : storing corpus in Matrix Market format to ./CORPUSES/_db_lemmatized.mm
INFO : saving sparse matrix to ./CORPUSES/_db_lemmatized.mm
INFO : PROGRESS: saving document #0


style is lemmatized


INFO : saved 186x14414 matrix, density=2.460% (65955/2681004)
DEBUG : closing ./CORPUSES/_db_lemmatized.mm
DEBUG : closing ./CORPUSES/_db_lemmatized.mm
INFO : saving MmCorpus index to ./CORPUSES/_db_lemmatized.mm.index


## GOING FROM CORPUS TO MODEL
1. **Create models from corpuses in ./CORPUSES. **

        create_models.py corpus_name num_topics no_of_passes

    * **corpus_name**: use the same name you gave in the step to create corpuses form the .txt file
    * **no_topics**: how many topics should LDA model contain
    * **no_of_passes**: how many iterations to tra
    in model
>   **./create_models.py  "_db"   100   40**



The models will be saved in ./MODELS. 

**NOTE**: cosine similarity function takes a function from "create_corpus.py"

### Loading corpuses and models
> dictionary = corpora.Dictionary.load("./CORPUSES/_db_lemmatized.dict")

> model = gensim.models.LdaModel.load("./MODELS/_db_tfidf_lda_model.model")

> corpus = gensim.corpora.MmCorpus("./CORPUSES/_db_lemmatized.mm)

-----
**Get LDA model for the privacy database (~ 186 articles)**

With 100 topics and 40 passes, this will take a few minutes.

I've already saved the model files that will get produced in ./MODELS**. Let's just load them **

In [149]:
# I've already saved models for the corpus _db, so I will load it
import gensim
privacy_lda_tfidf_model  = gensim.models.LdaModel.load("./MODELS/_db_lemmatized_tfidf_lda.model")
privacy_lemmatized_dict= gensim.corpora.Dictionary.load("./CORPUSES/_db_lemmatized.dict")
privacy_lemmatized_corpus = gensim.corpora.MmCorpus("./CORPUSES/_db_lemmatized.mm")

INFO : loading LdaModel object from ./MODELS/_db_lemmatized_tfidf_lda.model
INFO : loading id2word recursively from ./MODELS/_db_lemmatized_tfidf_lda.model.id2word.* with mmap=None
INFO : setting ignored attribute dispatcher to None
INFO : setting ignored attribute state to None
INFO : loading LdaModel object from ./MODELS/_db_lemmatized_tfidf_lda.model.state
INFO : loading Dictionary object from ./CORPUSES/_db_lemmatized.dict
INFO : loaded corpus index from ./CORPUSES/_db_lemmatized.mm.index
INFO : initializing corpus reader from ./CORPUSES/_db_lemmatized.mm
INFO : accepted corpus with 186 documents, 14414 features, 65955 non-zero entries


-----
### SOME COOL THINGS TO DO WITH MODELS: cosine_similarity btw two articles in DB
** Let's use our model to find the cosine similarity between two articles in the DB **

In [150]:
from create_models import *

# some articles from the DB
art1fn = "google search autocomplete feature links a german .txt"
art2fn = "google search autocomplete links german prime mini.txt"
art3fn = "some germans dislike having photos of their homes .txt"

# Clearly art1 and art2 are more related to each other than to art3
# Let's see if our  cosine similarity gets this right as well

STORY_CONTENT_DIR = "./DATA/content"

print("cos_sim score for g1 and g2 should be high, and it is %f" %topic_relev_score(art1fn,art2fn, STORY_CONTENT_DIR, privacy_lemmatized_dict, privacy_lda_tfidf_model))
print("cos_sim score for g1 and g3 should be medium, and it is %f" %topic_relev_score(art1fn,art3fn, STORY_CONTENT_DIR, privacy_lemmatized_dict, privacy_lda_tfidf_model))
print("cos_sim score for g2 and g3 should be medium, and it is %f" %topic_relev_score(art2fn,art3fn, STORY_CONTENT_DIR, privacy_lemmatized_dict, privacy_lda_tfidf_model))


cos_sim score for g1 and g2 should be high, and it is 0.947885
cos_sim score for g1 and g3 should be medium, and it is 0.716024
cos_sim score for g2 and g3 should be medium, and it is 0.691978


### SOME COOL THINGS TO DO WITH MODELS: Top 10 topics
** Get 10 top topics of a corpus, with n words/phrases of each topic **

The function **create_models.panda_topics(path_to_model, path_to_corpus)** just needs the location  of the Model you want to examine, and the location of the  corresponding corpus that made the model

**NOTE** To use this function you need  to have at least created the model with at least 10 topics

In [137]:
#import importlib
#importlib.reload(create_models)
from create_models import *

panda_topics("./MODELS/_db_lemmatized_tfidf_lda.model" , "./CORPUSES/_db_lemmatized.mm", 10)



INFO : loading LdaModel object from ./MODELS/_db_lemmatized_tfidf_lda.model
INFO : loading id2word recursively from ./MODELS/_db_lemmatized_tfidf_lda.model.id2word.* with mmap=None
INFO : setting ignored attribute dispatcher to None
INFO : setting ignored attribute state to None
INFO : loading LdaModel object from ./MODELS/_db_lemmatized_tfidf_lda.model.state
INFO : loaded corpus index from ./CORPUSES/_db_lemmatized.mm.index
INFO : initializing corpus reader from ./CORPUSES/_db_lemmatized.mm
INFO : accepted corpus with 186 documents, 14414 features, 65955 non-zero entries


Unnamed: 0,0
1st Most Discussed Topic (MDT),0.008*data + 0.006*breach + 0.006*said + 0.006*customer + 0.005*record + 0.004*school + 0.004*information + 0.004*patient + 0.004*online + 0.004*twitter
2nd MDT,0.013*user + 0.012*facebook + 0.008*security + 0.007*site + 0.006*network + 0.006*photo + 0.006*setting + 0.006*friend + 0.005*feature + 0.005*policy
3rd MDT,0.011*apps + 0.008*child + 0.007*snapchat + 0.007*website + 0.005*video + 0.005*number + 0.005*government + 0.005*personal + 0.005*party + 0.004*digital
4th MDT,0.010*credit + 0.009*location + 0.008*affected + 0.005*http + 0.005*week + 0.005*statement + 0.004*open + 0.004*researcher + 0.004*file + 0.004*iphone
5th MDT,0.015*google + 0.012*app + 0.009*student + 0.008*address + 0.007*court + 0.006*york + 0.006*mail + 0.006*law + 0.006*developer + 0.006*post
6th MDT,0.010*search + 0.007*nytimes + 0.006*com + 0.006*rule + 0.005*european + 0.005*archive + 0.005*www + 0.005*engine + 0.005*term + 0.004*campaign
7th MDT,0.008*consumer + 0.007*web + 0.007*phone + 0.007*say + 0.006*cooky + 0.006*target + 0.006*message + 0.006*state + 0.005*technology + 0.005*collected
8th MDT,0.012*sony + 0.010*card + 0.008*compromised + 0.007*stolen + 0.007*million + 0.007*theft + 0.005*payment + 0.004*breach + 0.004*tjx + 0.004*exposed
9th MDT,0.007*agency + 0.007*city + 0.006*agent + 0.005*worker + 0.005*database + 0.004*general + 0.004*victim + 0.004*mother + 0.004*family + 0.004*attorney
10th MDT,0.007*supra + 0.004*pregnant + 0.004*published + 0.004*teen + 0.003*subject + 0.003*didfalse + 0.003*duke + 0.003*father + 0.003*info + 0.003*secure


-------
### SOME COOL THINGS TO DO WITH MODELS: Brangelina top topics post-divorce announcement

** I'm interested to know what the top topics are with the Branjelina breakup. **

I did a quick bing scrape for Angelina Jolie, placing 54 bing articles in **./DATA/jolie**. 

Let's make a corpus and model and then run our **panda_topics()** function to see what the top topics are. 

In [167]:
# Create corpus, checking for duplicate articles
%run create_corpuses.py "DATA/jolie" jolie  --nodup

# Create LDA_TFIDF stacked model for our corpus
%run create_models.py "jolie"  50 25  # 50 topics and 25 passes

# Toggle off processing output since its large...


INFO : adding document #0 to Dictionary(0 unique tokens: [])
INFO : built Dictionary(3085 unique tokens: ['richard', 'phnom', 'clinical', 'pitt', 'parenthood']...) from 54 documents (total 10317 corpus positions)
INFO : saving Dictionary object under ./CORPUSES/jolie_lemmatized.dict, separately None
INFO : storing corpus in Matrix Market format to ./CORPUSES/jolie_lemmatized.mm
INFO : saving sparse matrix to ./CORPUSES/jolie_lemmatized.mm
INFO : ITER_DB_ARTILCES():duplicate files below not added twice to content_dump of script:  
INFO : PROGRESS: saving document #0
INFO : saved 54x3085 matrix, density=3.974% (6621/166590)
DEBUG : closing ./CORPUSES/jolie_lemmatized.mm
DEBUG : closing ./CORPUSES/jolie_lemmatized.mm
INFO : saving MmCorpus index to ./CORPUSES/jolie_lemmatized.mm.index
INFO : loaded corpus index from ./CORPUSES/jolie_lemmatized.mm.index
INFO : initializing corpus reader from ./CORPUSES/jolie_lemmatized.mm
INFO : accepted corpus with 54 documents, 3085 features, 6621 non-ze

style is lemmatized


INFO : running online LDA training, 50 topics, 40 passes over the supplied corpus of 54 documents, updating model once every 54 documents, evaluating perplexity every 54 documents, iterating 50x with a convergence threshold of 0.001000
DEBUG : bound: at document #0
INFO : -852.239 per-word bound, 35439916659840407091196100302870322717915435831171396527926175548848177660243666142773353116512915303453414827474387434327858924501419221902095291231966614659277778881055296232306937233013282980062351104922699030208200889210820573636634554411913778361794560.0 perplexity estimate based on a held-out corpus of 54 documents with 458 words
INFO : PROGRESS: pass 0, at document #54/54
DEBUG : performing inference on a chunk of 54 documents
DEBUG : 54/54 documents converged within 50 iterations
DEBUG : updating topics
INFO : topic #14 (0.020): 0.001*couple + 0.001*close + 0.001*divorce + 0.001*feliz + 0.001*jennifer + 0.001*source + 0.001*los + 0.001*pitt + 0.001*miraval + 0.001*cotillard
INFO : topi

Unnamed: 0,0
1st Most Discussed Topic (MDT),0.015*divorce + 0.014*pitt + 0.013*brad + 0.010*child + 0.010*said + 0.010*couple + 0.008*split + 0.008*year + 0.007*family + 0.007*filed
2nd MDT,0.009*source + 0.006*report + 0.006*cotillard + 0.006*marion + 0.006*like + 0.006*say + 0.006*want + 0.005*claim + 0.005*destructive + 0.005*set
3rd MDT,0.012*film + 0.009*woman + 0.006*star + 0.005*oscar + 0.005*hollywood + 0.005*movie + 0.005*acting + 0.004*director + 0.004*certainly + 0.004*think
4th MDT,0.011*inbox + 0.007*latest + 0.007*loung + 0.006*ung + 0.005*giveaway + 0.005*content + 0.005*signup + 0.005*news + 0.005*trailer + 0.005*khmer
5th MDT,0.007*heard + 0.007*news + 0.006*kid + 0.005*matter + 0.005*tuesday + 0.005*statement + 0.005*press + 0.004*later + 0.004*list + 0.004*privacy
6th MDT,0.007*father + 0.007*killed + 0.007*cambodia + 0.005*son + 0.005*movie + 0.005*scene + 0.005*people + 0.005*tell + 0.005*work + 0.004*normal
7th MDT,0.019*cancer + 0.015*breast + 0.008*gene + 0.007*mastectomy + 0.006*risk + 0.005*surgery + 0.005*double + 0.005*brca + 0.004*mutation + 0.004*preventive
8th MDT,0.005*read + 0.004*exposed + 0.004*sandwich + 0.004*starving + 0.004*eat + 0.003*love + 0.003*course + 0.003*surprising + 0.003*hour + 0.003*shocking
9th MDT,0.007*marvel + 0.006*gazing + 0.006*horoscope + 0.005*finally + 0.005*captain + 0.004*future + 0.004*finished + 0.004*stop + 0.004*far + 0.004*superhero
10th MDT,0.006*rape + 0.004*hand + 0.004*inevitable + 0.003*month + 0.003*march + 0.003*london + 0.003*war + 0.003*violence + 0.003*shame + 0.003*survivor


In [171]:
# Load the jolie LDA model and corpus (which are saved to ./MODELS and ./CORPUSES
# by running the create_models and create_corpuses scripts)
panda_topics("./MODELS/jolie_lemmatized_tfidf_lda.model" , "./CORPUSES/jolie_lemmatized.mm", 20)


INFO : loading LdaModel object from ./MODELS/jolie_lemmatized_tfidf_lda.model
INFO : loading id2word recursively from ./MODELS/jolie_lemmatized_tfidf_lda.model.id2word.* with mmap=None
INFO : setting ignored attribute state to None
INFO : setting ignored attribute dispatcher to None
INFO : loading LdaModel object from ./MODELS/jolie_lemmatized_tfidf_lda.model.state
INFO : loaded corpus index from ./CORPUSES/jolie_lemmatized.mm.index
INFO : initializing corpus reader from ./CORPUSES/jolie_lemmatized.mm
INFO : accepted corpus with 54 documents, 3085 features, 6621 non-zero entries


Unnamed: 0,0
1st Most Discussed Topic (MDT),0.015*divorce + 0.014*pitt + 0.013*brad + 0.010*child + 0.010*said + 0.010*couple + 0.008*split + 0.008*year + 0.007*family + 0.007*filed + 0.006*old + 0.006*new + 0.006*actress + 0.006*actor + 0.006*know + 0.006*custody + 0.006*time + 0.006*way + 0.006*celebrity + 0.005*day
2nd MDT,0.009*source + 0.006*report + 0.006*cotillard + 0.006*marion + 0.006*like + 0.006*say + 0.006*want + 0.005*claim + 0.005*destructive + 0.005*set + 0.005*photo + 0.005*self + 0.004*claimed + 0.004*rumor + 0.004*comment + 0.004*friend + 0.004*husband + 0.004*going + 0.004*maddox + 0.004*advertisement
3rd MDT,0.012*film + 0.009*woman + 0.006*star + 0.005*oscar + 0.005*hollywood + 0.005*movie + 0.005*acting + 0.004*director + 0.004*certainly + 0.004*think + 0.004*right + 0.004*career + 0.004*working + 0.004*look + 0.004*read + 0.004*direct + 0.004*debut + 0.004*second + 0.004*having + 0.003*home
4th MDT,0.011*inbox + 0.007*latest + 0.007*loung + 0.006*ung + 0.005*giveaway + 0.005*content + 0.005*signup + 0.005*news + 0.005*trailer + 0.005*khmer + 0.005*newsletter + 0.005*book + 0.004*coordinate + 0.004*rouge + 0.004*exclusive + 0.004*cambodian + 0.004*currently + 0.004*day + 0.003*story + 0.003*born
5th MDT,0.007*heard + 0.007*news + 0.006*kid + 0.005*matter + 0.005*tuesday + 0.005*statement + 0.005*press + 0.004*later + 0.004*list + 0.004*privacy + 0.004*confirmed + 0.004*ask + 0.004*challenging + 0.004*space + 0.004*kindly + 0.004*saddened + 0.004*deserve + 0.003*break + 0.003*manager + 0.003*geyer
6th MDT,0.007*father + 0.007*killed + 0.007*cambodia + 0.005*son + 0.005*movie + 0.005*scene + 0.005*people + 0.005*tell + 0.005*work + 0.004*normal + 0.004*little + 0.004*directed + 0.004*face + 0.004*red + 0.004*carpet + 0.004*issue + 0.003*oldest + 0.003*start + 0.003*onscreen + 0.003*went
7th MDT,0.019*cancer + 0.015*breast + 0.008*gene + 0.007*mastectomy + 0.006*risk + 0.005*surgery + 0.005*double + 0.005*brca + 0.004*mutation + 0.004*preventive + 0.003*ovarian + 0.003*developing + 0.003*increase + 0.003*ovary + 0.003*tube + 0.003*underwent + 0.003*chance + 0.003*fallopian + 0.002*removing + 0.002*hormone
8th MDT,0.005*read + 0.004*exposed + 0.004*sandwich + 0.004*starving + 0.004*eat + 0.003*love + 0.003*course + 0.003*surprising + 0.003*hour + 0.003*shocking + 0.002*sexy + 0.002*pretty + 0.002*stealing + 0.002*kim + 0.002*wenn + 0.002*make + 0.002*knot + 0.002*hot + 0.002*lost + 0.002*tied
9th MDT,0.007*marvel + 0.006*gazing + 0.006*horoscope + 0.005*finally + 0.005*captain + 0.004*future + 0.004*finished + 0.004*stop + 0.004*far + 0.004*superhero + 0.004*detail + 0.004*sign + 0.003*focus + 0.003*won + 0.003*project + 0.003*fact + 0.003*africa + 0.003*studio + 0.003*wonder + 0.003*today
10th MDT,0.006*rape + 0.004*hand + 0.004*inevitable + 0.003*month + 0.003*march + 0.003*london + 0.003*war + 0.003*violence + 0.003*shame + 0.003*survivor + 0.003*rapist + 0.003*arriving + 0.003*sexual + 0.003*global + 0.003*terminal + 0.003*lax + 0.003*exited + 0.003*smoke + 0.003*visit + 0.003*signature
