# Output from Topic Modeling - basic model validated, no plots

### Using larger model picked by cross-validation: 7 topics (no transformation)

**How does this notebook work?**

To make it more readable, most of the code is hidden. If you need to see the code or change it, press on the link by `toggle on` at the cell location. While you can't see the code, simple instructions are still displayed above to describe what is happening.

Steps:

    * Load libraries
    * Note: need to install orca for inline interactive plots (frequency plot)
    * May need to download: nltk.download(['punkt', 'stopwords', 'wordnet'])
    
**Important:** to view all possible arguments local modules `helper.helper_funs`, `helper.word_preprocess` and `methods.LDAprep`. Precise definitions can be found there.

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True) # may be a deprecation warning 

import pandas as pd
import numpy as np
from itertools import product

from os import path
import random
import warnings
warnings.filterwarnings("ignore")

import nltk
from nltk.corpus import stopwords

import gensim
from gensim import corpora, models
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()


# local modules
from helper.helper_funs import time_filename, save_folder_file, logged
from helper.word_preprocess import DataLoader, CleanText, WordCount, PlotWords
from methods.LDAprep import GensimPrep, GenMod, Diagnostics, ModelResults 

from helper.dk_ipython import toggled
from IPython.display import HTML
toggled()

## Set parameters:

`seed`: set seed number for reproducibility.

`start`, `max_num_topics`, `step` are for setting topic number (`int`) for cross-validation (lowest topic number, highest, step size)

`col` is for the columns with text. If more than one is chosen, text will be concatenated as one document per art center.

`stop` is to set the language to use for removing words with little meaning (the, a, are, etc.). french is included here because some French may be present even after removing observations labeled as French.

`stop.update` is for adding additional words to be remove, specific to this problem. For example, 'art' is not a meaningful word for our purposes. More words may be added by hand.

**To view the handpicked words removed, click on the `here` link by `toggle on/off`, under the first cell to review the code**

In [2]:
seed = 2018 
random.seed(seed)

# setting for tuning topic number in LDA
eval_every = 5      # the smaller number, the more precise the plot

col = ['Mandate/Mission', 
       'Main Community-Engaged Arts / Arts for Social Change Activities', 
       'Additional Info']  # name of comlumn(s) with text

stop = set(stopwords.words(['english', 'french'])) # words to remove from corpus

# stop.update([ 'community', 'use', 'make', 'organization', 'people', 'art', 'arts', 'artist', 'artists']) # hand picked words to remove



## Load and clean the data

Use the `DataLoader` to load, clean, remove accents, duplicates, empty rows, foreign languages and URLs. Small words (2 letters or less) are removed,then words are shortened and simplified with lemmatization and stemming. Stemming can be changed from `Porter` to `Lancaster` (more conservative) or `Snowball` (language specific).

Result: `processed_data` is a list of lists, containing every unique (stemmed and lemmatized) words in each document.
`pd_data` is a DataFrame with the raw text, for later display.

**Note that the path should be replaced as needed**

In [3]:
data = DataLoader(filename='artbridges_profiles.csv', 
                       data_folder='data',
                       colname=col,  # column(s) to remove weird chars
                       rm_NAs=False, # completely remove records with NaNs 
                                     #(implemented for 1st column only, other cols replace NaN with empty strings and concat)
                       removed_language=['french']) # list of languages to remove (lowercase)

donnee = data.data_loader()  # examples of problem mission statements: â, Lâ 


clean = CleanText(donnee,
                  stop=stop,
                  stemmer='Porter',
                  remove_urls=True)

processed_data = clean.preprocess()

pd_data = pd.DataFrame({'Text': donnee}) # for using unstemmed text later

Reading and cleaning data file...

Removing duplicates in column(s): ['Mandate/Mission', 'Main Community-Engaged Arts / Arts for Social Change Activities', 'Additional Info']...

Removing french words...

Done!


Only plain letters kept and words of 3 letters or more.

63 url(s) removed.


In [4]:
gp = GensimPrep(processed_data)
lda_dict = gp.gensimDict()
bow_rep = gp.gensimBOW(lda_dict)

Removing words of less than 3 characters, and words present in at least 0.8% of documents

Removing gaps in indices caused by preprocessing...

Saving gensim dictionary to corpus_data/2018-12-4/Gensim_dict_Params_MWL3_PD0.8__8h12m10s.dict

Saving .mm matrix to corpus_data/2018-12-4/BOWmat_8h12m10s.mm



## Running Optimized LDA Model 

__*using results from validation process*__

In [5]:
get_model = GenMod(bow_rep, lda_dict)

lda_model = get_model.LDA(eval_every=eval_every,
                          random_state=seed, num_topics= 7, alpha= 0.25, decay= 0.75)

Parameters used in model: 
Number of topics: 7
TFIDF transformation: False
Number of iterations: 10000
Batch size: 100
Update every 1 pass
Number of passes: 10
Topic inference per word: True
Alpha: 0.25
Eta: auto
Decay: 0.75
Minimum probability: 0.05
Minimum_phi_value: 0.02
Evaluate every: 5
Random seed: 2018

Saving LDA model to: 
saved_models/LDA/2018-12-4/LDA_Params_NT7_TFIDFFalsePer_word_topicTrue_8h12m17s.model


**Printing** linear combination of keywords for topics.

## Results: diagnostic plots and summary tables

## Tables with Topic Modeling Summary

Tables showing the distribution of words within topics and distribution of topics within documents (1 document = all texts from 1 organization)

**Important:** Topic are numbered starting at 0. For example, if we have 4 topics, they will be numbered from 0 to 3.

In [6]:
diag = Diagnostics(lda_dict,
                   bow_rep, 
                   processed_data)

warnings.filterwarnings("ignore", category=FutureWarning)

diag.LDAvis(lda_model)

Rendering visualization...


<hr>

### Table 1: Show Most Important Topic in each Document.

Displays the topic number that contributes most to document (in terms of marginal probability). Very high probability (close to 1) indicates that the topics are apparently easily distinguished from one another. 

Only 20 documents are displayed below, but all organizations with non-empty text are gathered in this table.

In [7]:
mod_res = ModelResults(lda_model, bow_rep, donnee)

df_dominant_topic = mod_res.format_topics_sentences()

small_dom_topic = pd.concat([df_dominant_topic.Dominant_Topic, 
                             df_dominant_topic.Percent_Contribution, 
                             df_dominant_topic.Important_Keywords,
                             df_dominant_topic.Mission], axis = 1)

small_dom_topic.columns = ['Dominant_Topic', 'Percent_Contribution', 'Important_Keywords', 'Mission']

small_dom_topic.head(15)

Saving the table to: results/2018-12-4/dominant_topic_per_text__8h12m31s.csv


Unnamed: 0,Dominant_Topic,Percent_Contribution,Important_Keywords,Mission
0,3.0,0.9592,"art, commun, space, stori, cultur, creativ, sh...",our goal is to have the space become self-sust...
1,0.0,0.5118,"art, commun, youth, artist, program, work, pro...",white water gallery is a not-for-profit artist...
2,0.0,0.3697,"art, commun, youth, artist, program, work, pro...",women sweetgrass film festival 2011
3,2.0,0.811,"commun, art, artist, program, creativ, studio,...",(in development):\n\n -to provide appropriat...
4,0.0,0.8157,"art, commun, youth, artist, program, work, pro...",to create original physical theatre. to share ...
5,1.0,0.9887,"commun, art, nation, cultur, indigen, danc, wo...",the woodland cultural centre is a first nation...
6,0.0,0.5568,"art, commun, youth, artist, program, work, pro...",to provide an open and supportive environment ...
7,2.0,0.5737,"commun, art, artist, program, creativ, studio,...",the mandate of the 4cs foundation is to build ...
8,3.0,0.9694,"art, commun, space, stori, cultur, creativ, sh...","4elements living arts is a multidisciplinary, ..."
9,0.0,0.9757,"art, commun, youth, artist, program, work, pro...",7th generation image makers is a community art...


<hr>

### Table 2: Most Important Document/Text for each Topic

These are the documents that were most clealy classified into a given topic.

In [8]:
tab = mod_res.top_texts_per_topic(df_dominant_topic)

tab = pd.concat([tab.Dominant_Topic, 
                             tab.Percent_Contribution, 
                             tab.Important_Keywords,
                             tab.Mission], axis = 1)

tab.columns = ['Dominant_Topic', 'Percent_Contribution', 'Important_Keywords', 'Mission']
tab

Saving the table to: results/2018-12-4/top_texts_per_topic_8h12m32s.csv


Unnamed: 0,Dominant_Topic,Percent_Contribution,Important_Keywords,Mission
0,0.0,0.9934,"art, commun, youth, artist, program, work, pro...",gallery mandate: incorporated in 1988 and work...
1,1.0,0.9887,"commun, art, nation, cultur, indigen, danc, wo...",the woodland cultural centre is a first nation...
2,2.0,0.9902,"commun, art, artist, program, creativ, studio,...","founded in 1987, canadian stage is one of the ..."
3,3.0,0.9879,"art, commun, space, stori, cultur, creativ, sh...",the heroes of 107th project aims to:\n\n-empow...
4,4.0,0.9454,"aborigin, art, urban, youth, commun, artist, p...",mandate: urban society for aboriginal youth (u...
5,5.0,0.9865,"art, program, commun, school, artist, disabl, ...",in-definite arts is a nonprofit visual arts ce...
6,6.0,0.9705,"commun, art, creativ, women, workshop, write, ...","established in 1991 in calgary, alberta, canad..."


### Table 3: How documents are distributed between topics

In [9]:
top_dist = mod_res.topic_distribution(df_dominant_topic, tab)

top_dist

Saving the table to: results/2018-12-4/doc_distribution_in_topics_8h12m32s.csv


Unnamed: 0,Dominant_Topic,Important_Keywords,Num_Documents,Perc_Documents
0,0.0,"art, commun, youth, artist, program, work, pro...",129,0.56087
1,1.0,"commun, art, nation, cultur, indigen, danc, wo...",17,0.07391
2,2.0,"commun, art, artist, program, creativ, studio,...",15,0.06522
3,3.0,"art, commun, space, stori, cultur, creativ, sh...",16,0.06957
4,4.0,"aborigin, art, urban, youth, commun, artist, p...",7,0.03043
5,5.0,"art, program, commun, school, artist, disabl, ...",31,0.13478
6,6.0,"commun, art, creativ, women, workshop, write, ...",15,0.06522


<hr>
<hr>

## Update Corpus with BC Specific Data

Update the trained model with additional data from the BC dataset, using the `Mission/Mandate` and `Additional Info` columns.

In [10]:
col2 = ['Mandate/Mission', 
       'Additional Info']  

data2 = DataLoader(filename='artbridges_test.csv', 
                       data_folder='data',
                       colname=col2,  # column(s) to remove weird chars
                       rm_NAs=False, # completely remove records with NaNs 
                                     #(implemented for 1st column only, other cols replace NaN with empty strings and concat)
                       removed_language=['french']) # list of languages to remove (lowercase)

donnee2 = data2.data_loader()  # examples of problem mission statements: â, Lâ 


clean2 = CleanText(donnee2,
                  stop=stop,
                  stemmer='Porter',
                  remove_urls=True)

processed_data2 = clean2.preprocess()

pd_data2 = pd.DataFrame({'Text': donnee2}) # for using unstemmed text later

Reading and cleaning data file...

Removing duplicates in column(s): ['Mandate/Mission', 'Additional Info']...

Removing french words...

Done!


Only plain letters kept and words of 3 letters or more.

9 url(s) removed.


In [11]:
gp2 = GensimPrep(processed_data2)
lda_dict2 = gp2.gensimDict()
lda_id2 = lda_dict2.token2id
bow_rep2 = gp2.gensimBOW(lda_dict2)


# tfidfTrans2 = gp2.tfidf_trans(bow_rep2)

Removing words of less than 3 characters, and words present in at least 0.8% of documents

Removing gaps in indices caused by preprocessing...

Saving gensim dictionary to corpus_data/2018-12-4/Gensim_dict_Params_MWL3_PD0.8__8h12m32s.dict

Saving .mm matrix to corpus_data/2018-12-4/BOWmat_8h12m32s.mm



In [12]:
# Create a new corpus, made of previously unseen documents.

other_corpus = [lda_dict2.doc2bow(text) for text in processed_data2]

lda_model.update(other_corpus)

In [13]:
diag2 = Diagnostics(lda_dict2,
                   bow_rep2, 
                   processed_data2)

warnings.filterwarnings("ignore", category=FutureWarning)

diag.LDAvis(lda_model)

Rendering visualization...


### Again, no improvement with the inclusion of the BC data