# Finding optimal number of topics
Machinelearingplus.com approach to finding the optimal number of topics is to build many LDA models with different values of number of topics (k) and pick the one that gives the highest coherence value.

Choosing a ‘k’ that marks the end of a rapid growth of topic coherence usually offers meaningful and interpretable topics. Picking an even higher value can sometimes provide more granular sub-topics.

If you see the same keywords being repeated in multiple topics, it’s probably a sign that the ‘k’ is too large.

The compute_coherence_values() (see below) trains multiple LDA models and provides the models and their corresponding coherence scores.

## How to run
1. Paste the relevant corpus
2. 
3. 


In [18]:
import pickle

def openData(filename):
    infile = open(filename,'rb')
    data = pickle.load(infile)
    infile.close()
    return data

dictionary = openData('model/corpusData')
M1 = openData('model/M1Data')
corpus = openData('model/dictionaryData')

In [4]:
def compute_coherence_values(dictionary, corpus, texts, limit, start, step):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=dictionary)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

        print(num_topics)
        
    return model_list, coherence_values

## Coherence score over topics, Stepping every 4th
Stepping from 2 to 104 with a step length of 4

In [19]:
# Can take a long time to run
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=M1, texts=corpus, limit=10, start=2, step=4)

NameError: name 'gensim' is not defined

In [None]:
print(coherence_values)

In [None]:
#coherence_values = [0.3250157403264292, 0.4116900808150424, 0.4277707176143622, 0.4419215919328382, 0.43300547700202796, 0.43331531510788773, 0.4507613241261437, 0.4401537362798364, 0.4436828004648704, 0.4321437701507225, 0.4287353256235635, 0.4345214971377863, 0.4207751964079457, 0.42903262376408574, 0.41356412264891634, 0.4094172711686478, 0.41385811506381215, 0.3971588021061537, 0.4020413137080951, 0.3944484660319797, 0.40475857425077855, 0.4043484985706639, 0.40233617235048474, 0.39983668071530726, 0.39050519443737175]
figure(figsize=(12, 6), dpi=80)
# Show graph
limit=102; start=2; step=4;
x = range(start, limit, step)
xi = list(x)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.xticks(xi, x)
plt.legend(("coherence_values"), loc='best')
plt.grid(True)
plt.show()

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

### Result
This gives us the top highest coherence score:
- Num Topics = xx  has Coherence Value of x.xxx

## Coherence score over topics, Stepping every one
Now we want to be more exact but since the coherence is dropping with higher number of topics we reduce the upper limit.

Now we try to step from 2 to 70 on every value

In [None]:
# Can take a long time to run
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=M1, texts=corpus_clean_bigram, limit=30, start=10, step=1)

In [None]:
print(coherence_values)

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

In [None]:
#coherence_values = [0.32571064083112744, 0.3537344947550003, 0.38589689375318126, 0.39384185450287246, 0.3968198253781281, 0.4085182731090585, 0.42127183913002364, 0.41899078263050016, 0.42801836357265693, 0.4309421068656622, 0.4235910756450276, 0.4516905876830981, 0.44230979084842603, 0.4296217041847969, 0.4312776000207533, 0.42955783914286844, 0.4300818857106711, 0.43599484572542013, 0.4288093062032049, 0.431145903679033, 0.45102550720523704, 0.4408525746026434, 0.43442023864813467, 0.437182271259546, 0.44709664105064434, 0.44248557145575135, 0.42246041004330903, 0.43784503147185494, 0.43667989244832234, 0.4351125576380536, 0.43902968092362493, 0.4440178347094399, 0.4271598549200618, 0.43617684419539876, 0.43125821112489443, 0.4329721336785587, 0.43217676681552086, 0.43310381516722185, 0.45072485690785824, 0.4401630451262257, 0.43130301356185136, 0.4334126436221655, 0.421009575540349, 0.4272964973804624, 0.4350421319148597, 0.4233540568965465, 0.4350122511003696, 0.43232603003979947, 0.41460403500831033, 0.42528017344333635, 0.4267933944574832, 0.4276083290805306, 0.4232434154417294, 0.4289482978832848, 0.4148342197092429, 0.4184389211956406, 0.4167830014416288, 0.41978178212929607, 0.40395936537564947, 0.40763329977423757, 0.421859133976826, 0.4079528877973058, 0.408300929971383, 0.41109076687251356, 0.4171141990155899, 0.40615382099134045, 0.4129390628617565, 0.40967900018905035]
figure(figsize=(12, 6), dpi=80)
# Show graph
limit=30; start=10; step=1;
x = range(start, limit, step)
xi = list(x)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.xticks(xi, x)
plt.legend(("coherence_values"), loc='best')
plt.grid(True)
plt.show()

### Result
This gives us:
- Num Topics = xx  has Coherence Value of x.xxxx
- Num Topics = yy  has Coherence Value of y.yyyy
- Num Topics = zz  has Coherence Value of z.zzzz

And one can see that we now have three good candidates in the number of topics