Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't seem to work for Gensim Topic Models #32

Open
avisekksarma opened this issue Feb 29, 2024 · 9 comments
Open

Doesn't seem to work for Gensim Topic Models #32

avisekksarma opened this issue Feb 29, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@avisekksarma
Copy link

avisekksarma commented Feb 29, 2024

I have trained LDA model using Gensim, and now want to use topicwizard for visualization.
But even after following the Readme for using Gensim topic model case, it doesn't seem to work.

Note: I am doing in Nepali Language. The lda model is also trained in Nepali Language.
Code :

from gensim.corpora.dictionary import Dictionary
from topicwizard.compatibility import gensim_pipeline
import topicwizard

dictionary, and lda_model are loaded from my training by use of Gensim.

I have checked and there is no problem in any of that.

dictionary_form_data = Dictionary(dictionary)
pipeline = gensim_pipeline(dictionary_form_data, model=lda_model)


corpus = [" ".join(tokenized_news) for tokenized_news in dictionary]

no problem till now, as corpus is shown as:

print(corpus[10:12])
image

Now fitting the corpus

pipeline.fit(corpus)
image

printing topic_names:

image

So, everything seems to be fine but when doing visualization:

image

So, what is the problem here? Is it that its not working in Gensim topic model ?

@x-tabdeveloping
Copy link
Owner

I think I wrote this down in the documentation, but only pretrained Gensim models are supported. To me it seems like you were trying to train the wrapper pipeline. Try fitting the model first and then pack it in a pipeline. I would also encourage using scikit-learn wherever possible, as it's easier to work with (in my humble opinion).

The way I would do it in Gensim goes something like this:

from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel
from gensim.utils import tokenize

tokenized_corpus = [list(tokenize(text, lower=True)) for text in corpus]
dictionary = Dictionary(tokenized_corpus)
bow_corpus = [dictionary.doc2bow(text) for text in tokenized_corpus]
lda = LdaModel(bow_corpus, num_topics=10)
pipeline = gensim_pipeline(dictionary, model=lda)
topic_data = pipeline.prepare_topic_data(corpus)

I know for a fact that this works, because it's in the library's test suite.
I hope I could be of help, if you experience further issues feel free to write again.

@avisekksarma
Copy link
Author

avisekksarma commented Mar 8, 2024

I am sorry, I couldn't get back to this sooner.
My previous code as said in this issue, is exactly same like you said ,except that:
1. Point 1:
i have just loaded lda_model which was trained and saved in disk. i.e. by
lda_model = models.ldamodel.LdaModel.load('./results/models/40_topics')

I don't think loading a saved model or training on the go makes any difference in model ,except that training makes slower everytime to run topicwizard.
You mentioned training on the go by this code :
lda = LdaModel(bow_corpus, num_topics=10)

2. Point 2:
Also, I loaded the tokenized_corpus , which is in same format as yours, just language is different, in my code tokenized_corpus was written as dictionary ( just different name only ) i.e.
dictionary_form_data = Dictionary(dictionary)
Then i did:
pipeline = gensim_pipeline(dictionary_form_data, model=lda_model)
and then like you said ,now I did:
`topic_data = pipeline.prepare_topic_data(corpus)'

but it throws following error:
image

3. Point 3:
This time i also trained like you said, where i have loaded bow_corpus variable ( i checked it )
image

And, still got same above error.

Conclusion
So, in conclusion, I have tokenized corpus, bow_corpus, and lda_model ( also trained lda_model on the go in point 3 ), I checked format/shape of those variables and they are similar to english language case. I couldn't use that tokenize() function of gensim ,as it is Nepali Language so tokenization and stemming is different.
Now, on running that pipeline.prepare_topic_data(corpus), it throws error and also topicwizard.visualize(corpus, model = pipeline)

Note: I wanted to be clear if we have misunderstandings, so I have posted this detailed comment, let me know if you want to know anything more.

@x-tabdeveloping
Copy link
Owner

Thanks for the effort and sorry for the troubles, I will mark this as a bug and will try to fix it as soon as possible!

@x-tabdeveloping x-tabdeveloping added the bug Something isn't working label Mar 8, 2024
@x-tabdeveloping
Copy link
Owner

@avisekksarma version 1.0.2 should fix this issue, can you please confirm that it works on your end?

@avisekksarma
Copy link
Author

I apologize for not getting back to this project sooner since I was going through exams.
Now, I tested according to same code as I said in first part in this issue thread. And , it is still not working.

The change now is : Its not throwing NotFittedError like previous case, but now the thrown error while i did following is :
topicwizard.visualize(corpus,model=pipeline)
Error: IndexError: index 18580 is out of bounds for axis 1 with size 18580

Just to be clear, My all code [ Like said above ] is:

from load_variables import load_data
processed_data, bow_corpus, id2word, lda_model = load_data() 
# above line loads all data since i trained already using Gensim i.e. same code as "lda = LdaModel(bow_corpus, num_topics=10)" for model training
dictionary = processed_data['body'].to_numpy().tolist()
print(dictionary[0]) 
# prints - ['सर्वोच्च', 'अदालत', 'प्रस्तावित', 'न्यायाधीश', 'अब्दुल', 'अजिज', ...]   i.e. tokenized forms in Nepali Language ( like in English )
corpus = [" ".join(tokenized_news) for tokenized_news in dictionary]

from gensim.corpora.dictionary import Dictionary
from topicwizard.compatibility import gensim_pipeline
import topicwizard

dictionary_form_data = Dictionary(dictionary)
# not done below two lines of code since i loaded already
# Note: Even though I do below code, the error is same only difference is I just have to redo training again.
# bow_corpus = [dictionary.doc2bow(text) for text in texts] 
# current_lda_model = LdaModel(bow_corpus, num_topics=40)

pipeline = gensim_pipeline(dictionary_form_data, model=lda_model)

pipeline.fit(corpus)

# No error till now, but now , below line causes error
topicwizard.visualize(corpus,model=pipeline)

Above last line threw below error :
image

And, like you said if i do following then also get following error:

Code : topic_data = pipeline.prepare_topic_data(corpus)

Error:
image

Shouldn't I be able to use LDA model that i trained with gensim and just loading it rather than training it here again like you said, since it takes lots of time to train lda model, I feel training here ( i.e. like you said - lda = LdaModel(bow_corpus, num_topics=10) ) and loading trained model should have no impact as its the same model
underneath.

Can you provide me with code snippet that you feel should work with gensim lda models, so that I can tweak around that to see if that works ?

@x-tabdeveloping
Copy link
Owner

Hmm your code should be fine. I will look more into this. Very strange behaviour considering that I have a test case for just this and it passes. Loading the model from disk should in theory be fine. In the meantime, like I said above, you can try using sklearn instead of gensim.

@avisekksarma
Copy link
Author

By using sklearn do you mean using that trained lda_model by gensim into some function of sklearn? Can you provide me some code for how to use sklearn on this aspect of utilizing the trained lda_model ? Or , can you clarify and give some code part for what you meant by using sklearn here?
Thank you.

@x-tabdeveloping
Copy link
Owner

If you don't mind training a new model you can do so with sklearn like this:

import joblib
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import topicwizard
from topicwizard.pipeline import make_pipeline

pipeline = make_pipeline(CountVectorizer(), LatentDirichletAllocation())
topic_data = pipeline.prepare_topic_data(corpus)

# I recommend persisting this data to disk
joblib.dump(topic_data, "topic_data.joblib")

topicwizard.visualize(topic_data=topic_data)

Though I personally would NOT recommend using LDA unless you have very good reasons to do so.
My experience is that it's incredibly slow and gives subpar results most of the time. If you wanna keep it classic and don't want to mess with contextual models I would just recommend a good preprocessing pipeline and NMF, because it's waaaay faster and gives nicer results.
If you want the best results possible you can try KeyNMF.

@x-tabdeveloping
Copy link
Owner

Also big sorries for not fixing this issue for a while, I've just been working on a publication and have been ill for a little while, I will hopefully get to it quickly :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants