-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to handle nan output from a topic model. #12
Comments
Can I get the full stack trace on the first error, so that I know which function it might come from? :) |
If there's no GDPR issue it would also be useful to know what data you used and what hyperparameters you supplied to the model. |
Hello, thanks for the response. :)
Absolutely. Here you go:
There are data privacy concerns (FERPA, to be exact). Therefore, it's not a good idea for me to share my dataset. But here's a bit of domain context:
Hyperparameters for DMM model on Tweetopic:
|
Thanks for the info, I will try to deliver a fix as quickly as possible, I think you were right in your judgment and it has to be the nans being output by tweetopic. In the meantime you can try to identify which texts are problematic (aka result in nans) and perhaps remove them before you pass the corpus as a list of texts to topicwizard. |
I checked your Colab notebook and I think when you try to remove the texts there's some pandas shenanigans going on. I would try: transformed_corpus = topic_pipeline.transform(corpus)
# Turning it into an array so you can index it with an array
filtered_corpus = np.array(corpus)
# Getting the indices where something is nan
problematic_indices = np.isnan(transformed_corpus).any(axis=1)
# Removing them
filtered_corpus = filtered_corpus[~problematic_indices]
topicwizard.visualize(pipeline=pipeline, corpus=filtered_corpus) I think this should work fine, I will try to address these issues in the meantime. |
I managed to reproduce the error with a custom version of NMF that randomly assigns nans to certain observations. class RandomNanNMF(NMF):
def transform(self, X):
res = super().transform(X)
n_docs = res.shape[0]
nans = np.random.choice(np.arange(n_docs), size=30, replace=False)
res[nans, :] = np.nan
return res
def fit_transform(self, X, y=None, W=None, H=None):
res = super().fit_transform(X, y, W, H)
n_docs = res.shape[0]
nans = np.random.choice(np.arange(n_docs), size=30, replace=False)
res[nans, :] = np.nan
return res The solution was to filter out the nan values in the preprocessing step of topicwizard and throw a warning to the user informing them about the removal of these documents. |
Fix merged into main, new version built and published to PyPI, you should try installing topicwizard 0.2.6 and run your code again :) |
Thank you! I'll give it a shot now. |
Can you confirm that the fix worked? |
Hi! Sorry for the long wait time, this query got lost in my massive work email mountain. I reran the visualization command with version 0.2.6 installed. I get the following error after running. Note that the UserError shows up, which means your validation is working as intended.
|
Hello! I am very impressed with this library as per Marton Kardos's article on Medium.
I attempted to use topicwizard to visualize short-text topic modeling inferences based on a quickly trained tweetopic model. The results of my issues and troubleshooting are located on this hosted Google Colab notebook. Please note you can't run the notebook. I've just published so you can easily view via Google Colab.
Information about my Conda environment:
I can train a topic model in tweetopic with no problems. I can import the topicwizard module with no problem. Once finished training on my tweetopic model, I can infer topic names via
topicwizard.infer_topic_names(pipeline=pipeline)
with no problems.However, when I attempt to run
topicwizard.visualize(vectorizer=vectorizer, topic_model=dmm, corpus=corpus_cleaned, port=8080)
I receive the following error:I troubleshooted and found that when I
.transform(...)
my corpus post-training, I found inferences that contain nans. I dropped those rows so that they don't mess with the elaborate computations the /prepare/<...py> files have in place to easily get the Dash app running. Despite cleaning up the nans, when I run the same .visualize() function above with the further cleaned inferences, I receive the following error tracing back to...tweetopic/lib/site-packages/joblib/parallel.py:288
Further context as to the steps I followed is available on that Google Colab notebook.Could any one help me figure out what is preventing me from getting the Dash app working? Thank you!
The text was updated successfully, but these errors were encountered: