Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to handle nan output from a topic model. #12

Closed
vshourie-asu opened this issue Jun 18, 2023 · 10 comments · Fixed by #13
Closed

Unable to handle nan output from a topic model. #12

vshourie-asu opened this issue Jun 18, 2023 · 10 comments · Fixed by #13
Assignees
Labels
bug Something isn't working

Comments

@vshourie-asu
Copy link

vshourie-asu commented Jun 18, 2023

Hello! I am very impressed with this library as per Marton Kardos's article on Medium.

I attempted to use topicwizard to visualize short-text topic modeling inferences based on a quickly trained tweetopic model. The results of my issues and troubleshooting are located on this hosted Google Colab notebook. Please note you can't run the notebook. I've just published so you can easily view via Google Colab.

Information about my Conda environment:

  • Python 3.9.16 (Installed via Anaconda)
  • ipykernel 6.9.12 and its dependencies (Anaconda)
  • Tweetopic 0.3.0 (PyPi)
  • Topic-wizard 0.2.5 (PyPi)
  • And all other dependencies which ensue from these two libraries.

I can train a topic model in tweetopic with no problems. I can import the topicwizard module with no problem. Once finished training on my tweetopic model, I can infer topic names via topicwizard.infer_topic_names(pipeline=pipeline) with no problems.

However, when I attempt to run topicwizard.visualize(vectorizer=vectorizer, topic_model=dmm, corpus=corpus_cleaned, port=8080) I receive the following error:

ValueError:
Invalid element(s) received for the 'size' property of scatter.marker
Invalid elements include: [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

The 'size' property is a number and may be specified as:
  - An int or float in the interval [0, inf]
  - A tuple, list, or one-dimensional numpy array of the above

I troubleshooted and found that when I .transform(...) my corpus post-training, I found inferences that contain nans. I dropped those rows so that they don't mess with the elaborate computations the /prepare/<...py> files have in place to easily get the Dash app running. Despite cleaning up the nans, when I run the same .visualize() function above with the further cleaned inferences, I receive the following error tracing back to ...tweetopic/lib/site-packages/joblib/parallel.py:288 Further context as to the steps I followed is available on that Google Colab notebook.

ValueError: cannot assign slice from input of different size

Could any one help me figure out what is preventing me from getting the Dash app working? Thank you!

@x-tabdeveloping
Copy link
Owner

Can I get the full stack trace on the first error, so that I know which function it might come from? :)

@x-tabdeveloping
Copy link
Owner

If there's no GDPR issue it would also be useful to know what data you used and what hyperparameters you supplied to the model.

@vshourie-asu
Copy link
Author

Hello, thanks for the response. :)

Can I get the full stack trace on the first error, so that I know which function it might come from? :)

Absolutely. Here you go:

ValueError                                Traceback (most recent call last)
Cell In[10], line 1
----> 1 topicwizard.visualize(vectorizer=vectorizer, topic_model=dmm, corpus=corpus_cleaned, port=8080)

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\app.py:245, in visualize(corpus, vectorizer, topic_model, pipeline, document_names, topic_names, port, enable_notebook)
    242     (_, vectorizer), (_, topic_model) = pipeline.steps
    244 print("Preprocessing")
--> 245 app = get_dash_app(
    246     vectorizer=vectorizer,
    247     topic_model=topic_model,
    248     corpus=corpus,
    249     document_names=document_names,
    250     topic_names=topic_names,
    251 )
    252 return run_app(app, port=port)

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\app.py:73, in get_dash_app(vectorizer, topic_model, corpus, document_names, topic_names)
     42 def get_dash_app(
     43     vectorizer: Any,
     44     topic_model: Any,
   (...)
     47     topic_names: Optional[List[str]] = None,
     48 ) -> Dash:
     49     """Returns topicwizard Dash application.
     50 
     51     Parameters
   (...)
     71         Dash application object for topicwizard.
     72     """
---> 73     blueprint = get_app_blueprint(
     74         vectorizer=vectorizer,
     75         topic_model=topic_model,
     76         corpus=corpus,
     77         document_names=document_names,
     78         topic_names=topic_names,
     79     )
     80     app = Dash(
     81         __name__,
     82         blueprint=blueprint,
   (...)
     92         ],
     93     )
     94     return app

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\app.py:31, in get_app_blueprint(vectorizer, topic_model, corpus, document_names, topic_names)
     24 def get_app_blueprint(
     25     vectorizer: Any,
     26     topic_model: Any,
   (...)
     29     topic_names: Optional[List[str]] = None,
     30 ) -> DashBlueprint:
---> 31     blueprint = prepare_blueprint(
     32         vectorizer=vectorizer,
     33         topic_model=topic_model,
     34         corpus=corpus,
     35         document_names=document_names,
     36         topic_names=topic_names,
     37         create_blueprint=create_blueprint,
     38     )
     39     return blueprint

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\blueprints\template.py:31, in prepare_blueprint(vectorizer, topic_model, corpus, create_blueprint, document_names, topic_names)
     29 if topic_names is None:
     30     topic_names = [f"Topic {i}" for i in range(n_topics)]
---> 31 blueprint = create_blueprint(
     32     vocab=vocab,
     33     document_term_matrix=document_term_matrix,
     34     document_topic_matrix=document_topic_matrix,
     35     topic_term_matrix=topic_term_matrix,
     36     document_names=document_names,
     37     corpus=corpus,
     38     vectorizer=vectorizer,
     39     topic_model=topic_model,
     40     topic_names=topic_names,
     41 )
     42 return blueprint

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\blueprints\app.py:35, in create_blueprint(vocab, document_term_matrix, document_topic_matrix, topic_term_matrix, document_names, corpus, vectorizer, topic_model, topic_names)
     23 def create_blueprint(
     24     vocab: np.ndarray,
     25     document_term_matrix: np.ndarray,
   (...)
     33 ) -> DashBlueprint:
     34     # --------[ Collecting blueprints ]--------
---> 35     topic_blueprint = topics.create_blueprint(
     36         vocab=vocab,
     37         document_term_matrix=document_term_matrix,
     38         document_topic_matrix=document_topic_matrix,
     39         topic_term_matrix=topic_term_matrix,
     40         document_names=document_names,
     41         corpus=corpus,
     42         vectorizer=vectorizer,
     43         topic_model=topic_model,
     44         topic_names=topic_names,
     45     )
     46     documents_blueprint = documents.create_blueprint(
     47         vocab=vocab,
     48         document_term_matrix=document_term_matrix,
   (...)
     55         topic_names=topic_names,
     56     )
     57     words_blueprint = words.create_blueprint(
     58         vocab=vocab,
     59         document_term_matrix=document_term_matrix,
   (...)
     66         topic_names=topic_names,
     67     )

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\blueprints\topics.py:65, in create_blueprint(vocab, document_term_matrix, document_topic_matrix, topic_term_matrix, topic_names, **kwargs)
     56 (
     57     topic_importances,
     58     term_importances,
   (...)
     61     topic_term_matrix, document_term_matrix, document_topic_matrix
     62 )
     64 # --------[ Collecting blueprints ]--------
---> 65 intertopic_map = create_intertopic_map(
     66     topic_positions, topic_importances, topic_names
     67 )
     68 blueprints = [
     69     intertopic_map,
     70     relevance_slider,
   (...)
     74     wordcloud,
     75 ]
     76 # layouts = [blueprint.layout for blueprint in blueprints]
     77 
     78 # --------[ Creating app blueprint ]--------

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\components\topics\intertopic_map.py:29, in create_intertopic_map(topic_positions, topic_importances, topic_names)
     20 x, y = topic_positions
     22 intertopic_map = DashBlueprint()
     24 intertopic_map.layout = dcc.Graph(
     25     id="intertopic_map",
     26     responsive=True,
     27     config=dict(scrollZoom=True),
     28     animate=True,
---> 29     figure=plots.intertopic_map(
     30         x=x,
     31         y=y,
     32         topic_importances=topic_importances,
     33         topic_names=topic_names,
     34     ),
     35     className="flex-1",
     36 )
     38 intertopic_map.clientside_callback(
     39     """
     40     function(currentTopic, topicNames, currentPlot) {
   (...)
     61     prevent_initial_call=True,
     62 )
     63 return intertopic_map

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\plots\topics.py:18, in intertopic_map(x, y, topic_importances, topic_names)
     11 def intertopic_map(
     12     x: np.ndarray,
     13     y: np.ndarray,
     14     topic_importances: np.ndarray,
     15     topic_names: List[str],
     16 ) -> go.Figure:
     17     n_topics = x.shape[0]
---> 18     topic_trace = go.Scatter(
     19         x=x,
     20         y=y,
     21         mode="text+markers",
     22         text=topic_names,
     23         marker=dict(
     24             size=topic_importances,
     25             sizemode="area",
     26             sizeref=2.0 * max(topic_importances) / (100.0**2),
     27             sizemin=4,
     28             color="rgb(168,162,158)",
     29         ),
     30         customdata=np.atleast_2d(np.arange(x.shape[0])).T,
     31     )
     32     fig = go.Figure([topic_trace])
     33     fig.update_layout(
     34         clickmode="event",
     35         modebar_remove=["lasso2d", "select2d"],
   (...)
     40         margin=dict(l=0, r=0, b=0, t=0, pad=0),
     41     )

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\plotly\graph_objs\_scatter.py:3378, in Scatter.__init__(self, arg, alignmentgroup, cliponaxis, connectgaps, customdata, customdatasrc, dx, dy, error_x, error_y, fill, fillcolor, fillpattern, groupnorm, hoverinfo, hoverinfosrc, hoverlabel, hoveron, hovertemplate, hovertemplatesrc, hovertext, hovertextsrc, ids, idssrc, legend, legendgroup, legendgrouptitle, legendrank, legendwidth, line, marker, meta, metasrc, mode, name, offsetgroup, opacity, orientation, selected, selectedpoints, showlegend, stackgaps, stackgroup, stream, text, textfont, textposition, textpositionsrc, textsrc, texttemplate, texttemplatesrc, uid, uirevision, unselected, visible, x, x0, xaxis, xcalendar, xhoverformat, xperiod, xperiod0, xperiodalignment, xsrc, y, y0, yaxis, ycalendar, yhoverformat, yperiod, yperiod0, yperiodalignment, ysrc, **kwargs)
   3376 _v = marker if marker is not None else _v
   3377 if _v is not None:
-> 3378     self["marker"] = _v
   3379 _v = arg.pop("meta", None)
   3380 _v = meta if meta is not None else _v

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\plotly\basedatatypes.py:4865, in BasePlotlyType.__setitem__(self, prop, value)
   4863 # ### Handle compound property ###
   4864 if isinstance(validator, CompoundValidator):
-> 4865     self._set_compound_prop(prop, value)
   4867 # ### Handle compound array property ###
   4868 elif isinstance(validator, (CompoundArrayValidator, BaseDataValidator)):

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\plotly\basedatatypes.py:5276, in BasePlotlyType._set_compound_prop(self, prop, val)
   5273 # Import value
   5274 # ------------
   5275 validator = self._get_validator(prop)
-> 5276 val = validator.validate_coerce(val, skip_invalid=self._skip_invalid)
   5278 # Save deep copies of current and new states
   5279 # ------------------------------------------
   5280 curr_val = self._compound_props.get(prop, None)

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\_plotly_utils\basevalidators.py:2475, in CompoundValidator.validate_coerce(self, v, skip_invalid, _validate)
   2472     v = self.data_class()
   2474 elif isinstance(v, dict):
-> 2475     v = self.data_class(v, skip_invalid=skip_invalid, _validate=_validate)
   2477 elif isinstance(v, self.data_class):
   2478     # Copy object
   2479     v = self.data_class(v)

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\plotly\graph_objs\scatter\_marker.py:1674, in Marker.__init__(self, arg, angle, angleref, anglesrc, autocolorscale, cauto, cmax, cmid, cmin, color, coloraxis, colorbar, colorscale, colorsrc, gradient, line, maxdisplayed, opacity, opacitysrc, reversescale, showscale, size, sizemin, sizemode, sizeref, sizesrc, standoff, standoffsrc, symbol, symbolsrc, **kwargs)
   1672 _v = size if size is not None else _v
   1673 if _v is not None:
-> 1674     self["size"] = _v
   1675 _v = arg.pop("sizemin", None)
   1676 _v = sizemin if sizemin is not None else _v

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\plotly\basedatatypes.py:4873, in BasePlotlyType.__setitem__(self, prop, value)
   4869         self._set_array_prop(prop, value)
   4871     # ### Handle simple property ###
   4872     else:
-> 4873         self._set_prop(prop, value)
   4874 else:
   4875     # Make sure properties dict is initialized
   4876     self._init_props()

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\plotly\basedatatypes.py:5217, in BasePlotlyType._set_prop(self, prop, val)
   5215         return
   5216     else:
-> 5217         raise err
   5219 # val is None
   5220 # -----------
   5221 if val is None:
   5222     # Check if we should send null update

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\plotly\basedatatypes.py:5212, in BasePlotlyType._set_prop(self, prop, val)
   5209 validator = self._get_validator(prop)
   5211 try:
-> 5212     val = validator.validate_coerce(val)
   5213 except ValueError as err:
   5214     if self._skip_invalid:

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\_plotly_utils\basevalidators.py:777, in NumberValidator.validate_coerce(self, v)
    772             v_invalid = np.logical_not(v_valid)
    773             some_invalid_els = np.array(v, dtype="object")[v_invalid][
    774                 :10
    775             ].tolist()
--> 777             self.raise_invalid_elements(some_invalid_els)
    779     v = v_array  # Always numeric numpy array
    780 elif self.array_ok and is_simple_array(v):
    781     # Check numeric

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\_plotly_utils\basevalidators.py:303, in BaseValidator.raise_invalid_elements(self, invalid_els)
    301     def raise_invalid_elements(self, invalid_els):
    302         if invalid_els:
--> 303             raise ValueError(
    304                 """
    305     Invalid element(s) received for the '{name}' property of {pname}
    306         Invalid elements include: {invalid}
    307 
    308 {valid_clr_desc}""".format(
    309                     name=self.plotly_name,
    310                     pname=self.parent_name,
    311                     invalid=invalid_els[:10],
    312                     valid_clr_desc=self.description(),
    313                 )
    314             )

ValueError: 
    Invalid element(s) received for the 'size' property of scatter.marker
        Invalid elements include: [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

    The 'size' property is a number and may be specified as:
      - An int or float in the interval [0, inf]
      - A tuple, list, or one-dimensional numpy array of the above

If there's no GDPR issue it would also be useful to know what data you used and what hyperparameters you supplied to the model.

There are data privacy concerns (FERPA, to be exact). Therefore, it's not a good idea for me to share my dataset.

But here's a bit of domain context:

  • We're repurposing tweetopic to do STTM on Salesforce case descriptions written by university financial aid employees.
  • We have a preprocessing pipeline built out via Spacy to do tokenization, POS removal, stop word removal, etc.
  • We supply in non-null preprocessed documents to be modeled

Hyperparameters for DMM model on Tweetopic:

  • alpha and beta both set to 0.1
  • num_topics: 50
  • iterations: 25 (get the same error when I set to 50 or 100)

@x-tabdeveloping
Copy link
Owner

Thanks for the info, I will try to deliver a fix as quickly as possible, I think you were right in your judgment and it has to be the nans being output by tweetopic. In the meantime you can try to identify which texts are problematic (aka result in nans) and perhaps remove them before you pass the corpus as a list of texts to topicwizard.

@x-tabdeveloping x-tabdeveloping self-assigned this Jun 26, 2023
@x-tabdeveloping x-tabdeveloping added the bug Something isn't working label Jun 26, 2023
@x-tabdeveloping x-tabdeveloping changed the title Unable to use in conjunction with tweetopic Unable to handle nan output from a topic model. Jun 26, 2023
@x-tabdeveloping
Copy link
Owner

I checked your Colab notebook and I think when you try to remove the texts there's some pandas shenanigans going on. I would try:

transformed_corpus = topic_pipeline.transform(corpus)
# Turning it into an array so you can index it with an array
filtered_corpus = np.array(corpus)
# Getting the indices where something is nan
problematic_indices = np.isnan(transformed_corpus).any(axis=1)
# Removing them
filtered_corpus = filtered_corpus[~problematic_indices]
topicwizard.visualize(pipeline=pipeline, corpus=filtered_corpus)

I think this should work fine, I will try to address these issues in the meantime.

@x-tabdeveloping
Copy link
Owner

I managed to reproduce the error with a custom version of NMF that randomly assigns nans to certain observations.

class RandomNanNMF(NMF):
    def transform(self, X):
        res = super().transform(X)
        n_docs = res.shape[0]
        nans = np.random.choice(np.arange(n_docs), size=30, replace=False)
        res[nans, :] = np.nan
        return res

    def fit_transform(self, X, y=None, W=None, H=None):
        res = super().fit_transform(X, y, W, H)
        n_docs = res.shape[0]
        nans = np.random.choice(np.arange(n_docs), size=30, replace=False)
        res[nans, :] = np.nan
        return res

The solution was to filter out the nan values in the preprocessing step of topicwizard and throw a warning to the user informing them about the removal of these documents.

@x-tabdeveloping x-tabdeveloping linked a pull request Jun 26, 2023 that will close this issue
@x-tabdeveloping
Copy link
Owner

Fix merged into main, new version built and published to PyPI, you should try installing topicwizard 0.2.6 and run your code again :)

@vshourie-asu
Copy link
Author

Thank you! I'll give it a shot now.

@x-tabdeveloping
Copy link
Owner

Can you confirm that the fix worked?

@vshourie-asu
Copy link
Author

vshourie-asu commented Sep 14, 2023

Hi!

Sorry for the long wait time, this query got lost in my massive work email mountain.

I reran the visualization command with version 0.2.6 installed.

I get the following error after running. Note that the UserError shows up, which means your validation is working as intended.

`C:\Users\vshourie\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\blueprints\template.py:33: UserWarning: 31 documents had nan values in the output of the topic model, these are removed in preprocessing and will not be visible in the app.
  warn(
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[8], line 1
----> 1 topicwizard.visualize(vectorizer=vectorizer, topic_model=dmm, corpus=corpus_cleaned, port=8080)

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\app.py:245, in visualize(corpus, vectorizer, topic_model, pipeline, document_names, topic_names, port, enable_notebook)
    242     (_, vectorizer), (_, topic_model) = pipeline.steps
    244 print("Preprocessing")
--> 245 app = get_dash_app(
    246     vectorizer=vectorizer,
    247     topic_model=topic_model,
    248     corpus=corpus,
    249     document_names=document_names,
    250     topic_names=topic_names,
    251 )
    252 return run_app(app, port=port)

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\app.py:73, in get_dash_app(vectorizer, topic_model, corpus, document_names, topic_names)
     42 def get_dash_app(
     43     vectorizer: Any,
     44     topic_model: Any,
   (...)
     47     topic_names: Optional[List[str]] = None,
     48 ) -> Dash:
     49     """Returns topicwizard Dash application.
     50 
     51     Parameters
   (...)
     71         Dash application object for topicwizard.
     72     """
---> 73     blueprint = get_app_blueprint(
     74         vectorizer=vectorizer,
     75         topic_model=topic_model,
     76         corpus=corpus,
     77         document_names=document_names,
     78         topic_names=topic_names,
     79     )
     80     app = Dash(
     81         __name__,
     82         blueprint=blueprint,
   (...)
     92         ],
     93     )
     94     return app

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\app.py:31, in get_app_blueprint(vectorizer, topic_model, corpus, document_names, topic_names)
     24 def get_app_blueprint(
     25     vectorizer: Any,
     26     topic_model: Any,
   (...)
     29     topic_names: Optional[List[str]] = None,
     30 ) -> DashBlueprint:
---> 31     blueprint = prepare_blueprint(
     32         vectorizer=vectorizer,
     33         topic_model=topic_model,
     34         corpus=corpus,
     35         document_names=document_names,
     36         topic_names=topic_names,
     37         create_blueprint=create_blueprint,
     38     )
     39     return blueprint

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\blueprints\template.py:44, in prepare_blueprint(vectorizer, topic_model, corpus, create_blueprint, document_names, topic_names)
     42 if topic_names is None:
     43     topic_names = [f"Topic {i}" for i in range(n_topics)]
---> 44 blueprint = create_blueprint(
     45     vocab=vocab,
     46     document_term_matrix=document_term_matrix,
     47     document_topic_matrix=document_topic_matrix,
     48     topic_term_matrix=topic_term_matrix,
     49     document_names=document_names,
     50     corpus=corpus,
     51     vectorizer=vectorizer,
     52     topic_model=topic_model,
     53     topic_names=topic_names,
     54 )
     55 return blueprint

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\blueprints\app.py:46, in create_blueprint(vocab, document_term_matrix, document_topic_matrix, topic_term_matrix, document_names, corpus, vectorizer, topic_model, topic_names)
     23 def create_blueprint(
     24     vocab: np.ndarray,
     25     document_term_matrix: np.ndarray,
   (...)
     33 ) -> DashBlueprint:
     34     # --------[ Collecting blueprints ]--------
     35     topic_blueprint = topics.create_blueprint(
     36         vocab=vocab,
     37         document_term_matrix=document_term_matrix,
   (...)
     44         topic_names=topic_names,
     45     )
---> 46     documents_blueprint = documents.create_blueprint(
     47         vocab=vocab,
     48         document_term_matrix=document_term_matrix,
     49         document_topic_matrix=document_topic_matrix,
     50         topic_term_matrix=topic_term_matrix,
     51         document_names=document_names,
     52         corpus=corpus,
     53         vectorizer=vectorizer,
     54         topic_model=topic_model,
     55         topic_names=topic_names,
     56     )
     57     words_blueprint = words.create_blueprint(
     58         vocab=vocab,
     59         document_term_matrix=document_term_matrix,
   (...)
     66         topic_names=topic_names,
     67     )
     68     blueprints = [
     69         topic_blueprint,
     70         words_blueprint,
     71         documents_blueprint,
     72     ]

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\blueprints\documents.py:32, in create_blueprint(vocab, document_term_matrix, document_topic_matrix, topic_term_matrix, document_names, corpus, vectorizer, topic_model, **kwargs)
     19 def create_blueprint(
     20     vocab: np.ndarray,
     21     document_term_matrix: np.ndarray,
   (...)
     29 ) -> DashBlueprint:
     30     # --------[ Preparing data ]--------
     31     n_topics = topic_term_matrix.shape[0]
---> 32     document_positions = prepare.document_positions(
     33         document_term_matrix=document_term_matrix
     34     )
     35     dominant_topics = prepare.dominant_topic(
     36         document_topic_matrix=document_topic_matrix
     37     )
     38     # Creating unified color scheme

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\topicwizard\prepare\documents.py:47, in document_positions(document_term_matrix)
     41 perplexity = np.min((40, n_docs - 1))
     42 manifold = umap.UMAP(
     43     n_components=2,
     44     n_neighbors=perplexity,
     45     metric="cosine",
     46 )
---> 47 x, y = manifold.fit_transform(document_term_matrix).T
     48 return x, y

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\umap\umap_.py:2772, in UMAP.fit_transform(self, X, y)
   2742 def fit_transform(self, X, y=None):
   2743     """Fit X into an embedded space and return that transformed
   2744     output.
   2745 
   (...)
   2770         Local radii of data points in the embedding (log-transformed).
   2771     """
-> 2772     self.fit(X, y)
   2773     if self.transform_mode == "embedding":
   2774         if self.output_dens:

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\umap\umap_.py:2516, in UMAP.fit(self, X, y)
   2510     nn_metric = self._input_distance_func
   2511 if self.knn_dists is None:
   2512     (
   2513         self._knn_indices,
   2514         self._knn_dists,
   2515         self._knn_search_index,
-> 2516     ) = nearest_neighbors(
   2517         X[index],
   2518         self._n_neighbors,
   2519         nn_metric,
   2520         self._metric_kwds,
   2521         self.angular_rp_forest,
   2522         random_state,
   2523         self.low_memory,
   2524         use_pynndescent=True,
   2525         n_jobs=self.n_jobs,
   2526         verbose=self.verbose,
   2527     )
   2528 else:
   2529     self._knn_indices = self.knn_indices

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\umap\umap_.py:328, in nearest_neighbors(X, n_neighbors, metric, metric_kwds, angular, random_state, low_memory, use_pynndescent, n_jobs, verbose)
    325     n_trees = min(64, 5 + int(round((X.shape[0]) ** 0.5 / 20.0)))
    326     n_iters = max(5, int(round(np.log2(X.shape[0]))))
--> 328     knn_search_index = NNDescent(
    329         X,
    330         n_neighbors=n_neighbors,
    331         metric=metric,
    332         metric_kwds=metric_kwds,
    333         random_state=random_state,
    334         n_trees=n_trees,
    335         n_iters=n_iters,
    336         max_candidates=60,
    337         low_memory=low_memory,
    338         n_jobs=n_jobs,
    339         verbose=verbose,
    340         compressed=False,
    341     )
    342     knn_indices, knn_dists = knn_search_index.neighbor_graph
    344 if verbose:

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\pynndescent\pynndescent_.py:804, in NNDescent.__init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_degree_multiplier, diversify_prob, n_search_trees, tree_init, init_graph, init_dist, random_state, low_memory, max_candidates, n_iters, delta, n_jobs, compressed, parallel_batch_queries, verbose)
    793         print(ts(), "Building RP forest with", str(n_trees), "trees")
    794     self._rp_forest = make_forest(
    795         data,
    796         n_neighbors,
   (...)
    802         self._angular_trees,
    803     )
--> 804     leaf_array = rptree_leaf_array(self._rp_forest)
    805 else:
    806     self._rp_forest = None

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\pynndescent\rp_trees.py:1097, in rptree_leaf_array(rp_forest)
   1095 def rptree_leaf_array(rp_forest):
   1096     if len(rp_forest) > 0:
-> 1097         return np.vstack(rptree_leaf_array_parallel(rp_forest))
   1098     else:
   1099         return np.array([[-1]])

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\pynndescent\rp_trees.py:1089, in rptree_leaf_array_parallel(rp_forest)
   1088 def rptree_leaf_array_parallel(rp_forest):
-> 1089     result = joblib.Parallel(n_jobs=-1, require="sharedmem")(
   1090         joblib.delayed(get_leaves_from_tree)(rp_tree) for rp_tree in rp_forest
   1091     )
   1092     return result

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\joblib\parallel.py:1098, in Parallel.__call__(self, iterable)
   1095     self._iterating = False
   1097 with self._backend.retrieval_context():
-> 1098     self.retrieve()
   1099 # Make sure that we get a last message telling us we are done
   1100 elapsed_time = time.time() - self._start_time

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\joblib\parallel.py:975, in Parallel.retrieve(self)
    973 try:
    974     if getattr(self._backend, 'supports_timeout', False):
--> 975         self._output.extend(job.get(timeout=self.timeout))
    976     else:
    977         self._output.extend(job.get())

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\multiprocessing\pool.py:771, in ApplyResult.get(self, timeout)
    769     return self._value
    770 else:
--> 771     raise self._value

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\multiprocessing\pool.py:125, in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
    123 job, i, func, args, kwds = task
    124 try:
--> 125     result = (True, func(*args, **kwds))
    126 except Exception as e:
    127     if wrap_exception and func is not _helper_reraises_exception:

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\joblib\_parallel_backends.py:620, in SafeFunction.__call__(self, *args, **kwargs)
    618 def __call__(self, *args, **kwargs):
    619     try:
--> 620         return self.func(*args, **kwargs)
    621     except KeyboardInterrupt as e:
    622         # We capture the KeyboardInterrupt and reraise it as
    623         # something different, as multiprocessing does not
    624         # interrupt processing for a KeyboardInterrupt
    625         raise WorkerInterrupt() from e

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\joblib\parallel.py:288, in BatchedCalls.__call__(self)
    284 def __call__(self):
    285     # Set the default nested backend to self._backend but do not set the
    286     # change the default number of processes to -1
    287     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288         return [func(*args, **kwargs)
    289                 for func, args, kwargs in self.items]

File ~\AppData\Local\miniconda3\envs\tweetopic\lib\site-packages\joblib\parallel.py:288, in <listcomp>(.0)
    284 def __call__(self):
    285     # Set the default nested backend to self._backend but do not set the
    286     # change the default number of processes to -1
    287     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288         return [func(*args, **kwargs)
    289                 for func, args, kwargs in self.items]

ValueError: cannot assign slice from input of different size`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants