Skip to content

Conversation

@asifmallik
Copy link
Contributor

This completes #290

Just noticed that I incorrectly capitalized d in MPdist so I need to correct that. Another thing is to decide whether to include a hierarchical clustering for euclidean distance. I was not able to replicate the figure from the paper so far. I am finalizing a notebook currently which showcases multiple attempts to replicate the figure and will post it in Issues.

Pull Request Checklist

  • Fork, clone, and checkout the newest version of the code
  • Create a new branch
  • Make necessary code changes
  • Install black (i.e., python -m pip install black or conda install -c conda-forge black)
  • Install flake8 (i.e., python -m pip install flake8 or conda install -c conda-forge flake8)
  • Install pytest-cov (i.e., python -m pip install pytest-cov or conda install -c conda-forge pytest-cov)
  • Run black . in the root stumpy directory
  • Run flake8 . in the root stumpy directory
  • Run ./setup.sh && ./test.sh in the root stumpy directory
  • Reference a Github issue (and create one if one doesn't already exist)

Skipped ./test.shbecause this is a documentation only pull request

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@asifmallik asifmallik changed the title [WIP] Add tutorial for MPDist [WIP] Add tutorial for MPdist Jul 10, 2021
@@ -0,0 +1,309 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider replacing the for loop with this:

data = loadmat('MaryBethAnhLisa_data.mat')['XSsY4']
dfs = {data[1][i][0]:pd.DataFrame(data[0][i].flatten()) for i in range(data[0].shape[0])}

Reply via ReviewNB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the for-loop is easier to read

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I agree, the dictionary comprehension is far from readable. Maybe I was just challenging myself to write a one liner.....

In any case, the final version should look like neither since we'll upload the datasets to Zenodo. Maybe you could do the data wrangling locally and THEN upload the cleaned dataset to Zenodo, so that on the final tutorial, it's just a matter of indexing a single dimension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, cleaning it first sounds like a good idea, I will fix this later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using Zenodo something I can do for this something I can do? @seanlaw

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asifmalik I can help you with that. I'm thinking that we'll just create 6 separate files (one for each name)?

@seanlaw
Copy link
Contributor

seanlaw commented Jul 12, 2021

@asifmallik So far, so good! Good work. I like where this is going

@codecov-commenter
Copy link

codecov-commenter commented Jul 12, 2021

Codecov Report

Merging #433 (96fe7d5) into main (66b3402) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #433   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           35        35           
  Lines         2766      2781   +15     
=========================================
+ Hits          2766      2781   +15     
Impacted Files Coverage Δ
stumpy/core.py 100.00% <0.00%> (ø)
stumpy/motifs.py 100.00% <0.00%> (ø)
stumpy/aamp_motifs.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66b3402...96fe7d5. Read the comment docs.

- Correct spelling and capitlization errors
- Make explanations better and less ambiguous
- More succint code
- Variable renaming
@asifmallik
Copy link
Contributor Author

@seanlaw @alvii147 Made some changes and incorporated some of your suggestions and corrections. Still have a bit left though

@seanlaw
Copy link
Contributor

seanlaw commented Jul 30, 2021

@asifmallik No problem! I value quality over quantity and what you've done so far is really taking shape

- Add dendrogram for Euclidean
- Add explanation for what we expect in cluster
- Add explanation for difference in Euclidean and MPdist result
- Remove mention of error in paper
- Improve code quality
- Align different names
- Other minor fixes
@asifmallik asifmallik changed the title [WIP] Add tutorial for MPdist Add tutorial for MPdist Aug 1, 2021
…ndas with numpy

- Grammar fixes
- Complete explanation for MPdist
@asifmallik
Copy link
Contributor Author

@seanlaw ready for a more thorough review

@@ -0,0 +1,380 @@
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*of their subsequences


Reply via ReviewNB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "determining whether most" maybe it should be "determining whether a limited subset of subsequences - as parameterized by a threshold - are similar"

Copy link
Contributor

@seanlaw seanlaw Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, if two time series is made up of the same repeating subsequences of window length m, then they would MPdist of 0 (if window size for MPdist is set to m), even if they are phase shifted. On the other hand, the Euclidean distance would be non-zero as long as they are phase shifted.

I find this sentence to be very confusing/abstract as it is not anchored to anything and expects the user to already have some prior knowledge

Consider how the original paper tries to describe this and paraphrase from there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, if two time series is are made up of the same repeating subsequences of window length m, then they would have an MPdist of 0...

I agree it's a bit difficult to follow this definition, although it's tough to come up with an alternative phrasing. Maybe something like: For instance, two time series that are made up of the same periodic subsequences, but are phase shifted, their MPdist would be zero, while their Euclidean distance would be non-zero.

I understand the significance of m , but maybe in this exact definition you can remove the mention of it? Idk it seems cleaner for the sake of explaining.

@seanlaw
Copy link
Contributor

seanlaw commented Aug 1, 2021

@seanlaw ready for a more thorough review

@asifmallik I will try to find some time to review it. Thank you

@@ -0,0 +1,380 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #4.    t_1 = base[:50]

Please use T_A and T_B to refer to time series A and time series B


Reply via ReviewNB

@@ -0,0 +1,380 @@
{
Copy link
Contributor

@seanlaw seanlaw Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #7.    plt.style.use(mplstyle_url)

Please remove the above line


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -0,0 +1,380 @@
{
Copy link
Contributor

@seanlaw seanlaw Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would make more sense to have the long name first and then followed by the short parts


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's a good idea too, changing the ordering

@@ -0,0 +1,380 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like you are saying a lot here but you aren't showing the evidence for it until later. It is okay to make a statement and show the evidence inline.


Reply via ReviewNB

@@ -0,0 +1,380 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do the colors in the dendrogram mean?


Reply via ReviewNB

Copy link
Contributor Author

@asifmallik asifmallik Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the docs, I am not exactly quite sure what it means, seems like they color it something depending on whether the distance between the children cluster nodes exceeds a certain threshold (default is set to be 70% of max distance). This doesn't seem particularly relevant for this case so I can just set link_color_func to be a function that returns the same color for every cluster.

@seanlaw
Copy link
Contributor

seanlaw commented Aug 3, 2021

@asifmallik I really like the story with both the Euclidean distance vs MPdist. It "feels" complete.

I was wondering if it would be possible to start with a simple example that only focuses on using MPdist to compare just two time series first. This way the focus is purely on MPdist and its output. After we've done through MPdist then we can talk about the name example. Otherwise, the dendrogram work might overshadow the point of this tutorial and that is to learn about MPdist and why it is useful and how it works.

@alvii147 I am curious as to your thoughts here as well!

@asifmallik
Copy link
Contributor Author

@asifmallik I really like the story with both the Euclidean distance vs MPdist. It "feels" complete.

I was wondering if it would be possible to start with a simple example that only focuses on using MPdist to compare just two time series first. This way the focus is purely on MPdist and its output. After we've done through MPdist then we can talk about the name example. Otherwise, the dendrogram work might overshadow the point of this tutorial and that is to learn about MPdist and why it is useful and how it works.

Do you mean I should replace the current introduction which starts with two randomly generated time series with the time series for two of the names instead?

@seanlaw
Copy link
Contributor

seanlaw commented Aug 10, 2021

Do you mean I should replace the current introduction which starts with two randomly generated time series with the time series for two of the names instead?

I would like to replace the random data example and use the data from Figure 1 (and maybe Figure 2) and also motivate and explain the pitfalls of other measures/methods by following what the authors covered/discussed in the introduction.

@@ -0,0 +1,380 @@
{
Copy link
Contributor

@alvii147 alvii147 Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would look nicer if the repeated subsequences were highlighted, or more visible somehow.

Try this (or something similar):

from matplotlib.patches import Rectangle

rect = Rectangle((0, 0), 20, 50, facecolor='lightgrey')
axs[0].add_patch(rect)
axs[0].plot(np.arange(20), t_1[:20], color='aquamarine', linewidth=10, alpha=0.5)

rect = Rectangle((5, 0), 20, 50, facecolor='lightgrey') axs[1].add_patch(rect) axs[1].plot(np.arange(5, 25), t_2[5:25], color='aquamarine', linewidth=10, alpha=0.5)

Reply via ReviewNB

@seanlaw seanlaw changed the title Add tutorial for MPdist [WIP] Add tutorial for MPdist Sep 22, 2021
@seanlaw
Copy link
Contributor

seanlaw commented Feb 6, 2022

@asifmallik Any updates on this?

@seanlaw seanlaw merged commit 3e991ac into stumpy-dev:main Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants