-
Notifications
You must be signed in to change notification settings - Fork 338
[WIP] Add tutorial for MPdist #433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
docs/Tutorial_MPDist.ipynb
Outdated
| @@ -0,0 +1,309 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider replacing the for loop with this:
data = loadmat('MaryBethAnhLisa_data.mat')['XSsY4']
dfs = {data[1][i][0]:pd.DataFrame(data[0][i].flatten()) for i in range(data[0].shape[0])}
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the for-loop is easier to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I agree, the dictionary comprehension is far from readable. Maybe I was just challenging myself to write a one liner.....
In any case, the final version should look like neither since we'll upload the datasets to Zenodo. Maybe you could do the data wrangling locally and THEN upload the cleaned dataset to Zenodo, so that on the final tutorial, it's just a matter of indexing a single dimension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, cleaning it first sounds like a good idea, I will fix this later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using Zenodo something I can do for this something I can do? @seanlaw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asifmalik I can help you with that. I'm thinking that we'll just create 6 separate files (one for each name)?
|
@asifmallik So far, so good! Good work. I like where this is going |
Codecov Report
@@ Coverage Diff @@
## main #433 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 35 35
Lines 2766 2781 +15
=========================================
+ Hits 2766 2781 +15
Continue to review full report at Codecov.
|
- Correct spelling and capitlization errors - Make explanations better and less ambiguous - More succint code - Variable renaming
|
@asifmallik No problem! I value quality over quantity and what you've done so far is really taking shape |
- Add dendrogram for Euclidean - Add explanation for what we expect in cluster - Add explanation for difference in Euclidean and MPdist result - Remove mention of error in paper - Improve code quality - Align different names - Other minor fixes
…ndas with numpy - Grammar fixes - Complete explanation for MPdist
|
@seanlaw ready for a more thorough review |
docs/Tutorial_MPDist.ipynb
Outdated
| @@ -0,0 +1,380 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "determining whether most" maybe it should be "determining whether a limited subset of subsequences - as parameterized by a threshold - are similar"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, if two time series is made up of the same repeating subsequences of window lengthm, then they would MPdist of 0 (if window size for MPdist is set tom), even if they are phase shifted. On the other hand, the Euclidean distance would be non-zero as long as they are phase shifted.
I find this sentence to be very confusing/abstract as it is not anchored to anything and expects the user to already have some prior knowledge
Consider how the original paper tries to describe this and paraphrase from there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, if two time series is are made up of the same repeating subsequences of window length m, then they would have an MPdist of 0...
I agree it's a bit difficult to follow this definition, although it's tough to come up with an alternative phrasing. Maybe something like: For instance, two time series that are made up of the same periodic subsequences, but are phase shifted, their MPdist would be zero, while their Euclidean distance would be non-zero.
I understand the significance of m , but maybe in this exact definition you can remove the mention of it? Idk it seems cleaner for the sake of explaining.
@asifmallik I will try to find some time to review it. Thank you |
docs/Tutorial_MPDist.ipynb
Outdated
| @@ -0,0 +1,380 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #4. t_1 = base[:50]
Please use T_A and T_B to refer to time series A and time series B
Reply via ReviewNB
docs/Tutorial_MPDist.ipynb
Outdated
| @@ -0,0 +1,380 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
docs/Tutorial_MPDist.ipynb
Outdated
| @@ -0,0 +1,380 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would make more sense to have the long name first and then followed by the short parts
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's a good idea too, changing the ordering
docs/Tutorial_MPDist.ipynb
Outdated
| @@ -0,0 +1,380 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like you are saying a lot here but you aren't showing the evidence for it until later. It is okay to make a statement and show the evidence inline.
Reply via ReviewNB
docs/Tutorial_MPDist.ipynb
Outdated
| @@ -0,0 +1,380 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the docs, I am not exactly quite sure what it means, seems like they color it something depending on whether the distance between the children cluster nodes exceeds a certain threshold (default is set to be 70% of max distance). This doesn't seem particularly relevant for this case so I can just set link_color_func to be a function that returns the same color for every cluster.
|
@asifmallik I really like the story with both the Euclidean distance vs MPdist. It "feels" complete. I was wondering if it would be possible to start with a simple example that only focuses on using MPdist to compare just two time series first. This way the focus is purely on MPdist and its output. After we've done through MPdist then we can talk about the name example. Otherwise, the dendrogram work might overshadow the point of this tutorial and that is to learn about MPdist and why it is useful and how it works. @alvii147 I am curious as to your thoughts here as well! |
Do you mean I should replace the current introduction which starts with two randomly generated time series with the time series for two of the names instead? |
I would like to replace the random data example and use the data from Figure 1 (and maybe Figure 2) and also motivate and explain the pitfalls of other measures/methods by following what the authors covered/discussed in the introduction. |
docs/Tutorial_MPDist.ipynb
Outdated
| @@ -0,0 +1,380 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would look nicer if the repeated subsequences were highlighted, or more visible somehow.
Try this (or something similar):
from matplotlib.patches import Rectangle
rect = Rectangle((0, 0), 20, 50, facecolor='lightgrey') axs[0].add_patch(rect) axs[0].plot(np.arange(20), t_1[:20], color='aquamarine', linewidth=10, alpha=0.5)
rect = Rectangle((5, 0), 20, 50, facecolor='lightgrey') axs[1].add_patch(rect) axs[1].plot(np.arange(5, 25), t_2[5:25], color='aquamarine', linewidth=10, alpha=0.5)
Reply via ReviewNB
|
@asifmallik Any updates on this? |
This completes #290
Just noticed that I incorrectly capitalized d in MPdist so I need to correct that. Another thing is to decide whether to include a hierarchical clustering for euclidean distance. I was not able to replicate the figure from the paper so far. I am finalizing a notebook currently which showcases multiple attempts to replicate the figure and will post it in Issues.
Pull Request Checklist
black(i.e.,python -m pip install blackorconda install -c conda-forge black)flake8(i.e.,python -m pip install flake8orconda install -c conda-forge flake8)pytest-cov(i.e.,python -m pip install pytest-covorconda install -c conda-forge pytest-cov)black .in the root stumpy directoryflake8 .in the root stumpy directory./setup.sh && ./test.shin the root stumpy directorySkipped
./test.shbecause this is a documentation only pull request