Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper: PyLZJD: An Easy to Use Tool for Machine Learning #462

Open
wants to merge 6 commits into
base: 2019
from

Conversation

Projects
None yet
5 participants
@EdwardRaff
Copy link

commented May 20, 2019

If you are creating this PR in order to submit a draft of your paper,
see http://procbuild.scipy.org/ for logs generated by the build
process.

See the project readme
for more information.

@bmcfee
Copy link
Contributor

left a comment

I just gave this a first pass, and it already seems to be in pretty good shape -- well done!

I've made a few suggestions to help improve the draft. The main thing that I think needs improving is a clear statement of what the package technically does: specifying its inputs and outputs. The how and why can come later (and are clearly articulated as is).


The Lempel Ziv Jaccard Distance
-------------------------------

This comment has been minimized.

Copy link
@bmcfee

bmcfee May 25, 2019

Contributor

Overall, the introduction here is great and clearly sets up the problem addressed by this tool.

In this section though, it would be helpful to clearly state up front what exactly the tool does: what it takes as input and what it produces as output. As it's currently written, the functionality of the tool comes later, after some exposition about compression distance and hashing, and it's a little confusing to a first-time reader.

I think the rest of the section is great, but a preamble that outlines the procedure first (byte sequence => set => distance / vector) would strengthen it.

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff May 27, 2019

Author

Added that to the end of the first paragraph in this section. I also chose that as a good place to introduce what a digest is, per one of your other comments.

This comment has been minimized.

Copy link
@bmcfee

bmcfee May 27, 2019

Contributor

This looks great

LZJD stands for "Lempel Ziv Jaccard Distance" :cite:`raff_lzjd_2017`, and is the algorithm implemented in PyLZJD. The inspiration and high-level understanding of LZJD comes from compression algorithms. Let :math:`C(\cdot)` represent your favorite compression algorithm (e.g., zip, bz2, etc.), which will take an input :math:`x` and produce a compressed version :math:`C(x)`. Using a decompressor, you can recover the original object or file :math:`x` from :math:`C(x)`. The purpose of this compression is to reduce the size of the file stored on disk. So if :math:`|x|` represents how many bytes it takes to represent the file :math:`x`, the goal is that :math:`|C(x)| < |x|`.

What if we wanted to compare the similarity of two files, :math:`x` and :math:`y`? We can use compression to help us do that. Consider two files :math:`x` and :math:`y`, with absolutely no shared content. Then we would expect that if we concatenated :math:`x` and :math:`y` together to make one larger file, :math:`x \Vert y`, then compressing the concatenated version of the files should be about the same size as the files compressed separately, :math:`|C(x \Vert y)| = |C(x)| + |C(y)|`. But what if :math:`|C(x \Vert y)| < |C(x)| + |C(y)|`? For that to be true, there must be some overlapping content between :math:`x` and :math:`y` that our compressor :math:`C(\cdot)` was able to reuse in order to achieve a smaller output. The more similarity between :math:`x` and :math:`y`, the greater difference in file size we should see. In which case, we could use the ratio of compressed file lengths to tell us how similar the files are. We could call this a "Compression Distance Metric" :cite:`Keogh:2004:TPD:1014052.1014077` as shown in Equation :ref:`cdm`, where CDM(:math:`x,y`) returns a smaller value the more similar :math:`x` and :math:`y` are, and a larger value if they are different.

This comment has been minimized.

Copy link
@bmcfee

bmcfee May 25, 2019

Contributor

It's a little misleading to call Eq 1 a "distance metric", since it does not satisfy identifiability. (Triangle inequality may or may not be satisfied, depending on the choice of C.) I think this should be reworded, or at least clarified in the text.

Note that this does not apply to the derived Jaccard similarity used later, which is a proper metric.

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff May 27, 2019

Author

I mean to include a statement about that in the next section, which has now been added! I wanted to keep the conversation at this section at a "intuition level", without getting too into-the-weeds.


While the above strategy has seen much success, it also suffers from drawbacks. Using a compression algorithm for every similarity comparison %makes prior methods
is slow, and the mechanics of standard compression algorithms are not optimized for machine learning tasks. LZJD rectifies these issues by converting a specific compression algorithm, LZMA, into a dedicated distance metric :cite:`raff_lzjd_2017`. By doing so, LZJD is fast enough to use for larger datasets, and maintains the properties of a true distance metric\footnote{symmetry, indiscernibility, and the triangle inequality
}. LZJD works by first creating the compression dictionary of the Lempel Ziv algorithm :cite:`Lempel1976`.

This comment has been minimized.

Copy link
@bmcfee

bmcfee May 25, 2019

Contributor

Typos; %makes and footnote did not render properly.

This comment has been minimized.

Copy link
@EdwardRaff
dist = 1.0-sim(lzset(a),lzset(b))
While the procedure above will implement the LZJD algorithm, it does not include the speedups that have been incorporated into PyLZJD. Following :cite:`raff_lzjd_2017` we use Min-Hashing :cite:`Broder:1998:MIP:276698.276781` to convert a set :math:`A` into a more compact representation :math:`A'`, which is of a fixed size :math:`k` (i.e., :math:`|A'|=k`) but guarantees that :math:`J(A, B) \approx J(A', B')`. :cite:`raff_lzjd_digest` reduced computational time and memory use further by mapping every sub-sequence to a hash, and performing :code:`lzset` construction using a rolling hash function to ensure every byte of input was only processed once. To handle class imbalance scenarios, a stochastic variant of LZJD allows over-sampling to improve accuracy :cite:`raff_shwel`. All of these optimizations were implemented with Cython :cite:`behnel2010cython` in order to make PyLZJD as fast as possible.

This comment has been minimized.

Copy link
@bmcfee

bmcfee May 25, 2019

Contributor

Minor detail: Is this using permutation-based minhash, or the (open-vocabulary) hashing approximation? It's maybe more detail than is strictly necessary for this article, but it might be a relevant point to mention since the purpose here is exactly to capture jaccard similarity.

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff May 27, 2019

Author

The bottom-k approach with a single hash function is used. Added that detail as a footnote.


The LZJD algorithm as discussed so far provides only a distance metric. This is valuable for search and information retrieval problems, many clustering algorithms, and :math:`k`-nearest-neighbor style classification, but does not avail ourselves to all the algorithms that would be available in Scikit-Learn. Prior work proposed one method of vectorizing LZSets :cite:`raff_shwel` based on feature hashing :cite:`Weinberger2009a`, where every item in the set is mapped to a random position in a large and high dimensional input (they used :math:`d=2^{20}`). For new users, we want to avoid such high dimensional spaces to avoid the "curse of dimensionality" :cite:`Bellman1957`, a phenomena that makes obtaining meaningful results in higher dimensions difficult.

Working in such high dimensional spaces often requires greater consideration and expertise. To make PyLZJD easier for novices to use, we have developed a different vectorization strategy. To make this possible, we use a new version of Min-Hashing called "SuperMinHash", :cite:`Ertl2017`. The new SuperMinHash adds a fixed overhead compared to the prior method, but enables us to use what is known as :math:`b`-bit minwise hashing to convert sets to a more compact vectorized representation :cite:`Li:2011:TAB:1978542.1978566`. Since :math:`k \leq 1024` in most cases, and :math:`b \leq 8`, we arrive at a more modest :math:`d=k\cdot b \leq 8,192`. By keeping the dimension smaller, we make PyLZJD easier to use and a wider selection of algorithms from Scikit-Learn should produce reasonable results.

This comment has been minimized.

Copy link
@bmcfee

bmcfee May 25, 2019

Contributor

I didn't understand what "fixed overhead" means in this context. A bit more detail would be appreciated here.

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff May 27, 2019

Author

Just that it's not as fast as the default method. Changed it to a more descriptive 'up to 40% slower"

% (scores.mean(), scores.std() * 2))
The above code returns a value of 91\% accuracy. This was all done without us having to specify anything about the associated file formats, how to parse them, or any feature engineering work. We can also leverage other distance metric based tools that Scikit-Learn provides. For example, we can use the t-SNE :cite:`Maaten2008` algorithm to create a 2D embedding of our data that we can then visualize with matplotlib. Using Scikit-Learn, this is only one line of code:

This comment has been minimized.

Copy link
@bmcfee

bmcfee May 25, 2019

Contributor

Example cites 91% accuracy, but does not provide a baseline for comparison. What is the class balance in the data / what accuracy would a majority-vote classifier get?

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff May 27, 2019

Author

Added. Majority vote gets you 24.5%

yGood = [0 for i in range(len(ham_paths))]
y = yBad + yGood
X = vectorize(all_paths)

This comment has been minimized.

Copy link
@bmcfee

bmcfee May 25, 2019

Contributor

A figure with examples of ham and spam images would go a long way to making this example more intuitive to readers.

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff May 27, 2019

Author

Added!

F1-Score & 0.958119 & 0.966374 \\
AUC & 0.987108 & 0.991602 \\ \hline
\end{tabular}
\end{table}

This comment has been minimized.

Copy link
@bmcfee

bmcfee May 25, 2019

Contributor

6 decimal places is almost certainly excessive for data of n~=1e4.

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff May 27, 2019

Author

Trimmed down to 3.


\subsection{Text Classification}

As our last example, we will use a text-classification problem. While other methods will work better, the purpose it to show that LZJD can be used in a wide array of potential applications. For this, we will use the well-known 20 Newsgroups dataset, which is available in Scikit-Learn. We use this dataset because LZJD works best with longer input sequences. For simplicity we will stick with distinguishing between the newsgroup categories of 'alt.atheism' and 'comp.graphics'. An example of a email from the later group is shown below.

This comment has been minimized.

Copy link
@bmcfee

bmcfee May 25, 2019

Contributor

a email => an email

This comment has been minimized.

Copy link
@EdwardRaff
X_train = vectorize(paths_train*10,
false_seen_prob=0.05)
X_test = vectorize(paths_test)

This comment has been minimized.

Copy link
@bmcfee

bmcfee May 25, 2019

Contributor

It's a little strange up front to see train and test data treated differently here.

One suggestion for making this come across more clearly is to describe over-sampling in terms of data augmentation. If you do that, then it might be a good idea to include X_train both with and without over-sampling. Eg:

X_train_clean = vectorize(paths_train)
X_train_aug = vectorize(paths_train * 9, false_seen_prob=0.05)
X_test = vectorize(paths_test)

This way, it's clear that the over-sampled data are "extra" in a sense, but you still have the original training data included.

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff May 27, 2019

Author

I like having the "clean" and "aug" versions, and added that / slightly adjusted the following text. I'm a bit conflicted on describing it as data augmentation. I can see it helping at an intuition level, but I'm not really augmenting the data. I haven't modified or transformed the underlying data in any realistic way. Instead we have different sub-sequences of the same underlying data stream.

In our previous paper, we described/compared this approach with SMOTE. Do you think that would be helpful? I didn't include it in this as I wanted to minimize external ML background for the reader to follow along.

This comment has been minimized.

Copy link
@bmcfee

bmcfee May 27, 2019

Contributor

Maybe that's just a slight difference of perspective then. I generally think of any synthetic transformation of data (realistic or not) for the purpose of improving robustness as DA, but it's of course up to you.

I think you're right that analogy to SMOTE might be more confusing for the scipy audience.

@EdwardRaff

This comment has been minimized.

Copy link
Author

commented May 25, 2019

@bmcfee Thank you for the first review!

Since this is a little different from normal academic review, I had two quick questions. 1) Should I be waiting for comments from other reviewers before making adjustments (or is it one-to-one)? 2) Should I discuss the comments in-line before making changes, or make changes and address the inline comments after? I'm not sure what the protocol is here.

@bmcfee

This comment has been minimized.

Copy link
Contributor

commented May 26, 2019

@EdwardRaff I'll defer to @deniederhut on this, but the author instructions do imply iteration during review, so I think it's fine to make changes as you see fit.

@deniederhut

This comment has been minimized.

Copy link
Member

commented May 27, 2019

Correct - you should do whatever is easier/better for you. As for reviewers, we assigned multiple to each paper, but there is no need to wait for everyone to chime in before making changes.

EdwardRaff added some commits May 27, 2019

rst is not latex...
Stupid dollar signs not working... blerg.
Fixing logic error in what was presented for false_seen_prob. The pyL…
…ZJD code is correct, and matches original paper/algorithm. Just the

main.rst had the wrong statment.
@ankurankan
Copy link

left a comment

Sorry for the late review. @bmcfee covered most of the issues which have already been fixed. The paper seems to be in great shape already so I don't really have much to add except for a few very minor suggestions (mostly typos). Cheers!!


.. class:: abstract

As Machine Learning (ML) becomes more widely known and popular, so too does the desire for new users from other backgrounds to apply ML techniques to their own domains. A difficult prerequisite that often confounds new users is the feature creation and engineering process. This is especially true when users attempt to apply ML to domains that have not historically received attention from the ML community (e.g., outside of text, images, and audio). The Lempel Ziv Jaccard Distance (LZJD) is a compression based technique that can be used for many machine learning tasks. Because of its compression background, users do not need to specify any feature extraction, making it easy to apply to new domains. We introduce pyLZJD, a library that implements LZJD in a manner meant to be easy to use and and apply for novice practitioners. We will discuss the intuition and high-level mechanics behind LZJD, followed by examples of how to use it on problems of disparate data types.

This comment has been minimized.

Copy link
@ankurankan

ankurankan Jun 8, 2019

Typo: ... easy to use and and apply ...

This comment has been minimized.

Copy link
@EdwardRaff

For this first example, we will stick to using LZJD as a similarity tool and distance metric. When you want to use distance based algorithms, you want to use the :code:`digest` and :code:`sim` functions instead of :code:`vectorize`. :code:`vectorize` will be less accurate and slower when computing distances.

To use LZJD's digest with Scikit-Learn, we need to massage the files into a form that it expects. Scikit-Learn needs a distance function between data stored as a list of vectors (i.e., a matrix :math:`X`). However, our digests are not vectors in the way that Scikit-Learn understands them, and needs to know how to properly measure distances. An easy way to do this [#], which is compatible with other specialized distance a user may want to leverage, is to create a 1-D list of vectors. Each vector will store the index of its digest in the created :code:`X_hashes` list. Then we can can create a distance function which uses the index, and returns the correct value. While wordy to explain, it takes only a few lines of code:

This comment has been minimized.

Copy link
@ankurankan

ankurankan Jun 8, 2019

Typo Then we can can create a ....

This comment has been minimized.

Copy link
@EdwardRaff
Example of t-SNE visualization created using LZJD. Best viewed digitally and in color.


A plot of the result is shown in Figure 1, where we see that the groups are mostly clustered into separate regions, but that there is a significant collection of points that were difficult to organize with their respective groups. While a tutorial on effective use t-SNE is beyond the scope of this paper, LZJD allows us to leverage this popular tool for immediate visual feedback and exploration.

This comment has been minimized.

Copy link
@ankurankan

ankurankan Jun 8, 2019

Maybe use of t-SNE instead of use t-SNE?

This comment has been minimized.

Copy link
@EdwardRaff
Introduction
------------

Machine Learning (ML) has become an increasingly popular tool, with libraries like Scikit-Learn :cite:`scikit-learn` and others :cite:`xgboost,JMLR:v18:16-131,JMLR:v17:15-237,Hall2009` making ML algorithms available to a wide audience of potential users. However, ML can be daunting for news and amateur users to pick up and use. Before even considering what algorithm should be used for a given problem, feature creation and engineering is a prerequisite step that is not easy to perform, nor is it easy to automate.

This comment has been minimized.

Copy link
@ankurankan

ankurankan Jun 8, 2019

Maybe new and amateur instead of news and amateur?

This comment has been minimized.

Copy link
@EdwardRaff
.. code-block:: python
def lzset(b): #b should be a list

This comment has been minimized.

Copy link
@ankurankan

ankurankan Jun 8, 2019

Is b supposed to be a string instead of list? Because the code wouldn't work if it is a list, won't be able to add b_s (list) to s (set).

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff Jun 8, 2019

Author

Ah, ok, thats correct.

This code isn't how LZJD is implemented, so I'm a little conflicted on modifications. It should be a string the way it is written, but I don't want to imply that lzset only works over strings. I also don't want to necessarily convert the sub-list into a tuple so it would go into a set.

Do you have any suggestions/thoughts on removing the comment, changing the comment to say a string, vs changing the code to accept tuples? I'm going to continue giving some thought on which might be best for the reader.

This comment has been minimized.

Copy link
@ankurankan

ankurankan Jun 10, 2019

I am not sure what's the best solution in this case because as a reader I would expect the code to be working for all possible inputs. So, maybe it can be explicitly mentioned somewhere that the example code only works on one data type and the full implementation works on other types as well.
Or another possibility could be to show pseudocode instead of actual code to give the general idea of what's happening.

This comment has been minimized.

Copy link
@mfenner1

mfenner1 Jun 10, 2019

you could include a comment: # code for string case only

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff Jun 22, 2019

Author

I like mfenner1's proposal, switched to that!

@mfenner1
Copy link

left a comment

I've proposed many grammatical edits primarily focused on improving clarity and simplifying sentence structure. I'll give a more general review in the discussion.


.. class:: abstract

As Machine Learning (ML) becomes more widely known and popular, so too does the desire for new users from other backgrounds to apply ML techniques to their own domains. A difficult prerequisite that often confounds new users is the feature creation and engineering process. This is especially true when users attempt to apply ML to domains that have not historically received attention from the ML community (e.g., outside of text, images, and audio). The Lempel Ziv Jaccard Distance (LZJD) is a compression based technique that can be used for many machine learning tasks. Because of its compression background, users do not need to specify any feature extraction, making it easy to apply to new domains. We introduce pyLZJD, a library that implements LZJD in a manner meant to be easy to use and and apply for novice practitioners. We will discuss the intuition and high-level mechanics behind LZJD, followed by examples of how to use it on problems of disparate data types.

This comment has been minimized.

Copy link
@mfenner1

mfenner1 Jun 10, 2019

there are several references to feature extraction, creation, engineering and in the conclusion "feature specification" ... they are all related and varyingly similar ... but I'd pick one (probably feature extraction or feature engineering) and stick with it ... you might say a few words about terminology and that you are using one term as a blanket term for all of them


Machine Learning (ML) has become an increasingly popular tool, with libraries like Scikit-Learn :cite:`scikit-learn` and others :cite:`xgboost,JMLR:v18:16-131,JMLR:v17:15-237,Hall2009` making ML algorithms available to a wide audience of potential users. However, ML can be daunting for news and amateur users to pick up and use. Before even considering what algorithm should be used for a given problem, feature creation and engineering is a prerequisite step that is not easy to perform, nor is it easy to automate.

In normal use, we as ML practitioners would describe our data as a matrix :math:`\boldsymbol{X}` that has :math:`n` rows and :math:`d` columns. Each of the :math:`n` rows corresponds to one of our data points (i.e., an example), and each of the :math:`d` columns corresponds to one of our features. Using cars as an analogy problem, we may want to know what color a car is, how old it is, or its odometer mileage, as features. We want to have these features in every row :math:`n` of our matrix so that we have the information for every car. Once done, we might train a model :math:`m(\cdot)` to perform a classification problem (e.g., is the car an SUV or sedan?), or use some distance measure :math:`d(\cdot, \cdot)` to help us find similar or related examples (e.g., which used car that has been sold is most like my own?).

This comment has been minimized.

Copy link
@mfenner1

mfenner1 Jun 10, 2019

using cars, we may ... [as an analogy problem is awkward]

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff Jun 22, 2019

Author

Adjusted


The question becomes, how do we determine what to use as our features? One could begin enumerating every property a car might have, but that would be time consuming, and not all of the features would be relevant to all tasks. If we had an image of a car, we might use a Neural Network to help us extract information or find similar looking images. But if one does not have prior experience with machine learning, these tasks can be daunting. For some types of complex data, feature engineering can be challenging even for experts.

To help new users avoid this difficult task, we have developed the PyLZJD library. PyLZJD makes it easy to get started with ML algorithms and retrieval tasks without needing any kind of feature specification, selection, or engineering, to be done by the user. Instead, all a user needs to do is represent their data as a file (i.e., one file for every data point, for :math:`n` total files). PyLZJD will automatically process the file and can be used with Scikit-Learn to tackle many common tasks. While PyLZJD will likely not be the best method to use for most problems, it provides an avenue for new users to begin using machine learning with minimal effort and time.

This comment has been minimized.

Copy link
@mfenner1

mfenner1 Jun 10, 2019

without any kind of feature specification, selection, or engineering, to be done by the user. Instead, all a user needs to do is represent their data as a file ----> without any kind of feature specification, selection, or engineering from the user. Instead, a user represents their data as a file

will automatically process the files [otherwise it seems awkward since each data point is a file .... we have many to process]

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff Jun 22, 2019

Author

Adjusted!

The Lempel Ziv Jaccard Distance
-------------------------------

LZJD stands for "Lempel Ziv Jaccard Distance" :cite:`raff_lzjd_2017`, and is the algorithm implemented in PyLZJD. LZJD takes a byte or character sequence :math:`x` (i.e., a "string"), converts it to a set of sub-strings, and then converts the set into a *digest*. This digest is a fixed-length summary of the input sequence, which requires a total of :math:`k` integers to represent. We can then measure the similarity of digests using a distance function, and we can trade accuracy for speed and compactness by decreasing :math:`k`. We can optionally convert this digest into a vector in euclidean space, allowing greater flexibility to use LZJD with other machine learning algorithms.

This comment has been minimized.

Copy link
@mfenner1

mfenner1 Jun 10, 2019

no comma after reference. The form of the sentence is: LZJD stands for .... and .... . No comma with that and.
There are several of these throughout the paper, which I'll note below.

recommend: :math:x (a string)

recommend: Euclidean space

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff Jun 22, 2019

Author

Adjustments made!


LZJD stands for "Lempel Ziv Jaccard Distance" :cite:`raff_lzjd_2017`, and is the algorithm implemented in PyLZJD. LZJD takes a byte or character sequence :math:`x` (i.e., a "string"), converts it to a set of sub-strings, and then converts the set into a *digest*. This digest is a fixed-length summary of the input sequence, which requires a total of :math:`k` integers to represent. We can then measure the similarity of digests using a distance function, and we can trade accuracy for speed and compactness by decreasing :math:`k`. We can optionally convert this digest into a vector in euclidean space, allowing greater flexibility to use LZJD with other machine learning algorithms.

The inspiration and high-level understanding of LZJD comes from compression algorithms. Let :math:`C(\cdot)` represent your favorite compression algorithm (e.g., zip, bz2, etc.), which will take an input :math:`x` and produce a compressed version :math:`C(x)`. Using a decompressor, you can recover the original object or file :math:`x` from :math:`C(x)`. The purpose of this compression is to reduce the size of the file stored on disk. So if :math:`|x|` represents how many bytes it takes to represent the file :math:`x`, the goal is that :math:`|C(x)| < |x|`.

This comment has been minimized.

Copy link
@mfenner1

mfenner1 Jun 10, 2019

(e.g., zip or bz2) [they're examples, there could be more, don't need etc.]

which will take ---> [I'd strong recommend keeping almost all of the paper in the present tense. writing in the future gets very wordy and awkward.]

The purpose of this compression is to reduce the size of the file stored on disk. ---> Compression attempts to reduce the size of the file.

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff Jun 22, 2019

Author

Adjusted

\end{tabular}
\end{table}

LZJD won't always be effective for images, and convolutional neural networks (CNNs) are a better approach if you need the best possible accuracy. However, this example demonstrates that LZJD can still be useful, and has been used successfully to find slightly altered images :cite:`Faria-joao`. This example also shows how to build a more deployable classifier with pyLZJD and tackle class-imbalance situations.

This comment has been minimized.

Copy link
@mfenner1

mfenner1 Jun 10, 2019

LZJD can still be useful, and has been used successfully to find slightly altered images ---> LZJD can still be useful and it found slightly altered images

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff Jun 22, 2019

Author

I prefer the original wording here. I feel like the proposed wording doesn't make it clear that we are referring to an example of someone else's work with LZJD.


\subsection{Text Classification}

As our last example, we will use a text-classification problem. While other methods will work better, the purpose it to show that LZJD can be used in a wide array of potential applications. For this, we will use the well-known 20 Newsgroups dataset, which is available in Scikit-Learn. We use this dataset because LZJD works best with longer input sequences. For simplicity we will stick with distinguishing between the newsgroup categories of 'alt.atheism' and 'comp.graphics'. An example of an email from the later group is shown below.

This comment has been minimized.

Copy link
@mfenner1

mfenner1 Jun 10, 2019

For this, we will use the well-known 20 Newsgroups dataset, which is available in Scikit-Learn. We use this dataset because LZJD works best ---> We will use the well-known 20 Newsgroups dataset, which is available in Scikit-Learn. LZJD works best

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff Jun 22, 2019

Author

In the context of the larger paragraph, I think the original wording does a better job explaining to the reader why choices are being made, and so prefer to keep it over the proposed wording.

Readers should verify what I wrote... :-)


When a string is not a valid path to a file, pyLZJD will processes the string itself to create a digest. This makes working with data stored as strings simple, and getting results is as easy as the code snippet below:

This comment has been minimized.

Copy link
@mfenner1

mfenner1 Jun 10, 2019

This simplifies working with strings, and getting results is as easy as:

This comment has been minimized.

Copy link
@EdwardRaff
pred, average='macro')
With the above code, we get an :math:`F_1` score of 83\%. It is important to note that in this case, using Scikit-Learn's TfidfVectorizer one can get 89\% :math:`F_1`. The point here is that with pyLZJD we can get decent results without having to think about what kind of vectorization is being performed, and that any string encoded data can be feed directly into the :code:`vectorize` or :code:`digest` functions to get immediate results.

This comment has been minimized.

Copy link
@mfenner1

mfenner1 Jun 10, 2019

With the above code, we get an :math:F_1 score of 83%. Using Scikit-Learn's TfidfVectorizer achieves an :math:F_1 of 89% . Our point is that PyLZJD we can get decent results without thinking about what kind of vectorization is being performed, and any string encoded data can be feed directly into the :code:vectorize or :code:digest functions to get forms immediately usable by Scikit-Learn.

This comment has been minimized.

Copy link
@EdwardRaff

EdwardRaff Jun 22, 2019

Author

Adjusted

Conclusion
----------

We have shown, by example, how to use pyLZJD on a number of different datasets composed of raw binary files, images, and regular ASCII text. In all cases, we did not have to do any feature specifications to use pyLZJD, making application simpler and easier. This shortcut is particularly useful when feature specification is hard, such as raw file types, but can also make it easier for people to get into applying Machine Learning.

This comment has been minimized.

Copy link
@mfenner1

mfenner1 Jun 10, 2019

specification wasn't really used earlier in the paper ... perhaps prefer engineering or extraction

This comment has been minimized.

Copy link
@EdwardRaff
@mfenner1

This comment has been minimized.

Copy link

commented Jun 10, 2019

So, I love this paper. Feature extraction/construction/engineering is an incredible important topic and I've been working with it for a solid 15 years. I'm always happy to see new approaches and particularly approaches that aim to be domain agnostic and easy on the user. I think you use just enough code, mathematics, and verbal descriptions to get your ideas across clearly. I think you have a good overall balance of background, implementation, and examples.

To reproduce your results, I feel like I would have a bit of struggle: primarily in tying together the LZJD description with SuperMinHashing. Even if public code is available, I would love to see an overall Python or pseudo-code algorithm this ties all of the pieces together. [I don't know the page limit off the top of my head -- that could be an issue] Is it feasible to include some pseudo-code for the SuperMinHashing?

Have you done (or do you plan to do) any larger scale studies on a variety of problems (for example, pick 20 datasets and compare some generic baseline learners against PyLZJD)?

I placed specific grammatical stuff in individual line level comments.

@EdwardRaff

This comment has been minimized.

Copy link
Author

commented Jun 10, 2019

@mfenner1 Thank you so much for the kind words and the extensive style edits/notes! I'll set aside some time to go through and adjust.

While we still have 2 pages free, SuperMingHash is fairly complicated to specify. Adding just that to the paper would take at least 1.5 pages, and wouldn't leave enough space to make a large / complete specification of the whole process in pseudo-code. I think it would make it very difficult for non ML experts to follow along.

I would love to do a study like that, and its kind-of on our radar, everything we've explored has been predominantly malware related - where LZJD is competitive with state-of-the-art for a number of tasks in part because of how difficult it is to get features. It is unfortunately a lot of work to go hunting for datasets like that, more than I have freely available.

On our radar is to do something similar to the CDM paper, which showed how to use SAX to apply the CDM to time series classification/clustering problems, which worked quite well. Combining their approach with this large archive of time series datasets would allow us to do a large study without too much pain.

@mfenner1

This comment has been minimized.

Copy link

commented Jun 13, 2019

Would there be a way to strike a balance where you tie the pieces together with high level pseudo code and give the SuperMinHash algorithm its own very high level (even just conceptual) breakdown as a sub-procedure?

I can definitely get behind leaving the details of SuperMinHash out, if it will just be confusing (and given that readers can go to those references if they want those details). So, I guess I'm back to how to tie all the pieces together ... even if SuperMinHash is left as a black-box.

Best,
Mark

@EdwardRaff

This comment has been minimized.

Copy link
Author

commented Jun 22, 2019

I've tried to address all the individual comments in the latest commit

I'm not sure I fully understand your "tying it all together" concern. Perhaps it's due to a different set of expectations.

I'm assuming the standard SciPy reader is not a Machine Learning researcher / does no have experience implementing algorithms from paper descriptions. Based on the other/previous papers, I was under the understanding that the target audience was predominately practitioners / how & why to use the software.

The entire process could potentially be fit into the paper, but I think it will be jaring for almost all readers and not easy to follow unless they are already experienced ML researchers. At the same time though, I would expect those readers to be able to pick up from the references in this paper for the deeper detail to independently reproduce. There is the additional issue that the rst format doesn't seem to have a method for me to write pseudo code.

@deniederhut

This comment has been minimized.

Copy link
Member

commented Jun 22, 2019

Hi everyone! I have not read this paper yet, but perhaps I can chime in with some perspective.

To address the easy question first, rst allows both preformatted text and text with code highlighting. You can see examples in the demo paper here -> https://github.com/scipy-conference/scipy_proceedings/blob/2019/papers/00_vanderwalt/00_vanderwalt.rst

As for the target audience, SciPy tends to be split about 50/50 between academia and industry, but even this is becoming a less useful distinction as more companies become engaged in R&D, and more grad students become founders. The intention behind the conference proceedings is that papers should be accessible to a scientifically literate person who is not in your field (e.g. you can imagine that you are presenting this to a group of computationally savvy researchers from the Geological Survey), and enlightening to someone who is already experienced in your field (e.g. one of the TPOT developers). The examples are intentional - they were both at SciPy last year.

In a lot of ways, these are competing goals, and striking a good balance is hard. I might think about it this way:

  1. Would the NGS researcher understand at a surface level what you did and why it matters?
  2. Would the TPOT developer see enough detail to reproduce your experiment? (This would not necessarily need to be in the paper itself if you can link to source elsewhere)
@EdwardRaff

This comment has been minimized.

Copy link
Author

commented Jun 22, 2019

Hi @deniederhut I'm not sure I understand towhat you are referring to for the vanderwalt example. I do have code highlighting used in my current paper, but I don't think that would benefit writing more pseudo-code like @mfenner1 is asking for. Or, I'm not understanding how that is achieved/done in the vanderwalt example.

@deniederhut

This comment has been minimized.

Copy link
Member

commented Jun 25, 2019

Sorry, I thought you had said that rst does not support formatting for psudo-code -- I was pointing you to an example of how to write pseudo-code in rst.

@EdwardRaff

This comment has been minimized.

Copy link
Author

commented Jun 25, 2019

I'm confused. I don't see any psudo-code in that link? I'm thinking something like this:
Screen Shot 2019-06-24 at 9 50 23 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.