we can define a sequence's posterior probability,like this:
π(s_1,π _2β¦π _π )=π(π _1 )βπ(π _2βπ _1 )βπ(π _3 |π _1,π _2 )β¦π(π _π |π _1,π _2,..π _(πβ1) )
we use markov process to simplify the above equation
π(π _πβπ _1 π _(2..) π _(πβ1) )βπ(π _πβπ _π π _(π+1..) π _(πβ1) ) 1<π<π
Given a database of sequences, we do statistics on these sequences probabilities,But creating, storing, and efficiently searching these probabilities pose a significant challenge. To address this, the program utilizes Probability Suffix Trees (PST), which employ a tree-like architecture. The concepts and ideas behind PST can be found in the paper "Mining for Outliers in Sequential Databases.pdf" listed in the repository..
- The program extends the estimator interface and can seamlessly integrate with Spark ML lib in a pipeline.
- It is designed to run in a distributed manner, taking advantage of the high-performance capabilities of Spark.
- The final transformed results are the similarities between individual sequence and the PST tree
- spark>=2.4
- scala