# Sampling from the generative model 

In this notebook, we will use the generative model of the HDHP (Hierarchical Dirichlet-Hawkes Process) in order to sample events. We will start with a predifined number of users, say `10`, and we will attempt to model their behavior as they are posting questions in an online platform. For simplicity, our "vocabulary" will be dummy.

We start by importing all the libraries that will be required.

In [1]:
%matplotlib inline
import datetime
import string
import hdhp
import notebook_helpers
import seaborn as sns

aa


Now, let us set some parameters for our model. These fall under two categories; the ones relevant to the content and then ones relevant to the time dynamics. Starting with the first set, we need  to decide on:

* the vocabulary: a dummy set of `100` words, i.e. `word0`, `word1`, ... , `word99`.
* the minimum and maximum length of a question
* the number of words of each pattern

As far as the time dynamics is concerned, we need to set:

* $\alpha_0$: the parameters of the Gamma prior for the time kernel of each pattern
* $\mu_0$: the parameters of the Gamma prior for the user activity rate
* $\omega$: the time decay parameter

Finally, in order to make the generative process more user-friendly, we can pre-set the number of patterns that our users can sample from.

In [2]:
vocabulary = ['word' + str(i) for i in range(100)]  # the `words` of our documents
doc_min_length = 5
doc_length = 10
words_per_pattern = 50

alpha_0 = (2.5, 0.75)
mu_0 = (2, 0.5)
beta_0 = (4, 5)
zeta_0 = [500,5,5]
omega = 3.5

num_global_patterns = 12
num_patterns = num_global_patterns
num_cluster = 3
num_patterns_per_cluster = [20,4,8]
process = hdhp.HDHProcess(num_cluster = num_cluster, num_global_patterns = num_global_patterns,
                          num_patterns_per_cluster =num_patterns_per_cluster,num_patterns=num_patterns,
                        alpha_0=alpha_0,mu_0=mu_0,beta_0 = beta_0, zeta_0= zeta_0,
                          vocabulary=vocabulary,
                          omega=omega, words_per_pattern=words_per_pattern,
                          random_state=12)

{0: 0.47024072731095173, 1: 0.3842734324577831, 2: 0.4319450350423422}


Before generating any questions, we can take a look at the patterns that we initialized our process with, and look at the content distribution of each pattern. Although each pattern has a different word distribution, we can still plot the overlap (Jaccard similarity) between the words that have non-zero probability for each pattern. Since we used a limited number of patterns, the distribution of the overlap will not be smooth.

In [3]:
# overlap = notebook_helpers.compute_pattern_overlap(process)
# sns.distplot(overlap, kde=True, norm_hist=True, axlabel='Content overlap')

Now, the next step is to generate the questions for the users. As we mentioned above, for the purpose of this notebook, we will limit ourselves to a set of `10` users. For each of them, we will sample their questions from the process. We will sample at least `100` and at max `5000` questions per user, and we will make sure that we do not sample for more than `365` time units (assume that 1 time unit = 1 day).

In [4]:
process.reset()  # removes any previously generated data
for i in range(10):
    cluster, events= process.sample_user_events(min_num_events=100, 
                                  max_num_events=5000,
                                  t_max=365)
#     print(process.local_pattern_popularity)
#     print('Total #events', len(events))
#     print(process.level_history_per_user[i].count(True))
#     print("cluster:" + str(cluster))

[0.97281838 0.01528673 0.01189489]
(array([0], dtype=int64),)
1.0
0
5001
150
{0: 3, 1: 3, 2: 3, 3: 3, 4: 0, 5: 3, 6: 3, 7: 3, 8: 1, 9: 3, 10: 5, 11: 3, 12: 1, 13: 11, 14: 3}
[0.97281838 0.01528673 0.01189489]
(array([0], dtype=int64),)
1.0
0
5001
4894
{0: 4, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 10, 7: 5, 8: 1, 9: 1, 10: 1, 11: 3, 12: 3, 13: 5, 14: 3, 15: 1, 16: 3, 17: 5, 18: 1, 19: 6, 20: 11, 21: 3, 22: 11, 23: 3, 24: 1, 25: 1, 26: 3, 27: 5, 28: 3, 29: 10, 30: 8, 31: 1, 32: 3, 33: 3, 34: 3, 35: 3, 36: 3, 37: 3, 38: 3, 39: 0, 40: 3, 41: 6, 42: 3, 43: 3, 44: 3, 45: 2, 46: 8, 47: 2, 48: 3, 49: 3, 50: 1, 51: 6, 52: 1, 53: 3, 54: 3, 55: 3, 56: 6, 57: 3, 58: 3, 59: 3, 60: 5, 61: 5, 62: 3, 63: 1, 64: 6, 65: 0, 66: 3, 67: 5, 68: 11, 69: 5}
[0.97281838 0.01528673 0.01189489]
(array([1], dtype=int64),)
1.0
1
5001
4956
{0: 3, 1: 1, 2: 3, 3: 8, 4: 3, 5: 2, 6: 3, 7: 3, 8: 6, 9: 6, 10: 1, 11: 3, 12: 3, 13: 11, 14: 11, 15: 1, 16: 4, 17: 11, 18: 5, 19: 6}
[0.97281838 0.01528673 0.01189489]
(array([0], dty

We can now review which patterns each user has adopted and check the content distribution of one of these patterns.

In [5]:
print (process.user_patterns_set(user=0))

AttributeError: 'HDHProcess' object has no attribute 'dish_on_table_per_user'

In [None]:
print (process.user_pattern_history_str(user=0, patterns=[0, 1],show_time=True))

In [None]:
print (process.pattern_content_str(patterns=[0, 1],
                                  show_words=10))

The last step is to plot the intensity (the rate at which each user asks questions) for each user and each pattern. Below, the plots share the same $y$-axis, so that the user intensities are comparable. Each color corresponds to a single pattern. We will also manually set the "beginning of time" at an arbitrary date.

In [None]:
start_date = datetime.datetime(2015, 9, 15)
fig = process.plot(start_date=start_date, user_limit=5,
                   num_samples=5000, time_unit='days',
                   label_every=1, seed=5)
fig.show()