Skip to content

Novelty and collective attention

Wang Cheng-Jun edited this page Dec 19, 2016 · 1 revision

说明:本文是Novelty and collective attention一文[1]的读书笔记。





 The observations can be described by a dynamical model characterized by a single novelty factor. Our measurements indicate that novelty within groups decays with a stretched-exponential law, suggesting the existence of a natural time scale over which attention fades.

Table of Contents

Log-normal distribution

we measured the histogram of the final diggs of all 29,864 popular stories in the year 2006. As can be seen from Fig. 1, the distribution appears to be quite skewed, with the normal Q–Q plot of <math>log(N_\infty)</math> a straight line. A Kolmogorov–Smirnov normality test of <math>log(N_{\infty})</math> with mean 6.546 and standard deviation 0.6626 yields a P value of 0.0939, suggesting that <math>N_{\infty} </math> follows a log-normal distribution.

<math>N_t</math>, the number of diggs of a popular story after finite time t. The distribution of <math>log(N_t)</math> again obeys a bell-shaped curve. As an example, a Kolmogorov–Smirnov normality test of <math>log(N_{2h})</math> with mean 5.925 and standard deviation 0.5451 yields a P value as high as 0.5605, supporting the hypothesis that <math>N_t</math> also follows a log-normal distribution.

600px

A simple stochastic dynamical model

  • <math>N_t</math> represents the number of people who know the story at time t, and a fraction <math>\mu</math> of those people will further spread the story to some of their friends.
  • Mathematically, this assumption can be expressed as <math>Nt = (1 + X_t)N_{t-1}</math>, where X1, X2, . . . are positive, independent, and identically distributed random variables with mean <math>\mu</math> and variance <math>\sigma^2</math>.
  • This growth in time is eventually curtailed by a decay in novelty, which we parameterize by a time-dependent factor <math>r_t</math>, consisting of a series of decreasing positive numbers with the property that <math>r_1 = 1</math> and <math>r_t \rightarrow 0</math> , as <math> t \rightarrow \infty</math>.
  • With this additional parameter, the full stochastic dynamics of story propagation is governed by <math> N_t = (1 + r_t X_t)N_{t-1} </math>, where the factor <math>r_t X_t</math> acts as a discounted random multiplicative factor.
  • Put together, we have <math>N_t = \prod_{s = 1}^{t}(1 + r_s X_s)N_0</math>
  • When <math>X_t</math> is small (which is the case for small time steps), we have the following approximate solution:
<math>N_t = \prod_{s = 1}^{t}(1 + r_s X_s)N_0 \approx \prod_{s = 1}^{t} e^{r_s X_s} N_0 = e^{\sum_{s = 1}^{t} r_s X_s} N_0</math> [1]

Because when x is small, there exists <math> 1 + x \approx e^x </math>.

  • Taking logarithm of both sides, we obtain
<math>log N_t - log N_0 = \sum_{s = 1}^{t} r_s X_s </math> [2]

Mean and Variance

  • taking the mean and variance of both sides for equation 2:
<math>\frac{E(log N_t - log N_0)}{var(log N_t - log N_0)} = \frac{\sum_{s = 1}^{t} r_s \mu}{\sum_{s = 1}^{t} r_s \sigma^2} = \frac{\mu}{\sigma^2}</math> [3]
 问题:如何推导出公式【3】?

Product of independent variables

If two variables X and Y are independent, the variance of their product is given by[2]

<math>
\begin{align} \operatorname{Var}(XY) &= [E(X)]^2 \operatorname{Var}(Y) + [E(Y)]^2 \operatorname{Var}(X) + \operatorname{Var}(X)\operatorname{Var}(Y). \end{align} </math>

Equivalently, using the basic properties of expectation, it is given by

<math>
\operatorname{Var}(XY) = E(X^2) E(Y^2) - [E(X)]^2 [E(Y)]^2. </math>

Now if X and Y are independent, then by definition j(x,y) = f(x)g(y) where f and g are the marginal PDFs for X and Y. Then

<math>\begin{align}
\operatorname{E}[XY] &= \iint xy \,j(x,y)\,\mathrm{d}x\,\mathrm{d}y = \iint x y f(x) g(y)\,\mathrm{d}y\,\mathrm{d}x \\ & = \left[\int]\left[\int] = \operatorname{E}[X]\operatorname{E}[Y] \end{align}</math>

and Cov(X, Y) = 0.

Observe that independence of X and Y is required only to write j(x, y) = f(x)g(y), and this is required to establish the second equality above. The third equality follows from a basic application of the Fubini-Tonelli theorem.

If the model is correct, a plot of the sample mean versus the sample variance for each time t should yield a straight line passing through the origin with slope <math>\frac{\mu}{\sigma^2}</math> . 如上图Fig2所示。

Computing decay factor

The decay factor <math>r_t</math> can now be computed explicitly from <math>N_t</math> up to a constant scale. By taking expectation values of Eq. 2 and normalizing r1 to 1, we have

<math>r_t = \frac{E(log N_t) - E(log N_{t-1})}{E(log N_1) - E(log N_0)}</math> [4]

 问题:如何推导出公式【4】?

根据公式2,可以得到:

<math>log N_t - log N_{t-1} = r_t Xt</math> [5]

<math>log N_1- log N_9 = r_1 X1</math> [6]

由【5】得到 <math>E(log N_t - log N_{t-1} )=E( r_t) E( Xt) =E(r_t) \mu</math> [7]

由【6】得到<math>E(log N_1- log N_0 )=E( r_1) E( X1) = \mu</math> [8]

由【7】和【8】可以得到:

<math>E(r_t) = \frac{E(log N_t) - E(log N_{t-1})}{E(log N_1) - E(log N_0)}</math>

500px

Stretched exponential relaxation

  1. The curve of <math>r_t</math> estimated from the 1,110 stories in January 2007 is shown in Fig. 3a. As can be seen, <math>r_t</math> decays very fast in the first 2–2 hours, and its value becomes 0.03 after 3 hours.
  2. Fig. 3 b and c shows that <math>r_t</math> decays slower than exponential and faster than power law.
  3. Fig. 3d shows that <math>r_t</math> can be fit empirically to a stretched exponential relaxation or Kohlrausch–Williams–Watts law[3]:
<math>r_t \sim e^{-0.4^{t^{0.4}}}</math>.

The half-life <math> \tau </math> of <math> r_t</math> can then be determined by solving the equation

<math>\int_{0}^{\tau} e^{-0.4^{t^{0.4}}} = \frac{1}{2} \int_{0}^{\infty} e^{-0.4^{t^{0.4}}}</math>

A numerical calculation gives 69 minutes, or 1 hour. This characteristic time is consistent with the fact that a story usually lives on the front page for a period between 1 and 2 hours.

数值模拟

1. python中的正态分布

from random import normalvariate
import matplotlib.pyplot as plt
import matplotlib.cm as cm

x = [normalvariate(0.5, 0.1) for i in range(500)]
plt.hist(x)

400px

我们看到了一个钟形分布,调节mean和std,可以得到不同的取值。

2. 随机动力学增长模型

def random_model(mean, sd):
    Nt = {}
    Nt[0] = 1
    for t in range(1, 100):
        xt = normalvariate(mean, sd)
        Nt[t] = (1+xt)*Nt[t-1]
    return Nt

fig = plt.figure(figsize=(12, 4),facecolor='white')
cmap = cm.get_cmap('rainbow_r',10)

for mean in np.linspace(0.1,0.9,10):
    Nt = random_model(mean, 0.1)
    plt.plot(Nt.keys(), Nt.values(),color=cmap(mean),linestyle='-',marker='.',label=str(np.round(mean,2)))
plt.yscale('log',basey=10)
plt.ylabel('log(Nt)'); plt.xlabel('t')
plt.legend(loc=2,fontsize=8)

plt.show()

400px

显然这个时候是指数分布,新闻的diggs增长过快。于是需要一个新的参数, decay factor使得增长变慢,并且越来越慢。一个选择就是stretched exponential relaxation。

3. 带有衰退的随机动力学增长模型

def random_model_with_decay(mean, sd, decay_prameter):
    Nt = {}
    Nt[0] = 1
    for t in range(1, 100):
        xt = normalvariate(mean, sd)
        rt = np.e**(-(t**decay_prameter)) # make it simpler here
        Nt[t] = (1+rt*xt)*Nt[t-1]
    return Nt


fig = plt.figure(figsize=(12, 4),facecolor='white')
cmap = cm.get_cmap('rainbow_r',10)

for mean in np.linspace(0.5,0.9,1):
    for dp in np.linspace(0.1, 0.5, 5):
        Nt = random_model_with_decay(mean, 0.1, dp)
        plt.plot(Nt.keys(), Nt.values(),
                 color=cmap(mean*dp),linestyle='-',marker='.',
                 label='Mean ='+str(np.round(mean,2))+' & Decay prameter = ' + str(dp))
plt.yscale('log',basey=10)
plt.ylabel('log(Nt)'); plt.xlabel('t')
plt.legend(loc=2,fontsize=8)

plt.show()

400px

在这里,我们可以清楚看到选取不同的decay prameter会显著改变decay factor,进而改变增长曲线:使得衰退比指数慢(指数导致正态分布,快速衰退),比幂律快(幂律导致长尾,衰退过慢)。

附:证明 <math> 1 + x \approx e^x </math>

欧拉数e,约等于2.71828,但它的来源更重要。1748年欧拉发表了“无穷分析概要”,确立了欧拉数e的数学地位。

证明:

因为: <math>e^x = 1 + \frac{x}{1!}+ \frac{x^2}{2!}+\cdots+ \frac{x^n}{n!}</math>见[4]

所以,当x趋近于0的时候, <math>e^x \approx 1 + \frac{x}{1!} = 1 + x</math>

 返回 [[Collective Order]]

文献

Wu F, Huberman BA (2007) Novelty and collective attention. Proceedings of the National Academy of Sciences 104: 17599–17601.

计算传播学是计算社会科学的重要分支。它主要关注人类传播行为的可计算性基础,以传播网络分析、传播文本挖掘、数据科学等为主要分析工具,(以非介入地方式)大规模地收集并分析人类传播行为数据,挖掘人类传播行为背后的模式和法则,分析模式背后的生成机制与基本原理,可以被广泛地应用于数据新闻和计算广告等场景,注重编程训练、数学建模、可计算思维。

Clone this wiki locally