# Find word's that changes most over time

## Approach

Compare word's distribution over time with a uniform distribution. Use as null hypothesis the belief that a word's distribution does not change over time. Filter out all the words for which there is no significance.

https://en.wikipedia.org/wiki/Goodness_of_fit#Categorical_data


## Candidate methods

### Chi-square test for a discrete uniform distribution

A χ2 goodness-of-fit test is used to determine how (un)likely a data serie (i.e. the word's distribution over time) has been generate by a (discrete) uniform distribution. The actual word counts for each year are used since χ2 is not applicable to relative frequencies. As a rule of thumb, χ2 test requires each individual value to be greater or equal to 5.

From [stats.stackexchange.com](stats.stackexchange.com/questions/25827/how-does-one-measure-the-non-uniformity-of-a-distribution):

*If you have not only the frequencies but the actual counts, you can use a χ2 goodness-of-fit test for each data series. In particular, you wish to use the test for a discrete uniform distribution. This gives you a good test, which allows you to find out which data series are likely not to have been generated by a uniform distribution, but does not provide a measure of uniformity..... (I guess that the chi-squared statistic can be seen as a measure of uniformity, but it has some drawbacks, such as the lack of convergence, dependence on the arbitrarily placed bins, that the number of expected counts in the cells needs to be sufficiently large, etc. Which measure/test to use is a matter of taste though, and entropy is not without its problems either (in particular, there are many different estimators of the entropy of a distribution). To me, entropy seems like a less arbitrary measure and is easier to interpret.)*

 $\tilde{\chi}^2=\frac{1}{d}\sum_{k=1}^{n} \frac{(O_k - E_k)^2}{E_k}$ (d degree of freedom, n samples, E expected, O observed)
 
References:

  - [Chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test)
  - [comparing-two-word-distributions](stats.stackexchange.com/questions/236192/comparing-two-word-distributions)

### Simple linear regression

Use least-squares fit to compute a Compare word's distribution over time with a uniform distribution. Use as null hypothesis the belief that a word's distribution does not change over time. Filter out all the words for which there is no significance.

| Slope | $y = k * x + m$ | Use linear regression to compute slope k. Select n word having highest absoulute value |

### G-test for a discrete uniform distribution

en.wikipedia.org/wiki/G-test

### Kolmogorov-Smirnov test (KS-test)

stackoverflow.com/questions/25208421/how-to-test-for-uniformity

### Entropy 

https://stats.stackexchange.com/questions/25827/how-does-one-measure-the-non-uniformity-of-a-distribution

*There are other possible approaches, such as computing the entropy of each series - the uniform distribution maximizes the entropy, so if the entropy is suspiciously low you would conclude that you probably don't have a uniform distribution. That works as a measure of uniformity in some sense.*

### Kullback-Leibler divergence (KS-test)
Another suggestion would be to use a measure like the Kullback-Leibler divergence, which measures the similarity of two distributions.


### L2 norm

*Here is a simple heuristic: if you assume elements in any vector sum to 1 (or simply normalize each element with the sum to achieve this), then uniformity can be represented by L2 norm, which ranges from 1d√ to 1, with d being the dimension of vectors.

The lower bound 1d√ corresponds to uniformity and upper bound to the 1-hot vector.

To scale this to a score between 0 and 1, you can use n∗d√−1d√−1, where n is the L2 norm.

An example modified from yours with elements summing to 1 and all vectors with the same dimension for simplicity:*

0.10    0.11    0.10    0.09    0.09    0.11    0.10    0.10    0.12    0.08
0.10    0.10    0.10    0.08    0.12    0.12    0.09    0.09    0.12    0.08
0.03    0.02    0.61    0.02    0.03    0.07    0.06    0.05    0.06    0.05
The following will yield 0.0028, 0.0051, and 0.4529 for the rows:

```
d = size(m,2); 
for i = 1 : size(m); 
    disp( ( norm(m(i,:))*sqrt(d)-1) / (sqrt(d)-1) ); 
end
```



In [4]:
%load_ext autoreload
%autoreload 2

import operator
import scipy
import scipy.stats
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib
import matplotlib.pyplot as plt

from westac.common import vectorized_corpus

import westac.common.utility as utility

logger = utility.setup_logger(filename='./westac.log')

%matplotlib inline

matplotlib.rcParams["figure.figsize"] = [20, 4.8] 
pd.set_option('display.max_rows', 1000)


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [12]:

import numpy as np
import math

m = np.array([
    [ 0.10, 0.11, 0.10, 0.09, 0.09, 0.11, 0.10, 0.10, 0.12, 0.08 ],
    [ 0.10, 0.10, 0.10, 0.08, 0.12, 0.12, 0.09, 0.09, 0.12, 0.08 ],
    [ 0.03, 0.02, 0.61, 0.02, 0.03, 0.07, 0.06, 0.05, 0.06, 0.05 ]
])

# The following will yield 0.0028, 0.0051, and 0.4529 for the rows:

d = m.shape[1]

for i in range(0, m.shape[0]):

    l2_norm = (np.linalg.norm(m[i, :]) * math.sqrt(d) - 1 ) / (math.sqrt(d) - 1)

    print('{:.4f}'.format(l2_norm))


0.0028
0.0051
0.4529


In [15]:
def gof_by_l2_norm(matrix, axis=1):

    """ Computes L2 norm for rows (axis = 1) or columns (axis = 0).
    """
    d = matrix.shape[1]

    l2_norm = (np.linalg.norm(matrix, axis=axis) * math.sqrt(d) - 1 ) / (math.sqrt(d) - 1)

    return l2_norm


print (np.round(gof_by_l2_norm(m), 4))


[0.0028 0.0051 0.4529]
