The goal here is to assess what effect social network centrality has on a speaker's influence on the overall speech pattern of a language community. To do this we will:

1. Break the corpus up into year-long periods
2. Create a social network for each period
3. Measure users' centrality in the social network
2. Create a language model for each period
3. Measure the average [perplexity](https://en.wikipedia.org/wiki/Perplexity#Perplexity_per_word) of a user's speech for each period
4. Correlate centrality with perplexity
  1. Expect to see the highest negative correlation between centrality and perplexity in the following period
  2. Check this against correlation between centrality and perplexity in the current period?

In [None]:
from __future__ import print_function
import wiki
import kenlm
import datetime
from nltk.tokenize import word_tokenize
import os
import scipy
import networkx as nx
import pandas as pd

Using the Wiki corpus. This assumes the corpus is located in `/data/corpus/`. The corpus can be downloaded using `get_corus.sh`.

In [None]:
corpus = wiki.Corpus('../data/corpus/')

Use KenLM to build an n-gram model for each period.
- [Download and install](https://kheafield.com/code/kenlm/)
- [Paper](https://kheafield.com/papers/avenue/kenlm.pdf)
- [Python module](https://github.com/kpu/kenlm)
KenLM needs to be run from the command line to generate a language model object. KenLM expects to receive a corpus in [this](https://kheafield.com/code/kenlm/estimation/) format. The language model object can then be loaded into Python.

In [None]:
# Location of the compiled KenLM utility
lmplz = '~/kenlm/build/bin/lmplz'
# Location to store temporary input and output files for KenLM
corpus_file = '../data/lm_corpus_{0}.txt'
kenlm_file = '../data/lm_{0}.arpa'

First create the language model for each period (using periods of 1 year here).
Further utterance cleanup needed:

- remove URLs: `[www.google.com]` --> `''`
- strp URL from links: `[www.google.com|Google]` --> `Google`
- dereference Wikipedia links: `[[The New York Times]]` --> `The New York Times`
- remove unencoded unicode characters

In [None]:
for year in range(2006,2012):
    
    start_date = datetime.datetime(year, 1, 1)
    end_date = datetime.datetime(year, 12, 31)
    
    utts = corpus.get_utts(start_date=start_date, end_date=end_date)
    users = {utt.user_id: None for utt in utts}

    # Create the corpus for this year in the appropriate format for KenLM. See corpus formatting notes: https://kheafield.com/code/kenlm/estimation/
    kenlm_corpus = '\n'.join(' '.join(b.tokenized) for b in utts)
    with open(corpus_file, 'w') as f:
        f.write(kenlm_corpus)
    # use KenLM to create the n-gram language model for this year
    os.system('{0} -o 3 -S 20% <{1} >{2}'.format(lmplz, corpus_file.format(year), kenlm_file.format(year)))

Now we gather data for each period...

1. Number of utterances by user
2. Network centrality by user
3. Average perplexity of users' utterances based on that period's language model


In [99]:
data = {}
for year in range(2006,2012):
    print(year)
    start_date = datetime.datetime(year, 1, 1)
    end_date = datetime.datetime(year, 12, 31)
    
    model = kenlm.Model(kenlm_file.format(year))

    network = corpus.generate_network(start_date=start_date, end_date=end_date)
    centrality = nx.closeness_centrality(network)
    
    year_data = {}
    for user in network:
        user_utts = {' '.join(utt.tokenized) for utt in corpus.get_utts(user, start_date, end_date)}
        n_utts = len(user_utts)
        avg_perplexity = scipy.mean([model.perplexity(utt) for utt in user_utts])
        index = user + '-' + str(year)  
        year_data[user ] = {'n_utts': n_utts, 
                           'avg_perplexity': avg_perplexity, 
                           'centrality': centrality[user]}
         
    data[year] = pd.DataFrame.from_dict(year_data, orient='index')
panel = pd.Panel(data)
        

2006
('Generating network from', 46191, 'utterances...')
('There were', 15474, 'replies to unknown users.')
('The unpruned network has ', 5883, 'nodes.')
Pruning network to its largest component...
('\t removed', 143, 'users from', 64, 'disconnected components.')
Normalizing edge weights...
2007
('Generating network from', 58586, 'utterances...')
('There were', 19237, 'replies to unknown users.')
('The unpruned network has ', 7024, 'nodes.')
Pruning network to its largest component...
('\t removed', 170, 'users from', 71, 'disconnected components.')
Normalizing edge weights...
2008
('Generating network from', 56579, 'utterances...')
('There were', 17440, 'replies to unknown users.')
('The unpruned network has ', 6962, 'nodes.')
Pruning network to its largest component...
('\t removed', 213, 'users from', 99, 'disconnected components.')
Normalizing edge weights...
2009
('Generating network from', 54344, 'utterances...')
('There were', 15728, 'replies to unknown users.')
('The unpruned n

In [108]:
for year in range(2006,2011):
    corrs = panel[year].join(panel[year+1], lsuffix=str(year), rsuffix=str(year+1)).corr()
    print(year)
    print('#utts <> perplexty', corrs['n_utts'+str(year)]['avg_perplexity'+str(year)])
    print('Centrality <> Perplexity', corrs['centrality'+str(year)]['avg_perplexity'+str(year)])
    print('centrality <> next year perplexity', corrs['centrality'+str(year)]['avg_perplexity'+str(year+1)])
    print()

2006
#utts <> perplexty -0.00848778352041
Centrality <> Perplexity -0.0457800953046
centrality <> next year perplexity -0.0231191578568

2007
#utts <> perplexty -0.00608223952797
Centrality <> Perplexity -0.0549031939729
centrality <> next year perplexity -0.0530200005688

2008
#utts <> perplexty -0.00933152555147
Centrality <> Perplexity -0.0360035216956
centrality <> next year perplexity -0.0305712133347

2009
#utts <> perplexty -0.010928106278
Centrality <> Perplexity -0.0421389554152
centrality <> next year perplexity -0.061451343082

2010
#utts <> perplexty -0.0139684027976
Centrality <> Perplexity -0.0525567776622
centrality <> next year perplexity -0.0468398856555



| Year | #utts <> perplexty | Centrality <> Perplexity | centrality <> next year perplexity |
|------|--------------------|--------------------------|------------------------------------|
| 2006 | -0.009             | -0.046                   | -0.023                             |
| 2007 | -0.006             | -0.055                   | -0.053                             |
| 2008 | -0.009             | -0.036                   | -0.031                             |
| 2009 | -0.011             | -0.042                   | -0.061                             |
| 2010 | -0.014             | -0.053                   | -0.047                             