## Chapter 21 Network Analysis

Many interesting data problems can be fruitfully thought of in terms of **networks** consisting of **nodes** of some type + the **edges** that join them.
Ex: FB friends form the nodes of a network whose edges = friendship relations, or the Web itself, w/ each web page a node, + each hyperlink from 1 page to another an edge.

FB friendship = mutual, so in this case, edges = **undirected**. Hyperlinks are *not*, so those edges = **directed**

### Betweenness Centrality

Ch1: Computed the key connectors in the network by counting # of friends each user had. Now we have enough machinery to look @ other approaches. Recall that the network comprised users and friendships:

In [1]:
users = [
    { "id": 0, "name": "Hero" },
    { "id": 1, "name": "Dunn" },
    { "id": 2, "name": "Sue" },
    { "id": 3, "name": "Chi" },
    { "id": 4, "name": "Thor" },
    { "id": 5, "name": "Clive" },
    { "id": 6, "name": "Hicks" },
    { "id": 7, "name": "Devin" },
    { "id": 8, "name": "Kate" },
    { "id": 9, "name": "Klein" }
]

friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
               (4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]

We also added friend lists to each user `dict`:

In [3]:
for user in users:
    user["friends"] = []
    
for i,j in friendships:
    # add i as friend of j and vice versa
    users[i]["friends"].append(users[j])
    users[j]["friends"].append(users[i])

When we left off, we were dissatisfied w/ our notion of **degree centrality**, which didn’t really agree w/ intuition about the **key connectors** of the network.

An alternative metric = **betweenness centrality** = ID's people who frequently are on the shortest paths between pairs of other people. In particular, the betweenness centrality of node `i` = adding up, for every other pair of nodes `j` and `k`, the proportion of shortest paths between node `j` and node `k` that pass through `i`.

That is, to figure out Thor’s betweenness centrality, compute all the shortest
paths between all pairs of people who aren’t Thor + then count how many of those shortest paths pass through Thor. 

* Ex: Only shortest path between Chi (id 3) and Clive (id 5) passes through Thor, while neither of the 2 shortest paths between Hero (id 0) and Chi (id 3) does.

So, as a 1st step, figure out the shortest paths between all pairs of people. There are some pretty sophisticated algorithms for doing so efficiently, but (as is almost always the case) we will use a less efficient, easier-to-understand algorithm.

This algorithm = an implementation of **breadth-first search**:

1. Goal = a function that takes a `from_user` + finds all shortest paths to every other user.
2. Represent a path = list of user IDs. Since every path starts at `from_user`, don’t include that ID in the list (length of list representing the path will be the length of the path itself).
3. Maintain a dictionary `shortest_paths_to` where keys = user IDs + values = lists of paths that end at the user w/ the specified ID. If there is a unique shortest path, list will just contain that 1 path. If there are multiple shortest paths, list will contain all of them.
4. Also maintain a queue frontier that contains the users we want to explore in the order we want to explore them, stored as pairs `(prev_user, user)` so we know how we got to each one. 
5. Initialize the queue w/ all neighbors of `from_user` (**queues** = data structures optimized for “add to the end” + “remove from the front” operations that, in Python, are implemented as **collections.deque**, which is actually a double-ended queue)
6. Explore the graph, + whenever we find new neighbors we don’t already know shortest paths to, add them to the end of the queue to explore later, w/  current user as `prev_user`.
7. When we take a user off the queue + we’ve never encountered that user before, we’ve definitely found 1+ shortest paths to him — each of the shortest paths to `prev_user` w/ 1 extra step added.
8. When we take a user off the queue + have encountered that user before, either we’ve found another shortest path (in which case we should add it) or we’ve found a longer path (in which case we shouldn’t).
9. When no more users are left on the queue, we’ve explored the whole graph (or, at least, the parts that are reachable from the starting user) + we’re done

In [None]:
import requests
import re
from bs4 import BeautifulSoup

def fix_unicode(text):
    return text.replace(u"\u2019","'")

url = "http://radar.oreilly.com/2010/06/what-is-data-science.html"
html = requests.get(url).text
soup = BeautifulSoup(html,"html5lib")

content = soup.find("div","article-body") # find entry content div
regex = r"[\w]+|[\.]" # match either a word or period
print(content)

In [None]:
document = []

for paragraph in content("p"):
    words = re.findall(regex,fix_unicode(paragraph.text))
    document.extend(words)
    
print(document)

Certainly could (+ likely should) clean this data further. There is still some amount of extraneous text in the doc (ex: “Section”), we’ve split on midsentence periods (ex; “Web 2.0”), + there's are a handful of captions + lists sprinkled throughout. Having said that, we’ll work w/ the doc it is.

Now that we have the text as a sequence of words, can model language in the
following way: 

* given some starting word, look @ all words that follow it in the source doc 
* randomly choose 1 of these to be the next word
* repeat process until we get to a period = signifies end of sentence.

This = a **bigram model** = is determined completely by frequencies of **bigrams (word pairs)** in the original data. For a starting word, just pick randomly from words that follow a period.

To start, precompute the possible word transitions (Recall `zip` stops when any of its inputs is done, so that `zip(document, document[1:])` gives us precisely the pairs of consecutive elements of document:

In [None]:
from collections import defaultdict

bigrams = zip(document,document[1:])
transitions = defaultdict(list)
for prev,current in bigrams:
    transitions[prev].append(current)
    
#print(bigrams,"\n")
#print(transitions.keys())

Now generate sentences

In [None]:
import random

def generate_using_bigrams():
    current = "." # next word starts a sentence
    result = []
    while True:
        next_word_candidates = transitions[current] # bigrams(current, _)
        current = random.choice(next_word_candidates) # choose one randomly
        result.append(current) # append to results
        if current == ".": return " ".join(result) # if ".", we're done
    return next_word_candidates
        
print(generate_using_bigrams())

The sentences produced = gibberish, but the the kind of gibberish you might put on a website if trying to sound data-science-y. 

* Ex: If you may know which are you want to data sort the data feeds web friend someone on trending topics as the data in Hadoop is the data science requires a book demonstrates why visualizations are but we do massive correlations across many commercial disk drives in Python language and creates more tractable form making connections then use and uses it to solve a data.

We can make sentences less gibberishy by looking at **trigrams**, triplets of consecutive words. (More generally, look at **n-grams** consisting of `n` consecutive words, but 3 will be plenty for now.) Now the transitions will depend on the previous *two* words:

In [None]:
trigrams = zip(document,document[1:],document[2:])
trigram_transitions = defaultdict(list)
starts = []

for prev, current, next in trigrams:
    if prev == ".": # if previous "word" was = a period
        starts.append(current) # then it's a start word
    trigram_transitions[(prev,current)].append(next)

Notice we now have to track  starting words separately, but can generate sentences in pretty much the same way:

In [None]:
def generate_using_trigrams():
    current = random.choice(starts) # choose random word from list above
    prev = "." # precede this word with a period
    result = [current]
    
    while True:
        next_word_candidates = trigram_transitions[(prev,current)]
        next_word = random.choice(next_word_candidates)
        
        prev, current = current, next_word
        result.append(current)
        
        if current == ".":
            return " ".join(result)
        
generate_using_trigrams()        

This produces better sentences like:

* In hindsight MapReduce seems like an epidemic and if so does that give us new insights into how economies work That’s not a question we could even have asked a few years there has been instrumented. 

Of course, they sound better b/c at each step, generation process has fewer choices, + at many steps only a single choice. This means you frequently generate sentences (or at least long phrases) that were seen verbatim in the original data. Having more data would help; it would also work better if you collected n-grams from multiple essays about data science.

### Grammars

Different approach to modeling language = w/ **grammars** = rules for generating acceptable sentences.

In [None]:
grammar = {"_S" : ["_NP _VP"],
           "_NP" : ["_N",
                    "_A _NP _P _A _N"],
           "_VP" : ["_V",
                    "_V _NP"],
           "_N" : ["data science", "Python", "regression"],
           "_A" : ["big", "linear", "logistic"],
           "_P" : ["about", "near"],
           "_V" : ["learns", "trains", "tests", "is"]
}

Names starting w/ underscores = rules that need further expanding, + other names = terminals that don’t need further processing.

* i.e "_S" = “sentence” rule, which produces a "_NP" (“noun phrase”) rule followed by a "_VP" (“verb phrase”) rule.
* verb phrase rule can produce either the "_V" (“verb”) rule, or the verb rule followed by the noun phrase rule.
* Notice the "_NP" rule contains itself in 1 of its productions. 
    * Grammars can be recursive = allows even finite grammars like this to generate infinitely many different sentences.

How do we generate sentences from this grammar? = start w/ a list containing the sentence rule ["_S"], repeatedly expand each rule by replacing it w/ a randomly chosen one of its productions, + stop when we have a list consisting solely of terminals.

* Ex: 1 such progression might look like:


In [None]:
['_S']
['_NP','_VP']
['_N','_VP']
['Python','_VP']
['Python','_V','_NP']
['Python','trains','_NP']
['Python','trains','_A','_NP','_P','_A','_N']
['Python','trains','logistic','_NP','_P','_A','_N']
['Python','trains','logistic','_N','_P','_A','_N']
['Python','trains','logistic','data science','_P','_A','_N']
['Python','trains','logistic','data science','about','_A', '_N']
['Python','trains','logistic','data science','about','logistic','_N']
['Python','trains','logistic','data science','about','logistic','Python']

How do we implement this? == start by creating a simple helper function to ID terminals:

In [None]:
def is_terminal(token):
    return token[0] != "_"

Next, write a function to turn a list of tokens into a sentence, 1st looking for 1st non-terminal token + if we can’t find one, it means we have a completed sentence + we’re done. If we *do* find a nonterminal, we randomly choose 1 of its productions. If that production = a terminal (i.e., a word), simply replace the token w it. Otherwise it’s a sequence of space-separated non-terminal tokens that we need to split + then splice into the current tokens. Either way, we repeat the process on the new set of tokens. Putting it all together we get:

In [None]:
def expand(grammer,tokens):
    for i,token in enumerate(tokens):
        # skip over terminals
        if is_terminal(token):
            continue
        # if we get here, we have a non-terminal token, 
        #   so choose random replacements
        replacement = random.choice(grammar[token])
        
        if is_terminal(replacement):
            tokens[i] = replacement
        else:
            tokens = tokens[:i] + replacement.split() + tokens[(i+1):]
        
        # now recursively call expand on new list of tokens
        return expand(grammar,tokens)

    # if we get here, we have all terminals + we're done
    return tokens

And now we can start generating sentences:

In [None]:
def generate_sentences(grammar):
    return expand(grammar, ["_S"])

generate_sentences(grammar)

Try changing the grammar — add more words, more rules, your own parts of
speech — until you’re ready to generate as many web pages as the company needs.

Grammars = actually more interesting when used in the *other direction*: Given a sentence, use a grammar to parse the sentence which allows us to ID
subjects + verbs + helps us make sense of the sentence. Using data science to generate text = a neat trick; using it to understand text is more magical. 

### An Aside: Gibbs Sampling

Generating samples from some distributions is easy. We can get uniform random variables with:

In [None]:
random.random()

and normal random variables w/:

In [None]:
import sys

sys.path.insert(0, './../../../00_DataScience/DSFromScratch/code')

from probability import inverse_normal_cdf

inverse_normal_cdf(random.random())

But some distributions = harder to sample from. **Gibbs sampling** = technique for generating samples from multidimensional distributions when we only know some of the conditional distributions.

Ex: Rolling 2 dice + let x = value of 1st die + let y = sum of dice, + imagine you wanted to generate lots of `(x, y)` pairs. In this case it’s easy to
generate the samples directly:


In [None]:
def roll_a_die():
    return random.choice([1,2,3,4,5,6])

def direct_sample():
    d1 = roll_a_die()
    d2 = roll_a_die()
    return d1, d1 + d2

But imagine you only knew the *conditional* distributions = **distribution of y conditional on x is easy == if you know value of x, y = equally likely to be x + 1, x + 2, x + 3, x + 4, x + 5, or x + 6:**

In [None]:
def random_y_given_x(x):
    """equally likely to be x + 1, x + 2, ... , x + 6"""
    return x + roll_a_die()

Other direction = more complicated. If we know y = 2, necessarily x = 1 (since only way 2 dice sum to 2 is both = 1). If we know y = 3, x = equally likely to be 1 or 2. Similarly, if y is 11, x is either 5 or 6:

In [None]:
def random_x_given_y():
    if y <= 7:
        # first die is equally likely to be 1, 2, ..., total - 7
        return random.randrange(1,y)
    else:
        # first die is equally likely to be total - 6, total - 5, ..., 6
        return random.randrange(y-6,7)

**Gibbs sampling works by starting w/ any (valid) value for x + y + then repeatedly alternate-replacing x w/ a random value, picked conditional on y + replacing y w/ a random value picked conditional on x. After a # of iterations, resulting values of x + y will represent a sample from the unconditional joint distribution:**

In [None]:
def gibbs_sample(num_iters=100):
    x,y = 1,2 # arbitrary
    for _ in range(num_iters):
        x = random_x_given_y(y)
        y = random_y_given_x(x)
    return x,y

def compare_distributions(num_samples=100):
    counts = defaultdict(lambda: [0,0])
    for _ in range(num_samples):
        counts[gibbs_sample()][0] += 1
        counts[direct_sample()][1] += 1
    return counts

### Topic Modeling

A more sophisticated approach to understanding users’ interests for You Should Know recommenders might try to ID topics that underlie those interests. A technique called **Latent Dirichlet Analysis (LDA)** = commonly used to ID common topics in a set of docs. We’ll apply it to docs that consist of each user’s interests.

LDA = some similarities to Naive Bayes Classifier in that *it assumes a probabilistic model for docs*. Glossing over over hairier mathematical details, LDA model assumes that:

* There is some fixed number `K` of topics.
* There is a random variable that assigns each topic an associated probability distribution over the words (think of this distribution as the probability of seeing word `w` given a topic `k`.
* There is another random variable that assigns each doc a probability distribution over topics (this distribution = the mixture of topics in doc `d`).
* Each word in a doc was generated by 1st randomly picking a topic (from doc's distribution of topics) + then randomly picking a word (from chosen topic’s distribution of words).

In particular, we have a collection of docs, each of which = a list of words, + we
have a corresponding collection of `document_topics` that assigns a topic (here a # between 0 + K – 1) to each word in each document. so that the 5th word in the 4th doc = `doc[3][4]` + the topic from which that word was chosen is: `document_topics[3][4]`.

This very explicitly defines each doc’s distribution over topics, + implicitly defines each topic’s distribution over words. We can estimate the likelihood topic 1 produces a certain word by comparing how many times topic 1 produces said word w/ how many times topic 1 produces *any* word.(Similarly, Ch13 spam filter = compared how many times each
word appeared in `spams` w/ total # of words appearing in `spams`.)

Although these topics are just #'s, we can give them descriptive names by looking at
the words on which they put the *heaviest weight*, we just have to somehow generate the
`document_topics` = where Gibbs sampling comes into play.

Start by assigning every word in every doc a topic completely at random, then go through each doc, 1 word at a time. For that word + doc, construct weights for each topic that depend on the (current) distribution of topics in that doc + the (current) distribution of words for that topic. Then use those weights to sample a new topic for that word. If we iterate this process many times, we end up w/ a joint sample from the topic-word distribution + the doc-topic distribution.

To start, need a function to randomly choose an index based on an arbitrary set of weights:

In [None]:
def sample_from(weights):
    """Returns i with p = weights[i]/sum(weights)"""
    total = sum(weights)
    rnd = total * random.random() # uniform random variable between 0 and total
    for i,w in enumerate(weights):
        rnd -= w   # return smallest i such that weights[0] + ... + weights[i] >= rnd
        if rnd <= 0: return 1

Ex: If we give it weights = `[1, 1, 3]`, then 1/5 of the time it will return 0,
1/5 of the time it will return 1, + 3/5 of the time it will return 2.

Our docs = users’ interests, which look like:

In [None]:
docs = [
    ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
    ["R", "Python", "statistics", "regression", "probability"],
    ["machine learning", "regression", "decision trees", "libsvm"],
    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
    ["statistics", "probability", "mathematics", "theory"],
    ["machine learning", "scikit-learn", "Mahout", "neural networks"],
    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
    ["Hadoop", "Java", "MapReduce", "Big Data"],
    ["statistics", "R", "statsmodels"],
    ["C++", "deep learning", "artificial intelligence", "probability"],
    ["pandas", "R", "Python"],
    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
    ["libsvm", "regression", "support vector machines"]
]

Try to find K = 4 topics. In order to calculate the sampling weights, must keep track of several counts. 1st, create the data structures for them:

In [None]:
from collections import Counter

K = 4

# How many times each topic is assigned to each doc:
    # a list of Counters, one for each document
doc_topic_counts = [Counter() for _ in docs]

# how many times each word is assigned to each topic
    # list of Counters, one for each topic
topic_word_counts = [Counter() for _ in range(K)]

# total # of words assinged in each topic
    # list of #'s, 1 for each topic
topic_counts = [0 for _ in range(K)]

# total # of words contained in each doc
    # list of #'s 1 for each doc
doc_lengths = [len(d) for d in docs]

# # of distinct words
distinct_words = set(word for doc in docs for word in doc)
W = len(distinct_words)

# # of docs
D = len(docs)

Once the above populate these, we can find, for example, the # of words in `documents[3]` associated w/ topic 1 as `doc_topic_counts[3][1]` and the # of times "nlp" is associated with topic 2 as `topic_word_counts[2]['nlp']`. 

Now to define the conditional probability functions, each with a smoothing term that ensures every topic has a non-zero chance of being chosen in any doc and that every word has a non-zero chance of being chosen for any topic:

In [None]:
def p_topic_given_document(topic,d,alpha=.1):
    """Reurns fraction of words in document 'd' that are
    assigned to 'topic' (including some smoothing)"""
    return ((doc_topic_counts[d][topic] + alpha) / 
            (doc_lengths[d] + K*alpha))

def p_word_given_topic(word,topic,beta=.1):
    """Returns fraction of words assigned to 'topic'
    that equals 'words' (plus some smoothing)"""
    return ((topic_word_counts[topic][word] + beta) / 
           (topic_counts[topic] + W*beta))

# use the above to create weights for updating topics
def topic_weight(d,word,k):
    """Given a document and a word within it
    return the weight for the kth topic"""
    return p_word_given_topic(word,k)*p_topic_given_document(k,d)

def choose_new_topic(d,word):
    return sample_from(weights=[topic_weight(d,word,k) for k in range(K)])

There're solid mathematical reasons why `topic_weight` is defined the way it is, but details would lead us too far afield. Intuitive sense = given a word + its document, the **likelihood of any topic choice depends on both how likely that topic is for the document  how likely that word is for the topic.**

Now, we start by assigning every word to a random topic + populating counters appropriately:


In [None]:
random.seed(0)
doc_topics = [[random.randrange(K) for word in doc] for doc in docs]
#print(doc_topics)

In [None]:
for d in range(D):
    for word,topic in zip(docs[d],doc_topics[d]):
        doc_topic_counts[d][topic] += 1
        topic_word_counts[topic][word] += 1
        topic_counts[topic] += 1
print(doc_topic_counts,"\n",topic_word_counts,"\n",topic_counts)

**Goal = to get a joint sample of the topics-words distribution + the documents-topics distribution using a form of Gibbs sampling that uses the conditional probabilities defined previously:**

In [None]:
for iter in range(1000):
    for d in range(D):
        for i, (word,topic) in enumerate(zip(docs[d],doc_topics[d])):
            # remove word/topic from counts so that
            # it doesn't influence weights
            doc_topic_counts[d][topic] -= 1
            topic_word_counts[topic][word] -= 1
            topic_counts[topic] -= 1
            doc_lengths[d] -= 1
            
            # choose new topic based on weights
            new_topic = choose_new_topic(d,word)
            doc_topics[d][i] = new_topic
            #print(new_topic)
            # now add back into counts
            doc_topic_counts[d][topic] += 1
            topic_word_counts[topic][word] += 1
            topic_counts[topic] += 1
            doc_lengths[d] += 1

The topics = #'s 0-3, and to get names for them must add them ourselves.

Can look @ the 5 most heavily weighted words for each topic:

In [None]:
for k,word_counts in enumerate(topic_word_counts):
    for word,count in word_counts.most_common(5):
        if count > 0: print(k,word,count)

Based on these, probably assign topic names:


In [None]:
topic_names = ['Machine Learning','Deep Learning','Languages','Big Data']

at which point we can see how the model assigns topics to each user’s interests:

In [None]:
for doc,topic_counts in zip(docs,doc_topic_counts):
    print(doc)
    for topic,count in topic_counts.most_common():
        if count > 0:
            print(topic_names[topic], count)
            

In [None]:
random.seed(0)
doc_topics = [[random.randrange(K) for word in document]
                   for document in docs]

for d in range(D):
    for word, topic in zip(docs[d], doc_topics[d]):
        doc_topic_counts[d][topic] += 1
        topic_word_counts[topic][word] += 1
        topic_counts[topic] += 1

for iter in range(1000):
    for d in range(D):
        for i, (word, topic) in enumerate(zip(docs[d],
                                              doc_topics[d])):

            # remove this word / topic from the counts
            # so that it doesn't influence the weights
            doc_topic_counts[d][topic] -= 1
            topic_word_counts[topic][word] -= 1
            topic_counts[topic] -= 1
            doc_lengths[d] -= 1

            # choose a new topic based on the weights
            new_topic = choose_new_topic(d, word)
            doc_topics[d][i] = new_topic

            # and now add it back to the counts
            doc_topic_counts[d][new_topic] += 1
            topic_word_counts[new_topic][word] += 1
            topic_counts[new_topic] += 1
            doc_lengths[d] += 1

In [None]:
#for k, word_counts in enumerate(topic_word_counts):
#    for word, count in word_counts.most_common():
#        if count > 0: print(k, word, count)

topic_names = ["Big Data and programming languages",
               "databases",
               "machine learning",
               "statistics"]

for document, topic_counts in zip(docs, doc_topic_counts):
    print(doc)
    for topic, count in topic_counts.most_common():
        if count > 0:
            print(topic_names[topic], count)
    print()

In [None]:
import random
from collections import Counter
#
# TOPIC MODELING
#

def sample_from(weights):
    total = sum(weights)
    rnd = total * random.random()       # uniform between 0 and total
    for i, w in enumerate(weights):
        rnd -= w                        # return the smallest i such that
        if rnd <= 0: return i           # sum(weights[:(i+1)]) >= rnd

documents = [
    ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
    ["R", "Python", "statistics", "regression", "probability"],
    ["machine learning", "regression", "decision trees", "libsvm"],
    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
    ["statistics", "probability", "mathematics", "theory"],
    ["machine learning", "scikit-learn", "Mahout", "neural networks"],
    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
    ["Hadoop", "Java", "MapReduce", "Big Data"],
    ["statistics", "R", "statsmodels"],
    ["C++", "deep learning", "artificial intelligence", "probability"],
    ["pandas", "R", "Python"],
    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
    ["libsvm", "regression", "support vector machines"]
]

K = 4

document_topic_counts = [Counter()
                         for _ in documents]

topic_word_counts = [Counter() for _ in range(K)]

topic_counts = [0 for _ in range(K)]

document_lengths = [len(d) for d in documents]

distinct_words = set(word for document in documents for word in document)
W = len(distinct_words)

D = len(documents)

def p_topic_given_document(topic, d, alpha=0.1):
    """the fraction of words in document _d_
    that are assigned to _topic_ (plus some smoothing)"""

    return ((document_topic_counts[d][topic] + alpha) /
            (document_lengths[d] + K * alpha))

def p_word_given_topic(word, topic, beta=0.1):
    """the fraction of words assigned to _topic_
    that equal _word_ (plus some smoothing)"""

    return ((topic_word_counts[topic][word] + beta) /
            (topic_counts[topic] + W * beta))

def topic_weight(d, word, k):
    """given a document and a word in that document,
    return the weight for the k-th topic"""

    return p_word_given_topic(word, k) * p_topic_given_document(k, d)

def choose_new_topic(d, word):
    return sample_from([topic_weight(d, word, k)
                        for k in range(K)])

In [None]:
random.seed(0)
document_topics = [[random.randrange(K) for word in document]
                   for document in documents]
print(document_topics)

In [None]:
print(list(zip(documents[2], document_topics[2])))

In [None]:
for d in range(D):
    for word, topic in zip(documents[d], document_topics[d]):
        document_topic_counts[d][topic] += 1
        topic_word_counts[topic][word] += 1
        topic_counts[topic] += 1
print(document_topic_counts,"\n",topic_word_counts,"\n",topic_counts)        

In [None]:
list(enumerate(zip(documents[d],document_topics[d])))

In [None]:

for iter in range(1000):
    for d in range(D):
        for i, (word, topic) in enumerate(zip(documents[d],
                                              document_topics[d])):

            # remove this word / topic from the counts
            # so that it doesn't influence the weights
            document_topic_counts[d][topic] -= 1
            topic_word_counts[topic][word] -= 1
            topic_counts[topic] -= 1
            document_lengths[d] -= 1

            # choose a new topic based on the weights
            new_topic = choose_new_topic(d, word)
            #document_topics[d][i] = new_topic
            print(new_topic)
            # and now add it back to the counts
            #document_topic_counts[d][new_topic] += 1
            #topic_word_counts[new_topic][word] += 1
            #topic_counts[new_topic] += 1
            #document_lengths[d] += 1

#for k,word_counts in enumerate(topic_word_counts):
  #  for word,count in word_counts.most_common(5):
    #    if count > 0: print(k,word,count)


In [None]:
if __name__ == "__main__":
 # topic MODELING

   # for k, word_counts in enumerate(topic_word_counts):
     #   for word, count in word_counts.most_common():
     #       if count > 0: print(k, word, count)
#
    topic_names = ["Big Data and programming languages",
                   "databases",
                   "machine learning",
                   "statistics"]

    for document, topic_counts in zip(documents, document_topic_counts):
        print(document)
        for topic, count in topic_counts.most_common():
            if count > 0:
                print(topic_names[topic], count)
        print()


In [None]:

for document, topic_counts in zip(documents, document_topic_counts):
print document
for topic, count in topic_counts.most_common():
if count > 0:
print topic_names[topic], count,
print
which gives:
['Hadoop', 'Big Data', 'HBase', 'Java', 'Spark', 'Storm', 'Cassandra']
Big Data and programming languages 4 databases 3
['NoSQL', 'MongoDB', 'Cassandra', 'HBase', 'Postgres']
databases 5
['Python', 'scikit-learn', 'scipy', 'numpy', 'statsmodels', 'pandas']
Python and statistics 5 machine learning 1
and so on. Given the “ands” we needed in some of our topic names, it’s possible we
should use more topics, although most likely we don’t have enough data to successfully
learn them.
For Further Exploration
Natural Language Toolkit is a popular (and pretty comprehensive) library of NLP tools
for Python. It has its own entire book, which is available to read online.
gensim is a Python library for topic modeling, which is a better bet than our fromscratch
model.