#### Pair Problem

You are given documents as probability distributions over topics, and topics as probability distributions over words.

Implement a function `make_doc` that takes a document (as `topic_probs`) and a number of words. The function should randomly generate a document by choosing a topic for each word using the document's topic probabilities and then choosing a particular word using that topic's word probabilities. The function should return a string containing all the generated document's words.

For example:

```python
docs = [[0.98, 0.01, 0.01],
        [0.01, 0.98, 0.01],
        [0.01, 0.01, 0.98]]
topics = [[ 0.4,      0.4,   0.01,        0.01,    0.01,       0.01,
            0.1,     0.04,   0.01,        0.01],
          [0.01,     0.01,    0.4,         0.4,    0.01,       0.01,
            0.1,     0.04,   0.01,        0.01],
          [0.02,     0.02,   0.01,        0.01,     0.4,        0.4,
           0.02,      0.1,   0.01,        0.01]]
words =  ['cat', 'kitten',  'dog',     'puppy',  'deep', 'learning',
          'fur',  'image',  'GPU', 'asparagus']


def make_doc(topic_probs, n_words):
    raise NotImplementedError

for doc in docs:
    print make_doc(topic_probs=doc, n_words=10)

#  Example output:
## cat learning kitten image kitten cat deep image cat kitten
## puppy puppy learning dog puppy dog dog puppy image dog
## deep learning deep image deep deep deep deep learning learning
```

Extension:

Update your `make_doc` function so that if `topic_probs` isn't specified, it will draw a random set of topic probabilities from a Dirichlet distribution.

In [3]:
import random

In [1]:
# Defining variables
docs = [[0.98, 0.01, 0.01],  # topic distributiion within the document
       [0.01, 0.98, 0.01],
       [0.01, 0.01, 0.98]]

topics = [[ 0.4,      0.4,   0.01,        0.01,    0.01,       0.01,
            0.1,     0.04,   0.01,        0.01],  # word distribution for each topic
          [0.01,     0.01,    0.4,         0.4,    0.01,       0.01,
            0.1,     0.04,   0.01,        0.01],
          [0.02,     0.02,   0.01,        0.01,     0.4,        0.4,
           0.02,      0.1,   0.01,        0.01]]

words =  ['cat', 'kitten',  'dog',     'puppy',  'deep', 'learning',
          'fur',  'image',  'GPU', 'asparagus']

In [22]:
def topic_word_proba_join(topics, words):
    topic_words_proba_pair_list = list()
    for topic in topics:
        topic_words_proba_pair = list(zip(topic, words))
        topic_words_proba_pair_list.append(topic_words_proba_pair)
    return topic_words_proba_pair_list    

topic_word_proba_join(topics, words)

[[(0.4, 'cat'),
  (0.4, 'kitten'),
  (0.01, 'dog'),
  (0.01, 'puppy'),
  (0.01, 'deep'),
  (0.01, 'learning'),
  (0.1, 'fur'),
  (0.04, 'image'),
  (0.01, 'GPU'),
  (0.01, 'asparagus')],
 [(0.01, 'cat'),
  (0.01, 'kitten'),
  (0.4, 'dog'),
  (0.4, 'puppy'),
  (0.01, 'deep'),
  (0.01, 'learning'),
  (0.1, 'fur'),
  (0.04, 'image'),
  (0.01, 'GPU'),
  (0.01, 'asparagus')],
 [(0.02, 'cat'),
  (0.02, 'kitten'),
  (0.01, 'dog'),
  (0.01, 'puppy'),
  (0.4, 'deep'),
  (0.4, 'learning'),
  (0.02, 'fur'),
  (0.1, 'image'),
  (0.01, 'GPU'),
  (0.01, 'asparagus')]]

In [21]:
def make_doc(message2, n_words=10):
    """
    Randomly generates a document by:
    - choosing a topic for each word using the document's topic probabilities.
    
    Then choosing a particular WORD using that TOPIC'S word probabilities. 
    
    The function should return a string containing all the generated document's words.
    """
    randomly_chosen_topic = np.randomchoice[message2]
    
    pass
    

## Proper Solution

In [23]:
# Today's pair is to reinforce what we learned for last week

# We learned a few topic modelling techniques
# LDA, LSA, NMF

In [24]:
# LDA
# How do we gain intuition about LDA?

# Imagine yourself as a writer. You want to write about a few topics - you won't write on all of them
# Iternally, you have a whole reach of vocabulary you want to use. 
# But if you're writing on a topic about cats and dogs, would you want to use a topic on deep learning?

In [25]:
# What humans see in real life, we infer what are the topics that are being written about from the WORDS

In [26]:
# In this pair, we are trying to do it in the opposite direction.
# We give you the distribution of topics, and you create a word from that distribution.

In [27]:
import numpy as np

# topic distributiion within the document
docs = [[0.98, 0.01, 0.01],  # first entry has 98% chance of being first topic, next entry has 0.01% chance of being second toppic.. etc
       [0.01, 0.98, 0.01],  # second row, second document (dogs and puppies)
       [0.01, 0.01, 0.98]] # thidr row, third document (deep learning)

topics = [[ 0.4,      0.4,   0.01,        0.01,    0.01,       0.01,
            0.1,     0.04,   0.01,        0.01],  # word distribution for each topic
          [0.01,     0.01,    0.4,         0.4,    0.01,       0.01,
            0.1,     0.04,   0.01,        0.01],
          [0.02,     0.02,   0.01,        0.01,     0.4,        0.4,
           0.02,      0.1,   0.01,        0.01]]

words =  ['cat', 'kitten',  'dog',     'puppy',  'deep', 'learning',
          'fur',  'image',  'GPU', 'asparagus']

In [28]:
np.random.dirichlet(np.ones(len(topics)))

# just to remind us what distributions we are using
# Give you probabilities for multiclass but will sum up to one

# in this code, we are assuming any of them are equally likely to be chosen
# We are just randomizing a document-topic probability distribution

# However we defined it above in docs already

array([0.36552452, 0.12355184, 0.51092365])

In [30]:
np.ones(len(topics))   # they are equally likely

array([1., 1., 1.])

In [31]:
np.random.dirichlet(np.ones(np.array([5,1,1])))

ValueError: object too deep for desired array

In [None]:
# LDA is doing the reverse of what we do.
# We use the WORDS in a document to generate an understanding of the TOPICS in a DOCUMENT/ARTICLE

# LDA uses a set of probabilities from the DOCUMENT-TOPIC distribution get TOPICS, 
# and then use the TOPIC-WORD distribution
# to get WORDS

In [None]:
def make_doc(topic_probs=None, n_words=40, verbose=True):
    if topic_probs is None:
        topic_probs = np.random.dirichlet(np.ones(len(topics)))
    if verbose:
        print('topic_probs:', topic_probs)
    results = []
    for _in range:
        