# Lab 2

The aims of the lab are to:
*   Introduce the spaCy python library for text processing 
*   Learn the details of how a dictionary is implemented
*   Practice creating a sparse one-hot encoding 
*   Implement Jaccard similarity
*   Learn to use SciKit-Learn to vectorize text with a bag-of-words representation
*   Use Cosine similarity to find similar documents in a collection
*   Perform KMeans clustering on Reddit posts
*   Practice performing basic evaluation of clustering



## Reddit Data Review

**Thread fields**
*   URL - reddit URL of the thread
*   title - title of the thread, as written by the first poster
*   is_self_post - True if the first post in the thread is a self-post (text addressed to the reddit community as opposed to an external link)
*   subreddit - the subreddit of the thread
*   posts - a list of all posts in the thread

**Post fields**
*   id - post ID, reddit ID of the current post
*   body - the text of the post
*   in_reply_to - parent ID, reddit ID of the parent post, or the post that the current post is in reply to
*   post_depth - the number of replies the current post is from the initial post
*   is_first_post - True if the current post is the initial post


Download the Reddit dataset.  

In [1]:
# The local location to store the reddit dataset.
local_file = "coarse_discourse_dump_reddit.json"

#!gsutil cp gs://textasdata/coarse_discourse_dump_reddit.json $local_file
  
# The ! performs a shell command to download the reddit dataset using wget.
#!wget -O  $local_file https://storage.googleapis.com/textasdata/coarse_discourse_dump_reddit.json


Load the JSON data into DataFrame with each post as a row.

In [2]:
# The reddit thread structure is nested with posts in a new content.
# This block reads the file as json and cates a new data frame.
import pandas as pd
import json

# A temporary variable to store the list of post content.
posts_tmp = list()

with open(local_file) as jsonfile:
  for i, line in enumerate(jsonfile):
    thread = json.loads(line)
    for post in thread['posts']:
      # Keep the thread title and subreddit with each post.
      posts_tmp.append((thread['subreddit'], thread['title'], thread['url'],
                        post['id'], post.get('author', ""), post.get('body', "")))
print(len(posts_tmp))

# Create the posts data frame.  
labels = ['subreddit', 'title', 'id', 'url', 'author', 'body']
post_frame = pd.DataFrame(posts_tmp, columns=labels)

110595


## Introduction to spaCy
[spaCy](https://spacy.io/) is an open-source software library for  Natural Language Processing, written in Python and Cython. The library is published under the MIT license and currently offers statistical models for English, German, Spanish, Portuguese, French, Italian, Dutch as well as tokenization for various other languages. In contrast to NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage. SpaCy is widely used by many companies for text and NLP processing. 

We will also use it later in the course for more advanced NLP tasks. 

**Note:** SpaCy includes a variety of models. Below we are using the english web small.  See the full list of [models](https://spacy.io/usage/models).  In practice, better effectiveness can be obtained by using a larger model.  

In [3]:
#!python -m spacy download en

import spacy

import sys
print(sys.version)

# Version checks
import importlib
def version_check(libname, min_version):
    m = importlib.import_module(libname)
    print ("%s version %s is" % (libname, m.__version__))
    print ("OK" if m.__version__ >= min_version 
           else "out-of-date. Please upgrade!")
    
version_check("spacy", "2.0")

# Load the small english model. 
# Disable the advanced NLP features in the pipeline for efficiency.
nlp = spacy.load('en_core_web_sm', disable=['ner'])
print(nlp.pipeline)
print(nlp.pipe_names)
nlp.remove_pipe('tagger')
nlp.remove_pipe('parser')
# Verify they are empty.
print(nlp.pipeline)


3.7.2 (default, Jan 10 2019, 23:51:51) 
[GCC 8.2.1 20181127]
spacy version 2.0.16 is
OK
[('tagger', <spacy.pipeline.Tagger object at 0x7f0127724da0>), ('parser', <spacy.pipeline.DependencyParser object at 0x7f00f381e728>)]
['tagger', 'parser']
[]


### SpaCy Tokenization

Last week we used NLTK to tokenized and normalized text.  This week we moved to a more advanced library. 

Below is example code of processing one of the Reddit posts with spaCy.  In particular, the code below prints out a few of the properties of the [Token](https://spacy.io/api/token) class. This class exposes many useful properties of tokens.



In [4]:
doc = nlp(post_frame.loc[10]['body'])
for token in doc[:13]:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
    ))

I	0	I	False	False	X
love	2	love	False	False	xxxx
cheese	7	cheese	False	False	xxxx
cake	14	cake	False	False	xxxx
!	18	!	True	False	!
I	20	I	False	False	X
love	22	love	False	False	xxxx
both	27	both	False	False	xxxx
making	32	make	False	False	xxxx
and	39	and	False	False	xxx
eating	43	eat	False	False	xxxx
it	50	it	False	False	xx
,	52	,	True	False	,


Note that spaCy includes the raw token, it's position in the original string, the lemma (using its [lemmatizer](https://spacy.io/api/lemmatizer)), as well other properties of the token. 

- What do the 'shape' patterns capture?  How might this token attribute be useful? 

### Optional warm-up: tokenization and normalization with spaCy
Below are several optional warm-up tasks that review material from the previous lab, but implemented with spaCy.   These should be done pretty quickly (5-10 minutes). If they are taking longer, see the solutions and move on to next section of the lab. 

#### Optional task
- Create a function ``spacy_tokenize`` function that uses spaCy to tokenize a string. The function should:
 - Accept a string as input
 - Output a list of spaCy token objects
 
 You may click SHOW CODE below to see the answer. 

In [5]:
#@title
def spacy_tokenize(string):
  tokens = list()
  doc = nlp(string)
  for token in doc:
    tokens.append(token)
  return tokens

Below we  apply the ``spacy_tokenize`` function to the ``body`` field of the posts in the ``post_frame`` DataFrame. The results are flattened into a ``flat_tokens`` variable that contains a single list of all tokens from all posts concatenated together. 

Note: Applying spaCy's tokenizer to all the posts will take a couple minutes. 

In [6]:

# This tokenizes the body posts and creates vector of tokens for each post.
# Note: This selections the body column from the posts only. 
all_posts_tokenized = post_frame.body.apply(spacy_tokenize)

import itertools
# A single variable with the (flattened) tokens from all posts.
flat_tokens = list(itertools.chain.from_iterable(all_posts_tokenized))

#### Optional task

- Inspect some of the tokenization to verify that it worked correctly 
- Print out the 20 most frequent (raw) tokens in the collection from the ``flat_tokens`` variable.
 - Hint: Recall the ``Counter`` object from the last lab. 
 - *Tip*: There are multiple ways to do this. Consider using a list comprehension to extract the  text from the tokens.

In [7]:
#@title
import collections
raw_words = [t.text for t in flat_tokens]
raw_count = collections.Counter(raw_words)
raw_count.most_common(20)

[('.', 244379),
 (',', 175780),
 ('the', 162024),
 ('I', 137844),
 ('to', 122575),
 ('a', 110136),
 ('and', 97918),
 ('of', 72791),
 ('it', 70102),
 ('you', 68689),
 ('is', 60717),
 ('\n\n', 60224),
 ('that', 57067),
 ('in', 53429),
 ('for', 45987),
 ("'s", 39097),
 ('-', 38366),
 ("n't", 38080),
 ('*', 36234),
 (')', 36016)]

#### Optional task:
Create a ``normalize`` function that normalizes raw text into a canonical form. The function should:
 - Take a list of spaCy token objects as input
 - Output a list of normalized strings
 - The normalization should only keep tokens consisting of alphanumeric characters. 
 - Normalization should use Spacy's lemma property from the token. 
 - The output should be lowerecased and trimmed of any extra whitespace.
 - One special case to handle is when Spacy's lemma is "-PRON-" , instead preserve the lowercased text token.
 
 
 You will need this code below. You may see show the answer below if you get stuck.  Click SHOW CODE to see it.

In [8]:
#@title
def normalize(tokens):
  normalized = list()
  for token in tokens:
    if (token.is_alpha):
      lemma = token.lemma_.lower().strip() if token.lemma_ != "-PRON-" else token.lower_
      normalized.append(lemma)
  return normalized

The code below runs the ``normalize`` function on the ``flat_tokens`` and stores it in ``normalized_tokens``. We will use these for our vocabulary and processing.

In [9]:
normalized_tokens = normalize(flat_tokens)

## One-hot encoding

We will now implement a one-hot encoding text representation using a dictionary.

Below is a skeleton class that implements a dictionary.  Recall from Lecture 1 that a dictionary allows us to translate a series of tokens to integer values (and back).

#### Your task
- The ``SimpleDictionary`` skeleton below is incomplete, fill in the missing elements. Specifically:  
 - Complete the ``_init_`` constructor to initialize the member variables appropriately
 - Implement the ``tokens_to_ids`` function that maps strings to integer values

In [10]:

"""class SimpleDictionary(object):
  
  # Special UNK token for unseen tokens
  UNK_TOKEN = "<unk>"

  def __init__(self, tokens, size=None):
    
    # All unigrams with their counts
    self.unigram_counts = 
    
    # The total size of the collection in tokens
    self.collection_size = 
    
    # The number of unique unigrams
    self.num_unigrams = 
    
    # Set of most frequent words (limited to top K by size if defined)
    # These are in descending order of collection frequency.
    # Remember to leave space for "<unk>" tokens.
    # Where should it go in the ordering? Why?
    self.vocab = 

    # Dictionary that assigns an id to each token, by frequency.
    self.id_to_token = 
    
    # Dictionary that assigns a token to id 
    self.token_to_id = 
    
    self.size = len(self.id_to_token)
    if size is not None:
        assert(self.size <= size)

    # For convenience keep a set of unique words.
    self.tokenset = set(iter(self.token_to_id.keys()))

    # Store special IDs for convenience
    self.UNK_ID = self.token_to_id[self.UNK_TOKEN]

  # Given a sequence of ids, return a sequence of corresponding tokens.
  def ids_to_tokens(self, ids):
    return [self.id_to_token[i] for i in ids]
  
  # Given an input sequence of tokens, return a sequence of token id.
  def tokens_to_ids(self, tokens):
    # YOUR CODE HERE"""

'class SimpleDictionary(object):\n  \n  # Special UNK token for unseen tokens\n  UNK_TOKEN = "<unk>"\n\n  def __init__(self, tokens, size=None):\n    \n    # All unigrams with their counts\n    self.unigram_counts = \n    \n    # The total size of the collection in tokens\n    self.collection_size = \n    \n    # The number of unique unigrams\n    self.num_unigrams = \n    \n    # Set of most frequent words (limited to top K by size if defined)\n    # These are in descending order of collection frequency.\n    # Remember to leave space for "<unk>" tokens.\n    # Where should it go in the ordering? Why?\n    self.vocab = \n\n    # Dictionary that assigns an id to each token, by frequency.\n    self.id_to_token = \n    \n    # Dictionary that assigns a token to id \n    self.token_to_id = \n    \n    self.size = len(self.id_to_token)\n    if size is not None:\n        assert(self.size <= size)\n\n    # For convenience keep a set of unique words.\n    self.tokenset = set(iter(self.token_t

If you get stuck here, the solution is provided below, click SHOW CODE to see it.  

We need it to work for the later exercises. 

In [11]:
#@title

class SimpleDictionary(object):
  
  # Special UNK token
  UNK_TOKEN = "<unk>"

  def __init__(self, tokens, size=None):
    # All unigrams with their counts
    self.unigram_counts = collections.Counter(tokens)
    
    # The total size of the collection in tokens
    self.collection_size = len(tokens)
    
    # The number of unique unigrams
    self.num_unigrams = len(self.unigram_counts.keys())
    
    top_counts = self.unigram_counts.most_common(None if size is None else (size - 1))

    # Set of most frequent words (limited to top K by size if defined)
    # Remember to leave space for "<unk>" tokens.
    self.vocab = ([self.UNK_TOKEN] + [t for t,c in top_counts])

    # Dictionary that assigns an id to each token, by frequency.
    self.id_to_token = dict(enumerate(self.vocab))
    
    # Dictionary that assign a token to id 
    self.token_to_id = {v:k for k,v in iter(self.id_to_token.items())}
    
    self.size = len(self.id_to_token)
    if size is not None:
        assert(self.size <= size)

    # For convenience keep a set of unique words.
    self.tokenset = set(iter(self.token_to_id.keys()))

    # Store special IDs for convenience
    self.UNK_ID = self.token_to_id[self.UNK_TOKEN]

  # Given a sequence of ids, return a sequence of corresponding tokens.
  def ids_to_tokens(self, ids):
    return [self.id_to_token[i] for i in ids]
  
  # Given an input sequence of tokens, return a sequence of token IDs.
  def tokens_to_ids(self, tokens):
    return [self.token_to_id.get(t, self.UNK_ID) for t in tokens]

Run the dictionary on the ``normalized_tokens`` that contains all of the tokens in the collection.  In Sci-kit Learn this is called "fitting", creating a vocabulary from a fixed collection of text. 

In [12]:
dictionary = SimpleDictionary(normalized_tokens)

#### Optional task

- Use the ``dictionary`` to print out properties of the text collection. 

 - Print out the total number of tokens (N)
 - Print out the size of the vocabulary  (V)
 - Print out the top 20 most frequent unigrams with three values: token, collection frequency, percentage of collection tokens.

In [13]:
#@title
print("Collection size: " + "{0}".format(dictionary.collection_size))
print("Vocabulary size: " + "{0}".format(dictionary.size))

for (word, count) in dictionary.unigram_counts.most_common(20):
  print("{0}\t{1}\t{2}".format(word, count, 100 * count / dictionary.collection_size))

Collection size: 4448955
Vocabulary size: 72344
the	175938	3.9545915838663235
be	167894	3.77378507986707
a	149415	3.3584291142526728
i	146584	3.2947961937129056
to	124878	2.8069063409272514
and	102057	2.2939544230049527
it	83701	1.8813631515715488
you	78023	1.7537376754766008
of	73378	1.6493311350643016
that	66063	1.484910501454836
in	56212	1.2634877179022939
have	56019	1.2591496205288657
for	48242	1.0843445258493287
do	48228	1.0840298452108417
on	34400	0.7732152831395238
but	34350	0.7720914237163559
with	33157	0.7452761378795695
this	32615	0.7330935017324293
can	31966	0.7185058064197098
my	29458	0.6621330177536073


The most frequent word, *the*, accounts for approximately 4% of all tokens.  The top 10 most frequent words account for over 25% of all word occurrences.  Recall [Zipf's law](https://simple.wikipedia.org/wiki/Zipf%27s_law) from lecture 2 and the power law distribution of text data. A few number of terms account for a large fraction of occurrences, but many words occur rarely.  You might consider, how many words occur just once?  These are all taking up space in the vocabulary.  As the collection size increases, we may prune the size of the dictionary.   

### From tokens to IDs and back again
Below are some examples of using the dictionary to map tokens to IDs in our vocabulary (and vice versa). Consider trying some of your own words to experiment with what happens here. 

In [14]:
# Pick a word from the dictionary
print(dictionary.tokens_to_ids(["like"]))

# What's the value of a made up word? 
print(dictionary.tokens_to_ids(["likemymadeupword"]))

# For fun, let's print out a couple random words from the vocab.
# Feel free to explore the vocabulary.
import random as rand
print(dictionary.ids_to_tokens([2]))
print(dictionary.ids_to_tokens([rand.randint(0, dictionary.size-1)]))

[29]
[0]
['be']
['gazetter']


- What is the value of the second word? Why?

SpaCy also provides access to a [vocabulary](https://spacy.io/api/vocab) object. It contains the normalized types, called Lexemes.  
 - What does spaCy use for its dictionary implementation? 
  - *Hint*: The values are from it's [StringStore](https://spacy.io/api/stringstore) object. 
  - What is the rationale for this implementation? What does this mean for token IDs for different spaCy models? 
  - Look at the token class. What field has the vocab integer identifier? 
  
  You can access spaCy's vocab for the language you are using from the ``nlp.vocab`` variable.  The code below prints out the top n words of its vocab. 


In [15]:
n = 0
for w in nlp.vocab:
  if (n > 20): 
    break
  print(w.text)
  n+=1

convincing
故明
palm
Bamboo
Hundred
nonprofit
upholstery
Beltway
steakhouse
maureen
tentative
Jiayangduoji
encoded
Run
532,000
futures
bascially
Surrounding
midcapitalization
flow
Meetings


### Creating a one-hot encoding representation

#### Your task
- Create a function: ``one_hot_encoding`` that uses the ``SimpleDictionary`` to take a string and return a vector of integers:
 - Takes a string as input and applies tokenization and normalization using the provided ``tokenize_normalize`` function.
 - Output a sparse one-hot encoding of the text (sorted in ascending order) 
 - Test the code by running it on the post in row index 10 in the post_frame.



In [16]:
def tokenize_normalize(string):
  return normalize(spacy_tokenize(string))

In [17]:
def one_hot_encoding(s):
    return set(dictionary.tokens_to_ids(tokenize_normalize(s)))

The original input sequence for the post has 87 tokens. The one-hot encoding has 62 values (including all of tokens 1-7).  

Run the same on the string below.  (and then try favorite sentence)

In [18]:
cofveve = "What is covfefe and why am I seeing all over social media?"
one_hot_encoding(cofveve)


{0, 2, 4, 6, 34, 40, 57, 117, 124, 779, 813}

- What does the first 0 indicate?
- What token does it represent? 

Our model only knows about tokens it has previously seen (and maybe only a subset of tokens if the size of the dictionary is truncated to remove rare words). The rest are assigned to the UNK token. 

## Jaccard similarity betwen pieces of text

#### Your task 

- Create a function ``jaccard_similarity`` that takes two documents represented as sparse one-hot encodings and computes the jaccard similarity. 
- *Hint*: You might want to look at the operations on the built-in set datastructure 
(https://docs.python.org/3/tutorial/datastructures.html#sets)
- *Debugging Tip*: Consider printing the different elements of Jaccard. 

In [19]:
def jaccard_similarity(doc1, doc2):
    return (len(doc1 & doc2)) / (len(doc1 | doc2))

In [20]:
doc1 = one_hot_encoding("the cat jumped over the fox")
doc2 = one_hot_encoding("the brown fox jumped over the dog")

print(doc1)
print(doc2)
jaccard_similarity(doc1, doc2)

{1, 1292, 754, 2837, 117}
{1, 491, 1644, 754, 117, 2837}


0.5714285714285714

The jaccard similarity of the sequences should be approximately 0.5714.  You might also recall other set-based similarity measures we discussed. Implementing them with a one-hot encoding should be familiar now. 

### Section summary
In the previous section we: 
 - Created a dictionary object and used it to represent text. 
 - Created a one-hot encoding of text documents
 - Implemented the Jaccard similarity function 


We could also have extended our functions to create bag-of-words representations.  In the next section we'll explore how to do this with one of the most widely used machine learning libraries. 

## Vector representations with Scikit-Learn

Scikit-learn is a widely machine learning library that includes tools for performing operations on data: similarity computation, clustering, classification, and many others. We'll use Scikit-learn to create vector representations of text data.

We first extract out a few fields from the DataFrame and put it into a friendlier format. 

In [21]:
from itertools import islice

# Parallel arrays of the post keys and values.
post_vals = list()
post_keys = list()

# Limit the size of the data loaded
# Recall that there is approximately 110k posts in the dataset.
posts_to_load = 10000

for post in islice(post_frame.itertuples(index=True, name='Pandas'), posts_to_load):
    post_keys.append(getattr(post, 'id'))
    post_vals.append(getattr(post,'body'))

#### Your task
Create a document-term matrix with term frequency (counts) from the collection using the [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). This will have the following steps:
 - Import the ``CountVectorizer`` and create an instance; assign it to a variable, ``tf_vectorizer``
 - All Scikit-Learn vectorizers accept a tokenizer as an optional parameter (it has a built-in tokenizer).  Pass in the ``tokenize_normalize`` function to ``CountVectorize (tokenizer=...)`` we defined above that performs both operations in a single step with spaCy. 
 - Call ``fit`` on the ``post_vals`` variable to learn a vocabulary
 - ``Transform`` the ``post_vals`` into a document-term matrix, assign it to a variable, tf_document_term_matrix.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
tf_vectorizer = CountVectorizer(tokenizer=tokenize_normalize)
tf_vectorizer.fit(post_vals)
tf_document_term_matrix = tf_vectorizer.transform(post_vals)

What did this process do?
 - The ``fit()`` function tokenized the text collection and built a vocabulary/dictionary 
 - The ``transform()`` function created a document-term matrix with a bag-of-words representation (raw TF) word counts as the weighting.
 
 These steps are sometimes combined together with the single step called ``fit_transform``.  The constructor of ``CountVectorizer`` accepts parameters on how to control the dictionary created. 

Let's now apply the vectorizer on new unseen text.  We do this by calling ``transform()`` on the string data (technically an array of strings, each entry a document). 


In [24]:
mystring = 'The next town over recently got a brand new flagship Lidl – now my town is getting a brand new flagship Aldi. I fear war.'
response = tf_vectorizer.transform([mystring])
print (response)
print (tf_vectorizer.inverse_transform(response))

  (0, 8)	2
  (0, 1165)	1
  (0, 1598)	2
  (0, 4687)	1
  (0, 4858)	2
  (0, 5355)	2
  (0, 6287)	1
  (0, 8494)	1
  (0, 8672)	2
  (0, 8687)	1
  (0, 8847)	1
  (0, 9219)	1
  (0, 10568)	1
  (0, 12950)	1
  (0, 13241)	2
  (0, 14141)	1
[array(['a', 'be', 'brand', 'fear', 'flagship', 'get', 'i', 'my', 'new',
       'next', 'now', 'over', 'recently', 'the', 'town', 'war'],
      dtype='<U152')]


What is the output here? 
 - (0, 2) 2 has three parts --> (row, column) count; this shows a simple document-term matrix with count values.
 
 
 What happened? 
 - ``Transform`` applies the vectorizor, just like ``tokens_to_ids`` in our dictionary implementation.  
 - The result is a document-term matrix for the data passed to it. 
 
 The ``inverse_transform`` is just like ``ids_to_tokens`` applied to every non-UNK value (which are not invertable). 
 
 
 **Question:** We did not call ``fit``.  Why? What would this have done?

In the inverse output,  see that some of the words are not present (e.g. lidl, aldi, etc...). These are just ignored; calling ``transform()`` does not update our vocabulary and there is no UNK representation in this vocabulary.  Scikit-Learn ignores all UNK tokens that haven't been seen when using CountVectorizer. 

#### Your task 
- Create a TF-IDF vectorizer and apply it to the ``post_vals`` similar to what was done for CountVectorizer. 
 - Use the [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).
 - Set sublinear_tf = true to use the log scaling (as opposed to raw TF counts)
 - Create n-grams up to length 2
 - Limit the number of features (the size of the vocabulary) to 50000
 - Assign it to a ``ngram_vectorizer`` variable
 
 
Read the documentation of the vectorizer for details on the parameters as needed.  Warning: if this takes too long, or crashes then your configuration is wrong.


In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
ngram_vectorizer = TfidfVectorizer(sublinear_tf=True, tokenizer=tokenize_normalize, ngram_range=(0,2), max_features=50000)
ngram_vectorizer.fit(post_vals)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=50000, min_df=1,
        ngram_range=(0, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=True,
        token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<function tokenize_normalize at 0x7f00b32f9ae8>,
        use_idf=True, vocabulary=None)

In [28]:
str = 'The next town over recently got a brand new flagship Lidl – now my town is getting a brand new flagship Aldi. I fear war.'
ngram_matrix = ngram_vectorizer.transform([str])
print (ngram_matrix)
print (ngram_vectorizer.inverse_transform(ngram_matrix))

  (0, 46655)	0.1529200505454279
  (0, 45176)	0.21900116111582324
  (0, 45174)	0.26513495403373744
  (0, 42302)	0.1514435625636529
  (0, 41407)	0.045425923871779426
  (0, 33698)	0.21043380970617825
  (0, 33681)	0.1510874409082641
  (0, 27804)	0.11383289200470212
  (0, 23089)	0.22468290543476732
  (0, 23060)	0.10079337738312472
  (0, 22491)	0.12951330275210632
  (0, 22426)	0.19329392423313688
  (0, 22126)	0.2013520512394951
  (0, 21781)	0.06962748951097884
  (0, 16447)	0.23200794786233606
  (0, 16325)	0.04550970901909656
  (0, 13682)	0.2049998156315833
  (0, 13681)	0.12639447503160012
  (0, 12174)	0.18014428319259548
  (0, 6375)	0.3804212278568931
  (0, 6373)	0.3180425346969255
  (0, 4968)	0.16481152250476264
  (0, 4617)	0.04360237765441838
  (0, 183)	0.39282360279061307
  (0, 36)	0.07736838173801036
  (0, 0)	0.10742221143788376
[array(['war', 'town be', 'town', 'the next', 'the', 'recently get',
       'recently', 'over', 'now my', 'now', 'next', 'new', 'my town',
       'my', 'i fear',

We're just scratching the surface of what's possible.  There are other types of representations as well.  For example, there is a [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) and others.
- What are the pros and cons of using HasingVectorizer vs TFIDFVectorizer?
- Question: What does the ``fit()`` function do on the HashingVectorizer?  Why is this? 

## Cosine similarity
We will now use sklearn's cosine similarity implementation to find similar posts.

In [29]:
from sklearn.metrics.pairwise import cosine_similarity
 
# A function that given an input query item returns the top-k most similar items 
# by their cosine similarity.
def find_similar(query_vector, td_matrix, top_k = 5):
    cosine_similarities = cosine_similarity(query_vector, td_matrix).flatten()
    related_doc_indices = cosine_similarities.argsort()[::-1]
    return [(index, cosine_similarities[index]) for index in related_doc_indices][0:top_k]

#### Your task
- Find the closes 10 posts to the string below. 
  - For each of the most similar posts, print out four values: the cosine similarity, index in the post data, its URL, and it's body content.  Hint: You might use the ``post_keys`` and ``post_values`` we are operating over. 
  - Repeat the exercise for both the CountVectorize as well as the TFIDFVectorizer with n-grams. 
   - Try a different string value and repeat . 


In [42]:
# An input string.
##str = 'The next town over recently got a brand new flagship Lidl – now my town is getting a brand new flagship Aldi. I fear war.'

# Or take the content of a random post.
import random as rand
post_index = rand.randint(0, len(post_vals))
string = post_vals[post_index]

matrix = ngram_vectorizer.transform([string])
similar = find_similar(matrix, ngram_vectorizer.transform(post_vals))
print(string)
print("-" * 10)
print(post_vals[similar[1][0]])

Thank you
----------
Thank you!


What do you see?  If you use a post, it should find itself and return a similarity score of 1.0.

Try experimenting with different vectorizers and matrix representations (count, tfidf, ngrams). 
- How do the most similar posts change?
- What do you think is most effective? Why?

## KMeans clustering

What's in the Reddit dataset? When we want to explore a dataset, one method is to apply clustering and then to inspect the clusters.


From the SKlearn documentation:
The [KMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.  K-means is sometimes referred to as Lloyd’s algorithm. 

Recall from lecture: In basic terms, the algorithm has three steps. 
1) The first step chooses the initial centroids, with the most basic method being to choose k samples from the dataset X. 


After initialization, K-means consists of looping between the two other steps. The first assigns each sample to its nearest centroid. The second step creates new centroids by taking the mean value of all of the samples assigned to each previous centroid. The difference between the old and the new centroids are computed and the algorithm repeats until the centroids do not change significantly.


#### Your task
 - Run K-Means on the document-term matrix from TFIDF vectorization. 
  - Start with ``k=8`` clusters
  - Use KMeans with 'random' cluster initialization
  - Add verbose=10 to show the clustering progress
  - Just like fitting a vocabulary, clustering is produced by calling ``fit``, but on the document-term matrix (not raw data), to fit clusters to do matrix. 
  - Assign the result to a ``kmeans`` variable. 
  
 
 
 If it is slow, you might need to use MiniBatchKMeans



In [44]:
from sklearn.cluster import KMeans
num_clusters = 8
kmeans = KMeans(n_clusters=num_cluster, init='random', verbose=10)
kmeans.fit(tf_document_term_matrix)

Initialization complete
Iteration  0, inertia 831853.000
Iteration  1, inertia 601731.473
Iteration  2, inertia 575068.552
Iteration  3, inertia 560676.705
Iteration  4, inertia 551549.977
Iteration  5, inertia 542724.980
Iteration  6, inertia 536735.640
Iteration  7, inertia 530771.770
Iteration  8, inertia 523503.240
Iteration  9, inertia 517367.704
Iteration 10, inertia 513723.236
Iteration 11, inertia 510880.340
Iteration 12, inertia 508810.143
Iteration 13, inertia 507414.907
Iteration 14, inertia 505985.749
Iteration 15, inertia 504978.234
Iteration 16, inertia 503898.227
Iteration 17, inertia 502667.265
Iteration 18, inertia 501784.532
Iteration 19, inertia 501171.386
Iteration 20, inertia 500437.983
Iteration 21, inertia 499206.950
Iteration 22, inertia 496985.285
Iteration 23, inertia 492703.207
Iteration 24, inertia 490638.710
Iteration 25, inertia 489087.922
Iteration 26, inertia 487916.624
Iteration 27, inertia 487595.276
Iteration 28, inertia 487101.877
Iteration 29, inert

Iteration 52, inertia 485121.870
Iteration 53, inertia 485115.583
Iteration 54, inertia 485115.266
Iteration 55, inertia 485114.998
Iteration 56, inertia 485114.177
Iteration 57, inertia 485111.940
Iteration 58, inertia 485110.945
Iteration 59, inertia 485110.721
Iteration 60, inertia 485110.329
Iteration 61, inertia 485110.220
Iteration 62, inertia 485110.133
Converged at iteration 62: center shift 0.000000e+00 within tolerance 5.531234e-07
Initialization complete
Iteration  0, inertia 778459.000
Iteration  1, inertia 565825.682
Iteration  2, inertia 548230.848
Iteration  3, inertia 540141.154
Iteration  4, inertia 532849.064
Iteration  5, inertia 525235.682
Iteration  6, inertia 519676.813
Iteration  7, inertia 515076.117
Iteration  8, inertia 511816.753
Iteration  9, inertia 509646.462
Iteration 10, inertia 507962.746
Iteration 11, inertia 506707.426
Iteration 12, inertia 505504.203
Iteration 13, inertia 504101.653
Iteration 14, inertia 503050.978
Iteration 15, inertia 502023.647
It

Iteration  4, inertia 521592.289
Iteration  5, inertia 514923.090
Iteration  6, inertia 510727.567
Iteration  7, inertia 508216.084
Iteration  8, inertia 506053.201
Iteration  9, inertia 504089.313
Iteration 10, inertia 502183.846
Iteration 11, inertia 500409.994
Iteration 12, inertia 499635.937
Iteration 13, inertia 498930.218
Iteration 14, inertia 498226.481
Iteration 15, inertia 497160.984
Iteration 16, inertia 493902.761
Iteration 17, inertia 492372.966
Iteration 18, inertia 490814.984
Iteration 19, inertia 489673.175
Iteration 20, inertia 488229.334
Iteration 21, inertia 487613.796
Iteration 22, inertia 487274.247
Iteration 23, inertia 487015.645
Iteration 24, inertia 486809.031
Iteration 25, inertia 486651.478
Iteration 26, inertia 485950.984
Iteration 27, inertia 485808.690
Iteration 28, inertia 485755.276
Iteration 29, inertia 485703.549
Iteration 30, inertia 485681.679
Iteration 31, inertia 485657.670
Iteration 32, inertia 485616.249
Iteration 33, inertia 485570.428
Iteration 

KMeans(algorithm='auto', copy_x=True, init='random', max_iter=300,
    n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=10)

We should now have a clustering with K clusters. Let's examine the centroids. Recall that they are not documents; they represent a typical example (average) document in the cluster.

We'll print out the top 10 terms from each of the centroids.

NOTE: ``vectorizer`` below should be replaced with the name of your TFIDFVectorizor used to produce the document-term matrix.

In [48]:
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = tf_vectorizer.get_feature_names()
for i in range(num_clusters):
  print("Cluster %d:" % i)
  for ind in order_centroids[i, :10]:
    print(' %s' % terms[ind])
  print()

Cluster 0:
 be
 i
 the
 a
 you
 to
 it
 and
 that
 of

Cluster 1:
 i
 be
 a
 the
 to
 and
 it
 my
 that
 have

Cluster 2:
 the
 be
 to
 a
 i
 and
 of
 it
 that
 in

Cluster 3:
 the
 be
 and
 i
 to
 a
 of
 in
 have
 that

Cluster 4:
 i
 be
 a
 the
 to
 and
 of
 have
 it
 that

Cluster 5:
 the
 be
 a
 to
 and
 of
 you
 it
 i
 that

Cluster 6:
 the
 be
 a
 to
 and
 of
 i
 in
 you
 it

Cluster 7:
 i
 be
 a
 the
 to
 and
 it
 you
 of
 that



We can also look at the cluster assignments.  Each post is assigned to one cluster (partioning the document space). 

In [49]:
# Group the posts by their cluster labels.
clustering = collections.defaultdict(list)
for idx, label in enumerate(kmeans.labels_):
  clustering[label].append(idx)


In [50]:
for cluster, indices in clustering.items():
  print("\nCluster:", cluster, " Num posts: ", len(indices))
  cur_docs = 0
  for index in indices:
    if (cur_docs > 10):
      break
    post_contents = post_vals[index].replace('\n', '')
    print(index, post_keys[index], (post_contents[:75] + '..') if len(post_contents) > 75 else post_contents)
    cur_docs+=1


Cluster: 2  Num posts:  39
0 https://www.reddit.com/r/100movies365days/comments/1bx6qw/dtx120_87_nashville/ 4/7/13  7/27/12  http://www.imdb.com/title/tt0073440/referenceIt was only a..
65 https://www.reddit.com/r/2007scape/comments/1uz5j7/training_or_slayer/ Lawl I seriously hate this argument. Probably nobody will see this outside ..
566 https://www.reddit.com/r/911truth/comments/3as2wh/are_there_people_who_still_believe_911_unfolded/ The real problem is not that there are people who believe everything the go..
567 https://www.reddit.com/r/911truth/comments/3as2wh/are_there_people_who_still_believe_911_unfolded/ It's the farthest thing from "self evident", the officially sanctioned and ..
643 https://www.reddit.com/r/AMA/comments/4ee3q7/im_chrisacrosstheworld_japans_first_salaried/ 1. What are your dreams and aspiration.I have so many dreams!! I made a buc..
885 https://www.reddit.com/r/Advice/comments/4bfbxd/jealousy_issues/ My boyfriend and I work for the same company (yes I know 



*   Is the clustering useful to explore the data?
*   Can you label the clusters?
*   Are these 'good' clusters?



#### Optional task : Creating a better clustering

Create a better clustering than the one above.  
 - Use MiniBatchKMeans instead of kmeans (you might start with a batch size of 500 or so).
 - Try using kmeans++ instead of random
 - Vary the number of clusters (k)
 - Plot the sum of distances of samples to their closest cluster center (see the kmeans.inertia_ value) for different values of K. You may also try the silhouette score mentioned in lecture - sklearn.metrics.silhouette_score.  See also  https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
 - What is a 'good value'  of K for this data?
 


## Summary 

In this lab we covered a lot of practical ground:
- We introduced spaCy for text processing
- We implemented a dictionary and created a one-hot encoding of text
- Implemented the Jaccard similarity function
- Used Sci-kit Learn to vectorize text using bag-of-words representation with TF and TF-IDF weights
- Ran KMeans clustering on Reddit data and used it to explore data. 


Next time we'll perform more advanced language prediction tasks focusing on modeling sequences of text.

Please take the [Lab 2 Moodle Feedback quiz](https://moodle.gla.ac.uk/mod/feedback/view.php?id=1110531). 