In [None]:
version = "v2.2.033020"

# Assignment 4: Mining Sequence Data (Part I)

Welcome to Assignment 4, the last assignment in this course. In this assignment, we will explore the sequence representation of data. Lots of real-world data can be represented as sequences. Among them, text data is typical and widely available, which we will use for this assignment. We will look at the Tweets with the colorful emojis again, but this time we are not going to filter the  textual content. Several toolkits or packages have been developed to process text data. In this assignment, we will rely on the [NLTK](https://www.nltk.org/) package (Natual Language Toolkit).

In this assignment, you will: 
* Tokenize text data and extract ngrams and skipngrams.
* Implement the calculation of edit distance. 
* Find near-duplicate sequences of a given piece of text.  

First, let's load the dependencies and the Tweets.

In [None]:
import pandas as pd
import csv

import nltk

from collections import Counter

In [None]:
tweets = []
with open('assets/tweets.txt', encoding='utf-8') as f:
    for line in f:
        if len(line) > 0:
            tweets.append(line.strip())
tweets[:5]

To construct the sequence representation of the Tweet, we need to *tokenize* each Tweet into a sequence of language units, which in our case are words. For this assignment, we will use the TweetTokenizer API. For some languages, however, tokenization can be challenging.

In [None]:
tokenizer = nltk.tokenize.casual.TweetTokenizer()

In [None]:
tokenizer.tokenize(tweets[0])

For a quick sanity check, we can calculate the word frequency and see which ones are the most frequently used. With the `Counter` object, we can easily obtain the most frequently used words and their numbers of occurrences.

In [None]:
unigram_counter = Counter()
for tweet in tweets:
    unigram_list = tokenizer.tokenize(tweet)
    unigram_counter.update(unigram_list)
unigram_counter.most_common(20)

One common type of defined sequential patterns are $n$-grams. Particularly, 1-grams are often called "unigrams", 2-grams called bigrams, and 3-grams trigrams. Let's examine the bigrams of a Tweet:

In [None]:
bigram_list = list(nltk.bigrams(tokenizer.tokenize(tweets[0])))
bigram_list

Note that each bigram is represented as a tuple of two strings.

### Exercise 1. Find the most frequent bigrams (20 pts)

Please complete the `freq_bigram` function to find the $n$ most frequent bigrams. Your function should return a list of `top_n` tuples. Each of the tuples should contain a bigram tuple (such as ('👍', '👏')) and its number of occurrence. 

In [None]:
def freq_bigrams(tweets, top_n):
    bigram_counter = Counter()
    for tweet in tweets:
        # YOUR CODE HERE
        raise NotImplementedError()
    return bigram_counter.most_common(top_n)

In [None]:
freq_bigrams(tweets, 10)

In [None]:
# This code block tests whether the `freq_bigrams` function is implemented correctly.
# We hide some tests. Passing the displayed assertions does not guarantee full points.
answer = freq_bigrams(tweets, 10)
assert answer[0] == (('!', '!'), 1334)
assert answer[8] == (('🎂', '🍰'), 276)

answer2 = freq_bigrams(tweets[:5000], 5)
assert answer2[0] == (('!', '!'), 682)
assert answer2[2] == (('Happy', 'Birthday'), 290)


Similarly, you can generate trigrams by calling the `nltk.trigrams` API. 

### Exercise 2. Find the most frequent skipgrams (20 pts)

In this exercise we will compute another commonly defined type of sequential patterns -- the skip-grams. Luckily this is also supported by NLTK. You can find the documentation [here](https://tedboy.github.io/nlps/generated/generated/nltk.skipgrams.html).

Please implement the `freq_skipgrams` function to calculate the most frequently used $k$-skip-$n$-grams. Your function should return a list of `top_n` tuples. Each of the tuples should contain a $k$-skip-$n$-gram tuple (such as ('Happy', 'Birthday', '🎂')) and its number of occurrences. 

In [None]:
def freq_skipgrams(tweets, n, k, top_n):
    skipgram_counter = Counter()
    # YOUR CODE HERE
    raise NotImplementedError()

With this function, you can find the 10 most frequent 2-skip-trigram with the following command.

In [None]:
freq_skipgrams(tweets, n=3, k=2, top_n=10)

In [None]:
# test
# This test cell contains hidden tests. Passing the displayed assertions does not guarantee full points.
answer = freq_skipgrams(tweets, n=3, k=2, top_n=10)
assert answer[0] == (('!', '!', '!'), 511)
assert answer[3] == (('🎂', '🎂', '🎂'), 282)


Ngrams and skipgrams are commonly used in text mining, biological sequence mining, and behavior mining tasks.  They are directly used as basis for tasks like phrase detection, named-entity detection, and motif detection, and they are also used as features for building machine learning models in general. Have fun using them in your own data analysis! 