# Tokenization, Stemming, n-gram and skip-gram generation

In this exercise, we'll the emails in the <i>twenty newsgroups</i> dataset to perform simple tokenization, stemming n-gram and skip-gram generation. More information on the <i>Twenty Newsgroups</i> dataset can be found on the [UCI website](http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html). 

**Prerequisites:**
* load and preprocess data set in `00_prepare_data_set/01_preprocess_newsgroups_data.ipynb`

## Setup database connectivity

We'll reuse our module from the previous notebook (***`00_database_connectivity_setup.ipynb`***) to establish connectivity to the database

In [1]:
%run '00_database_connectivity_setup.ipynb'
%matplotlib inline
from IPython.display import display
from IPython.display import HTML



Your connection object is ***`conn`***:
1. Queries: You can run your queries using ***```psql.read_sql("""<YOUR SQL>""", conn)```***.
2. Create/Delete/Updates: You can run these statements using ***```psql.execute("""<YOUR SQL>""", conn)```***, followed by a ***```conn.commit()```*** command to ensure your transaction is committed. Otherwise your changes will be rolledback if you terminate your kernel.

If you created a new connection object (say to connect to a new cluster) as shown in the last section of `00_database_connectivity_setup.ipynb` notebook, use that connection object where needed.

## Generate n-grams and skip-grams 

Skip grams are a generalization of n-grams where we relax the constraint that the tokens should be adjacent to each other. For instance, let's consider the phrase `"insurgents killed in ongoing fighting"`, if we tokenize this phrase and were to extract <i>unigrams</i> from it, we'd get `['insurgents', 'killed', 'in', 'ongoing', 'fighting']`. If we were to extract <i>bigrams</i>, then we'd get `[('insurgents','killed'), ('killed','in'), ('in','ongoing'), ('ongoing', 'fighting')]`. If we were to generalize this to `k-skip-n-gram`, for instance say `2-skip-bigram` we'd get the following tokens: 

`[('insurgents', 'killed'), 
    ('insurgents', 'in'), 
    ('insurgents', 'ongoing'), 
    ('killed', 'in'), 
    ('killed', 'ongoing'), 
    ('killed', 'fighting'), 
    ('in', 'ongoing'), 
    ('in', 'fighting'), 
    ('ongoing', 'fighting')]`
    
In many NLP tasks, skip-grams have achieved comparable accuracy to models trained on n-grams with far fewer training samples. For more information on skip-grams please refer to [A Closer Look at Skip-gram Modelling](http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf)

### Define the UDF to generate k-skip-n-grams

In [8]:
sql = """
    drop function if exists YOUR_SCHEMA.generate_k_skip_n_grams(text[], int, int) cascade;
    create function YOUR_SCHEMA.generate_k_skip_n_grams(
        tokens text[],
        k int,
        n int
    )
    returns setof text[]
    as
    $$
        from itertools import combinations
        def is_valid_k_skip(lst, k):
            '''
                Validate if a given n-gram list (of token indices) contains valid k-skips.
                For instance, a 2-skip bigram, includes 0-skip bigrams, 1-skip bigrams 
                and 2-skip bigrams.
            '''
            idx_diff = [(next_item[0] - current_item[0]) for current_item, next_item in zip(lst, lst[1:])]
            valid_skips = set(range(1, k+2))
            validity_check = [1 if idx in valid_skips else 0 for idx in idx_diff]
            #check if every index is a valid skip
            return sum(validity_check) == len(validity_check)

        def generate_k_skip_n_gram(lst, k, n):
            '''
                Return all k-skip, n-grams as defined in 
                http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf
                ex: "Insurgents killed in ongoing fighting"
                Bi-grams = {insurgents killed, killed in, in ongoing,
                ongoing fighting}.
                2-skip-bi-grams = {insurgents killed, insurgents in,
                insurgents ongoing, killed in, killed ongoing, killed
                fighting, in ongoing, in fighting, ongoing fighting}
                Tri-grams = {insurgents killed in, killed in ongoing, in
                ongoing fighting}.
                2-skip-tri-grams = {insurgents killed in, insurgents killed
                ongoing, insurgents killed fighting, insurgents in ongoing,
                insurgents in fighting, insurgents ongoing fighting, killed
                in ongoing, killed in fighting, killed ongoing fighting, in
                ongoing fighting}.
            '''
            if n > len(lst) or k > len(lst):
                raise 'Invalid values for n:{0} or k:{1}'.format(n, k) 
            #Optimization for normal n-grams (0-skip-n-grams)
            if(k==0):
                return zip(*[lst[i:] for i in range(n)])
            else:
                n_grams = combinations(enumerate(lst), n)
                return [[tup[1] for tup in ngram] for ngram in filter(lambda ngram: is_valid_k_skip(ngram, k), n_grams)]
        return generate_k_skip_n_gram(tokens, k, n)
    $$language plpythonu;
"""
psql.execute(sql, conn)
conn.commit()

### Invoke the UDF to generate bigrams and trigrams on a sample document

In [9]:
sql = """
    select
        YOUR_SCHEMA.generate_k_skip_n_grams(
            tokens,
            0, --no. of skips (k)
            2  --n-gram (n=2 => bigram, n=3 => trigram etc.) 
        ) as bigram
    from
    (
        select
            regexp_split_to_array(
                regexp_replace(
                    trim(both from
                        $$ Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational 
                        linguistics concerned with the interactions between computers and human (natural) languages. 
                        As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve 
                        natural language understanding, that is, enabling computers to derive meaning from human or 
                        natural language input, and others involve natural language generation. $$
                    ),
                    E'[\\\n\\\r]+', ' ', 'g'
                ),
                E'\\\s+'
            ) as tokens
    )q;
"""
df = psql.read_sql(sql, conn)
HTML(df.to_html())

Unnamed: 0,bigram
0,"[Natural, language]"
1,"[language, processing]"
2,"[processing, (NLP)]"
3,"[(NLP), is]"
4,"[is, a]"
5,"[a, field]"
6,"[field, of]"
7,"[of, computer]"
8,"[computer, science,]"
9,"[science,, artificial]"


In [10]:
sql = """
    select
        YOUR_SCHEMA.generate_k_skip_n_grams(
            tokens,
            0, --no. of skips (k)
            3  --n-gram (n=2 => bigram, n=3 => trigram etc.) 
        ) as bigram
    from
    (
        select
            regexp_split_to_array(
                regexp_replace(
                    trim(both from
                        $$ Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational 
                        linguistics concerned with the interactions between computers and human (natural) languages. 
                        As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve 
                        natural language understanding, that is, enabling computers to derive meaning from human or 
                        natural language input, and others involve natural language generation. $$
                    ),
                    E'[\\\n\\\r]+', ' ', 'g'
                ),
                E'\\\s+'
            ) as tokens
    )q;
"""
df = psql.read_sql(sql, conn)
HTML(df.to_html())

Unnamed: 0,bigram
0,"[Natural, language, processing]"
1,"[language, processing, (NLP)]"
2,"[processing, (NLP), is]"
3,"[(NLP), is, a]"
4,"[is, a, field]"
5,"[a, field, of]"
6,"[field, of, computer]"
7,"[of, computer, science,]"
8,"[computer, science,, artificial]"
9,"[science,, artificial, intelligence,]"


### Invoke the UDF to generate `2-skip-bi-grams`

In [11]:
sql = """
    select
        YOUR_SCHEMA.generate_k_skip_n_grams(
            tokens,
            2, --no. of skips (k)
            2  --n-gram (n=2 => bigram, n=3 => trigram etc.) 
        ) as bigram
    from
    (
        select
            regexp_split_to_array(
                regexp_replace(
                    trim(both from
                        $$ Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational 
                        linguistics concerned with the interactions between computers and human (natural) languages. 
                        As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve 
                        natural language understanding, that is, enabling computers to derive meaning from human or 
                        natural language input, and others involve natural language generation. $$
                    ),
                    E'[\\\n\\\r]+', ' ', 'g'
                ),
                E'\\\s+'
            ) as tokens
    )q;
"""
df = psql.read_sql(sql, conn)
HTML(df.to_html())

Unnamed: 0,bigram
0,"[Natural, language]"
1,"[Natural, processing]"
2,"[Natural, (NLP)]"
3,"[language, processing]"
4,"[language, (NLP)]"
5,"[language, is]"
6,"[processing, (NLP)]"
7,"[processing, is]"
8,"[processing, a]"
9,"[(NLP), is]"


### Invoke the UDF to generate `2-skip-tri-grams`

In [12]:
sql = """
    select
        YOUR_SCHEMA.generate_k_skip_n_grams(
            tokens,
            2, --no. of skips (k)
            3  --n-gram (n=2 => bigram, n=3 => trigram etc.) 
        ) as bigram
    from
    (
        select
            regexp_split_to_array(
                regexp_replace(
                    trim(both from
                        $$ Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational 
                        linguistics concerned with the interactions between computers and human (natural) languages. 
                        As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve 
                        natural language understanding, that is, enabling computers to derive meaning from human or 
                        natural language input, and others involve natural language generation. $$
                    ),
                    E'[\\\n\\\r]+', ' ', 'g'
                ),
                E'\\\s+'
            ) as tokens
    )q;
"""
df = psql.read_sql(sql, conn)
HTML(df.to_html())

Unnamed: 0,bigram
0,"[Natural, language, processing]"
1,"[Natural, language, (NLP)]"
2,"[Natural, language, is]"
3,"[Natural, processing, (NLP)]"
4,"[Natural, processing, is]"
5,"[Natural, processing, a]"
6,"[Natural, (NLP), is]"
7,"[Natural, (NLP), a]"
8,"[Natural, (NLP), field]"
9,"[language, processing, (NLP)]"
