# Phrase Learning

Phrase learning is a very common problem that arises when we deal with text data. In this notebook, we will use PySpark to build a phrase learner using Stackoverflow's dataset containing Javascript posts.

### Imports

We use [Spark's NLP library](https://nlp.johnsnowlabs.com/) for detecting sentences and normalization.

In [1]:
from pyspark.sql.functions import explode, pandas_udf, PandasUDFType, col, split, log
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, NGram
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

### Read data

Before executing the below command, download the files, untar and place them in the working directory - [questions](https://bostondata.blob.core.windows.net/stackoverflow/orig-q.tsv.gz), [duplicates](https://bostondata.blob.core.windows.net/stackoverflow/dup-q.tsv.gz) and [answers](https://bostondata.blob.core.windows.net/stackoverflow/ans.tsv.gz).

In [2]:
questions = spark.read.options(delimiter='\t')\
                 .csv('orig-q.tsv')\
                 .toDF('id', 'ans_id', 'text', 'date')

duplicates = spark.read.options(delimiter='\t')\
                  .csv('dup-q.tsv')\
                  .toDF('id', 'ans_id', 'text', 'date')

answers = spark.read.options(delimiter='\t')\
               .csv('ans.tsv')\
               .toDF('id', 'text')

We will work on all the text data available in these files.

In [3]:
df = questions.select('text').union(answers.select('text')).union(duplicates.select('text'))

### Paragraph extraction from text

The text of any post starts with the title and continues with the body. Usually, the body of the post consists of code blocks, quotes and paragraphs. Since the post has a structure that's enforced by html tags, our job of detecting paragraphs becomes a little easier.

We use regex tokenizer to tokenize the post into paragraphs. Here is a sample post:

In [4]:
print(df.take(1)[0][0])

Accessing the web page's HTTP Headers in JavaScript. <p>How do I access a page's HTTP response headers via JavaScript?</p> <p>Related to <a href="http://stackoverflow.com/questions/220149/how-do-i-access-the-http-request-header-fields-via-javascript"><strong>this question</strong></a>, which was modified to ask about accessing two specific HTTP headers.</p> <blockquote> <p><strong>Related:</strong><br> <a href="http://stackoverflow.com/questions/220149/how-do-i-access-the-http-request-header-fields-via-javascript">How do I access the HTTP request header fields via JavaScript?</a></p> </blockquote>


#### Spark RegexTokenizer

In [5]:
regex_tokenizer = RegexTokenizer(pattern='(<p>(.*?)<\/p>)+', gaps=False, inputCol='text', outputCol='paragraphs')

# first tokenize post to sentences, then return a new row for each sentence in the post
df = regex_tokenizer.transform(df)\
                    .select(explode('paragraphs').alias('paragraphs'))

### Clean body of post

Strip text of code blocks, html tags and urls/links.

In [6]:
@pandas_udf('string', PandasUDFType.SCALAR)
def clean_text(s):
    return s.str.strip()\
            .str.replace('<pre><code>.*?</code></pre>|<[^>]+>|<a[^>]+>(.*)</a>|', '')

df = df.withColumn('clean_paragraphs', clean_text(col('paragraphs')))

In [7]:
df.show()

+--------------------+--------------------+
|          paragraphs|    clean_paragraphs|
+--------------------+--------------------+
|<p>how do i acces...|how do i access a...|
|<p>related to <a ...|related to this q...|
|<p><strong>relate...|related: how do i...|
|<p>i need to some...|i need to somehow...|
|   <p>any ideas?</p>|          any ideas?|
|<p>i'm not agains...|i'm not against u...|
|<p>i am using <co...|i am using setint...|
|<p>i want the use...|i want the user t...|
|<p>how can an ema...|how can an email ...|
|<p>suppose i atta...|suppose i attach ...|
|<p>is there a way...|is there a way to...|
|<p>for example, s...|for example, supp...|
|<p>if i click the...|if i click the sp...|
|<p>ps: if the onc...|ps: if the onclic...|
|<p>pps: the backg...|pps: the backgrou...|
|<p>i'm writing a ...|i'm writing a web...|
|<p>is it possible...|is it possible to...|
|<p>so the smes at...|so the smes at my...|
|<p>what the users...|what the users ha...|
|<p>i know there a...|i know the

### Sentence boundary detection

We use the [Spark-NLP library](https://nlp.johnsnowlabs.com/) to detect sentence boundaries in paragraphs.

In [8]:
document_assembler = DocumentAssembler()\
  .setInputCol("clean_paragraphs")\
  .setOutputCol("document")

sentence_detector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sent")

finisher = Finisher() \
    .setInputCols(["sent"]) \
    .setIncludeKeys(False) \
    .setCleanAnnotations(True)

pipeline = Pipeline(
    stages = [
    document_assembler,
    sentence_detector,
    finisher
  ])

df = pipeline.fit(df).transform(df)

In [9]:
df.show()

+--------------------+--------------------+--------------------+
|          paragraphs|    clean_paragraphs|       finished_sent|
+--------------------+--------------------+--------------------+
|<p>how do i acces...|how do i access a...|how do i access a...|
|<p>related to <a ...|related to this q...|related to this q...|
|<p><strong>relate...|related: how do i...|related: how do i...|
|<p>i need to some...|i need to somehow...|i need to somehow...|
|   <p>any ideas?</p>|          any ideas?|          any ideas?|
|<p>i'm not agains...|i'm not against u...|i'm not against u...|
|<p>i am using <co...|i am using setint...|i am using setint...|
|<p>i want the use...|i want the user t...|i want the user t...|
|<p>how can an ema...|how can an email ...|how can an email ...|
|<p>suppose i atta...|suppose i attach ...|suppose i attach ...|
|<p>is there a way...|is there a way to...|is there a way to...|
|<p>for example, s...|for example, supp...|for example, supp...|
|<p>if i click the...|if 

### Cleaning and Splitting Sentences

This is basically an implementation of what's already done [here](https://github.com/Azure/MachineLearningSamples-QnAMatching/blob/master/modules/phrase_learning.py)

#### Regex to split on punctuation

In [10]:
punctuation_tokenizer = RegexTokenizer(pattern="[\"\!\?\)\]\}\,\:\;\*\-]*\s+\([0-9]+\)\s+[\(\[\{\"\*\-]*"                         
                                               "|[\"\!\?\)\]\}\,\:\;\*\-]+\s+[\(\[\{\"\*\-]*" 
                                               "|\.\.+"       # ..
                                               "|\s*\-\-+\s*" # --
                                               "|\s+\-\s+"    # -  
                                               "|\:\:+"       # ::
                                               "|\s+[\/\(\[\{\"\-\*]+\s*"  
                                               "|[\,!\?\"\)\(\]\[\}\{\:\;\*](?=[a-zA-Z])"
                                               "|[\"\!\?\)\]\}\,\:\;]+[\.]*$",
                                       inputCol='finished_sent',
                                       outputCol='sent_sans_punct')

df = punctuation_tokenizer.transform(df)

In [11]:
df.show()

+--------------------+--------------------+--------------------+--------------------+
|          paragraphs|    clean_paragraphs|       finished_sent|     sent_sans_punct|
+--------------------+--------------------+--------------------+--------------------+
|<p>how do i acces...|how do i access a...|how do i access a...|[how do i access ...|
|<p>related to <a ...|related to this q...|related to this q...|[related to this ...|
|<p><strong>relate...|related: how do i...|related: how do i...|[related, how do ...|
|<p>i need to some...|i need to somehow...|i need to somehow...|[i need to someho...|
|   <p>any ideas?</p>|          any ideas?|          any ideas?|         [any ideas]|
|<p>i'm not agains...|i'm not against u...|i'm not against u...|[i'm not against ...|
|<p>i am using <co...|i am using setint...|i am using setint...|[i am using setin...|
|<p>i want the use...|i want the user t...|i want the user t...|[i want the user ...|
|<p>how can an ema...|how can an email ...|how can an 

In [12]:
df = df.select(explode(col('sent_sans_punct')).alias('sentences'))

In [13]:
df.show()

+--------------------+
|           sentences|
+--------------------+
|how do i access a...|
|related to this q...|
|which was modifie...|
|             related|
|how do i access t...|
|i need to somehow...|
|       not even ssi.|
|           any ideas|
|i'm not against u...|
|if someone can su...|
|i am using setint...|
|               fname|
|10000);@to call a...|
|i want the user t...|
|how can an email ...|
|suppose i attach ...|
|is there a way to...|
|the element which...|
|inside the functi...|
|         for example|
+--------------------+
only showing top 20 rows



#### Remove underbars, equal signs and parenthesis

In [14]:
# replace underbars with spaces
@pandas_udf('string', PandasUDFType.SCALAR)
def underbar_to_spaces(s):
    return s.str.strip()\
            .str.replace('_|_+', ' ')

df = df.withColumn('sentences_without_underbars', underbar_to_spaces('sentences'))

In [15]:
df.show()

+--------------------+---------------------------+
|           sentences|sentences_without_underbars|
+--------------------+---------------------------+
|how do i access a...|       how do i access a...|
|related to this q...|       related to this q...|
|which was modifie...|       which was modifie...|
|             related|                    related|
|how do i access t...|       how do i access t...|
|i need to somehow...|       i need to somehow...|
|       not even ssi.|              not even ssi.|
|           any ideas|                  any ideas|
|i'm not against u...|       i'm not against u...|
|if someone can su...|       if someone can su...|
|i am using setint...|       i am using setint...|
|               fname|                      fname|
|10000);@to call a...|       10000);@to call a...|
|i want the user t...|       i want the user t...|
|how can an email ...|       how can an email ...|
|suppose i attach ...|       suppose i attach ...|
|is there a way to...|       is

In [16]:
document_assembler = DocumentAssembler() \
    .setInputCol("sentences_without_underbars") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

normalizer = Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized") \
  .setPatterns(["_+|\(\$?|=|\)\$?"])

finisher = Finisher() \
    .setInputCols(["normalized"]) \
    .setIncludeKeys(False) \
    .setCleanAnnotations(True)

pipeline = Pipeline(
    stages = [
    document_assembler,
    tokenizer,
    normalizer,
    finisher
  ])

df = pipeline.fit(df).transform(df)

# split on @ to get multiple tokens
df = df.withColumn('tokens', split(df['finished_normalized'], '@'))

# remove stopwords
stopwords_remover = StopWordsRemover(inputCol='tokens', outputCol='clean_tokens')

df = stopwords_remover.transform(df)

In [17]:
df.show()

+--------------------+---------------------------+--------------------+--------------------+--------------------+
|           sentences|sentences_without_underbars| finished_normalized|              tokens|        clean_tokens|
+--------------------+---------------------------+--------------------+--------------------+--------------------+
|how do i access a...|       how do i access a...|how@do@i@access@a...|[how, do, i, acce...|[access, page, 's...|
|related to this q...|       related to this q...|related@to@this@q...|[related, to, thi...| [related, question]|
|which was modifie...|       which was modifie...|which@was@modifie...|[which, was, modi...|[modified, ask, a...|
|             related|                    related|             related|           [related]|           [related]|
|how do i access t...|       how do i access t...|how@do@i@access@t...|[how, do, i, acce...|[access, http, re...|
|i need to somehow...|       i need to somehow...|i@need@to@somehow...|[i, need, to, som

### Normalized Pointwise Mutual Information

For finding commonly used bigram phrases, we will use Normalized Pointwise Mutual Information to rank the phrases. For two words $x$ and $y$, the normalized pointwise mutual information (nmpi), $nmpi(x,y)$ is defined as:

$$nmpi(x,y) = -\frac{1}{\ln{p(x, y)}} \cdot \ln{\frac{p(x, y)}{p(x)p(y)}}$$

$p(x)$ and $p(y)$ represent probabilities of $x$ and $y$. Ideally, we must use smoothed probabilities but for simplicity, we will use counts here.

Read more about NMPI [here](https://svn.spraakdata.gu.se/repos/gerlof/pub/www/Docs/npmi-pfd.pdf).

In [18]:
# get unigram counts
unigrams = df.select(explode('clean_tokens').alias('unigram')).groupBy('unigram').count()

# bigram counts
ngrams = NGram(n=2, inputCol='clean_tokens', outputCol='ngram')
bigrams = ngrams.transform(df).select(explode('ngram').alias('bigram')).groupBy('bigram').count()

In [19]:
# get individual words for the ngram
split_col = split(col('bigram'), ' ')
bigrams = bigrams.select('bigram', col('count').alias('bigram_count'),
                         split_col.getItem(0).alias('word1'),
                         split_col.getItem(1).alias('word2'))

# get count of word1
bigrams = bigrams.join(unigrams, bigrams['word1'] == unigrams['unigram'])\
                 .select('bigram', 'bigram_count', col('count').alias('word1_count'), 'word2')

bigrams = bigrams.join(unigrams, bigrams['word2'] == unigrams['unigram'])\
                 .select('bigram', 'bigram_count', 'word1_count',
                         col('count').alias('word2_count'))

In [20]:
bigrams.cache()

DataFrame[bigram: string, bigram_count: bigint, word1_count: bigint, word2_count: bigint]

In [21]:
N = bigrams.groupBy().sum('bigram_count').collect()[0][0]

bigrams = bigrams.withColumn('npmi',
    log(2.0, col('bigram_count') * N / col('word1_count') / col('word2_count')) / -log(2.0, col('bigram_count')/N))

In [22]:
final_df = bigrams.filter(col('bigram_count') > 20).orderBy(-col('npmi'))

In [23]:
final_df.show()

+--------------------+------------+-----------+-----------+------------------+
|              bigram|bigram_count|word1_count|word2_count|              npmi|
+--------------------+------------+-----------+-----------+------------------+
|   douglas crockford|          22|         23|         46|0.9235799177561944|
|                & lt|         137|        268|        157|0.9039493435633289|
|          john resig|          23|         65|         23|0.8980390219748962|
|  possible duplicate|        2350|       3778|       2615|0.8954366413246286|
|    doctype html&gt;|          34|         44|         76| 0.891592775739534|
|  uncaught typeerror|         155|        262|        237| 0.885336038746305|
|   internet explorer|          89|        222|        104|0.8789273729401587|
|    unexpected token|          75|        129|        132|0.8770265244977943|
|  requested resource|          60|         91|        127|0.8736378242708454|
|      stack overflow|          59|        107|     