# Language model

Language model is a probability distribution over sequences of word.

In this lab we will apply laguage model for a classification problem. The task is to implement a filter for spam documents.

Read this article
https://towardsdatascience.com/learning-nlp-language-models-with-real-data-cdff04c51c25

### Dataset
Download this https://www.kaggle.com/uciml/sms-spam-collection-dataset dataset.
Normalize the text and split by sentences using nltk library. Split sentences to the terms. We don't need to do lemmatize words and remove stop words. For simplicity we will lose the punctuation and characters register.
Make a lists of sentences for spam and ham messages.

In [18]:
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [19]:
import pandas as pd
import nltk

from itertools import chain
import re

df = pd.read_csv('spam.csv', encoding = 'ISO-8859-1')[['v1', 'v2']]
df.rename(columns={'v1': 'type', 'v2': 'content'}, inplace=True)
df.head()

spam_messages = df[df['type'] == 'spam']['content'].tolist() #list of sentences, each sentence represented as a list of terms
ham_messages = df[df['type'] == 'ham']['content'].tolist()

len(spam_messages), len(ham_messages)

Unnamed: 0,type,content
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


(747, 4825)

In [20]:
spam_messages[0:5]

["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
 "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv",
 'WINNER!! As a valued network customer you have been selected to receivea å£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.',
 'Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030',
 'SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info']

Print the average length and average number of sentences in spam message.

In [21]:
from nltk.tokenize import sent_tokenize, word_tokenize
_ = nltk.download('punkt')

sent_sizes = [len(word_tokenize(sent)) for msg in spam_messages for sent in sent_tokenize(msg)]
avg_sent_len = sum(sent_sizes) / len(sent_sizes)

avg_sent_num = sum([len(sent_tokenize(msg)) for msg in spam_messages]) / len(spam_messages)

print(f'Average sentence length: {avg_sent_len:.2f}, average num of sent: {avg_sent_num:.2f}')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Average sentence length: 9.17, average num of sent: 3.02


In [22]:
import string

PUNCT = string.punctuation + '“”«»—•\\/'

def normalize(text, allow_asterix=False):
    text = text.lower()
    text = re.sub('\'', '', text)                          # remove apostrophes
    #text = re.sub('[!\?@#$.\-+=><—,\(\)&:“”]', ' ', text)        # replace all punctuation signs with spaces
    text = re.sub(f'[{PUNCT}]', ' ', text)        # replace all punctuation signs with spaces
    
    if not allow_asterix:
        text = re.sub('\*', ' ', text)                      # replace all astrixes (*) with spaces
    text = re.sub('[0-9]', ' ', text)                      # replace all digits with spaces
    result = " ".join([x.lower() for x in text.split()])   # lower all letters and delete all doubled spaces
    return result

def messages_to_sentences(messages):
    return list(chain(
        *[[word_tokenize(normalize(sent)) for sent in sent_tokenize(msg)] for msg in messages]))

spam_sentences = messages_to_sentences(spam_messages)
ham_sentences = messages_to_sentences(ham_messages)

for i in range(10):
    print(spam_sentences[i])

['free', 'entry', 'in', 'a', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', 'st', 'may']
['text', 'fa', 'to', 'to', 'receive', 'entry', 'question', 'std', 'txt', 'rate', 't', 'cs', 'apply', 'over', 's']
['freemsg', 'hey', 'there', 'darling', 'its', 'been', 'weeks', 'now', 'and', 'no', 'word', 'back']
['id', 'like', 'some', 'fun', 'you', 'up', 'for', 'it', 'still']
['tb', 'ok']
['xxx', 'std', 'chgs', 'to', 'send', 'å£', 'to', 'rcv']
['winner']
['as', 'a', 'valued', 'network', 'customer', 'you', 'have', 'been', 'selected', 'to', 'receivea', 'å£', 'prize', 'reward']
['to', 'claim', 'call']
['claim', 'code', 'kl']


### Unigram model

Calculate the number of occurancies of each term separately for spam and ham messages. 

Calculate the total number of terms.

In [0]:
from collections import Counter

class CountDict(Counter):
    """
    Class that is used as counter, inherited from dict. 
    """
    def __getitem__(self, item):
        """
        Gets item without exceptions
        :param item:  key you want to get
        :return: value, associated with `item` (if item in the `keys`), 0 otherwise
        """
        if item not in self:
            return 0
        return super().__getitem__(item)

In [24]:
START = 'START'
END = 'END'

spam_term_c = CountDict(Counter(list(chain(*spam_sentences))))  # dict()
spam_term_c[START] = spam_term_c[END] = len(spam_sentences)
spam_N = sum(spam_term_c.values())

ham_term_c = CountDict(Counter(list(chain(*ham_sentences))))  # dict()
ham_term_c[START] = ham_term_c[END] = len(ham_sentences)
ham_N = sum(ham_term_c.values())

spam_N, ham_N

(21863, 86107)

Print 10 most popular words in spam messages.

In [25]:
'Spam most popular words', spam_term_c.most_common(10)

('Spam most popular words',
 [('START', 2256),
  ('END', 2256),
  ('to', 688),
  ('a', 390),
  ('call', 370),
  ('å£', 299),
  ('you', 291),
  ('your', 264),
  ('free', 228),
  ('the', 206)])

In [26]:
'Ham most popular words', ham_term_c.most_common(10)

('Ham most popular words',
 [('START', 8841),
  ('END', 8841),
  ('i', 2295),
  ('you', 1858),
  ('to', 1554),
  ('the', 1124),
  ('a', 1055),
  ('u', 1010),
  ('and', 857),
  ('in', 820)])

### Bigram model

We will use sentence begining and sentence ending as a special terms. Calculate the number of occuracnies for bigrams. As a key in dictionary you might use words, separated by the space symbol.

Also, for a genetative model, epxlained later, for each term we will need a list of next term, found in the dataset.

In [0]:
def bigram_nextword(sentences):
    bigrams = CountDict()
    next_words = dict()
    for i in range(len(sentences)):
        sent = [START] + sentences[i] + [END]
        for cur_token_pos in range(len(sent) - 1):
            cur_token, next_token = sent[cur_token_pos:cur_token_pos + 2]
            bigram = cur_token + ' ' + next_token
            if not bigram in bigrams:
                bigrams[bigram] = 0
            bigrams[bigram] += 1
            
            if not cur_token in next_words:
                next_words[cur_token] = set()
            next_words[cur_token].add(next_token)
    for k in next_words:
        next_words[k] = list(next_words[k]) 
    return bigrams, next_words

spam_bigram_c, spam_next_words = bigram_nextword(spam_sentences)
ham_bigram_c, _ = bigram_nextword(spam_sentences)

spam_bi_N = sum(spam_bigram_c.values())
ham_bi_N = sum(ham_bigram_c.values())

Which bigrams are the most popular in spam messages?

From which words spam sentence usually begins?

In [28]:
spam_bigram_pop = sorted(list(spam_bigram_c.items()), 
                         key=lambda x: x[1],
                         reverse=True
                         )[:10]

spam_bigram_start_pop = sorted(
    [x for x in spam_bigram_c.items() if x[0].startswith(START)],
    key=lambda x: x[1],
    reverse=True
    )[:10]

spam_bigram_pop, spam_bigram_start_pop

([('START call', 149),
  ('now END', 103),
  ('START you', 80),
  ('you have', 73),
  ('to END', 73),
  ('a å£', 73),
  ('START your', 67),
  ('START to', 66),
  ('START txt', 61),
  ('call now', 59)],
 [('START call', 149),
  ('START you', 80),
  ('START your', 67),
  ('START to', 66),
  ('START txt', 61),
  ('START urgent', 54),
  ('START free', 47),
  ('START text', 40),
  ('START claim', 38),
  ('START reply', 37)])

Implement a function, which return the conditional probability $P(t_2 | t_1) = \frac{count(t_1 t_2)}{count(t_1)}$

In [0]:
def conditional_prob(t1, t2, spam=True):
    bigram = t1 + ' ' + t2
    if spam:
        if spam_bigram_c[bigram] * spam_term_c[t1] == 0:
            return 0
        return spam_bigram_c[bigram] / spam_term_c[t1]
    else:
        #print(t1, t2, ham_bigram_c[bigram], ham_term_c[t1])
        if ham_bigram_c[bigram] * ham_term_c[t1] == 0:
            return 0
        return ham_bigram_c[bigram] / ham_term_c[t1]

### Genetative model

Now is the funny task. Using your language model generate a spam message. Remember you calculated the average number of sentences, average sentence size for spam messages.

Print few generated ouptuts.

#### Interesting outputs

- `Urgent. Reply or å£ cash every week. You are trying to no prepayment`
- `Win a guaranteed. Txt chat on orange line rental camcorder hit. Stop to contact u been renewed and downloads`
- `Free top quality ringtone club credits pls reply. Just txt great graphics from landline box croydon. Urgent`

In [30]:
from random import random, choice, seed, randint
import numpy as np

seed(1)
np.random.seed(1)

def generate_sentence(bigrams, next_words, sent_size):
    sent = []
    prev_word = START
    for j in range(round(sent_size)):
        if prev_word == END:
            break
        possible = next_words[prev_word]
        pdf = np.array([conditional_prob(prev_word, word) for word in possible])
        cdf = np.cumsum(pdf)

        cur_word = possible[0]
        r = random()
        for i in range(len(cdf)):
            if cdf[i] > r:
                cur_word = possible[i]
                break
        sent.append(cur_word)
        prev_word = cur_word
        
    sent[0] = sent[0].capitalize()
    return ' '.join(sent[:-1])

def generate(bigrams, next_words, sent_size, sent_num):
    text = [
            generate_sentence(bigrams, next_words, sent_size)
            for _ in range(round(sent_num))
    ]
    return '. '.join(text)

for i in range(5):
    print(generate(spam_bigram_c, spam_next_words, avg_sent_len, avg_sent_num))

Hi babe to send nokia to. Xmas reward of jordan txt swap and up. Sunshine
Free game å£ of discount vouchers. Yrs only with a å£ this is the. Spook to maximize ur smart though this weeks
Cost p mt msgrcvd skip an urgent. Call. U will receive å£ bonus caller prize
Call free. Money over. Private
We have new mobile for euro stop to. Im i put you are the following service. Txt or a sim subscriber ur games arcade


### Smoothing

The problem is that if the bigram $t_1 t_2$ occuted $0$ times in the corpus, the conditional probability $P(t_2|t_1) = 0$

The solution is smoothing. Read this document https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

Your task is to implement one of the advanced (from the document, except additive smoothing) smoothing techniques from it. Be ready to explain it defending the lab.

Implement a function, which return the conditional probability $P(t_2 | t_1)$ with a smoothing.

### Jelinek-Mercer smoothing (interpolation)

$P(t_2 | t_1) = \lambda \cdot \frac{count(t_1 t_2)}{count(t_1)} + (1 - \lambda) \cdot \frac{count(t_1)}{\sum_{i=0}^{M} count(t_i)} $

In [46]:
def unigram_prob(t, spam=True, lamb=0.5):
    if spam:
        return lamb * spam_term_c[t] / spam_N + (1 - lamb) / spam_N
    else:
        return lamb * ham_term_c[t] / ham_N + (1 - lamb) / ham_N

def bigram_prob(t1, t2, spam=True, lamb=0.5):
    return lamb * conditional_prob(t1, t2, spam) + (1 - lamb) * unigram_prob(t2)

def smoothing_conditional_prob(t1, t2, spam=True, lamb=0.5):
    # Jelinek-Mercer smoothing (interpolation)
    return bigram_prob(t1, t2, spam, lamb)

t1, t2 = choice(list(spam_bigram_c.keys())).split()
t1, t2,
for lamb in np.linspace(0, 1, num=11):
    print(f'lambda= {lamb:.1f}, smoothing_cond_prob= {smoothing_conditional_prob(t1, t2, True, lamb):.2f}')

('then', 'text')

lambda= 0.0, smoothing_cond_prob= 0.00
lambda= 0.1, smoothing_cond_prob= 0.02
lambda= 0.2, smoothing_cond_prob= 0.04
lambda= 0.3, smoothing_cond_prob= 0.06
lambda= 0.4, smoothing_cond_prob= 0.08
lambda= 0.5, smoothing_cond_prob= 0.10
lambda= 0.6, smoothing_cond_prob= 0.12
lambda= 0.7, smoothing_cond_prob= 0.14
lambda= 0.8, smoothing_cond_prob= 0.16
lambda= 0.9, smoothing_cond_prob= 0.18
lambda= 1.0, smoothing_cond_prob= 0.20


### Classification

Now, implement a bayessian classifier for the sentence. Test one of your generated sentences on it.

It should return, which probability is higher

$$P(spam|t_1, \dots , t_k) = \frac{P(t_1, \dots , t_k|spam)P(spam)}{P(t_1, \dots , t_k)} \sim P(t_1, \dots , t_k|spam)P(spam)$$ 
$$\sim P(t_1 | BEGIN, spam) \cdot \sim P(t_2 | t_1, spam) \cdot \dots \cdot \sim P(END | t_k, spam)$$

or the same for ham sentence.

In [0]:
from termcolor import colored

EPS = 10 ** (-20)

def classification_prob(sentence, spam, lamb=0.5):
    if isinstance(sentence, str):
        sentence = word_tokenize(sentence)
    sentence_bigrams = [(sentence[i], sentence[i + 1])
                        for i in range(len(sentence) - 1)]
    return np.prod([
                    smoothing_conditional_prob(bi[0], bi[1], spam, lamb)
                    for bi in sentence_bigrams
    ])


def classify(sentence, lamb=0.5):
    spam_prob = classification_prob(sentence, spam=True, lamb=lamb)
    ham_prob = classification_prob(sentence, spam=False, lamb=lamb)
    
    #if abs(spam_prob - ham_prob) < EPS: # with this it classifies much worse
    #    return 'same'
    if spam_prob > ham_prob:
        return 'spam'
    return 'ham'


def colorify(r):
    if r == 'same':
        return colored('same', 'blue')
    if r == 'spam':
        return colored('spam', 'red')
    return colored('ham', 'green')

In [48]:
from sklearn.metrics import balanced_accuracy_score

best_lamb = 0
best_score = 0
for lamb in np.linspace(0.0, 1.0, num=20):
    y_pred = [classify(sent, lamb) for sent in spam_sentences + ham_sentences]
    y_true = ['spam'] * len(spam_sentences) + ['ham'] * len(ham_sentences)
    score = balanced_accuracy_score(y_true, y_pred)
    if score > best_score:
        best_score = score
        best_lamb = lamb
    print(f'lambda= {lamb:.2f}, score= {score:.5f}')

lambda= 0.00, score= 0.50000
lambda= 0.05, score= 0.59250
lambda= 0.11, score= 0.60779
lambda= 0.16, score= 0.61610
lambda= 0.21, score= 0.61959
lambda= 0.26, score= 0.62236
lambda= 0.32, score= 0.62413
lambda= 0.37, score= 0.62845
lambda= 0.42, score= 0.62972
lambda= 0.47, score= 0.63388
lambda= 0.53, score= 0.63443
lambda= 0.58, score= 0.63532
lambda= 0.63, score= 0.63725
lambda= 0.68, score= 0.63903
lambda= 0.74, score= 0.64030
lambda= 0.79, score= 0.64252
lambda= 0.84, score= 0.64363
lambda= 0.89, score= 0.64363
lambda= 0.95, score= 0.64518
lambda= 1.00, score= 0.87569


In [49]:
print("ACTUALLY SPAM")
for i in range(10):
    k = randint(0, len(spam_sentences))
    pred = colorify(classify(spam_sentences[k], best_lamb))
    print(pred, spam_sentences[k])


print("\nACTUALLY HAM")
for i in range(10):
    k = randint(0, len(ham_sentences))
    pred = colorify(classify(ham_sentences[k], best_lamb))
    print(pred, ham_sentences[k])

ACTUALLY SPAM
[31mspam[0m ['call', 'from', 'land', 'line']
[31mspam[0m ['freemsg', 'fancy', 'a', 'flirt']
[31mspam[0m ['you', 'have', 'new', 'message']
[31mspam[0m ['mila', 'age', 'blonde', 'new', 'in', 'uk']
[31mspam[0m ['claim', 'is', 'easy', 'just', 'call', 'now']
[31mspam[0m ['cc', 'hg', 'suite', 'lands', 'row', 'w', 'j', 'hl']
[31mspam[0m ['your', 'å£', 'prize', 'from', 'yesterday', 'is', 'still', 'awaiting', 'collection']
[32mham[0m ['congrats']
[31mspam[0m ['we', 'are', 'trying', 'to', 'contact', 'you']
[31mspam[0m ['u', 'up', 'for', 'some', 'fun']

ACTUALLY HAM
[32mham[0m ['darren', 'is', 'wif', 'them', 'now']
[32mham[0m ['how', 'r', 'u', 'man']
[32mham[0m ['yes']
[32mham[0m ['the', 'battery', 'is', 'for', 'mr', 'adewale', 'my', 'uncle']
[32mham[0m ['am', 'on', 'a', 'train', 'back', 'from', 'northampton', 'so', 'im', 'afraid', 'not']
[32mham[0m ['and', 'is', 'there', 'a', 'way', 'you', 'can', 'send', 'shades', 'stuff', 'to', 'her']
[32mham[0m [

## Funny application
Super Innopolis messages generator bot: [@SuperInnoMsgBot](https://teleg.run/SuperInnoMsgBot)
