# Probabilistic classifier of texts into spam / ham

## Intro

Here is a classical "complete the notebook" assignment. 

You can run all the cells in the notebook, and some of them you have to complete. 

The code you have to complete is marked with `#TODO` comments. The cells containing such code also contain assertions that you should fulfill. 

If the cells produce no errors, you can be pretty sure you do everything OK. 

Let's try it!

In [4]:
def square_root(x):
    """ This is a function that takes a non-negative numeric argument x and produces its square root. """
    # TODO: calculate the square root of x and put it into the y variable instead of None. 
    # If you are not sure, have a look on the list of Python basic operators
    # https://www.tutorialspoint.com/python/python_basic_operators.htm
    y = x**0.5
    return y

assert square_root(144) == 12


Now that you understand the format, let's have look at a [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) of short messages into spam and not-spam.

The main idea behind it is that $$P(spam|text) = \frac{P(spam)P(text|spam)}{P(text)}$$

You will have to implement this formula along with some hacks to make its application more robust.

![](https://pics.me.me/suppose-you-have-one-rabbit-now-suppose-someone-gives-you-21826742.png)

## Loading the data

The cell below loads the file with messages. 

If you run this notebook locally on Windows, you have to download the file manually. 

In [5]:
!wget https://raw.githubusercontent.com/avidale/ps4ds2019/master/homework/week1/spam_classifier/SMSSpamCollection

/bin/sh: wget: command not found


The following cell imports some Python libraries. It is possible that you have some of them not installed (namely, `pandas`). In this case, you have to install them using package manager from command line. The command would look like `pip install pandas` or `conda install pandas`.

If you run this notebook from Google Colab, then the libraries are already installed

In [3]:
# load some useful Python libratries

import pandas as pd # the library for working with data tables
import re
from collections import Counter # a class for counting objects (words and text labels, in our case)

ModuleNotFoundError: No module named 'pandas'

In [None]:
# load the data from disk to a tabular format, and give readable names to its columns
data = pd.read_csv('SMSSpamCollection', sep='\t', header=None)
data.columns = ['target', 'text']

In this dataset, "ham" is a good text, and "spam" is, well, spam. 

In [19]:
# enable pandas to display large texts and look into our data

pd.options.display.max_colwidth = 300

print(data.shape) # number of rows and columns
data.head(5)

(5572, 2)


Unnamed: 0,target,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


## Preprocessing the data

In a minute we will have to estimate probabilites of different texts. 

We could use *language models* using e.g. n-grams or recurrent neural networks, to calculate probability of original texts. 

But for our problem, it will be sufficient to represent each text with the set of words (and other symbols) that occur in it. This representation ignores word order and number of words.

That is, we will not make difference between texts 

> this one is a long message. 

and 

> this message is a long long long long long long one.

Both will be represented as a set of tokens:

In [219]:
# from nltk import stem
# from nltk.corpus import stopwords
# stemmer = stem.SnowballStemmer('english')
# stopwords = set(stopwords.words('english'))
import nltk
from nltk import stem
#nltk.download()
#from nltk.corpus import stopwords
#from nltk.stem import WordNetLemmatizer

stemmer = stem.SnowballStemmer('english')

#stopwords = stopwords.words("english")
#stopwords = ["", "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
#stopwords = ["", "a", "about", "above", "after", "again", "against", "ain", "all", "am", "an", "and", "any", "are", "aren", "aren't", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "can", "couldn", "couldn't", "d", "did", "didn", "didn't", "do", "does", "doesn", "doesn't", "doing", "don", "don't", "down", "during", "each", "few", "for", "from", "further", "had", "hadn", "hadn't", "has", "hasn", "hasn't", "have", "haven", "haven't", "having", "he", "her", "here", "hers", "herself", "him", "himself", "his", "how", "i", "if", "in", "into", "is", "isn", "isn't", "it", "it's", "its", "itself", "just", "ll", "m", "ma", "me", "mightn", "mightn't", "more", "most", "mustn", "mustn't", "my", "myself", "needn", "needn't", "no", "nor", "not", "now", "o", "of", "off", "on", "once", "only", "or", "other", "our", "ours", "ourselves", "out", "over", "own", "re", "s", "same", "shan", "shan't", "she", "she's", "should", "should've", "shouldn", "shouldn't", "so", "some", "such", "t", "than", "that", "that'll", "the", "their", "theirs", "them", "themselves", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "until", "up", "ve", "very", "was", "wasn", "wasn't", "we", "were", "weren", "weren't", "what", "when", "where", "which", "while", "who", "whom", "why", "will", "with", "won", "won't", "wouldn", "wouldn't", "y", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves", "could", "he'd", "he'll", "he's", "here's", "how's", "i'd", "i'll", "i'm", "i've", "let's", "ought", "she'd", "she'll", "that's", "there's", "they'd", "they'll", "they're", "they've", "we'd", "we'll", "we're", "we've", "what's", "when's", "where's", "who's", "why's", "would", "able", "abst", "accordance", "according", "accordingly", "across", "act", "actually", "added", "adj", "affected", "affecting", "affects", "afterwards", "ah", "almost", "alone", "along", "already", "also", "although", "always", "among", "amongst", "announce", "another", "anybody", "anyhow", "anymore", "anyone", "anything", "anyway", "anyways", "anywhere", "apparently", "approximately", "arent", "arise", "around", "aside", "ask", "asking", "auth", "available", "away", "awfully", "b", "back", "became", "become", "becomes", "becoming", "beforehand", "begin", "beginning", "beginnings", "begins", "behind", "believe", "beside", "besides", "beyond", "biol", "brief", "briefly", "c", "ca", "came", "cannot", "can't", "cause", "causes", "certain", "certainly", "co", "com", "come", "comes", "contain", "containing", "contains", "couldnt", "date", "different", "done", "downwards", "due", "e", "ed", "edu", "effect", "eg", "eight", "eighty", "either", "else", "elsewhere", "end", "ending", "enough", "especially", "et", "etc", "even", "ever", "every", "everybody", "everyone", "everything", "everywhere", "ex", "except", "f", "far", "ff", "fifth", "first", "five", "fix", "followed", "following", "follows", "former", "formerly", "forth", "found", "four", "furthermore", "g", "gave", "get", "gets", "getting", "give", "given", "gives", "giving", "go", "goes", "gone", "got", "gotten", "h", "happens", "hardly", "hed", "hence", "hereafter", "hereby", "herein", "heres", "hereupon", "hes", "hi", "hid", "hither", "home", "howbeit", "however", "hundred", "id", "ie", "im", "immediate", "immediately", "importance", "important", "inc", "indeed", "index", "information", "instead", "invention", "inward", "itd", "it'll", "j", "k", "keep", "keeps", "kept", "kg", "km", "know", "known", "knows", "l", "largely", "last", "lately", "later", "latter", "latterly", "least", "less", "lest", "let", "lets", "like", "liked", "likely", "line", "little", "'ll", "look", "looking", "looks", "ltd", "made", "mainly", "make", "makes", "many", "may", "maybe", "mean", "means", "meantime", "meanwhile", "merely", "mg", "might", "million", "miss", "ml", "moreover", "mostly", "mr", "mrs", "much", "mug", "must", "n", "na", "name", "namely", "nay", "nd", "near", "nearly", "necessarily", "necessary", "need", "needs", "neither", "never", "nevertheless", "new", "next", "nine", "ninety", "nobody", "non", "none", "nonetheless", "noone", "normally", "nos", "noted", "nothing", "nowhere", "obtain", "obtained", "obviously", "often", "oh", "ok", "okay", "old", "omitted", "one", "ones", "onto", "ord", "others", "otherwise", "outside", "overall", "owing", "p", "page", "pages", "part", "particular", "particularly", "past", "per", "perhaps", "placed", "please", "plus", "poorly", "possible", "possibly", "potentially", "pp", "predominantly", "present", "previously", "primarily", "probably", "promptly", "proud", "provides", "put", "q", "que", "quickly", "quite", "qv", "r", "ran", "rather", "rd", "readily", "really", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards", "related", "relatively", "research", "respectively", "resulted", "resulting", "results", "right", "run", "said", "saw", "say", "saying", "says", "sec", "section", "see", "seeing", "seem", "seemed", "seeming", "seems", "seen", "self", "selves", "sent", "seven", "several", "shall", "shed", "shes", "show", "showed", "shown", "showns", "shows", "significant", "significantly", "similar", "similarly", "since", "six", "slightly", "somebody", "somehow", "someone", "somethan", "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", "specifically", "specified", "specify", "specifying", "still", "stop", "strongly", "sub", "substantially", "successfully", "sufficiently", "suggest", "sup", "sure", "take", "taken", "taking", "tell", "tends", "th", "thank", "thanks", "thanx", "thats", "that've", "thence", "thereafter", "thereby", "thered", "therefore", "therein", "there'll", "thereof", "therere", "theres", "thereto", "thereupon", "there've", "theyd", "theyre", "think", "thou", "though", "thoughh", "thousand", "throug", "throughout", "thru", "thus", "til", "tip", "together", "took", "toward", "towards", "tried", "tries", "truly", "try", "trying", "ts", "twice", "two", "u", "un", "unfortunately", "unless", "unlike", "unlikely", "unto", "upon", "ups", "us", "use", "used", "useful", "usefully", "usefulness", "uses", "using", "usually", "v", "value", "various", "'ve", "via", "viz", "vol", "vols", "vs", "w", "want", "wants", "wasnt", "way", "wed", "welcome", "went", "werent", "whatever", "what'll", "whats", "whence", "whenever", "whereafter", "whereas", "whereby", "wherein", "wheres", "whereupon", "wherever", "whether", "whim", "whither", "whod", "whoever", "whole", "who'll", "whomever", "whos", "whose", "widely", "willing", "wish", "within", "without", "wont", "words", "world", "wouldnt", "www", "x", "yes", "yet", "youd", "youre", "z", "zero", "a's", "ain't", "allow", "allows", "apart", "appear", "appreciate", "appropriate", "associated", "best", "better", "c'mon", "c's", "cant", "changes", "clearly", "concerning", "consequently", "consider", "considering", "corresponding", "course", "currently", "definitely", "described", "despite", "entirely", "exactly", "example", "going", "greetings", "hello", "help", "hopefully", "ignored", "inasmuch", "indicate", "indicated", "indicates", "inner", "insofar", "it'd", "keep", "keeps", "novel", "presumably", "reasonably", "second", "secondly", "sensible", "serious", "seriously", "sure", "t's", "third", "thorough", "thoroughly", "three", "well", "wonder", "a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "co", "op", "research-articl", "pagecount", "cit", "ibid", "les", "le", "au", "que", "est", "pas", "vol", "el", "los", "pp", "u201d", "well-b", "http", "volumtype", "par", "0o", "0s", "3a", "3b", "3d", "6b", "6o", "a1", "a2", "a3", "a4", "ab", "ac", "ad", "ae", "af", "ag", "aj", "al", "an", "ao", "ap", "ar", "av", "aw", "ax", "ay", "az", "b1", "b2", "b3", "ba", "bc", "bd", "be", "bi", "bj", "bk", "bl", "bn", "bp", "br", "bs", "bt", "bu", "bx", "c1", "c2", "c3", "cc", "cd", "ce", "cf", "cg", "ch", "ci", "cj", "cl", "cm", "cn", "cp", "cq", "cr", "cs", "ct", "cu", "cv", "cx", "cy", "cz", "d2", "da", "dc", "dd", "de", "df", "di", "dj", "dk", "dl", "do", "dp", "dr", "ds", "dt", "du", "dx", "dy", "e2", "e3", "ea", "ec", "ed", "ee", "ef", "ei", "ej", "el", "em", "en", "eo", "ep", "eq", "er", "es", "et", "eu", "ev", "ex", "ey", "f2", "fa", "fc", "ff", "fi", "fj", "fl", "fn", "fo", "fr", "fs", "ft", "fu", "fy", "ga", "ge", "gi", "gj", "gl", "go", "gr", "gs", "gy", "h2", "h3", "hh", "hi", "hj", "ho", "hr", "hs", "hu", "hy", "i", "i2", "i3", "i4", "i6", "i7", "i8", "ia", "ib", "ic", "ie", "ig", "ih", "ii", "ij", "il", "in", "io", "ip", "iq", "ir", "iv", "ix", "iy", "iz", "jj", "jr", "js", "jt", "ju", "ke", "kg", "kj", "km", "ko", "l2", "la", "lb", "lc", "lf", "lj", "ln", "lo", "lr", "ls", "lt", "m2", "ml", "mn", "mo", "ms", "mt", "mu", "n2", "nc", "nd", "ne", "ng", "ni", "nj", "nl", "nn", "nr", "ns", "nt", "ny", "oa", "ob", "oc", "od", "of", "og", "oi", "oj", "ol", "om", "on", "oo", "oq", "or", "os", "ot", "ou", "ow", "ox", "oz", "p1", "p2", "p3", "pc", "pd", "pe", "pf", "ph", "pi", "pj", "pk", "pl", "pm", "pn", "po", "pq", "pr", "ps", "pt", "pu", "py", "qj", "qu", "r2", "ra", "rc", "rd", "rf", "rh", "ri", "rj", "rl", "rm", "rn", "ro", "rq", "rr", "rs", "rt", "ru", "rv", "ry", "s2", "sa", "sc", "sd", "se", "sf", "si", "sj", "sl", "sm", "sn", "sp", "sq", "sr", "ss", "st", "sy", "sz", "t1", "t2", "t3", "tb", "tc", "td", "te", "tf", "th", "ti", "tj", "tl", "tm", "tn", "tp", "tq", "tr", "ts", "tt", "tv", "tx", "ue", "ui", "uj", "uk", "um", "un", "uo", "ur", "ut", "va", "wa", "vd", "wi", "vj", "vo", "wo", "vq", "vt", "vu", "x1", "x2", "x3", "xf", "xi", "xj", "xk", "xl", "xn", "xo", "xs", "xt", "xv", "xx", "y2", "yj", "yl", "yr", "ys", "yt", "zi", "zz"]
#'win', 'luck', 'big', 'title'
#
#stopwords = ["", 'dhoni', "so", 'we', 'to',  'have', 'will', 'some']
stopwords = [""]
def get_words(text):
    """ This function converts the given text into an unordered and uncounted bag of words. """
    #return set(re.split('\W+', text.lower())).difference({''} )
    
    return set(re.split('\W+', text.lower())).difference(stopwords )
    
    #first_text = [word for word in re.split('\W+', text.lower()) if word not in stopwords]
    #stem_text = [stemmer.stem(word) for word in first_text]
    #return set(stem_text)
    #.difference(stopwords )

# just an example
get_words("this message is a long, long, long, long long long one caresses ponies.")

{'a', 'caresses', 'is', 'long', 'message', 'one', 'ponies', 'this'}

This simplified approach will allow us to train the probabilistic model of texts using a modest amount of data.

In [None]:
# apply this logic to texts of all messages
bags_of_words = [get_words(text) for text in data.text]

To evaluate how well our model classifies messages, let's train it on the first 3000 texts, and measure accuracy on the rest.

In [None]:
n_train = 3000
train_x, test_x, train_y, test_y = bags_of_words[:n_train], bags_of_words[n_train:], data.target[:n_train], data.target[n_train:]
# print(bags_of_words)
# print(train_x)
# print(test_x)

## The basic classifier

In the cell below, we will count occurences of words under different labels.

We are going to use `Counter` objects. If you are not sure how they work, please look at [the documentation](https://docs.python.org/3.6/library/collections.html#collections.Counter). 

In [None]:
# this counter will keep the number of spam and ham texts
label_counter = Counter()

# these counters will keep the frequency of each word in ham and spam texts
word_counters = {
    'spam': Counter(), 
    'ham': Counter()
}

#print("nb spam:{}".format(word_counters['spam']))
all_words = set()

for label, words in zip(train_y, train_x):
    all_words.update(words)
#    print(type(words))
    # TODO: use the `update` methods of all 3 counters, to calculate total number of 
#    print(list(words))
    label_counter.update({label:1})
    word_counters[label].update(words)


assert label_counter['spam'] == 409
assert word_counters['ham']['hello'] >= 2

Now let's calculate different probabilities of words, texts, and labels for our classifier

In [None]:
def prior_probability_of_label(label):
    """ This function evaluates probability of the given label (it can be 'spam' or 'ham'), using the counters. """
    # TODO: calculate and return this probability as ratio of number of texts with this labels to number all texts
    return label_counter[label] / sum(label_counter.values())

# print(label_counter['spam'])
# print(len(list(label_counter)))
# print(prior_probability_of_label('spam'))
assert round(prior_probability_of_label('spam'), 2) == 0.14
assert round(prior_probability_of_label('ham'), 2) == 0.86

In [None]:
def word_probability_given_label(word, label):
    """ This function calculates probability of a word occurence in text, conditional on the label of this text. """
    # TODO: calculate and return this probability 
    # as ratio of number of texts with this word and label to number of texts with this label

    #return (word_counters[label][word] + 𝛼 * p) / (label_counter[label] + p)
    
    return (word_counters[label][word]) / (label_counter[label] )

assert round(word_probability_given_label("99", "spam"), 3) == 0.002

Here we encounter the first practical problem: some words have never occurred in our training data. 

But they can probably occur in the texts to which our model will be applied in the future. 

To assign a non-zero probability to such texts, we can slightly modify the `word_probability_given_label`. For example, instead of original estimate, 

$$\hat{p}(word|label) = \frac{count(word, label)}{count(label)}$$

we could use a "smoothed" version

$$\hat{p}(word|label) = \frac{count(word, label) + \alpha\times p}{count(label) + p}$$

where $alpha\in(0, 1)$ is the anchor probability towards which we move our estimate, and $p$ is the step size towards this anchor. 

Values like $p=0.1$ and $\alpha=1^{-3}$ would do.  

In [None]:
# TODO: modify the `word_probability_given_label` function, by moving each probability towards a small positive constant
def word_probability_given_label(word, label):
    """ This function calculates probability of a word occurence in text, conditional on the label of this text. """
    # TODO: calculate and return this probability 
    # as ratio of number of texts with this word and label to number of texts with this label
    p = 0.009
    𝛼 = 1 ** -3

    return (word_counters[label][word] + 𝛼 * p) / (label_counter[label] + p)
    #return (word_counters[label][word]) / (label_counter[label] )

assert word_probability_given_label("999", "spam") > 0
assert word_probability_given_label("999", "spam") < 0.005

Now we can move from words to texts. 

Here is where we apply our naive assumption that occurrences of each word are independent:
$$ P(text|label) = \prod_{word \in text} P(word|label) \times \prod_{word \notin text} (1-P(word|label)) $$

In [None]:
def text_probability_given_label(text, label):
    """ This function calculates probability of the text conditional on its label. """
    if isinstance(text, str):
        text = get_words(text)
    probability = 1.0
    # TODO: calculate the probability of text given label. 
    # use a function defined above and the naive assumption of word independence
    for word in all_words:
        if word in text:
            probability *= word_probability_given_label(word, label) 
            #pass 
        else:
          probability *=  (1 - word_probability_given_label(word, label))
            #pass
    return probability

greeting1 = 'hello how are you'
greeting2 = 'hello teacher how are you'

assert text_probability_given_label(greeting1, 'ham') > 0
assert text_probability_given_label(greeting1, 'ham') < 0.0001
assert text_probability_given_label(greeting2, 'ham') < text_probability_given_label(greeting1, 'ham')

Now you have all the components to compile your first probabilistic classifier!

In [195]:
def label_probability_given_text(text, label):
    """ This function calculates probability of the label (spam or ham) conditional on the text. """
    # TODO: calculate label probability conditional on text
    # use the Bayes rule and the functions defined above

  #round
    return round((prior_probability_of_label(label) * text_probability_given_label(text, label)) / 
     (text_probability_given_label(text, "spam") * prior_probability_of_label("spam") + text_probability_given_label(text, "ham") * prior_probability_of_label("ham")),15)
#𝑃(𝑠𝑝𝑎𝑚|𝑡𝑒𝑥𝑡)=𝑃(𝑠𝑝𝑎𝑚)𝑃(𝑡𝑒𝑥𝑡|𝑠𝑝𝑎𝑚)𝑃(𝑡𝑒𝑥𝑡)


text1 = 'hello how r you'
text2 = 'only today you can buy our book with 50% discount!'

print(label_probability_given_text(text1, 'ham'))
print(label_probability_given_text(text1, 'spam'))
#a = word_probability_given_label(text1, "") 
#print("all_words:{}".format(len(all_words)))
print("new")
print(label_probability_given_text(text2, 'ham'))
print(label_probability_given_text(text2, 'spam'))
a = label_probability_given_text(text1, 'ham') + label_probability_given_text(text1, 'spam')
print("a:{}".format(a))
assert label_probability_given_text(text1, 'ham') + label_probability_given_text(text1, 'spam') == 1.0
assert label_probability_given_text(text1, 'ham') > label_probability_given_text(text1, 'spam')
assert label_probability_given_text(text1, 'ham') > label_probability_given_text(text2, 'ham')

0.999999459718435
5.40281565e-07
new
0.112191797747459
0.887808202252541
a:1.0


## Tuning the classifier

Now we have the classifier, but we don't know how well it works on the unseen data. 

Let's see what fraction of test messages are classified correctly:

In [1]:
#threshold = 0.33
threshold = 0.004
test_spam_probabilities = [label_probability_given_text(text, 'spam') for text in test_x]
test_predictions = ['spam' if spamness > threshold else 'ham' for spamness in test_spam_probabilities]



accuracy = sum(1 if pred == fact else 0 for pred, fact in zip(test_predictions, test_y)) / len(test_y)
print(accuracy)

# i = 0
# cp_spam = 0
# cp_ham = 0
# for pred, fact in zip(test_predictions, test_y):
#   if pred != fact:
#     if fact == "spam":
#       cp_spam += 1
#     else:
#       cp_ham += 1
#     print (fact)
#     print(test_spam_probabilities[i])
#     print(test_x[i])
#   i += 1
# print(cp_spam) 
# print(cp_ham)

assert accuracy > 0.9

NameError: name 'test_x' is not defined

This is a good accuracy, but you can achieve better results by tuning the algorithm. 

What you can do:
* play with the different values of the threshold
* play with the regularization constants that you used in `word_probability_given_label`
* experiment with different implementations of `get_words` - e.g. ignore the word case, or use word lemmas
* use your imagination

Can you beat 99% accuracy?

Have a good time! (-: