<a href="https://colab.research.google.com/github/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/introduction_to_information_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Introduction to Information Search

You might have come across the term Information Retrieval in the context of search engines: for example, Google famously started its business by providing a powerful search algorithm that kept improving over time. The search for information, however, is a basic need that you may face not only in the context of searching online: for instance, every time you search for the files on your computer, you also perform sort of information retrieval. In fact, the task predates digital era: before computers and Internet became a commodity, one
had to manually wade through paper copies of encyclopedias, books, documents, files and so on. Thanks to the technology, the algorithms these days help you do many of these tasks automatically.

##Setup

In [1]:
import os
import math
import random
import string

import nltk
from nltk import word_tokenize, WordNetLemmatizer, pos_tag
from nltk.corpus import stopwords
from nltk.stem.lancaster import LancasterStemmer
from nltk.text import Text

from operator import itemgetter

In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
%%shell

wget -qq https://github.com/ekochmar/Essential-NLP/raw/master/cisi.zip

unzip -qq cisi.zip



##Step 1: Understanding the task

---

**Scenario 1**:

Imagine that you have to perform the search in a collection of documents yourself, i.e. without the help of the machine. For example, you have a thousand printed out notes and minutes related to the meetings at work, and you
only need those that discuss the management meetings. How will you find all such documents? How will you identify the most relevant of these?

---

if you were tasked with this in actual life, you would go through the
documents one by one, identifying those that contain the key words (like `management` and `meetings`) and split all the documents into two piles: e.g. those documents that you should keep and look into further and those that you can discard because they do not answer your information need in learning more about the management meetings. This task is akin to filtering.

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/images/1.png?raw=1' width='800'/>

Now, there are a couple of points that we did not get to discuss before: imagine there are a hundred of documents in total and you can quickly skim through them to filter out the most irrelevant ones – those that do not even mention either “meetings” or “management”.

Luckily, these days we have computers and most documents are stored electronically. Computers can really help us speed the things up here.



---

**Scenario 2** (based on Scenario 1, but more technical!):

Imagine that you have to perform the search in a collection of documents, this time with the help of the machine. For
example, you have a thousand notes and minutes related to the meetings at work stored in electronic format, and
you only need those that discuss the management meetings.
- First, how will you find all such documents? In other words, how can you code the search algorithm and what
characteristics of the documents should the search be based on?
- Second, how will you identify the most relevant of these documents? In other words, how can you implement a
sorting algorithm to sort the documents in order of decreasing relevance?

---

It allows you to leverage the computational power of the machine, but the drill is the same as before: get the machine to identify the texts that have the keywords in them, and then sort the “keep” pile according to the relevance of the texts, starting with the most relevant for the user or yourself to look at.

Despite us saying just now that the procedure is similar to how the humans perform the task (as in Scenario 1), there are actually some steps involved in getting the machine identify the documents with the keywords in them and sorting by relevance that we are not explicitly mentioning here. 

For instance, we humans have the following abilities that we naturally possess but machines naturally lack:

- We know what represents a word, while a machine gets in a sequence of symbols and does not, by itself, have a notion of what a “word” is.
- We know which words are keywords: e.g., if we are interested in finding the
documents on management meetings, we will consider those containing “meeting”
and “management”, but also those containing “meetings” and potentially even
“manager” and “managerial”. The machine, on the other hand, does not know that
these words are related, similar, or basically different forms of the same word.
- We have an ability to focus on what matters: in fact, when reading texts we usually skim through them rather than pay equal attention to each word. For instance, when
reading a sentence “Last Friday the management committee had a meeting”, which
words do you pay more attention to? Which ones express the key idea of this
message? Think about it – and we will return to this question later. The machines, on the other hand, should be specifically “told” which words matter more.
- Finally, we also intuitively know how to judge what is more relevant. The machines can make relevance judgments, too, but unlike us humans they need to be “told” how to measure relevance in precise numbers.

That, in a nutshell, represents the basic steps in the search algorithm.


<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/images/2.png?raw=1' width='800'/>

In this notebook, you will learn about other NLP techniques to preselect words, map the different forms of the same word to each other and weigh the words according to how much information they contribute to the task. Then you will build an information search algorithm that for any query (for example, “management meetings”) will find the most relevant documents in the collection of documents (for example, all minutes of the past managerial meetings sorted by their relevance).

Suppose you have built such an application following all the steps. You type in a query and the algorithm returns a document or several documents that are supposedly relevant to this query. How can you tell whether the algorithm has picked out the right documents?

Let’s use a dataset of documents and queries, where the documents are labeled with respect to their relevance to the queries. You will use this dataset as your gold standard, and before using the information search algorithm in practice, evaluate its performance against the ground truth labels in the labeled dataset.




###Data and data structures

You are going to use a publicly available dataset labeled for the task. That
means, a dataset with a number of documents and various queries, and a labeled list specifying which queries correspond to which documents. Once you implement and evaluate a search algorithm on such data labeled with ground truth, you can apply it to your own documents in your own projects.

You will use the dataset collected by the Centre for Inventions and Scientific Information (CISI), which contains abstracts and some additional metadata on the journal articles on information systems and information retrieval.

You will need to keep precisely three data structures for this application: 
- one for the documents, 
- another one for the queries, and 
- the third one matching the queries to the documents

Information search is based on the idea that the content of a document or set of documents is relevant given the content of a particular query, so both documents and queries data structures should keep the contents of all documents and all queries. 

What would be the best way to keep track of which content represents which document?

The most informative and useful way would be to assign a unique identifier – an index – to each document and each query. You can imagine, for example, storing content of the documents and queries in two separate tables, with each row representing a single document or query, and row numbers corresponding to the documents and queries ids. In Python, tables can be represented with dictionaries.

Now, if you keep two Python dictionaries (tables) matching each unique document
identifier (called key) to the document’s content (called value) in documents dictionary and matching each unique query identifier to the query’s content in queries dictionary, how should you represent the relevance mappings? 

You can use a dictionary structure again: this time, the keys will contain the queries ids, while the values should keep the matching documents ids. Since each query may correspond to multiple documents, it would be best to
keep the ids of the matching documents as lists.

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/images/3.png?raw=1' width='800'/>

As this figure shows, query with id 1 matches documents with ids `1` and `1460`, therefore the mappings data structure keeps a list of `[1, 1460]` for query `1`; similarly it keeps `[3]` for query `2`, `[2]` for query `112`, and an empty list for query `3`, because in this example there are no
documents relevant for this query.

Now let’s look into the CISI dataset and code the data reading and initialization step. All documents are stored in a single text file CISI.ALL. It has a peculiar format: it keeps the abstract of each article and some additional information, such as the index in the set, the title, authors’ list and cross-references – a list of indexes for the articles that cite each other.

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/images/4.png?raw=1' width='800'/>

For the information search application, arguably the most useful information is the content of the abstract: abstracts in the articles typically serve as a concise summary of what the article presents, something akin to a snippet.

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/images/5.png?raw=1' width='800'/>

As you can see, the field identifiers such as .A or .W are separated from the actual text by new line. In addition, the text within each field, for example, the abstract may be spread across multiple lines. Ideally, we would like to convert this format into something like text.

Note that for the text that falls within the same field, e.g. .W, the line breaks (“\n”) are replaced with whitespaces, so each line now starts with a field identifier followed by the field content:

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/images/6.png?raw=1' width='800'/>

The format is much easier to work with: you can now read the text line by line,
extract the unique identifier for the article from the field .I, merge the content of the fields `.T`, `.A` and `.W`, and store the result in the documents dictionary as `{1: "18 Editions of the … this country and abroad."}`.

In [4]:
def read_documents():
  file = open("cisi/CISI.ALL")
  merged = ""

  for a_line in file.readlines():
    # Unless a string starts with a new field identifier, add the content to the current field separating the content from the previous line with a whitespace; 
    #votherwise, start a new line with the next identifier and field.
    if a_line.startswith("."):
      merged += "\n" + a_line.strip()
    else:
      merged += " " + a_line.strip()

    documents = {}
    content = ""
    doc_id = ""

    for a_line in merged.split("\n"):
      if a_line.startswith(".I"):
        doc_id = a_line.split(" ")[1].strip()  # doc_id can be extracted from the line with the .I field identifier
      elif a_line.startswith(".X"):
        documents[doc_id] = content
        content = ""
        doc_id = ""
      else:
        content += a_line.strip()[3:] + " "  # Otherwise, keep extracting the content from other fields (.T, .A and .W) removing the field identifiers themselves

  file.close()   
  return documents

As a sanity check, print out the size of the dictionary (make sure it contains all 1460 articles) and print out the content of the very first article.

In [5]:
documents = read_documents()
print(len(documents))
print(documents.get("1"))

1460
 18 Editions of the Dewey Decimal Classifications Comaromi, J.P. The present study is a history of the DEWEY Decimal Classification.  The first edition of the DDC was published in 1876, the eighteenth edition in 1971, and future editions will continue to appear as needed.  In spite of the DDC's long and healthy life, however, its full story has never been told.  There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad. 


The queries are stored in `CISI.QRY` file and follow a very similar format: half the time, you see only two fields – `.I` for the unique identifier and `.W` for the content of the query. 

Other queries though are formulated not as questions but rather as abstracts from other articles. In such cases, the query also has an `.A` field for the authors’ list, `.T` for the title and `.B` field, which keeps the reference to the original journal in which the abstract was published.

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/images/7.png?raw=1' width='800'/>

We are going to only focus on the unique identifiers and the content of the query itself (fields `.W` and `.T`, where available), so the code below is quite similar to the above as it allows you to populate the queries dictionary with data:

In [6]:
def read_queries():
  file = open("cisi/CISI.QRY")
  merged = ""

  for a_line in file.readlines():
    # Unless a string starts with a new field identifier, add the content to the current field separating the content from the previous line with a whitespace; 
    #votherwise, start a new line with the next identifier and field.
    if a_line.startswith("."):
      merged += "\n" + a_line.strip()
    else:
      merged += " " + a_line.strip()

    queries = {}
    content = ""
    query_id = ""

    for a_line in merged.split("\n"):
      if a_line.startswith(".I"):
        if not content == "":
          queries[query_id] = content
          content = ""
          query_id = ""
        query_id = a_line.split(" ")[1].strip()  # query_id can be extracted from the line with the .I field identifier
      elif a_line.startswith(".W"):
        content += a_line.strip()[3:] + " "  # Otherwise, keep adding content to the content variable

    # The very last query is not followed by any next .I field, so the strategy from above won’t work –
    # you need to add the entry for the last query to the dictionary using this extra step
    queries[query_id] = content

  file.close()   
  return queries

In [7]:
queries = read_queries()

# Print out the length of the dictionary (it should contain 112 entries), and the content of the very first query
print(len(queries))
print(queries.get("1"))

112
What problems and concerns are there in making up descriptive titles? What difficulties are involved in automatically retrieving articles from approximate titles? What is the usual relevance of the content of articles to their titles? 


Finally, let's read in the mapping between the queries and the documents – we'll keep these in the mappings data structure – with tuples where each query index (key) corresponds to the list of one or more document indices (value):

In [8]:
def read_mappings():
  file = open("cisi/CISI.REL")

  mappings = {}

  for a_line in file.readlines():
    voc = a_line.strip().split()
    key = voc[0].strip()
    current_value = voc[1].strip()  # The key (query id) is stored in the first column, while the document id is stored in the second column
    value = []
    """
    If the mappings dictionary already contains some document ids for the documents matching the given
    query, you need to update the existing list with the current value; otherwise just add current value to the new list
    """
    if key in mappings.keys():
      value = mappings.get(key)
    value.append(current_value)
    mappings[key] = value

  file.close()   
  return mappings

In [9]:
mappings = read_mappings()
print(len(mappings))
print(mappings.keys())
print(mappings.get("1"))

76
dict_keys(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '37', '39', '41', '42', '43', '44', '45', '46', '49', '50', '52', '54', '55', '56', '57', '58', '61', '62', '65', '66', '67', '69', '71', '76', '79', '81', '82', '84', '90', '92', '95', '96', '97', '98', '99', '100', '101', '102', '104', '109', '111'])
['28', '35', '38', '42', '43', '52', '65', '76', '86', '150', '189', '192', '193', '195', '215', '269', '291', '320', '429', '465', '466', '482', '483', '510', '524', '541', '576', '582', '589', '603', '650', '680', '711', '722', '726', '783', '813', '820', '868', '869', '894', '1162', '1164', '1195', '1196', '1281']


That’s it – you have successfully initialized one dictionary for documents with the ids linked to the articles content, another dictionary for queries linking queries ids to their correspondent texts, and the mappings dictionary, which matches the queries ids to the lists of relevant document ids.

Now, you are all set to start implementing the search algorithm for this data.

### Boolean search algorithm

Let’s start with the simplest approach: the information need is formulated as a query. If you extract the words from the query, you can then search for the documents that contain these words and return these documents, as they should be relevant to the query.

Here is the algorithm in a nutshell:
- Extract the words from the query
- For each document, compare the words in the document to the words in the query
- Return the document as relevant if any of the query words occurs in the document

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/images/8.png?raw=1' width='800'/>

The very first step in this algorithm is extraction of the words from both queries and documents. You may recall that text comes in as a sequence of symbols or characters, and the machine needs to be told what a word is – you used a special NLP tool called tokenizer to extract words.

In [None]:
def get_words(text):
  word_list = [word for word in word_tokenize(text.lower())]  # Text is converted to lower case and split into words
  return word_list

In [None]:
doc_words = {}
query_words = {}

# Entries in both documents and queries are represented as word lists
for doc_id in documents.keys():
  doc_words[doc_id] = get_words(documents.get(doc_id))
for qry_id in queries.keys():
  query_words[qry_id] = get_words(queries.get(qry_id))  

# check out the length of the dictionaries (these should be the same as before – 1460 and 112), 
# and check what words are extracted from the first document and the first query
print(len(doc_words))
print(doc_words.get("1"))
print(len(doc_words.get("1")))

print(len(query_words))
print(query_words.get("1"))
print(len(query_words.get("1")))

1460
['18', 'editions', 'of', 'the', 'dewey', 'decimal', 'classifications', 'comaromi', ',', 'j.p.', 'the', 'present', 'study', 'is', 'a', 'history', 'of', 'the', 'dewey', 'decimal', 'classification', '.', 'the', 'first', 'edition', 'of', 'the', 'ddc', 'was', 'published', 'in', '1876', ',', 'the', 'eighteenth', 'edition', 'in', '1971', ',', 'and', 'future', 'editions', 'will', 'continue', 'to', 'appear', 'as', 'needed', '.', 'in', 'spite', 'of', 'the', 'ddc', "'s", 'long', 'and', 'healthy', 'life', ',', 'however', ',', 'its', 'full', 'story', 'has', 'never', 'been', 'told', '.', 'there', 'have', 'been', 'biographies', 'of', 'dewey', 'that', 'briefly', 'describe', 'his', 'system', ',', 'but', 'this', 'is', 'the', 'first', 'attempt', 'to', 'provide', 'a', 'detailed', 'history', 'of', 'the', 'work', 'that', 'more', 'than', 'any', 'other', 'has', 'spurred', 'the', 'growth', 'of', 'librarianship', 'in', 'this', 'country', 'and', 'abroad', '.']
113
112
['what', 'problems', 'and', 'concerns',

Now let’s code the simple search algorithm. We will refer to it as the
Boolean search algorithm since it relies on presence (1) or absence (0) of the query words in the documents:

In [None]:
def retrieve_documents(doc_words, query):
  docs = []
  query_word = []
  for doc_id in doc_words.keys():
    found = False
    i = 0
    # Keep iterating through the words in the query word list until either of the two conditions is satisfied
    while i < len(query) and not found:
      word = query[i]
      if word in doc_words.get(doc_id):
        docs.append(doc_id)
        query_word.append(word)
        found = True
      else:
        i += 1
  return (docs, query_word)

In [None]:
# Check the results: select a query by its id (e.g., query with id 3 here), print out the ids of the documents
# that the algorithm found (e.g., the first 100, as there may be many),check how many there are in total
docs, query_word = retrieve_documents(doc_words, query_words.get("3"))
print(docs[:100])
print(len(docs))
print(query_word)

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102']
1397
['is', 'is', 'is', 'information', 'is', 'is', '.', 'is', 'definitions', 'is', 'is', 'information', 'what', 'is', 'information', 'is', 'what', 'is', 'is', 'is', 'is', 'is', 'information', 'what', 'is', 'is', 'information', 'what', 'is', 'is', 'is', 'information', 'information', 'is', '.', '.', 'is', 'is', '.', 'is', '.', 'is', 'is', 'is', 'is', 'is', 'is', 'is', '.', 'is', 'is', 'information', 'inf

In [None]:
# Let’s, for example, look into how the algorithm decided on the documents relevant for query with id 6
docs, query_word = retrieve_documents(doc_words, query_words.get("6"))
print(docs[:100])
print(len(docs))
print(query_word)

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100']
1460
['there', 'are', 'are', 'for', 'for', 'are', 'for', 'there', 'for', 'are', 'are', 'for', 'what', 'are', 'communication', 'are', 'what', 'and', 'for', 'for', 'for', 'possibilities', 'are', 'what', 'are', 'are', 'for', 'what', 'are', 'are', 'are', 'are', 'between', 'are', 'there', ',', 'and', ',', 'between', 'for', 'for', 'for', 'are', 'for', 'are', 'there', 'are', 'are', 'are', 'are', 'are', 'for', '

As it shows, the query is matched to the document based on occurrence of such words as “there”, “this”, “the”, “and”, “is” and even a comma since punctuation marks are part of the word list returned by the tokenizer:

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/images/9.png?raw=1' width='800'/>

On the face of it, there is a considerable word overlap between the query and the document, yet if you read the text of the query and the text of the document, they don’t seem to have any ideas in common, so in fact this document is not relevant for the given query at all! 

It seems like the words on the basis of which the query and the document are matched here are simply the wrong ones – they are somewhat irrelevant to the actual information need expressed in the query. 

How can you make sure that the query and the documents are matched on the basis of more meaningful words?


---
**Exercise**

Another way to match the documents to the queries would be to make it a requirement that the document should contain all the words from the query rather than any.

Is this a better approach? Modify the code of the simple Boolean search algorithm to match documents to the queries on the basis of all words, and compare the results.

---

You will notice that it is rarely the case that a document, even if it is generally relevant,
contains all words from the query (at the very least, it does not have to contain question
words like “what” and “which” from the query to be relevant). Therefore, this more
conservative approach of returning only the documents with all query words in them will
work even worse at this stage – it simply will not find any relevant documents.



In [None]:
def retrieve_documents(doc_words, query):
  docs = []
  for doc_id in doc_words.keys():
    # here, you are interested in the documents that contain all words
    found = True
    i = 0
    # iterate through words in the query
    while i < len(query) and found:
      word = query[i]
      if not word in doc_words.get(doc_id):
        # if the word is not in document, turn found flag off and stop
        found = False
      else:
        # rwise, move on to the next query word
        i += 1

    # if all words are found in the document, the last index is len(query)-1 add the doc_id only in this case
    if i == len(query) - 1:
      docs.append(doc_id)
  return docs

In [None]:
docs = retrieve_documents(doc_words, query_words.get("112"))
print(docs[:100])
print(len(docs))

[]
0


In fact, it is a very rare case that you may have any single document that contains all the words from the query, therefore, with this approach, you will likely get no relevant documents returned for any queries in this dataset.


Before we move on, let’s summarize which steps of the algorithm you have implemented so far: you have read the data, initialized the data structures and tokenized the texts.

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/images/10.png?raw=1' width='800'/>

##Step 2: Preprocessing the data

Since we have identified several weaknesses of the current algorithm. 

Let’s look into further preprocessing steps that will help you represent the content of both the documents and the queries in a more informative way.

###Preselecting the words that matter: stopwords removal

The main problem with the search algorithm identified so far is that it considers all words in the queries and documents as equally important. This leads to poor search results, but on top of that it is also intuitively incorrect.

You may notice that not all words are equally meaningful in the sentences above. A good test for that would be to ask yourself whether you can define in one phrase what a particular word means: for example, what does “the” mean? You can say that “the” does not have a precise meaning of its own, rather it serves a particular function.

In linguistic terms, such words are called function words. You might even notice that when you read a text, for example an article or an email, you tend to skim over such words without paying much attention to them.

What happens to the search algorithm when these words are present? You have seen in the example before that they don’t help identify the relevant texts, so in fact the algorithm’s effort is wasted on them. What would happen if the less meaningful words were not taken into consideration?

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/images/11.png?raw=1' width='800'/>

You can see that, were the less meaningful words removed before matching documents to queries, document 1 would not stand a chance – there is simply not a single word overlapping between the query and this document. You can also see that the words that are not grayed out concisely summarize the main idea of the text.

**This suggests the first improvement to the developed algorithm: let’s remove the less meaningful words. In NLP applications, the less meaningful words are called stopwords,** and luckily you don’t have to bother with enumerating them – since stopwords are highly repetitive in English, most NLP toolkits have a specially defined stopwords list, so you can rely on this list when processing the data, unless you want to customize it.





In [10]:
def process(text):
  stoplist = set(stopwords.words("english"))  # use English stopwords
  # Tokenize text, convert it to lower case and only add the words if they are not included in the stoplist and are not punctuation marks
  word_list = [word for word in word_tokenize(text.lower())
               if not word in stoplist and not word in string.punctuation
              ]

  return word_list

In [11]:
# Check the result of these preprocessing steps on some documents or queries, e.g. document 1
word_list = process(documents.get("1"))
print(word_list)

['18', 'editions', 'dewey', 'decimal', 'classifications', 'comaromi', 'j.p.', 'present', 'study', 'history', 'dewey', 'decimal', 'classification', 'first', 'edition', 'ddc', 'published', '1876', 'eighteenth', 'edition', '1971', 'future', 'editions', 'continue', 'appear', 'needed', 'spite', 'ddc', "'s", 'long', 'healthy', 'life', 'however', 'full', 'story', 'never', 'told', 'biographies', 'dewey', 'briefly', 'describe', 'system', 'first', 'attempt', 'provide', 'detailed', 'history', 'work', 'spurred', 'growth', 'librarianship', 'country', 'abroad']


That is, the preprocessing step helps removing the stopwords like “of” and “the” from the word list.

###Matching forms of same word: morphological processing

One effect that stopwords and punctuation marks removal has is optimization of search algorithm – the words that do not matter much are removed, so the computational resources are not wasted on them. In general, the more concise and the more informative the data representation is, the better.

This brings us to the next issue. Take a look at the query with id 15
and document with id 27, which are a match according to the ground truth mappings:

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/images/11.png?raw=1' width='800'/>
> The words highlighted in blue will be matched between the query and the document; the ones in red will be missed.

The reason for this mismatch is that words may take different forms in different contexts: some contexts may require a mention of a single object or concept like “system”, while others may need multiple “systems” to be mentioned. 

Such different forms of a word that depend on the context and express different aspects of meaning, for instance multiplicity of “systems”, are technically called morphological forms, and when you see a word like “systems” and try to match it to its other variant “system” you are dealing with morphology.

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/images/12.png?raw=1' width='800'/>

Stemming takes word matching one step further and tries to map related words across the board, and this means not just the forms of the very same word. For that, the stemmers rely on a set of rules that try to reduce the related words to the same basic core.

The stem in `{retrieve, retrieves, retrieved, retrieving, retrieval}` is retriev. So here is the difference with the technique that you used before – stemming might result in non-words, as for example, you
won’t find a word like retriev in a dictionary. To provide you with a couple of other examples, the stem for `{expect, expects, expected, expecting, expectation, expectations}` is expect and the stem for `{continue, continuation, continuing}` is continu.

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/03-information-retrieval/images/13.png?raw=1' width='800'/>

Note that the stemmer tries to identify which part of the word is shared between the different forms and related words, and returns this part as a stem by cutting off the differing word endings.

Now let’s implement the stemming preprocessing step using NLTK’s stemming
functionality.



##Step 3: Extract and normalize the features

Once the words are extracted from running text, you need to convert them into features. In particular, you need to put all words into lower case to make your algorithm establish the connection between different formats like “Lottery” and “lottery”.

Putting all strings to lower case can be achieved with Python’s string functionality. To extract the features (words) from the text, you need to iterate through the recognized words and put all words to lower case.

In [None]:
def get_features(text):
  features = {}
  word_list = [word for word in word_tokenize(text.lower())]
  # For each word in the email let’s switch on the ‘flag’ that this word is contained in the email
  for word in word_list:
    features[word] = True
  
  return features

In [None]:
# it will keep tuples containing the list of features matched with the “spam” or “ham” label for each email
all_features = [(get_features(email), label) for (email, label) in all_emails]

print(get_features("Participate In Our New Lottery NOW!"))

print(len(all_features))
print(len(all_features[0][0]))
print(len(all_features[99][0]))

{'participate': True, 'in': True, 'our': True, 'new': True, 'lottery': True, 'now': True, '!': True}
5172
29
18


With this bit of code, you iterate over the emails in your collection (all_emails) and store the list of features extracted from each email matched with the label.

For example, if a spam email consists of a single sentence “Participate In Our New Lottery NOW!” your algorithm will first extract the list of features present in this email and assign a ‘True’ value to each of them.

Then, the algorithm will add this list of features to
all_features together with the “spam” label.

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/1.png?raw=1' width='800'/>

Imagine your whole dataset contained only one spam text “Participate In Our New Lottery NOW!” and one ham text “Participate in the Staff Survey”. What features will be extracted from this dataset?

You will end up with the following feature set:

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/2.png?raw=1' width='800'/>

Let’s now clarify what each tuple structure representing an email contains. Tuples pair up two information fields: in this case a list of features extracted from the email and its label, i.e. each tuple in `all_features` contains a pair (`list_of_features`, `label`).

So if you’d like to access first email in the list, you call on `all_features[0]`, to access its list of features you use `all_features[0][0]`, and to access its label you use `all_features[0][1]`.

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/3.png?raw=1' width='800'/>





In [None]:
# access first email in the list with feature and label
all_features[0]

({'!': True,
  ',': True,
  '.': True,
  ':': True,
  'am': True,
  'and': True,
  'fl': True,
  'for': True,
  'friends': True,
  'from': True,
  'hey': True,
  'hi': True,
  'homepage': True,
  'i': True,
  'is': True,
  'jane': True,
  'last': True,
  'looking': True,
  'miami': True,
  'my': True,
  'name': True,
  'new': True,
  'photos': True,
  'see': True,
  'subject': True,
  'webcam': True,
  'weblog': True,
  'with': True,
  'you': True},
 'spam')

In [None]:
# access its list of features only
all_features[0][0]

{'!': True,
 ',': True,
 '.': True,
 ':': True,
 'am': True,
 'and': True,
 'fl': True,
 'for': True,
 'friends': True,
 'from': True,
 'hey': True,
 'hi': True,
 'homepage': True,
 'i': True,
 'is': True,
 'jane': True,
 'last': True,
 'looking': True,
 'miami': True,
 'my': True,
 'name': True,
 'new': True,
 'photos': True,
 'see': True,
 'subject': True,
 'webcam': True,
 'weblog': True,
 'with': True,
 'you': True}

In [None]:
# access its label only
all_features[0][1]

'spam'

##Step 4: Train the classifier

Next, let’s apply machine learning and teach the machine to distinguish between the features that describe each of the two classes. There are a number of classification algorithms that you can use, let’s start with one of the most interpretable ones – an algorithm called **Naïve Bayes**. Don’t be misled by the word “Naïve” in its title, though: despite relative simplicity of the approach compared to other ones, this algorithm often works well in practice
and sets a competitive performance baseline that is hard to beat with more sophisticated approaches.

Naïve Bayes is a probabilistic classifier, which means that it makes the class prediction based on the estimate of which outcome is most likely: i.e., it assesses the probability of an
email being spam and compares it with the probability of it being ham, and then selects the outcome that is most probable between the two.

In the previous step, you extracted the content of the email and converted it into a list of individual words (features). In this step, the machine will try to predict whether the email content represents spam or ham. 

In other words, it will try to predict whether the email is spam or ham given or conditioned on its content. This type of probability, when the outcome (class of “spam” or “ham”) depends on the condition (words used as features), is called conditional probability. 

For spam detection, you estimate `P(spam | email content)` and `P(ham | email content)`, or generally `P(outcome | (given) condition)`.Then you compare one estimate to another and return the most probable class.

```python
If P(spam | content) = 0.58 and P(ham | content) = 0.42, predict spam
If P(spam | content) = 0.37 and P(ham | content) = 0.63, predict ham
```

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/4.png?raw=1' width='800'/>

A machine can estimate the probability that an email is spam or ham conditioned on its content taking the number of times it has seen this content leading to a particular outcome.

```
P(spam | "Participate in our lottery now!") = (number of emails "Participate in our lottery now!" that are spam) / (total number of emails "Participate in our lottery now!", either spam or ham)

P(ham | "Participate in our lottery now!") = (number of emails "Participate in our lottery now!" that are ham) / (total number of emails "Participate in our lottery now!", either spam or ham)
```

In the general form, this can be expressed as:

```
P(outcome | condition) = number_of_times(condition led to outcome) number_of_times(condition applied)
```

Remember that you used tokenization to split long texts into separate words to let the algorithm access the smaller bits of information – words rather than whole sequences. The idea of estimating probabilities based on separate features rather than based on the whole sequence of features (whole text) is somewhat similar.

In the previous step, you converted this single text into a set of features as:

```['participate': True, 'in': True, …, 'now': True, '!': True]``` 

Note that the conditional probabilities like:

``` 
P(spam| "Participate in our lottery now!") and P(spam| ['participate': True,
‘in’: True, …, ‘now’: True, ‘!’: True])
```

are the same because this set of features encodes the text.

Is there a way to split this set to get at more fine-grained, individual probabilities, for example to establish a link between `[‘lottery’: True]` and the class of “spam”?

Unfortunately, there is no way to split the conditional probability estimation like `P(outcome | conditions)` when there are multiple conditions specified, however it is possible to split the probability estimation like `P(outcomes | condition)` when there is a single condition and multiple outcomes.

In spam detection, the class is a single value (it is “spam” or “ham”), while features are a set `([‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’:
True])`. If you can flip around the single value of class and the set of features in such a way that the class becomes the new condition and the features become the new outcomes, you can split the probability into smaller components and establish the link between individual features like `[‘lottery’: True]` and class values like “spam”.


<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/5.png?raw=1' width='800'/>

Luckily, there is a way to flip the outcomes (class) and conditions (features extracted from the content) around!

Let’s look into the estimation of conditional probabilities again: you
estimate the probability that the email is spam given that its content is “Participate in our new lottery now!” based on how often in the past an email with such content was spam. For that, you take the proportion of the times you have seen “Participate in our new lottery now!” in a spam email among the emails with this content.

```
P(spam | "Participate in our new lottery now!") = P("Participate in our new lottery now!" is used in a spam email) / P("Participate in our new lottery now!" is used in an email)
```

Similarly to how you estimated the probabilities above, you need the proportion of times you have seen “Participate in our new lottery now!” in a spam email among all spam emails.

```
P("Participate in our new lottery now!" | spam) = P("Participate in our new lottery now!" is used in a spam email) / P(an email is spam)
```

That is, every time you use conditional probabilities, you need to
divide how likely it is that you see the condition and outcome together by how likely it is that you see the condition on its own – this is the bit after |.

Now you can see that both Formulas 1 and 2 rely on how often you see particular content in an email of particular class. They share this bit, so you can use it to connect the two formulas. For instance, from Formula 2 you know that:

```
P("Participate in our new lottery now!" is used in a spam email) = P("Participate in our new lottery now!" | spam) * P(an email is spam)
```

Now you can fit this into Formula 1:

```
P(spam | "Participate in our new lottery now!") = P("Participate in our new lottery now!" is used in a spam email) / P("Participate in our new lottery now!" is used in an email) = [P("Participate in our new lottery now!" | spam) * P(an email is spam)] / P("Participate in our new lottery now!" is used in an email)
```

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/6.png?raw=1' width='800'/>

In the general form:

```
P(class | content) = P(content represents class) / P(content) = [P(content | class) * P(class)] / P(content)
```

In other words, you can express the probability of a class given email content via the probability of the content given the class.

Now you can replace the conditional probability of `P(class | content)` with `P(content | class)`, e.g. whereas before you had to calculate `P(“spam” | “Participate in our new lottery now!”)` or equally `P(“spam” | [‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True])`, which is hard to do because you will often end up with too few examples of exactly the same email content or exactly the same combination of features, now you can estimate `P([‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True] | “spam”)` instead.

But how does this solve the problem? Aren’t you still dealing with a long sequence of features?

Here is where the “naïve” assumption in Naïve Bayes helps: it assumes that the features are independent of each other, or that your chances of seeing a word “lottery” in an email are independent of seeing a word “new” or any other word in this email before. So you can estimate the probability of the whole sequence of features given a class as a product of probabilities of each feature given this class.

```
P([‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True] | “spam”) = P(‘participate’: True | “spam”) * P(‘in’: True | “spam”) … * P(‘!’: True | “spam”)
```

If you express `[‘participate’: True]` as the first feature in the feature list, or `f1`, `[‘in’: True]` as `f2`, and so on, until `fn = [‘!’: True]`, you can use the general formula:

```
P([f1, f2, …, fn] | class) = P(f1 | class) * P(f2| class) … * P(fn| class)
```

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/7.png?raw=1' width='800'/>

Now that you have broken down the probability of the whole feature list given class into the probabilities for each word given that class, how do you actually estimate them?

Since for each email you note which words occur in it, the total number of times you can switch on the flag `[‘feature’: True]` equals the total number of emails in that class, while the actual number of times you switch on this flag is the number of emails where this feature is actually
present. The conditional probability `P(feature | class)` is simply the proportion of the two:

```
P(feature | class) = number(emails in class with feature present) / total_number(emails in class)
```

These numbers are easy to estimate from the training data – let’s try to do that with an example.

>Suppose you have 5 spam emails and 10 ham emails. What are the conditional probabilities for P('prescription':True | spam), P('meeting':True | ham), P('stock':True | spam) and P('stock':True | ham), if:
- 2 spam emails contain word prescription
- 1 spam email contains word stock
- 3 ham emails contain word stock
- 5 ham emails contain word meeting

**Solution:**

The probabilities are simply:
- P('prescription':True | spam) = number(spam emails with 'prescription')/number(spam emails) = 2/5 = 0.40
- P('meeting':True | ham) = 5/10 = 0.50
- P('stock':True | spam) = 1/5 = 0.20
- P('stock':True | ham) = 3/10 = 0.30

Let’s iterate through the classification steps again: during the training phase, the algorithm learns prior class probabilities (this is simply
class distribution, e.g. `P(ham)=0.71 and P(spam)=0.29)` and probabilities for each feature given each of the classes (this is simply the proportion of emails with each feature in each class, e.g. `P(‘meeting’:True | ham) = 0.50)`. 

During test phase, or when the algorithm is applied to a new email and is asked to predict its class, the following comparison from the beginning of this section is applied:

```
Predict “spam” if P(spam | content) > P(ham | content) 
Predict “ham” otherwise
```

This is what we started with originally, but we said that the conditions are flipped, so it becomes:

```
Predict “spam” if P(content | spam) * P(spam) / P(content) > P(content | ham) * P(ham) / P(content)

Predict “ham” otherwise
```

Note that we end up with `P(content)` in denominator on both sides of the expression, so the absolute value of this probability doesn’t matter and it can be removed from the expression altogether. So we can simplify the expression as:

```
Predict “spam” if P(content | spam) * P(spam) > P(content | ham) * P(ham)
Predict “ham” otherwise
```

`P(spam)` and `P(ham)` are class probabilities estimated during training, and `P(content | class)`, using naïve independence assumption, are products of probabilities, so:

```
Predict “spam” if P([f1, f2, …, fn]| spam) * P(spam) > P([f1, f2, …, fn]| ham) * P(ham)
Predict “ham” otherwise
```

is split into the individual feature probabilities as:

```
Predict “spam” if P(f1 | spam) * P(f2| spam) … * P(fn| spam) * P(spam) > P(f1 | ham) * P(f2| ham) … * P(fn| ham) * P(ham)
Predict “ham” otherwise
```

This is the final expression the classifier relies on. The following code implements this idea.
Since Naïve Bayes is frequently used for NLP tasks, NLTK comes with its own
implementation, too, and here you are going to use it.





In [None]:
def train(features, proportion):
  train_size = int(len(features) * proportion)
  # Use the first n% (according to the specified proportion) of emails with their features for training, and the rest for testing
  train_set, test_set = features[: train_size], features[train_size:]
  print(f"Training set size = {str(len(train_set))} emails")
  print(f"Test set size = {str(len(test_set))} emails")

  classifier = NaiveBayesClassifier.train(train_set)

  return train_set, test_set, classifier

In [None]:
# Apply the train function using 80% (or a similar proportion) of emails for training. 
train_set, test_set, classifier = train(all_features, 0.8)

Training set size = 4137 emails
Test set size = 1035 emails


##Step 5: Evaluate your classifier

Finally, let’s evaluate how well the classifier performs in detecting whether an email is spam or ham. For that, let’s use the accuracy score returned by the NLTK’s classifier:

In [None]:
def evaluate(train_set, test_set, classifier):
  print(f"Accuracy on the training set = {str(classify.accuracy(classifier, train_set))}")
  print(f"Accuracy of the test set = {str(classify.accuracy(classifier, test_set))}")

  # inspect the most informative features (words). You need to specify the number of the top most informative features to look into, e.g. 50 here
  classifier.show_most_informative_features(50)

In [None]:
evaluate(train_set, test_set, classifier)

Accuracy on the training set = 0.95987430505197
Accuracy of the test set = 0.9497584541062802
Most Informative Features
               forwarded = True              ham : spam   =    204.3 : 1.0
                    2004 = True             spam : ham    =    141.9 : 1.0
                    2001 = True              ham : spam   =    130.8 : 1.0
            prescription = True             spam : ham    =    127.8 : 1.0
                     nom = True              ham : spam   =    125.5 : 1.0
                    pain = True             spam : ham    =    107.4 : 1.0
                     ect = True              ham : spam   =    106.9 : 1.0
                    spam = True             spam : ham    =     90.1 : 1.0
                  health = True             spam : ham    =     87.0 : 1.0
                featured = True             spam : ham    =     74.5 : 1.0
              nomination = True              ham : spam   =     73.9 : 1.0
                  differ = True             spam : ham 

As you can see, many spam emails in this dataset are related to medications,
which shows a particular bias – the most typical spam that you personally get might be on a different topic altogether! What effect might this mismatch between the training data from the publicly available dataset like Enron and your personal data have?

One other piece of information presented in this output is accuracy. Test accuracy shows the proportion of test emails that are correctly classified by Naïve Bayes among all test emails.

Note, that since the classifier is trained on the training data, it actually gets to “see” all the correct labels for the training examples. 

Shouldn’t it then know the correct answers and perform at 100% accuracy on the training data?

Well, the point here is that the classifier doesn’t just
retrieve the correct answers: during training it has built some probabilistic model (i.e., learned about the distribution of classes and the probability of different features), and then it applies this model to the data. So, it is actually very likely that the probabilistic model doesn’t capture all the things in the data 100% correctly.

Therefore, when you run the code above, you will get accuracy on the training data of `96.13%`. This is not perfect (i.e., not `100%`) but very close to it! When you apply the same classifier to new data – the test set that the classifier hasn’t seen during training – the accuracy reflects its generalizing ability. That is, it shows whether the probabilistic assumptions it made based on the training data can be successfully applied to any other data. The accuracy on the test set is `94.20%`, which is slightly lower than that on the training set, but is also very high.

Finally, if you’d like to gain any further insight into how the words are used in the emails from different classes, you can also check the occurrences of any particular word in all available contexts.

For example, word “stocks” features as a very strong predictor of spam
messages. Why is that? You might be thinking, “OK, some emails containing “stocks” will be spam, but surely there must be contexts where “stocks” is used in a completely harmless way?”

In [None]:
def concordance(data_list, search_word):
  for email in data_list:
    word_list = [word for word in word_tokenize(email.lower())]
    text_list = Text(word_list)

    """
    “Concordancer” is a tool that checks for the occurrences of the specified word and prints out the word in
    its context. By default, NLTK’s concordancer prints out the search_word surrounded by the previous
    36 and the following 36 characters – so note, that it doesn’t always result in full words
    """
    if search_word in word_list:
      text_list.concordance(search_word)

In [None]:
# Apply this function to two lists – ham_list and spam_list – to find out about the different contexts of use for the word “stocks”
print("STOCKS in HAM:")
concordance(ham_list, "stocks")
print("\n\nSTOCKS in SPAM:")
concordance(spam_list, "stocks")

STOCKS in HAM:
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files
Displaying 1 of 1 matches:
ad my portfolio is diversified into stocks that have lost even more money than
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files


STOCKS in SPAM:
Displaying 5 of 5 matches:
5 where were you when the following stocks exploded : scos : exploded from . 3
d . 80 on friday . face it . little stocks can mean big gains for you . this r
might occur . as with many microcap stocks , today ' s company has additional 
his email pertaining to investing , stocks , securities must be understood as 
ntative before deciding to trade in stocks featured within this report . none 
Displaying 4 of 4 matches:
hree days . play of the week tracks stocks on downward trends , foresees botto
mark is our unc

If you run this code and print out the contexts for “stocks”, you will find out that “stocks” feature in only 4 ham contexts (e.g., an email reminder “Follow your stocks and news headlines”) as compared to hundreds of spam contexts including “Stocks to play”, “Big money was made in these stocks”, “Select gold mining stocks”, “Little stocks can mean big gains for you”, and so on.

##Deploying your spam filter in practice

For instance, the classifier that you’ve built performs at 94% accuracy, so
you can expect it to classify real emails into spam and ham quite accurately. It’s time to deploy it in practice then. When you run it on some new emails (perhaps, some from your own inbox) you need to perform the same steps on these emails as before, that is:

- you need to read them in, then
- you need to extract the features from these emails, and finally
- you need to apply the classifier that you trained before on these emails.

In [None]:
# Feel free to provide your own examples
test_spam_list = [
  "Participate in our new lottery!",
  "Try out this new medicine"
]
test_ham_list = [
  "See the minutes from the last meeting attached", 
  "Investors are coming to our office on Monday"
]

# Read the emails extracting their textual content and keeping the labels for further evaluation
test_emails = [(email_content, "spam") for email_content in test_spam_list]
test_emails += [(email_content, "ham") for email_content in test_ham_list]

# Extract the features
new_test_set = [(get_features(email), label) for (email, label) in test_emails]

# Apply the trained classifier and evaluate its performance
evaluate(train_set, new_test_set, classifier)

Accuracy on the training set = 0.95987430505197
Accuracy of the test set = 1.0
Most Informative Features
               forwarded = True              ham : spam   =    204.3 : 1.0
                    2004 = True             spam : ham    =    141.9 : 1.0
                    2001 = True              ham : spam   =    130.8 : 1.0
            prescription = True             spam : ham    =    127.8 : 1.0
                     nom = True              ham : spam   =    125.5 : 1.0
                    pain = True             spam : ham    =    107.4 : 1.0
                     ect = True              ham : spam   =    106.9 : 1.0
                    spam = True             spam : ham    =     90.1 : 1.0
                  health = True             spam : ham    =     87.0 : 1.0
                featured = True             spam : ham    =     74.5 : 1.0
              nomination = True              ham : spam   =     73.9 : 1.0
                  differ = True             spam : ham    =     71.3 :

The classifier that you’ve trained performs with 100% accuracy on these examples. Good! How can you print out the predicted label for each particular email though?

For that, you simply extract the features from the email content and print out the label, i.e. you don’t need to run the full evaluation with the accuracy calculation.


In [None]:
for email in test_spam_list:
  print(email)
  print(classifier.classify(get_features(email)))

for email in test_ham_list:
  print(email)
  print(classifier.classify(get_features(email)))

Participate in our new lottery!
spam
Try out this new medicine
spam
See the minutes from the last meeting attached
ham
Investors are coming to our office on Monday
ham


Let’s summarize what you have covered.

You have learned how build a classifier in five steps:

1. the emails should be read, and the two classes should be clearly defined for the machine to learn from.
2. the text content should be extracted.
3. then the content should be converted into features.
4. the classifier should be trained on the training set of the data.
5. finally, the classifier should be evaluated on the test set

There are a number of machine learning classifiers, and you’ve applied one of the most interpretable of them – Naïve Bayes. Naïve Bayes is a probabilistic
classifier: it assumes that the data in two classes is generated by different probability distributions, which are learned from the training data. Despite its simplicity and “naïve” feature independence assumption, Naïve Bayes often performs well in practice, and sets competitive baseline for other more sophisticated algorithms.



##Assignment

Apply the trained classifier to a different dataset, for example to `enron2/spam` and ham emails that originate with a different owner (check `Summary.txt` for more information). For that you need to:

- read the data from the `spam/` and `ham/` subfolders in `enron2/`
- extract the textual content and convert it into features
- evaluate the classifier

What do the results suggest? 

Hint: one man’s spam may be another man’s ham. If you are not satisfied with the results, try combining the data from the two owners in one dataset.

In [None]:
test_spam_list = read_files("enron2/spam/")
print(len(test_spam_list))
print(test_spam_list[0])

test_ham_list = read_files("enron2/ham/")
print(len(test_ham_list))
print(test_ham_list[0])

test_emails = [(email_content, "spam") for email_content in test_spam_list]
test_emails += [(email_content, "ham") for email_content in test_ham_list]

random.shuffle(test_emails)

new_test_set = [(get_features(email_content), label) for email_content, label in test_emails]

evaluate(train_set, new_test_set, classifier)

1496
Subject: help
Television in 1919 by seat to my knoweledge. Chrono cross in 1969
4361
Subject: re: eol
Clayton,
Great news. I would like to sit down with you, tom and stinson and review
Where
We are with this project. Also, I would like to talk to you about your
Status (finalizing
The transfer to another group).
Vince
Clayton vernon@ enron
01/18/2001 03: 21 pm
To: vasant shanbhogue/hou/ect@ ect
Cc: stinson gibner/hou/ect@ ect, vince j kaminski/hou/ect@ ect
Subject: eol
Vasant -
Dave delaney called an hour ago. He needed a statistic from eol that the eol
Folks couldn' t give him (it seems they had a database problem in 1999), and
The grapevine had it we had the data. Tom barkley was able to give him the
Data he needed for his presentation, within a matter of 10 minutes or so.
Clayton
Accuracy on the training set = 0.95987430505197
Accuracy of the test set = 0.7611405156223322
Most Informative Features
               forwarded = True              ham : spam   =    

Now, we will combine the two datasets.

In [None]:
spam_list = read_files("enron1/spam/") + read_files("enron2/spam/")
print(len(spam_list))

ham_list = read_files("enron1/ham/") + read_files("enron2/ham/")
print(len(ham_list))

all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]

random.shuffle(test_emails)

all_features = [(get_features(email_content), label) for email_content, label in all_emails]
print(len(all_features))

train_set, test_set, classifier = train(all_features, 0.8)
evaluate(train_set, new_test_set, classifier)

2996
8033
11029
Training set size = 8823 emails
Test set size = 2206 emails
Accuracy on the training set = 0.9818655786013828
Accuracy of the test set = 0.9820727334813044
Most Informative Features
                   meter = True              ham : spam   =    264.5 : 1.0
                   vince = True              ham : spam   =    200.0 : 1.0
                     nom = True              ham : spam   =    195.6 : 1.0
                     sex = True             spam : ham    =    195.1 : 1.0
            prescription = True             spam : ham    =    169.2 : 1.0
                     ect = True              ham : spam   =    167.7 : 1.0
                    spam = True             spam : ham    =    145.8 : 1.0
               forwarded = True              ham : spam   =    137.7 : 1.0
                     fyi = True              ham : spam   =    137.0 : 1.0
                    2005 = True             spam : ham    =    128.1 : 1.0
                   logos = True             spam : h