# Bag of Words Lab

## Introduction

**Bag of words (BoW)** is an important technique in text mining and [information retrieval](https://en.wikipedia.org/wiki/Information_retrieval). BoW uses term-frequency vectors to represent the content of text documents which makes it possible to use mathematics and computer programs to analyze and compare text documents.

BoW contains the following information:

1. A dictionary of all the terms (words) in the text documents. The terms are normalized in terms of the letter case (e.g. `Ironhack` => `ironhack`), tense (e.g. `had` => `have`), singular form (e.g. `students` => `student`), etc.
1. The number of occurrences of each normalized term in each document.

For example, assume we have three text documents:

DOC 1: **Ironhack is cool.**

DOC 2: **I love Ironhack.**

DOC 3: **I am a student at Ironhack.**

The BoW of the above documents looks like below:

| TERM | DOC 1 | DOC 2 | Doc 3 |
|---|---|---|---|
| a | 0 | 0 | 1 |
| am | 0 | 0 | 1 |
| at | 0 | 0 | 1 |
| cool | 1 | 0 | 0 |
| i | 0 | 1 | 1 |
| ironhack | 1 | 1 | 1 |
| is | 1 | 0 | 0 |
| love | 0 | 1 | 0 |
| student | 0 | 0 | 1 |


The term-frequency array of each document in BoW can be considered a high-dimensional vector. Data scientists use these vectors to represent the content of the documents. For instance, DOC 1 is represented with `[0, 0, 0, 1, 0, 1, 1, 0, 0]`, DOC 2 is represented with `[0, 0, 0, 0, 1, 1, 0, 1, 0]`, and DOC 3 is represented with `[1, 1, 1, 0, 1, 1, 0, 0, 1]`. **Two documents are considered identical if their vector representations have close [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).**

In real practice there are many additional techniques to improve the text mining accuracy such as using [stop words](https://en.wikipedia.org/wiki/Stop_words) (i.e. neglecting common words such as `a`, `I`, `to` that don't contribute much meaning), synonym list (e.g. consider `New York City` the same as `NYC` and `Big Apple`), and HTML tag removal if the data sources are webpages. In Module 3 you will learn how to use those advanced techniques for [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing), a component of text mining.

In real text mining projects data analysts use packages such as Scikit-Learn and NLTK, which you will learn in Module 3, to extract BoW from texts. In this exercise, however, we would like you to create BoW manually with Python. This is because by manually creating BoW you can better understand the concept and also practice the Python skills you have learned so far.

## The Challenge

We need to create a BoW from a list of documents. The documents (`doc1.txt`, `doc2.txt`, and `doc3.txt`) can be found in the `your-code` directory of this exercise. You will read the content of each document into an array of strings named `corpus`.

*What is a corpus (plural: corpora)? Read the reference in the README file.*

Your challenge is to use Python to generate the BoW of these documents. Your BoW should look like below:

```python
bag_of_words = ['a', 'am', 'at', 'cool', 'i', 'ironhack', 'is', 'love', 'student']

term_freq = [
    [0, 0, 0, 1, 0, 1, 1, 0, 0],
    [0, 0, 0, 0, 1, 1, 0, 1, 0],
    [1, 1, 1, 0, 1, 1, 0, 0, 1],
]
```

Now let's define the `docs` array that contains the paths of `doc1.txt`, `doc2.txt`, and `doc3.txt`.

In [1]:
docs = ['doc1.txt', 'doc2.txt', 'doc3.txt']

Define an empty array `corpus` that will contain the content strings of the docs. Loop `docs` and read the content of each doc into the `corpus` array.

In [2]:
# Write your code here
import pandas as pd 

df = pd.DataFrame(pd.read_csv(docs[0]))
print(df)

for i in range(1,len(docs)):
    data = pd.read_csv(docs[i])
    df1 = pd.DataFrame(data)
    df = pd.concat([df1,df],axis=1)

Empty DataFrame
Columns: [Ironhack is cool.]
Index: []


Print `corpus`.

In [3]:
corpus = list(df)
corpus

['I am a student at Ironhack.', 'I love Ironhack.', 'Ironhack is cool.']

You expected to see:

```['ironhack is cool', 'i love ironhack', 'i am a student at ironhack']```

But you actually saw:

```['Ironhack is cool.', 'I love Ironhack.', 'I am a student at Ironhack.']```

This is because you haven't done two important steps:

1. Remove punctuation from the strings

1. Convert strings to lowercase

Write your code below to process `corpus` (convert to lower case and remove special characters).

In [4]:
# Write your code here
c = str(corpus)
corpus_lower= str((c.lower()))

import re
rule = "\w+"
low = re.findall(rule, corpus_lower)
print(low)

['i', 'am', 'a', 'student', 'at', 'ironhack', 'i', 'love', 'ironhack', 'ironhack', 'is', 'cool']


Now define `bag_of_words` as an empty array. It will be used to store the unique terms in `corpus`.

In [5]:
bag_of_words = []

Loop through `corpus`. In each loop, do the following:

1. Break the string into an array of terms. 
1. Create a sub-loop to iterate the terms array. 
  * In each sub-loop, you'll check if the current term is already contained in `bag_of_words`. If not in `bag_of_words`, append it to the array.

In [6]:
# Write your code here
for item in corpus:
    print(item)

I am a student at Ironhack.
I love Ironhack.
Ironhack is cool.


In [7]:
a = set(low)   #Set stores a single copy of the duplicate values into it. This property of set can be used to get unique values from a list in Python
bag_of_words = list(a)
print(bag_of_words)

['am', 'love', 'ironhack', 'student', 'at', 'a', 'i', 'cool', 'is']


Print `bag_of_words`. You should see: 

```['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']```

If not, fix your code in the previous cell.

Now we define an empty array called `term_freq`. Loop `corpus` for a second time. In each loop, create a sub-loop to iterate the terms in `bag_of_words`. Count how many times each term appears in each doc of `corpus`. Append the term-frequency array to `term_freq`.

In [8]:
type(corpus)

list

In [9]:
corpus

['I am a student at Ironhack.', 'I love Ironhack.', 'Ironhack is cool.']

In [10]:
for item in corpus:
    print((item, len(item)))

('I am a student at Ironhack.', 27)
('I love Ironhack.', 16)
('Ironhack is cool.', 17)


In [11]:
# Write your code here
term_freq = []

term_freq = [[sentence.split(' ').count(words) for words in bag_of_words] for sentence in corpus]
term_freq

[[1, 0, 0, 1, 1, 1, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 1]]

Print `term_freq`. You should see:

```[[1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 1, 1, 1, 1]]```

**If your output is correct, congratulations! You've solved the challenge!**

If not, go back and check for errors in your code.