## The Bag-of-Words Model

Useful with great progress in Language modeling and Document Classification.

#### 1. The problem with text.

The problem with text is its always messy, and ML algorithms prefer well defined fixed-length inputs and outputs. Since machine learning algorithms cannot work with raw text directly, the text must be converted into numbers. Specifically, vectors of numbers. This process is called feature extraction or feature encoding.

#### 2. What is a Bag-of-Words?

BOW is a way of extracting features from text for use in modeling. It is a representation of text that describes the occurence of words within a document, and involves two things:
    
    1. A vocabulary of known words.
    2. A measure of the presence of words.

**Note:** It is called a bag of words because any information about order or structure of words in the document is discarded. The model is only concerned about known words occuring *in* the document, not *where* in the document.

The BOW can be as simple or complex as you'd want it to be. The complexity comes both in deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known words.

#### 3. Example of the Bag-of-Words Model

3.1. Step 1: Collect Data
    
    Collect the data as sentences or a document.
    
    
3.2. Step 2: Design the Vocabulary
    
    Make a list or set of all words in the vocabulary ignoring punctuations and case
    

3.3. Step 3: Create Document Vectors
    
    Score the words in each document. Objective being to turn each document of free text into a vector that we can use as input or output for an ML model. Words not encoded in the train document will be ignored.

#### 4. Managing Vocabulary

As the vocab size increases, so does the vector representation of documents. In this case what then happens will be a case of a vector with lots of zeros called a sparse vector or sparce representation.

Sparse vectors require more memory and computational resources when modeling and the vast number of positions or dimensions can make the modeling process very challenging for traditional algorithms. Thus there now is the pressure to reduce the size of the vocabulary when using a BOW model.

Simple cleaning steps that can be used as a first step:
    
    1. Ignoring case.
    2. Ignoring punctuation.
    3. Ignoring frequent words that dont contain much information, called stop words.
    4. Fixing misspelled words.
    5. Reducing words to their stem using stemming algorithms.
    
A sophisticated approach would be to create a vocabulary of grouped words. This changes the scope of the vocabulary and allows the BOW to capture a bit more meaning from the document. In this method, each word or token is called a *gram*.
Creating a vocab of two-word pairs is then called a *bi-gram* model.

Again, only the bi-grams that appear in the corpus are modeled, not all possible bi-grams.
A vocab that tracks triplets of words is called a trigram model and the general approach is called the n-gram model, where **n** refers to the number of grouped words. Often, a simple bigram approach is better than a 1-gram BOW model for tasks like document classification.

#### 5. Scoring Words

Once a vocab has been chosen, the occurence of words in example documents needs to be scored. Some scoring methods include:
    
    1. Counts: Count the number of times each word appears in a document.
    2. Frequencies: Calculate the frequency that each word appears in a document out of all the words in the document.

5.1. Word Hashing:
    
    **Remember:** A hash function is a bit of math that maps data to a fixed size set of numbers. For example, we use them in hash tables when programming where perhaps names are converted to numbers for fast lookup.
    
    We can use a hash representation of known words in our vocab. this addresses the problem of having a very large vocabulary for a large text corpus because we can choose the size of the hash space, while is in turn the size of the vector representation of the document.
    
    Words are hashed deterministically to the same integer index in the target hash space. A binary score or count can then be used to score the word. This is called the Hash trick or feature hashing. 
    
    The challenge then is in choosing a hash space to accomodate the chosen vocab size to minimize the probability of collisions and trade-off sparsity.
    
5.2. TF-IDF:
    
    A problem with scoring word frequencies is that highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much informational content to the model as rarer but perhaps domain specific words. One approach is to rescale the frequency of words by how often they apprea in all documents, so that the scores for frequent words like *the* that are also frequent across all documents are penalized. This approach is called Term Frequency - Inverse Document Frequency, or TF-IDF for short, where:
    
    Term Frequency: is a scoring of the frequency of the words in the current document.
    Inverse Document Frequency: is a scoring of how rare the word is across documents. 
    
The scores then are a weighting where not all words are equally as important or interesting. The scores have the effect of highlighting words that are distinct (contain useful information) in a given document.

*Thus the IDF of a rare term is high, whereas the IDF of a frquent term is likely to be low.*

#### 6. Limitations of Bag-of-Words

The bag of words model though quite robust and simple to understand, suffers from a few shortcomings such as:
    
    1. Vocabulary: The vocab requires careful design, most specifically in order to manage the size, which impacts the sparcity of the document representations.
        
    2. Sparcity: Sparce representations are harder to model both for computational reasons (space and time complexity) and also for informational reasons, where the challenge is for the models to harness so little information in such a large representational space.
        
    3. Meaning: Discarding word order ignores the context, and in turn meaning of words in the document (semantics). Context and meaning can offer a lot to the model, that if modelled could tell the difference between the same words differently arranged.