# Converting Text into a Numerical Representation

## Bag of Words

A Bag-of-Words (BoW) model is a technique in natural language processing (NLP) and machine learning that represents text data by counting the frequency of each word in a document, completely ignoring the order of words, grammar, and sentence structure.

It converts text into numerical vectors that machine learning models can process, making it useful for tasks like text classification, sentiment analysis, and spam filtering.

### How it Works

1. **Tokenization:** The text is broken down into individual words or tokens.
2. **Vocabulary Creation:** A unique list of all words from the entire document set is created.
3. **Frequency Counting:** For each document, the model counts the occurrences of each word in the vocabulary, generating a numerical representation.

### Example:

**Document:** "The cat sat on the mat".

**Bag-of-Words Representation:** `{the: 2, cat: 1, sat: 1, on: 1, mat: 1}`

## Using **`CountVectorizer`**

**`CountVectorizer`** is a powerful tool within the `scikit-learn` library for Natural Language Processing (NLP).

Its primary function is to convert a collection of raw text documents into a numerical representation, specifically a matrix of token (word or character n-gram) counts.

It transforms a collection of text documents into a matrix of token counts, where each row represents a document and each column represents a unique word (or "token") from the entire collection.

The value in each cell indicates the frequency of a specific word in a particular document.

This numerical representation is crucial for applying machine learning algorithms to text data, as most algorithms require numerical input.

### How it works

* **Vocabulary Creation:** `CountVectorizer` first analyzes the input text documents to build a vocabulary of all unique words or tokens found across the entire corpus. Each unique token is assigned a unique integer ID.
* **Document-Term Matrix:** For each document, it then counts the occurrences of each token from the established vocabulary. This results in a document-term matrix where rows represent documents and columns represent the tokens in the vocabulary. The values in the matrix indicate the frequency of each token in each document. This matrix is typically sparse, meaning most entries are zero, and is often stored efficiently using sparse matrix formats like `scipy.sparse.csr_matrix`.


### Example

In this example, we will use `CountVectorizer` to transform two text documents into a matrix.

We will use the following two text documents:
* I like watching SciFi movies. I like SciFi.
* I hate watching horror movies. I hate horror.

#### Create the Text Documents

In [1]:
# import libraries
import pandas as pd

# create text documents: a list of two sentences
text = ['I like watching SciFi movies. I like SciFi.',
        'I hate watching horror movies. I hate horror.']

# create dataframe from list of sentences
data_text = pd.DataFrame({'review':['review1', 'review2'], 'text':text})

# show dataframe
data_text

Unnamed: 0,review,text
0,review1,I like watching SciFi movies. I like SciFi.
1,review2,I hate watching horror movies. I hate horror.


#### Create an instance of `CountVectorizer`

We will create an instance of the `CountVectorizer` and use it to transform the text documents into a document-term matrix.

When instantiating the `CountVectorizer`, we will set the parameter `stop_words` to `english`. This parameter allows for the removal of common words (like "the", "is", "a") that often carry little semantic meaning and can be specified as a list or set of words, or by using pre-defined lists like `english`.

In [2]:
# import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# create an instance of CountVectorizer
cv = CountVectorizer(stop_words='english')

#### Create the Document-Term Matrix

The **`fit_transform()`** method combines two distinct steps:
* **`fit`:** This step analyzes a collection of text documents (the "corpus") to learn a vocabulary. It identifies all unique words or tokens present in the corpus and assigns each a unique index. This forms the basis for converting text into numerical representations.
* **`transform`:** This step then converts the raw text documents into a numerical representation based on the learned vocabulary. For each document, it counts the occurrences of each word in the vocabulary, resulting in a sparse matrix where each row represents a document and each column represents a word from the vocabulary, with the cell values indicating the count of that word in that document.

The `fit_transform()` method performs both the learning of the vocabulary and the subsequent conversion of the text into a matrix of word counts in a single call.

In [3]:
# create the document-terms matrix
cv_matrix = cv.fit_transform(data_text['text'])

cv_matrix

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 8 stored elements and shape (2, 6)>

### Print the Learned Vocabulary

The **`get_feature_names_out()`** method is used to retrieve the names of the features (words or n-grams) that were extracted from the text corpus after fitting the vectorizer.

Each feature name corresponds to a column in the resulting document-term matrix.


In [9]:
# print the learned vocabulary
print(cv.get_feature_names_out())

['hate' 'horror' 'like' 'movies' 'scifi' 'watching']


#### Create DataFrame from Document-Term Matrix

For readability, we will create a DataFrame from the Document-Terms Matrix.

We will use the `toarray()` method to converts this sparse representation into a full dense `numpy.ndarray`, where all elements, including zeros, are explicitly stored in memory.

In [4]:
# create dataframe from sparse matrix
data_dtm = pd.DataFrame(cv_matrix.toarray(), index=data_text['review'].values,
                        columns=cv.get_feature_names_out())

data_dtm

Unnamed: 0,hate,horror,like,movies,scifi,watching
review1,0,0,2,1,2,1
review2,2,2,0,1,0,1


## Term Frequency - Inverse Document Frequency (TF-IDF)

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure that evaluates the importance of a word within a document relative to a collection of documents (corpus).

It combines two metrics: Term Frequency (TF), which counts how often a term appears in a single document, and Inverse Document Frequency (IDF), which measures how rare or common a term is across the entire corpus.

A higher TF-IDF score indicates that a word is more significant in a document.

### How it works:

1. **Term Frequency (TF):** This component measures how frequently a term appears in a specific document.
* **Calculation:** It's typically calculated by dividing the number of times a term appears in a document by the total number of terms in that document.
* **Purpose:** TF gives more weight to words that are central to a document's content.

2. Inverse Document Frequency (IDF): This component accounts for the term's rarity across the entire corpus.
* **Calculation:** It's calculated by dividing the total number of documents by the number of documents that contain the term, and then taking the logarithm of that value.
* **Purpose:** IDF gives higher weight to words that are rare across the corpus and lower weight to common words (like "the" or "a"), thus filtering out general terms and highlighting more specific, informative words.

3. **TF-IDF Score:** The final TF-IDF score for a term in a document is the product of its TF and IDF values.
* **Interpretation:** A high TF-IDF score means a term is frequent in that particular document (high TF) but uncommon across the entire set of documents (high IDF), making it a highly important and representative term for that document.

### Applications:

TF-IDF is widely used in:
* **Information Retrieval:**<br>
Online search engines use it to rank documents based on their relevance to a user's query.
* **Text Mining and NLP:**<br>
It helps in identifying keywords and phrases in large text collections.
* **Machine Learning:**<br>
It's used as a feature extraction technique to represent documents numerically for various machine learning models.

### Calculation:

TF is calcuated as follows:

$$TF(t,d) = \frac{\text{Number of occurence of term} \ t \ \text{in document} \ d}{\text{Total number of terms in document} \ d}$$

IDF is calcuated as follows:

$$IDF(t,D) = Ln \left( \frac{\text{Total number of documents in the corpus} \ D}{\text{Number of documents with term} \ t \ \text{in them}} \right)$$

where $Ln$ is the natural logarithm.

TF-IDF is calculated as follows:

$$TF-IDF(t,d,D) = TF(t,d) \ \times \ IDF(t,D)$$

### Using **`TfidfVectorizer`**

`TfidfVectorizer` is a powerful tool in the `scikit-learn` library for converting a collection of raw text documents into a matrix of TF-IDF features

This process transforms unstructured text data into a numerical representation that can be used by machine learning algorithms.

#### Key Functionality

* **Term Frequency (TF):** Measures how frequently a term appears in a document.
* **Inverse Document Frequency (IDF):** Measures the importance of a term by considering its inverse frequency across the entire document corpus. Terms that appear in fewer documents receive higher IDF scores, indicating greater specificity.
* **TF-IDF Calculation:** Combines TF and IDF to produce a score for each term in each document, reflecting both its local frequency and global importance.


### Example

In this example, we will use `TfidfVectorizer` to transform two text documents into a matrix.

We will use the following two text documents:
* I like watching SciFi movies. I like SciFi.
* I hate watching horror movies. I hate horror.

#### Create the Text Documents


In [5]:
# create text documents: a list of two sentences
text = ['I like watching SciFi movies. I like SciFi.',
        'I hate watching horror movies. I hate horror.']

# create dataframe from list of sentences
data_text = pd.DataFrame({'review':['review1', 'review2'], 'text':text})

# show dataframe
data_text

Unnamed: 0,review,text
0,review1,I like watching SciFi movies. I like SciFi.
1,review2,I hate watching horror movies. I hate horror.


#### Create an instance of `TfidfVectorizer`

In [6]:
# import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# create an instance of TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')

#### Create the Document-Term Matrix

In [7]:
# create document-term matrix
tfidf_matrix = tfidf.fit_transform(data_text['text'])

tfidf_matrix

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 8 stored elements and shape (2, 6)>

#### Create DataFrame from Document-Terms Matrix

We will create a DataFrame from the Document-Terms Matrix.

In [8]:
# create dataframe from document-terms matrix
data_dtm = pd.DataFrame(tfidf_matrix.toarray(), index=data_text['review'].values,
                        columns=tfidf.get_feature_names_out())

data_dtm

Unnamed: 0,hate,horror,like,movies,scifi,watching
review1,0.0,0.0,0.666205,0.237005,0.666205,0.237005
review2,0.666205,0.666205,0.0,0.237005,0.0,0.237005
