Some common terms to remember:
1. Corpus
2. Vocabulary
3. Document
4. Word

# Bag of words (BoW)

The **bag-of-words model** **(BoW)** is a model of text which uses an unordered collection (a "bag") of words. It is used in natural language processing and information retrieval (IR). It disregards word order (and thus most of syntax or grammar) but captures multiplicity.

The bag-of-words model is commonly used in methods of document classification where, for example, the (frequency of) occurrence of each word is used as a feature for training a classifier. It has also been used for computer vision.

In [20]:
# Import libraries
import numpy as np
import pandas as pd

In [7]:
df = pd.DataFrame({"text":['This is the first document.',
                    'This document is the second document.',
                    'And this is the third one.',
                    'Is this the first document?',],"output":[1,1,0,0]})

df

Unnamed: 0,text,output
0,This is the first document.,1
1,This document is the second document.,1
2,And this is the third one.,0
3,Is this the first document?,0


In [8]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer() # Convert a collection of text documents to a matrix of token counts.

In [9]:
bow = cv.fit_transform(df['text'])
#vocabulary
print(cv.vocabulary_)

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}


In [12]:
cv.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [13]:
cv.fit_transform(df['text']).toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

In [14]:
bow.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

In [15]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())

[[0 1 1 1 0 0 1 0 1]]
[[0 2 0 1 0 1 1 0 1]]
[[1 0 0 1 1 0 1 1 1]]


In [18]:
# new
cv.transform(['this is the third document']).toarray()

array([[0, 1, 0, 1, 0, 0, 1, 1, 1]])

In [19]:
X = bow.toarray()
y = df['output']

In [21]:
X,y

(array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
        [0, 2, 0, 1, 0, 1, 1, 0, 1],
        [1, 0, 0, 1, 1, 0, 1, 1, 1],
        [0, 1, 1, 1, 0, 0, 1, 0, 1]]),
 0    1
 1    1
 2    0
 3    0
 Name: output, dtype: int64)

#  N-grams
An **n-gram** is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. **N-gram** models are useful in many text analytics applications where sequences of words are relevant, such as in sentiment analysis, text classification, and text generation.

An n-gram is a sequence of n adjacent symbols in particular order. The symbols may be n adjacent letters, syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome.

In [22]:
df = pd.DataFrame({"text":['This is the first document.',
                    'This document is the second document.',
                    'And this is the third one.',
                    'Is this the first document?',],"output":[1,1,0,0]})

df

Unnamed: 0,text,output
0,This is the first document.,1
1,This document is the second document.,1
2,And this is the third one.,0
3,Is this the first document?,0


In [23]:
# BI grams
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(2,2))

In [24]:
gram = cv.fit_transform(df['text'])

In [25]:
print(cv.vocabulary_)

{'this is': 11, 'is the': 3, 'the first': 6, 'first document': 2, 'this document': 10, 'document is': 1, 'the second': 7, 'second document': 5, 'and this': 0, 'the third': 8, 'third one': 9, 'is this': 4, 'this the': 12}


In [26]:
print(gram[0].toarray())
print(gram[1].toarray())
print(gram[2].toarray())

[[0 0 1 1 0 0 1 0 0 0 0 1 0]]
[[0 1 0 1 0 1 0 1 0 0 1 0 0]]
[[1 0 0 1 0 0 0 0 1 1 0 1 0]]


In [27]:
gram.toarray()

array([[0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0],
       [0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1]])

In [28]:
#Ti gram
# BI grams
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(3,3))

In [29]:
gram = cv.fit_transform(df['text'])

In [30]:
print(cv.vocabulary_)

{'this is the': 10, 'is the first': 2, 'the first document': 6, 'this document is': 9, 'document is the': 1, 'is the second': 3, 'the second document': 7, 'and this is': 0, 'is the third': 4, 'the third one': 8, 'is this the': 5, 'this the first': 11}


In [31]:
print(gram[0].toarray())
print(gram[1].toarray())
print(gram[2].toarray())

[[0 0 1 0 0 0 1 0 0 0 1 0]]
[[0 1 0 1 0 0 0 1 0 1 0 0]]
[[1 0 0 0 1 0 0 0 1 0 1 0]]


# TF-IDF (Term frequency- Inverse document frequency)
In information retrieval, **tf–idf** (also TF*IDF, **TFIDF**, **TF–IDF**, or **Tf–idf**), short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. Like the bag-of-words model, it models a document as a multiset of words, without word order. It is a refinement over the simple bag-of-words model, by allowing the weight of words to depend on the rest of the corpus.

It was often used as a weighting factor in searches of information retrieval, text mining, and user modeling. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries used **tf–idf**. Variations of the **tf–idf** weighting scheme were often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

One of the simplest ranking functions is computed by summing the **tf–idf** for each query term; many more sophisticated ranking functions are variants of this simple model.

In [32]:
df = pd.DataFrame({"text":['This is the first document.',
                    'This document is the second document.',
                    'And this is the third one.',
                    'Is this the first document?',],"output":[1,1,0,0]})

df

Unnamed: 0,text,output
0,This is the first document.,1
1,This document is the second document.,1
2,And this is the third one.,0
3,Is this the first document?,0


In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfid= TfidfVectorizer()

In [34]:
arr = tfid.fit_transform(df['text']).toarray()

In [35]:
arr

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

In [36]:
print(tfid.idf_) # Inverse document frequency vector, only defined if use_idf=True. Returns: ndarray of shape (n_features,)

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073
 1.         1.91629073 1.        ]
