<a href="https://colab.research.google.com/github/sheldonkemper/portfolio/blob/main/CAM_DS_C301_BoW_and_TF_IDF_Demo_1_2_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**First things first** - please go to 'File' and select 'Save a copy in Drive' so that you have your own version of this activity set up and ready to use.
Remember to update the portfolio index link to your own work once completed!



#Demonstration 1.2.1 Bag of Words and TF-IDF

This Notebook accompanies two demonstrations, one on the Bag-of-Words technique, and one on Term Frequency - Inverse Document Frequency. In the demonstrations, you will create each type of representation, and learn how to:
- construct a vectoriser to count words
- extract the vocabulary from a set of documents
- import and initialise the TF-IDF vectoriser
- view the TF-IDF values and vocabulary of words.


## a. Bag-of-Words representation

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


In [None]:
docs = [
        'This is the first document.',
        'This document is the second document., the document is nice',
        'And this is the third one.',
        'Is this the first document?', ]


In [None]:
vectorizer = CountVectorizer(analyzer = "word",
                             lowercase=True,
                             max_features = 50)

In [None]:
# Convert the documents into a document-term matrix.
wm = vectorizer.fit_transform(docs)
print(wm.todense())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [None]:
vocabulary = vectorizer.vocabulary_
print(vocabulary)

{'this': 9, 'is': 3, 'the': 7, 'first': 2, 'document': 1, 'second': 6, 'nice': 4, 'and': 0, 'third': 8, 'one': 5}


In [None]:
vocabulary = vectorizer.vocabulary_
print(vocabulary)

{'this': 9, 'is': 3, 'the': 7, 'first': 2, 'document': 1, 'second': 6, 'nice': 4, 'and': 0, 'third': 8, 'one': 5}


In [None]:
tokens = vectorizer.get_feature_names_out()
print(tokens)

['and' 'document' 'first' 'is' 'nice' 'one' 'second' 'the' 'third' 'this']


In [None]:
# Create an index for each row.
doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(wm)]
df_bow = pd.DataFrame(data=wm.toarray(), index=doc_names,
                  columns=tokens)

In [None]:
df_bow

Unnamed: 0,and,document,first,is,nice,one,second,the,third,this
Doc0,0,1,1,1,0,0,0,1,0,1
Doc1,0,3,0,2,1,0,1,2,0,1
Doc2,1,0,0,1,0,1,0,1,1,1
Doc3,0,1,1,1,0,0,0,1,0,1


## b. Term Frequency - Inverse Document Frequency Representation (TF-IDF)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer(analyzer = "word",
                             lowercase=True,
                             max_features = 50)

In [None]:
docs = [
        'This is the first document.',
        'This document is the second document., the document is nice',
        'And this is the third one.',
        'Is this the first document?', ]

In [None]:
# Convert the documents into a document-term matrix.
wm = vectorizer.fit_transform(docs)
print(wm.todense())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.         0.38408524 0.         0.38408524]
 [0.         0.67208551 0.         0.36631596 0.35098394 0.
  0.35098394 0.36631596 0.         0.18315798]
 [0.51184851 0.         0.         0.26710379 0.         0.51184851
  0.         0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.         0.38408524 0.         0.38408524]]


In [None]:
vocabulary = vectorizer.vocabulary_
print(vocabulary)

{'this': 9, 'is': 3, 'the': 7, 'first': 2, 'document': 1, 'second': 6, 'nice': 4, 'and': 0, 'third': 8, 'one': 5}


In [None]:
tokens = vectorizer.get_feature_names_out()
print(tokens)

['and' 'document' 'first' 'is' 'nice' 'one' 'second' 'the' 'third' 'this']


In [None]:
# Create an index for each row.
doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(wm)]
df_tfidf = pd.DataFrame(data=wm.toarray(), index=doc_names,
                  columns=tokens)

In [None]:
df_tfidf

Unnamed: 0,and,document,first,is,nice,one,second,the,third,this
Doc0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.0,0.384085,0.0,0.384085
Doc1,0.0,0.672086,0.0,0.366316,0.350984,0.0,0.350984,0.366316,0.0,0.183158
Doc2,0.511849,0.0,0.0,0.267104,0.0,0.511849,0.0,0.267104,0.511849,0.267104
Doc3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.0,0.384085,0.0,0.384085


## Key information
This demonstration illustrated how to implement Bag-of-Words and TF-IDF representations in Python.

## Reflect
Examine the dataframes at the end of each demonstration, in comparison to the document titles, to check you understand where the numbers are coming from.
- Which word does the BoW value 3 correspond to?
- Which word does the TF-IDF value 0.512 correspond to?
- Describe some the differences in output between the weightings given by BoW and TF-IDF, specifically for this set of texts.
- What, if anything, might be more useful about the TF-IDF approach?
> Select the pen from the toolbar to add your entry.