## Count Vector, TFIDF Representations of Text

Working with text generally involves converting it into a format that our model is able to understand, which is mostly numbers. In this notebook, you will have a closer look on two of the most basic and ubiquitiously used formats: 

 - Count Vector
 - TFIDF

You will also build a Machine Learning model on a real world dataset of **BBC News** and perform text classification utilizing the above two formats.

#### Table of Contents
1. About the Dataset
2. Preprocessing Text
3. Working with Count Vector
4. Using TFIDF to improve Count Vector
5. Conclusion
6. Challenge

### 1. About the Dataset

The dataset that you are going to use is a collection of news articles from BBC across 5 major categories, namely:
 
 - Business
 - Entertainment
 - Politics
 - Sport
 - Tech

There are a total of 2225 articles in the dataset, which is a mix of all of the above categories. Let's load the dataset using pandas and have a quick look at some of the articles. 

**Note:** You can get the dataset [here](https://trainings.analyticsvidhya.com/asset-v1:AnalyticsVidhya+LP_DL_2019+2019_T1+type@asset+block@bbc_news_mixed.csv)


In [1]:
import os
from PyPDF2 import PdfReader
import pandas as pd

In [3]:
!pip3 list 

Package               Version
--------------------- -----------
altair                4.1.0
appnope               0.1.3
argon2-cffi           21.3.0
argon2-cffi-bindings  21.2.0
async-generator       1.10
attrs                 22.1.0
backcall              0.2.0
backports.zoneinfo    0.2.1
base58                2.1.1
bleach                4.1.0
certifi               2022.5.18.1
cffi                  1.15.1
charset-normalizer    2.0.12
click                 8.0.4
colorama              0.4.5
cryptography          38.0.4
dataclasses           0.8
decorator             5.1.1
defusedxml            0.7.1
easydict              1.9
entrypoints           0.4
et-xmlfile            1.1.0
Flask                 2.0.3
Flask-Caching         1.10.1
Flask-Cors            3.0.10
Flask-Script          2.0.6
flask-swagger-ui      4.11.1
great-expectations    0.15.34
greenlet              1.1.2
idna                  3.3
importlib-metadata    4.8.3
importlib-resources   5.4.0

In [2]:
!python3 -m pip install scikit-learn --upgrade



In [2]:
def pdf_to_text(pdf_path):
    reader = PdfReader(pdf_path)
 
    # printing number of pages in pdf file
    #print(len(reader.pages))
 
    # getting a specific page from the pdf file
    page = reader.pages[0]
 
    # extracting text from page
    text = page.extract_text()
    #print(text)
    return text

In [3]:
df = pd.DataFrame(columns = ['text', 'label'])

In [4]:
mortgage_doc_dir = 'mortgage_documents'
label = 'mortgage'
for filename in os.listdir(mortgage_doc_dir):
    f = os.path.join(mortgage_doc_dir, filename)
    # checking if it is a file
    if os.path.isfile(f):
        #print(f)
        pdf_text = pdf_to_text(f)
        df = df.append({'text' : pdf_text, 'label' : label}, ignore_index = True)

In [5]:
insurance_doc_dir = 'insurance_documents'
label = 'insurance'
for filename in os.listdir(insurance_doc_dir):
    f = os.path.join(insurance_doc_dir, filename)
    # checking if it is a file
    if os.path.isfile(f):
        #print(f)
        pdf_text = pdf_to_text(f)
        df = df.append({'text' : pdf_text, 'label' : label}, ignore_index = True)

In [6]:
df.head(12)

Unnamed: 0,text,label
0,MORTG...,mortgage
1,1 \nDEED OF SIMPLE MORTGAGE \n \nTHIS DEED O...,mortgage
2,Citibank \nMortgage Loan Agreement\n(Applicab...,mortgage
3,LD/2239 \n \n \nMEMORANDUM OF DEPOSIT OF TITL...,mortgage
4,\n \nFORM NO. 3 \n \n \nForm of Mortgage Dee...,mortgage
5,MORTGAGE Of Agricultural \nLand With Possessio...,mortgage
6,1a. INSURED’S I.D. NUMBER (FOR PROGR...,insurance
7,ITGI / TP / 07 \n \nTRAVEL PROTECTOR INSURANCE...,insurance
8,\n 1 \nIFFCO-TOKIO GENERAL INSURANCE COMPANY...,insurance
9,\n \n 1 \nClaim Fo...,insurance


In [7]:
# print first 2 articles
for art in df.text[:2]:
    print(art)

 1 
DEED OF SIMPLE MORTGAGE  
 
THIS DEED OF SIMPLE MORTGAGE is made at ______________ this _____ 
day of ________ 200__ 
 between  
 
(1) _______________, son/daughter of _____________ aged about ______ years 
resident at ___________________(hereinafter referred to as the ‘Mortgagor which 
expression shall unless repugnant to the context or meaning thereof, be deemed 
to mean and include his/her legal heirs, executors and administrators) of the one 
part.  
 
OR (applicable in case of a couple ) 
 
_________________, son/daughter of _________________ aged about _____ years 
and his/her spouse __________________, son/daughter of ________________ 
aged about _____ years both residing at ___________ (hereinafter referred to as the ‘Mortgagors’  which expression sh all unless repugnant to the context or 
meaning thereof, be deemed to mean and include their legal heirs, executors and 
administrators) of the one part; and  
 
(2) ____________________ [Housing Finance Company (HFC)], a compa

Now that you have an idea of how your data looks like, let's see the count of each category in the dataset!

In [7]:
# category-wise count
df.label.value_counts()

insurance    6
mortgage     6
Name: label, dtype: int64

### 2. Preprocessing Text

You would have noticed that the labels are in text format, in order to build a model on this dataset you will have to create a mapping between the labels and numbers like 0,1,2,3 this process is called Label Encoding. You can easily label encode your text data using sklearn's [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). Let's have a look at how to do that!

In [8]:
from sklearn.preprocessing import LabelEncoder

# initialize LabelEncoder
lencod = LabelEncoder()
# fit_transform() converts the text to numbers
df.label = lencod.fit_transform(df.label)
# label-wise count
df.label.value_counts()

1    6
0    6
Name: label, dtype: int64

In [31]:
df.label

0     1
1     1
2     1
3     1
4     1
5     1
6     0
7     0
8     0
9     0
10    0
11    0
Name: label, dtype: int64

**Note** You'd have noticed in the output of the above code that the text labels have been replaced by numbers. We have a mapping like this - 
 - 0 is Business
 - 1 is Entertainment
 - 2 is Politics
 - 3 is Sport
 - 4 is Tech
 
 ## 0 -> insurance ; 1 -> mortgage 
 
### 3. Working with Count Vector

Sklearn provides an easy way to create count vectors from a piece of text. You can use the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to do that. Let's see how simple it is!

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize count vector
cvec = CountVectorizer(stop_words='english')
# create Bag of Words
bow = cvec.fit_transform(df.text)
# shape of Bag of Words
print('shape of BOW:', bow.shape)
# number of words in the vocabulary
print('No. of words in vocabulary:', len(cvec.vocabulary_))

shape of BOW: (12, 906)
No. of words in vocabulary: 906


Let's have a closer look at the Bag of Words that you have just generated.

If you explore the above dataframe, you will find that the Bag of Words representation of the text. Notice that the word "called" appears in the first only once hence there is a 1 at it's index. Now that your BOW is created, let's see just how good is it at classifying the articles in a ML model.

You'll be using [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) model because it works well with sparse features of text.

In [11]:
df.head()

Unnamed: 0,text,label
0,1 \nDEED OF SIMPLE MORTGAGE \n \nTHIS DEED O...,1
1,\n \nFORM NO. 3 \n \n \nForm of Mortgage Dee...,1
2,LD/2239 \n \n \nMEMORANDUM OF DEPOSIT OF TITL...,1
3,Citibank \nMortgage Loan Agreement\n(Applicab...,1
4,MORTGAGE Of Agricultural \nLand With Possessio...,1


In [10]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# creates a ML model based on parameters
def create_model(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    model = MultinomialNB()
    model = model.fit(X_train, y_train)
    return model, X_test, y_test

In [11]:
# create BOW based classification model
model_b, X_test_b, y_test_b = create_model(bow, df.label)

Now that the model is created and trained, have a look at the classification accuracy:

In [12]:
from sklearn.metrics import accuracy_score

# check accuracy 
accuracy_score(y_test_b, model_b.predict(X_test_b))

0.75

In [13]:
type(X_test_b)

scipy.sparse.csr.csr_matrix

In [14]:
test_df = pd.DataFrame(["This is a insurance document"], columns=['text'])

In [15]:
test_df.head()

Unnamed: 0,text
0,This is a insurance document


In [16]:
# create Bag of Words
bow_test = cvec.transform(test_df.text)

In [19]:
# bow_test.shape
bow_test


<1x906 sparse matrix of type '<class 'numpy.int64'>'
	with 2 stored elements in Compressed Sparse Row format>

In [18]:
model_b.predict(bow_test)

array([0])

In [20]:
import joblib

In [21]:
filename = 'model.sav'
joblib.dump(model_b, filename)

filename = 'countvectorizer.sav'
joblib.dump(cvec, filename)

['countvectorizer.sav']

In [22]:
import joblib
loaded_model = joblib.load("model.sav")
loaded_cvec = joblib.load("countvectorizer.sav")


In [23]:
test_df1 = pd.DataFrame(["This is a MORTGAGE document"], columns=['text'])

In [24]:
bow_test1 = loaded_cvec.transform(test_df1.text)

In [28]:
bow_test2 = loaded_cvec.transform(test_df.text)

In [27]:
loaded_model.predict(bow_test1)

array([1])

In [30]:
loaded_model.predict(bow_test2)

0

In [28]:
cvec.vocabulary_

{'deed': 291,
 'simple': 762,
 'mortgage': 557,
 '______________': 58,
 '_____': 49,
 'day': 287,
 '________': 52,
 '200__': 26,
 '_______________': 59,
 'son': 769,
 'daughter': 286,
 '_____________': 57,
 'aged': 126,
 '______': 50,
 'years': 896,
 'resident': 706,
 '___________________': 63,
 'hereinafter': 416,
 'referred': 678,
 'mortgagor': 561,
 'expression': 359,
 'shall': 750,
 'unless': 861,
 'repugnant': 701,
 'context': 256,
 'meaning': 531,
 'thereof': 823,
 'deemed': 293,
 'mean': 530,
 'include': 444,
 'legal': 495,
 'heirs': 414,
 'executors': 352,
 'administrators': 117,
 'applicable': 142,
 'case': 205,
 'couple': 269,
 '_________________': 61,
 'spouse': 776,
 '__________________': 62,
 '________________': 60,
 'residing': 708,
 '___________': 55,
 'mortgagors': 562,
 'sh': 749,
 '____________________': 64,
 'housing': 428,
 'finance': 378,
 'company': 241,
 'hfc': 419,
 'companies': 240,
 'act': 112,
 '1956': 22,
 'having': 408,
 'registered': 681,
 'office': 593,
 

### 5. Conclusion

 - Notice that using TFIDF word presentation, you were able to build a better model by just using 4000 words as oppossed to the 29,192 words of the BOW. 
 - This is where TFIDF's strength lies which gives the intution that rest of the 25,000+ words weren't adding much useful information to the model and would be common among many documents.
 - You can know more about the word vectors, TFIDF and similar text embeddings in [this comprehensive article](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/).
 - Finally, note that you could have gotten an even better accuracy by doing preprocessing over the text like Normalization, spelling correction and much more.

### 6. Challenge

If you notice the TFIDF dataframe, words like `demand` `demands` and `demanded` are counted separately this is because the data set isn't normalize yet. I encourage you to go ahead and try to do that using concepts learnt in the previous classes.

In [None]:
# Your code here