# Requirements

In [5]:
# Install Prerequesties
!pip install -r requirements.txt
!python -m spacy download en_core_web_sm


# Tokenization
Here we will see how to tokenize the text i.e. divide whole text into smaller chunks either chunks of sentences or chunks of words. There are two ways of tokenization i.e. sentence tokenization and word tokenization. Let's go through one by one

In [6]:
# Here we are considering sample corpus for sentence tokenization
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize

text = "Natural language processing (NLP) is a field " + \
       "of computer science, artificial intelligence " + \
       "and computational linguistics concerned with " + \
       "the interactions between computers and human " + \
       "(natural) languages, and, in particular, " + \
       "concerned with programming computers to " + \
       "fruitfully process large natural language " + \
       "corpora. Challenges in natural language " + \
       "processing frequently involve natural " + \
       "language understanding, natural language" + \
       "generation frequently from formal, machine" + \
       "-readable logical forms), connecting language " + \
       "and machine perception, managing human-" + \
       "computer dialog systems, or some combination " + \
       "thereof."

tokenized_sents = sent_tokenize(text)
tokenized_words = word_tokenize(text)

print(tokenized_sents,'\n',tokenized_words)

# As we can see in output, sentence is tokenized based on period(.) and words are tokenized based on space between two words

['Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.', 'Challenges in natural language processing frequently involve natural language understanding, natural languagegeneration frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.'] 
 ['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'and', ',', 'in', 'particular', ',', 'concerned', 'with', 'programming', 'computers', 't

# Remove stop words from a list of tokens
Stopwords are the words which doesn't make that much sense when we play with sentences. For e.g the, has, have etc. So Let's remove it from list of tokens we generated above to cut down training time of building of machine learning model for downstream tasks.

Let's import stopwords using nltk library and filter from list of word tokens.

In [7]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('English'))

In [8]:
filtered_sentences = [word for word in tokenized_words if not word.lower() in stop_words]
print(filtered_sentences)

['Natural', 'language', 'processing', '(', 'NLP', ')', 'field', 'computer', 'science', ',', 'artificial', 'intelligence', 'computational', 'linguistics', 'concerned', 'interactions', 'computers', 'human', '(', 'natural', ')', 'languages', ',', ',', 'particular', ',', 'concerned', 'programming', 'computers', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', '.', 'Challenges', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', ',', 'natural', 'languagegeneration', 'frequently', 'formal', ',', 'machine-readable', 'logical', 'forms', ')', ',', 'connecting', 'language', 'machine', 'perception', ',', 'managing', 'human-computer', 'dialog', 'systems', ',', 'combination', 'thereof', '.']


# Stemming and Lemmatization
Stemming or lemmatization means generation of root form of inflected words. For e.g charge is derived by stemming or lemmatization of charging,charges etc.

* There is little theoritical difference between these two i.e stemming may not result actual word but lemmatization will result actual word as root. So we can say that lemmatization is specialized form of stemming.

In [9]:
# Stemming using NLTK
from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = ['Programs','Programming','Charging','Studying','Coding']
for w in words:
    print(ps.stem(w))

program
program
charg
studi
code


In [10]:
# Lemmatization using NLTK
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

words_ = ['rocks','programs','guests','games']
for w in words_:
    print(wnl.lemmatize(w))


rock
program
guest
game


# Represent document as vector
Let's say we have bunch of sentences in a document and we wanna do classify texts then how we can fed input to the machine learning model. As system understands numeric values, we need to convert text into numeric as feature vectors representation so that it can be fed to model.

There are different methods of vectorization like countvectorizer(i.e. counts the occurences of word in a document) and tfidf(Term frequency inverse document frequency) vectorizer.

Let's use scikit-learn library to import vectorizers

* Count Vectorizer

In [11]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import pandas as pd
import numpy as np

# Let's play with sample tweets dataset
data = pd.read_csv('tweets_2020-05-29-20.csv')
docs = list(data['full_retweet_text'][:10])

docs = [str(sent) for sent in docs]
print('Before vectorization')
print(docs)



Before vectorization
['.@AshishXL Bhai, Plasma donation in this case, has been done, thanks to the donor and our health team (Santosh, Kiran, Sirtaj, Gurpeet)🙏🏼\n#StayHome #DelhiFightsCorona \n@sharmanagendar https://t.co/MGXZv3DzO5', 'nan', 'Impact of COVID-19 on pregnancy \n#Obstetrics #Maternalhealth #childlife #UNICEF \n\nhttps://t.co/vIjP4gF9HT', "China's cover-up of the Wuhan virus allowed the disease to spread all over the world, instigating a global pandemic that has caused over 100000 American lives &amp; over a million lives worldwide. Chinese officials ignored their reporting obligations to WHO: US President Donald Trump https://t.co/jo15L2s38C", '#COVID19 recovery rate improves to around 43 per cent; over 71,000 cured so far\n\nhttps://t.co/dO48FjSEyb', 'BREAKING:\n\nPresident Trump just announced that the US will be TERMINATING our relationship with the WHO—the Wuhan Health Organization!\n\nRT if you are THRILLED this is finally happening!', '#IndiaFightsCorona \n\nLet us 

In [12]:
# Initialize vectorizer
count_vectorizer = CountVectorizer() # We can tweek some parameters like ngram_range and analyzer here - ngram_range is bydefault (1,1) else if we want pairs of words count then it can be (2,2) and so on
vectors = count_vectorizer.fit_transform(docs)


print('After vectorization')
print(vectors.toarray())

# Here this count vectorizer depicts the count of particular word in a document

After vectorization
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]]


* TF-IDF Vectorizer

In [13]:
tfidf_vectorizer = TfidfVectorizer(use_idf = True, analyzer = 'word', ngram_range=(1,2))
tfidf_vectorizer.fit(docs)
tfidf_vectors = tfidf_vectorizer.fit_transform(docs)
print('TF-IDF Vectors', tfidf_vectors.toarray())
print('Features names', tfidf_vectorizer.get_feature_names_out())

TF-IDF Vectors [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Features names ['000' '000 cured' '100000' '100000 american' '1024' '1024 new' '16'
 '16 281' '19' '19 on' '206' '206 or' '207' '207 finland' '231'
 '231 recovered' '24' '24 hours' '2462' '2462 people' '28' '28 may' '281'
 '281 cases' '316' '316 number' '43' '43 per' '443' '443 the' '71'
 '71 000' '7495' '7495 death' '8470' '8470 delhifightscorona' 'active'
 'active cases' 'all' 'all boost' 'all over' 'allowed' 'allowed the'
 'american' 'american lives' 'amp' 'amp over' 'and' 'and our' 'announced'
 'announced that' 'are' 'are thrilled' 'around' 'around 43' 'as' 'as of'
 'ashishxl' 'ashishxl bhai' 'be' 'be terminating' 'been' 'been done'
 'bengalnewzworld' 'bhai' 'bhai plasma' 'bigbreakingnews'
 'bigbreakingnews in' 'boost' 'boost the' 'breaking' 'breaking president'
 'breaking_24x7' 'breaking_24x7 trumpdepression' 'br

# Query documents by similarity

We can use cosine similarity scores to find similar sentences in a document after TF-IDF vectorization.

Note- TF-IDF vectorization works better than Count vectorization as prior considers frequency of words in document and number of documents containing those words as well which helps to determine similarity between individual sentences in a document.

In [14]:
from sklearn.metrics.pairwise import cosine_similarity
# Let's consider first sentence in docs as query sentence
query = tfidf_vectors[0:1]
similarity_matrix = cosine_similarity(query,tfidf_vectors.toarray())
print(similarity_matrix)

[[1.         0.         0.02614298 0.04764016 0.03083108 0.05700748
  0.03265538 0.03920179 0.01180718 0.05367363]]


# Apply Word Embeddings Model to create Document Vectors

Let's explore another method to create document vectors i.e. pretrained word embeddings models to get contextual features from given text sequences. These document vectors can help to classify sentences and many other downstream tasks further.

In [15]:
# Let's use spacy pretrained word embeddings models trained on english wikipedia text to create document vectors
import spacy
nlp = spacy.load('en_core_web_sm',download=True) # here en_core_web_sm is small pretrained word embedding model, en_core_web_md -> medium size pretrained model and en_core_web_lg -> large size pretrained model

# Here for each word in sentence particular value is being assigned as per spacy pretrained word embedding model
vectors = [nlp(sent).vector for sent in docs]
print(vectors)


[array([ 0.15237004,  0.18714751, -0.5484187 ,  0.30625474,  1.6435678 ,
        0.7385486 ,  0.45245537, -0.15774831,  0.99820447,  1.5292078 ,
        0.6275515 , -1.1710415 , -0.55321056, -0.942731  , -0.6987732 ,
       -1.3398434 , -0.2996181 ,  0.32035145,  0.10361589, -0.5445901 ,
        0.6154242 , -0.3586897 , -0.31722206, -0.19259013, -0.511952  ,
       -0.3073782 , -0.5385157 , -0.69947106,  0.7838872 , -0.58011055,
        0.05466099,  0.5952998 ,  0.2915551 ,  0.18558408,  0.29152414,
       -0.33659157,  1.5055457 , -0.9074809 ,  0.05406311, -0.6646737 ,
        1.3476272 ,  0.49729127,  0.21240169, -1.8145515 , -1.2545425 ,
       -0.89764935,  0.09181776, -0.3249551 , -0.7165451 ,  0.52670294,
        0.55528575, -1.1691321 , -0.5744502 ,  0.00452233, -1.8886673 ,
        0.8994386 ,  0.23017785,  0.52385014, -0.3442439 , -0.07341588,
       -0.16352   , -0.36046058,  1.0966384 ,  0.52782434,  0.06773804,
       -0.17929795,  0.58979917, -1.109526  , -0.90127087,  0.6

# Extract Text Features and use them in Classification Pipelines

As we saw above, we have extracted text features and now are gonna classify text using classification pipelines in sklearn. 


* Training

In [40]:
# Let's load news group dataset where we have 20 types of news groups and we need to classify that
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score,train_test_split
from sklearn.metrics import accuracy_score,classification_report
# More detailed documentation about fetch newsgroup dataset is here - https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
data_newsgroups  = fetch_20newsgroups(subset='train') # Training subset of labelled data

# Split Dataset
X_train,X_test,y_train,y_test = train_test_split(data_newsgroups.data,data_newsgroups.target,test_size=0.3)

# Let's use tfidf for numeric features extraction and support vector classifier to classify text
# Let's make a classification pipeline using above two components i.e. tfidf vectorizer and support vector classifier(SVC) with linear kernel

svc_tfidf = Pipeline([('tf_idf_vectorizer',TfidfVectorizer(stop_words='english',max_features=5000)),
                      ('svc',SVC(kernel='linear'))])

scores = cross_val_score(svc_tfidf,X_train,y_train,cv=2).mean()
print('Score',scores)


Score 0.8277564416583107


* Testing

In [41]:
svc_tfidf.fit(X_train,y_train)
preds = svc_tfidf.predict(X_test)

In [42]:

accuracy_score_ = accuracy_score(y_test,preds)
classification_report_ = classification_report(y_test,preds)

print('Accuracy Score', accuracy_score_)
print('Classification Report',classification_report_)

Accuracy Score 0.8771723122238586
Classification Report               precision    recall  f1-score   support

           0       0.92      0.91      0.91       144
           1       0.77      0.82      0.80       177
           2       0.79      0.81      0.80       197
           3       0.68      0.77      0.72       178
           4       0.84      0.81      0.83       168
           5       0.85      0.87      0.86       168
           6       0.84      0.84      0.84       159
           7       0.90      0.86      0.88       199
           8       0.95      0.95      0.95       173
           9       0.95      0.86      0.90       175
          10       0.94      0.97      0.96       184
          11       0.97      0.93      0.95       155
          12       0.73      0.79      0.76       166
          13       0.92      0.93      0.93       175
          14       0.97      0.93      0.95       196
          15       0.89      0.94      0.91       187
          16       0.96  

# Latent Semantic Analysis(LSA) for Document Classification
LSA is an approach to find hidden topics that exists across set of documents. 

The two main steps in this process is to build term-document matrix and then reducing the dimensionality of that matrix. To reduce dimensions and discover topics we will use SVD(Support Vector Decomposition).

Let's go to brief walkthrough, how we can do it using scikit-learn

In [32]:
# Let's use same dataset as above for document classification
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score,classification_report

categories = ['talk.religion.misc','comp.graphics'] # Let's consider 2 categories out of 20 categories of newsgroups
newsgroups_train = fetch_20newsgroups(subset='train',categories=categories,remove=('headers','footers','quotes'))
newsgroups_test = fetch_20newsgroups(subset='test',categories=categories,remove=('headers','footers','quotes'))

 Let's discover latent topics from training data and fit the model and then predict topics values from testing data
 

In [17]:
vectorizer = TfidfVectorizer(max_df=0.5,stop_words='english',use_idf=True,max_features=5000)
X_train,X_test,y_train,y_test = train_test_split(newsgroups_train.data,newsgroups_train.target,test_size=0.3)
X_train_tfidf = vectorizer.fit_transform(X_train)

In [33]:
# Let's project tfidf vectors to principle components and determine topics
svd = TruncatedSVD(100)
lsa = make_pipeline(svd,Normalizer(copy=False))

# Project the training data to lower dimensions using SVD
X_train_lsa = lsa.fit_transform(X_train_tfidf)

# Let's apply transformations to testing data
X_test_tfidf = vectorizer.transform(X_test)
X_test_lsa = lsa.transform(X_test_tfidf)

# Build classifier model- e.g. KnnClassifier
knn_classify = KNeighborsClassifier()
knn_classify.fit(X_train_lsa,y_train)

preds = knn_classify.predict(X_test_lsa)
score = accuracy_score(y_test,preds)
classify_report = classification_report(y_test,preds)
print(score)
print(classify_report)


0.8546712802768166
              precision    recall  f1-score   support

           0       0.93      0.83      0.88       178
           1       0.77      0.89      0.82       111

    accuracy                           0.85       289
   macro avg       0.85      0.86      0.85       289
weighted avg       0.86      0.85      0.86       289



# Compare various vectorization methods for document classification

Features in any Machine Learning algorithms are generally numerical data on which we can easily perform any mathematical operations. But Machine Learning algorithms cannot work on raw text data. Machine Learning algorithms can only process numerical representation in form of vector(matrix) of actual text. For converting textual data into numerical representation of features we can use the following text vectorization techniques in Natural Language Processing.

* Bag Of Words (Count Vectorizer)
* Term Frequency and Inverse Document Frequency (TF-IDF)
* Word2Vec
Raw data contains numerical values, punctuations, spaces, special characters which can hamper the performance of model. So it is necessary to pre-process the data first. For that we can use various pre-processing techniques like :

Regular expressions – for removing numerical values, punctuation’s, special characters etc.
Lowercase the text data
Tokenization – converting group of sentence into tokens
Removing stopwords from text data. Example – of, in, the etc.
Stemming and/or Lemmetization – reducing a word to its word stem
After applying these pre-processing technique we need to convert the final extracted features into numerical features in order to build our model. This is where Text Data Vectorization techniques come into picture.

Let’s have a look at each of them in detail:

* Bag Of Words

BOW is a text vectorization model commonly useful in document representation method in the field of information retrieval.

In information retrieval, the BOW model assumes that for a document, it ignores its word order, grammar, syntax and other factors, and treats it as a collection of several words. The appearance of each word in the document is independent of whether other words appear. (It’s out of order)
The Bag-of-words model (BOW model) ignores the grammar and word order of a text, and uses a set of unordered words to express a text or a document.


In the example above three sentences are taken which have in all 12 unique words. The order of words is not related in which they appear in sentence. The sentences are transformed to vectors using CountVectorizer() function. The output contains a total of 12 elements, where the i-th element represents the number of times the i-th word in the dictionary appears in the sentence.

Imagine a huge document set D with a total of M documents. Unique words from documents are extracted, comprising a list N words. In Bag of words model, each document represents N-dimensional vector.

The BOW model can be considered as a statistical histogram. It is used in text retrieval, document classification and processing applications.

* TF-IDF (Term Frequency – Inverse Document Frequency)

Another popular word embedding/text vectorization technique for extracting features from data is TF-IDF. TF-IDF is numerical statistical technique and used to figure out the relevance of any word in document, which is part of an even larger body of document.

The two metrics TF and IDF are as follows:

Term Frequency (TF) – In TF , we are giving some scoring for each word or token based on the frequency of that word. The frequency of a word is dependent on the length of the document. Means in large size of document a word occurs more than a small or medium size of the documents.

So to overcome this problem we will divide the frequency of a word with the length of the document (total number of words) to normalize.By using this technique, we are creating a sparse matrix with frequency of every word in each document.

TF = no. of times term occurrences in a document / total number of words in a document


Inverse Document Frequency (IDF) – It is a measure of the general importance of a word. The main idea is that if there are fewer documents containing the entry t and the larger, it means that the entry has a good ability to distinguish categories. The IDF of a specific word can be calculated by dividing the total number of files by the number of files containing the word, and then taking the log of the obtained quotient.

IDF = log base e (total number of documents / number of documents which are having term t)


Example:
Consider a document containing 100 words where in the word cat appears 3 times. The term frequency (Tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these.

Then, the inverse document frequency (Idf) is calculated as log(10,000,000 / 1,000) = 4.

Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.


In this code the TF-IDF of three sentences is found and converted to vectors using TfidfVectorizer(). TF-IDF value increases based on frequency of the word in a document. Like Bag of Words in this technique also we can not get any semantic meaning for words.

* TF-IDF application
Search Engine
Keyword Extraction
Text Similarity
Text Summary
Word2Vec

With Bag of Words and TF-IDF text vectorization techniques we did not get semantic meaning of words. But for most of the applications of NLP tasks like sentiment classification, sarcasm detection etc require semantic meaning of a word and semantic relationships of a word with other words.

Word embeddings captures semantic and syntactic relationships between words and also the context of words in a document. Word2vec technique used to implement word embeddings.

* Word2vec model takes input as a large size of corpus and produces output to vector space. This vector space size may be in hundred of dimensionality. Each word vector will be placed on this vector space. In vector space words that share context commonly in a corpus are closer to each other. Word vector having positions of corresponding words in a vector space.

The Word2vec method learns all those types of relationships of words while building a model. For this purpose word2vec uses 2 types of methods.

Skip-gram
CBOW (Continuous Bag of Words)
The Word2vec model captures relationships of words with the help of window size by using skip-gram and CBOW methods. Window size is a technique similar to n-grams where we create sequence of n words.


* Skip-gram
Skip-gram method takes the center word from the window size words as an input and context words (neighbour words) as outputs. Word2vec models predict the context words of a center word using skip-gram method. Skip-gram works well with a small dataset and identifies rare words really well.

* Continuous Bag-of-Words (CBOW)
CBow is just a reverse method of the skip gram method. Here we are taking context words as input and predicting the center word within the window. Another difference from skip gram method is, It was working faster and better representations for most frequency words.


Word2Vec has its applications in knowledge discovery and recommendation systems.

Conclusion:
We can use any one of the text feature extraction based on our project requirement. Because every method has their advantages  like a Bag-Of-Words suitable for text classification, TF-IDF is for document classification and if you want semantic relation between words then go with word2vec.

We can’t say blindly what type of feature extraction gives better results. One more thing is building word embeddings from our dataset or corpus will give better results. But we don’t always have enough size of data set so in that case we can use pre-trained models with transfer learning.