# Entity Classification

Generally, when we read a text, we recognize entities straightway like people, values, locations and more. For example, in the sentence “ Alexander the Great, was a king of the ancient Greek kingdom of Macedonia.”, we can identify three types of entities as follows:


> * Person: Alexander
> * Culture: Greek
> * Kingdom: Macedonia

We are getting an enormous amount of text data; with the help of the modern machine, we can process this text to perform tasks like Sentiment Analysis, search specific content, Named Entity Recognition, part of speech tagging, information retrieval and the list goes on.  

In this practice session, with the help of the Naive Bayes classifier, we will classify the text into different entities or into what category it belongs. To perform this task, we are going to use a famous 20 newsgroup dataset. The 20 newsgroups dataset comprises around 19000 newsgroups posts on 20 different topics.

# Code Implementation to identify entities



Create the Environment:

Create the necessary Python environment by importing the frameworks and libraries

In [None]:
!python -m pip install pip --upgrade --user -q --no-warn-script-location
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn nltk gensim --user -q --no-warn-script-location


In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

Let’s quickly walk through the dataset.

In [None]:
dataset = fetch_20newsgroups(subset='all',shuffle=True,random_state=42)

We can not directly feed raw data like this to the machine before. It should be converted into a vector of numerical values representing each sentence of the document. Utilities like CountVectorizer and TfidfTransformer provided by Sklearn are used to represent raw text into meaningful vectors.

In [None]:
count_vecto = CountVectorizer()

CountVectorizer is used to tokenize a given collection of text documents and build a vocabulary of known words. When you call fit_transform on a given document, the result is an encoded vector with the length of the full vocabulary and an integer count for how many times each word appeared in the document, as shown in the above picture. The vectors returned are mostly sparse. To understand what the function has done, you can convert it into a NumPy array by calling toarray() function.

In [None]:
x_train_count = count_vecto.fit_transform(dataset.data)

This text representation will not help as it does consider the words like ‘the’,’ an’, ’a’, and so on, which appear many times throughout the document and their large counts are not meaningful in the encoded vectors.

###Limit subset to the train if want see vector representation

TfidfTransformer is an alternative method to perform tokenization and encoding for a given text document. TF-IDF are word frequency scores that try to highlight words that have more relevance to the context.

The frequency of occurrence of terms in a document is measured by Term Frequency. Inverse Document Frequency assigns the rank to the words based on their relevance in the document; in other words, it downscale the words that appear more frequently a’,’ an’,’ the’. The use case of TF-IDF is similar to that of the CountVectorizer.

Here we have already performed the first step of TF-IDF. We can directly use our countvectorizer data to calculate inverse document frequencies, i.e. to downscale the data.    

In [None]:
#pd.DataFrame(x_train_count.toarray())

In [None]:
tfid_vecto = TfidfTransformer()
train_tfid = tfid_vecto.fit_transform(x_train_count)

In [None]:
train_tfid.data

Use of MultiNomialNaiveBayes Classifier in identifying entities:

As our training data is represented in the term frequency, the MultinomialNB classifier is most suitable for discrete features such as word counts. It uses term frequency to compute maximum likelihood estimates based on training data to estimate conditional probabilities.

In [None]:
len(dataset.target)

In [None]:
model= MultinomialNB().fit(train_tfid,dataset.target)

Let’s try to predict random text which includes few targets of our trained data.  

In [None]:
new=['i have a motorbike which made by honda.','i have GPU based system.','The Bible is simply the written core of that tradition.']

In [None]:
test = count_vecto.transform(new)
test.toarray()

In [None]:
pd.DataFrame(test.toarray())

In [None]:
len(test.data)

In [None]:
pred=model.predict(tfid_vecto.transform(test))
pred

Cross-check the outcomes with the class. 

In [None]:
for a,b in zip(new,pred):
    print(a,'---> is predictes as --->',dataset.target_names[b])