# Tensorflow Text Classification

Text Classification is the task of assigning the right label to a given piece of text. This text can either be a phrase, a sentence or even a paragraph. Our aim would be to take in some text as input and attach or assign a label to it. Since we will be using Tensor Flow Is deep learning library, we can call this the Tensorflow text classification system

This task involves training a neural network with lots of data indicating what a piece of text represents. I am sure you would have heard of the term “Sentiment Analysis“. Well, sentiment analysis a text classification task but it is restricted only to identify the sentiment of the person saying something. For example, the sentence, ” The food was amazing” has a positive sentiment. On the other hand, ” the movie was horrible” has a negative sentiment while the sentence “sun rises from the east” has a neutral sentiment.

For sentiment analysis, the labels are positive, negative and neutral most of the times. But, this is just one use of the text classification. If you are building other text-based applications like a chatbot, or a document parsing algorithm, you might want to know what a particular sentence belongs to. For example: ” Hello! how are you?” can have the label “Greeting” attached to it or the sentence ” It was a pleasure meeting you” can have the label “Farewell” attached to it.

### What we you going to do ?

We are gonna build a simple text classifier

# Step 1: Data Preparation

Before we train a model that can classify a given text to a particular category, we have to first prepare the data. We can create a simple JSON file that will hold the required data for training.

Following is a sample file that I have created, that contains 5 categories. You can create how many ever categories that you want.

{
 
"time" : ["what time is it?", "how long has it been since we started?", "that's a long time ago", " I spoke to you last week", " I saw you yesterday"],
 
"sorry" : ["I'm extremely sorry", "did he apologize to you?", "I shouldn't have been rude"],
 
"greeting": ["Hello there!", "Hey man! How are you?", "hi"],
 
"farewell": ["It was a pleasure meeting you", "Good Bye.", "see you soon", "I gotta go now."],
 
"age": ["what's your age?", "How old are you?", "I'm a couple of years older than her", "You look aged!"]
 
}


In the above structure, we have a simple JSON with 5 categories ( time, sorry, greeting, farewell, and age). For each category, we have a set of sentences which we can use to train our model.

Given this data, we have to classify any given sentence into one of these 5 categories

# Step 2: Data Load and Pre-processing

In this step, we load the JSON data that we have created. Let us assume that we have that data stored in a file named “data.json”.

Once we load the data, we would have to perform some operations on it to clean the data and form the bag of words.

In [5]:
import nltk
from nltk.stem.lancaster import LancasterStemmer
import numpy as np
import tflearn
import tensorflow as tf
import random
import json
import string
import unicodedata
import sys

# a table structure to hold the different punctuation used
tbl = dict.fromkeys(i for i in range(sys.maxunicode)
                    if unicodedata.category(chr(i)).startswith('P'))



In [6]:
# method to remove punctuations from sentences.
def remove_punctuation(text):
    return text.translate(tbl)

# initialize the stemmer
stemmer = LancasterStemmer()
# variable to hold the Json data read from the file
data = None

# read the json file and load the training data
with open('data.json') as json_data:
    data = json.load(json_data)
    print(data)

# get a list of all categories to train for
categories = list(data.keys())
words = []
# a list of tuples with words in the sentence and category name
docs = []

for each_category in data.keys():
    for each_sentence in data[each_category]:
        # remove any punctuation from the sentence
        each_sentence = remove_punctuation(each_sentence)
        print(each_sentence)
        # extract words from each sentence and append to the word list
        w = nltk.word_tokenize(each_sentence)
        print("tokenized words: ", w)
        words.extend(w)
        docs.append((w, each_category))

# stem and lower each word and remove duplicates
words = [stemmer.stem(w.lower()) for w in words]
words = sorted(list(set(words)))

print(words)
print(docs)

{'time': ['what time is it?', 'how long has it been since we started?', "that's a long time ago", ' I spoke to you last week', ' I saw you yesterday'], 'sorry': ["I'm extremely sorry", 'did he apologize to you?', "I shouldn't have been rude"], 'greeting': ['Hello there!', 'Hey man! How are you?', 'hi'], 'farewell': ['It was a pleasure meeting you', 'Good Bye.', 'see you soon', 'I gotta go now.'], 'age': ["what's your age?", 'How old are you?', "I'm a couple of years older than her", 'You look aged!']}
what time is it
tokenized words:  ['what', 'time', 'is', 'it']
how long has it been since we started
tokenized words:  ['how', 'long', 'has', 'it', 'been', 'since', 'we', 'started']
thats a long time ago
tokenized words:  ['thats', 'a', 'long', 'time', 'ago']
 I spoke to you last week
tokenized words:  ['I', 'spoke', 'to', 'you', 'last', 'week']
 I saw you yesterday
tokenized words:  ['I', 'saw', 'you', 'yesterday']
Im extremely sorry
tokenized words:  ['Im', 'extremely', 'sorry']
did he 

In [7]:
# create our training data
training = []
output = []
# create an empty array for our output
output_empty = [0] * len(categories)


In [9]:
for doc in docs:
    # initialize our bag of words(bow) for each document in the list
    bow = []
    # list of tokenized words for the pattern
    token_words = doc[0]
    # stem each word
    token_words = [stemmer.stem(word.lower()) for word in token_words]
    # create our bag of words array
    for w in words:
        bow.append(1) if w in token_words else bow.append(0)

    output_row = list(output_empty)
    output_row[categories.index(doc[1])] = 1

    # our training set will contain a the bag of words model and the output row that tells
    # which catefory that bow belongs to.
    training.append([bow, output_row])

# shuffle our features and turn into np.array as tensorflow  takes in numpy array
random.shuffle(training)
training = np.array(training)

# trainX contains the Bag of words and train_y contains the label/ category
train_x = list(training[:, 0])
train_y = list(training[:, 1])

In [10]:
# reset underlying graph data
tf.reset_default_graph()
# Build neural network
net = tflearn.input_data(shape=[None, len(train_x[0])])
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, len(train_y[0]), activation='softmax')
net = tflearn.regression(net)

# Define model and setup tensorboard
model = tflearn.DNN(net, tensorboard_dir='tflearn_logs')
# Start training (apply gradient descent algorithm)
model.fit(train_x, train_y, n_epoch=1000, batch_size=8, show_metric=True)
model.save('model.tflearn')

Training Step: 2999  | total loss: [1m[32m0.19740[0m[0m | time: 0.011s
| Adam | epoch: 1000 | loss: 0.19740 - acc: 0.9794 -- iter: 16/19
Training Step: 3000  | total loss: [1m[32m0.17892[0m[0m | time: 0.016s
| Adam | epoch: 1000 | loss: 0.17892 - acc: 0.9815 -- iter: 19/19
--
INFO:tensorflow:C:\Users\Prabhat\Desktop\100DaysOfMLCode\40. Tensorflow Text Classification\model.tflearn is not in all_model_checkpoint_paths. Manually adding it.


In [11]:
# let's test the mdodel for a few sentences:
# the first two sentences are used for training, and the last two sentences are not present in the training data.
sent_1 = "what time is it?"
sent_2 = "I gotta go now"
sent_3 = "do you know the time now?"
sent_4 = "you must be a couple of years older then her!"

# a method that takes in a sentence and list of all words
# and returns the data in a form the can be fed to tensorflow


def get_tf_record(sentence):
    global words
    # tokenize the pattern
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word
    sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
    # bag of words
    bow = [0]*len(words)
    for s in sentence_words:
        for i, w in enumerate(words):
            if w == s:
                bow[i] = 1

    return(np.array(bow))


# we can start to predict the results for each of the 4 sentences
print(categories[np.argmax(model.predict([get_tf_record(sent_1)]))])
print(categories[np.argmax(model.predict([get_tf_record(sent_2)]))])
print(categories[np.argmax(model.predict([get_tf_record(sent_3)]))])
print(categories[np.argmax(model.predict([get_tf_record(sent_4)]))])

time
farewell
time
age


Credit: https://sourcedexter.com/tensorflow-text-classification-python/