# Introduction: Text Classification with CNNs
Hello people, welcome to this kernel. In this kernel I am going to show you how to create a Convolutional Neural Network using Tensorflow to classify texts.

Before starting, let's take a look at our table of content

# Table of Content
1. But CNNs Are For images!?!?
1. Preparing Environment
1. Preparing Data
1. Neural Network Modeling
1. EXTRA: How To Make Our Model Ready-to-Deploy?
1. Conclusion


# But CNNs Are For Images!?!
In deep learning, we generally use Convolutional Neural Networks and their variants to classify image data. So most of the people thinks *we can use them only for image data*.

But a convolution operator **extracts** features from a data given. And if data has dimension more than one, we can use it with a convolution operator. And if we use **word embeddings** to convert words we can use a Convolutional Neural Network. 

Let's start.


# Preparing Environment
In this section we'll import libraries and read our data from HDD.

In [22]:
import pandas as pd
import numpy as np
import re

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix
import matplotlib.pyplot as plt


In [23]:
data_true = pd.read_csv('../input/fake-and-real-news-dataset/True.csv')
data_fake = pd.read_csv('../input/fake-and-real-news-dataset/Fake.csv')

# Preparing Data
In this section we're going to prepare data to use it in our neural network.

In [24]:
data_true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


* We can drop title, subject and date.
* Also we need to add a label which will be 1

In [25]:
data_true["label"] = 1
data_fake["label"] = 0
data = pd.concat([data_true,data_fake],0)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44898 entries, 0 to 23480
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 2.1+ MB


In [26]:
data = data.loc[:,["text","label"]]
data.head()

Unnamed: 0,text,label
0,WASHINGTON (Reuters) - The head of a conservat...,1
1,WASHINGTON (Reuters) - Transgender people will...,1
2,WASHINGTON (Reuters) - The special counsel inv...,1
3,WASHINGTON (Reuters) - Trump campaign adviser ...,1
4,SEATTLE/WASHINGTON (Reuters) - President Donal...,1


In [27]:
x = data["text"]
y = data["label"]

* Now we're going to define a function which will clean data.

In [28]:
def cleanText(text):
    cleaned = re.sub("[^'a-zA-Z0-9]"," ",text)
    lowered = cleaned.lower().strip()
    return lowered

* Let's test our function.

In [29]:
cleanText("Test .* yup *?! okay!.")

'test    yup     okay'

In [30]:
st = time.time()
x_cleaned = [cleanText(t) for t in x]
print("This process took {} seconds".format(round(time.time()-st,2)))

This process took 8.06 seconds


In [31]:
x_cleaned[0]

'washington  reuters    the head of a conservative republican faction in the u s  congress  who voted this month for a huge expansion of the national debt to pay for tax cuts  called himself a  fiscal conservative  on sunday and urged budget restraint in 2018  in keeping with a sharp pivot under way among republicans  u s  representative mark meadows  speaking on cbs   face the nation   drew a hard line on federal spending  which lawmakers are bracing to do battle over in january  when they return from the holidays on wednesday  lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues  such as immigration policy  even as the november congressional election campaigns approach in which republicans will seek to keep control of congress  president donald trump and his republicans want a big budget increase in military spending  while democrats also want proportional increases for non defense  discretionary  spending on programs that support educat

* Now we'll tokenize our data using Tensorflow's tokenizer.

In [32]:
st = time.time()
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(x_cleaned)
x_tokenized = tokenizer.texts_to_sequences(x_cleaned)
print("This process took {} seconds".format(round(time.time()-st,2)))

This process took 22.78 seconds


In [33]:
print(x_tokenized[0])

[106, 66, 1, 440, 3, 4, 318, 78, 6, 1, 37, 7, 201, 29, 828, 26, 256, 10, 4, 1254, 3007, 3, 1, 128, 981, 2, 474, 10, 189, 1285, 158, 411, 4, 1405, 318, 9, 347, 5, 1144, 527, 6, 1225, 6, 1787, 16, 4, 3337, 147, 165, 335, 143, 37, 7, 827, 918, 621, 9, 1946, 473, 1, 322, 1772, 4, 528, 583, 9, 180, 731, 49, 459, 28, 2, 90, 1421, 68, 6, 425, 59, 31, 796, 25, 1, 9, 203, 459, 40, 1466, 386, 2, 901, 4, 180, 527, 6, 4, 545, 331, 2, 22, 1732, 2, 80, 455, 177, 18, 325, 219, 109, 18, 1, 494, 605, 97, 2060, 1523, 6, 49, 143, 40, 1201, 2, 418, 364, 3, 201, 35, 71, 12, 5, 19, 143, 188, 4, 424, 527, 995, 6, 179, 731, 108, 213, 61, 188, 3574, 10, 738, 387, 731, 9, 1016, 8, 160, 1013, 4113, 1098, 1478, 175, 324, 5, 1415, 1233, 1, 12, 146, 21, 308, 44, 1384, 2, 129, 33, 121, 142, 2, 995, 738, 387, 731, 20, 39, 579, 138, 614, 3, 1, 758, 34, 3504, 73, 803, 2196, 14, 9, 1, 341, 99, 213, 28, 187, 8, 7, 24, 487, 33, 240, 2, 427, 1, 72, 4, 474, 1361, 3, 329, 2, 581, 138, 10, 4, 1405, 318, 32, 159, 38, 192, 139,

* Now we need to pad our sequences, in order to find the true length, I'll use the third quartile of the length array (array which has the lengths of the sequences)

In [34]:
length_array = [len(s) for s in x_tokenized]
SEQUENCE_LENGTH = int(np.quantile(length_array,0.75))
print(SEQUENCE_LENGTH)

474


* And let's pad.

In [35]:
x_padded = pad_sequences(x_tokenized,maxlen=SEQUENCE_LENGTH)

In [36]:
x_padded.shape

(44898, 474)

* Our text data is ready to use, let's split our dataset into train and test sets.

In [37]:
x_train,x_test,y_train,y_test = train_test_split(x_padded,y,test_size=0.2,random_state=42)

In [38]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(35918, 474)
(8980, 474)
(35918,)
(8980,)


# Neural Network Modeling
In this section I'm going to build and train our convolutional neural network using keras' sequential api.

In [39]:
# We've added 1 because or word index has numbers from 1 to end but we've added
# 0 tokens in padding so our vocab now has len(tokenizer.word_index) + 1
VOCAB_LENGTH = len(tokenizer.word_index) + 1
VECTOR_SIZE = 100

def getModel():
    """
    Returns a trainable Sigmoid Convolutional Neural Network
    """
    model = keras.Sequential()
    model.add(layers.Embedding(input_dim=VOCAB_LENGTH,
                               output_dim=VECTOR_SIZE,
                               input_length=SEQUENCE_LENGTH
                              ))
    
    model.add(layers.Conv1D(128,kernel_size=4))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation("relu"))
    model.add(layers.MaxPooling1D(2))
    
    model.add(layers.Conv1D(256,kernel_size=4))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation("relu"))
    model.add(layers.MaxPooling1D(2))
    
    model.add(layers.Conv1D(512,kernel_size=4))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation("relu"))
    model.add(layers.MaxPooling1D(2))
    
    model.add(layers.Flatten())
    model.add(layers.Dense(1,activation="sigmoid"))
    
    model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])
    
    return model

In [40]:
model = getModel()
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 474, 100)          12236200  
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 471, 128)          51328     
_________________________________________________________________
batch_normalization_3 (Batch (None, 471, 128)          512       
_________________________________________________________________
activation_3 (Activation)    (None, 471, 128)          0         
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 235, 128)          0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 232, 256)          131328    
_________________________________________________________________
batch_normalization_4 (Batch (None, 232, 256)         

In [41]:
history = model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=1)



* 1 epoch and %93 validation accuracy, this is how a convolutional neural network works with text data 

# EXTRA: How To Make Our Model Ready-to-Deploy?
Before finishing this kernel, I wanna show you one more thing, an important one. How to make a model ready to deploy using a web library or framework like Flask or Django.

Let's start.

* First we'll save weights of our model and pickle our tokenizer.

In [42]:
model.save_weights("trained_model.h5")

In [43]:
import pickle
with open("tokenizer.pickle",mode="wb") as F:
    pickle.dump(tokenizer,F)


* Also let's save our label map using json library.

In [44]:
import json
label_map = {0:"Fake",
             1:"Real"
            }

json.dump(label_map,open("label_map.json",mode="w"))

* And now we'll write a class which will have a function to predict data.

In [93]:
class DeployModel():
    
    def __init__(self,weights_path,tokenizer_path,seq_length,label_map_path
                ):
        
        self.model = getModel()
        self.model.load_weights(weights_path)
        self.tokenizer = pickle.load(open(tokenizer_path,mode="rb"))
        self.seq_len = seq_length
        self.label_map = json.load(open(label_map_path))
    
    def _prepare_data(self,text):
        
        cleaned = cleanText(text)
        tokenized = self.tokenizer.texts_to_sequences([cleaned])
        padded = pad_sequences(tokenized,maxlen=self.seq_len)
        return padded
    
    def _predict(self,text):
        
        text = self._prepare_data(text)
        pred = int(self.model.predict_classes(text)[0])
        return str(pred)
    
    def result(self,text):
        
        pred = self._predict(text)
        return self.label_map[pred]

* And let's create an object using our class.

In [94]:
deploy_model = DeployModel(weights_path="./trained_model.h5",
                           tokenizer_path="./tokenizer.pickle",
                           seq_length=SEQUENCE_LENGTH,
                           label_map_path="./label_map.json"
                          )

In [60]:
test_text = x_cleaned[0]

In [95]:
print(test_text)
print("\n\n===========================")
print("Results: ",deploy_model.result(test_text))

washington  reuters    the head of a conservative republican faction in the u s  congress  who voted this month for a huge expansion of the national debt to pay for tax cuts  called himself a  fiscal conservative  on sunday and urged budget restraint in 2018  in keeping with a sharp pivot under way among republicans  u s  representative mark meadows  speaking on cbs   face the nation   drew a hard line on federal spending  which lawmakers are bracing to do battle over in january  when they return from the holidays on wednesday  lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues  such as immigration policy  even as the november congressional election campaigns approach in which republicans will seek to keep control of congress  president donald trump and his republicans want a big budget increase in military spending  while democrats also want proportional increases for non defense  discretionary  spending on programs that support educati

* And yes, it was real!