# Introduction: Text Classification with CNNs
Hello people, welcome to this kernel. In this kernel I am going to show you how to create a Convolutional Neural Network using Tensorflow to classify texts.

Before starting, let's take a look at our table of content

# Table of Content
1. But CNNs Are For images!?!?
1. Preparing Environment
1. Preparing Data
1. Neural Network Modeling
1. EXTRA: How To Make Our Model Ready-to-Deploy?
1. Conclusion


# But CNNs Are For Images!?!
In deep learning, we generally use Convolutional Neural Networks and their variants to classify image data. So most of the people thinks *we can use them only for image data*.

But a convolution operator **extracts** features from a data given. And if data has dimension more than one, we can use it with a convolution operator. And if we use **word embeddings** to convert words we can use a Convolutional Neural Network. 

Let's start.


# Preparing Environment
In this section we'll import libraries and read our data from HDD.

In [None]:
import pandas as pd
import numpy as np
import re

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix
import matplotlib.pyplot as plt


In [None]:
data_true = pd.read_csv('../input/fake-and-real-news-dataset/True.csv')
data_fake = pd.read_csv('../input/fake-and-real-news-dataset/Fake.csv')

# Preparing Data
In this section we're going to prepare data to use it in our neural network.

In [None]:
data_true.head()

* We can drop title, subject and date.
* Also we need to add a label which will be 1

In [None]:
data_true["label"] = 1
data_fake["label"] = 0
data = pd.concat([data_true,data_fake],0)
data.info()

In [None]:
data = data.loc[:,["text","label"]]
data.head()

In [None]:
x = data["text"]
y = data["label"]

* Now we're going to define a function which will clean data.

In [None]:
def cleanText(text):
    cleaned = re.sub("[^'a-zA-Z0-9]"," ",text)
    lowered = cleaned.lower().strip()
    return lowered

* Let's test our function.

In [None]:
cleanText("Test .* yup *?! okay!.")

In [None]:
st = time.time()
x_cleaned = [cleanText(t) for t in x]
print("This process took {} seconds".format(round(time.time()-st,2)))

In [None]:
x_cleaned[0]

* Now we'll tokenize our data using Tensorflow's tokenizer.

In [None]:
st = time.time()
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(x_cleaned)
x_tokenized = tokenizer.texts_to_sequences(x_cleaned)
print("This process took {} seconds".format(round(time.time()-st,2)))

In [None]:
print(x_tokenized[0])

* Now we need to pad our sequences, in order to find the true length, I'll use the third quartile of the length array (array which has the lengths of the sequences)

In [None]:
length_array = [len(s) for s in x_tokenized]
SEQUENCE_LENGTH = int(np.quantile(length_array,0.75))
print(SEQUENCE_LENGTH)

* And let's pad.

In [None]:
x_padded = pad_sequences(x_tokenized,maxlen=SEQUENCE_LENGTH)

In [None]:
x_padded.shape

* Our text data is ready to use, let's split our dataset into train and test sets.

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x_padded,y,test_size=0.2,random_state=42)

In [None]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

# Neural Network Modeling
In this section I'm going to build and train our convolutional neural network using keras' sequential api.

In [None]:
# We've added 1 because or word index has numbers from 1 to end but we've added
# 0 tokens in padding so our vocab now has len(tokenizer.word_index) + 1
VOCAB_LENGTH = len(tokenizer.word_index) + 1
VECTOR_SIZE = 100

def getModel():
    """
    Returns a trainable Sigmoid Convolutional Neural Network
    """
    model = keras.Sequential()
    model.add(layers.Embedding(input_dim=VOCAB_LENGTH,
                               output_dim=VECTOR_SIZE,
                               input_length=SEQUENCE_LENGTH
                              ))
    
    model.add(layers.Conv1D(128,kernel_size=4))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation("relu"))
    model.add(layers.MaxPooling1D(2))
    
    model.add(layers.Conv1D(256,kernel_size=4))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation("relu"))
    model.add(layers.MaxPooling1D(2))
    
    model.add(layers.Conv1D(512,kernel_size=4))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation("relu"))
    model.add(layers.MaxPooling1D(2))
    
    model.add(layers.Flatten())
    model.add(layers.Dense(1,activation="sigmoid"))
    
    model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])
    
    return model

In [None]:
model = getModel()
model.summary()

In [None]:
history = model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=1)

* 1 epoch and %93 validation accuracy, this is how a convolutional neural network works with text data 

# EXTRA: How To Make Our Model Ready-to-Deploy?
Before finishing this kernel, I wanna show you one more thing, an important one. How to make a model ready to deploy using a web library or framework like Flask or Django.

Let's start.

* First we'll save weights of our model and pickle our tokenizer.

In [None]:
model.save_weights("trained_model.h5")

In [None]:
import pickle
with open("tokenizer.pickle",mode="wb") as F:
    pickle.dump(tokenizer,F)


* Also let's save our label map using json library.

In [None]:
import json
label_map = {0:"Fake",
             1:"Real"
            }

json.dump(label_map,open("label_map.json",mode="w"))

* And now we'll write a class which will have a function to predict data.

In [None]:
class DeployModel():
    
    def __init__(self,weights_path,tokenizer_path,seq_length,label_map_path
                ):
        
        self.model = getModel()
        self.model.load_weights(weights_path)
        self.tokenizer = pickle.load(open(tokenizer_path,mode="rb"))
        self.seq_len = seq_length
        self.label_map = json.load(open(label_map_path))
    
    def _prepare_data(self,text):
        
        cleaned = cleanText(text)
        tokenized = self.tokenizer.texts_to_sequences([cleaned])
        padded = pad_sequences(tokenized,maxlen=self.seq_len)
        return padded
    
    def _predict(self,text):
        
        text = self._prepare_data(text)
        pred = int(self.model.predict_classes(text)[0])
        return str(pred)
    
    def result(self,text):
        
        pred = self._predict(text)
        return self.label_map[pred]

* And let's create an object using our class.

In [None]:
deploy_model = DeployModel(weights_path="./trained_model.h5",
                           tokenizer_path="./tokenizer.pickle",
                           seq_length=SEQUENCE_LENGTH,
                           label_map_path="./label_map.json"
                          )

In [None]:
test_text = x_cleaned[0]

In [None]:
print(test_text)
print("\n\n===========================")
print("Results: ",deploy_model.result(test_text))

* And yes, it was real!