<a href="https://colab.research.google.com/github/shivangibithel/IRMiDis_Task2/blob/main/Text_Classification_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[A Comprehensive Guide to Understand and Implement Text Classification in Python](https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/)

# Imports

In [127]:
# pip install emoji --upgrade

In [128]:
import numpy as np
import pandas as pd #to work with csv files

#matplotlib imports are used to plot confusion matrices for the classifiers
import matplotlib as mpl 
import matplotlib.cm as cm 
import matplotlib.pyplot as plt 

#import feature extraction methods from sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words

#pre-processing of text
import string
import re

#import classifiers from sklearn
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

#import different metrics to evaluate the classifiers
from sklearn.metrics import accuracy_score
#from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix 
from sklearn import metrics

#import time function from time module to track the training duration
from time import time

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

import pandas, xgboost, numpy, textblob, string
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers


In [129]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

# Section 1: Load and explore the dataset

In [147]:
our_data = pd.read_csv("cleaned_train.csv",index_col = "id")
test_data = pd.read_csv("cleaned_test.csv",index_col = "id")
our_data = shuffle(our_data)
our_data.head()

Unnamed: 0_level_0,tweet,label,label_num
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.33083e+18,a quick reminder that cambridge not known for ...,Neutral,2
1.32591e+18,really good news pfizer and biontech report th...,ProVax,0
1.3262e+18,nhs to be ready to roll out coronavirus vaccin...,Neutral,2
1.32903e+18,so its gone up just as a new vaccine announced...,AntiVax,1
1.33228e+18,south korea says it foiled north korea attempt...,Neutral,2


In [148]:
display(our_data.shape) #Number of rows (instances) and columns in the dataset
print(our_data["label"].value_counts()/our_data.shape[0]) #Class distribution in the dataset

(2792, 3)

Neutral    0.361748
ProVax     0.354943
AntiVax    0.283309
Name: label, dtype: float64


In [149]:
# Prepare Dataset
trainDF = pd.DataFrame()
trainDF['tweet'] = our_data.tweet
trainDF['label'] = our_data.label

# split the dataset into training and validation datasets 
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['tweet'], trainDF['label'])

# label encode the target variable 
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
# train_y = encoder.fit_transform(trainDF['label'])
valid_y = encoder.fit_transform(valid_y)

In [150]:
trainDF

Unnamed: 0_level_0,tweet,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1.330830e+18,a quick reminder that cambridge not known for ...,Neutral
1.325910e+18,really good news pfizer and biontech report th...,ProVax
1.326200e+18,nhs to be ready to roll out coronavirus vaccin...,Neutral
1.329030e+18,so its gone up just as a new vaccine announced...,AntiVax
1.332280e+18,south korea says it foiled north korea attempt...,Neutral
...,...,...
1.334160e+18,todays coronavirus news ontario is reporting c...,Neutral
1.325770e+18,i love that trump said uhh uhh coronavirus is ...,ProVax
1.336230e+18,covid vaccine first person receives #pfizer #c...,Neutral
1.333440e+18,moderna to ask health regulators to authorize ...,Neutral


### Section 2: Text Pre-processing

Typical steps involve tokenization, lower casing, removing, stop words, punctuation markers etc, and vectorization. Other processes such as stemming/lemmatization can also be performed. Here, we are performing the following steps: removing br tags, punctuation, numbers, and stopwords. While we are using sklearn's list of stopwords, there are several other stop word lists (e.g., from NLTK) or sometimes, custom stopword lists are needed depending on the task. 

In [134]:
stopwords = stop_words.ENGLISH_STOP_WORDS
def clean(doc): #doc is a string of text
    # doc = doc.replace("@", " ")
    # doc = doc.replace(" ", " ")
    doc = "".join([char for char in doc if char not in string.punctuation and not char.isdigit()])
    # doc = " ".join([token for token in doc.split() if token not in stopwords])
    #remove punctuation and numbers
    return doc

In [135]:
import re
def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

In [136]:
def remove_url(data):
  url = re.compile("http[s]?\:\/\/.[a-zA-Z0-9\.\/\_?=%&#\-\+!]+", re.UNICODE)
  return re.sub(url, '', data)

In [137]:
# for i in range(len(our_data)):
#   our_data.tweet.iloc[i] = remove_url(our_data.tweet.iloc[i])
#   our_data.tweet.iloc[i] = clean(our_data.tweet.iloc[i])
#   our_data.tweet.iloc[i] = remove_emojis(our_data.tweet.iloc[i])

In [138]:
our_data.head()

Unnamed: 0,id,tweet,label,label_num
827,1.32654e+18,no company or user will directly profit from #...,Neutral,2
330,1.33425e+18,cdc vaccine advisers vote on which groups shou...,Neutral,2
487,1.32903e+18,now what the fuck is this efficacy revision of...,AntiVax,1
348,1.32577e+18,some good news the covid vaccine being develop...,ProVax,0
1442,1.3308e+18,norbertelekes hope some vaccine results will c...,ProVax,0


# Section 3: Modeling

Now we are ready for the modelling. We are going to use algorithms from sklearn package. We will go through the following steps:

1 Read train data    
2 Extract features from the training data using CountVectorizer, which is a bag of words feature  implementation. We will use the pre-processing function above in conjunction with Count Vectorizer  
3 Transform the test data into the same feature vector as the training data.  
4 Train the classifier  
5 Evaluate the classifier  

# Feature Engineering

In [174]:
X_train = train_x
# X_train = trainDF['tweet']
y_train =train_y
print(X_train.shape, y_train.shape, valid_x.shape, valid_y.shape)

(2094,) (2094,) (698,) (698,)


In [None]:
X_train

 ### Count Vectors as features

In [180]:
# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['tweet'])

# transform the training and validation data using count vectorizer object
X_train_dtm =  count_vect.transform(X_train)
xvalid_count =  count_vect.transform(valid_x)

In [181]:
print(X_train_dtm.shape, xvalid_count.shape)

(2094, 7716) (698, 7716)


### TF-IDF Vectors as features

In [182]:
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(trainDF['tweet'])
xtrain_tfidf =  tfidf_vect.transform(X_train)
xvalid_tfidf =  tfidf_vect.transform(valid_x)
print(xtrain_tfidf.shape, xvalid_tfidf.shape)

# ngram level tf-idf 
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram.fit(trainDF['tweet'])
xtrain_tfidf_ngram =  tfidf_vect_ngram.transform(X_train)
xvalid_tfidf_ngram =  tfidf_vect_ngram.transform(valid_x)
print(xtrain_tfidf_ngram.shape, xvalid_tfidf_ngram.shape)

# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram_chars.fit(trainDF['tweet'])
xtrain_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(X_train)
xvalid_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(valid_x) 
print(xtrain_tfidf_ngram_chars.shape, xvalid_tfidf_ngram_chars.shape)

(2094, 5000) (698, 5000)
(2094, 5000) (698, 5000)
(2094, 5000) (698, 5000)


In [183]:
print(xtrain_tfidf_ngram)

  (0, 4770)	0.286356967336268
  (0, 4537)	0.23613552329447768
  (0, 4518)	0.30261147337599364
  (0, 4517)	0.21924968854523927
  (0, 4282)	0.2658787115734172
  (0, 3782)	0.286356967336268
  (0, 3464)	0.24703695015549074
  (0, 3261)	0.30261147337599364
  (0, 3238)	0.19681550611370044
  (0, 3195)	0.29366597318366866
  (0, 876)	0.2748242117657422
  (0, 864)	0.13144551705544316
  (0, 195)	0.30261147337599364
  (0, 194)	0.30261147337599364
  (0, 118)	0.12795735759481333
  (1, 4804)	0.16756732145158726
  (1, 4441)	0.2751674834341599
  (1, 4438)	0.21722730462171483
  (1, 4210)	0.2821908876982268
  (1, 4033)	0.148205016102205
  (1, 2943)	0.2425278136648661
  (1, 2942)	0.2425278136648661
  (1, 2650)	0.26922927974039096
  (1, 2175)	0.23738390765157036
  (1, 2174)	0.20949705350404296
  :	:
  (2092, 1931)	0.21285747806848465
  (2092, 1842)	0.17257547764014033
  (2092, 1841)	0.17257547764014033
  (2092, 959)	0.12027060638112788
  (2092, 921)	0.06033190563492359
  (2092, 566)	0.1874187018300794
  (20

### Word Embeddings  --> Not Working

In [None]:
# load the pre-trained word-embedding vectors 
embeddings_index = {}
for i, line in enumerate(open('data/wiki-news-300d-1M.vec')):
    values = line.split()
    embeddings_index[values[0]] = numpy.asarray(values[1:], dtype='float32')

# create a tokenizer 
token = text.Tokenizer()
token.fit_on_texts(our_data['tweet'])
word_index = token.word_index

# convert text to sequence of tokens and pad them to ensure equal length vectors 
train_seq_x = sequence.pad_sequences(token.texts_to_sequences(train_x), maxlen=70)

# create token-embedding mapping
embedding_matrix = numpy.zeros((len(word_index) + 1, 300))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

### Text / NLP based features

In [103]:
trainDF = our_data
trainDF['char_count'] = trainDF['tweet'].apply(len)
trainDF['word_count'] = trainDF['tweet'].apply(lambda x: len(x.split()))
trainDF['word_density'] = trainDF['char_count'] / (trainDF['word_count']+1)
trainDF['punctuation_count'] = trainDF['tweet'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 
trainDF['title_word_count'] = trainDF['tweet'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
trainDF['upper_case_word_count'] = trainDF['tweet'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

In [104]:
pos_family = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

# function to check and get the part of speech tag count of a words in a given sentence
def check_pos_tag(x, flag):
    cnt = 0
    try:
        wiki = textblob.TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_family[flag]:
                cnt += 1
    except:
        pass
    return cnt
trainDF['noun_count'] = trainDF['tweet'].apply(lambda x: check_pos_tag(x, 'noun'))
trainDF['verb_count'] = trainDF['tweet'].apply(lambda x: check_pos_tag(x, 'verb'))
trainDF['adj_count'] = trainDF['tweet'].apply(lambda x: check_pos_tag(x, 'adj'))
trainDF['adv_count'] = trainDF['tweet'].apply(lambda x: check_pos_tag(x, 'adv'))
trainDF['pron_count'] = trainDF['tweet'].apply(lambda x: check_pos_tag(x, 'pron'))

### Topic Models as features

In [105]:
# train a LDA Model
lda_model = decomposition.LatentDirichletAllocation(n_components=20, learning_method='online', max_iter=20)
X_topics = lda_model.fit_transform(X_train_dtm)
topic_word = lda_model.components_ 
vocab = vect.get_feature_names()

# view the topic models
n_top_words = 10
topic_summaries = []
for i, topic_dist in enumerate(topic_word):
    topic_words = numpy.array(vocab)[numpy.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))

In [106]:
topic_summaries

['vaccinations chances six miss modernavaccine revlox conduct renders extended mutate',
 'demand difficult substantial thoughts organized liquid gold duck ddale maintain',
 'preparing similar ur rehman atta dr squashtheclot suits unveil shipping',
 'wow leading congratulations quite sputnikvaccine underlying yale onboard value replication',
 'travel canadians explains warn finance dailybriefing debunked stockmarket retrigger roars',
 'seems gonna trump completely america wrong rollout freedom transition youre',
 'china begin mid shut spray nasal wantai stage agency expert',
 'credit send patent rogue tn referendum fyi profits generic gives',
 'funder qqq dia spy err wonderwoman meaning programmes prepare qanon',
 'india times doses serum writes institute buy trained sections table',
 'evil bye jail coalition positions ariana tour swab deliveries abs',
 'petition sign dna lockdowns via scientific billgates pm meeting meet',
 'the vaccine to of covid and be will coronavirus in',
 'russia

# Classifiers

### Non Deep Methods

In [190]:
# Naive Bayes Classifier
nb = MultinomialNB() #instantiate a Multinomial Naive Bayes model
nb.fit(X_train_dtm, y_train)#train the count vectorizer model
# nb.fit(xtrain_tfidf, y_train)#train the TF-IDF model
# nb.fit(xtrain_tfidf_ngram, y_train)#train the TF-IDF ngram model
# nb.fit(xtrain_tfidf_ngram_chars, y_train)#train the TF-IDF ngram CHAR model

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [201]:
# Logistic Regression Classifier
logreg = LogisticRegression(class_weight="balanced",max_iter=1000) #instantiate a logistic regression model
# logreg.fit(X_train_dtm, y_train) #fit the model with training data
logreg.fit(xtrain_tfidf, y_train)#train the TF-IDF model
# logreg.fit(xtrain_tfidf_ngram, y_train)#train the TF-IDF ngram model
# logreg.fit(xtrain_tfidf_ngram_chars, y_train)#train the TF-IDF ngram CHAR model

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=1000, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [212]:
# Linear SVM Classifier
classifier = LinearSVC(class_weight='balanced') #instantiate a logistic regression model
# classifier.fit(X_train_dtm, y_train) #fit the model with training data
classifier.fit(xtrain_tfidf, y_train)#train the TF-IDF model
# classifier.fit(xtrain_tfidf_ngram, y_train)#train the TF-IDF ngram model
# classifier.fit(xtrain_tfidf_ngram_chars, y_train)#train the TF-IDF ngram CHAR model

LinearSVC(C=1.0, class_weight='balanced', dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [222]:
# Bagging Model
classifier = ensemble.RandomForestClassifier()
classifier.fit(X_train_dtm, y_train)
# classifier.fit(xtrain_tfidf, y_train)#train the TF-IDF model
# classifier.fit(xtrain_tfidf_ngram, y_train)#train the TF-IDF ngram model
# classifier.fit(xtrain_tfidf_ngram_chars, y_train)#train the TF-IDF ngram CHAR model

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [229]:
# Extereme Gradient Boosting
EXB = xgboost.XGBClassifier()
EXB.fit(X_train_dtm.tocsc(), y_train)
# EXB.fit(xtrain_tfidf.tocsc(), y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [248]:
# Shallow Neural Network
def create_model_architecture(input_size):
    # create input layer 
    input_layer = layers.Input((input_size, ), sparse=True)
    
    # create hidden layer
    hidden_layer = layers.Dense(100, activation="relu")(input_layer)
    
    # create output layer
    output_layer = layers.Dense(1, activation="sigmoid")(hidden_layer)

    classifier = models.Model(inputs = input_layer, outputs = output_layer)
    classifier.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
    return classifier 

classifier = create_model_architecture(xtrain_tfidf.shape[1])
classifier.fit(xtrain_tfidf.toarray(), y_train,epochs =1, shuffle = True)



<keras.callbacks.History at 0x7f8e12d53450>

### Deep Neural Models

In [None]:
# CNN
def create_cnn():
    # Add an Input Layer
    input_layer = layers.Input((70, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)

    # Add the convolutional Layer
    conv_layer = layers.Convolution1D(100, 3, activation="relu")(embedding_layer)

    # Add the pooling Layer
    pooling_layer = layers.GlobalMaxPool1D()(conv_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(pooling_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)
    
    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
    
    return model

classifier = create_cnn()
classifier.fit(train_seq_x, train_y)
# train_seq_x -- > Word Embeddings required -->Later
# accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)
# print "CNN, Word Embeddings",  accuracy

In [None]:
#  LSTM
def create_rnn_lstm():
    # Add an Input Layer
    input_layer = layers.Input((70, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)

    # Add the LSTM Layer
    lstm_layer = layers.LSTM(100)(embedding_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(lstm_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
    
    return model

classifier = create_rnn_lstm()
classifier.fit(train_seq_x, train_y)
# train_seq_x -- > Word Embeddings required -->Later

In [None]:
def create_rnn_gru():
    # Add an Input Layer
    input_layer = layers.Input((70, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)

    # Add the GRU Layer
    lstm_layer = layers.GRU(100)(embedding_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(lstm_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
    
    return model

classifier = create_rnn_gru()
classifier.fit(train_seq_x, train_y)
# train_seq_x -- > Word Embeddings required -->Later

In [None]:
def create_bidirectional_rnn():
    # Add an Input Layer
    input_layer = layers.Input((70, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)

    # Add the LSTM Layer
    lstm_layer = layers.Bidirectional(layers.GRU(100))(embedding_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(lstm_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
    
    return model

classifier = create_bidirectional_rnn()
classifier.fit(train_seq_x, train_y)
# train_seq_x -- > Word Embeddings required -->Later

In [None]:
def create_rcnn():
    # Add an Input Layer
    input_layer = layers.Input((70, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)
    
    # Add the recurrent layer
    rnn_layer = layers.Bidirectional(layers.GRU(50, return_sequences=True))(embedding_layer)
    
    # Add the convolutional Layer
    conv_layer = layers.Convolution1D(100, 3, activation="relu")(embedding_layer)

    # Add the pooling Layer
    pooling_layer = layers.GlobalMaxPool1D()(conv_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(pooling_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
    
    return model

classifier = create_rcnn()
classifier.fit(train_seq_x, train_y)
# train_seq_x -- > Word Embeddings required -->Later

# Evaluation

### Naive Bayes Evaluation

In [185]:
predictions = nb.predict(xvalid_count)
print(" NB, Count Vectors:", metrics.accuracy_score(predictions, valid_y))

 NB, Count Vectors: 0.7320916905444126


In [187]:
predictions = nb.predict(xvalid_tfidf)
print(" NB, WordLevel TF-IDF:", metrics.accuracy_score(predictions, valid_y))

 NB, WordLevel TF-IDF: 0.7134670487106017


In [189]:
predictions = nb.predict(xvalid_tfidf_ngram)
print(" NB, N-Gram Vectors:", metrics.accuracy_score(predictions, valid_y))

 NB, N-Gram Vectors: 0.6762177650429799


In [191]:
predictions = nb.predict(xvalid_tfidf_ngram_chars)
print(" NB, CharLevel Vectors:", metrics.accuracy_score(predictions, valid_y))

 NB, CharLevel Vectors: 0.673352435530086


### Linear Classification --> Logistic Regression

In [194]:
predictions = logreg.predict(xvalid_count)
print(" LR, Count Vectors:", metrics.accuracy_score(predictions, valid_y))

 LR, Count Vectors: 0.7134670487106017


In [196]:
predictions = logreg.predict(xvalid_tfidf)
print(" LR, WordLevel TF-IDF:", metrics.accuracy_score(predictions, valid_y))

 LR, WordLevel TF-IDF: 0.7306590257879656


In [198]:
predictions = logreg.predict(xvalid_tfidf_ngram)
print(" LR, N-Gram Vectors:", metrics.accuracy_score(predictions, valid_y))

 LR, N-Gram Vectors: 0.6776504297994269


In [200]:
predictions = logreg.predict(xvalid_tfidf_ngram_chars)
print(" LR, CharLevel Vectors:", metrics.accuracy_score(predictions, valid_y))

 LR, CharLevel Vectors: 0.7048710601719198


### SVM Evaluation

In [204]:
predictions = classifier.predict(xvalid_count)
print(" SVM, Count Vectors:", metrics.accuracy_score(predictions, valid_y))

 SVM, Count Vectors: 0.7020057306590258


In [213]:
predictions = classifier.predict(xvalid_tfidf)
print(" SVM, WordLevel TF-IDF:", metrics.accuracy_score(predictions, valid_y))

 SVM, WordLevel TF-IDF: 0.7392550143266475


In [209]:
predictions = classifier.predict(xvalid_tfidf_ngram)
print(" SVM, N-Gram Vectors:", metrics.accuracy_score(predictions, valid_y))

 SVM, N-Gram Vectors: 0.663323782234957


In [211]:
predictions = classifier.predict(xvalid_tfidf_ngram_chars)
print(" SVM, CharLevel Vectors:", metrics.accuracy_score(predictions, valid_y))

 SVM, CharLevel Vectors: 0.7177650429799427


### Random Forrest Evaluation Evaluation

In [224]:
predictions = classifier.predict(xvalid_count)
print(" RF, Count Vectors:", metrics.accuracy_score(predictions, valid_y))

 RF, Count Vectors: 0.6934097421203438


In [215]:
predictions = classifier.predict(xvalid_tfidf)
print(" RF, WordLevel TF-IDF:", metrics.accuracy_score(predictions, valid_y))

 RF, WordLevel TF-IDF: 0.6905444126074498


In [219]:
predictions = classifier.predict(xvalid_tfidf_ngram)
print(" RF, N-Gram Vectors:", metrics.accuracy_score(predictions, valid_y))

 RF, N-Gram Vectors: 0.6532951289398281


In [221]:
predictions = classifier.predict(xvalid_tfidf_ngram_chars)
print(" RF, CharLevel Vectors:", metrics.accuracy_score(predictions, valid_y))

 RF, CharLevel Vectors: 0.6432664756446992


### Boosting

In [226]:
predictions = EXB.predict(xvalid_count.tocsc())
print(" EXB, Count Vectors:", metrics.accuracy_score(predictions, valid_y))

 EXB, Count Vectors: 0.6977077363896849


In [228]:
predictions = EXB.predict(xvalid_tfidf.tocsc())
print(" RF, WordLevel TF-IDF:", metrics.accuracy_score(predictions, valid_y))

 RF, WordLevel TF-IDF: 0.6805157593123209


### Shallow NN

In [249]:
predictions = classifier.predict(xvalid_tfidf)
predictions = predictions.argmax(axis=-1)
print(" Shallow NN, WordLevel TF-IDF:", metrics.accuracy_score(predictions, valid_y))

 Shallow NN, WordLevel TF-IDF: 0.27793696275071633


### Deep