<h2>TWEET CLASSIFICATION MODEL (BASIC) </h2>

The following Jupyter notebook contains implementation for a basic classification model for classifying tweets 
into "Positive", "Negative", "Nuetral" cateogories.
This has been implemented using the following classifiers:
1) RandomForest<br>
2) LSTM

<h3>1. DATA PREPROCESSING</h3>

In [1]:
import numpy as np
import pandas as pd
#Downloaded csv file for training data and creating a dataframe for pre-processing
data = pd.read_csv('Aitrainingdata.csv')
#Preprocessing of data , checking for null values 
data.count()

tweet                                  4104
twitterHandlerused                     4104
label                                  4104
fasttext_labels                        4104
Comparison_human&ft                      87
Labels_Manish_Krishna_Anu_Harvinder       2
2nd review                                0
Final_labels                           4104
dtype: int64

In [2]:
#Tweet without cleaning 
data['tweet'].head(5)

0    @HatePatroller All @mindvalley students are li...
1    RT @AlphaGammaHQ: Conferences are great platfo...
2    @AlphaGammaHQ @wobi_en @GIFLondon @Esportsbzsu...
3    RT @mindvalley: You asked, we delivered. ÃƒÂ°Ã...
4    @dubeji18 Check this one out by @thesleepdocto...
Name: tweet, dtype: object

In [3]:
from nltk.corpus import stopwords
import re
from nltk.tokenize import TweetTokenizer
import preprocessor as p
import string

<h4> 1.1 Tokenizing, Stopword and Punctuation Removal</h4>

In [4]:
#Function to apply cleaning to all tweets by removing usernames, special characters and filtering out the tweet body
def clean_tweet(row):
    '''  
    Parameters
    ----------
    row : dataframe row        

    Returns
    -------
    tweet : df column values
    cleaned tweet from df row  

    '''
    tweet = row['tweet']
    tweet = p.clean(tweet)
    return tweet
#Applying to all rows of the tweet column in the data
data['tweet'] = data.apply(clean_tweet,axis =1)

In [5]:
def tokenize_rem_stopwords(row):
    '''  
    Parameters
    ----------
    row : dataframe row        

    Returns
    -------
    tweet : df column values
    stopwords and punctuation removed tweet from df row  

    '''
    tweet = row['tweet']
    tweet = TweetTokenizer().tokenize(tweet)
    stop = stopwords.words('english') #Importing stopwords from english
    newtweet = []
    for ww in tweet:
        if ww.lower() not in stop and ww not in string.punctuation:
            newtweet.append(ww)
    return newtweet
#
data['tweet'] = data.apply(tokenize_rem_stopwords,axis = 1)
data['tweet_token'] = data['tweet']

<h4> 1.2 Label Encoding </h4>

In [6]:
from sklearn.preprocessing import LabelEncoder
#Label encoding for converting the label into 0,1,2
label_encode = LabelEncoder()
data['Final_labels'] = label_encode.fit_transform(data['Final_labels'])

In [7]:
list1 = data['tweet'][1]
list1 = ' '.join(list1)
list1

"Conferences great platforms exchange ideas meet like-minded people build Here's"

In [8]:
#Converting the tokenized tweet back to string for vectorization
def stringyfy(row):
    '''  
    Parameters
    ----------
    row : dataframe row        

    Returns
    -------
    tweet : df column values
    string version of tokenized sentence
    
    '''
    tweet = row['tweet']
    tweet = ' '.join(tweet)
    return tweet
data['tweet'] = data.apply(stringyfy,axis=1)
data['tweet']

0                                           students like
1       Conferences great platforms exchange ideas mee...
2                  Awesome list Humble thank including us
3       asked delivered years youve asking us make FRE...
4       Check one recommended takes consideration natu...
                              ...                        
4099    Hopefully eventually come use meantime belittl...
4100                                          thanks mine
4101    hey got text real HSBC ALERT authorised paymen...
4102                                          basic bitch
4103    Thank letting us know definitely scam ever get...
Name: tweet, Length: 4104, dtype: object

<h4> 1.4 TF-IDF Vectorization </h4>

In [9]:
#Creating a bag of sentences of all tweets for vectorizing
bagOfSentences = data['tweet'].to_list()
#for converting the string to vector(numeric) using countvectorizer and tfidf transform
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
cv = CountVectorizer()
word_count = cv.fit_transform(bagOfSentences)
word_count.shape

(4104, 7209)

In [10]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) 
tfidf_transformer.fit(word_count)
tf_idf_vector=tfidf_transformer.transform(word_count).toarray()
#Tweets in idf vector format
X = tf_idf_vector
X.shape

(4104, 7209)

<h3> 2. CLASSIFICATION USING RANDOM FOREST </h4>

In [11]:
#X = data.loc[:,'tweet':'fasttext_labels']
#Splitting of data into train and test set with 80:20 ratio
Y = data['Final_labels']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y,random_state=1403,train_size = 0.20)

In [12]:
#Classification using RandomForest on the tf-idf vector
from sklearn.ensemble import RandomForestClassifier 
rf_clf = RandomForestClassifier(n_estimators=250, random_state=0) 
rf_clf.fit(X_train, y_train) 
y_pred = rf_clf.predict(X_test)

In [28]:
#Accuracy on the test data 
print("The accuracy on test data using random classifier",np.mean(y_pred==y_test)*100)

The accuracy on test data using random classifier 74.72594397076736


<h3> 3. CLASSIFICATION USING TENSORFLOW MODEL </h3>

In [14]:
#Import for tensorflow and Keras
import tensorflow as tf
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU,SimpleRNN
from keras.layers.embeddings import Embedding
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D, Dense, Activation, Dropout, Input
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import RMSprop
#Keras uses tokenized  data for model training
X_deep = data['tweet_token']

In [23]:
#Tokenize words using tokenizer and max word length to 600
max_len = 600
tok = Tokenizer(num_words=2000)
tok.fit_on_texts(X_deep)
sequences = tok.texts_to_sequences(X_deep)
#Creating sequence matrix to represent each tweet in a vector format
sequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)
X_train_deep, X_test_deep, y_train_deep, y_test_deep = train_test_split(sequences_matrix,Y,random_state=1403,train_size = 0.20)

In [29]:
def tensorflow_based_model(): #Defined tensorflow_based_model function for training tenforflow based model with 6 layers
    inputs = Input(name='inputs',shape=[max_len])#step1
    layer = Embedding(2000,50,input_length=max_len)(inputs) #step2
    layer = LSTM(64)(layer) #step3
    layer = Dense(256,name='FC1')(layer) #step4
    layer = Activation('relu')(layer) # step5
    layer = Dropout(0.5)(layer) # step6
    layer = Dense(1,name='out_layer')(layer) #step4 
    layer = Activation('sigmoid')(layer) #step5 
    model = Model(inputs=inputs,outputs=layer) 
    return model 

In [25]:
model = tensorflow_based_model()
model.compile(loss='binary_crossentropy',optimizer=RMSprop(),metrics=['accuracy'])
history=model.fit(X_train_deep,y_train_deep,batch_size=80,epochs=6, validation_split=0.1)# here we are starting the training of model by feeding the training data
print('Training finished')

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Training finished !!


In [26]:
#Accuracy of the model on the test data
accr1 = model.evaluate(X_test_deep,y_test_deep)

