### Brief explanation -
 The aim of the model is to guess whether the title is a viral news headline or not. The up_votes represent the positive votes for the corresponding headline. All the row in title with up_votes of 95th percentile and above are classified as Viral (1) and others, non-Viral (0) in the new Target column. Pre-processing is carried out on the title column such as stop words, removing punctuation, lemmatization and tokenization. As the number of data representing class 0 was about 4,80,000 and the number of class 1 was only about 25,000, undersampling of the majority class was necessary to be carried out to have a balanced dataset. A deep learning model is deployed including RNNs such as Long term Short Memory layers because of it's quality to deal with vanishing and exploding gradients while training the model. The model scored 92.98% training accuracy and 62.32% test accuracy which means the model has overfitted. To overcome that dropout is used for regularizing the neural network.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [1]:
import pandas as pd 
df = pd.read_csv('Eluvio_DS_Challenge.csv')   #Loading the csv data into a DataFrame.
df.head() 

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420,worldnews
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans,worldnews


In [2]:
df['up_votes'].quantile(0.95) #checking the up_vote at 95th percentile.

418.0

In [3]:
df['target'] = (df.up_votes >= 418.).astype('int') #making a new binary class column in place of up_votes.

In [9]:
df.head()

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category,target
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews,0
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews,0
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews,0
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420,worldnews,0
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans,worldnews,0


In [10]:
#Dropping all the unnecessary columns.
df.drop('down_votes', axis = 1, inplace = True)         
df.drop('category', axis = 1, inplace = True)
df.drop('time_created', axis = 1, inplace = True)
df.drop('date_created', axis = 1, inplace = True)
df.drop('over_18', axis = 1, inplace = True)
df.drop('author', axis = 1, inplace = True)
df.drop('up_votes', axis = 1, inplace = True)

In [11]:
#pre-processing title column by removing the stop words.
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
df['title'] = df['title'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Saura\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
df.head()

Unnamed: 0,title,target
0,Scores killed Pakistan clashes,0
1,Japan resumes refuelling mission,0
2,US presses Egypt Gaza border,0
3,Jump-start economy: Give health care,0
4,Council Europe bashes EU&UN terror blacklist,0


In [13]:
#removing all the punctuations from the title column.
import re
df['title'] = df['title'].apply(lambda x : x.lower())
df['title'] = df['title'].apply(lambda x : re.sub('[^a-zA-z0-9\s]','',x))
df['title'].head()

0                 scores killed pakistan clashes
1               japan resumes refuelling mission
2                   us presses egypt gaza border
3             jumpstart economy give health care
4    council europe bashes euun terror blacklist
Name: title, dtype: object

In [14]:
#importing all the keras dependencies.
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, load_model
from keras.layers import Dense, LSTM, Embedding, Dropout
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.preprocessing.text import text_to_word_sequence

from keras.preprocessing.text import text_to_word_sequence


In [15]:
#Downsampling data to fix the imbalance.
from sklearn.utils import resample
# Separate majority and minority classes
df_majority = df[df['target']==0]
df_minority = df[df['target']==1]
 
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=25465,     # to match minority class
                                 random_state=123) # reproducible results
 
# Combine minority class with downsampled majority class
data = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
data['target'].value_counts()


1    25465
0    25465
Name: target, dtype: int64

In [16]:
#Converting the words to values.
tokenizer = Tokenizer(num_words=5000, split=" ")
tokenizer.fit_on_texts(data['title'].values) 

X = tokenizer.texts_to_sequences(data['title'].values)
X = pad_sequences(X) # padding our text vector so they all have the same length


In [17]:
#Structuring the deep neural network.
model = Sequential()
model.add(Embedding(5000, 256, input_length=X.shape[1]))
model.add(Dropout(0.3))
model.add(LSTM(256, return_sequences=True, dropout=0.3, recurrent_dropout=0.2))
model.add(LSTM(256, dropout=0.3, recurrent_dropout=0.2))
model.add(Dense(2, activation='softmax'))

In [18]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 37, 256)           1280000   
_________________________________________________________________
dropout_1 (Dropout)          (None, 37, 256)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 37, 256)           525312    
_________________________________________________________________
lstm_2 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 514       
Total params: 2,331,138
Trainable params: 2,331,138
Non-trainable params: 0
_________________________________________________________________


In [19]:
y = pd.get_dummies(data['target']).values

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  #Data split

In [16]:
batch_size = 32
epochs = 12

model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, verbose=2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/12
 - 246s - loss: 0.6320 - accuracy: 0.6469
Epoch 2/12
 - 260s - loss: 0.5747 - accuracy: 0.6999
Epoch 3/12
 - 266s - loss: 0.5371 - accuracy: 0.7272
Epoch 4/12
 - 268s - loss: 0.4943 - accuracy: 0.7568
Epoch 5/12
 - 276s - loss: 0.4539 - accuracy: 0.7815
Epoch 6/12
 - 280s - loss: 0.4130 - accuracy: 0.8061
Epoch 7/12
 - 281s - loss: 0.3697 - accuracy: 0.8307
Epoch 8/12
 - 283s - loss: 0.3321 - accuracy: 0.8496
Epoch 9/12
 - 342s - loss: 0.2948 - accuracy: 0.8684
Epoch 10/12
 - 329s - loss: 0.2631 - accuracy: 0.8864
Epoch 11/12
 - 309s - loss: 0.2348 - accuracy: 0.8975
Epoch 12/12
 - 343s - loss: 0.2086 - accuracy: 0.9132


<keras.callbacks.callbacks.History at 0x2184c78bf48>

In [17]:
val_loss, val_acc = model.evaluate(X_test, y_test)  # evaluate the out of sample data with model
print(val_loss)  # model's loss (error)
print(val_acc)  # model's accuracy

1.2411577065380013
0.6279206871986389


In [18]:
import pickle                                
with open('Viral_or_not_Model','wb') as file:                   #saving the model into a pickle file
    pickle.dump(model,file)


In [21]:
import pickle
with open('Viral_or_not_Model','rb') as file:                 
    mp = pickle.load(file)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


In [27]:
predictions = mp.predict(X_test)

In [26]:
import numpy as np
print(np.argmax(predictions[0]))

1
