# **LSTM**
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems. This is a behavior required in complex problem domains like machine translation, speech recognition, and more. LSTMs are a complex area of deep learning.

## **Required Libraries**

In [1]:
import pandas as pd
import numpy as np 
import nltk
import re 
import tensorflow as tf
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
from sklearn.model_selection import train_test_split
from  tensorflow.keras.preprocessing.text import one_hot

### **Reading the csv data**

In [2]:
df = pd.read_csv(r'C:\Users\jgaur\Tensorflow_Tut\LSTM\fake-news\train.csv')
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [3]:
'''Dropping Nan values'''
df = df.dropna()

In [4]:
''' Independent and Dependent data '''
X = df.drop('label', axis=1)
y = df['label']

In [6]:
''' checking shape of data '''
X.shape

(18285, 4)

In [10]:
''' Vocabulary size '''
voc_size = 5000

In [11]:
messages = X.copy()

In [12]:
''' reseting the index '''
messages.reset_index(inplace=True) 

### **downloading stopwords**
`Stop words` are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry very little useful information.

In [14]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jgaur\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
corpus = []
for i in range(0, len(messages)):
    ''' removing everything except a-z and A-Z'''
    review = re.sub('[^a-zA-Z]', ' ', messages['title'][i])
    
    ''' converting every word into lower word'''
    review = review.lower()
    # print(review[:5])
    review = review.split()
    # print(review[:5])
    
    ''' removing stopwords '''
    review = [word for word in review if word not in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

### **one hot encoding**
A one hot encoding is a representation of categorical variables as binary vectors. This first requires that the categorical values be mapped to integer values. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.12

In [16]:
one_hot = [one_hot(word, voc_size) for word in corpus]

### **what does pad_sequence do?**
pad_sequences is used to ensure that all sequences in a list have the same length. By default this is done by padding 0 in the beginning of each sequence until each sequence has the same length as the longest sequence.

In [23]:
sen_length = 25
pad_sequence = pad_sequences(one_hot, padding='pre', maxlen=sen_length)
print(pad_sequence)

[[   0    0    0 ... 4180  498 4163]
 [   0    0    0 ...  683 3960 1313]
 [   0    0    0 ... 3920 4857 3950]
 ...
 [   0    0    0 ... 2928  833 1776]
 [   0    0    0 ...  489 4089 2476]
 [   0    0    0 ... 4747 3494 1277]]


In [24]:
len(pad_sequence)

18285

## **LSTM Model**

In [25]:
embedding_vector_features = 40
model = Sequential()
model.add(Embedding(voc_size, embedding_vector_features, input_length=sen_length))

''' LSTM Layer '''
model.add(LSTM(100))

''' Classification Layer '''
model.add(Dense(1, activation='sigmoid'))

''' initializing the loss, optimizer and metrics '''
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 25, 40)            200000    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               56400     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 256,501
Trainable params: 256,501
Non-trainable params: 0
_________________________________________________________________
None


In [27]:
y.shape

(18285,)

In [29]:
''' converting the independent and dependent data into array '''
X_final = np.array(pad_sequence)
y_final = np.array(y)

### **Trian Test Split**
The train-test split is a technique for evaluating the performance of a machine learning algorithm. It can be used for classification or regression problems and can be used for any supervised learning algorithm. The procedure involves taking a dataset and dividing it into two subsets.

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.33, random_state=42)

In [31]:
''' training '''
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
