# Stance Detection for the Fake News Challenge

## Identifying Textual Relationships with Deep Neural Nets

### Check the problem context [here](http://www.fakenewschallenge.org/).

### Download files required for the project from [here](https://drive.google.com/drive/folders/1D6YRfeXGg5k8GaMCwQWOc16pqiSFi6bC?usp=sharing).

 ## <font color=red> Milestone - 1 </font>

## Step1: Load the given dataset <h1> [10 marks] </h1>

1. Mount the google drive

2. Import Glove embeddings

3. Import the test and train datasets

### Mount the google drive to access required project files

Run the below commands

In [None]:
from google.colab import drive

In [None]:
drive.mount('/content/drive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/


#### Path for Project files on google drive

**Note:** You need to change this path according where you have kept the files in google drive. 

In [None]:
project_path = "/content/drive/My Drive/Datasets/Fake News Challenge-20190812T030358Z-001/Fake News Challenge/"

### Loading the Glove Embeddings

In [None]:
from zipfile import ZipFile
with ZipFile(project_path+'glove.6B.zip', 'r') as z:
  z.extractall()

### Load the dataset

1. Using [read_csv()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) in pandas load the given train datasets files **`train_bodies.csv`** and **`train_stances.csv`**

2. Using [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) command in pandas merge the two datasets based on the Body ID. 

Note: Save the final merged dataset in a dataframe with name **`dataset`**.

In [None]:
#load train_bodies.csv
import pandas as pd
path = '/content/drive/My Drive/Datasets/Fake News Challenge-20190812T030358Z-001/Fake News Challenge/train_bodies.csv'
bodies = pd.read_csv(path, encoding='latin1')
bodies = bodies.fillna(method="ffill") # Deal with N/A
bodies = bodies.sort_values('Body ID')
bodies.head()

Unnamed: 0,Body ID,articleBody
0,0,A small meteorite crashed into a wooded area i...
1,4,Last week we hinted at what was to come as Ebo...
2,5,(NEWSER) â Wonder how long a Quarter Pounder...
3,6,"Posting photos of a gun-toting child online, I..."
4,7,At least 25 suspected Boko Haram insurgents we...


Loaded Article Bodies:

	Body ID articleBody
0	0       A small meteorite crashed into a wooded area i...

1	4	      Last week we hinted at what was to come as Ebo...

2	5	      (NEWSER) â Wonder how long a Quarter Pounder...

3	6	      Posting photos of a gun-toting child online, I...

4	7	      At least 25 suspected Boko Haram insurgents we...

In [None]:
#load train_stances.csv
path = '/content/drive/My Drive/Datasets/Fake News Challenge-20190812T030358Z-001/Fake News Challenge/train_stances.csv'
stances = pd.read_csv(path, encoding='latin1')
stances = stances.fillna(method="ffill") # Deal with N/A
stances = stances.sort_values('Body ID')
stances.head(27)

Unnamed: 0,Headline,Body ID,Stance
27879,Soldier shot near Canadian parliament building,0,unrelated
21704,Caught a catfish record in Po: 127 kg and 2.67...,0,unrelated
7110,Enormous 20-stone catfish caught with fishing ...,0,unrelated
12573,Soldier shot at war memorial in Canada,0,unrelated
16307,A soldier has been shot at Canadaâs war memo...,0,unrelated
37891,Canadian Soldier Shot At Ottawa War Memorial: ...,0,unrelated
37896,Iraqi social-media rumors claim IS leader slain,0,unrelated
35767,Breaking: Soldier shot at National War Memoria...,0,unrelated
44961,Kurds fear Isis use of chemical weapon in Kobani,0,unrelated
4740,Giant 8ft 9in catfish weighing 19 stone caught...,0,unrelated


Loaded HeadLines:

	Headline	Body ID	Stance
27879	Soldier shot near Canadian parliament building	0	unrelated

21704	Caught a catfish record in Po: 127 kg and 2.67...	0	unrelated

7110	Enormous 20-stone catfish caught with fishing ...	0	unrelated

12573	Soldier shot at war memorial in Canada	0	unrelated

16307	A soldier has been shot at Canadaâs war memo...	0	unrelated

In [None]:
#merge the bodies and stances into dataset
dataset=pd.merge(bodies, stances, on='Body ID',sort=True)


<h2> Check1:</h2>
  
<h3> See the data: </h3>

In [None]:
dataset.head()

Unnamed: 0,Body ID,articleBody,Headline,Stance
0,0,A small meteorite crashed into a wooded area i...,Soldier shot near Canadian parliament building,unrelated
1,0,A small meteorite crashed into a wooded area i...,Caught a catfish record in Po: 127 kg and 2.67...,unrelated
2,0,A small meteorite crashed into a wooded area i...,Enormous 20-stone catfish caught with fishing ...,unrelated
3,0,A small meteorite crashed into a wooded area i...,Soldier shot at war memorial in Canada,unrelated
4,0,A small meteorite crashed into a wooded area i...,A soldier has been shot at Canadaâs war memo...,unrelated


Merged Dataset:

	Body ID	articleBody	Headline	Stance
0	0	A small meteorite crashed into a wooded area i...	Soldier shot near Canadian parliament building	unrelated

1	0	A small meteorite crashed into a wooded area i...	Caught a catfish record in Po: 127 kg and 2.67...	unrelated

2	0	A small meteorite crashed into a wooded area i...	Enormous 20-stone catfish caught with fishing ...	unrelated

3	0	A small meteorite crashed into a wooded area i...	Soldier shot at war memorial in Canada	unrelated

4	0	A small meteorite crashed into a wooded area i...	A soldier has been shot at Canadaâs war memo...	unrelated

## Step2: Data Pre-processing and setting some hyper parameters needed for model


#### Run the code given below to set the required parameters.

1. `MAX_SENTS` = Maximum no.of sentences to consider in an article.

2. `MAX_SENT_LENGTH` = Maximum no.of words to consider in a sentence.

3. `MAX_NB_WORDS` = Maximum no.of words in the total vocabualry.

4. `MAX_SENTS_HEADING` = Maximum no.of sentences to consider in a heading of an article.

In [None]:
MAX_NB_WORDS = 20000
MAX_SENTS = 20
MAX_SENTS_HEADING = 1
MAX_SENT_LENGTH = 20
VALIDATION_SPLIT = 0.2

### Download the `Punkt` from nltk for sentence tokenization.




In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Tokenizing the text and loading the pre-trained Glove word embeddings for each token

#### Import the Tokenizer from keras preprocessing text

In [None]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


#### Initialize the Tokenizer class with maximum vocabulary count as `MAX_NB_WORDS`

In [None]:
#initializing tokenizer t with MAX_NB_WORDS as the maximum number of words
t = Tokenizer(num_words=MAX_NB_WORDS)

#### Now, using fit_on_texts() from Tokenizer class,encode the data 

In [None]:
#getting all the words in article body and headline into a list - text
headlines=list(dataset['Headline'])
body=list(dataset['articleBody'])
text = headlines + body

In [None]:
print(bodies['articleBody'])

0       A small meteorite crashed into a wooded area i...
1       Last week we hinted at what was to come as Ebo...
2       (NEWSER) â Wonder how long a Quarter Pounder...
3       Posting photos of a gun-toting child online, I...
4       At least 25 suspected Boko Haram insurgents we...
5       There is so much fake stuff on the Internet in...
6       (CNN) -- A meteorite crashed down in Managua, ...
7       Move over, Netflix and Hulu.\nWord has it that...
8       Weâve all seen the traditional depictions of...
9       A SOLDIER has been shot at Canadaâs National...
10      mboxCreate('FoxNews-Politics-Autoplay-Videos-I...
11      Don't fucking cheat on Cassy, aka @NessLovnTre...
12      Kai the shar pei-crossbreed was discovered tie...
13      An article saying NASA confirmed six days of â...
14      Italian fisherman Dino Ferrari landed what cou...
15      HBO's subscription streaming service will be c...
16      In a sprawling Facebook post and subsequent in...
17      Macaul

In [None]:
#displaying the first 5 articles in text list
print(text[:5])

['Soldier shot near Canadian parliament building', 'Caught a catfish record in Po: 127 kg and 2.67 meters', 'Enormous 20-stone catfish caught with fishing rod in Italy after 40-minute boat battle', 'Soldier shot at war memorial in Canada', 'A soldier has been shot at Canadaâ\x80\x99s war memorial just steps away from the nationâ\x80\x99s parliament']


First 5 articles in text list:

[['Soldier shot near Canadian parliament building', 
'Caught a catfish record in Po: 127 kg and 2.67 meters', 'Enormous 20-stone catfish caught with fishing rod in Italy after 40-minute boat battle',
'Soldier shot at war memorial in Canada', 
'A soldier has been shot at Canadaâ\x80\x99s war memorial just steps away from the nationâ\x80\x99s parliament']

In [None]:
#fitting the tokenizer on the words from headlines and articleBody
t.fit_on_texts(text)

#### fit_on_texts() gives the following attributes:

* **word_counts:** dictionary mapping words (str) to the number of times they appeared on during fit. Only set after fit_on_texts was called.

* **word_docs:** dictionary mapping words (str) to the number of documents/texts they appeared on during fit. Only set after fit_on_texts was called.

* **word_index:** dictionary mapping words (str) to their rank/index (int). Only set after fit_on_texts was called.

* **document_count:** int. Number of documents (texts/sequences) the tokenizer was trained on. Only set after fit_on_texts or fit_on_sequences was called.



### Now, tokenize the sentences using nltk sent_tokenize() and encode the senteces with the ids we got form the above `t.word_index`

Initialise 2 lists with names `texts` and `articles`.

```
texts = [] to store text of article as it is.

articles = [] split the above text into a list of sentences.
```

In [None]:
#initialise a list texts with all the articles from the dataset
texts= list(dataset['articleBody'])
#displaying the first article
texts[0]

'A small meteorite crashed into a wooded area in Nicaragua\'s capital of Managua overnight, the government said Sunday. Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city\'s airport, the Associated Press reports. \n\nGovernment spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth." House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports. \nMurillo said Nicaragua will ask international experts to help local scientists in understanding what happened.\n\nThe crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee. He said it is still not clear if the meteorite disintegrated or was buried.\n\nHumbe

The First Article:

'A small meteorite crashed into a wooded area in Nicaragua\'s capital of Managua overnight, the government said Sunday. Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city\'s airport, the Associated Press reports. \n\nGovernment spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth." House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports. \nMurillo said Nicaragua will ask international experts to help local scientists in understanding what happened.\n\nThe crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee. He said it is still not clear if the meteorite disintegrated or was buried.\n\nHumberto Garcia, of the Astronomy Center at the National Autonomous University of Nicaragua, said the meteorite could be related to an asteroid that was forecast to pass by the planet Saturday night.\n\n"We have to study it more because it could be ice or rock," he said.\n\nWilfried Strauch, an adviser to the Institute of Territorial Studies, said it was "very strange that no one reported a streak of light. We have to ask if anyone has a photo or something."\n\nLocal residents reported hearing a loud boom Saturday night, but said they didn\'t see anything strange in the sky.\n\n"I was sitting on my porch and I saw nothing, then all of a sudden I heard a large blast. We thought it was a bomb because we felt an expansive wave," Jorge Santamaria told The Associated Press.\n\nThe site of the crater is near Managua\'s international airport and an air force base. Only journalists from state media were allowed to visit it.'



In [None]:
#splits the articles into sentences
articles=[]
#for each article among all articles, tokenize the article and add it to the list of tokenized articles
for text in texts:
    article = nltk.tokenize.sent_tokenize(text)
    articles.append(article)


## Check 2:

first element of texts and articles should be as given below. 

In [None]:
texts[0]

'A small meteorite crashed into a wooded area in Nicaragua\'s capital of Managua overnight, the government said Sunday. Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city\'s airport, the Associated Press reports. \n\nGovernment spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth." House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports. \nMurillo said Nicaragua will ask international experts to help local scientists in understanding what happened.\n\nThe crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee. He said it is still not clear if the meteorite disintegrated or was buried.\n\nHumbe

The First Article:

'A small meteorite crashed into a wooded area in Nicaragua\'s capital of Managua overnight, the government said Sunday. Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city\'s airport, the Associated Press reports. \n\nGovernment spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth." House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports. \nMurillo said Nicaragua will ask international experts to help local scientists in understanding what happened.\n\nThe crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee. He said it is still not clear if the meteorite disintegrated or was buried.\n\nHumberto Garcia, of the Astronomy Center at the National Autonomous University of Nicaragua, said the meteorite could be related to an asteroid that was forecast to pass by the planet Saturday night.\n\n"We have to study it more because it could be ice or rock," he said.\n\nWilfried Strauch, an adviser to the Institute of Territorial Studies, said it was "very strange that no one reported a streak of light. We have to ask if anyone has a photo or something."\n\nLocal residents reported hearing a loud boom Saturday night, but said they didn\'t see anything strange in the sky.\n\n"I was sitting on my porch and I saw nothing, then all of a sudden I heard a large blast. We thought it was a bomb because we felt an expansive wave," Jorge Santamaria told The Associated Press.\n\nThe site of the crater is near Managua\'s international airport and an air force base. Only journalists from state media were allowed to visit it.'



In [None]:
articles[0]

["A small meteorite crashed into a wooded area in Nicaragua's capital of Managua overnight, the government said Sunday.",
 "Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city's airport, the Associated Press reports.",
 'Government spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth."',
 'House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports.',
 'Murillo said Nicaragua will ask international experts to help local scientists in understanding what happened.',
 'The crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee.',
 'He said it is still not clear if the meteorite disintegrated or was bu

The First Article after Sentence Tokenization:

["A small meteorite crashed into a wooded area in Nicaragua's capital of Managua overnight, the government said Sunday.",

 "Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city's airport, the Associated Press reports.",
 
 'Government spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth."',
 
 'House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports.',
 
 'Murillo said Nicaragua will ask international experts to help local scientists in understanding what happened.',
 
 'The crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee.',
 
 'He said it is still not clear if the meteorite disintegrated or was buried.',
 
 'Humberto Garcia, of the Astronomy Center at the National Autonomous University of Nicaragua, said the meteorite could be related to an asteroid that was forecast to pass by the planet Saturday night.',
 
 '"We have to study it more because it could be ice or rock," he said.',
 
 'Wilfried Strauch, an adviser to the Institute of Territorial Studies, said it was "very strange that no one reported a streak of light.',
 
 'We have to ask if anyone has a photo or something."',
 "Local residents reported hearing a loud boom Saturday night, but said they didn't see anything strange in the sky.",
 
 '"I was sitting on my porch and I saw nothing, then all of a sudden I heard a large blast.',
 
 'We thought it was a bomb because we felt an expansive wave," Jorge Santamaria told The Associated Press.',
 
 "The site of the crater is near Managua's international airport and an air force base.",
 
 'Only journalists from state media were allowed to visit it.']

 ## <font color=red> Milestone - 2 </font>

#### Now iterate through each article and each sentence to encode the words into ids using t.word_index

Use use `text_to_word_sequence` to get words from sentence  

1. Import text_to_word_sequence

2. Initialize a variable of shape (no.of articles, MAX_SENTS, MAX_SENT_LENGTH)

In [None]:
#import text_to_word_sequence
from keras.preprocessing.text import text_to_word_sequence

In [None]:
#find Number of articles
no_of_articles=len(articles)
print(no_of_articles)

49972


Number of articles = 49972

In [None]:
#initialize data
import numpy as np
data=np.zeros(shape=(no_of_articles,MAX_SENTS,MAX_SENT_LENGTH),dtype='int32')
print(data[0])

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


In [None]:
#this code block will find the embeddings for the body of each article and append it to the data[article]
#for each article in range 0,no_of_articles.
import keras
article = 0
for article in range(0,no_of_articles):
  
    #print(article)
    #list the words in each article
    article_value=articles[article]
    for sentence in article_value:
      text_headings= text_to_word_sequence(sentence)
      #print(text_headings)
    #generate embeddings for words used in each article
    text_headings_embedded = (t.texts_to_sequences(articles[article]))
    
    #make the embeddigs to size 20x20 for each article
    encoded_data = keras.preprocessing.sequence.pad_sequences(text_headings_embedded, maxlen=MAX_SENT_LENGTH, dtype='int32', padding='post', truncating='post', value=0.0)
    #print(encoded_data)
    for encoding in range(0,min(20,len(encoded_data))):
      data[article][encoding]=encoded_data[encoding]
      
        

### Check 3:

Accessing first element in data should give something like given below.

In [None]:
#data.astype('int32')
data[0, :, :]

array([[    3,   481,   427,  7211,    81,     3,  3734,   331,     5,
         3892,   350,     4,  1431,  2960,     1,    89,    12,   466,
            0,     0],
       [  758,    95,  1047,     3,  2679,  1752,     7,   189,     3,
         1217,  1075,  2030,   700,   159,     1,  3033,   448,     1,
          555,   235],
       [   89,  1068,  4117,  2349,    12,     3,  1092,  3307,    19,
            1,    89,     2,  1793,     1,   521,  2009,    15,     9,
            3,  3111],
       [  181,  3641,   972,   200,  2558,    44,  6776,  1722,  1252,
            5, 13324, 17943,     1,   778,    31,   740,  3991,    67,
           85,     0],
       [ 2349,    12,  1557,    38,  1094,   351,   775,     2,   367,
          260,  1770,     5,  4455,    70,   494,     0,     0,     0,
            0,     0],
       [    1,   700,   189,    19,     1,   427,    32,     3,  7423,
            4,  2159,  1252,     6,     3,  5271,     4,  1217,  1252,
           12,  3365],
       [  

First Article after encoding:

array([[    3,   481,   427,  7211,    81,     3,  3734,   331,     5,
         3892,   350,     4,  1431,  2960,     1,    89,    12,   466,
            0,     0],
       [  758,    95,  1047,     3,  2679,  1752,     7,   189,     3,
         1217,  1075,  2030,   700,   159,     1,  3033,   448,     1,
          555,   235],
       [   89,  1068,  4117,  2349,    12,     3,  1092,  3307,    19,
            1,    89,     2,  1793,     1,   521,  2009,    15,     9,
            3,  3111],
       [  181,  3641,   972,   200,  2558,    44,  6776,  1722,  1252,
            5, 13324, 17943,     1,   778,    31,   740,  3991,    67,
           85,     0],
       [ 2349,    12,  1557,    38,  1094,   351,   775,     2,   367,
          260,  1770,     5,  4455,    70,   494,     0,     0,     0,
            0,     0],
       [    1,   700,   189,    19,     1,   427,    32,     3,  7423,
            4,  2159,  1252,     6,     3,  5271,     4,  1217,  1252,
           12,  3365],
       [   13,    12,    15,     8,   149,    25,   543,    64,     1,
          427,  3727,    41,     9,  1850,     0,     0,     0,     0,
            0,     0],
       [ 3365,  5734,     4,     1,  5876,   614,    21,     1,   311,
         3439,   795,     4,  1557,    12,     1,   427,    69,    23,
          787,     2],
       [   37,    17,     2,  1793,    15,    52,   120,    15,    69,
           23,  4923,    41,  1963,    13,    12,     0,     0,     0,
            0,     0],
       [ 4737,  3339,    24,  3971,     2,     1,  1316,     4,  3073,
         1655,    12,    15,     9,   195,  1421,     7,    58,    40,
           95,     3],
       [   37,    17,     2,  1094,    64,   510,    20,     3,   250,
           41,   264,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [  260,   758,    95,  1047,     3,  1808,  1752,   531,   276,
           29,    12,    33,   703,   163,   893,  1421,     5,     1,
         2081,     0],
       [   35,     9,  2058,    10,   116,  5828,     6,    35,   576,
          656,   104,    59,     4,     3,  2411,    35,   241,     3,
          512,  1911],
       [   37,   341,    15,     9,     3,  2082,   120,    37,   881,
           24,  4456,  2585,  4317,  4924,    55,     1,   555,   235,
            0,     0],
       [    1,   255,     4,     1,   700,     8,   159,  3961,   351,
          448,     6,    24,   155,   465,  1930,     0,     0,     0,
            0,     0],
       [  126,   921,    22,    47,   100,    36,  1834,     2,  1213,
           15,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0]], dtype=int32)

In [None]:
#initialise a list texts with all the headings of articles from the dataset
texts= list(dataset['Headline'])
#displaying the first article Headline
texts[0]

'Soldier shot near Canadian parliament building'

First Article headline:
'Soldier shot near Canadian parliament building'

In [None]:
#splits the headings into sentences
headlines=[]
#for each headline among all headings, tokenize the heading into sentences and add it to the list of tokenized headings
for text in texts:
    headline = nltk.tokenize.sent_tokenize(text)
    headlines.append(headline)

In [None]:
#find Number of headlines
no_of_headlines=len(headlines)
print(no_of_headlines)

49972


No of headlines = 49972

In [None]:
#initialize headline_data
import numpy as np
data_heading=np.zeros(shape=(no_of_headlines,MAX_SENTS,MAX_SENT_LENGTH),dtype='int32')
print(data_heading[0])

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


In [None]:
#this code block will find the embeddings for the headline of each article and append it to the headline_data[article]
#for each headline in range 0,no_of_headlines.
import keras
headline = 0
for headline in range(0,no_of_articles):
  
    #print(headline)
    #list the words in each headline
    headline_value=headlines[headline]
    for sentence in headline_value:
      article_headings= text_to_word_sequence(sentence)
      #print(text_headings)
    #generate embeddings for words used in each article
    article_headings_embedded = (t.texts_to_sequences(headlines[headline]))
    
    #make the embeddigs to size 20x20 for each article
    encoded_data = keras.preprocessing.sequence.pad_sequences(article_headings_embedded, maxlen=MAX_SENT_LENGTH, dtype='int32', padding='post', truncating='post', value=0.0)
    #print(encoded_data)
    for encoding in range(0,min(20,len(encoded_data))):
      data_heading[headline][encoding]=encoded_data[encoding]
      

In [None]:
print(data_heading[0])

[[717 206 159 356 343 387   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

First Article Heading after encoding:
[[717 206 159 356 343 387   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]]

### Now the features are ready, lets make the labels ready for the model to process.

### Convert labels into one-hot vectors

You can use [get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) in pandas to create one-hot vectors.

In [None]:
#create one-hot encoding of target classes
labels=pd.get_dummies(dataset['Stance'])

In [None]:
type(labels)

pandas.core.frame.DataFrame

In [None]:
#displaying the first 30 target clases
print(labels[0:30])
#convert labels(of type padas DataFrame) to a numpy array
labels=labels.values

    agree  disagree  discuss  unrelated
0       0         0        0          1
1       0         0        0          1
2       0         0        0          1
3       0         0        0          1
4       0         0        0          1
5       0         0        0          1
6       0         0        0          1
7       0         0        0          1
8       0         0        0          1
9       0         0        0          1
10      0         0        0          1
11      0         0        0          1
12      0         0        0          1
13      0         0        0          1
14      0         0        0          1
15      0         0        0          1
16      0         0        0          1
17      0         0        0          1
18      0         0        0          1
19      0         0        0          1
20      0         0        0          1
21      0         0        0          1
22      0         0        0          1
23      0         0        0          1


Class labels read from dataframe:

 agree  disagree  discuss  unrelated
 
0       0         0        0          1

1       0         0        0          1

2       0         0        0          1

3       0         0        0          1

### Check 4:

The shape of data and labels:

In [None]:
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (49972, 20, 20)
Shape of label tensor: (49972, 4)


Shape of data tensor: (49972, 20, 20)
Shape of label tensor: (49972, 4)

### Shuffle the data

In [None]:
## get numbers upto no.of articles
indices = np.arange(data.shape[0])
## shuffle the numbers
np.random.shuffle(indices)
print(indices)

[28867 37484  1708 ...  8377 32283 37493]


Indices of the articles after shuffling:
[23979 43074  6633 ... 20021 16735 18750]

In [None]:
## shuffle the data
data = data[indices]
data_heading = data_heading[indices]
## shuffle the labels according to data
labels = labels[indices]

In [None]:
#displaying the first value of data ,data_heading and labels
print("data 1:")
print(data[0])
print("headline 1:")
print(data_heading[0])
print("label 1:")
print(labels[0])

data 1:
[[ 5255     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0]
 [ 1132  3435     5   808     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0]
 [   78     2  2592    30   296    21     1 19849  2079  7239 18077    19
    996   982     1  2244    56   808   954    21]
 [  160   447    24  2473  2395     2    60     0     0     0     0     0
      0     0     0     0     0     0     0     0]
 [    1   179     7   296    61   163     1  3239  2442     0     0     0
      0     0     0     0     0     0     0     0]
 [  160  6170     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0]
 [   78     2     3   107    10  2475   982    30   403   117  3407    11
   1621     1  6681   644  1134  2381    96  3987]
 [  182  2660     1  5363   756  5516   130   144   759     1    26   622
      8    14     1  1334  1189   196   226

### Split into train and validation sets. Split the train set 80:20 ratio to get the train and validation sets.


Use the variable names as given below:

x_train, x_val - for body of articles.

x_heading_train, x_heading_val - for heading of articles.

y_train - for training labels.

y_val - for validation labels.

<h1> [10 marks] </h1>

In [None]:
import math
index1 = 0.8*len(data)
index1=math.ceil(index1)
index1

39978

Training set: index 0-39978 ,
Validation set: index 39979-40000

In [None]:
#splitting into training and validation sets in the ratio 80:20
x_train,x_val = data[0:index1],data[index1+1:]
x_heading_train,x_heading_val = data_heading[0:index1],data_heading[index1+1:]
y_train,y_val = labels[0:index1],labels[index1+1:]

### Check 5:

The shape of x_train, x_val, y_train and y_val should match the below numbers.

In [None]:
print(x_train.shape)
print(y_train.shape)

print(x_val.shape)
print(y_val.shape)

(39978, 20, 20)
(39978, 4)
(9993, 20, 20)
(9993, 4)


Shape of training data and class labels:- (39978, 20, 20) and (39978, 4)

Shape of data and class labels used for validation:- (9993, 20, 20) and (9993, 4)

### Create embedding matrix with the glove embeddings


Run the below code to create embedding_matrix which has all the words and their glove embedding if present in glove word list.

In [None]:
vocab_size=len((t.word_index.items()))
print(vocab_size)

27873


Vocabulary size = 27873

In [None]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('./glove.6B.100d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = np.asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 100))


for word, i in t.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

Loaded 400000 word vectors.


 ## <font color=red> Milestone - 3 </font>

## Try different sequential models and report accuracy scores for each model.

<h1>[50 marks]  </h1>

### Import layers from Keras to build the model

In [None]:
#importing layers and models from keras
from keras.models import Model
from keras.layers import LSTM, Embedding , Dense, Dropout, Bidirectional, Input , TimeDistributed, BatchNormalization

In [None]:
from keras.layers import Flatten,Activation

### Model

**Model 1:Without using glove embeddings**
1. Training and testing is done using article bodies(x_train)


In [None]:
print(x_train.shape,x_val.shape)
type(x_train)

(39978, 20, 20) (9993, 20, 20)


numpy.ndarray

In [None]:
x_train_model1 =x_train.reshape(39978,MAX_SENTS *MAX_SENT_LENGTH)
x_val_model1=x_val.reshape(9993,MAX_SENTS *MAX_SENT_LENGTH)

In [None]:
print(x_train_model1.shape,x_val_model1.shape)

(39978, 400) (9993, 400)


In [None]:
Article_len = 400
text_input_model1 = sentence_input = Input(shape=(Article_len,), dtype='int32')
embedded_sequence_m1 = Embedding(output_dim=100, input_dim=vocab_size, input_length=(400,))(text_input_model1)
l_lstm_m1 = Bidirectional(LSTM(100,return_sequences=True))(embedded_sequence_m1)
l_dense_m11 = TimeDistributed(Dense(100))(l_lstm_m1)
l_flatten_m1 = Flatten()(l_dense_m11)
l_dense_m1 = Dense(4,activation='softmax')(l_flatten_m1)

model = model = Model(inputs=text_input_model1, outputs=l_dense_m1)

### Compile and fit the model

In [None]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
model.summary()

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 400)               0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 400, 100)          2787300   
_________________________________________________________________
bidirectional_3 (Bidirection (None, 400, 200)          160800    
_________________________________________________________________
time_distributed_3 (TimeDist (None, 400, 100)          20100     
_________________________________________________________________
flatten_3 (Flatten)          (None, 40000)             0         
_________________________________________________________________
dense_6 (Dense)              (None, 4)                 160004    
Total params: 3,128,204
Trainable params: 3,128,204
Non-trainable params: 0
_________________________________________________

In [None]:
model.fit(x=x_train_model1, y=y_train, epochs=2, verbose=1, validation_data=(x_val_model1,y_val), shuffle=False)

Train on 39978 samples, validate on 9993 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f7be17afbe0>

model.fit results:

Train on 39978 samples, validate on 9993 samples

Epoch 1/2
39978/39978 [==============================] - 2018s 50ms/step - loss: 0.2822 - acc: 0.8876 - val_loss: 0.2579 - val_acc: 0.8993

Epoch 2/2

39978/39978 [==============================] - 2031s 51ms/step - loss: 0.2431 - acc: 0.9021 - val_loss: 0.2613 - val_acc: 0.8984

<keras.callbacks.History at 0x7f7be17afbe0>


**MODEL 1:**
    
    This model uses only the  article bodies for prediction.The input is passed through an
    LSTM block. It has a batch size of 10 and produces 88.44% validation accuracy after 2 epochs.
    
    MODEL SUMMARY:
    ============
    
    
    Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   

input_3 (InputLayer)         (None, 400)               0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 400, 100)          2787300   
_________________________________________________________________
bidirectional_3 (Bidirection (None, 400, 200)          160800    
_________________________________________________________________
time_distributed_3 (TimeDist (None, 400, 100)          20100     
_________________________________________________________________
flatten_3 (Flatten)          (None, 40000)             0         
_________________________________________________________________
dense_6 (Dense)              (None, 4)                 160004    

Total params: 3,128,204
Trainable params: 3,128,204
Non-trainable params: 0


 **MODEL RESULTS:**
  
  
  Train on 39978 samples, validate on 9993 samples
Epoch 1/2
39978/39978 [==============================] - 2018s 50ms/step - loss: 0.2822 - acc: 0.8876 - val_loss: 0.2579 - val_acc: 0.8993
Epoch 2/2
39978/39978 [==============================] - 2031s 51ms/step - loss: 0.2431 - acc: 0.9021 - val_loss: 0.2613 - val_acc: 0.8984
<keras.callbacks.History at 0x7f7be17afbe0>

    

# Model 2:

Using both headline and body for prediction

In [None]:
#reshaping headline data to pass to the model
x_train_headline =x_heading_train.reshape(39978,MAX_SENTS *MAX_SENT_LENGTH)
x_val_headline =x_heading_val.reshape(9993,MAX_SENTS *MAX_SENT_LENGTH)

In [None]:
#concatenating headline and body before passing to the model
x_train_m2= np.concatenate((x_train_headline,x_train_model1),axis=1)
x_val_m2 = np.concatenate((x_val_headline,x_val_model1),axis=1)

In [None]:
x_val_m2.shape

(9993, 800)

In [None]:

#Input layer containg both body and headline
main_input_m2 = Input(shape=(800,), dtype='int32')

# This embedding layer will encode the input sequence
# into a sequence of dense 100-dimensional vectors.
main_embed_m2 = Embedding(output_dim=100, input_dim=vocab_size, input_length=(800,))(main_input_m2)

l_lstm_m2 = Bidirectional(LSTM(100,return_sequences=True))(main_embed_m2)
l_dense_m22 = TimeDistributed(Dense(100))(l_lstm_m2)
l_flatten_m2 = Flatten()(l_dense_m22)
l_dense_m2 = Dense(4,activation='softmax')(l_flatten_m2)


In [None]:
model3 = Model(inputs=main_input_m2, outputs=l_dense_m2)

In [None]:
model3.summary()

Model: "model_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         (None, 800)               0         
_________________________________________________________________
embedding_14 (Embedding)     (None, 800, 100)          2787300   
_________________________________________________________________
bidirectional_10 (Bidirectio (None, 800, 200)          160800    
_________________________________________________________________
time_distributed_8 (TimeDist (None, 800, 100)          20100     
_________________________________________________________________
flatten_8 (Flatten)          (None, 80000)             0         
_________________________________________________________________
dense_16 (Dense)             (None, 4)                 320004    
Total params: 3,288,204
Trainable params: 3,288,204
Non-trainable params: 0
_________________________________________________

In [None]:
#compiling the model
model3.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
type(x_train_headline)

numpy.ndarray

In [None]:
model3.fit(x_train_m2, y=y_train,batch_size=32, epochs=2, verbose=1, validation_data=(x_val_m2,y_val), shuffle=False)

Train on 39978 samples, validate on 9993 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f7bddffde48>

Model.fit results:

Train on 39978 samples, validate on 9993 samples

Epoch 1/2

39978/39978 [==============================] - 3985s 100ms/step - loss: 0.2676 - acc: 0.8946 - val_loss: 0.2367 - val_acc: 0.9072

Epoch 2/2

39978/39978 [==============================] - 3924s 98ms/step - loss: 0.2136 - acc: 0.9154 - val_loss: 0.2385 - val_acc: 0.9047
<keras.callbacks.History at 0x7f7bddffde48>


**MODEL 2:**

This model uses both the  article body and its respective heading for prediction.The input is concatenated array of both heading and body which is then passed through an
LSTM block. It has a batch size of 32 and produces 90.47% validation accuracy after 2 epochs.

It can be noted that the model performs better when both the article heading and body is used to train the model, than when only the article body is used.

**MODEL SUMMARY:**

Model: "model_8"

_________________________________________________________________
Layer (type)                 Output Shape              Param #   

input_6 (InputLayer)         (None, 800)               0         
_________________________________________________________________
embedding_14 (Embedding)     (None, 800, 100)          2787300   
_________________________________________________________________
bidirectional_10 (Bidirectio (None, 800, 200)          160800    
_________________________________________________________________
time_distributed_8 (TimeDist (None, 800, 100)          20100     
_________________________________________________________________
flatten_8 (Flatten)          (None, 80000)             0         
_________________________________________________________________
dense_16 (Dense)             (None, 4)                 320004    


Total params: 3,288,204
Trainable params: 3,288,204
Non-trainable params: 0
____________________________

**MODEL RESULTS:**

Train on 39978 samples, validate on 9993 samples
Epoch 1/2
39978/39978 [==============================] - 3985s 100ms/step - loss: 0.2676 - acc: 0.8946 - val_loss: 0.2367 - val_acc: 0.9072

Epoch 2/2
39978/39978 [==============================] - 3924s 98ms/step - loss: 0.2136 - acc: 0.9154 - val_loss: 0.2385 - val_acc: 0.9047
<keras.callbacks.History at 0x7f7bddffde48>

**MODEL 3**


In [None]:
#Input layer containg both body and headline
main_input_m3 = Input(shape=(800,), dtype='int32')

# This embedding layer will encode the input sequence
# into a sequence of dense 100-dimensional vectors.
main_embed_m3= Embedding(output_dim=100, input_dim=vocab_size, input_length=(800,))(main_input_m3)

l_lstm_m3 = Bidirectional(LSTM(100,return_sequences=True))(main_embed_m3)
l_dense_m33 = TimeDistributed(Dense(100))(l_lstm_m3)
l_flatten_m3 = Flatten()(l_dense_m33)


# Adding a dropout layer before the dense layer
dropout = Dropout(0.2)(l_flatten_m3)
#Output layer uses softmax activation function
l_out_m3 = Dense(4,activation='softmax')(dropout)


In [None]:
model4 = Model(inputs=main_input_m3, outputs=l_out_m3)

In [None]:
#compiling the model
model4.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
#fitting the model ,increasing the batch size to 50
model4.fit(x_train_m2, y=y_train,batch_size=50, epochs=2, verbose=1, validation_data=(x_val_m2,y_val), shuffle=False)

Train on 39978 samples, validate on 9993 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f7ff546e4e0>

In [None]:
model4.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 800)               0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 800, 100)          2787300   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 800, 200)          160800    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 800, 100)          20100     
_________________________________________________________________
flatten_2 (Flatten)          (None, 80000)             0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 80000)             0         
_________________________________________________________________
dense_4 (Dense)              (None, 4)                 3200

**MODEL 3:**

This model also takes the combined array of both headline and body as input. It has an additional dropout layer just before the output layer, which uses softmax activation function. Batch size used for training is 50. The model achieves a validation accuracy of 90.75% after 2 epochs of training.

Adding the dropout layer only gives very minute improvement in model accuracy, but reduces the the time required for training.

**Model Summary:**
Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 800)               0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 800, 100)          2787300   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 800, 200)          160800    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 800, 100)          20100     
_________________________________________________________________
flatten_2 (Flatten)          (None, 80000)             0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 80000)             0         
_________________________________________________________________
dense_4 (Dense)              (None, 4)                 320004    

Total params: 3,288,204
Trainable params: 3,288,204
Non-trainable params: 0

**Model Results:**

Train on 39978 samples, validate on 9993 samples

Epoch 1/2

39978/39978 [==============================] - 2545s 64ms/step - loss: 0.2679 - acc: 0.8940 - val_loss: 0.2346 - val_acc: 0.9075

Epoch 2/2

39978/39978 [==============================] - 2544s 64ms/step - loss: 0.2104 - acc: 0.9168 - val_loss: 0.2374 - val_acc: 0.9075

<keras.callbacks.History at 0x7f7ff546e4e0>


In [None]:
print("validation data shape:",x_val_m2.shape)
x_val_first= x_val_m2[0:1]
print("first data shape:",x_val_first.shape)


validation data shape: (9993, 800)
first data shape: (1, 800)


In [None]:
#predicting the label for a news article
ynew = model4.predict(x_val_first)

In [None]:
#predicting output
labels=pd.get_dummies(dataset['Stance'])
print('Predicted label for data at index 0:',labels.columns[ynew.argmax()])
print('Actual label for data at index 0:',labels.columns[y_val[0].argmax()])

Predicted label for data at index 0: unrelated
Actual label for data at index 0: unrelated


**MODEL 4**

In [None]:
#Input layer containg both body and headline
main_input_m4 = Input(shape=(800,), dtype='int32')

# This embedding layer will encode the input sequence
# into a sequence of dense 100-dimensional vectors.
main_embed_m4= Embedding(output_dim=100, input_dim=vocab_size, input_length=(800,))(main_input_m4)
#Adding a dropout of 0.2
dropout1 = Dropout(0.2)(main_embed_m4)
l_lstm_m4 = Bidirectional(LSTM(100,return_sequences=True))(dropout1)
#Adding a Normalization layer
l_norm_m4 = BatchNormalization()(l_lstm_m4)
l_dense_m44 = TimeDistributed(Dense(100))(l_norm_m4 )
l_flatten_m4 = Flatten()(l_dense_m44)


# Adding a dropout layer before the dense layer
dropout4 = Dropout(0.2)(l_flatten_m4)
#Output layer uses softmax activation function
l_out_m4 = Dense(4,activation='softmax')(dropout4)

In [None]:
model5 = Model(inputs=main_input_m4, outputs=l_out_m4)

In [None]:
#compiling the model
model5.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
model5.summary()

Model: "model_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         (None, 800)               0         
_________________________________________________________________
embedding_6 (Embedding)      (None, 800, 100)          2787300   
_________________________________________________________________
dropout_7 (Dropout)          (None, 800, 100)          0         
_________________________________________________________________
bidirectional_6 (Bidirection (None, 800, 200)          160800    
_________________________________________________________________
batch_normalization_3 (Batch (None, 800, 200)          800       
_________________________________________________________________
time_distributed_5 (TimeDist (None, 800, 100)          20100     
_________________________________________________________________
flatten_4 (Flatten)          (None, 80000)             0   

In [None]:
#fitting the model ,increasing the batch size to 50
model5.fit(x_train_m2, y=y_train,batch_size=50, epochs=2, verbose=1, validation_data=(x_val_m2,y_val), shuffle=False)

Train on 39978 samples, validate on 9993 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f7fa75830f0>

In [None]:
#predicting the label for a news article
prediction = model5.predict(x_val_first)

#predicting output
labels=pd.get_dummies(dataset['Stance'])
print('Predicted label for data at index 0:',labels.columns[prediction.argmax()])
print('Actual label for data at index 0:',labels.columns[y_val[0].argmax()])

Predicted label for data at index 0: unrelated
Actual label for data at index 0: unrelated


**MODEL 4**

Model 4 also uses the conatenated array of headlines and article bodies. It has added layers of BatchNormalization and Dropout in addition to the LSTM, compared to model 3. Model 4 has a validation accuracy of 90.38%.

Adding Batch Normalization decreases the accuracy of the model by a minute value.

**Model Summary:**

Model: "model_4"
_________________________________________________________________
Layer (type)                    Output Shape                Param    
_________________________________________________________________
input_6 (InputLayer)     (None, 800)               0         
_________________________________________________________________
embedding_6 (Embedding)      (None, 800, 100)          2787300   
_________________________________________________________________
dropout_7 (Dropout)          (None, 800, 100)          0         
_________________________________________________________________
bidirectional_6 (Bidirection (None, 800, 200)          160800    
_________________________________________________________________
batch_normalization_3 (Batch (None, 800, 200)          800       
_________________________________________________________________
time_distributed_5 (TimeDist (None, 800, 100)          20100     
_________________________________________________________________
flatten_4 (Flatten)          (None, 80000)             0         
_________________________________________________________________
dropout_8 (Dropout)          (None, 80000)             0         
_________________________________________________________________
dense_9 (Dense)              (None, 4)                 320004    

 
Total params: 3,289,004
Trainable params: 3,288,604
Non-trainable params: 400

**Model Results:**

Train on 39978 samples, validate on 9993 samples

Epoch 1/2

39978/39978 [==============================] - 2536s 63ms/step - loss: 0.3281 - acc: 0.8811 - val_loss: 0.2569 - val_acc: 0.9030

Epoch 2/2

39978/39978 [==============================] - 2544s 64ms/step - loss: 0.2297 - acc: 0.9098 - val_loss: 0.2500 - val_acc: 0.9038

<keras.callbacks.History at 0x7f7fa75830f0>
