<h1>Stance Detection </h1> 
Team Members: Tanvi Padhye [Z1906477],Ankita Ratnaparkhi [Z1907718]

**Step 1: Import the necessary files** <br>
This step is to import all the necessary files.

In [None]:
import pandas as pd
import os
import requests
import io
import matplotlib.pyplot as plt
import numpy as np
import nltk
nltk.download('punkt')
import operator
from nltk.corpus import stopwords
nltk.download('stopwords')
import re
import seaborn as sns
from wordcloud import WordCloud

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Step 2: Loading the data set**  <br>
The dataset used for this project is EDA climate change dataset from Kaggle which contains 15000 tweets. <br>

In [None]:
path = "https://raw.githubusercontent.com/iAnkitar/ISR_Spotlight/master/train.csv"
r = requests.get(path)
train_df = pd.read_csv(io.StringIO(r.text),header=0)

train_df.head(5)

Unnamed: 0,stance,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


In [None]:
#Dropping the news from the database
train_df = train_df.drop(train_df[train_df.stance == 2].index)

printing the number of tweets according to the stance 

In [None]:
print(len(train_df))
print(train_df.stance.unique())
print(train_df['stance'].value_counts())

12179
[ 1  0 -1]
 1    8530
 0    2353
-1    1296
Name: stance, dtype: int64


**Text preprocessing**<br>
The dataset needs to be processed before any other steps. We will achieve that by removing the stop words, punctuations, cleaning the text and any removal of any special characters.<br><br>**Simple text cleaning processes**: Some of the common text cleaning process involves: <br>


1.  Removing leading,trailing & extra white spaces/tabs
2.  Removing punctuations, special characters, URLs & hashtags
3.  Correcting any typos and slangs and abbreviations.

**Stop-word removal:** Using nltk a generic list of stop words such as 'i','you','a','the' can be removed. <br> <br>

In [None]:
train_df = train_df.reset_index(drop=True)
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    text = str(text)
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing. 
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwors from text
    return text

train_df['message'] = train_df['message'].apply(clean_text)
train_df['message'] = train_df['message'].str.replace('\d+', '')

In [None]:
from keras.preprocessing.text import Tokenizer

# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 50000
# Max number of words in each tweet.
MAX_SEQUENCE_LENGTH = 50
# This is fixed.
EMBEDDING_DIM = 100
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(train_df['message'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 26147 unique tokens.


In [None]:
from keras.preprocessing.sequence import pad_sequences
X = tokenizer.texts_to_sequences(train_df['message'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)

print('Shape of data tensor:', X.shape)

Shape of data tensor: (12179, 50)


In [None]:
Y = pd.get_dummies(train_df['stance']).values

print('Shape of label tensor:', Y.shape)

Shape of label tensor: (12179, 3)


Spliting the dataset into train(80%), test(10%) and validation(10%).

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(9743, 50) (9743, 3)
(2436, 50) (2436, 3)


Below code is used for Creating the model that we will use for stance detection

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import SpatialDropout1D
from keras.callbacks import EarlyStopping

model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 5
batch_size = 128

print(model.summary())

history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 50, 100)           5000000   
                                                                 
 spatial_dropout1d_1 (Spatia  (None, 50, 100)          0         
 lDropout1D)                                                     
                                                                 
 lstm_1 (LSTM)               (None, 100)               80400     
                                                                 
 dense_1 (Dense)             (None, 3)                 303       
                                                                 
Total params: 5,080,703
Trainable params: 5,080,703
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Evaluating the model for test data and printing the accuracy.

In [None]:
accr = model.evaluate(X_test,Y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 0.821
  Accuracy: 0.770
