# Importing Dependencies & extracting dataset
----
This part of the process deals with the _Extraction_ part of the ***E***TL. Firstly, the needed modules will be imported and then the dataset. 

In [2]:
import pandas as pd
import numpy as np
import re

from keras.preprocessing import sequence as seq
import tensorflow as tf
import tensorflow_datasets as tfds

Using TensorFlow backend.


In [3]:
### In case, one wanted to upload the files to a File Storage (in case Google Drive)
from google.colab import drive

drive.mount('/content/gdrive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive/


In [4]:
### Some commands to create directory structure and rename file - already executed in the Data Exploration Notebook
# %cd /content/gdrive/My\ Drive
# !rm -r projects
# !mkdir projects
# !mkdir projects/capstone
# !mkdir projects/capstone/checkpoints
%cd /content/gdrive/My\ Drive/projects/capstone

/content/gdrive/My Drive/projects/capstone


In [5]:
### Already executed in the Data Exploration Notebook
# !rm trainingandtestdata.zip -f
# !wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip 
# !unzip trainingandtestdata.zip
# !mv training.1600000.processed.noemoticon.csv sentiment_analysis_trainingset.csv

## Loading the dataset into Pandas framework:

In [8]:
nameColumns = ["sentiment", "id", "date", "query", "user", "post"]

fullDataset = pd.read_csv('sentiment_analysis_trainingset.csv', 
                          header=None, 
                          names=nameColumns, 
                          encoding='latin1', 
                          engine='python')

In [12]:
print(fullDataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   sentiment  1600000 non-null  int64 
 1   id         1600000 non-null  int64 
 2   date       1600000 non-null  object
 3   query      1600000 non-null  object
 4   user       1600000 non-null  object
 5   post       1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB
None


## Initial sampling
In order to gain time-efficiency, a random sample was extracted from the 1.6MM datapoints.


In [13]:
sampleDataset = fullDataset.sample(frac=0.1, replace = None, random_state=300)
sampleDataset.index = range(len(sampleDataset))
sampleDataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160000 entries, 0 to 159999
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   sentiment  160000 non-null  int64 
 1   id         160000 non-null  int64 
 2   date       160000 non-null  object
 3   query      160000 non-null  object
 4   user       160000 non-null  object
 5   post       160000 non-null  object
dtypes: int64(2), object(4)
memory usage: 7.3+ MB


# Tranformations
---
In this part, we will run the necessary data transformations (E***T***L):

## (re)tagging sentiments
Ensure that inputs labels will be `0` or `1`

In [14]:
sampleDatasetLabels = sampleDataset.sentiment.values
sampleDatasetLabels[sampleDatasetLabels == 4] = 1

## tweets text treatment
Here we will treat the Twitter text posts, to remove all sort of noise and non-essencial for our analysis:

In [15]:
def TweetTreament(stringTweet):
    tweet = re.sub(r"https?://[a-zA-Z0-9./]+", ' ', stringTweet)  # remove the URL links
    tweet = re.sub(r"(w|W){3}[a-zA-Z0-9./]+", ' ', tweet)         # remove the www references
    tweet = re.sub(r"@[a-zA-Z0-9]+", ' ', tweet)                  # remove references from accounts/profiles, using @ 
    tweet = re.sub(r"[^a-zA-Z.!?']", ' ', tweet)                  # keep only letters & punctuations marks
    tweet = re.sub(r" +", ' ', tweet).strip()                     # remove potential excess of whitespaces before, in and after
    
    return tweet

cleanedTweets = [TweetTreament(t) for t in sampleDataset.post]

## tokenization
Tokenization is the process of converting each word in a text to an unique number (integer), called tokens. This is a common task in NLP, since computers learn words from numbers, not characters. Luckily, TensorFlow support us with all necessary methods.

The method `SubwordTextEncoder.build_from_corpus()` allows us to build a vocabulary using the universe of words in our dataset. 

The `.encode()` method convert each word in the Tweet into a token. This process is invertible by using `.decode()`.


In [16]:
targetVocab = 2**16

tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(
                                          cleanedTweets, 
                                          target_vocab_size=targetVocab)

tokenizedData = [tokenizer.encode(sentence) for sentence in cleanedTweets]

> **point of decision:** the vocabulary size is a critical variable, as a large vocabulary size will result in a high number of word embedding parameters in the model, and therefore a large storage or memory. 
> Searching in Google, we see that a average Englisch speaker knows around 20.000-40.000 words. For our experiment, we will use the `targetVocab = 2**16`, or the most 65.536 important words.

## padding
Since we will use a Convolutional Neural Network (CNN)  model, padding is required and it improves the model performance.

Some tweets have more words than others, and this is where the padding is necessary.

We need to have the inputs with the same size, so padding represents the process of normalizing the length for each tokenized tweet by completing it with `0`.

Luckily, Keras `sequence.pad_sequences()` does this task for us. For common length (`maxLength`) we will use the maximum length found in our sample. 

In [17]:
maxLength = max([len(p) for p in tokenizedData])

paddedData = tf.keras.preprocessing.sequence.pad_sequences(tokenizedData,
                                                           value=0,
                                                           padding="pre",
                                                           maxlen=maxLength)
print(maxLength)

58


In [18]:
paddedData[0]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,   774,  5919,
         589,   208,    86, 53975], dtype=int32)

# Loading

## Spliting data into training/testing set
As best practice, we will split our dataset in two groups: training set (80%) and testing set (20%).

In [22]:
testSize = int(len(paddedData) * 0.2)
print('The testing set has {0} datapoints\n'.format(testSize))

testInputs = paddedData[-testSize:]
testLabels = sampleDatasetLabels[-testSize:]
trainInputs = paddedData[:-testSize]
trainLabels = sampleDatasetLabels[:-testSize]

### Checking the shape of the model's inputs and labels 
print('Size of testing Inputs:', len(testInputs))
print('Size of testing Labels:', len(testLabels))
print('\nSize of training Inputs:', len(trainInputs))
print('Size of training Inputs:', len(trainLabels))
print('\nFirst observation of training label:', trainLabels[0])
print('\nFirst observation of training inputs:', trainInputs[0])
print('\nTraining inputs shape:', trainInputs[0].shape)

The testing set has 32000 datapoints

Size of testing Inputs: 32000
Size of testing Labels: 32000

Size of training Inputs: 128000
Size of training Inputs: 128000

First observation of training label: 1

First observation of training inputs: [    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0   774  5919   589   208    86 53975]

Training inputs shape (58,)
