<a href="https://colab.research.google.com/github/stavIatrop/Fake-News-Detection/blob/master/text_preprocessing_embeddings_isot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import and split data

In [45]:
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit

data = pd.read_csv("/content/drive/My Drive/datasets/isot_rev.csv", ",")
data_labels = data['label'].values
data = data['text'].values

sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

for train_index, test_index in sss.split(data, data_labels):
    X_train, X_test = data[train_index], data[test_index]
    Y_train, Y_test = data_labels[train_index], data_labels[test_index]

print("Train shape : ",X_train.shape)
print("Test shape : ",X_test.shape)

Train shape :  (35918,)
Test shape :  (8980,)


Remove non-ascii characters

In [0]:
import re

def remove_non_ascii(X):
  for i in range(len(X)):
    words = X[i].split()
    filtered_list = []
    for word in words:
        pattern = re.compile('[^\u0000-\u007F]+', re.UNICODE)  #Remove all non-alphanumeric characters
        
        word = pattern.sub(" ", word)
        filtered_list.append(word)
        result = ' '.join(filtered_list)
        
    X[i] = result
  return X

In [0]:
X_train = remove_non_ascii(X_train)
X_test = remove_non_ascii(X_test)

Build the training vocab

In [0]:
def build_vocab(sentences):     #sentences --> list of lists of tokens
  vocab = dict()
  for sentence in sentences:
    for word in sentence:
      if word in vocab.keys():
        vocab[word] += 1
      else:
        vocab[word] = 1
  return vocab

In [49]:
sentences = [row.split() for row in X_train]
vocab = build_vocab(sentences)
print({k: vocab[k] for k in list(vocab)[:10]})

{'Amateur': 15, 'president': 11227, 'Donald': 21771, 'Trump': 91090, 's': 183353, 'hostility': 153, 'towards': 1389, 'the': 735851, 'Environmental': 435, 'Protection': 609}


In [0]:
import numpy as np
def load_glove_index():
    EMBEDDING_FILE = "/content/drive/My Drive/GloVe/glove.6B.50d.txt"
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')[:50]
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
    return embeddings_index

In [0]:
glove_index = load_glove_index()

Check the percentage that GloVe vocab covers training vocab

In [0]:
import operator

def check_cover(vocab, glove_index):

  not_in_embeddings = dict()
  in_embeddings = dict()
  text_len_in = 0
  text_len_out = 0
  for word in vocab.keys():
    if word in glove_index.keys():
      in_embeddings[word] = vocab[word]
      text_len_in += vocab[word]
    else:
      not_in_embeddings[word] = vocab[word]
      text_len_out += vocab[word]
  
  print("Training vocabulary is covered at %.2f %%" % ((len(in_embeddings)/len(vocab)) * 100 ))
  print("Training text is covered at %.2f %%" % ((text_len_in/(text_len_in + text_len_out)) * 100) )
  
  not_in_emb_sorted = sorted(not_in_embeddings.items(), key=operator.itemgetter(1))[::-1]
  
  return not_in_emb_sorted



In [53]:
not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 14.50 %
Training text is covered at 74.83 %


GloVe embeddings cover only ~15% of the vocabulary, which means that ~25% of our dataset is not utilized. The next step is to check the vocabulary words that are not included in the embeddings to see if we can improve things.

In [54]:
not_in_embeddings[:20]

[('Trump', 91090),
 ('The', 79814),
 ('I', 47920),
 ('U.S.', 37371),
 ('President', 26261),
 ('Donald', 21771),
 ('It', 21602),
 ('He', 19900),
 ('Obama', 19363),
 ('Clinton', 18708),
 ('Republican', 18527),
 ('United', 18372),
 ('House', 17523),
 ('(Reuters)', 17239),
 ('We', 16609),
 ('In', 15988),
 ('A', 13283),
 ('White', 12787),
 ('But', 12560),
 ('Hillary', 12282)]

It seems that many words that start with capital letter are ommited from the embeddings, but are their lower case forms ommited?

In [55]:
'trump' in glove_index

True

In [56]:
'the' in glove_index

True

In [57]:
'president' in glove_index

True

In [58]:
'in' in glove_index

True

It seems that many words that start with capital letter are ommited from the embeddings, but are their lower case forms ommited?

In [0]:
def toLowerCase(X):

  for i in range(len(X)):
    filtered_list = []
    for word in X[i].split():
      word = word.lower()
      filtered_list.append(word)
      result = ' '.join(filtered_list)

    X[i] = result
  
  return X

In [0]:
X_train = toLowerCase(X_train)
sentences = [row.split() for row in X_train]
vocab = build_vocab(sentences)

Now that the text is lower case, check the coverage of embeddings.

In [61]:
not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 24.51 %
Training text is covered at 89.40 %


The coverage was increased from ~15% to ~25%. Let's check now for words not included in embeddings.

In [62]:
not_in_embeddings[:20]

[('(reuters)', 17239),
 ('trump,', 6425),
 ('said,', 6294),
 ('it.', 4850),
 ('however,', 4484),
 ('trump.', 3511),
 ('it,', 3359),
 ('year,', 3135),
 ('that,', 3089),
 ('states,', 2972),
 ('election.', 2875),
 ('year.', 2544),
 ('election,', 2538),
 ('statement.', 2507),
 ('people,', 2494),
 ('week,', 2409),
 ('now,', 2366),
 ('him.', 2353),
 ('years,', 2347),
 ('them.', 2293)]

It seems that punctuation symbol are not totally eliminated from the preprocessing of the embeddings. So, I will remove punctuation if it is in the middle of a word token or separate it if it is at the start/end of it.

In [0]:
import string

def handle_punctuation(X):

  for i in range(len(X)):
    filtered_list = []
    
    for word in X[i].split():
      
      cleaned = 0
      
      while(not cleaned):
        punc_word = ""

        if (word[0] in string.punctuation):
          punc_word = word[0]
          if (len(word) == 1):
            cleaned = 1
          else:
            word = word[1:]
          filtered_list.append(punc_word)
          result = ' '.join(filtered_list)
        elif (word[len(word) - 1] in string.punctuation):
          punc_word = word[len(word) - 1]
          word = word[:len(word) - 1]
          filtered_list.append(punc_word)
          result = ' '.join(filtered_list)
        else:
          #word = word.translate(str.maketrans(' ', ' ', string.punctuation))
          t = str.maketrans(dict.fromkeys(string.punctuation, " "))
          word = word.translate(t)
          cleaned = 1
          filtered_list.append(word)
          result = ' '.join(filtered_list)
    
    X[i] = result
  
  return X

In [0]:
X_train = handle_punctuation(X_train)
sentences = [row.split() for row in X_train]
vocab = build_vocab(sentences)

Check coverage now.

In [65]:
not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 65.77 %
Training text is covered at 99.34 %


Almost all of the text is covered! Try removing completely the punctuation.

In [0]:
def remove_punctuation(X):
  
  for i in range(len(X)):
    filtered_list = []
    for word in X[i].split():
      
      t = str.maketrans(dict.fromkeys(string.punctuation, " "))
      word = word.translate(t)
      filtered_list.append(word)
      result = ' '.join(filtered_list)
    
    X[i] = result
  
  return X

In [0]:
# X_train = remove_punctuation(X_train)
# sentences = [row.split() for row in X_train]
# vocab = build_vocab(sentences)

Check coverage without punctuation

In [44]:
#not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 65.76 %
Training text is covered at 99.28 %


Slightly lower (65.76%, 99.28% ), so we will keep the punctuation. And let's check for further improvement.

In [66]:
not_in_embeddings[:20]

[('realdonaldtrump', 3842),
 ('brexit', 1774),
 ('21wire', 1748),
 ('daca', 522),
 ('antifa', 515),
 ('puigdemont', 484),
 ('filessupport', 479),
 ('screengrab', 474),
 ('2017the', 461),
 ('fjs', 435),
 ('scaramucci', 413),
 ('somodevilla', 372),
 ('reince', 371),
 ('youtu', 340),
 ('finicum', 308),
 ('cdata', 294),
 ('tmsnrt', 288),
 ('2016the', 242),
 ('hadn', 238),
 ('2017trump', 233)]

Many preprocessed embeddings have replaced large numbers with #, so let's clean the numbers.

In [0]:
def clean_numbers(X):
  for i in range(len(X)):
    x = X[i]
    if bool(re.search(r'\d', x)):
      x = re.sub('[0-9]{4,}', ' ### ', x)
      x = re.sub('[0-9]{3}', ' ## ', x)
      x = re.sub('[0-9]{2}', ' # ', x)
    X[i] = x
  return X

In [0]:
X_train = clean_numbers(X_train)
sentences = [row.split() for row in X_train]
vocab = build_vocab(sentences)

Check coverage with '#' instead of numbers.

In [69]:
not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 67.48 %
Training text is covered at 99.43 %


There is an improvement of ~2%.

In [71]:
not_in_embeddings[:20]

[('realdonaldtrump', 3842),
 ('brexit', 1775),
 ('antifa', 522),
 ('daca', 522),
 ('puigdemont', 484),
 ('filessupport', 479),
 ('screengrab', 474),
 ('fjs', 435),
 ('scaramucci', 415),
 ('reince', 376),
 ('somodevilla', 372),
 ('youtu', 340),
 ('finicum', 308),
 ('cdata', 294),
 ('tmsnrt', 288),
 ('hadn', 238),
 ('hillaryclinton', 231),
 ('wfb', 215),
 ('nusra', 204),
 ('hesher', 189)]

Until here, the improvement increased from 15% to 68% and there is no obvious further improvement. The training data is ready.