<a href="https://colab.research.google.com/github/stavIatrop/Fake-News-Detection/blob/master/text_preprocessing_embeddings_balanced_gossipcop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import and split data

In [53]:
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit

data = pd.read_csv("/content/drive/My Drive/datasets/gossipcop_withPunctRevBalanced.csv", ",")
data_labels = data['label'].values
data = data['text'].values

sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

for train_index, test_index in sss.split(data, data_labels):
    X_train, X_test = data[train_index], data[test_index]
    Y_train, Y_test = data_labels[train_index], data_labels[test_index]

print("Train shape : ",X_train.shape)
print("Test shape : ",X_test.shape)

Train shape :  (6601,)
Test shape :  (1651,)


Remove non-ascii characters

In [0]:
import re

def remove_non_ascii(X):
  for i in range(len(X)):
    words = X[i].split()
    filtered_list = []
    for word in words:
        pattern = re.compile('[^\u0000-\u007F]+', re.UNICODE)  #Remove all non-alphanumeric characters
        
        word = pattern.sub(" ", word)
        filtered_list.append(word)
        result = ' '.join(filtered_list)
        
    X[i] = result
  return X

In [0]:
X_train = remove_non_ascii(X_train)
X_test = remove_non_ascii(X_test)

Build the training vocab

In [0]:
def build_vocab(sentences):     #sentences --> list of lists of tokens
  vocab = dict()
  for sentence in sentences:
    for word in sentence:
      if word in vocab.keys():
        vocab[word] += 1
      else:
        vocab[word] = 1
  return vocab

In [57]:
sentences = [row.split() for row in X_train]
vocab = build_vocab(sentences)
print({k: vocab[k] for k in list(vocab)[:10]})

{'Trending': 210, 'Update:After': 1, 'we': 8927, 'received': 1064, 'EXCLUSIVE': 59, 'info': 44, 'that': 44580, 'the': 175742, 'two': 5587, 'of': 82930}


In [0]:
import numpy as np
def load_glove_index():
    EMBEDDING_FILE = "/content/drive/My Drive/GloVe/glove.6B.50d.txt"
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')[:50]
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
    return embeddings_index

In [0]:
glove_index = load_glove_index()

Check the percentage that GloVe vocab covers training vocab

In [0]:
import operator

def check_cover(vocab, glove_index):

  not_in_embeddings = dict()
  in_embeddings = dict()
  text_len_in = 0
  text_len_out = 0
  for word in vocab.keys():
    if word in glove_index.keys():
      in_embeddings[word] = vocab[word]
      text_len_in += vocab[word]
    else:
      not_in_embeddings[word] = vocab[word]
      text_len_out += vocab[word]
  
  print("Training vocabulary is covered at %.2f %%" % ((len(in_embeddings)/len(vocab)) * 100 ))
  print("Training text is covered at %.2f %%" % ((text_len_in/(text_len_in + text_len_out)) * 100) )
  
  not_in_emb_sorted = sorted(not_in_embeddings.items(), key=operator.itemgetter(1))[::-1]
  
  return not_in_emb_sorted

In [61]:
not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 13.73 %
Training text is covered at 68.27 %


GloVe embeddings cover only ~14% of the vocabulary, which means that ~32% of our dataset is not utilized. The next step is to check the vocabulary words that are not included in the embeddings to see if we can improve things.

In [62]:
not_in_embeddings[:20]

[('I', 31531),
 ('The', 28208),
 ('In', 8002),
 ('She', 7484),
 ('It', 6101),
 ('He', 5474),
 ('And', 4745),
 ('A', 4643),
 ('But', 4415),
 ('This', 4197),
 ('New', 3777),
 ('We', 3760),
 ('They', 2962),
 ('You', 2905),
 ('Prince', 2544),
 ('Kardashian', 2413),
 ('Kim', 2402),
 ('"I', 2386),
 ('On', 2317),
 ('Brad', 2270)]

It seems that many words that start with capital letter are ommited from the embeddings, but are their lower case forms ommited?

In [63]:
'i' in glove_index

True

In [64]:
'and' in glove_index

True

In [65]:
'prince' in glove_index

True

In [66]:
'kardashian' in glove_index

True

As it was suspected, the preprocessing of the embeddings may include the process of lower casing the data. So, let's transform the data into lower case.

In [0]:
def toLowerCase(X):

  for i in range(len(X)):
    filtered_list = []
    for word in X[i].split():
      word = word.lower()
      filtered_list.append(word)
      result = ' '.join(filtered_list)

    X[i] = result
  
  return X

In [0]:
X_train = toLowerCase(X_train)
sentences = [row.split() for row in X_train]
vocab = build_vocab(sentences)

Now that the text is lower case, check the coverage of embeddings.

In [69]:
not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 21.91 %
Training text is covered at 83.42 %


The coverage was increased from 14% to ~22%. Let's check now for words not included in embeddings.

In [70]:
not_in_embeddings[:20]

[("it's", 3317),
 ('"i', 2390),
 ('it.', 1968),
 ('it?s', 1923),
 ("i'm", 1843),
 ("don't", 1791),
 ('however,', 1551),
 ('like,', 1491),
 ("she's", 1438),
 ('said,', 1352),
 ('"the', 1319),
 ('her.', 1228),
 ("didn't", 1225),
 ('time,', 1210),
 ("that's", 1206),
 ('it,', 1189),
 ('year,', 1143),
 ('?i', 1121),
 ('time.', 1062),
 ("he's", 1053)]

In [71]:
'it' in glove_index

True

In [72]:
'thats' in glove_index

True

In [73]:
'dont' in glove_index

True

In [74]:
',' in glove_index

True

It seems that punctuation symbol are not totally eliminated from the preprocessing of the embeddings. So, I will remove punctuation if it is in the middle of a word token or separate it if it is at the start/end of it.

In [0]:
import string

def handle_punctuation(X):

  for i in range(len(X)):
    filtered_list = []
    
    for word in X[i].split():
      
      cleaned = 0
      
      while(not cleaned):
        punc_word = ""

        if (word[0] in string.punctuation):
          punc_word = word[0]
          if (len(word) == 1):
            cleaned = 1
          else:
            word = word[1:]
          filtered_list.append(punc_word)
          result = ' '.join(filtered_list)
        elif (word[len(word) - 1] in string.punctuation):
          punc_word = word[len(word) - 1]
          word = word[:len(word) - 1]
          filtered_list.append(punc_word)
          result = ' '.join(filtered_list)
        else:
          #word = word.translate(str.maketrans(' ', ' ', string.punctuation))
          t = str.maketrans(dict.fromkeys(string.punctuation, " "))
          word = word.translate(t)
          cleaned = 1
          filtered_list.append(word)
          result = ' '.join(filtered_list)
    
    X[i] = result
  
  return X

In [0]:
X_train = handle_punctuation(X_train)
sentences = [row.split() for row in X_train]
vocab = build_vocab(sentences)

Check coverage now.

In [76]:
not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 54.32 %
Training text is covered at 98.41 %


Almost all of the text is covered! Try removing completely the punctuation.

In [0]:
def remove_punctuation(X):
  
  for i in range(len(X)):
    filtered_list = []
    for word in X[i].split():
      
      t = str.maketrans(dict.fromkeys(string.punctuation, " "))
      word = word.translate(t)
      filtered_list.append(word)
      result = ' '.join(filtered_list)
    
    X[i] = result
  
  return X

In [0]:
# X_train = remove_punctuation(X_train)
# sentences = [row.split() for row in X_train]
# vocab = build_vocab(sentences)

Check coverage without punctuation

In [51]:
#not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 54.31 %
Training text is covered at 98.17 %


Slightly lower (54.31%, 98.17% ), so we will keep the punctuation. And let's check for further improvement.

In [82]:
not_in_embeddings[:20]

[('aposs', 638),
 ('khlo', 395),
 ('disponible', 327),
 ('disick', 321),
 ('stormi', 305),
 ('hollywoodlife', 269),
 ('wireimage', 263),
 ('edici', 261),
 ('verlo', 260),
 ('gustar', 260),
 ('personalizado', 260),
 ('hollywoodlifers', 244),
 ('shookus', 220),
 ('hadn', 206),
 ('apost', 197),
 ('viewcomments', 177),
 ('kimkardashian', 166),
 ('selfie', 165),
 ('conservatee', 159),
 ('updated5', 154)]

Many preprocessed embeddings have replaced large numbers with #, so let's clean the numbers.

In [0]:
def clean_numbers(X):
  for i in range(len(X)):
    x = X[i]
    if bool(re.search(r'\d', x)):
      x = re.sub('[0-9]{4,}', ' ### ', x)
      x = re.sub('[0-9]{3}', ' ## ', x)
      x = re.sub('[0-9]{2}', ' # ', x)
    X[i] = x
  return X

In [0]:
X_train = clean_numbers(X_train)
sentences = [row.split() for row in X_train]
vocab = build_vocab(sentences)

Check coverage with '#' instead of numbers.

In [85]:
not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 54.53 %
Training text is covered at 98.47 %


There is an improvement of ~0.2%.

In [86]:
not_in_embeddings[:20]

[('aposs', 638),
 ('khlo', 395),
 ('disponible', 327),
 ('disick', 321),
 ('stormi', 305),
 ('hollywoodlife', 269),
 ('wireimage', 263),
 ('edici', 261),
 ('verlo', 260),
 ('gustar', 260),
 ('personalizado', 260),
 ('hollywoodlifers', 244),
 ('shookus', 220),
 ('hadn', 206),
 ('apost', 197),
 ('viewcomments', 177),
 ('kimkardashian', 166),
 ('selfie', 165),
 ('conservatee', 159),
 ('updated5', 154)]

Until here, the improvement increased from 14% to 54% and there is no obvious further improvement. The training data is ready.