<a href="https://colab.research.google.com/github/stavIatrop/Fake-News-Detection/blob/master/text_preprocessing_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import and split data

In [41]:
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit

data = pd.read_csv("drive/My Drive/datasets/politifact.csv", ",")
data_labels = data['label'].values
data = data['text'].values

sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

for train_index, test_index in sss.split(data, data_labels):
    X_train, X_test = data[train_index], data[test_index]
    Y_train, Y_test = data_labels[train_index], data_labels[test_index]

print("Train shape : ",X_train.shape)
print("Test shape : ",X_test.shape)



Train shape :  (559,)
Test shape :  (140,)


Remove non-ascii characters

In [0]:
import re

def remove_non_ascii(X):
  for i in range(len(X)):
    words = X[i].split()
    filtered_list = []
    for word in words:
        pattern = re.compile('[^\u0000-\u007F]+', re.UNICODE)  #Remove all non-alphanumeric characters
        
        word = pattern.sub(" ", word)
        filtered_list.append(word)
        result = ' '.join(filtered_list)
        
    X[i] = result
  return X

In [0]:
X_train = remove_non_ascii(X_train)
X_test = remove_non_ascii(X_test)

Build the training vocab

In [0]:
def build_vocab(sentences):     #sentences --> list of lists of tokens
  vocab = dict()
  for sentence in sentences:
    for word in sentence:
      if word in vocab.keys():
        vocab[word] += 1
      else:
        vocab[word] = 1
  return vocab

In [45]:
sentences = [row.split() for row in X_train]
vocab = build_vocab(sentences)
print({k: vocab[k] for k in list(vocab)[:10]})

{'George': 177, 'W.': 41, 'Bush': 285, 'has': 2956, 'lobbed': 3, 'thinly-veiled': 1, 'critiques': 2, 'of': 20547, 'President': 1164, 'Donald': 257}


In [0]:
import numpy as np
def load_glove_index():
    EMBEDDING_FILE = "/content/drive/My Drive/GloVe/glove.6B.50d.txt"
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')[:50]
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
    return embeddings_index

In [0]:
glove_index = load_glove_index()

Check the percentage that GloVe vocab covers traing vocab

In [0]:
import operator

def check_cover(vocab, glove_index):

  not_in_embeddings = dict()
  in_embeddings = dict()
  text_len_in = 0
  text_len_out = 0
  for word in vocab.keys():
    if word in glove_index.keys():
      in_embeddings[word] = vocab[word]
      text_len_in += vocab[word]
    else:
      not_in_embeddings[word] = vocab[word]
      text_len_out += vocab[word]
  
  print("Training vocabulary is covered at %.2f %%" % ((len(in_embeddings)/len(vocab)) * 100 ))
  print("Training text is covered at %.2f %%" % ((text_len_in/(text_len_in + text_len_out)) * 100) )
  
  not_in_emb_sorted = sorted(not_in_embeddings.items(), key=operator.itemgetter(1))[::-1]
  
  return not_in_emb_sorted



In [49]:
not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 31.06 %
Training text is covered at 72.35 %


GloVe embeddings cover only ~30% of the vocabulary, which means that ~28% of our dataset is not utilized. The next step is to check the vocabulary words that are not included in the embeddings to see if we can improve things.

In [50]:
not_in_embeddings[:20]

[('I', 11151),
 ('And', 4814),
 ('The', 3137),
 ('We', 2205),
 ('But', 2096),
 ('It', 1288),
 ("it's", 1210),
 ('So', 1209),
 ('You', 1172),
 ('President', 1164),
 ("don't", 1152),
 ('American', 1070),
 ('United', 1010),
 ('Senator', 1006),
 ("that's", 1006),
 ('THE', 999),
 ('Well,', 963),
 ('Obama', 937),
 ('know,', 921),
 ('That', 914)]

It seems that many words that start with capital letter are ommited from the embeddings, but are their lower case forms ommited?

In [51]:
'i' in glove_index

True

In [52]:
'and' in glove_index

True

In [53]:
'the' in glove_index

True

In [54]:
'president' in glove_index

True

As it was suspected, the preprocessing of the embeddings may include the process of lower casing the data. So, let's transform the data into lower case.

In [0]:
def toLowerCase(X):

  for i in range(len(X)):
    filtered_list = []
    for word in X[i].split():
      word = word.lower()
      filtered_list.append(word)
      result = ' '.join(filtered_list)

    X[i] = result
  
  return X

In [0]:
X_train = toLowerCase(X_train)
sentences = [row.split() for row in X_train]
vocab = build_vocab(sentences)

Now that the text is lower case, check the coverage of embeddings.

In [57]:
not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 44.47 %
Training text is covered at 84.92 %


The coverage was increased from 31% to 44%. Let's check now for words not included in embeddings.

In [58]:
not_in_embeddings[:20]

[("it's", 1838),
 ("that's", 1635),
 ("don't", 1188),
 ("we're", 1178),
 ('well,', 1117),
 ('know,', 921),
 ("i'm", 848),
 ('it.', 827),
 ('now,', 806),
 ("we've", 748),
 ('(applause.)', 586),
 ('obama:', 578),
 ('that.', 562),
 ("they're", 552),
 ("you're", 532),
 ('that,', 494),
 ('tapper:', 487),
 ("i've", 485),
 ("there's", 480),
 ('said,', 470)]

In [59]:
'obama' in glove_index

True

In [60]:
'know' in glove_index

True

In [61]:
'thats' in glove_index

True

In [62]:
'dont' in glove_index

True

In [63]:
'theres' in glove_index

True

In [64]:
'\'' in glove_index

True

In [65]:
':' in glove_index

True

In [66]:
',' in glove_index

True

In [67]:
'(' in glove_index

True

In [68]:
').' in glove_index

False

It seems that punctuation symbol are not totally eliminated from the preprocessing of the embeddings. So, I will remove punctuation if it is in the middle of a word token or separate it if it is at the start/end of it.

In [0]:
import string

def handle_punctuation(X):

  for i in range(len(X)):
    filtered_list = []
    
    for word in X[i].split():
      
      cleaned = 0
      
      while(not cleaned):
        punc_word = ""

        if (word[0] in string.punctuation):
          punc_word = word[0]
          if (len(word) == 1):
            cleaned = 1
          else:
            word = word[1:]
          filtered_list.append(punc_word)
          result = ' '.join(filtered_list)
        elif (word[len(word) - 1] in string.punctuation):
          punc_word = word[len(word) - 1]
          word = word[:len(word) - 1]
          filtered_list.append(punc_word)
          result = ' '.join(filtered_list)
        else:
          #word = word.translate(str.maketrans(' ', ' ', string.punctuation))
          t = str.maketrans(dict.fromkeys(string.punctuation, " "))
          word = word.translate(t)
          cleaned = 1
          filtered_list.append(word)
          result = ' '.join(filtered_list)
    
    X[i] = result
  
  return X

In [0]:
# X_train = handle_punctuation(X_train)
# sentences = [row.split() for row in X_train]
# vocab = build_vocab(sentences)

Check coverage now.

In [31]:
# not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 94.19 %
Training text is covered at 99.80 %


Almost all of the text is covered! Try removing completely the punctuation.

In [0]:
def remove_punctuation(X):
  
  for i in range(len(X)):
    filtered_list = []
    for word in X[i].split():
      
      t = str.maketrans(dict.fromkeys(string.punctuation, " "))
      word = word.translate(t)
      filtered_list.append(word)
      result = ' '.join(filtered_list)
    
    X[i] = result
  
  return X

In [0]:
# X_train = remove_punctuation(X_train)
# sentences = [row.split() for row in X_train]
# vocab = build_vocab(sentences)

Check coverage without punctuation

In [71]:
#not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 94.18 %
Training text is covered at 99.77 %


Slightly lower (94.18%, 99.77% ), so we will keep the punctuation. And let's check for further improvement.

In [35]:
not_in_embeddings[:20]

[('karibjanian', 34),
 ('hadn', 28),
 ('guccifer', 22),
 ('mikerin', 21),
 ('thinkprogress', 20),
 ('chyron', 20),
 ('abcnews', 19),
 ('strzok', 18),
 ('nucera', 17),
 ('antifa', 17),
 ('booo', 16),
 ('isil', 13),
 ('shutterstock', 11),
 ('sciutto', 10),
 ('6079', 10),
 ('delawareans', 10),
 ('realdonaldtrump', 10),
 ('daca', 9),
 ('dcleaks', 9),
 ('hillaryclinton', 8)]

Many preprocessed embeddings have replaced large numbers with #, so let's clean the numbers.

In [0]:
def clean_numbers(X):
  for i in range(len(X)):
    x = X[i]
    if bool(re.search(r'\d', x)):
      x = re.sub('[0-9]{4,}', ' ### ', x)
      x = re.sub('[0-9]{3}', ' ## ', x)
      x = re.sub('[0-9]{2}', ' # ', x)
    X[i] = x
  return X

In [0]:
X_train = clean_numbers(X_train)
sentences = [row.split() for row in X_train]
vocab = build_vocab(sentences)

Check coverage with '#' instead of numbers.

In [38]:
not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 94.68 %
Training text is covered at 99.82 %


There is an improvement of ~0.5%.

In [73]:
not_in_embeddings[:20]

[('karibjanian', 34),
 ('hadn', 28),
 ('guccifer', 22),
 ('mikerin', 21),
 ('thinkprogress', 20),
 ('chyron', 20),
 ('abcnews', 19),
 ('strzok', 18),
 ('nucera', 17),
 ('antifa', 17),
 ('booo', 16),
 ('isil', 13),
 ('shutterstock', 11),
 ('sciutto', 10),
 ('6079', 10),
 ('delawareans', 10),
 ('realdonaldtrump', 10),
 ('daca', 9),
 ('dcleaks', 9),
 ('hillaryclinton', 8)]

Until here, the improvement increased from 30% to 94% and there is no obvious further improvement. The training data are ready.