<a href="https://colab.research.google.com/github/stavIatrop/Fake-News-Detection/blob/master/text_preprocessing_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import and split data

In [1]:
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit

data = pd.read_csv("drive/My Drive/datasets/politifact.csv", ",")
data_labels = data['label'].values
data = data['text'].values

sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

for train_index, test_index in sss.split(data, data_labels):
    X_train, X_test = data[train_index], data[test_index]
    Y_train, Y_test = data_labels[train_index], data_labels[test_index]

print("Train shape : ",X_train.shape)
print("Test shape : ",X_test.shape)



Train shape :  (559,)
Test shape :  (140,)


Remove non-ascii characters

In [0]:
import re

def remove_non_ascii(X):
  for i in range(len(X)):
    words = X[i].split()
    filtered_list = []
    for word in words:
        pattern = re.compile('[^\u0000-\u007F]+', re.UNICODE)  #Remove all non-alphanumeric characters
        
        word = pattern.sub('', word)
        filtered_list.append(word)
        result = ' '.join(filtered_list)
        
    X[i] = result
  return X

In [0]:
X_train = remove_non_ascii(X_train)
X_test = remove_non_ascii(X_test)

Build the training vocab

In [0]:
def build_vocab(sentences):     #sentences --> list of lists of tokens
  vocab = dict()
  for sentence in sentences:
    for word in sentence:
      if word in vocab.keys():
        vocab[word] += 1
      else:
        vocab[word] = 1
  return vocab

In [5]:
sentences = [row.split() for row in X_train]
vocab = build_vocab(sentences)
print({k: vocab[k] for k in list(vocab)[:10]})

{'George': 177, 'W.': 41, 'Bush': 284, 'has': 2956, 'lobbed': 3, 'thinly-veiled': 1, 'critiques': 2, 'of': 20543, 'President': 1135, 'Donald': 255}


In [0]:
import numpy as np
def load_glove_index():
    EMBEDDING_FILE = "/content/drive/My Drive/GloVe/glove.6B.50d.txt"
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')[:50]
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
    return embeddings_index

In [0]:
glove_index = load_glove_index()

Check the percentage that GloVe vocab covers traing vocab

In [0]:
import operator

def check_cover(vocab, glove_index):

  not_in_embeddings = dict()
  in_embeddings = dict()
  text_len_in = 0
  text_len_out = 0
  for word in vocab.keys():
    if word in glove_index.keys():
      in_embeddings[word] = vocab[word]
      text_len_in += vocab[word]
    else:
      not_in_embeddings[word] = vocab[word]
      text_len_out += vocab[word]
  
  print("Training vocabulary is covered at %.2f %%" % ((len(in_embeddings)/len(vocab)) * 100 ))
  print("Training text is covered at %.2f %%" % ((text_len_in/(text_len_in + text_len_out)) * 100) )
  
  not_in_emb_sorted = sorted(not_in_embeddings.items(), key=operator.itemgetter(1))[::-1]
  
  return not_in_emb_sorted



In [9]:
not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 30.65 %
Training text is covered at 72.05 %


GloVe embeddings cover only ~30% of the vocabulary, which means that ~28% of our dataset is not utilized. The next step is to check the vocabulary words that are not included in the embeddings to see if we can improve things.

In [10]:
not_in_embeddings[:20]

[('I', 10424),
 ('And', 4814),
 ('The', 3134),
 ('But', 2096),
 ('We', 1938),
 ("it's", 1210),
 ('So', 1208),
 ("don't", 1152),
 ('President', 1135),
 ('You', 1104),
 ('American', 1067),
 ('United', 1010),
 ('Senator', 1006),
 ("that's", 1006),
 ('THE', 999),
 ('It', 971),
 ('Well,', 963),
 ('know,', 921),
 ("we're", 898),
 ('This', 895)]

It seems that many words that start with capital letter are ommited from the embeddings, but are their lower case forms ommited?

In [11]:
'i' in glove_index

True

In [12]:
'and' in glove_index

True

In [13]:
'the' in glove_index

True

In [14]:
'president' in glove_index

True

As it was suspected, the preprocessing of the embeddings may include the process of lower casing the data. So, let's transform the data into lower case.

In [0]:
def toLowerCase(X):

  for i in range(len(X)):
    filtered_list = []
    for word in X[i].split():
      word = word.lower()
      filtered_list.append(word)
      result = ' '.join(filtered_list)

    X[i] = result
  
  return X

In [0]:
X_train = toLowerCase(X_train)
sentences = [row.split() for row in X_train]
vocab = build_vocab(sentences)

Now that the text is lower case, check the coverage of embeddings.

In [17]:
not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 44.12 %
Training text is covered at 84.67 %


The coverage was increased from 30% to 44%. Let's check now for words not included in embeddings.

In [18]:
not_in_embeddings[:20]

[("it's", 1838),
 ("that's", 1635),
 ("don't", 1188),
 ("we're", 1178),
 ('well,', 1117),
 ('know,', 921),
 ("i'm", 848),
 ('it.', 827),
 ('now,', 806),
 ("we've", 748),
 ('(applause.)', 586),
 ('obama:', 577),
 ('that.', 562),
 ("they're", 552),
 ("you're", 532),
 ('that,', 493),
 ('tapper:', 487),
 ("i've", 485),
 ("there's", 480),
 ('said,', 470)]

In [19]:
'obama' in glove_index

True

In [20]:
'know' in glove_index

True

In [21]:
'thats' in glove_index

True

In [22]:
'dont' in glove_index

True

In [23]:
'theres' in glove_index

True

In [24]:
'\'' in glove_index

True

In [25]:
':' in glove_index

True

In [26]:
',' in glove_index

True

In [27]:
'(' in glove_index

True

In [28]:
').' in glove_index

False

It seems that punctuation symbol are not totally eliminated from the preprocessing of the embeddings. So, I will remove punctuation if it is in the middle of a word token or separate it if it is at the start/end of it.

In [0]:
import string

def handle_punctuation(X):

  for i in range(len(X)):
    filtered_list = []
    for word in X[i].split():
      
      cleaned = 0
      
      while(not cleaned):
        punc_word = ""

        if (word[0] in string.punctuation):
          punc_word = word[0]
          if (len(word) == 1):
            cleaned = 1
          else:
            word = word[1:]
          filtered_list.append(punc_word)
          result = ' '.join(filtered_list)
        elif (word[len(word) - 1] in string.punctuation):
          punc_word = word[len(word) - 1]
          word = word[:len(word) - 1]
          filtered_list.append(punc_word)
          result = ' '.join(filtered_list)
        else:
          word = word.translate(str.maketrans('', '', string.punctuation))
          cleaned = 1
          filtered_list.append(word)
          result = ' '.join(filtered_list)
    
    X[i] = result
  
  return X

In [0]:
X_train = handle_punctuation(X_train)
sentences = [row.split() for row in X_train]
vocab = build_vocab(sentences)

Check coverage now.

In [31]:
not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Training vocabulary is covered at 81.71 %
Training text is covered at 99.08 %


Almost all of the text is covered! Try removing completely the punctuation.

In [0]:
def remove_punctuation(X):
  
  for i in range(len(X)):
    filtered_list = []
    for word in X[i].split():
      
      word = word.translate(str.maketrans('', '', string.punctuation))
      
      filtered_list.append(word)
      result = ' '.join(filtered_list)
    
    X[i] = result
  
  return X

In [0]:
# X_train_no_punc = remove_punctuation(X_train)
# sentences = [row.split() for row in X_train]
# vocab = build_vocab(sentences)

Check coverage without punctuation

In [0]:
# not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

Slightly lower, so we will keep the punctuation. And let's check for further improvement.

In [35]:
not_in_embeddings[:20]

[('youve', 351),
 ('theyve', 210),
 ('shouldnt', 102),
 ('theyll', 78),
 ('werent', 74),
 ('odonnell', 62),
 ('250000', 49),
 ('bbld', 47),
 ('whove', 39),
 ('africanamerican', 33),
 ('202n', 30),
 ('africanamericans', 29),
 ('shortterm', 29),
 ('theyd', 29),
 ('hadnt', 28),
 ('200000', 28),
 ('costsharing', 26),
 ('karibjanian', 25),
 ('doddfrank', 25),
 ('twothirds', 22)]

Many preprocessed embeddings have replaced large numbers with #, so let's clean the numbers.

In [0]:
def clean_numbers(X):
  for i in range(len(X)):
    x = X[i]
    if bool(re.search(r'\d', x)):
      x = re.sub('[0-9]{4,}', ' ### ', x)
      x = re.sub('[0-9]{3}', ' ## ', x)
      x = re.sub('[0-9]{2}', ' # ', x)
    X[i] = x
  return x

In [0]:
# X_train = clean_numbers(X_train)
# sentences = [row.split() for row in X_train]
# vocab = build_vocab(sentences)

Check coverage with '#' instead of numbers.

In [0]:
# not_in_embeddings = check_cover(vocab=vocab, glove_index=glove_index)

In [0]:
# not_in_embeddings[:20]