## 02 - Clean and Transform

This notebook will highlights the step to clean and transform data into
matrices readied for model fitting.

Three type of matrices will be created:
1. Term document matrix - sparse
2. Term frequency inverse document frequency

Fortunately, there is no missing value in this case.

#### Small dataset

Extracting only 20000 articles with almost equal number of fake and
reliable labels. While 20000 is small compared to our big dataset, the matrix created afterward is still massive.

After some testing, starting from row 910000 gives a good mix of reliable
and fake articles.

Notes: A random guess of all fake news will give about 70% accuracy!

In [23]:
import pandas as pd
import numpy as np
import pickle

from words_clean_function import denoise_text, normalize
from nltk import word_tokenize

from gensim import corpora, matutils, models
from gensim.similarities.docsim import MatrixSimilarity

from scipy.sparse import save_npz, load_npz

In [6]:
# define data path
data_path = 'D:\\PycharmProjects\\springboard\\data\\'

# Skip to 910000 row for the good mix of fake and reliable news
skiprows = 910000
nrows = 20000

# Read in sample data
df = pd.read_csv(f'{data_path}news_clean_1.csv',
                 skiprows=skiprows,
                 index_col=False,
                 nrows=nrows,
                 names=['index', 'type', 'content'])

# Dropping index columns
df = df.drop('index', axis=1)
df.type.value_counts()

fake        15600
reliable     4400
Name: type, dtype: int64

In [7]:
# Print first article to view
print(df.content[0])

Statement on the Release of Prisoners with link to list

(Before It's News)

[For list of the terrorists being released in Hebrew:

http://www.shabas.gov.il/NR/rdonlyres/B43A2078-2C20-449D-AFC3-FDB247722BAD/0/reshima1.pdf

Shortened link:

http://bit.ly/19XgFuS ]

Statement on the Release of Prisoners

(Communicated by the Prime Ministerâs Media Adviser)

In wake of the Cabinet decision to resume the diplomatic negotiations

between Israel and the Palestinians and authorize a team of ministers to

deal with the release of prisoners during the negotiations, the ministerial

committee convened this evening (Sunday, 11 August 2013). Defense Minister

Moshe Yaalon chaired the discussion; Justice Minister Tzipi Livni and

Science, Technology and Space Minister Yaakov Peri also participated, as did

representatives of the Prison Service, the Justice Ministry, the IDF and

other agencies.

The committee approved the release of 26 prisoners. The list of prisoners

will be published on the Pr

### Cleaning

The goal here is to clean all the article and create a Term document
sparse matrix. As with any NLP projects, the following functions will
clean and tokenize the articles. Afterward, TDM will be created using
gensim package.

Notes: Customize cleaning functions are a good way to learn more about NLP.

The steps are as followed:
1. De-noise
    - Remove brackets/links content
    - Remove contractions
2. Tokenize
3. Normalize
    - Remove non ASCII
    - Remove stopwords
    - Remove punctuation
    - Lowercase all words

All the functions above are included with definition in words_clean_function.py

In [8]:
# Using map to quickly clean and tokenize our data
df.content = df.content.map(denoise_text)
df.content = df.content.map(word_tokenize)
df.content = df.content.map(normalize)

In [9]:
# Print the first article again
print(df.content[0])

['statement', 'release', 'prisoners', 'link', 'list', 'news', 'statement', 'release', 'prisoners', 'communicated', 'prime', 'ministeras', 'media', 'adviser', 'wake', 'cabinet', 'decision', 'resume', 'diplomatic', 'negotiations', 'israel', 'palestinians', 'authorize', 'team', 'ministers', 'deal', 'release', 'prisoners', 'negotiations', 'ministerial', 'committee', 'convened', 'evening', 'sunday', '11', 'august', '2013', 'defense', 'minister', 'moshe', 'yaalon', 'chaired', 'discussion', 'justice', 'minister', 'tzipi', 'livni', 'science', 'technology', 'space', 'minister', 'yaakov', 'peri', 'also', 'participated', 'representatives', 'prison', 'service', 'justice', 'ministry', 'idf', 'agencies', 'committee', 'approved', 'release', '26', 'prisoners', 'list', 'prisoners', 'published', 'prison', 'service', 'website', 'later', 'tonight', 'notice', 'given', 'bereaved', 'families', 'asked', 'informed', 'advance', 'list', 'includes', '14', 'prisoners', 'transferred', 'gaza', '12', 'judea', 'samari

### Sparse Term Document Matrix

[Bag of Words](https://machinelearningmastery.com/gentle-introduction-bag-words-model/)

[Term Document Matrix](https://en.wikipedia.org/wiki/Document-term_matrix)

Gensim package provides a simple way to create sparse tdm with matutils.
While tdm is a very straightforward representation of bag of words model,
there will be a lot of zeros which increase our feature spaces tremendously!

Steps to create TDM matrix:
1. create lexicon (dictionary of all words)
2. transform into matrix

The dictionary created has 125557 unique tokens/words

In [10]:
# Create a word lexicon
lexicon = corpora.Dictionary(df.content)
print(lexicon)

# bag of words
bow = []
for doc in df.content:
    bow.append(lexicon.doc2bow(doc))

# Create term frequency matrix
tf_sparse_matrix = matutils.corpus2csc(bow)

Dictionary(196679 unique tokens: ['11', '12', '14', '2013', '26']...)


In [11]:
# Print out first article again
print(type(tf_sparse_matrix))
print(tf_sparse_matrix.shape)

<class 'scipy.sparse.csc.csc_matrix'>
(196679, 20000)


Since the matrix is inverted, it must be inverted back before saving
to be fit later on.
fake label = 0, reliable label = 1

In [12]:
# Saving tf_sparse_matrix
# A quick transformation of y into 0 for fake and 1 for reliable
X = tf_sparse_matrix.T
y = df.type.astype('category').cat.codes

# Quick check before saving
print(f'X shape: {X.shape} and X type: {type(X)}')
print(f'y shape: {y.shape} and y type: {type(y)}')

# Save to disk
save_npz(f'{data_path}news_tf_sparse.npz', X)
y.to_csv(f'{data_path}news_labels.csv')

X shape: (20000, 196679) and X type: <class 'scipy.sparse.csr.csr_matrix'>
y shape: (20000,) and y type: <class 'pandas.core.series.Series'>


### Term Frequency - Inverse Document Frequency

[TF-IDF](http://www.tfidf.com/)

TF-IDF weighing importance of words in the document and give it a score.
A higher the score, the more importance is that word.

Again, Gensim package provides intuitive way to create the matrix

In [14]:
# Initialize tf-idf model
tf_idf = models.TfidfModel(bow)

# tf-idf corpus
tf_idf_corpus = []
for doc in bow:
    tf_idf_corpus.append(tf_idf[doc])

# Create tf idf matrix
tf_idf_matrix = matutils.corpus2dense(tf_idf_corpus, num_terms=len(lexicon.token2id))

array([[0.04208539, 0.        , 0.        , 0.        ],
       [0.03988195, 0.        , 0.0580506 , 0.        ],
       [0.04474735, 0.        , 0.        , 0.        ],
       [0.04389755, 0.05998502, 0.02129851, 0.        ],
       [0.05441932, 0.        , 0.02640354, 0.        ],
       [0.06655783, 0.        , 0.        , 0.        ],
       [0.05444635, 0.        , 0.        , 0.        ],
       [0.05802789, 0.        , 0.        , 0.        ],
       [0.06138134, 0.        , 0.        , 0.        ],
       [0.1343428 , 0.        , 0.        , 0.        ],
       [0.05370708, 0.        , 0.        , 0.        ],
       [0.01518245, 0.        , 0.00736633, 0.        ],
       [0.06234901, 0.        , 0.        , 0.        ],
       [0.03910037, 0.        , 0.        , 0.        ],
       [0.0479917 , 0.06557959, 0.06985483, 0.        ],
       [0.0948211 , 0.        , 0.        , 0.        ],
       [0.11451864, 0.        , 0.        , 0.        ],
       [0.06860378, 0.        ,

Save the matrix for models. We also save tfidf corpus as this will be
used for similarity matrix later on.

In [22]:
# Save as X since y is unchanged
X = tf_idf_matrix.T

# Checking before save to disk
print(f'X shape is {X.shape}')

# Save to disk
np.save(f'{data_path}news_tf_idf.npy', X)
with open(f'{data_path}news_tf_idf_corpus.txt', 'wb') as fp:
    pickle.dump(tf_idf_corpus, fp)

X shape is (20000, 196679)
