# LIBRARIES

In [None]:
from mitielib import *
import numpy as np
import utils
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import cluster
import pandas as pd

# GLOBALS/CONSTANTS

In [None]:
articles_path='./articles'

# GATHERING NEWS ARTICLES

For this tutorial, it is assumed that you have access to electronic version of news stories as text, ideally all from a single day. It is suggested that you have at least 50 articles. You can extract the content of these articles, including their titles, manually or by using Python-based tools like PythonGoose (https://github.com/grangier/python-goose). 

The rest of this tutorial assumes that you have access to the content of each of the news articles in the following format: 
 - <b>Filename:</b> title-<article_numer>.txt, example: title-1.txt
* <b>Contents:</b> The title of the news story. 


 - <b>Filename:</b> article-<article_numer>.txt, example: article-1.txt
* <b>Contents:</b> The contents of the news story. 
        
In addition, it is also recommended that you tag each article with the actual <b>“topic”</b> which will help us evaluate the performance of the spectral clustering algorithm. This can be done by assigning the name of the hierarchical identifier for the news story on the website it is hosted on. For instance, news stories about <b>“Brexit”</b> were typically classified under the <b>“Brexit”</b> section in most online news websites. Stories about the <b>“Middle East”</b> are typically classified under <b>“Middle East”</b>, and so on. If this information is available, it can be made available in the following format:
 - <b>Filename:</b> topic-<article_numer>.txt, example: topic-1.txt
* <b>Contents:</b> The actual “topic” (section or sub-section) under which the news story was classified on the hosting website. 

In [None]:
utils.get_articles(articles_path, news_website = 'https://edition.cnn.com/world', language='en', lm_articles = 300)

# ENTITY EXTRACTION

We now want to represent each article’s contents (corpus) as a “bag of entities”. This simply means that we will look for the mentions of certain words, i.e. names of people, organizations, locations etc. The master list of such entities can be found using various publicly available libraries. One such Python-based library is called “MITIE” (https://github.com/mit-nlp/MITIE). You will need to install it, and make a note about the path to the library. 

In [4]:
# NER path   
path_to_ner_model = './MITIE-models/english/ner_model.dat'
ner = named_entity_extractor(path_to_ner_model)

Load Articles: 

In [5]:
# First, get the articles from the function
# total number of articles to process
N = 300
# in memory stores for the topics, titles and contents of the news stories
topics_array = []
titles_array = []
corpus = []
for i in range(N):
    # get the contents of the article
    with open(articles_path+f'/article-{i}.txt', 'r') as myfile:
        d1=myfile.read().replace('\n', '')
        d1 = d1.lower()
        corpus.append(d1)
        
    #get the original topic of the article
    with open(articles_path+f'/topic-{i}.txt', 'r') as myfile:
        to1=myfile.read().replace('\n', '')
        to1 = to1.lower()
        topics_array.append(to1)
        
    #get the title of the article
    with open(articles_path+f'/title-{i}.txt', 'r') as myfile:
        ti1=myfile.read().replace('\n', '')
        ti1 = ti1.lower()
        titles_array.append(ti1)

As MITIE is installed, we are now ready to do the following:
1. Loop over all the article text corpuses to determine all the unique words used across our dataset.
2. Find the subset of the entities from the ner model that are among the unique words being used across the dataset (determined in step 1)

This goal can be achieved using the following lines of code:

In [6]:
# entity subset array
entity_text_array = []
for i in range(N):
    # Load the article contents text file and convert it into a list of words.
    tokens = tokenize(load_entire_file((articles_path+f'/article-{i}.txt')))
    # extract all entities known to the ner model mentioned in this article
    entities = ner.extract_entities(tokens)
    # extract the actual entity words and append to the array
    for e in entities:
        range_array = e[0]
        entity_text = b' '.join([tokens[j] for j in range_array])
        entity_text_array.append(entity_text.lower())

# remove duplicate entities detected
entity_text_array = np.unique(entity_text_array)

entity_text_array

array([b'#vaccinescauseautism', b'&pizza', b"'mona lisa", ...,
       b'zulkiflee anwar haque', b'zunar', b'zurcher'], dtype='|S62')

# TF-IDF

Now that we have the list of all entities used across our dataset, we can represent each article as a vector that contains the TF-IDF (https://en.wikipedia.org/wiki/Tf–idf ) score for each entity stored in the <i>entity_text_array</i>. This task can easily be achieved by using the scikit-learn library (http://scikitlearn.org/stable/) for Python. Please ensure scikit-learn is installed and ready to use before proceeding. The following lines of code can help represent each article in the dataset as a vector of TF-IDF values:

In [7]:
# Construct TfidVectorizer
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word',
                       stop_words='english', vocabulary=entity_text_array)
corpus_tf_idf = vect.fit_transform(corpus)

# Spectral Clustering

Now that we have the articles represented as vectors of their TF-IDF scores, we are ready to perform <b>Spectral Clustering</b> on the articles. We can use the scikit-learn library for this purpose as well. The following lines of code will cluster our articles in to 7 clusters: 

In [8]:
# change n_clusters to equal the number of clusters desired
n_clusters = 7
#spectral clustering
spectral = cluster.SpectralClustering(n_clusters= n_clusters,
                                      eigen_solver='arpack',
                                      affinity="nearest_neighbors",
                                      n_neighbors = 17)
spectral.fit(corpus_tf_idf)

if hasattr(spectral, 'labels_'):
    cluster_assignments = spectral.labels_.astype(np.int)
for i in range(0, 40): #len(cluster_assignments))
    # Topics of documents doesn't make sense because the site (cnn)
    # didn't have very good topics by default
    print('Document Topic : {}'.format(topics_array[i]))
    print('Cluster Assignment : {}'.format(cluster_assignments[i]))
    print('Document Title : {}'.format(titles_array[i]))
    print('------------------------')

Document Topic : edition.cnn.com
Cluster Assignment : 5
Document Title : cnn.com - transcripts
------------------------
Document Topic : edition.cnn.com
Cluster Assignment : 2
Document Title : cnn.com - transcripts
------------------------
Document Topic : edition.cnn.com
Cluster Assignment : 3
Document Title : cnn.com - transcripts
------------------------
Document Topic : edition.cnn.com
Cluster Assignment : 2
Document Title : cnn.com - transcripts
------------------------
Document Topic : edition.cnn.com
Cluster Assignment : 3
Document Title : cnn.com - transcripts
------------------------
Document Topic : edition.cnn.com
Cluster Assignment : 2
Document Title : cnn.com - transcripts
------------------------
Document Topic : edition.cnn.com
Cluster Assignment : 3
Document Title : cnn.com - transcripts
------------------------
Document Topic : edition.cnn.com
Cluster Assignment : 2
Document Title : cnn.com - transcripts
------------------------
Document Topic : edition.cnn.com
Cluster

## Result Visualization

The table doesn't make sense because the site (cnn) didn't have very good topics by default. 

In [9]:
df = pd.DataFrame({'Topic': topics_array, 'Assignment': cluster_assignments})
df['val'] = 1
tb = df.groupby(['Topic', 'Assignment']).sum().unstack().fillna(0).astype(int)
tb

Unnamed: 0_level_0,val,val,val,val,val,val,val
Assignment,0,1,2,3,4,5,6
Topic,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
edition.cnn.com,9,5,112,105,3,3,6
money.cnn.com,1,12,16,15,1,0,2
www.abc12.com,0,0,1,0,0,0,0
www.cbs46.com,0,0,0,1,0,0,0
www.cnn.com,1,0,2,1,0,1,0
www.kctv5.com,0,0,1,0,0,0,0
www.kptv.com,0,0,0,1,0,0,0
www.wsmv.com,0,0,0,0,0,0,1
