<a href="https://colab.research.google.com/github/s-miramontes/News_Filter/blob/master/Pilot/Data%20Cleanse%20and%20Clustering%20Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

X

# Clustering News Headlines

In this notebook we begin my importing data to analyze its contents and be able to determine the best clustering algorithm to determine the articles that are mostly related to each other.

We start by importing some dependencies

In [2]:
import numpy as np
import pandas as pd

## Importing Data from Local File System

Datasets are located here: https://www.kaggle.com/snapcrack/all-the-news/version/4#articles3.csv

Proceed to download all three 'csv' files, and store them in a 'data' directory at the location of your choice.


In [3]:
articles_3 = pd.read_csv('/Users/mirasilvia/mids/w266/News_Filter/data/articles3.csv.zip', compression='zip')
articles_2 = pd.read_csv('/Users/mirasilvia/mids/w266/News_Filter/data/articles2.csv.zip', compression='zip')
articles_1 = pd.read_csv('/Users/mirasilvia/mids/w266/News_Filter/data/articles1.csv.zip', compression='zip')

In [4]:
full_data = pd.concat([articles_1, articles_2, articles_3], ignore_index=True)

In [5]:
# remove duplicates
full_data = full_data.drop_duplicates(subset=['title', 'publication', 'author', 'date'])

In [6]:
# remove missing titles 
full_data = full_data.dropna(subset=['title'])

In [7]:
title_lengths = [len(s) for s in full_data['title']]

In [8]:
len(title_lengths), len(full_data)

(142568, 142568)

In [9]:
full_data['title_length'] = full_data['title'].apply(len)

In [10]:
full_data = full_data[full_data.title_length > 2]

In [11]:
# sample from full_data (set seed to 5)
small_data = full_data.sample(n=10000, random_state=5).reset_index()
set(small_data.publication)

{'Atlantic',
 'Breitbart',
 'Business Insider',
 'Buzzfeed News',
 'CNN',
 'Fox News',
 'Guardian',
 'NPR',
 'National Review',
 'New York Post',
 'New York Times',
 'Reuters',
 'Talking Points Memo',
 'Vox',
 'Washington Post'}

In [12]:
# tokenization
tok_title = [title.lower().split() for title in small_data.title]

## Cluster Model 2 (Silvia's)

Similarity Measures we attempt:
1. Word2Vec
2. Word embeddings
3. Knowledge-based Measures (wordNet).

Clustering methods to attempt:
1. TFIDF + K-Means
2. Hierarchical Clustering?



**Word2Vec as Preprocessing of Text.**

In [13]:
from gensim.models import Word2Vec
from nltk.cluster import KMeansClusterer
import nltk

from sklearn import cluster
from sklearn import metrics

In [14]:
model = Word2Vec(tok_title, min_count=1)

In [15]:
def sentence_vectorizer(s, model):
    
    sent_vector = []
    num_words = 0
    
    for word in s:
        try:
            if num_words == 0:
                sent_vector = model[word]
            else:
                sent_vector = np.add(sent_vector, 
                                     model[word])
            num_words += 1
        except:
            pass
    
    return np.asarray(sent_vector) / num_words
    

In [16]:
X = []

for sentence in tok_title:
    X.append(sentence_vectorizer(sentence, model))
    
print("======================")
#print(X)

  if __name__ == '__main__':
  if sys.path[0] == '':




In [17]:
#print(model[model.wv.vocab])

In [18]:
# what is the similarity between trump and sanders?
print(model.similarity('trump', 'sanders'))
print(model.most_similar(positive=['trump'], negative=[], topn=2))

0.9992844
[('for', 0.9999223947525024), ('by', 0.9999154210090637)]


  
  This is separate from the ipykernel package so we can avoid doing imports until


Introducing Clustering. 

- How could we determine the best number of clusters? 
- I think at the moment its best to estimate based on 10 topics? 
- Bring this up for discussion

In [19]:
NUM_CLUSTERS = 15

kclusterer = KMeansClusterer(NUM_CLUSTERS, 
                             distance = nltk.cluster.util.cosine_distance,
                             repeats=25)

assigned_clusters = kclusterer.cluster(X, assign_clusters=True)


Note that here we used cosine distance to cluster our data.
After we got cluster results we can associate each sentence with the cluster that it got assigned to

**Below we see what cluster each sentence was assigned to.**

In [20]:

#for index, sentence in enumerate(tok_title):
#    print(str(assigned_clusters[index]) + ":" + str(sentence))

**Now we proceed to apply KMeans**

In [21]:
kmeans = cluster.KMeans(n_clusters = NUM_CLUSTERS)
kmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=15, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [22]:
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

In [28]:
# sum of squared distances of samples to their closest cluster center.
kmeans.inertia_

565.483632094283

In [32]:
from sklearn.metrics.pairwise import cosine_similarity

In [41]:
# cosine similarities for each row with cluster center
cos_sim = [cosine_similarity(X[i].reshape(1,-1), kmeans.cluster_centers_[label].reshape(1, -1))[0][0] for i,label in enumerate(labels)]

In [42]:
# simple dataframe with title, cluster, and distance metric used
title_data = pd.DataFrame({'title':small_data.title, 
                          'cluster':labels,
                          'cos_sim':cos_sim})

In [43]:
title_data

Unnamed: 0,title,cluster,cos_sim
0,’Window Into The Future’: Scientists Document ...,1,0.988962
1,How museums are quietly resisting President Trump,7,0.999906
2,Pro boxer beats up NYPD sergeant during traffi...,9,0.999633
3,Whoopi ’Implores’ People to Vote So ‘Fearmonge...,4,0.998595
4,People who took a gamble on Trump’s nomination...,0,0.999871
5,New technique may prevent the gruesome deaths ...,14,0.998786
6,Justin Wolfe admits role in drug dealer’s slay...,5,0.999975
7,Why Did It Take So Long for Class-Based School...,0,0.999709
8,Why celebrities are wearing a T-shirt in winter,4,0.998995
9,Hillary Clinton to Wellesley Grads: ’Cheer Up’...,7,0.999470


In [50]:
# inspect titles highest cosine similarity in clusters
top_5 = title_data.groupby('cluster')['cos_sim'].nlargest(5)
for i, ind in top_5.index:
    print("cluster", i)
    print(title_data['title'][ind])
    print('----------------')

cluster 0
EPA to pull back on fuel-efficiency standards for cars, trucks in future model years
----------------
cluster 0
GOP Senator To Angry Constituents: Schedule Your Protest In Advance!
----------------
cluster 0
Cleveland police shooting of Tamir Rice: city to pay $6 million after 12-year-old’s death
----------------
cluster 0
This woman chose to go homeless in San Francisco instead of paying high rent
----------------
cluster 0
Trump tours a hulking aircraft carrier to promote hike in military spending
----------------
cluster 1
Louisiana’s Democratic Governor Robs Kids of School Choice
----------------
cluster 1
Protesters Crowd Outside Fox News’ GOP Debate In Detroit (PHOTOS)
----------------
cluster 1
Mary Tyler Moore, TV legend, has died at 80
----------------
cluster 1
ICE Deports MS-13 Gang Member Wanted for Violent Crimes in El Salvador
----------------
cluster 1
Fifa hands authorities 20,000 pieces of evidence as internal inquiry concludes
----------------
cluster 2
How 

### Predict Input

In [52]:
input_topic = ["Hillary Clinton defends handling of Benghazi attack",
              "Women's March Highlights", "Hillary Clinton emails"]

In [53]:
# run word2vec
w2v = Word2Vec(input_topic, min_count=1)

In [54]:
X2 = []

for sent in input_topic:
    X2.append(sentence_vectorizer(sent, w2v))

  if __name__ == '__main__':
  if sys.path[0] == '':


In [57]:
X_train = X
y_train = title_data.cluster
X_test = X2

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

res = knn.predict(X_test)

In [58]:
print(res)

[3 3 3]


In [65]:
for i,ind in top_5.index:
    if (i==3):
        print("cluster", i)
        print(title_data['title'][ind])
        print('-------------------')

cluster 3
U.S., Japan first ladies: both unconventional yet poles apart
-------------------
cluster 3
Using Another Secret Tunnel, Drug Kingpin ’El Chapo’ Almost Evaded Capture
-------------------
cluster 3
Military Convoy Sporting ’Trump’ Flag Was Naval Special Warfare Unit
-------------------
cluster 3
Wall St. ends flat amid election doubts, M&A flurry
-------------------
cluster 3
Rio Highlights: Simone Biles Wins All-Around; Michael Phelps Gets 22nd Gold 
-------------------
