<a href="https://colab.research.google.com/github/shashanknigade/Data-Science/blob/main/Unsupervised_News_Aggregator_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Article Aggregator
---
### Problem Statement
Develop algorithm technique to group news articles by story.
**Dataset**:
https://www.kaggle.com/uciml/news-aggregator-dataset

---

### Approach
1.   Text Preprocessing using Gensim & NLTK Package
1.   Doc2Vec generation
1.   Dimensionality reduction using stacked autoencoders
1.   Unsupervised clustering using kMeans algorithm
1.   Metrics calculation







In [2]:
# Import all necessary libraries
# For file copy from Kaggle etc.
import os, shutil
from google.colab import drive
import zipfile
# For data exploration and graphs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# For text preprocessing
import gensim
from gensim.parsing.preprocessing import remove_stopwords, preprocess_string
from gensim.parsing.preprocessing import strip_punctuation
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import nltk
from nltk.tokenize import word_tokenize
# For saving models etc
import pickle
# Deep Neural Network libraries
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
# KMeans
from sklearn.cluster import KMeans,MiniBatchKMeans
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Folder Structure Setup and input data
`Google Drive` - 
  - `Kaggle`
    - `kaggle.json` - `Contains username and API key for downloading data from Kaggle directly`

`Colab Folder`
  - `content`
    - `data` - csv file downloaded here
      - `kaggle` - `kaggle.json copied here`

In [5]:
# Mount the drive to read kaggle config file
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


In [6]:
# global config values
zipName = 'news-aggregator-dataset.zip'
csvFileName = 'uci-news-aggregator.csv'
modelFileSaveLocation = '/content/gdrive/MyDrive'
run = 'newsagg-1'
exportModels = False

In [7]:
# Create kaggle folder
os.makedirs('data',exist_ok=True)
os.makedirs('data/kaggle',exist_ok=True)
# Change the current working directory to data where we will download 
# the data
os.chdir(os.path.join(os.getcwd(),'data'))

In [8]:
# copy kaggle.json to the content folder's kaggle folder
shutil.copy('/content/gdrive/MyDrive/kaggle/kaggle.json',
            'kaggle')
# Set environment for kaggle directory to the folder in prev. step
os.environ['KAGGLE_CONFIG_DIR'] = 'kaggle'

In [9]:
# download the dataset
!kaggle datasets download -d uciml/news-aggregator-dataset
zipfileref = zipfile.ZipFile(zipName,'r')
# Extract the files in pwd
zipfileref.extractall()

news-aggregator-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


### Text Preprocessing

In [10]:
# Read the csv file
df = pd.read_csv(csvFileName)

In [94]:
def ProcessTest(text):
  '''
    text: Input Text
    This function removes stop words from a text, removes punctuation
    and lowers the case
  '''
  text = remove_stopwords(text)
  text = strip_punctuation(text)
  return str.lower(str.strip(text))

In [95]:
# Store processed text in list
processedText = []
# Read and process TITLE column in the CSV which will be the input to
# this algorithm
for index, value in df['TITLE'].items():
    processedText.append([index,ProcessTest(value)])


In [96]:
# Create a tagged document object list by tokenizing the words in a sentence from processedText
inputText = [TaggedDocument(word_tokenize(sen[1]),[sen[0]]) for sen in processedText]

In [97]:
# Generate a Doc2Vec from the input text of size 200
#https://radimrehurek.com/gensim/models/doc2vec.html?highlight=doc2vec#module-gensim.models.doc2vec 
vec_size = 200
gmodel = Doc2Vec(inputText, vector_size=vec_size, window=2, min_count=1, workers=4)

In [98]:
gmodel.docvecs.vectors_docs.shape

(422419, 200)

In [30]:
# if the config value is set to export the model
if exportModels:
  # Export processed Text
  folder = os.path.join(modelFileSaveLocation,run)
  # Create folder if it does not exist
  os.makedirs(folder,exist_ok=True)
  gmodel.save(os.path.join(folder,'newsaggword2vec_200.pkl'))

In [9]:
# Load pre-trained Word2Vec model.
gmodel = gensim.models.Doc2Vec.load("/content/gdrive/MyDrive/newsaggword2vec_200.pkl")

In [129]:
X = gmodel.docvecs.vectors_docs

### Dimensionality Reduction
*   Create AutoEncoder to train of X
*   AutoEncoder's Encoder part reduces the input features in a such way that the decoder part can reproduce the inputs




In [130]:
# Encoder network
autoModelEncoder=Sequential()
autoModelEncoder.add(Dense(100, input_shape=[vec_size]))
autoModelEncoder.add(Dense(50, activation='selu'))
autoModelEncoder.add(Dense(10, activation='selu'))
# Decoder network
autoModelDecoder=Sequential()
autoModelDecoder.add(Dense(50,input_shape=[10]))
autoModelDecoder.add(Dense(100, activation='selu'))
autoModelDecoder.add(Dense(vec_size, activation='selu'))
# Stacked autoEncoder
autoEncoder = Sequential([autoModelEncoder,autoModelDecoder])
autoEncoder.compile('adam',loss='mse')

In [131]:
autoEncoder.summary()

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
sequential_7 (Sequential)    (None, 10)                25660     
_________________________________________________________________
sequential_8 (Sequential)    (None, 200)               25850     
Total params: 51,510
Trainable params: 51,510
Non-trainable params: 0
_________________________________________________________________


In [132]:
history = autoEncoder.fit(X,X,epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [133]:
codings = autoModelEncoder.predict(X)

### Clustering

In [None]:
# Predict with 100 clusters
# For assignment purpose the cluster size is chosen to be 100 and can
# can be explored further to find optimum size
Kmean = KMeans(n_clusters=100)
Kmean.fit(codings)

In [None]:
# Save the lables in the dataframe
df['Prediction'] = Kmean.labels_

In [151]:
df[['TITLE','STORY','Prediction']].head()

Unnamed: 0,TITLE,STORY,Prediction
0,"Fed official says weak data caused by weather,...",ddUyU0VZz0BRneMioxUPQVP6sIxvM,11
1,Fed's Charles Plosser sees high bar for change...,ddUyU0VZz0BRneMioxUPQVP6sIxvM,11
2,US open: Stocks fall after Fed official hints ...,ddUyU0VZz0BRneMioxUPQVP6sIxvM,39
3,"Fed risks falling 'behind the curve', Charles ...",ddUyU0VZz0BRneMioxUPQVP6sIxvM,21
4,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,ddUyU0VZz0BRneMioxUPQVP6sIxvM,9


In [152]:
# Check the cluster 0 news articles
df[df['Prediction'] == 0][['TITLE','STORY']].value_counts()
# As seen the cluster contains various sub clusters of similar articles

TITLE                                                                            STORY                        
Fingerprint lock in Samsung Galaxy 5 easily defeated by whitehat hackers         dynZ2e-zzuhPqbMOq7Mq3rCo44vyM    2
Free coffee at McDonald's amid breakfast war                                     dM-ub8KuIwsrB2MFbfiNkhw0PNvgM    1
Freddie Prinze, Jr.: Working with Kiefer Sutherland Made Me Want to Quit Acting  dhLBOkcY7Z2Jr-MoauAW9o5smlJIM    1
Free Ben & Jerry's Ice Cream Tomorrow!                                           dNkKiBF-kLRjmeM4B7Hq0v2I_5PPM    1
Free Drug Samples for Doctors Might Prove Costly                                 dbSy6LpD5LyNoWM3dUsR9yyWV50_M    1
                                                                                                                 ..
Nokia names Rajeev Suri as new CEO                                               d0P4azXWamm0lRMJMTmGz5l2BskIM    1
Nokia names Rajeev Suri as new CEO and reports sales drop                    

In [None]:
if exportModels:
  # Export processed Text
  folder = os.path.join(modelFileSaveLocation,run)
  # Create folder if it does not exist
  os.makedirs(folder,exist_ok=True)
  pickle.dump(Kmean,
              open(os.path.join(folder,'newsaggkmeansAutoEncoder.pkl'),'wb'))

### Next steps

1. As this is a unsupervised problem, metrics can be calculated using Silhoute score to measure the quality of clusters.

2. Currently the articles are getting clubbed together but not up to the mark. Hence to improve the algorithm quality below things can be tried
> * Trying out different dimensions for doc2vec vector.
> * Trying different encoder network configurations
> * Trying various KMeans paramaters
> * Trying other clustering algorithms like DBSCAN





In [153]:
!pip freeze > requirements.txt