<a href="https://colab.research.google.com/github/yuvaravii/BBC-News-article-Topic-Identification/blob/main/LSA_Theme_extraction_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Problem Description**

In this project your task is to identify major themes/topics across a collection of BBC news articles. You can use clustering algorithms such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) etc.

In [None]:
# for dataframes
import pandas as pd
import numpy as np
import re

#for ignoring warnings
import warnings
warnings.filterwarnings("ignore")

import json
import glob
import os


#gensim
import gensim
import gensim.corpora as corpora 
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel


from spacy import displacy
from gensim.corpora import Dictionary
from gensim.models import LdaModel

import sklearn
import keras

#spacy
import spacy 
from nltk.corpus import stopwords

# for visualisation of data
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
processed_data_filepath='/content/drive/MyDrive/Colab Notebooks/Capstone Project/BBC article/2. Cleaned and Preprocessed data/3rd_cleaned_dataset_stg.csv'
new_df=pd.read_csv(processed_data_filepath)
df=new_df.copy()
df=df.drop(columns={'Unnamed: 0'})
df.head()

In [None]:
# importing necessary libraries
from sklearn.feature_extraction.text import  TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [None]:
corpus=df['cleaned_doc']
corpus[:5]

**TF IDF vectorization**

In [None]:
# creating a model
vectorizer=TfidfVectorizer(use_idf=True)

# Data imputation into model
X=vectorizer.fit_transform(corpus)

In [None]:
X[0]

In [None]:
print(X[0])  # the right side is the TF IDF score

In [None]:
X.shape # Each documents are columnized into matrix giving 7 Lakh columns

In [None]:
X.size # Total number of data points present in the matrix

**LSA - LATENT SEMANTIC ANALYSIS**

The steps are almost similar to that of LDA like
1. Document ----> Document term matrix
2. Document term matrix -----> Document topic matrix + topic term matrix

Here the difference rolls.
1. Application of SVD (Singular value decomposition) on Document term matrix.

  1.1 Converts the Doc-Term matrix into 3 parts 
      a) Orthogonal column matrix- Document topic matrix
      b) Orthogonal row matrix - Topic term matrix
      C) Singular matrix - Importances of topics stored in diagonal matrix

2. Hyper parameter tuned according to k(number of topics) with evaluation metric of coherence.


In [None]:
#Model creation
lsa=TruncatedSVD(n_components=10 , n_iter=50,algorithm='arpack',random_state=100)

# Data imputation into the model
lsa.fit_transform(X)

In [None]:
lsa.components_[9]

In [None]:
lsa.get_params

In [None]:
vectorizer.get_feature_names()

In [None]:
vocab = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_):
     vocab_comp = zip(vocab, comp)
     sorted_words = sorted(vocab_comp, key= lambda x:x[1], reverse=True)[:10]
     
     print("Topic "+str(i)+": ")
     for t in sorted_words:
            print(t[0],end=" ")
     print("\n")

In [None]:
from wordcloud import WordCloud
# Generate a word cloud image for given topic
def draw_word_cloud(index):
  imp_words_topic=""
  comp=lsa.components_[index]
  vocab_comp = zip(vocab, comp)
  sorted_words = sorted(vocab_comp, key= lambda x:x[1], reverse=True)[:50]
  for word in sorted_words:
    imp_words_topic=imp_words_topic+" "+word[0]

  wordcloud = WordCloud(width=600, height=400).generate(imp_words_topic)
  plt.figure( figsize=(5,5))
  plt.imshow(wordcloud)
  plt.axis("off")
  plt.tight_layout()
  
  plt.show()
 

In [None]:
for i in range(0,10):
  print('topic{}'.format(i),draw_word_cloud(i))