<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-research-and-practice/blob/main/getting-started-with-nlp/09-topic-analysis/topic_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Topic Analysis

In this notebook, you will work with one more application of
this powerful framework – topic classification. 

Let’s start with a scenario: suppose you work
as a content manager for a large news platform. Your platform hosts texts from a wide
variety of authors and mainly specializes in the following set of well-established topics:
“Politics”, “Finance”, “Science”, “Sports”, and “Arts”. Your task is to decide, for every
incoming article, which topic it belongs to and post it under the relevant tab on the platform.

This scenario relates to the task that we can broadly define as topic analysis.

Imagine
that the content on your platform doesn’t stay the same all the time – new topics may
emerge in the data. 

Unfortunately, if you wanted to train a classification model to cover
these topics in a supervised manner, you will need labeled articles for these new topics as
well. 

Data availability is the major bottleneck for supervised machine learning. 

Therefore, you need to learn about alternative ways of topic discovery and need to apply two unsupervised machine learning algorithms – 

* clustering
* Latent Dirichlet Allocation


<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/getting-started-with-nlp/09-topic-analysis/images/1.png?raw=1' width='600'/>

##Setup

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans

from sklearn import metrics
from sklearn.metrics import plot_confusion_matrix

from sklearn.datasets import fetch_20newsgroups

import random
import numpy as np
import matplotlib.pyplot as plt

##Dataset

In [2]:
def load_dataset(subsets, cats):
  dataset = fetch_20newsgroups(subset=subsets, categories=cats, remove=("headers", "footers", "quotes"), shuffle=True)
  return dataset

In [3]:
categories = ["comp.windows.x", "misc.forsale", "rec.autos", "rec.motorcycles", "rec.sport.baseball"]
categories += ["rec.sport.hockey", "sci.crypt", "sci.med", "sci.space", "talk.politics.mideast"]

newsgroups_train = load_dataset("train", categories)
newsgroups_test = load_dataset("test", categories)

In [4]:
# Let's check our uploaded data subsets
def check_data(dataset):
  print(list(dataset.target_names))
  print(dataset.filenames.shape)
  print(dataset.target.shape)

  if dataset.filenames.shape[0] == dataset.target.shape[0]:
    print("Equal sizes for data and targets")
  
  print(dataset.filenames[0])
  print(dataset.data[0])
  print(dataset.target[:10])

In [6]:
check_data(newsgroups_train)
print("\n***\n")
check_data(newsgroups_test)

['comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.med', 'sci.space', 'talk.politics.mideast']
(5913,)
(5913,)
Equal sizes for data and targets
/root/scikit_learn_data/20news_home/20news-bydate-train/rec.sport.baseball/102665
I have posted the logos of the NL East teams to alt.binaries.pictures.misc 
 Hopefully, I'll finish the series up next week with the NL West.

 Darren

[4 3 9 7 4 3 0 5 7 8]

***

['comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.med', 'sci.space', 'talk.politics.mideast']
(3937,)
(3937,)
Equal sizes for data and targets
/root/scikit_learn_data/20news_home/20news-bydate-test/misc.forsale/76785
As the title says. I would like to sell my Star LV2010 9 pin printer.
Its a narrow colum dot matrix, supports both parallel and serial
interfacing, prints at 200 characters per second, has a 16K buffer, 
and is very dependab

##Naïve Bayes classifier