# Internet news: Topic Modeling and Search with Top2Vec

[Top2Vec](https://github.com/ddangelov/Top2Vec) is an algorithm for **topic modelling** and **semantic search**. It **automatically** detects topics present in text and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model you can:
* Get number of detected topics.
* Get topics.
* Search topics by keywords.
* Search documents by topic.
* Find similar words.
* Find similar documents.

This notebook preprocesses the [Kaggle Internet news data with readers engagement](https://www.kaggle.com/szymonjanowski/internet-articles-data-with-users-engagement), it treats each section of every paper as a distinct document. A Top2Vec model is trained on those documents. 

Once the model is trained you can do **semantic** search for documents by topic, searching for documents with keywords, searching for topics with keywords, and for finding similar words. These methods all leverage the joint topic, document, word embeddings distances, which represent semantic similarity. 




# Import and Setup 

### 1. Install the [Top2Vec](https://github.com/ddangelov/Top2Vec) library

In [None]:
!pip install top2vec==1.0.6

### 2. Import Libraries

In [None]:
import numpy as np 
import pandas as pd 
import json
import os
from top2vec import Top2Vec

# Import dataset

In [None]:
df = pd.read_csv("../input/internet-articles-data-with-users-engagement/articles_data.csv",
                usecols = ["content"])
df.head()

In [None]:
df.isna().sum()

In [None]:
df1 = df.dropna()

In [None]:
df1.isna().sum()

In [None]:
# hotel_reviews = hotel_reviews_df.Review.values.tolist()
df1 = df1.content.values.tolist()
type(df1)

In [None]:
len(df1)

In [None]:
df1[:1]

# Train Top2Vec Model



Parameters:

* documents: Input corpus, should be a list of strings.
* speed: This parameter will determine how fast the model takes to train. The 'fast-learn' option is the fastest and will generate the lowest quality vectors. The 'learn' option will learn better quality vectors but take a longer time to train. The 'deep-learn' option will learn the best quality vectors but will take significant time to train.
* workers: The amount of worker threads to be used in training the model. Larger amount will lead to faster training.



In [None]:
model = Top2Vec(documents=df1, speed="learn", workers=8)

# Explore Top2Vec Discovered Topics

## 1.Get Number of Topics
> 
This will return the number of topics that Top2Vec has found in the data.

In [None]:
model.get_num_topics()

## 2.Get Topics

This will return the topics in decreasing size.

In [None]:
topic_words, word_scores, topic_nums = model.get_topics(82)

## 3. Generate Word Clouds

Using a topic number you can generate a word cloud. We will generate word clouds for topics 70 through 75.

In [None]:
for topic in topic_nums[70:75]:
    model.generate_topic_wordcloud(topic, background_color="black")

## 4. Search Papers by Topic

We are going to search by topic 40

In [None]:
documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=40, num_docs=2)


Returns:

*     documents: The documents in a list, the most similar are first.

*     doc_scores: Semantic similarity of document to topic. The cosine similarity of the document and topic vector.

*     doc_ids: Unique ids of documents. If ids were not given, the index of document in the original corpus.


For each of the returned documents we are going to print its content, score and document number.

In [None]:
documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=40, num_docs=2)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()

## 5. Similar Keywords

Search for similar words to football.

In [None]:
words, word_scores = model.similar_words(keywords=["football"], keywords_neg=[], num_words=10)
for word, score in zip(words, word_scores):
    print(f"{word} {score}")

## 6. Search Papers by Keywords

Search documents for content semantically similar to **games** and **players**.

In [None]:
documents, document_scores, document_nums = model.search_documents_by_keyword(keywords=["games", "players"], num_docs=4)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()
