## Topic Modelling

Topic modeling is a method for unsupervised classification of documents, similar to clustering on numeric data, which finds some natural groups of items (topics) even when we’re not sure what we’re looking for.

A document can be a part of multiple topics, kind of like in fuzzy clustering (soft clustering) in which each data point belongs to more than one cluster

Topic modeling provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives.
It can help with the following:

1. Discovering the hidden themes in the collection.
2. Classifying the documents into the discovered themes.
3. Using the classification to organize/summarize/search the documents.


For example, let’s say a document belongs to the topics food, dogs and health. So if a user queries “dog food”, they might find the above-mentioned document relevant because it covers those topics(among other topics). We are able to figure its relevance with respect to the query without even going through the entire document.

Therefore, by annotating the document, based on the topics predicted by the modeling method, we are able to optimize our search process

**We are going to make use of LDA (Latent Dirichlet Allocation).**

It is one of the most popular topic modeling methods. Each document is made up of various words, and each topic also has various words belonging to it. The aim of LDA is to find topics a document belongs to, based on the words in it.

In [4]:
pip install pyldavis

Collecting pyldavisNote: you may need to restart the kernel to use updated packages.

  Using cached pyLDAvis-3.3.1.tar.gz (1.7 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting sklearn
  Using cached sklearn-0.0.post1.tar.gz (3.6 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting funcy
  Downloading funcy-1.18-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: pyldavis, sklearn
  Building wheel for pyldavis (pyproject.toml): started
  Building wheel for pyldavis (pyproject.toml): finished with status 'done'
 

In [5]:
import pyLDAvis

In [6]:
import pyLDAvis.gensim_models

In [7]:
import spacy
import gensim

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


In [8]:
nlp = spacy.load('en_core_web_sm')



### Accessing the texts

In [9]:
text_1 = ('''Chess, one of the oldest and most popular board games, played by two opponents on a checkered board with specially designed pieces of contrasting colours, commonly white and black. White moves first, after which the players alternate turns in accordance with fixed rules, each player attempting to force the opponent’s principal piece, the King, into checkmate—a position where it is unable to avoid capture. Chess first appeared in India about the 6th century AD and by the 10th century had spread from Asia to the Middle East and Europe. Since at least the 15th century, chess has been known as the “royal game” because of its popularity among the nobility. Rules and set design slowly evolved until both reached today’s standard in the early 19th century. Once an intellectual diversion favoured by the upper classes, chess went through an explosive growth in interest during the 20th century as professional and state-sponsored players competed for an officially recognized world championship title and increasingly lucrative tournament prizes. Organized chess tournaments, postal correspondence games, and Internet chess now attract men, women, and children around the world. This article provides an in-depth review of the history and the theory of the game by noted author and international grandmaster Andrew Soltis. For a chronological list of world champions since the mid-19th century, featuring direct links to biographical articles, see the table of world chess champions. Chess is played on a board of 64 squares arranged in eight vertical rows called files and eight horizontal rows called ranks. These squares alternate between two colours: one light, such as white, beige, or yellow; and the other dark, such as black or green. The board is set between the two opponents so that each player has a light-coloured square at the right-hand corner. Individual moves and entire games can be recorded using one of several forms of notation. By far the most widely used form, algebraic (or coordinate) notation, identifies each square from the point of view of the player with the light-coloured pieces, called White. The eight ranks are numbered 1 through 8 beginning with the rank closest to White. The files are labeled a through h beginning with the file at White’s left hand. Each square has a name consisting of its letter and number, such as b3 or g8. Additionally, files a through d are referred to as the queenside, and files e through h as the kingside''')

In [10]:
text_1

'Chess, one of the oldest and most popular board games, played by two opponents on a checkered board with specially designed pieces of contrasting colours, commonly white and black. White moves first, after which the players alternate turns in accordance with fixed rules, each player attempting to force the opponent’s principal piece, the King, into checkmate—a position where it is unable to avoid capture. Chess first appeared in India about the 6th century AD and by the 10th century had spread from Asia to the Middle East and Europe. Since at least the 15th century, chess has been known as the “royal game” because of its popularity among the nobility. Rules and set design slowly evolved until both reached today’s standard in the early 19th century. Once an intellectual diversion favoured by the upper classes, chess went through an explosive growth in interest during the 20th century as professional and state-sponsored players competed for an officially recognized world championship ti

In [11]:
text_2 = ('''This article documents the progress of significant human–computer chess matches. Chess computers were first able to beat strong chess players in the late 1980s. Their most famous success was the victory of Deep Blue over then World Chess Champion Garry Kasparov in 1997, but there was some controversy over whether the match conditions favored the computer. In 2002–2003, three human–computer matches were drawn, but, whereas Deep Blue was a specialized machine, these were chess programs running on commercially available computers. Chess programs running on commercially available desktop computers won decisive victories against human players in matches in 2005 and 2006. The second of these, against then world champion Vladimir Kramnik is (as of 2019) the last major human-computer match. Since that time, chess programs running on commercial hardware—more recently including mobile phones—have been able to defeat even the strongest human players. MANIAC (1956)In 1956 MANIAC, developed at Los Alamos Scientific Laboratory, became the first computer to defeat a human in a chess-like game. Playing with the simplified Los Alamos rules, it defeated a novice in 23 moves.[1] Mac Hack VI (1966–1968). In 1966 MIT student Richard Greenblatt wrote the chess program Mac Hack VI using MIDAS macro assembly language on a Digital Equipment Corporation PDP-6 computer with 16K of memory. Mac Hack VI evaluated 10 positions per second. In 1967, several MIT students and professors (organized by Seymour Papert) challenged Dr. Hubert Dreyfus to play a game of chess against Mac Hack VI. Dreyfus, a professor of philosophy at MIT, wrote the book What Computers Can’t Do, questioning the computer's ability to serve as a model for the human brain. He also asserted that no computer program could defeat even a 10-year-old child at chess. Dreyfus accepted the challenge. Herbert A. Simon, an artificial intelligence pioneer, watched the game. He said, "it was a wonderful game—a real cliffhanger between two woodpushers with bursts of insights and fiendish plans ... great moments of drama and disaster that go in such games." The computer was beating Dreyfus when he found a move which could have captured the enemy queen. The only way the computer could get out of this was to keep Dreyfus in checks with its own queen until it could fork the queen and king, and then exchange them. That is what the computer did. Soon, Dreyfus was losing. Finally, the computer checkmated Dreyfus in the middle of the board. In the spring of 1967, Mac Hack VI played in the Boston Amateur championship, winning two games and drawing two games. Mac Hack VI beat a 1510 United States Chess Federation player. This was the first time a computer won a game in a human tournament. At the end of 1968, Mac Hack VI achieved a rating of 1529. The average rating in the USCF was near 1500.[2] Chess x.x (1968–1978) In 1968, Northwestern University students Larry Atkin, David Slate and Keith Gorlen began work on Chess (Northwestern University). On 14 April 1970 an exhibition game was played against Australian Champion Fred Flatow, the program running on a Control Data Corporation 6600 model. Flatow won easily. On 25 July 1976, Chess 4.5 scored 5–0 in the Class B (1600–1799) section of the 4th Paul Masson chess tournament in Saratoga, California. This was the first time a computer won a human tournament. Chess 4.5 was rated 1722. Chess 4.5 running on a Control Data Corporation CDC Cyber 175 supercomputer (2.1 megaflops) looked at less than 1500 positions per second. On 20 February 1977, Chess 4.5 won the 84th Minnesota Open Championship with 5 wins and 1 loss. It defeated expert Charles Fenner rated 2016. On 30 April 1978, Chess 4.6 scored 5–0 at the Twin Cities Open in Minneapolis. Chess 4.6 was rated 2040.[3] International Master Edward Lasker stated that year, "My contention that computers cannot play like a master, I retract. They play absolutely alarmingly. I know, because I have lost games to 4.7."[4] David Levy's bet (1978) Main article: David Levy (chess player) § Computer chess bet For a long time in the 1970s and 1980s, it remained an open question whether any chess program would ever be able to defeat the expertise of top humans. In 1968, International Master David Levy made a famous bet that no chess computer would be able to beat him within ten years. He won his bet in 1978 by beating Chess 4.7 (the strongest computer at the time). Cray Blitz (1981) In 1981, Cray Blitz scored 5–0 in the Mississippi State Championship. In round 4, it defeated Joe Sentef (2262) to become the first computer to beat a master in tournament play and the first computer to gain a master rating (2258).[5] HiTech (1988) In 1988, HiTech won the Pennsylvania State Chess Championship with a score of 4½–½. HiTech defeated International Master Ed Formanek (2485).[6] The Harvard Cup Man versus Computer Chess Challenge was organized by Harvard University. There were six challenges from 1989 until 1995. They played in Boston and New York City. In each challenge the humans scored higher and the highest scorer was a human''')

In [12]:
text_2

'This article documents the progress of significant human–computer chess matches. Chess computers were first able to beat strong chess players in the late 1980s. Their most famous success was the victory of Deep Blue over then World Chess Champion Garry Kasparov in 1997, but there was some controversy over whether the match conditions favored the computer. In 2002–2003, three human–computer matches were drawn, but, whereas Deep Blue was a specialized machine, these were chess programs running on commercially available computers. Chess programs running on commercially available desktop computers won decisive victories against human players in matches in 2005 and 2006. The second of these, against then world champion Vladimir Kramnik is (as of 2019) the last major human-computer match. Since that time, chess programs running on commercial hardware—more recently including mobile phones—have been able to defeat even the strongest human players. MANIAC (1956)In 1956 MANIAC, developed at Los

In [13]:
text_3 = ('''Surprised. Shocked. Stunned. 4th of April 2022 just left everybody nonplussed. The National Assembly was reeling. The opposition was fuming. The government was celebrating. The media was wondering. The people of Pakistan were enquiring. Nobody and nothing had prepared the political pundits for this move. This move’s unpredictability, its alacrity and its associated actions swept the National Assembly carpet from the feet of parliamentarians. The overflowing media gallery was caught shaking their heads. The breaking news went hoarse breaking loudly and repetitively. By the time people had gathered their wits, the National assembly was prorogued and the National assembly was dissolved. If politics is a game of chess, this was a checkmate. In the game of chess when a king is attacked, it is called check. A checkmate occurs when a king is placed in check and has no other moves to escape. When a checkmate happens, the game ends immediately, and the player who delivered the checkmate wins. They say in chess, checkmating your opponent should be your top priority since this will ensure your victory even if you have less material or if you have had a worse position throughout the game. It seems that the happening on 4th April was also a chess move. Whether it will lead to eventual victory despite the Supreme Court striking it down, remains to be seen.Let us see what are the moves and counter moves that government and opposition have played in this game of political chess: 1. The Super Powers discomfort- The super power becomes a super power by finding a “powerless” regional target that can be easily used as a base for its agenda in the region. In Central America where US has been involved in nearly 25 regime changes it targets weaker countries like Guatemala, Panama, Chile, Venezuela etc. In Middle East it used to be Iraq. Iraq rebelled and had to be tamed through a war. In South Asia it shifted between Afghanistan and Pakistan''')

In [14]:
text_3

'Surprised. Shocked. Stunned. 4th of April 2022 just left everybody nonplussed. The National Assembly was reeling. The opposition was fuming. The government was celebrating. The media was wondering. The people of Pakistan were enquiring. Nobody and nothing had prepared the political pundits for this move. This move’s unpredictability, its alacrity and its associated actions swept the National Assembly carpet from the feet of parliamentarians. The overflowing media gallery was caught shaking their heads. The breaking news went hoarse breaking loudly and repetitively. By the time people had gathered their wits, the National assembly was prorogued and the National assembly was dissolved. If politics is a game of chess, this was a checkmate. In the game of chess when a king is attacked, it is called check. A checkmate occurs when a king is placed in check and has no other moves to escape. When a checkmate happens, the game ends immediately, and the player who delivered the checkmate wins. 

In [15]:
# Creating a list of texts

texts = [text_1, text_2, text_3]

In [16]:
print (texts)

['Chess, one of the oldest and most popular board games, played by two opponents on a checkered board with specially designed pieces of contrasting colours, commonly white and black. White moves first, after which the players alternate turns in accordance with fixed rules, each player attempting to force the opponent’s principal piece, the King, into checkmate—a position where it is unable to avoid capture. Chess first appeared in India about the 6th century AD and by the 10th century had spread from Asia to the Middle East and Europe. Since at least the 15th century, chess has been known as the “royal game” because of its popularity among the nobility. Rules and set design slowly evolved until both reached today’s standard in the early 19th century. Once an intellectual diversion favoured by the upper classes, chess went through an explosive growth in interest during the 20th century as professional and state-sponsored players competed for an officially recognized world championship t

### Creating a word list

In [17]:
words_list = []
for text in texts:
    doc = nlp(text)
    text_words = []
    for token in doc:
        if token.is_stop == False and token.is_punct == False and token.like_num == False and token.text!= '\n':
            text_words.append(token.lemma_)
    words_list.append(text_words)

In [18]:
print(words_list)

[['Chess', 'old', 'popular', 'board', 'game', 'play', 'opponent', 'checkered', 'board', 'specially', 'design', 'piece', 'contrast', 'colour', 'commonly', 'white', 'black', 'white', 'move', 'player', 'alternate', 'turn', 'accordance', 'fix', 'rule', 'player', 'attempt', 'force', 'opponent', 'principal', 'piece', 'King', 'checkmate', 'position', 'unable', 'avoid', 'capture', 'Chess', 'appear', 'India', 'century', 'ad', 'century', 'spread', 'Asia', 'Middle', 'East', 'Europe', 'century', 'chess', 'know', 'royal', 'game', 'popularity', 'nobility', 'rule', 'set', 'design', 'slowly', 'evolve', 'reach', 'today', 'standard', 'early', 'century', 'intellectual', 'diversion', 'favour', 'upper', 'class', 'chess', 'go', 'explosive', 'growth', 'interest', 'century', 'professional', 'state', 'sponsor', 'player', 'compete', 'officially', 'recognize', 'world', 'championship', 'title', 'increasingly', 'lucrative', 'tournament', 'prize', 'organized', 'chess', 'tournament', 'postal', 'correspondence', 'gam

In [21]:
print (len(words_list))

3


In [22]:
print (len(words_list[0]))

216


In [23]:
print (len(words_list[1]))

477


In [24]:
print (len(words_list[2]))

155


### Creating a corpus

In [25]:
corpus = []
from gensim.corpora import Dictionary

In [27]:
dict = Dictionary(words_list)
type(dict)

gensim.corpora.dictionary.Dictionary

In [28]:
for word in words_list:
    corpus.append(dict.doc2bow(word))

In [29]:
print (corpus)

[[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 3), (10, 1), (11, 1), (12, 1), (13, 1), (14, 2), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 2), (28, 4), (29, 3), (30, 1), (31, 6), (32, 2), (33, 1), (34, 1), (35, 1), (36, 6), (37, 1), (38, 1), (39, 1), (40, 1), (41, 3), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 5), (65, 1), (66, 1), (67, 2), (68, 1), (69, 5), (70, 1), (71, 1), (72, 1), (73, 1), (74, 2), (75, 2), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 3), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 2), (97, 1), (98, 2), (99, 1), (100, 2), (101, 1), (102, 1), (103, 3), (104, 1), (105, 3), (106, 2), (107, 5), (108, 1), (109, 1), (110, 1)

In [30]:
len(corpus)

3

In [31]:
len(corpus[0])

151

In [32]:
len(corpus[1])

253

In [33]:
len(corpus[2])

120

### Creating an LDA model

In [35]:
lda = gensim.models.ldamodel.LdaModel(corpus = corpus, num_topics = 5, id2word = dict)

### Displaying the topics

In [36]:
lda.print_topics()

[(0,
  '0.029*"chess" + 0.021*"game" + 0.016*"computer" + 0.011*"human" + 0.011*"player" + 0.010*"Chess" + 0.010*"win" + 0.010*"play" + 0.009*"checkmate" + 0.007*"Dreyfus"'),
 (1,
  '0.025*"chess" + 0.023*"computer" + 0.015*"game" + 0.012*"human" + 0.011*"Chess" + 0.010*"win" + 0.009*"play" + 0.009*"Mac" + 0.008*"Hack" + 0.008*"VI"'),
 (2,
  '0.014*"chess" + 0.014*"game" + 0.009*"computer" + 0.009*"Chess" + 0.008*"file" + 0.007*"century" + 0.007*"square" + 0.006*"play" + 0.006*"tournament" + 0.006*"player"'),
 (3,
  '0.026*"computer" + 0.021*"chess" + 0.015*"game" + 0.014*"Chess" + 0.011*"player" + 0.011*"human" + 0.011*"play" + 0.008*"defeat" + 0.008*"program" + 0.008*"win"'),
 (4,
  '0.011*"chess" + 0.010*"computer" + 0.009*"game" + 0.008*"Chess" + 0.007*"play" + 0.005*"player" + 0.005*"Hack" + 0.005*"defeat" + 0.005*"human" + 0.005*"win"')]

### Getting topics for a word

In [38]:
lda.get_term_topics('game')

[(0, 0.019751012), (1, 0.013281167), (2, 0.011446461), (3, 0.013203524)]

In [40]:
lda.get_term_topics('move')

[]

**The word move is available in the corpus, but it has very low probability to be assigned to a topic.**

In [42]:
lda.get_term_topics('human')

[(1, 0.010753638)]

In [43]:
lda.get_term_topics('computer')

[(0, 0.014984815), (1, 0.021099873), (3, 0.024760805)]

In [45]:
lda.get_term_topics('checkmate')

[]

### Visualization of the topics

In [46]:
pyLDAvis.enable_notebook()

In [47]:
plot = pyLDAvis.gensim_models.prepare(lda, corpus = corpus, dictionary = lda.id2word)

  default_term_info = default_term_info.sort_values(


In [49]:
plot

**The number of figures show the number of topics and the size of the figures show the intensity of the topics. The right hand side shows the top 30 words with their level of importance pertaining to each topic. Blue shows the overall frequency of the words in the corpus, while the red colour shows the frequency of the words appearing in the topic.**