<a href="https://colab.research.google.com/github/tracykimani/Builld-a-quiz-app/blob/main/lab6_3_nlp_topicmodelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 6.3 Topic Modelling
## Problem Descriptions

Topic modelling aims to discover the hidden semantic structures of a large text corpus, with numerous applications such as automatic categorisation of documents, text mining, text information retrieval, to name a few.

The latent Dirichlet allocation (LDA) is a common method for topic modelling, based on the assumption that each document in a corpus is composed by one or more hidden topics, and each topic is supported by a number of words. The process is to find these hidden topics and their supporting words by maximising the posterior probability of the whole corpus given the topics and words.

## Implementation and Results

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora


In [None]:
documents = [
  """
  Though Star Wars initially opened in only 42 theatres, 
  the film earned almost $3 million in its first week and had grossed 
  $100 million by the end of the summer. The film won six Academy Awards 
  along with a special-achievement award for accomplishments in sound, 
  and it revolutionized the motion picture industry with its advancements 
  in special effects. Lucas’s effects company, Industrial Light and Magic (ILM), 
  designed a slew of imaginative alien creatures and mechanical “droids” that 
  populated a variety of exotic locales. Perhaps most impressive, however, 
  were the elaborate space battles accomplished with scaled miniatures. 
  The series continued to make remarkable advancements in the field of 
  special effects into the 21st century, and ILM became one of the most 
  successful effects studios utilized by Hollywood. Lucas followed the first 
  Star Wars film with two sequels, Star Wars: Episode V—The Empire Strikes 
  Back (1980) and Star Wars: Episode VI—Return of the Jedi (1983). The 
  franchise thrived in the 1980s and ’90s through the release of videos, a 
  substantial merchandise line, and the theatrical re-release of 
  the trilogy in 1997.
  """,
  """
  Lego’s own MI6, its top-secret R & D lab, is on the second floor of a 
  drab brick structure called the Tech Building. Inside, gearheads in jeans 
  and fleece pullovers are surrounded by enough electronic ganglia to 
  jump-start Frankenstein’s monster. Amid a spaghetti of wires and a blaze of 
  red, green, blue, yellow and purple blocks is an amazing array of robot 
  prototypes, all capable of exasperating behavior. Some of these marvels 
  propel themselves on Lego wheels; others skitter around on Lego legs. 
  There’s a scorpionlike robot that turns sharply, snaps its claws and 
  searches for an infrared beacon “bug.” There’s a Mohawked android that 
  flings little red balls as it rumbles. And there’s a fanged robot snake that, 
  with the wave of a smartphone, shakes, rattles and rolls. Dangle your cell 
  in front of the serpent’s head and it lunges to bite you.
  With bricks, action and hues as vibrant as tropical sunsets, 
  Lego created a way for novices to learn the basics of structural engineering: 
  bracing, tension and compression, loading constraints, building to scale. 
  By combining Lego bricks to sensors, servo motors and microprocessors, those 
  novices can now explore everything from basic pulleys and belts to computer 
  programming. “Mindstorms EV3 makes tinkering with machines cool again,” says 
  Ralph Hempel, author of Lego Spybiotics Secret Agent Training Manual. 
  """, 
  """
  It’s no exaggeration to say that George Orwell is one of the most 
  influential writers the UK has ever produced. His novels, including 
  Nineteen Eighty-Four and Animal Farm, continue to be relevant today, 
  and his non-fiction writing on truth, politics and society remain 
  as sharply observed as ever. Born Eric Blair in 1903 in India, 
  Orwell’s father served the British Empire. While Orwell’s first 
  job was as a policeman in Burma, he is well-known for being anti-imperialist.
  By the time he dies in 1950, Orwell was a successful and respected novelist
  and journalist. If you don’t know where to start with his books, our
  handy reading guide will help.
  There was a moment in 1944 when George Orwell almost lost the original manuscript 
  for Animal Farm, meaning we could now be without one of the greatest 
  allegorical novels ever written in English. The completed manuscript for the 
  book was at Orwell’s home in Kilburn when a V1 flying bomb destroyed the house.
  Orwell’s adopted son Richard Blair told the Ham & High newspaper that 
  the author “spent hours and hours rifling through (bomb site) rubbish)” and 
  finally found the manuscript. Animal Farm was written between 1943 and 1944, 
  after Orwell’s experiences during the Spanish Civil War, and was a 
  condemnation of Stalin’s rule. In modern times, Animal Farm has remained 
  popular for its insight into authoritarian control or governments that buy 
  into their power too much. And its popularity has seen it published in a 
  number of new editions, including this graphic novel, illustrated by Odyr.
  """


]

In [None]:
# Clean the data by using stemming and stopwords removal
nltk.download('stopwords')
stemmer = SnowballStemmer('english')
stop_words = stopwords.words('english')
texts = [
  [stemmer.stem(word) for word in document.lower().split() if word not in stop_words]
  for document in documents
  ]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
texts

[['though',
  'star',
  'war',
  'initi',
  'open',
  '42',
  'theatres,',
  'film',
  'earn',
  'almost',
  '$3',
  'million',
  'first',
  'week',
  'gross',
  '$100',
  'million',
  'end',
  'summer.',
  'film',
  'six',
  'academi',
  'award',
  'along',
  'special-achiev',
  'award',
  'accomplish',
  'sound,',
  'revolution',
  'motion',
  'pictur',
  'industri',
  'advanc',
  'special',
  'effects.',
  'luca',
  'effect',
  'company,',
  'industri',
  'light',
  'magic',
  '(ilm),',
  'design',
  'slew',
  'imagin',
  'alien',
  'creatur',
  'mechan',
  '“droids”',
  'popul',
  'varieti',
  'exot',
  'locales.',
  'perhap',
  'impressive,',
  'however,',
  'elabor',
  'space',
  'battl',
  'accomplish',
  'scale',
  'miniatures.',
  'seri',
  'continu',
  'make',
  'remark',
  'advanc',
  'field',
  'special',
  'effect',
  '21st',
  'century,',
  'ilm',
  'becam',
  'one',
  'success',
  'effect',
  'studio',
  'util',
  'hollywood.',
  'luca',
  'follow',
  'first',
  'star',


In [None]:
# Create a dictionary from the words
dictionary = corpora.Dictionary(texts)

# Create a document-term matrix
doc_term_mat = [dictionary.doc2bow(text) for text in texts]

# Generate the LDA model 
num_topics = 3
ldamodel = models.ldamodel.LdaModel(doc_term_mat, 
        num_topics=num_topics, id2word=dictionary, passes=25)


In [None]:
doc_term_mat

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 2),
  (12, 2),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 2),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 3),
  (27, 1),
  (28, 1),
  (29, 1),
  (30, 1),
  (31, 2),
  (32, 1),
  (33, 1),
  (34, 3),
  (35, 2),
  (36, 1),
  (37, 1),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1),
  (44, 2),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 2),
  (51, 1),
  (52, 1),
  (53, 1),
  (54, 1),
  (55, 2),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 1),
  (60, 1),
  (61, 1),
  (62, 1),
  (63, 1),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1),
  (68, 1),
  (69, 1),
  (70, 1),
  (71, 1),
  (72, 1),
  (73, 1),
  (74, 2),
  (75, 1),
  (76, 4),
  (77, 1),
  (78, 1),
  (79, 1),
  (80, 1),
  (81, 1),
  (82, 1),
  (83, 1),
  (84, 1),
  (85, 1),
  (86, 1),
  (87, 1),
  (88, 1),
  (89, 1),
  (90, 1),
  (91, 1)

In [None]:
num_words = 5
for i in range(num_topics):
  print(ldamodel.print_topic(i, topn=num_words))

print('\nTop ' + str(num_words) + ' contributing words to each topic:')
for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
    print('\nTopic', item[0])
    list_of_strings = item[1].split(' + ')
    for text in list_of_strings:
        details = text.split('*')
        print("%-12s:%0.2f%%" %(details[1], 100*float(details[0])))


0.022*"orwel" + 0.011*"star" + 0.011*"anim" + 0.009*"film" + 0.009*"first"
0.003*"orwel" + 0.003*"anim" + 0.003*"farm" + 0.003*"lego" + 0.003*"remain"
0.025*"lego" + 0.013*"robot" + 0.013*"there" + 0.009*"structur" + 0.009*"brick"

Top 5 contributing words to each topic:

Topic 0
"orwel"     :2.20%
"star"      :1.10%
"anim"      :1.10%
"film"      :0.90%
"first"     :0.90%

Topic 1
"orwel"     :0.30%
"anim"      :0.30%
"farm"      :0.30%
"lego"      :0.30%
"remain"    :0.30%

Topic 2
"lego"      :2.50%
"robot"     :1.30%
"there"     :1.30%
"structur"  :0.90%
"brick"     :0.90%


In [None]:
new_docs = [
  """
  We’ve been working with LEGO Star Wars since the beginning, 
  Chris and I, and we acknowledged right away that there was a huge 
  [adult audience]. We just saw it in the office when we started with 
  LEGO Star Wars — how many of my colleagues were super interested and 
  into that. It was already, in the year 2000, the second year of LEGO Star 
  Wars, that we introduced the Ultimate Collector Series models. At that time, 
  I think they were marked 16+, but we knew that these were something that 
  were definitely more complex and targeted to an older audience. And because 
  of that we had a special packaging. Do you remember, Chris? 
  It was black and white.
  """
]

new_texts = [
  [stemmer.stem(word) for word in document.lower().split() if word not in stop_words]
  for document in new_docs
  ]
new_doc_term_mat = [dictionary.doc2bow(text) for text in new_texts]

vector = ldamodel[new_doc_term_mat]
print(vector[0])


[(0, 0.6364652), (1, 0.026084848), (2, 0.33744997)]


In [None]:
new_docs = [
  """
  In the decades since the publication of George Orwell’s seminal work of 
  anti-Stalinist satire, we have seen the collapse of the regime that 
  disturbed and inspired its author; the beginning and end of the 
  Cold War, with all its attendant horrors; and the rise and fall of 
  any number of would-be Napoleons, both at home and abroad. Animal Farm, 
  once a work so controversial that it seemed unlikely to find a publisher, 
  has served for so long, and in so many school curriculums, as the 
  predominant introduction to the concept of totalitarianism that it is 
  in danger of being perceived as trite.
  """
]

new_texts = [
  [stemmer.stem(word) for word in document.lower().split() if word not in stop_words]
  for document in new_docs
  ]
new_doc_term_mat = [dictionary.doc2bow(text) for text in new_texts]

vector = ldamodel[new_doc_term_mat]
print(vector[0])





[(0, 0.9386559), (1, 0.030818595), (2, 0.03052543)]


## Discussions

In this task, I perform topic modelling using an LDA model.
The three documents I used in the corpus, were gotten from different online articles involving the topics:  "Star Wars", "LEGO", and "George Orwell's Animal Farm".

I then cleaned the data by removing stop words and performming stemming with a "SnowballStemmer" provided by the ```nltk``` package.

Then a dictionary was created from the cleaned words, and a document-term matrix was built using the bag-of-words model.

I then applied the LDA model to find the topics and their supporting words (where I opted to print the top 5 supporting words per topic).

After that, the trained model was applied to new text; to find out if it could determine whether the topic(s) in the new text were related to the topics of the documents that were in the corpus, and to what degree this new text corresponds to each topic in the corpus.



## Results

I applied the model to two different text examples. For the first one, it is about "LEGO Star Wars", and is a transcription of an interview I found online. The model judged the text to be 61% related to the topic, "Star Wars"; 35% related to the topic, "LEGO"; and 4% related to the topic, "George Orwell's Animal Farm".

For the second example, I took the exerpt of text from an article about the 75th Anniversary of Orwell's Animal Farm.

The model judged that this text was about 8% related to the "Star Wars" topic; about 3% related to the topic, "LEGO"; and a significant 89% related to the topic, "George Orwell's Animal Farm".

We can see that the model performs well in all instances and is able to identify what topics are contained within these texts.
