# Topic modelling

## problem description
Topic modelling using the Latent Dirichlet Allocation (LDA).
Topic modelling aims to discover the hidden semantic structures of a large text
corpus, with numerous applications such as automatic categorisation of documents,
text mining, text information retrieval, to name a few.(course material)

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora


In [2]:
documents = [
  """
  Artificial intelligence (AI), sometimes called machine
  intelligence, is intelligence demonstrated by machines, unlike
  the natural intelligence displayed by humans and animals. Leading
  AI textbooks define the field as the study of "intelligent
  agents": any device that perceives its environment and takes
  actions that maximize its chance of successfully achieving its
  goals. Colloquially, the term "artificial intelligence" is often
  used to describe machines (or computers) that mimic "cognitive"
  functions that humans associate with the human mind, such
  as "learning" and "problem solving".
  """,
  """
  Association football, more commonly known as football or
  soccer, is a team sport played with a spherical ball between
  two teams of 11 players. It is played by approximately 250
  million players in over 200 countries and dependencies, making it
  the world's most popular sport. The game is played on a
  rectangular field called a pitch with a goal at each end. The
  object of the game is to outscore the opposition by moving the
  ball beyond the goal line into the opposing goal. The team with
  the higher number of goals wins the game.  
  """
]

In [3]:
# Clean the data by using stemming and stopwords removal
nltk.download('stopwords')
stemmer = SnowballStemmer('english')
stop_words = stopwords.words('english')
texts = [
  [stemmer.stem(word) for word in document.lower().split() if word not in stop_words]
  for document in documents
  ]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
# Create a dictionary from the words
dictionary = corpora.Dictionary(texts)

# Create a document-term matrix
doc_term_mat = [dictionary.doc2bow(text) for text in texts]

# Generate the LDA model 
num_topics = 2
ldamodel = models.ldamodel.LdaModel(doc_term_mat, 
        num_topics=num_topics, id2word=dictionary, passes=25)


In [5]:
num_words = 5
for i in range(num_topics):
  print(ldamodel.print_topic(i, topn=num_words))

print('\nTop ' + str(num_words) + ' contributing words to each topic:')
for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
    print('\nTopic', item[0])
    list_of_strings = item[1].split(' + ')
    for text in list_of_strings:
        details = text.split('*')
        print("%-12s:%0.2f%%" %(details[1], 100*float(details[0])))


0.035*"human" + 0.035*"intellig" + 0.025*"machin" + 0.015*"associ" + 0.015*"field"
0.036*"team" + 0.036*"play" + 0.036*"goal" + 0.026*"game" + 0.026*"ball"

Top 5 contributing words to each topic:

Topic 0
"human"     :3.50%
"intellig"  :3.50%
"machin"    :2.50%
"associ"    :1.50%
"field"     :1.50%

Topic 1
"team"      :3.60%
"play"      :3.60%
"goal"      :3.60%
"game"      :2.60%
"ball"      :2.60%


In [6]:
new_docs = [
  """
Artificial intelligence was founded as an academic discipline in 1956, 
and in the years since has experienced several waves of optimism,
followed by disappointment and the loss of funding (known as an "AI winter"),
 followed by new approaches, success and renewed funding.
  AI research has tried and discarded many different approaches since 
  its founding, including simulating the brain, modeling human problem solving, 
  formal logic, large databases of knowledge and imitating animal behavior. 
  In the first decades of the 21st century, highly mathematical-statistical 
  machine learning has dominated the field, and this technique has proved 
  highly successful, helping to solve many challenging problems throughout 
  industry and academia.
  
  """
]

new_texts = [
  [stemmer.stem(word) for word in document.lower().split() if word not in stop_words]
  for document in new_docs
  ]
new_doc_term_mat = [dictionary.doc2bow(text) for text in new_texts]

vector = ldamodel[new_doc_term_mat]
print(vector[0])


[(0, 0.9264854), (1, 0.07351459)]


#discussion

To perform topic modelling  we have used LDA.  
* The latent Dirichlet allocation (LDA) is a common method for topic modelling, based
on the assumption that each document in a corpus is composed by one or more
hidden topics, and each topic is supported by a number of words. The process is to
find these hidden topics and their supporting words by maximising the posterior
probability of the whole corpus given the topics and words.

For filtering stop words we use snowball stemmer.
we are also displaying top words in each articles whic is:--

Topic 0
"human"     :3.50%
"intellig"  :3.50%
"machin"    :2.50%
"associ"    :1.50%
"field"     :1.50%

Topic 1
"team"      :3.60%
"play"      :3.60%
"goal"      :3.60%
"game"      :2.60%
"ball"      :2.60%

at the end we are passing new artcle and we get to which category it belongs most.
 [(0, 0.9264854), (1, 0.07351459)]
 there is 92% probability that new article is near to topic one.
