<a href="https://colab.research.google.com/github/rewpak/AI-works/blob/main/NLP_Topic_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 6. Natural Language Processing
# Task 6.1 Topic Modeling
# Problem Descriptions

Topic modeling is a method used in text mining to discover abstract topics within a collection of documents. The method involves the following steps:

1. Data Collection: Gather and compile the text data from various sources, such as articles, books, social media posts, or any other text-rich documents.

2. Preprocessing: We then perform a simple data cleaning by removing the stop words and stemming with one of the stemmers provided in nltk, “SnoballStemmer”

3. Feature Extraction: Transform the preprocessed text into a format that can be analyzed and build a document-term matrix using the bag-of-words (BoW) model.

4. Model Selection: Choose a topic modeling algorithm. The most common ones are Latent Dirichlet Allocation (LDA)

5. Training the Model: Run the chosen algorithm on your document-term matrix. The model will attempt to discover patterns and group words into topics based on their distribution across the documents.

6. Topic Interpretation: Analyze the output of the model, which includes a list of topics and the words that are most representative of each topic.

7. Application: Apply the trained LDA model to a new text.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora


In [None]:
documents = [
  """
  In the realm of environmental sustainability, the role of technology has been pivotal.
  Innovative solutions for renewable energy, like solar panels and wind turbines, are transforming how we power our cities.
  Meanwhile, advancements in electric vehicles and battery storage are essential in the fight against climate change.
  The impact of technology on reducing carbon emissions is undeniable.
  Additionally, conservation efforts are being augmented by technology, from wildlife tracking systems to data analysis of climate patterns.
  The intersection of technology and environmentalism is creating new pathways for sustainable development.".
  """,
  """
  The culinary arts are a profound expression of cultural heritage.
  Traditional recipes passed down through generations are more than just instructions for preparing food; they are a window into a culture’s history and values.
  Each ingredient, technique, and flavor tells a story of geography, trade, and human connection. The evolution of regional cuisines reflects the blending of diverse cultures and histories.
  For instance, the fusion of indigenous ingredients and colonial influences can be seen in many Latin American dishes. Moreover, the art of cooking is a medium of preserving and celebrating cultural identity in a globalized world.
  """
]

In [None]:
# Clean the data by using stemming and stopwords removal
nltk.download('stopwords')
stemmer = SnowballStemmer('english')
stop_words = stopwords.words('english')
texts = [
  [stemmer.stem(word) for word in document.lower().split() if word not in stop_words]
  for document in documents
  ]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Create a dictionary from the words
dictionary = corpora.Dictionary(texts)

# Create a document-term matrix
doc_term_mat = [dictionary.doc2bow(text) for text in texts]

# Generate the LDA model
num_topics = 2
ldamodel = models.ldamodel.LdaModel(doc_term_mat,
        num_topics=num_topics, id2word=dictionary, passes=25)


In [None]:
num_words = 5
for i in range(num_topics):
  print(ldamodel.print_topic(i, topn=num_words))

print('\nTop ' + str(num_words) + ' contributing words to each topic:')
for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
    print('\nTopic', item[0])
    list_of_strings = item[1].split(' + ')
    for text in list_of_strings:
        details = text.split('*')
        print("%-12s:%0.2f%%" %(details[1], 100*float(details[0])))


0.033*"technolog" + 0.024*"environment" + 0.024*"climat" + 0.014*"vehicl" + 0.014*"power"
0.043*"cultur" + 0.024*"art" + 0.014*"seen" + 0.014*"mani" + 0.014*"window"

Top 5 contributing words to each topic:

Topic 0
"technolog" :3.30%
"environment":2.40%
"climat"    :2.40%
"vehicl"    :1.40%
"power"     :1.40%

Topic 1
"cultur"    :4.30%
"art"       :2.40%
"seen"      :1.40%
"mani"      :1.40%
"window"    :1.40%


In [None]:
new_docs = [
  """
  Innovative culinary techniques are increasingly focusing on sustainability,
  blending the art of cooking with technology to minimize environmental impact.
  Chefs are utilizing modern technology to source local, organic ingredients,
  reducing carbon footprints associated with food transportation.
  Techniques like sous-vide and molecular gastronomy not only enhance flavor but also optimize energy use.
  Additionally, technology in agriculture, such as precision farming,
  directly influences the quality and sustainability of ingredients used in the culinary arts.
  This fusion highlights the importance of technology in evolving culinary traditions while emphasizing environmental responsibility.”
  """
]

new_texts = [
  [stemmer.stem(word) for word in document.lower().split() if word not in stop_words]
  for document in new_docs
  ]
new_doc_term_mat = [dictionary.doc2bow(text) for text in new_texts]

vector = ldamodel[new_doc_term_mat]
print(vector[0])


[(0, 0.5712285), (1, 0.42877153)]


# Discussions

In this task, we employed Latent Dirichlet Allocation (LDA), a popular topic modeling technique, to analyze and uncover underlying themes in a set of documents. The primary goal was to identify distinct topics and understand how different documents relate to these topics.

Findings:

Topic 0: Characterized by words such as "technology" (3.30%), "environment" (2.40%), "climate" (2.40%), "vehicle" (1.40%), and "power" (1.40%).

Topic 1: Defined by terms like "culture" (4.30%), "art" (2.40%), "seen" (1.40%), "many" (1.40%), and "window" (1.40%).

After implementing a trained model to a new document we can see that:
(0, 0.5712285) indicates that about 57.12% of the document is related to Topic 0
(1, 0.42877153) shows that around 42.88% of the document is associated with Topic 1

The document is a mix of discussions about both technology/environment and cultural arts, but with a slightly greater emphasis on the technology and environment aspect.
