<a href="https://colab.research.google.com/github/ummeamunira/NLP-LLM/blob/main/topic-modelling/Topic_modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Sample data
documents = [
    "Oil leak detected in the main pipeline.",
    "Regular maintenance required for the pump.",
    "Environmental issue due to oil spill.",
    "Safety hazard in the drilling operation.",
    "Pump station requires urgent maintenance.",
    "Oil spill causing environmental damage.",
    "Detected hazardous conditions in the pipeline.",
]

# Vectorize the documents
vectorizer = CountVectorizer(stop_words='english')
doc_term_matrix = vectorizer.fit_transform(documents)

# Train LDA model
lda = LatentDirichletAllocation(n_components=3, random_state=42)
lda.fit(doc_term_matrix)

# Display the topics and their top words
def display_topics(model, feature_names, num_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic #{topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]))

num_top_words = 5
feature_names = vectorizer.get_feature_names_out()
display_topics(lda, feature_names, num_top_words)


Topic #0:
oil spill environmental causing damage
Topic #1:
maintenance pump requires station urgent
Topic #2:
detected pipeline main leak hazard


The oil and gas industry generates vast amounts of textual data, including technical reports, maintenance logs, safety incident reports, research papers, and operational manuals. Understanding the main themes or topics within this data can be challenging due to the sheer volume and complexity. Topic modeling can be used to automatically identify and categorize these themes, aiding in better data organization, trend analysis, and decision-making.

**Goal:**
Apply topic modeling to a collection of maintenance logs and safety incident reports to identify the predominant topics. This can help in understanding common issues, identifying areas that need attention, and improving overall safety and maintenance practices.

Data Collection:

Collect a dataset of maintenance logs and safety incident reports. Assume we have a dataset in a CSV file with a column report containing the text data.

In [2]:
import pandas as pd

# Sample data
data = {
    'report': [
        "Detected an oil leak in the main pipeline.",
        "Routine maintenance required for the offshore rig.",
        "Environmental issue due to oil spill near the coast.",
        "Incident report: Safety hazard during drilling operation.",
        "Urgent maintenance needed for the pump station.",
        "Oil spill causing significant environmental damage.",
        "Detected hazardous conditions in the pipeline section.",
        "Scheduled maintenance for the drilling equipment.",
        "Safety incident: Gas leak detected in the refinery.",
        "Maintenance report: Replacing worn-out valves in the pipeline."
    ]
}

df = pd.DataFrame(data)


Data Preprocessing:

Clean and preprocess the text data, including tokenization, removing stop words, and stemming/lemmatization.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import nltk
from nltk.corpus import stopwords
import string

# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Function to preprocess text
def preprocess_text(text):
    # Tokenize and remove punctuation
    tokens = [word for word in text.lower().split() if word not in string.punctuation]
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

# Apply preprocessing
df['cleaned_report'] = df['report'].apply(preprocess_text)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Applying Topic Modeling:

Use Latent Dirichlet Allocation (LDA) to identify topics in the text data.

In [4]:
# Vectorize the cleaned reports
vectorizer = CountVectorizer()
doc_term_matrix = vectorizer.fit_transform(df['cleaned_report'])

# Apply LDA
lda = LatentDirichletAllocation(n_components=3, random_state=42)
lda.fit(doc_term_matrix)

# Display the topics and their top words
def display_topics(model, feature_names, num_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic #{topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]))

num_top_words = 5
feature_names = vectorizer.get_feature_names_out()
display_topics(lda, feature_names, num_top_words)


Topic #0:
detected oil pipeline causing significant
Topic #1:
maintenance drilling incident safety report
Topic #2:
leak environmental spill issue coast


Analyzing and Interpreting Topics:

Analyze the resulting topics by examining the top words in each topic and labeling them.

**Benefits of using topic modelling**

Insight into Common Issues:

Identify frequently occurring problems in maintenance logs and safety reports, helping to prioritize resources and address recurring issues.

Improved Safety:

Detect patterns in safety incidents, leading to better preventative measures and training programs.

Efficient Data Management:

Organize large volumes of unstructured text data into meaningful categories, making it easier to search, retrieve, and analyze information.

Trend Analysis:

Analyze trends over time to understand how certain issues evolve, aiding in strategic planning and operational improvements.

Enhanced Decision-Making:

Provide management with a clearer understanding of key areas that require attention, supporting data-driven decision-making.