### This code performs the following steps of categorization, illustrated by the P-graph SLR database

1. Loads data from a file into a Pandas dataframe.
2. Replaces 'Not a Number' (NaN) values in the dataframe with an empty string.
3. Extracts unique categories in the dataframe and uses them to initialize a dictionary.
4. Initializes the TfidfVectorizer to extract keywords and phrases from the data.
5. Loops through each category, selects the corresponding abstracts, fits the TfidfVectorizer to the data, and stores the feature names (keywords and phrases) and scores in a dictionary.
6. Sorts the dictionary by scores in descending order and adds the keywords and phrases for each category to a dataframe.
7. Maps each keyword to a category with its highest score and stores the mapping in a dataframe.
8. Saves the final dataframes to csv files.
9. The original dataframe is cleaned by removing stop words, punctuation, and other unwanted elements using the nltk library, and the newly identified keywords (and TF-IDF scores) are used to predict the original categories, without considering any other information from the original dataframe.

In [4]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

import os

# Change the directory to the desired location
os.chdir("C:/Users/<filepath>")

# Get the current working directory
print("Current working directory:", os.getcwd())

Current working directory: C:\Users\sutaa\Documents\Python Scripts\Pgraph_SLR


In [5]:
# Load the data that was previously categorized in an input variable ("Category") into a Pandas DataFrame
df = pd.read_csv("C:/Users/<filepath>/input_139.txt', sep='\t', engine='python')

In [6]:
# Replace NaN values with an empty string
df = df.fillna('')
print(df)

                                                 Title  \
0    Efficient Design and Sustainability Assessment...   
1    Optimal planning of inter-plant hydrogen integ...   
2    Utilization of process network synthesis and m...   
3    A Hybrid P-Graph And WEKA Approach In Decision...   
4    Conversion technologies: Evaluation of economi...   
..                                                 ...   
134  Environmentally Friendly Heterogeneous Azeotro...   
135  Generating Efficient Wastewater Treatment Netw...   
136  The P-graph approach for systematic synthesis ...   
137  Synthesis of sustainable circular economy in p...   
138  Hierarchical estimation of sustainability-pote...   

                                              Abstract  \
0    In the tannery industry approximately, 30 - 35...   
1    With the rising demand for hydrogen in petroch...   
2    This paper introduces the utilization of two d...   
3    Process system engineering approaches have a c...   
4    The gene

In [7]:
# Get the list of categories
categories = df['Category'].unique().tolist()

# Initialize a dictionary to store the keywords for each category
category_keywords = {Category: [] for Category in categories}

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))

# A dictionary to store the results for each category
category_keywords = {}

In [8]:
# Loop through each category
category_keywords_list = []
for category in categories:
    # Select the documents in the current category
    texts = df[df['Category'] == category]['Abstract'].tolist()
    
    # Initialize the TfidfVectorizer
    tfidf = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')
    tfidf_matrix = tfidf.fit_transform(texts)

    # Get the feature names (keywords and phrases)
    feature_names = tfidf.get_feature_names()

    # Get the tf-idf scores for each feature
    scores = tfidf_matrix.sum(axis=0).tolist()[0]

    # Combine the feature names and scores into a dictionary
    features = dict(zip(feature_names, scores))

    # Sort the dictionary by tf-idf scores in descending order
    sorted_features = {k: v for k, v in sorted(features.items(), key=lambda item: item[1], reverse=True)}

    # Add the keywords and phrases for the current category to a new dataframe
    category_keywords_list.append({'Category': category, 'Keyword/Phrase': list(sorted_features.keys()), 'Score': list(sorted_features.values())})
    category_keywords = pd.DataFrame(category_keywords_list)
    

In [9]:
# Reset the index of the final dataframe
category_keywords.reset_index(drop=True, inplace=True)

In [10]:
print(category_keywords)

                                            Category  \
0    2. Reduction/elimination of emissions and waste   
1         5. Process integration and intensification   
2  8. Other techniques for environmental impact r...   
3                                  3. Carbon capture   
4  9. Risk mitigation/criticality analysis on the...   
5     7. Energy saving, distribution, and management   
6     1. Use of renewable resources (for production)   
7                       6. Sustainable supply chains   
8  4. Circular economy/ reuse, recycling of resou...   

                                      Keyword/Phrase  \
0  [waste, process, treatment, graph, energy, min...   
1  [heat, graph, network, process, industrial, op...   
2  [graph, process, ecosystem, network, productio...   
3  [biochar, carbon, power, graph, pathways, co2,...   
4  [bioenergy, economic, process, risk, graph, pa...   
5  [energy, supply, systems, graph, heat, process...   
6  [hydrogen, graph, process, synthesis, ibr, b

In [11]:
category_keywords.to_csv('category_keywords.csv', index=False)

In [12]:
# Initialize a dictionary to store the keyword-category mapping
keyword_category_mapping = {}

# Iterate over the rows in the category_keywords dataframe
for index, row in category_keywords.iterrows():
    category = row['Category']
    keywords = row['Keyword/Phrase']
    scores = row['Score']
    
    # Iterate over the keywords and their corresponding scores in the current category
    for keyword, score in zip(keywords, scores):
        # If the keyword is already in the dictionary, update the mapping if the current score is higher
        if keyword in keyword_category_mapping:
            if score > keyword_category_mapping[keyword][1]:
                keyword_category_mapping[keyword] = (category, score)
        else:
            keyword_category_mapping[keyword] = (category, score)
            
print(keyword_category_mapping)

{'waste': ('2. Reduction/elimination of emissions and waste', 1.746443548617687), 'process': ('7. Energy saving, distribution, and management', 1.5150103428007922), 'treatment': ('2. Reduction/elimination of emissions and waste', 1.246136494827345), 'graph': ('7. Energy saving, distribution, and management', 1.5744418371545366), 'energy': ('7. Energy saving, distribution, and management', 2.606423791164414), 'minimization': ('2. Reduction/elimination of emissions and waste', 0.7656471505269913), 'waste minimization': ('2. Reduction/elimination of emissions and waste', 0.7233996334045321), 'technologies': ('2. Reduction/elimination of emissions and waste', 0.6712656168026434), 'optimal': ('7. Energy saving, distribution, and management', 1.3807269491158225), 'methodology': ('7. Energy saving, distribution, and management', 0.7677084288707203), 'using': ('7. Energy saving, distribution, and management', 0.7029921230100433), 'cost': ('7. Energy saving, distribution, and management', 1.034

In [13]:
# Initialize a list to store the tuples
data = []

# Iterate over the items in the keyword_category_mapping dictionary
for keyword, (category, score) in keyword_category_mapping.items():
    data.append((keyword, category, score))

# Convert the list of tuples to a dataframe with columns "Keyword/Phrase", "Category", and "TF-IDF"
keyword_category_mapping_df = pd.DataFrame(data, columns=["Keyword/Phrase", "Category", "TF-IDF"])

print(keyword_category_mapping_df)
keyword_category_mapping_df.to_csv('keyword_category_mapping_df.csv', index=False)

            Keyword/Phrase                                           Category  \
0                    waste    2. Reduction/elimination of emissions and waste   
1                  process     7. Energy saving, distribution, and management   
2                treatment    2. Reduction/elimination of emissions and waste   
3                    graph     7. Energy saving, distribution, and management   
4                   energy     7. Energy saving, distribution, and management   
...                    ...                                                ...   
15135    treatment country  4. Circular economy/ reuse, recycling of resou...   
15136  treatment structure  4. Circular economy/ reuse, recycling of resou...   
15137           type yield  4. Circular economy/ reuse, recycling of resou...   
15138     urban population  4. Circular economy/ reuse, recycling of resou...   
15139       useful dealing  4. Circular economy/ reuse, recycling of resou...   

         TF-IDF  
0      1.

In [14]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Creating a copy of the original dataframe
cleaned_keyword_category_mapping = keyword_category_mapping_df.copy()

# 1. Tokenizing the words in the "Keyword/Phrase" column
cleaned_keyword_category_mapping['Keyword/Phrase'] = cleaned_keyword_category_mapping['Keyword/Phrase'].apply(word_tokenize)

# 2. Removing stopwords from the tokenized words
stop_words = set(stopwords.words('english'))
cleaned_keyword_category_mapping['Keyword/Phrase'] = cleaned_keyword_category_mapping['Keyword/Phrase'].apply(lambda x: [word for word in x if word.lower() not in stop_words])

# 3. Removing punctuation from the tokenized words
cleaned_keyword_category_mapping['Keyword/Phrase'] = cleaned_keyword_category_mapping['Keyword/Phrase'].apply(lambda x: [word for word in x if word not in string.punctuation])

# 4. Removing digits from the tokenized words
cleaned_keyword_category_mapping['Keyword/Phrase'] = cleaned_keyword_category_mapping['Keyword/Phrase'].apply(lambda x: [word for word in x if not word.isdigit()])

# 5. Removing single characters from the tokenized words
cleaned_keyword_category_mapping['Keyword/Phrase'] = cleaned_keyword_category_mapping['Keyword/Phrase'].apply(lambda x: [word for word in x if len(word) >= 3])

# 6. Converting the tokenized words back to strings
cleaned_keyword_category_mapping['Keyword/Phrase'] = cleaned_keyword_category_mapping['Keyword/Phrase'].apply(lambda x: ' '.join(x))

# 7. Removing rows with empty "Keyword/Phrase" values
cleaned_keyword_category_mapping = cleaned_keyword_category_mapping[cleaned_keyword_category_mapping['Keyword/Phrase'] != '']

# 8. Removing rows with the word "et", "graph", etc.
cleaned_keyword_category_mapping = cleaned_keyword_category_mapping[~cleaned_keyword_category_mapping['Keyword/Phrase'].str.contains("et")]
cleaned_keyword_category_mapping = cleaned_keyword_category_mapping[~cleaned_keyword_category_mapping['Keyword/Phrase'].str.contains("graph")]

print(cleaned_keyword_category_mapping)
cleaned_keyword_category_mapping.to_csv('cleaned_keyword_category_mapping.csv', index=True)

            Keyword/Phrase                                           Category  \
0                    waste    2. Reduction/elimination of emissions and waste   
1                  process     7. Energy saving, distribution, and management   
2                treatment    2. Reduction/elimination of emissions and waste   
4                   energy     7. Energy saving, distribution, and management   
5             minimization    2. Reduction/elimination of emissions and waste   
...                    ...                                                ...   
15135    treatment country  4. Circular economy/ reuse, recycling of resou...   
15136  treatment structure  4. Circular economy/ reuse, recycling of resou...   
15137           type yield  4. Circular economy/ reuse, recycling of resou...   
15138     urban population  4. Circular economy/ reuse, recycling of resou...   
15139       useful dealing  4. Circular economy/ reuse, recycling of resou...   

         TF-IDF  
0      1.

In [15]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag

#This code will first lowercase the text, then tokenize it into words and remove the stop words, then tag the remaining words with their part-of-speech tags and only keep the words that are tagged as nouns (N).

def clean_keyword_text(text):
    text = text.lower()
    words = word_tokenize(text)
    words = [word for word in words if word.isalpha()]
    words = [word for word in words if word not in stopwords.words("english")]
    tagged = pos_tag(words)
    words = [word for word, pos in tagged if pos.startswith("N")]
    return " ".join(words)

cleaned_keyword_category_mapping_noun = keyword_category_mapping_df.copy()
cleaned_keyword_category_mapping_noun["Keyword/Phrase"] = cleaned_keyword_category_mapping_noun["Keyword/Phrase"].apply(clean_keyword_text)

cleaned_keyword_category_mapping_noun = cleaned_keyword_category_mapping_noun[cleaned_keyword_category_mapping_noun["Keyword/Phrase"] != ""]

print(cleaned_keyword_category_mapping_noun)
cleaned_keyword_category_mapping_noun.to_csv('mapping_noun.csv', index=True)

             Keyword/Phrase  \
0                     waste   
1                   process   
2                 treatment   
3                     graph   
4                    energy   
...                     ...   
15134  treatment approaches   
15135     treatment country   
15136   treatment structure   
15137                 yield   
15138            population   

                                                Category    TF-IDF  
0        2. Reduction/elimination of emissions and waste  1.746444  
1         7. Energy saving, distribution, and management  1.515010  
2        2. Reduction/elimination of emissions and waste  1.246136  
3         7. Energy saving, distribution, and management  1.574442  
4         7. Energy saving, distribution, and management  2.606424  
...                                                  ...       ...  
15134  4. Circular economy/ reuse, recycling of resou...  0.050005  
15135  4. Circular economy/ reuse, recycling of resou...  0.050005  
15136 

In [16]:
# Prediction for categorization based on noun Words + Phrases
#Create a dictionary to store the category and its corresponding keywords as a list of keywords.

category_keywords = {}
for i, row in cleaned_keyword_category_mapping.iterrows():
    category = row['Category']
    keyword = row['Keyword/Phrase']
    if category not in category_keywords:
        category_keywords[category] = []
    category_keywords[category].append(keyword)

def predict_category(abstract, abstract_title, abstract_keywords, cleaned_keyword_category_mapping):
    phrase_scores = {category: 0 for category in set(cleaned_keyword_category_mapping['Category'])}
    
    abstract = abstract.lower()
    abstract_title = abstract_title.lower()
    abstract_keywords = abstract_keywords.lower()
    
    for phrase, category, score in zip(cleaned_keyword_category_mapping['Keyword/Phrase'], cleaned_keyword_category_mapping['Category'], cleaned_keyword_category_mapping['TF-IDF']):
        # search for the exact match
        if phrase in abstract or phrase in abstract_title or phrase in abstract_keywords: phrase_scores[category] += score
    return max(phrase_scores, key=phrase_scores.get)

df['Predicted Category_wordsAndPhrases'] = df.apply(lambda x: predict_category(x['Abstract'], x['Title'], x['Keywords'], cleaned_keyword_category_mapping), axis=1)

print(df)

#df.to_csv('df.csv', index=True)

                                                 Title  \
0    Efficient Design and Sustainability Assessment...   
1    Optimal planning of inter-plant hydrogen integ...   
2    Utilization of process network synthesis and m...   
3    A Hybrid P-Graph And WEKA Approach In Decision...   
4    Conversion technologies: Evaluation of economi...   
..                                                 ...   
134  Environmentally Friendly Heterogeneous Azeotro...   
135  Generating Efficient Wastewater Treatment Netw...   
136  The P-graph approach for systematic synthesis ...   
137  Synthesis of sustainable circular economy in p...   
138  Hierarchical estimation of sustainability-pote...   

                                              Abstract  \
0    In the tannery industry approximately, 30 - 35...   
1    With the rising demand for hydrogen in petroch...   
2    This paper introduces the utilization of two d...   
3    Process system engineering approaches have a c...   
4    The gene

In [17]:
#Remove words, use only phrases

mapping_noun_phrases = cleaned_keyword_category_mapping_noun[
    cleaned_keyword_category_mapping_noun['Keyword/Phrase'].str.count(' ') + 1 > 1
]

#Remove false positive generating phrases containing general words
mapping_noun_phrases = mapping_noun_phrases[~mapping_noun_phrases['Keyword/Phrase'].str.contains("et")]
mapping_noun_phrases = mapping_noun_phrases[~mapping_noun_phrases['Keyword/Phrase'].str.contains("graph")]
mapping_noun_phrases = mapping_noun_phrases[~mapping_noun_phrases['Keyword/Phrase'].str.contains("process")]
print(mapping_noun_phrases)
mapping_noun_phrases.to_csv('mapping_noun_phrases.csv', index=True)

              Keyword/Phrase  \
6         waste minimization   
49           decision making   
58                case study   
74               waste water   
78     minimization analysis   
...                      ...   
15131              tool case   
15133               tool msw   
15134   treatment approaches   
15135      treatment country   
15136    treatment structure   

                                                Category    TF-IDF  
6        2. Reduction/elimination of emissions and waste  0.723400  
49       2. Reduction/elimination of emissions and waste  0.410498  
58        7. Energy saving, distribution, and management  0.588197  
74       2. Reduction/elimination of emissions and waste  0.333112  
78       2. Reduction/elimination of emissions and waste  0.330943  
...                                                  ...       ...  
15131  4. Circular economy/ reuse, recycling of resou...  0.050005  
15133  4. Circular economy/ reuse, recycling of resou...  0.050

In [18]:
# Prediction for categorization based on noun Phrases only
#Create a dictionary to store the category and its corresponding keywords as a list of keywords.

category_keywords = {}
for i, row in mapping_noun_phrases.iterrows():
    category = row['Category']
    keyword = row['Keyword/Phrase']
    if category not in category_keywords:
        category_keywords[category] = []
    category_keywords[category].append(keyword)
    
#For each abstract, compare its words with the keywords of each category. 
#The category with the most matches will be the predicted category for that abstract.

import re

def predict_category(abstract, abstract_title, abstract_keywords, mapping_noun_phrases):
    phrase_scores = {category: 0 for category in set(mapping_noun_phrases['Category'])}
    
    # text preprocessing
    abstract = abstract.lower()
    abstract_title = abstract_title.lower()
    abstract_keywords = abstract_keywords.lower()
    
    for phrase, category, score in zip(mapping_noun_phrases['Keyword/Phrase'], mapping_noun_phrases['Category'], mapping_noun_phrases['TF-IDF']):
        if re.search(r'\b' + re.escape(phrase) + r'\b', abstract) or re.search(r'\b' + re.escape(phrase) + r'\b', abstract_title) or re.search(r'\b' + re.escape(phrase) + r'\b', abstract_keywords):
            phrase_scores[category] += score
    return max(phrase_scores, key=phrase_scores.get)

df['Predicted Category_phrases'] = df.apply(lambda x: predict_category(x['Abstract'], x['Title'], x['Keywords'], mapping_noun_phrases), axis=1)

print(df)

df.to_csv('df.csv', index=True)

                                                 Title  \
0    Efficient Design and Sustainability Assessment...   
1    Optimal planning of inter-plant hydrogen integ...   
2    Utilization of process network synthesis and m...   
3    A Hybrid P-Graph And WEKA Approach In Decision...   
4    Conversion technologies: Evaluation of economi...   
..                                                 ...   
134  Environmentally Friendly Heterogeneous Azeotro...   
135  Generating Efficient Wastewater Treatment Netw...   
136  The P-graph approach for systematic synthesis ...   
137  Synthesis of sustainable circular economy in p...   
138  Hierarchical estimation of sustainability-pote...   

                                              Abstract  \
0    In the tannery industry approximately, 30 - 35...   
1    With the rising demand for hydrogen in petroch...   
2    This paper introduces the utilization of two d...   
3    Process system engineering approaches have a c...   
4    The gene