# Project: NLP for Cleantech
### Stage 3: Question answering / Information retrieval
Authors: Esin, Sabrina

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### 1. Text Pre-processing



As a first step the dataset will be loaded and preprocessed as before.

In [None]:
#Load the data
import pandas as pd
cleantech = pd.read_csv("/content/drive/MyDrive/CLT Project/NLP Stage 1/cleantech_media_dataset_v1_20231109.csv",delimiter=",",low_memory=False)
cleantech

Unnamed: 0.1,Unnamed: 0,title,date,author,content,domain,url
0,1280,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,,"[""Qatar Petroleum ( QP) is targeting aggressiv...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
1,1281,India Launches Its First 700 MW PHWR,2021-01-15,,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL)...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
2,1283,New Chapter for US-China Energy Trade,2021-01-20,,"[""New US President Joe Biden took office this ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
3,1284,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,,"[""The slow pace of Japanese reactor restarts c...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
4,1285,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,,"[""Two of New York City's largest pension funds...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
...,...,...,...,...,...,...,...
9602,82339,Strata Clean Energy Nets $ 300 Million in Fund...,2023-11-06,,['Strata Clean Energy has closed a $ 300 milli...,solarindustrymag,https://solarindustrymag.com/strata-clean-ener...
9603,82340,Orsted Deploying SparkCognition Renewable Suit...,2023-11-07,,['Global renewable energy developer Ørsted is ...,solarindustrymag,https://solarindustrymag.com/orsted-deploying-...
9604,82341,Veolia Has Plans for 5 MW of Solar in Arkansas,2023-11-07,,"['Veolia North America, a provider of environm...",solarindustrymag,https://solarindustrymag.com/veolia-has-plans-...
9605,82342,"SunEdison: Too Big, Too Fast?",2023-11-08,,['Once the self-proclaimed “ leading renewable...,solarindustrymag,http://www.solarindustrymag.com/online/issues/...


In [None]:
#Rename "Unnamed: 0 " column for better accessibility:
cleantech.rename(columns={'Unnamed: 0':'rowID'}, inplace=True)

In [None]:
dupl_title = cleantech[cleantech.duplicated(subset=["title"],keep=False)]

In [None]:
dupl_content = cleantech[cleantech.duplicated(subset=["content"],keep=False)]

In [None]:
#Extract all rows to be removed in dupl title:
contain_sgvoice_dupltitle = dupl_title[dupl_title['url'].str.contains('sgvoice')]
#print(contain_sgvoice_dupltitle)

#Delete the rows in contain_sgvoice_dupltitle from cleantech:
list1 = contain_sgvoice_dupltitle["rowID"].values.tolist()
cleantech_nodupl = cleantech[cleantech.rowID.isin(list1) == False]

In [None]:
#Delete all articles with duplicate content:
cleantech_nodupl = cleantech_nodupl.drop_duplicates(subset='content')

#Extract useful columns:
cleantech_con = cleantech_nodupl[["title", "content"]]

#Concatenate title and content:
#cleantech_con['text'] = cleantech_con['title'] + ' ' + cleantech_con['content']

cleantech_con.reset_index(drop=True, inplace= True)

##### 1.1. Content and structure cleaning

Taking a closer look, the content column indeed contains lists. To facilitate the sentence tokenization, we will read the content as list type:

In [None]:
import ast
def true_eval(string):
  x = ast.literal_eval(string)
  return x
listcontent = cleantech_con['content'].apply(true_eval)

Checking the list elements of an example, it shows that there are elements about the websites' privacy policy which don't hold any content relevant info:

In [None]:
#check example elements:
for x in listcontent[6932]:
  print(x)
  print("\n")

In this pv magazine Webinar, we will discuss with Paul Wormser of CEA and Elias Hinckley of K & L Gates the IRS’ Safe Harbor Provision for Solar Energy Projects and how to take advantage of it.


Now is the time to make sure your current or potential solar project can capture the current 26% solar investment tax credit ( ITC). Unless Congress and the administration intercede, the solar ITC will step down to 22% on January 1, 2021, potentially adding millions of dollars in expenses that would otherwise be nonexistent. If you’ re going to build a $ 100,000,000 project and bring it online before the end of 2023, capturing the 2020 safe harbor brings $ 4 million to your bottom line.


In this pv magazine Webinar, we will discuss with Paul Wormser of CEA and Elias Hinckley of K & L Gates the IRS’ Safe Harbor Provision for Solar Energy Projects and how to take advantage of it. The presenters will also cover the “ 5% rule ” and how minimal upfront investment can qualify a project for the full

Inspecting the cleantech dataset as a text file, we can define those certain repeating contents in the articles:

In [None]:
"'This website uses cookies to anonymously count visitor numbers. View our privacy policy. ×', ""The cookie settings on this website are set to `` allow cookies '' to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click `` Accept '' below then you are consenting to this."""

"'This content is protected by copyright and may not be reused. If you want to cooperate with us and would like to reuse some of our content, please contact: editors @ pv-magazine.com.', 'Please be mindful of our community standards.', 'Your email address will not be published. Required fields are marked *', 'Save my name, email, and website in this browser for the next time I comment.', 'By submitting this form you agree to pv magazine using your data for the purposes of publishing your comment.', 'Your personal data will only be disclosed or otherwise transmitted to third parties for the purposes of spam filtering or if this is necessary for technical maintenance of the website. Any other transfer to third parties will not take place unless this is justified on the basis of applicable data protection regulations or if pv magazine is legally obliged to do so.', 'You may revoke this consent at any time with effect for the future, in which case your personal data will be deleted immediately. Otherwise, your data will be deleted if pv magazine has processed your request or the purpose of data storage is fulfilled.', 'Further information on data privacy can be found in our Data Protection Policy.', 'This website uses cookies to anonymously count visitor numbers. View our privacy policy. ×', ""The cookie settings on this website are set to `` allow cookies '' to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click `` Accept '' below then you are consenting to this."""

"'This content is protected by copyright and may not be reused. If you want to cooperate with us and would like to reuse some of our content, please contact: editors @ pv-magazine.com.', 'I am interested to read more articles from the pv magazine.', 'Please be mindful of our community standards.', 'Your email address will not be published. Required fields are marked *', 'Save my name, email, and website in this browser for the next time I comment.', 'By submitting this form you agree to pv magazine using your data for the purposes of publishing your comment.', 'Your personal data will only be disclosed or otherwise transmitted to third parties for the purposes of spam filtering or if this is necessary for technical maintenance of the website. Any other transfer to third parties will not take place unless this is justified on the basis of applicable data protection regulations or if pv magazine is legally obliged to do so.', 'You may revoke this consent at any time with effect for the future, in which case your personal data will be deleted immediately. Otherwise, your data will be deleted if pv magazine has processed your request or the purpose of data storage is fulfilled.', 'Further information on data privacy can be found in our Data Protection Policy.', 'This website uses cookies to anonymously count visitor numbers. View our privacy policy. ×', ""The cookie settings on this website are set to `` allow cookies '' to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click `` Accept '' below then you are consenting to this."""

"'This content is protected by copyright and may not be reused. If you want to cooperate with us and would like to reuse some of our content, please contact: editors @ pv-magazine.com.', 'This is an interesting article about the fluctuations in polysilicon prices and how it affects the solar industry. It’ s good to hear that prices will be slightly lower in March, which should stimulate installation demand. I’ m curious to know how these price changes will affect the adoption of solar energy globally. Will it make it more accessible to countries with limited resources, or will it only affect larger markets with established solar industries?', 'Why are prices this high? In the US in 2015 you could get solar panels @ 21% efficiency for 57 cents per watt. Google cost per what solar panel 2015. I bought a bunch of them then. Why is everyone pretending like solar is so cheap now when it’ s double the price?', 'Please be mindful of our community standards.', 'Your email address will not be published. Required fields are marked *', 'Save my name, email, and website in this browser for the next time I comment.', 'By submitting this form you agree to pv magazine using your data for the purposes of publishing your comment.', 'Your personal data will only be disclosed or otherwise transmitted to third parties for the purposes of spam filtering or if this is necessary for technical maintenance of the website. Any other transfer to third parties will not take place unless this is justified on the basis of applicable data protection regulations or if pv magazine is legally obliged to do so.', 'You may revoke this consent at any time with effect for the future, in which case your personal data will be deleted immediately. Otherwise, your data will be deleted if pv magazine has processed your request or the purpose of data storage is fulfilled.', 'Further information on data privacy can be found in our Data Protection Policy.', 'This website uses cookies to anonymously count visitor numbers. View our privacy policy. ×', ""The cookie settings on this website are set to `` allow cookies '' to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click `` Accept '' below then you are consenting to this."""

In [None]:
cleantech_con[cleantech_con['content'].str.contains("The cookie settings on this website are set to `` allow cookies '' to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click `` Accept '' below then you are consenting to this.")]

Unnamed: 0,title,content
6932,Is your company capturing the 2020 safe harbor...,"['In this pv magazine Webinar, we will discuss..."
6933,Meyer Burger gets €22.5 million in public subs...,"['The funds come, on the one hand, from the re..."
6934,JinkoSolar claims 24.9% efficiency for n-type ...,['The result was confirmed by Germany’ s Insti...
6935,"Engie, Neoen plan 1 GW solar-plus-storage proj...",['The two French companies have announced Hori...
6936,Seawater aqueous battery based on alloy of zin...,['Scientists in the United States developed a ...
...,...,...
8141,Quasi-solid-state magnesium-ion battery achiev...,['Researchers at the University of Hong Kong (...
8142,"PV, storage may allow 53% of European househol...",['A German-Swiss research team has calculated ...
8143,Key takeaways from China’ s SNEC Energy Storag...,['The latest edition of China’ s SNEC Energy S...
8144,Empowering energy savings at home – pv magazin...,"['Across Europe, rising energy prices and the ..."


Searching for those elements, we can see that over 1200 articles are affected. Therefore, we can remove those elements:

In [None]:
elements_to_remove = ["This website uses cookies to anonymously count visitor numbers. View our privacy policy. ×",
                      "The cookie settings on this website are set to `` allow cookies '' to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click `` Accept '' below then you are consenting to this.",
                      "This content is protected by copyright and may not be reused. If you want to cooperate with us and would like to reuse some of our content, please contact: editors @ pv-magazine.com.",
                      "Please be mindful of our community standards.",
                      "Your email address will not be published. Required fields are marked *",
                      "I am interested to read more articles from the pv magazine.",
                      "Save my name, email, and website in this browser for the next time I comment.",
                      "By submitting this form you agree to pv magazine using your data for the purposes of publishing your comment.",
                      "Your personal data will only be disclosed or otherwise transmitted to third parties for the purposes of spam filtering or if this is necessary for technical maintenance of the website. Any other transfer to third parties will not take place unless this is justified on the basis of applicable data protection regulations or if pv magazine is legally obliged to do so.",
                      "You may revoke this consent at any time with effect for the future, in which case your personal data will be deleted immediately. Otherwise, your data will be deleted if pv magazine has processed your request or the purpose of data storage is fulfilled.",
                      "Further information on data privacy can be found in our Data Protection Policy."]

#remove redundant elements
for article in listcontent:
  article[:] = [element for element in article if element not in elements_to_remove]


In [None]:
#check if removed:
for x in listcontent[6932]:
  print(x)
  print("\n")

In this pv magazine Webinar, we will discuss with Paul Wormser of CEA and Elias Hinckley of K & L Gates the IRS’ Safe Harbor Provision for Solar Energy Projects and how to take advantage of it.


Now is the time to make sure your current or potential solar project can capture the current 26% solar investment tax credit ( ITC). Unless Congress and the administration intercede, the solar ITC will step down to 22% on January 1, 2021, potentially adding millions of dollars in expenses that would otherwise be nonexistent. If you’ re going to build a $ 100,000,000 project and bring it online before the end of 2023, capturing the 2020 safe harbor brings $ 4 million to your bottom line.


In this pv magazine Webinar, we will discuss with Paul Wormser of CEA and Elias Hinckley of K & L Gates the IRS’ Safe Harbor Provision for Solar Energy Projects and how to take advantage of it. The presenters will also cover the “ 5% rule ” and how minimal upfront investment can qualify a project for the full

Checking the example, we have made sure that we have successfully cleaned the redundant lines of text.

##### 1.2. Sentence Tokenization

After coverting the content into its correct list type, we have seen that not all sentences were already split. Using the NLTK sentence tokenizer, we can proceed by splitting the remaining sentences in the cleantech articles:

In [None]:
#use sent_tokenizer to split remaining sentences within article elements:
import nltk

nltk.download('punkt')

from nltk import sent_tokenize

# Tokenize sentences within each article
for i, article in enumerate(listcontent):
    tokenized_article = [sent_tokenize(sentence) for sentence in article]
    listcontent[i] = [sentence for sublist in tokenized_article for sentence in sublist]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
#check with example
for x in listcontent[6932]:
  print(x)
  print("\n")

In this pv magazine Webinar, we will discuss with Paul Wormser of CEA and Elias Hinckley of K & L Gates the IRS’ Safe Harbor Provision for Solar Energy Projects and how to take advantage of it.


Now is the time to make sure your current or potential solar project can capture the current 26% solar investment tax credit ( ITC).


Unless Congress and the administration intercede, the solar ITC will step down to 22% on January 1, 2021, potentially adding millions of dollars in expenses that would otherwise be nonexistent.


If you’ re going to build a $ 100,000,000 project and bring it online before the end of 2023, capturing the 2020 safe harbor brings $ 4 million to your bottom line.


In this pv magazine Webinar, we will discuss with Paul Wormser of CEA and Elias Hinckley of K & L Gates the IRS’ Safe Harbor Provision for Solar Energy Projects and how to take advantage of it.


The presenters will also cover the “ 5% rule ” and how minimal upfront investment can qualify a project for th

Using the same example as in the previous section, we can now see that the remaining sentences were split correctly.

In [None]:
#add sentence tokens to df:
cleantech_con["sentence_tokens"] = listcontent

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleantech_con["sentence_tokens"] = listcontent


#### 2. Text Summarization with TextRank

In this chapter, we have tried multiple approaches to extract the key sentences of the cleantech data. The first appraoch aimed at creating a similarity matrix that covers all sentences in the whole dataset. However, this amount is over 300,000. This similarity matrix would have given us valuable key sentences that cover the whole dataset. Unfortunately, due to the time-consuming process, we were not able to implement it as wished. <br>
To diminish the scope, the second approach ("Version with Similarity Matrix") included a similarity matrix per article. Eventually, the graph within the TextRank Algorithm unfortunately failed to converge in this approach. <br>



#### Version with Similarity Matrix - doesn't work

##### Load GloVe embeddings

In [None]:
#load GloVe embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2023-12-29 08:41:46--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-12-29 08:41:46--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-12-29 08:41:47--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [None]:
import numpy as np

# Extract word vectors
word_embeddings = {}
f = open('glove.6B.50d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()
len(word_embeddings)

400000

##### Generate Sentence Vectors

In [None]:
import numpy as np

# Function to generate sentence vectors for an article's tokenized sentences using GloVe embeddings
def generate_sentence_vectors(article_sentences):
    sentence_vectors = []
    for sentence in article_sentences:
        vector_sum = np.zeros((50,))
        num_words = 0
        for word in sentence:
            if word in word_embeddings:
                vector_sum += word_embeddings[word]
                num_words += 1
        # Avoid division by zero
        if num_words != 0:
            sentence_vector = vector_sum / num_words
        else:
            sentence_vector = np.zeros((50,))
        sentence_vectors.append(sentence_vector)
    return sentence_vectors

# Apply sentence vector generation to each article's tokenized sentences
cleantech_con['sentence_vectors'] = cleantech_con['tokenized_sentences'].apply(generate_sentence_vectors)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleantech_con['sentence_vectors'] = cleantech_con['tokenized_sentences'].apply(generate_sentence_vectors)


In [None]:
#article 1 has 20 sentences, the embedding dimension of each is 50 as selected
print(len(cleantech_con.sentence_vectors[0]))
print(len(cleantech_con.sentence_vectors[0][1]))

20
50


In [None]:
dftest = cleantech_con.iloc[:1]
dftest

Unnamed: 0,title,content,sentences,tokenized_sentences,sentence_vectors,similarity_matrix
0,Qatar to Slash Emissions as LNG Expansion Adva...,"[""Qatar Petroleum ( QP) is targeting aggressiv...","[[""Qatar Petroleum ( QP) is targeting aggressi...","[[qatar, petroleum, target, aggressive, greenh...","[[0.601856681862652, 0.035504841866592564, 0.4...","[[0.9999999999999999, 0.9177666194864875, 0.88..."


##### Compute Cosine Similarity Matrix

In [None]:
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Function to compute cosine similarity for an article's sentence vectors
def compute_similarity(sent_vectors):
    num_sentences = len(sent_vectors)
    sim_mat = np.zeros((num_sentences, num_sentences))

    for i in range(num_sentences):
        for j in range(num_sentences):
            sim_mat[i][j] = cosine_similarity(
                [sent_vectors[i]], [sent_vectors[j]]
            )[0, 0]

    return sim_mat

# empty list to store similarity matrices for each article
similarity_matrices = []

# Iterate through each article's sentence vectors and compute similarity matrix
for article_sent_vectors in tqdm(cleantech_con['sentence_vectors'], desc='Computing Similarity'):
    similarity_matrices.append(compute_similarity(article_sent_vectors))

# Add the matrices to the df
cleantech_con['similarity_matrix'] = similarity_matrices

cleantech_con['similarity_matrix'].to_pickle('/content/drive/MyDrive/CLT Project/NLP Stage 3/sim_matrices_series.pkl')

Computing Similarity: 100%|██████████| 9590/9590 [2:07:15<00:00,  1.26it/s]


In [None]:
cleantech_con.similarity_matrix[0]

array([[ 1.        ,  0.91776662,  0.88434964,  0.90413528,  0.87231117,
         0.88942366,  0.84365261,  0.84048639,  0.90236429,  0.74132322,
         0.85225204,  0.80270568,  0.61029088,  0.79297613,  0.        ,
         0.85226492,  0.55832929,  0.81305097,  0.82670182, -0.037656  ],
       [ 0.91776662,  1.        ,  0.90787639,  0.91942622,  0.90316827,
         0.9036527 ,  0.8762243 ,  0.79936401,  0.90552593,  0.81424426,
         0.86851449,  0.83759618,  0.61940958,  0.82625491,  0.        ,
         0.87104205,  0.51688243,  0.86035711,  0.83121067, -0.05376662],
       [ 0.88434964,  0.90787639,  1.        ,  0.93931711,  0.90756429,
         0.92329905,  0.88779634,  0.87255445,  0.85793755,  0.77019585,
         0.80014473,  0.78002079,  0.56127517,  0.82144727,  0.        ,
         0.82044251,  0.54336904,  0.79846547,  0.79415004, -0.15031595],
       [ 0.90413528,  0.91942622,  0.93931711,  1.        ,  0.96758318,
         0.94485886,  0.81717652,  0.84746249,  

##### Extract Summaries

In [None]:
#read sim_mats from above
sim_matrix = pd.read_pickle(r'/content/drive/MyDrive/CLT Project/NLP Stage 3/sim_matrices_series.pkl')

In [None]:
cleantech_con["sim_mat"] = pd.Series(sim_matrix)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleantech_con["sim_mat"] = pd.Series(sim_matrix)


In [None]:
cleantech_con

Unnamed: 0,title,content,sim_mat,sentences
0,Qatar to Slash Emissions as LNG Expansion Adva...,"[""Qatar Petroleum ( QP) is targeting aggressiv...","[[0.9999999999999999, 0.9177666194864875, 0.88...","[[""Qatar Petroleum ( QP) is targeting aggressi..."
1,India Launches Its First 700 MW PHWR,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL)...","[[0.9999999999999999, 0.7157988165567464, 0.85...","[[""• Nuclear Power Corp. of India Ltd. ( NPCIL..."
2,New Chapter for US-China Energy Trade,"[""New US President Joe Biden took office this ...","[[1.0, 0.853795701450164, 0.8707241111000026, ...","[[""New US President Joe Biden took office this..."
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,"[""The slow pace of Japanese reactor restarts c...","[[0.9999999999999999, 0.8544889961353036, 0.88...","[[""The slow pace of Japanese reactor restarts ..."
4,NYC Pension Funds to Divest Fossil Fuel Shares,"[""Two of New York City's largest pension funds...","[[0.9999999999999998, 0.9287753583009911, 0.83...","[[""Two of New York City's largest pension fund..."
...,...,...,...,...
9585,Strata Clean Energy Nets $ 300 Million in Fund...,['Strata Clean Energy has closed a $ 300 milli...,"[[0.9999999999999999, 0.8089654784962432, 0.81...",[['Strata Clean Energy has closed a $ 300 mill...
9586,Orsted Deploying SparkCognition Renewable Suit...,['Global renewable energy developer Ørsted is ...,"[[1.0000000000000002, 0.9400735006234783, 0.96...",[['Global renewable energy developer Ørsted is...
9587,Veolia Has Plans for 5 MW of Solar in Arkansas,"['Veolia North America, a provider of environm...","[[0.9999999999999999, 0.8608687464453404, 0.82...","[['Veolia North America, a provider of environ..."
9588,"SunEdison: Too Big, Too Fast?",['Once the self-proclaimed “ leading renewable...,"[[1.0000000000000002, 0.716748300272666, 0.818...",[['Once the self-proclaimed “ leading renewabl...


In [None]:
import networkx as nx
print(nx.__version__)

3.2.1


In [None]:
#does not work
# Function to get a summary of 20% of article length
import networkx as nx

# Function to get a summary of variable length based on the number of sentences
def get_variable_length_summary(sim_mat, sentences):
    nx_graph = nx.from_numpy_array(sim_mat)
    #scores = nx.pagerank(nx_graph, max_iter= 1000)
    nx_graph_normalized = nx.normalized_laplacian_matrix(nx_graph)#.todense()
    scores = nx.pagerank(nx_graph_normalized, max_iter=1000)

    num_sentences = len(sentences)
    summary_length = max(int(num_sentences * 0.2), 1)  # 20%
    ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)

    # Extract sentences based on summary length
    selected_sentences = [ranked_sentences[i][1] for i in range(min(summary_length, num_sentences))]
    return selected_sentences

# Apply the function to each article
cleantech_con['summary'] = cleantech_con.apply(lambda row: get_variable_length_summary(row['sim_mat'], row['sentences']), axis=1)


#### Text Summarization with TextRank
Our final approach to extract key sentences from the cleantech articles and to summarize them:

In [None]:
# Sub function to build graph within the text summarization function
import editdistance
import itertools
import networkx as nx
import nltk
import os

def build_graph(nodes):
    """Return a networkx graph instance.

    :param nodes: List of hashables that represent the nodes of a graph.
    """
    gr = nx.Graph()  # initialize an undirected graph
    gr.add_nodes_from(nodes)
    nodePairs = list(itertools.combinations(nodes, 2))

    # add edges to the graph (weighted by Levenshtein distance)
    for pair in nodePairs:
        firstString = pair[0]
        secondString = pair[1]
        levDistance = editdistance.eval(firstString, secondString)
        gr.add_edge(firstString, secondString, weight=levDistance)

    return gr

In [None]:
# Function to generate summaries
import nltk
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def generate_summary(sentence_tokens, summary_length=100, clean_sentences=True):

    # Build the graph
    graph = build_graph(sentence_tokens)

    # Calculate PageRank for sentence importance
    calculated_page_rank = nx.pagerank(graph, weight='weight')

    # Sort sentences in ascending order of importance
    sentences = sorted(calculated_page_rank, key=calculated_page_rank.get, reverse=True)

        # return a 100 word summary
    summary = ' '.join(sentences)
    summary_words = summary.split()
    summary_words = summary_words[0:summary_length]
    dot_indices = [idx for idx, word in enumerate(summary_words) if word.find('.') != -1]
    if clean_sentences and dot_indices:
        last_dot = max(dot_indices) + 1
        summary = ' '.join(summary_words[0:last_dot])
    else:
        summary = ' '.join(summary_words)

    return summary

In [None]:
#generate summaries
from tqdm import tqdm
tqdm.pandas(desc="Processing...")
cleantech_con["summary"] = cleantech_con["sentence_tokens"].progress_apply(generate_summary)

Processing...: 100%|██████████| 9590/9590 [04:28<00:00, 35.76it/s] 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleantech_con["summary"] = cleantech_con["sentence_tokens"].progress_apply(generate_summary)


In [None]:
#save the summaries
cleantech_con.summary.to_csv("/content/drive/MyDrive/CLT Project/NLP Stage 3/100_len_summaries.csv")

In [None]:
# read saved summaries and add to cleantech_con
summaries = pd.read_csv("/content/drive/MyDrive/CLT Project/NLP Stage 3/100_len_summaries.csv",delimiter=",",low_memory=False)
cleantech_con["summary"]= summaries.summary

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleantech_con["summary"]= summaries.summary


#### 3. Generate Question & Answer Pairs

In this step, we have also explored different options to generate questions and answers from our generated summaries. <br>
The first approach was with a pre-trained T5 model found on HuggingFace. Unfortunately, the performance was not good and the question generation was very flawed.<br>
Regarding the questions, the approach that we deemed as highest performing was a bert-based pre-trained question generator("Bert-based Question Generation Model")<br>
For the answers, we used a T5-based model that is pre-trained on generating answers.


##### Pre-trained T5 Model

In [None]:
# installing modules
%pip install torch
%pip install transformers

In [None]:
#test with pretrained T5 model
#!pip install --no-cache-dir transformers sentencepiece
import torch

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("p208p2002/t5-squad-qg-hl", use_fast=False)

model = AutoModelForSeq2SeqLM.from_pretrained("p208p2002/t5-squad-qg-hl")

In [None]:
# sentence
sequence = f"Generate a question based on the following summary: {testinput.summary}"
# encoding sentence for model to process
inputs = tokenizer.encode(sequence, return_tensors='pt')
# generating text
outputs = model.generate(inputs, max_length=300, do_sample=True, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
# decoding text
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
# printing output
#print(text)

The following code snippet shows that the performance of this model is not good. 4 out of 5 question outputs did not include a real question:

In [None]:
for q in testinput.question:
  print(q)

QP has raised its carbon capture and storage ambitions to 7 million tons/yr by 2027
What did Fluor announce on Jan. 11?
China's imports of US oil and LNG could serve as the foundation for fresh discussions on trade.
Tepco is struggling to bring one of its two advanced boiling water reactors on line but must wait for a lengthy prefectural approval process. Kansai Electric Power Co. is having its own set of struggles: technical, legal and political
New York City pension funds announced a target to become carbon neutral by 2040.


##### 3.1. Bert-based Question Generation Model

In [None]:
import torch

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

def generate_questions(input_summary):

  #initialize tokenizer and model
  #sequence-to-sequence question generator based on a pretrained bert-base model:
  tokenizer = AutoTokenizer.from_pretrained("voidful/context-only-question-generator", use_fast=False)
  model = AutoModelForSeq2SeqLM.from_pretrained("voidful/context-only-question-generator")

  #define input sequence:
  sequence = f"Generate a question based on the following summary: {input_summary}"

  # encode sequence for model to process
  inputs = tokenizer.encode(sequence, return_tensors='pt')

  # generate text
  outputs = model.generate(inputs, max_length=300, do_sample=True, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)

  # decoding text
  output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

  return output_text

In [None]:
#create questions for each summary
from tqdm import tqdm
tqdm.pandas(desc="Processing...")

# Function to generate questions
def generate_questions_wrapper(df):
    df["question"] = df["summary"].progress_apply(generate_questions)
    return df

# Calculate the size of each section
section_size = len(cleantech_con) // 10

# Split the dataframe into four sections
sections = [cleantech_con.iloc[i:i+section_size] for i in range(0, len(cleantech_con), section_size)]

# Process and save each section
for i, section_df in enumerate(sections):
    section_df = generate_questions_wrapper(section_df)
    section_df.question.to_csv(f"/content/drive/MyDrive/CLT Project/NLP Stage 3/section_{i+1}_question.csv")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/331 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/712M [00:00<?, ?B/s]

Processing...: 100%|██████████| 959/959 [1:41:57<00:00,  6.38s/it]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["question"] = df["summary"].progress_apply(generate_questions)
Processing...: 100%|██████████| 959/959 [1:39:19<00:00,  6.21s/it]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["question"] = df["summary"].progress_apply(generate_questions)
Processing...: 100%|██████████| 959/959 [1:39:13<00:00,  6.21s/it]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the cavea

In [None]:
#import and concat questions
import glob
import os

path = r'/content/drive/MyDrive/CLT Project/NLP Stage 3/questions/'
all_files = glob.glob(os.path.join(path, "*.csv"))

questions_df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

In [None]:
#export questions as single column
cleantech_con.question.to_csv("/content/drive/MyDrive/CLT Project/NLP Stage 3/all_questions.csv")

In [None]:
# read saved questions and add to cleantech_con
all_questions = pd.read_csv("/content/drive/MyDrive/CLT Project/NLP Stage 3/all_questions.csv",delimiter=",",low_memory=False)
cleantech_con["question"]= all_questions.question

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleantech_con["question"]= all_questions.question


Below, we can see the question generated for the first article of the dataset:

In [None]:
cleantech_con.summary[0]

"The company is also aiming to reduce gas flaring intensity across its upstream facilities by more than 75% and has raised its carbon capture and storage ambitions from 5 million tons/yr to 7 million tons/yr by 2027. In its latest Sustainability Report published on Wednesday, QP said its goals include `` reducing the emissions intensity of Qatar's LNG facilities by 25% and of its upstream facilities by at least 15%. '' QP says it should be able to eliminate routine gas flaring by 2030, with methane emissions limited `` by setting a methane intensity target of 0.2%"

In [None]:
cleantech_con.question[0]

'By what year will QP be able to eliminate routine gas flaring?'

We can see that the pretrained model generated questions nicely. The next step is to generate the answers.

##### 3.2. T5-based Answer Generation Model

In [None]:
from  transformers  import  AutoTokenizer, AutoModelWithLMHead, pipeline

def generate_answers(row):
  #T5 base answer generator
  model_name = "MaRiOrOsSi/t5-base-finetuned-question-answering"
  tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast = False)
  model = AutoModelWithLMHead.from_pretrained(model_name)

  question = row["question"]
  context = row["summary"]

  input = f"question: {question} context: {context}"

  encoded_input = tokenizer([input],
                              return_tensors='pt',
                              max_length=512,
                              truncation=True)

  output = model.generate(input_ids = encoded_input.input_ids,
                              attention_mask = encoded_input.attention_mask)
  output = tokenizer.decode(output[0], skip_special_tokens=True)

  return output

In [None]:
#!pip install sentencepiece
#!pip install transformers
#!pip install transformers[sentencepiece]
import sentencepiece
from tqdm import tqdm
tqdm.pandas(desc="Processing...")

# Function to generate answers
def generate_answers_wrapper(df):
    df["answer"] = df.progress_apply(generate_answers,axis=1)
    return df

# Calculate the size of each section
section_size = len(cleantech_con) // 10

# Split the dataframe into four sections
sections = [cleantech_con.iloc[i:i+section_size] for i in range(0, len(cleantech_con), section_size)]

# Process and save each section
for i, section_df in enumerate(sections[6:]):
    section_df = generate_answers_wrapper(section_df)
    section_df.answer.to_csv(f"/content/drive/MyDrive/CLT Project/NLP Stage 3/section_{i+7}_answer.csv")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
Processing...: 100%|██████████| 959/959 [1:42:22<00:00,  6.41s/it]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["answer"] = df.progress_apply(generate_answers,axis=1)
Processing...: 100%|██████████| 959/959 [1:46:49<00:00,  6.68s/it]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["answer"] = df.progress_apply(generate_answers,axis=1)


In [None]:
#import and concat answers
import glob
import os

path = r'/content/drive/MyDrive/CLT Project/NLP Stage 3/answers/'
all_files = glob.glob(os.path.join(path, "*.csv"))

answers_df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

In [None]:
#export answers as single column
cleantech_con.answer.to_csv("/content/drive/MyDrive/CLT Project/NLP Stage 3/all_answers.csv")

In [None]:
# read saved answers and add to cleantech_con
all_answers = pd.read_csv("/content/drive/MyDrive/CLT Project/NLP Stage 3/all_answers.csv",delimiter=",",low_memory=False)
cleantech_con["answer"]= all_answers.answer

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleantech_con["answer"]= all_answers.answer


Displaying the summary, question and answers together, we can see the contextual connection:

In [None]:
cleantech_con.summary[2]

'Energy has come to play a bigger role in that relationship than ever before, and rising Chinese imports of US oil and LNG could serve as the foundation for fresh discussions on trade -- one of the few areas where US-China communications have not completely broken down. Last year’ s Phase 1 trade deal entailed China spending an unrealistically high $ 18.5 billion above the 2017 baseline on US energy products, a goal made all the more impossible by China’ s demand plunge from the coronavirus and sharply lower oil prices.'

In [None]:
cleantech_con.question[2]

'How much did China spend on US energy products in 2017?'

In [None]:
cleantech_con.answer[2]

'$18.5 billion'

However, we can also see that not all answers were generated:

In [None]:
import numpy as np
missing_answers_df = cleantech_con.loc[cleantech_con.answer.isnull()]
missing_answers_df

Unnamed: 0,title,content,sentence_tokens,summary,question,answer
5,Japan: Supreme Court Will Likely Decide on Fuk...,"[""Japan's Supreme Court will likely become the...",[Japan's Supreme Court will likely become the ...,One by the government's own Headquarters for E...,What is the government's Headquarters for quak...,
47,Carolinas Wrestle Over Duke's Future Plans for...,"[""Duke Energy's plans for the Carolinas has hi...",[Duke Energy's plans for the Carolinas has hit...,“ This legislation appears to bind the hands o...,What does the Environmental Defense Fund direc...,
71,Biden Hits Gas Pedal on EV Push,"[""The Biden administration last week rolled ou...",[The Biden administration last week rolled out...,"Biden, in an executive order last week, outlin...",What was Biden's executive order?,
119,US-EU Outline Energy 'Marshall Plan ',['The EU and US have agreed to set up a joint ...,[The EU and US have agreed to set up a joint t...,And the US commitment to swiftly approve new i...,What was the short term difficulty in persuadi...,
124,Big Bank Shareholders Reject Climate Proposals,['Shareholders at three of the top US investme...,[Shareholders at three of the top US investmen...,“ Our board believes that the policy change re...,What did Bank of America say in proxy statements?,
...,...,...,...,...,...,...
9369,ArcVera Expands Renewable Energy Opportunities...,"['ArcVera Renewables, a global provider of con...","[ArcVera Renewables, a global provider of cons...",We are now actively seeking new administrative...,What is the new strategy update for ArcVera?,
9398,Emerson Debuts Ovation Green Single-System Ren...,['Emerson says it is combining its power-secto...,[Emerson says it is combining its power-sector...,“ Countries around the globe are focused on tr...,What is the Ovation Green portfolio?,
9510,Revolution BESS Project Closer to Commercial O...,['Commissioning activities have begun at Revol...,[Commissioning activities have begun at Revolu...,"In December 2022, Spearmint broke ground on Re...",What is the PowerTitan Series battery energy s...,
9566,"Arctech, Technical University of Madrid Unite ...","['Recently, two professors from the Technical ...","[Recently, two professors from the Technical U...",The exclusive development of the 3D Geographic...,What is the purpose of the 3D Geographic Infor...,


We can see that 341 answers could not be generated. We will therefore manually clean up the question and answer pairs and insert the missing answers. The missing answers have been generated using ChatGPT.

##### Cleaning up generated Q&A Pairs

In [None]:
missing_answers_df = missing_answers_df.drop(['title', 'content', 'sentence_tokens'], axis=1)

In [None]:
missing_answers_df

Unnamed: 0,summary,question,answer
5,One by the government's own Headquarters for E...,What is the government's Headquarters for quak...,
47,“ This legislation appears to bind the hands o...,What does the Environmental Defense Fund direc...,
71,"Biden, in an executive order last week, outlin...",What was Biden's executive order?,
119,And the US commitment to swiftly approve new i...,What was the short term difficulty in persuadi...,
124,“ Our board believes that the policy change re...,What did Bank of America say in proxy statements?,
...,...,...,...
9369,We are now actively seeking new administrative...,What is the new strategy update for ArcVera?,
9398,“ Countries around the globe are focused on tr...,What is the Ovation Green portfolio?,
9510,"In December 2022, Spearmint broke ground on Re...",What is the PowerTitan Series battery energy s...,
9566,The exclusive development of the 3D Geographic...,What is the purpose of the 3D Geographic Infor...,


In [None]:
missing_answers_df.to_csv("/content/drive/MyDrive/CLT Project/NLP Stage 3/df_missing_answers.csv")

Manually add missing answers:

In [None]:
cleantech_con.answer[5] = "The government's earthquake research office."
cleantech_con.answer[47] = "The director criticizes legislation for favoring fossil fuel power plants."
cleantech_con.answer[71] = "Biden's executive order outlines regulatory steps, including revising tailpipe emissions standards and fuel economy targets."
cleantech_con.answer[119] = "Short-term difficulty: Industry skepticism and senior administration officials' reluctance to look beyond approved LNG export projects."
cleantech_con.answer[124] = "Bank of America opposes the proposed policy change, stating it's unnecessary and would restrict the company's ability to implement its climate strategy."
cleantech_con.answer[125] = "Bank of America expresses belief that the proposed policy change is unnecessary and would restrict the company's ability to implement its climate strategy."
cleantech_con.answer[133] = "Electricity generated from renewable sources can power the most energy-intensive part of the CO2 capture process using solid sorbents."
cleantech_con.answer[154] = "Capacity side of energy storage refers to impacts from the war in Ukraine and economic crises in Europe."
cleantech_con.answer[219] = "The leader of the world's largest economy is not specified in the provided text."
cleantech_con.answer[221] = "Recent declines in global LNG prices are believed to be a net positive, supporting demand growth in key developing markets."
cleantech_con.answer[233] = "Another challenge in meeting ambitious offshore installation targets: Evolution of prices in transport, steel, and energy affecting ZF Wind Power's production."
cleantech_con.answer[254] = "Khambule criticizes Ramaphosa for peddling 20th-century solutions to 21st-century problems."
cleantech_con.answer[256] = "Grim outlook for 2021 includes material production downgrades, uncertainty, and drilling pauses."
cleantech_con.answer[272] = "Housing crisis in mining towns; Eskom has a responsibility to be a catalyst for transformation."
cleantech_con.answer[296] = "New program supports development of floating offshore wind technology."
cleantech_con.answer[313] = "Prime Minister of Eswatini announces 100% hot water provision in government medical clinics."
cleantech_con.answer[317] = "Sword drives business efficiently by coordinating vendors, using specific templates, and understanding technical scopes."
cleantech_con.answer[324] = "CEO describes compression and additional development wells as a 'no brainer'."
cleantech_con.answer[335] = "Waste management specialist ensures waste is never a surprise; forward planning may prevent costs."
cleantech_con.answer[369] = "CEO says the £350m expansion project needs assistance for green infrastructure and hydrogen feasibility studies."
cleantech_con.answer[370] = "First and easiest approach: Bringing more renewables via a private wire or tying into local distilleries for CO2."
cleantech_con.answer[390] = "BP and Orsted believe both projects can co-exist; working closely to resolve issues with the Crown Estate."
cleantech_con.answer[417] = "Chevron delayed CO2 injection, unable to meet agreed targets; made the decision to start the system safely."
cleantech_con.answer[423] = "Mr. van Beurden suggests producing oil and gas in the UK's own backyard; considers various factors before each licensing round."
cleantech_con.answer[434] = "The presumption is for green hydrogen production facilities to be sited close to consumers."
cleantech_con.answer[444] = "New Sectoral Marine Plan for offshore wind energy is a key component of progress to decarbonization and a transition to net zero."
cleantech_con.answer[459] = "Purpose of testing wave energy converter (WEC) technologies: Testing systems for remote applications and generating public data."
cleantech_con.answer[461] = "Contracts for Difference scheme establishes subsidies for renewable projects; supports the next generation of renewable electricity projects."
cleantech_con.answer[507] = "Challenges remain in heating, buildings, transport, and other sectors."
cleantech_con.answer[555] = "NSZ (net zero strategy) contains appropriate detail; public interest in disclosing information is outweighed by public interest in withholding."
cleantech_con.answer[566] = "Historical moment for Vietnam as it becomes the leader of wind installations in Southeast Asia and the second biggest market in Asia Pacific."
cleantech_con.answer[567] = "Second part of the series focuses on net zero data and reducing the carbon emissions footprint of data."
cleantech_con.answer[576] = "New All-Party Parliamentary Group of the Celtic Sea provides visibility at Westminster."
cleantech_con.answer[613] = "Peter Welsh of the GMB union considers the political failure, citing missed opportunities for local job creation."
cleantech_con.answer[619] = "Relevant time to promote geothermal is marked by political and industry will, supportive legislative frameworks, and funding."
cleantech_con.answer[669] = "Innovation awards focus on achievements, growth, supply chain, people strategy, and behind-the-scenes support in the wind sector."
cleantech_con.answer[680] = "Climate crisis at COP26 revolves around the urgent need for action to limit global warming to 1.5 degrees."
cleantech_con.answer[682] = "Diversification of gas supplies involves more LNG, pipeline imports, biomethane, and green hydrogen."
cleantech_con.answer[694] = "Professor Underhill suggests investigating repurposing some gas fields for energy supply security."
cleantech_con.answer[745] = "Steve Edwards discusses geo-political tensions, proactive measures, reshaping portfolios, and addressing stakeholder expectations."
cleantech_con.answer[845] = "Drawing subsurface heat for scalable baseload energy is a feature of the project."
cleantech_con.answer[871] = "Execution template for managing brownfield subsea tie-ins emphasizes collaboration across license holders, FPSO operators, and key service providers."
cleantech_con.answer[889] = "The guide uses existing frameworks for data disclosure tools."
cleantech_con.answer[937] = "Angela Terry criticizes the energy secretary's views on onshore wind and his preference for banning wind farms."
cleantech_con.answer[955] = "Impact of each reduction initiative calculated through measuring and modeling for emissions reduction and strategic cost analysis."
cleantech_con.answer[967] = "Demonstrating fixed income sources over several years helps achieve higher gearing for financial support from banks."
cleantech_con.answer[973] = "Nature tech aims to de-risk investments, boost crop yields, increase transparency, and promote national and global growth."
cleantech_con.answer[1039] = "E & P significantly impacts the carbon footprint, and the UK is considered unfriendly for exploration and production."
cleantech_con.answer[1050] = "Companies transformed from a sustainability perspective include Apple, Google, Intel, Amazon, SAP, HP, and Snowflake."
cleantech_con.answer[1053] = "UNEP will work with willing governments and businesses to shift away from single-use plastics and mobilize private finance."
cleantech_con.answer[1067] = "Morag Watson praises the bold, ambitious, and transformative aspects of the Scottish Government's Energy Strategy and Just Transition Plan."
cleantech_con.answer[1069] = "Scotland's largest electricity infrastructure company stands ready for a potential £15bn investment program for net zero."
cleantech_con.answer[1101] = "Culture Unstained describes a near wholesale rejection of BP's brand across the arts due to climate concerns."
cleantech_con.answer[1103] = "IEA and IPCC warn against new oil fields; major banks finance such projects despite environmental pledges."
cleantech_con.answer[1124] = "Companies in the Flue2Chem initiative include BASF, Tata Steel, Procter & Gamble, UPM-Kymmene, Holmen, Johnson Matthey, Reckitt, Croda, and Carbon Clean."
cleantech_con.answer[1127] = "Werner Hoyer, the president of EIB, emphasizes the need for Europe to keep up with green subsidies."
cleantech_con.answer[1159] = "Benj Sykes highlights the super urgent need to resolve the battle for space between offshore wind and carbon storage technologies."
cleantech_con.answer[1161] = "Thinking too short-term around infrastructure upgrades poses a danger; Mr. Wharton suggests not spending enough on infrastructure."
cleantech_con.answer[1165] = "Association for Renewable Energy and Clean Technology warns of risks to renewables investments over a controversial levy."
cleantech_con.answer[1172] = "Next step of the POB Module is to provide complete visibility and control of site arrivals, departures, and current POB across teams."
cleantech_con.answer[1183] = "Goal of net zero is driving plans to boost renewables, revive nuclear, and build new industries like carbon capture."


In [None]:
cleantech_con.answer[1191] = "Examples of short and complex procurement processes include government contracts, emergency acquisitions, and urgent supply needs."
cleantech_con.answer[1193] = "The pragmatic approach to the UKCS involves keeping it healthy, developing oil and gas finds, and prioritizing essential drilling while increasing focus on low carbon energy."
cleantech_con.answer[1239] = "The North Sea veteran criticized the Tory Party for introducing complicated and penalizing tax changes and the Labour Party for advocating harsher taxation policies that may constrain new investment in the oil and gas industry."
cleantech_con.answer[1264] = "The text is in German, and translation is needed to answer the question about calculating Peak Demand Reduction at the Residential Sector."
cleantech_con.answer[1276] = "The specific bill number or name that Jared Polis vetoed is not mentioned in the provided text."
cleantech_con.answer[1292] = "The BMW i4, BYD Tang, MG Marvel R, Citroen e-C4, and BMW iX3 all grew between 2x and 4x in volume."
cleantech_con.answer[1296] = "The cost of Climeworks carbon credits is expected to drop to $250-$300/mtCO2e by the end of 2030."
cleantech_con.answer[1298] = "There is uncertainty regarding the 3-month total number of the Mini Cooper SE due to data sources being patchy for some models and months."
cleantech_con.answer[1342] = "The aim is to understand when it's optimal to replace infrastructure, considering factors like operational performance, energy consumption, and emissions."
cleantech_con.answer[1346] = "The Energy Department describes the goals of the Wind and Solar Reliability program as supporting technology goals and accelerating job growth and quality."
cleantech_con.answer[1374] = "The new title of the Inflation Reduction Act of 2022 is not provided in the text."
cleantech_con.answer[1436] = "The Mastercard report calculates employee residential charging costs based on electric vehicle charging events, including home charges."
cleantech_con.answer[1443] = "Variable renewable energy and transmission capacity are sized to meet demand during daily stressful periods, with storage filling supply gaps and curtailing excess renewable energy."
cleantech_con.answer[1448] = "The white paper discusses how nature tech can help solve climate and nature crises, emphasizing the need for investment in nature-based solutions."
cleantech_con.answer[1529] = "The Amazon Rainforest is one of the world's largest carbon sinks."
cleantech_con.answer[1578] = "The German word for Schadenfreude is not provided in the text."
cleantech_con.answer[1665] = "The APS (Advanced Photon Source) device was instrumental in supporting the lithium-sulfur battery research."
cleantech_con.answer[1707] = "CleanTechnica is the #1 cleantech-focused news & analysis website globally."
cleantech_con.answer[1708] = "Barr, described as one of the top Republicans on the House Financial Services Committee, stated he would exercise rigorous oversight of regulators and asset managers who politicized capital allocation."
cleantech_con.answer[1732] = "The type of testing infrastructure available at the site is not specified in the provided text."
cleantech_con.answer[1738] = "The purpose of the tariff review proposed by Kenya Power is to incentivize the adoption of electric vehicles and encourage investment in electric vehicle charging infrastructure."
cleantech_con.answer[1740] = "Climate stability is losing stability with climate change, making larger areas less suitable for densely populated regions with deadly effects."
cleantech_con.answer[1741] = "The cost of renewables in America is raised by the increased focus on energy security due to the Russia-Ukraine war, and the seizure of solar panels from the Xinjiang region of China."
cleantech_con.answer[1776] = "Some of the issues with the standards for EV charging stations include disparities in connector types, payment methods, data privacy, speed and power of chargers, reliability, and the overall user experience."
cleantech_con.answer[1794] = "The purpose of GridBeyond's white paper is to look at the vulnerabilities and strengths of the energy sector, identifying renewables and technologies like energy storage, battery, and EVs as essential for a more resilient and flexible grid and a stronger energy market less dependent on fossil fuels."
cleantech_con.answer[1805] = "If the new owners of a 1995 Chevy Cavalier didn't know how to activate the cigarette lighter, it wasn't the end of the world."
cleantech_con.answer[1849] = "The review found that communities across the United States have faced environmental injustices, enduring underinvestment in infrastructure and critical services, and suffering disproportionate impacts from climate change."
cleantech_con.answer[1889] = "To best use public funds so strategic EV charging takes place, you can incentivize charging installations, ideally through a competitive process, such as companies bidding for different projects. Incentivizing installing charging at workplaces can tap into both economic and environmental benefits."
cleantech_con.answer[1904] = "The philosophy of Bold for Nature is emphasized in the center fascia through the application of contrasting materials, while the large, wide panoramic display provides greater levels of content for occupants to enjoy."
cleantech_con.answer[1914] = "One of the enduring pieces of FUD (Fear, Uncertainty, Doubt) misinformation is that wind turbines and other materials used for renewable energy generation will end up in the landfill and destroy the Earth."
cleantech_con.answer[1925] = "The long-term goal of the Paris Agreement is to hold the increase in the global average temperature to well below 2°C above pre-industrial levels and to pursue efforts to limit the temperature increase to 1.5°C above pre-industrial levels."
cleantech_con.answer[1937] = "A sea change for energy security and jobs is ensuring the U.S. solar and storage industry has a reliable supply of solar equipment as it becomes a fundamental part of America's energy supply."
cleantech_con.answer[1952] = "The mining code refers to the comprehensive set of rules, regulations, and procedures issued by the International Seabed Association (ISA) to regulate prospecting, exploration, and exploitation of marine minerals in the international seabed area."
cleantech_con.answer[1967] = "Tesla describes at length the various tax credits that will apply to its vehicles and solar products in a new posting on its website, particularly in light of the Inflation Reduction Act and other clean energy and electric vehicle programs."
cleantech_con.answer[2019] = "The goal of reducing the energy consumption associated with AI algorithms is to minimize energy consumption and improve scalability for wirelessly interconnected devices, thereby reducing the overall environmental impact of these technologies."
cleantech_con.answer[2048] = "CleanTechnica is the #1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage."
cleantech_con.answer[2091] = "CleanTechnica is the #1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage."
cleantech_con.answer[2103] = "CleanTechnica is the #1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage."
cleantech_con.answer[2108] = "CleanTechnica is the #1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage."
cleantech_con.answer[2116] = "CleanTechnica is the #1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage."
cleantech_con.answer[2119] = "CleanTechnica is the #1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage."
cleantech_con.answer[2128] = "CleanTechnica is the #1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage."
cleantech_con.answer[2129] = "CleanTechnica is the #1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage."
cleantech_con.answer[2134] = "CleanTechnica is the #1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage."
cleantech_con.answer[2141] = "CleanTechnica is the #1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage."
cleantech_con.answer[2146] = "CleanTechnica is the #1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage."
cleantech_con.answer[2152] = "CleanTechnica is the #1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage."
cleantech_con.answer[2157] = "CleanTechnica is the #1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage."
cleantech_con.answer[2162] = "CleanTechnica is the #1 cleantech-focused news & analysis website in the US & the world, focusing primarily on electric cars, solar energy, wind energy, & energy storage."
cleantech_con.answer[2194] = "The project will promote the use of green hydrogen, produced from water using renewable energy, to reduce emissions and displace petroleum fuels in the transportation sector."
cleantech_con.answer[2202] = "The Renewable Energy Community, powered by Enel X's technologies, will have a positive impact on the environment and the community. Ferrari aims to become carbon-neutral by 2030, driving innovation in various areas."
cleantech_con.answer[2220] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, with a primary focus on electric cars, solar energy, wind energy, and energy storage."
cleantech_con.answer[2222] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, emphasizing electric cars, solar energy, wind energy, and energy storage. It provides insights into innovative solutions in the clean energy space."
cleantech_con.answer[2232] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, focusing on electric cars, solar energy, wind energy, and energy storage. It plays a crucial role in reporting on initiatives related to decarbonizing transportation."
cleantech_con.answer[2233] = "The NREL (National Renewable Energy Laboratory) is leading efforts in the U.S. aviation industry to find sustainable pathways, including sustainable aviation fuels, electrification, and new transportation options, to reduce carbon dioxide emissions."
cleantech_con.answer[2253] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, with a primary focus on electric cars, solar energy, wind energy, and energy storage."
cleantech_con.answer[2260] = "The policy against articles written with ChatGPT is being broken to allow articles explicitly stated as being from ChatGPT, addressing the ban on ChatGPT in certain machine learning conference submissions."
cleantech_con.answer[2270] = "Sean Moriarty, Deputy Commissioner, emphasizes that the landfill solar redevelopment project transforms an environmental liability into an asset, achieving environmental and economic success while serving the community."
cleantech_con.answer[2305] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, focusing on electric cars, solar energy, wind energy, and energy storage. It discusses the potential of waste biomass feedstock for fulfilling transportation needs."
cleantech_con.answer[2310] = "The optimal specifications for the CJPT agreement include a cruising range per charge of about 200 kilometers and a flexible chassis to meet the needs of customers in the delivery industry."
cleantech_con.answer[2318] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, focusing on electric cars, solar energy, wind energy, and energy storage. It reports on new research exploring the potential value of implementing EV managed charging."
cleantech_con.answer[2320] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, emphasizing electric cars, solar energy, wind energy, and energy storage. North Carolina State University has received an NSF grant to create the National Research Center for Future Renewable Energy."
cleantech_con.answer[2321] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, focusing on electric cars, solar energy, wind energy, and energy storage. U.S. agencies are releasing rules to tighten reporting and procedural requirements for reducing national emissions."
cleantech_con.answer[2328] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, focusing on electric cars, solar energy, wind energy, and energy storage."
cleantech_con.answer[2347] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, concentrating on electric cars, solar energy, wind energy, and energy storage. Teams interested in developing solutions for decarbonizing U.S. housing can apply for a new Building America Program."
cleantech_con.answer[2348] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, emphasizing electric cars, solar energy, wind energy, and energy storage."
cleantech_con.answer[2352] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, focusing on electric cars, solar energy, wind energy, and energy storage. The U.S. Environmental Protection Agency has announced new guidelines related to mountaintop removal coal mine permits."
cleantech_con.answer[2359] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, concentrating on electric cars, solar energy, wind energy, and energy storage. It emphasizes understanding and implementing targets and policies to reduce greenhouse gas emissions."
cleantech_con.answer[2362] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, focusing on electric cars, solar energy, wind energy, and energy storage. The article discusses camping, van life, and living off the grid with technology."
cleantech_con.answer[2363] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, concentrating on electric cars, solar energy, wind energy, and energy storage. Himiway has gained recognition for its long-range, fat-tire electric mountain bikes."
cleantech_con.answer[2374] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, emphasizing electric cars, solar energy, wind energy, and energy storage. CityRyde's carbon methodology for its Inspire software has been validated under the Verified Carbon Standard."
cleantech_con.answer[2377] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, focusing on electric cars, solar energy, wind energy, and energy storage. Himiway is recognized for its long-range, fat-tire electric mountain bikes."
cleantech_con.answer[2399] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, concentrating on electric cars, solar energy, wind energy, and energy storage. It discusses the unexpected role of red state right-to-work laws in attracting electric vehicle manufacturers."
cleantech_con.answer[2404] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, emphasizing electric cars, solar energy, wind energy, and energy storage. The National Geographic series 'Secrets of the Elephants' is highlighted for its visual storytelling."
cleantech_con.answer[2449] = "CleanTechnica is the #1 cleantech-focused news and analysis website globally, focusing on electric cars, solar energy, wind energy, and energy storage."
cleantech_con.answer[2482] = "The megatrend of decarbonization refers to the overarching trend toward reducing carbon emissions and achieving climate neutrality in various industries. DecarbXpo is an event discussing and presenting solutions related to climate neutrality and decarbonization."
cleantech_con.answer[2491] = "DecarbXpo offers start-ups in the hydrogen sector an opportunity to present their innovations and ideas to an expert audience from different industries and business sectors."
cleantech_con.answer[2505] = "To increase local commitment to protection purposes, organizations can promote long-term environmental practices, raise awareness, and cooperate with other initiatives. This includes implementing quiet ventilation methods, receptive solar design, water recycling systems, rainwater harvesting, and waste control."
cleantech_con.answer[2635] = "The focus of research on renewable energy from biomass includes biomass-based renewable energy, the production of commodity chemicals from renewable sources, and exploring the water-energy-food nexus. The research aims to support South Africa's transition from conventional plastics to more environmentally sustainable alternatives."


In [None]:
cleantech_con.answer[2640] = "Supporting SA's shift to sustainable plastics."
cleantech_con.answer[2686] = "Need solid data for water management."
cleantech_con.answer[2727] = "Analyzed hydrogen-enriched fuel performance."
cleantech_con.answer[2789] = "Design for sustainability in the built environment."
cleantech_con.answer[2815] = "Recycled plastic's diverse applications."
cleantech_con.answer[2870] = "AZoCleantech discusses climate change mitigation."
cleantech_con.answer[2873] = "Examples of farmer-driven funding models."
cleantech_con.answer[2954] = "2010 report on microalgae characteristics."
cleantech_con.answer[3021] = "Supporting SA's transition from plastics."
cleantech_con.answer[3049] = "Pike Research tracks Evotherm M1 impacts."
cleantech_con.answer[3086] = "AZoCleantech on X-ray mapping of aged nanoparticles."
cleantech_con.answer[3087] = "Pike Research report on hydrogen release."
cleantech_con.answer[3180] = "Balanced theory and apps for DME use."
cleantech_con.answer[3292] = "Risks from forward-looking statements."
cleantech_con.answer[3384] = "Challenges in decarbonizing fertilizer production."
cleantech_con.answer[3390] = "Click 'Allow All' for media outlet survival."
cleantech_con.answer[3419] = "Svaty uses tech for soil mineral assessment."
cleantech_con.answer[3561] = "Benefits to rural communities from ArcVera."
cleantech_con.answer[3633] = "Evotherm M1's compliance with standards."
cleantech_con.answer[3693] = "Stugalux's LOOP systems process waste gas."
cleantech_con.answer[3741] = "UNSW-led project with AUD $3.08m funding."
cleantech_con.answer[3751] = "Click 'Allow All' for strong signal."
cleantech_con.answer[3823] = "Summary on climate risk in NEPA reviews."
cleantech_con.answer[3834] = "NOAA's Spring outlook aids climate readiness."
cleantech_con.answer[3958] = "Tool for hard coral diversity by DNA."
cleantech_con.answer[4030] = "Effect of cost on LCOH and CCA."
cleantech_con.answer[4040] = "AZoCleantech discusses fuel cell development."
cleantech_con.answer[4119] = "Teixeira's comment on the study."
cleantech_con.answer[4157] = "Danger of adding sodium hydroxide to coastal waters."
cleantech_con.answer[4204] = "AMETEK's sales and service network."
cleantech_con.answer[4224] = "AZoCleantech discusses cookies and site usage."
cleantech_con.answer[4317] = "Motivation behind the electric platform study."
cleantech_con.answer[4357] = "Topic of the discussion 'Sustainability in Winter Sports'."
cleantech_con.answer[4514] = "Study's discovery about microbial functional and genomic information encoding nitrogen (N)."
cleantech_con.answer[4549] = "Purpose of chemical sorting."
cleantech_con.answer[4576] = "Study's findings according to opinions and comments."
cleantech_con.answer[4631] = "Hyundai IONIQ 5 as a reliable heavy hitter."
cleantech_con.answer[4808] = "Status of Peugeot e-2008 in the top 10 spots."
cleantech_con.answer[4863] = "The American problem of lagging cleantech investment."
cleantech_con.answer[4864] = "CleanTechnica is the #1 cleantech-focused news & analysis website."
cleantech_con.answer[4871] = "Helping school districts understand obstacles."
cleantech_con.answer[4901] = "Explanation of the 25th Amendment."
cleantech_con.answer[4975] = "Shell's commitment to fleet decarbonization."
cleantech_con.answer[4982] = "Patrick Rau's comment on production levels and demand."
cleantech_con.answer[4988] = "Most cost-effective, low-risk pathways to net zero."
cleantech_con.answer[5013] = "Goal of creating an emissions-free power sector."
cleantech_con.answer[5023] = "Purpose of the Net-Zero Producers Forum."
cleantech_con.answer[5037] = "Fallout of the gasoline shortages."
cleantech_con.answer[5056] = "Hill's statement about the combination with Crestone."
cleantech_con.answer[5096] = "Adrián Duhalt's comment to NGI's Mexico Gas Price Index."
cleantech_con.answer[5100] = "Developing a clear and reliable methodology for quantifying LNG supply chain carbon intensity."
cleantech_con.answer[5104] = "ARI President Vello Kuuskraa emphasizes experience in carbon storage for the Heartland system's reliability."
cleantech_con.answer[5117] = "Volatile market ahead with potential for downward pressure on LNG demand."
cleantech_con.answer[5122] = "Federal government alleges inadequacies in federal programs regarding environmental impacts and community involvement."
cleantech_con.answer[5155] = "Risk of insufficient gas supplies due to low European stocks and potential shortfall in gas-to-coal switching."
cleantech_con.answer[5166] = "Fear in the market due to perceived insufficient storage levels, particularly for a cold winter."
cleantech_con.answer[5184] = "Limited fundamental catalysts for recent sell-off; bomb cyclone likely to remain offshore."
cleantech_con.answer[5196] = "Hydrogen Infrastructure Initiative aims to increase clean hydrogen production; potential drastic increase discussed."
cleantech_con.answer[5263] = "Bearish tilt in weather outlook not explicitly mentioned in the provided text."
cleantech_con.answer[5278] = "Impact of Omicron variant on energy demand and recovery; restrictions expected in 1Q2022."
cleantech_con.answer[5309] = "Occidental's new shareholder return framework for 2022 is focused on debt reduction and operational efficiencies."
cleantech_con.answer[5342] = "Ursula von der Leyen mentioned using collective bargaining power instead of outbidding to control costs and attract LNG and pipeline imports."
cleantech_con.answer[5346] = "Operators in the basin being in high gear on their development plans is a clear signal."
cleantech_con.answer[5347] = "LNG Allies CEO Fred Hutchison states that the action by the EXIM board benefits U.S. LNG export projects during the current global LNG shortage."
cleantech_con.answer[5375] = "Callon's spending guidance for the Permian Basin is a product of inflationary service cost pressures."
cleantech_con.answer[5390] = "The likely winter storage shortfall for Nymex suggests a renewal of significant upward pressure."
cleantech_con.answer[5393] = "EBW senior analyst Eli Rubin mentions a renewal of significant upward pressure for Nymex futures."
cleantech_con.answer[5460] = "Shale Daily provides impactful news and transparent pricing for shale and unconventional plays across the U.S. and Canada."
cleantech_con.answer[5483] = "The most notable responses, according to IEA, include the U.S. Inflation Reduction Act."
cleantech_con.answer[5504] = "Shale Daily offers a clear snapshot of natural gas supplies for analysts, investors, and global LNG buyers."
cleantech_con.answer[5534] = "The issue with the resource in the U.S. is not about availability but about transportation to where it's needed."
cleantech_con.answer[5558] = "One step taken to drive down emissions at the port is investments in electric vehicles and LED lighting."
cleantech_con.answer[5560] = "The head shepherd volunteered to make the treacherous journey to the neighboring town to secure supplies."
cleantech_con.answer[5567] = "The issue with the resource in the U.S. is getting it from where it is to where it's needed."
cleantech_con.answer[5582] = "The tipping point of the new paper is when EVs cost the same to manufacture as conventional cars."
cleantech_con.answer[5590] = "Some policy and deployment challenges for the sector include planning conditions, grid connection barriers, and limited support for marine energy."
cleantech_con.answer[5598] = "The spokesperson for the Department for Environment, Food and Rural Affairs emphasized comprehensive plans for flood risk reduction."
cleantech_con.answer[5642] = "The letter emphasizes the critical role of women and girls in the fight against the climate crisis at COP26."
cleantech_con.answer[5660] = "Key actions to accelerate progress include ending the construction of new coal power capacity and reducing the share of coal in global electricity generation."
cleantech_con.answer[5681] = "Actions like slashing overseas aid, cutting EV incentives, and supporting airport expansion have raised concerns about the UK's commitment to net-zero targets."
cleantech_con.answer[5712] = "Co-op's sustainability initiatives include aligning all finance activities in support of low carbon investments."
cleantech_con.answer[5907] = "The WA Environmental Protection Authority confirmed Woodside's plan for a 100MW solar facility in the Maitland Strategic Industrial Area."
cleantech_con.answer[5970] = "The DoC (Department of Commerce) is conducting an inquiry into whether Chinese-branded solar panels are using cheap components to cut costs."
cleantech_con.answer[6000] = "To help establish the H2 trading platform, research and improvements in policies, standards, methodologies, and market-based trading mechanisms for hydrogen are planned."


In [None]:
cleantech_con.answer[6058] = "Skimming 'excessive' revenues from green energy for consumer power bills."
cleantech_con.answer[6132] = "Expert optimistic about nuclear fusion's importance."
cleantech_con.answer[6140] = "G7 meeting implications: Stronger push on renewables, biodiversity, and clean tech."
cleantech_con.answer[6169] = "Benefits: Improved energy transformation, cost reduction, and efficiency."
cleantech_con.answer[6224] = "Petrobras plans 23GW more offshore wind to lead in sea wind energy."
cleantech_con.answer[6385] = "Geothermal proposed for heat decarbonization due to energy savings."
cleantech_con.answer[6398] = "FORGE program: Confirming well connectivity, fracture conductivity, and achieving conformance."
cleantech_con.answer[6428] = "Conference topics: Resources, data, launch, perspectives, startups, and movements."
cleantech_con.answer[6429] = "Key responsibilities: Managing full life cycle, budgeting, and negotiating equipment contracts."
cleantech_con.answer[6464] = "Enhanced Geothermal Shot: Accelerating R&D to reduce drilling costs."
cleantech_con.answer[6495] = "Geothermal excels in land use, materials, and CO2 production per unit of energy."
cleantech_con.answer[6501] = "IGC Webinar focuses on insights into geothermal industry activities."
cleantech_con.answer[6507] = "Off-takers: Desert Community Energy, Clean Energy Alliance, California Choice Energy."
cleantech_con.answer[6526] = "New Lithium Extraction Tax Law in California's 2022-2023 budget provisions."
cleantech_con.answer[6631] = "Geothermal exploration zone: Part of GeoZone initiative in Sonoma and Mendocino counties."
cleantech_con.answer[6645] = "Chile seminar: Experts highlight geothermal's potential in energy transition."
cleantech_con.answer[6656] = "Book 'Our Hidden Powers': Delightful campaign promoting geothermal awareness."
cleantech_con.answer[6671] = "PGE signs agreements with ANZ, Citigroup, HSBC, BNP Paribas, and others."
cleantech_con.answer[6675] = "Next Utah FORGE step: Pumping water for heat absorption in geothermal projects."
cleantech_con.answer[6677] = "ANCHORBIT: Downhole walking system for stability in hot geothermal drilling."
cleantech_con.answer[6722] = "Hungary's new regulatory framework accelerates geothermal development."
cleantech_con.answer[6728] = "Good data management crucial for long-term geothermal development."
cleantech_con.answer[6787] = "Consultant evaluates proposal results and reasons for drilling projects not commencing."
cleantech_con.answer[6795] = "WGC 2026 Tender Committee impressed by Calgary's bid, highlighting Alberta's geothermal expertise."
cleantech_con.answer[6816] = "Explore heating/cooling concepts for industrial decarbonization."
cleantech_con.answer[6853] = "Home Energy Efficiency Team won The Five on Fire award."
cleantech_con.answer[6867] = "Geothermal energy: Permanent, non-weather dependent, sustainable lithium extraction."
cleantech_con.answer[6905] = "Responsible for designing geothermal reservoir models."
cleantech_con.answer[6961] = "Groundwork-like solution uses air and water, limited carbon footprint."
cleantech_con.answer[6980] = "Mylar UVHPET: Sustainable halogen-free backsheet films by DuPont."
cleantech_con.answer[7001] = "IRENA needs many more PV installers for rooftop solar."
cleantech_con.answer[7078] = "pv magazine Webinar: Safety trends in battery storage design and hailstorms."
cleantech_con.answer[7094] = "Oman: Prominent global position due to climate, geography, and renewable expertise."
cleantech_con.answer[7137] = "Raising capital shows business quality and resilience."
cleantech_con.answer[7164] = "Paradox: Policies subsidize smart energy devices, but regulatory framework lacks."
cleantech_con.answer[7168] = "GREAT START for AgriVoltaics (AV) with formal research on AV Panels."
cleantech_con.answer[7172] = "Technology innovation and supportive policies drive solar, storage, and electrification."
cleantech_con.answer[7221] = "Co-solvent dilution strategy for large-scale perovskite module fabrication."
cleantech_con.answer[7245] = "Research topics for improving technology, including re-parametrizing intrinsic recombination."
cleantech_con.answer[7247] = "Government's fact-finding exercise may suggest tapering down subsidy support for CfD-backed projects."
cleantech_con.answer[7312] = "Transition to a new energy model is necessary for competitiveness, local employment, and energy independence."
cleantech_con.answer[7376] = "SolarEdge system simplifies installation, commissioning, and offers app control."
cleantech_con.answer[7483] = "Flywheel energy needs significant land area and strategic proposals for positioning."
cleantech_con.answer[7488] = "Methodology neglects physical properties causing soiling on deposited material."
cleantech_con.answer[7510] = "Driving up development costs due to insufficient interconnection availability."
cleantech_con.answer[7521] = "Modelling approach quantifies the role of materials not directly involved in module's energy generation."
cleantech_con.answer[7551] = "Silver consumption reduction in adhesives, 'matrix layout' technique for module format."
cleantech_con.answer[7570] = "UK police investigate copper disposal routes through the We Don't Buy Crime initiative."
cleantech_con.answer[7582] = "Measures to kick-start energy storage in Europe, including Electricity Market Design revisions."
cleantech_con.answer[7671] = "Focus: TOPCon technology impact on LCOE for DG projects."
cleantech_con.answer[7686] = "Crisis: Russia's invasion of Ukraine."
cleantech_con.answer[7702] = "Zayed Sustainability Prize winners showcase real-world action."
cleantech_con.answer[7732] = "Group examines decommissioned PV array modules' inner workings."
cleantech_con.answer[7832] = "Experts include Adele Zhao, Head of Product Solutions & Marketing, Trina Solar."
cleantech_con.answer[7863] = "Sun Agri and RWE partner to accelerate Agri-PV in France."
cleantech_con.answer[7894] = "Second quality of Tata group: Integrating ideas into methodologies."
cleantech_con.answer[7904] = "Decreased interest in auction support due to matured PV technology."
cleantech_con.answer[7910] = "Researchers identify three typologies of fencing for PV projects."
cleantech_con.answer[8025] = "Key feature of proposed system: Microclimate control for farming."
cleantech_con.answer[8032] = "Researchers from EEB, SSE, and PIK question increased nuclear power investments."
cleantech_con.answer[8070] = "Article aims to identify barriers and strategies for solar neighborhood planning."
cleantech_con.answer[8092] = "Components of battery energy storage system: Architecture, safety, real-world examples."
cleantech_con.answer[8100] = "Risk: Utilities relying on Release instead of upgrading grids."
cleantech_con.answer[8113] = "New approach: Transmission storage connections to speed up connections."
cleantech_con.answer[8142] = "Successful system: PV, short-term battery, long-term hydrogen storage."
cleantech_con.answer[8182] = "Small but sufficient system in the past."
cleantech_con.answer[8198] = "New 0% interest rate helps homeowners achieve affordable solar ownership."
cleantech_con.answer[8259] = "Agrivoltaics preserves land legacy, provides value and retirement."
cleantech_con.answer[8301] = "White Construction's dominance in wind, rising in solar."
cleantech_con.answer[8302] = "Partnership helps small businesses understand and apply for solar loans."
cleantech_con.answer[8334] = "AI emerging as strategic forecasting asset for optimizing energy resources."
cleantech_con.answer[8356] = "FSM arms onsite techs with up-to-date data, optimizing workflows."
cleantech_con.answer[8367] = "New Mexico PRC improved interconnection rules for solar projects."
cleantech_con.answer[8382] = "Practicality: Placing solar arrays on warehouses for land conservation."


In [None]:
cleantech_con.answer[8433] = "Firm commitment to road electrification in the US."
cleantech_con.answer[8441] = "New battery tool for Sunobi: Simplifies solar sales process."
cleantech_con.answer[8448] = "David Smart, CCO of BioStar Renewables, chose Castillo Engineering."
cleantech_con.answer[8462] = "EnerVenue's system is safe for commercial and residential applications."
cleantech_con.answer[8475] = "Key step to combat climate change: Managing energy in the region."
cleantech_con.answer[8499] = "Benefit of solar + storage systems: SEIA updates factsheets."
cleantech_con.answer[8569] = "Franklin Home Power Solution leverages multiple energy sources during outages."
cleantech_con.answer[8607] = "Largest electric cooperative in Arizona: Silicon Ranch."
cleantech_con.answer[8616] = "Use of less energy, material, and lower supply chain risk aids sustainability."
cleantech_con.answer[8627] = "Incentives should be increased for benefits like tax revenue and land remediation."
cleantech_con.answer[8664] = "CAISO identified transmission network needs for a better grid."
cleantech_con.answer[8666] = "Largest renewable energy facility in the southern hemisphere: Clarke Creek farm."
cleantech_con.answer[8670] = "Low albedo effect maximizes solar energy production in desert environments."
cleantech_con.answer[8686] = "SmartDesign 2.0 purpose: Simplify solar and energy storage system design."
cleantech_con.answer[8695] = "New dawn for US solar after challenges: Inflation Reduction Act."
cleantech_con.answer[8698] = "Step to streamline permitting: Increased investment, promotion of dialogues."
cleantech_con.answer[8721] = "Large-scale development helps the population: Improves employment, living conditions."
cleantech_con.answer[8745] = "Generating reference point for cost reduction: Comprehensive product design."
cleantech_con.answer[8776] = "Bifacial perovskite cell aim: Achieve parity with monofacial efficiencies."
cleantech_con.answer[8793] = "Samantha Sloan praises Inflation Reduction Act for growth in solar manufacturing."
cleantech_con.answer[8831] = "Previous net metering legislation: NEM 2, offered higher repayment rates."
cleantech_con.answer[8900] = "Global growth outlook for renewables: Positive, difficult pathway to 1.5°C."
cleantech_con.answer[8949] = "Altus Power's third investment in Hawaii: Energy savings for Oâ€™ahu residents."
cleantech_con.answer[8959] = "Developers calculate profitability: HOMER Front analyzes financial metrics."
cleantech_con.answer[9075] = "Encore's experience reduces barriers for solar adoption."
cleantech_con.answer[9079] = "Purpose of 2021 Standard Scenarios Report: Inform investment decisions."
cleantech_con.answer[9138] = "Purpose of Indiana Michigan Power's RFP: Diversify and optimize generation portfolio."
cleantech_con.answer[9187] = "Unity system of Inaccess optimizes renewable power plant operation."
cleantech_con.answer[9228] = "Project developers streamline workflow with instant power purchase agreements."
cleantech_con.answer[9241] = "Examples of Lioness technology elements: Energy Management System, Battery Management System."
cleantech_con.answer[9256] = "Participants stay current through access to industrial capabilities and mentorships."
cleantech_con.answer[9259] = "Recognition: Need to accelerate path to net zero in renewable energy objectives."
cleantech_con.answer[9263] = "Project examines reduction in greenhouse gas emissions, economic benefits, and grid implications."
cleantech_con.answer[9278] = "Xendee's new multi-node feature: Creates advanced interconnected microgrid networks."
cleantech_con.answer[9285] = "Minimum U.S. iron and steel purchase requirement for solar and wind facilities."
cleantech_con.answer[9365] = "Maintenance includes protection of soil and pollinators through native vegetation."
cleantech_con.answer[9369] = "ArcVera's new strategy update: Continued growth in solar and storage services."
cleantech_con.answer[9398] = "Ovation Green portfolio: Standardized, intuitive system supporting different technologies."
cleantech_con.answer[9510] = "PowerTitan Series BESS: One of the largest battery energy storage projects in the U.S."
cleantech_con.answer[9566] = "Purpose of 3D GIS + Numerical Wind Tunnel CFD Simulation: Improve reliability and design of trackers."
cleantech_con.answer[9570] = "Green Glove program: Enhances installer experience and drives quality across solar value chain."

In [None]:
cleantech_con.answer[5537] = "The issue is transporting the gas supply to where it's needed."
cleantech_con.answer[6403] = "FORGE field laboratory program aims to confirm connectivity, adequate fracture conductivity, and achieve conformance for Enhanced Geothermal Systems."


In [None]:
import numpy as np
missing_answers = cleantech_con.loc[cleantech_con.answer.isnull()]
missing_answers

Unnamed: 0,title,content,sentence_tokens,summary,question,answer


In [None]:
cleantech_con

Unnamed: 0,title,content,sentence_tokens,summary,question,answer
0,Qatar to Slash Emissions as LNG Expansion Adva...,"[""Qatar Petroleum ( QP) is targeting aggressiv...",[Qatar Petroleum ( QP) is targeting aggressive...,The company is also aiming to reduce gas flari...,By what year will QP be able to eliminate rout...,2030
1,India Launches Its First 700 MW PHWR,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL)...",[• Nuclear Power Corp. of India Ltd. ( NPCIL) ...,• The inaugural US small modular reactor ( SMR...,What is the name of the first US small modular...,SMR
2,New Chapter for US-China Energy Trade,"[""New US President Joe Biden took office this ...",[New US President Joe Biden took office this w...,Energy has come to play a bigger role in that ...,How much did China spend on US energy products...,$18.5 billion
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,"[""The slow pace of Japanese reactor restarts c...",[The slow pace of Japanese reactor restarts co...,Tokyo Electric Power Co. ( Tepco) is strugglin...,How many Ohi and Takahama units are currently ...,Four
4,NYC Pension Funds to Divest Fossil Fuel Shares,"[""Two of New York City's largest pension funds...",[Two of New York City's largest pension funds ...,The announcement by the New York City pension ...,How much is the New York State Common retireme...,$90 million
...,...,...,...,...,...,...
9585,Strata Clean Energy Nets $ 300 Million in Fund...,['Strata Clean Energy has closed a $ 300 milli...,[Strata Clean Energy has closed a $ 300 millio...,Nomura Securities International Inc. led the f...,How many other participating banks were there?,five
9586,Orsted Deploying SparkCognition Renewable Suit...,['Global renewable energy developer Ørsted is ...,[Global renewable energy developer Ørsted is d...,“ Renewable Suite provides renewable energy ow...,What does Renewable Suite provide?,predictive recommendations
9587,Veolia Has Plans for 5 MW of Solar in Arkansas,"['Veolia North America, a provider of environm...","[Veolia North America, a provider of environme...","“ This investment to bring clean, renewable po...",What is the name of VNA's Environmental Soluti...,Bob Cappadona
9588,"SunEdison: Too Big, Too Fast?",['Once the self-proclaimed “ leading renewable...,[Once the self-proclaimed “ leading renewable ...,"“ In the U.S., the timing of these potential s...",What are utilities and independent power produ...,Renewable electricity standards and Clean Powe...


In [None]:
import pandas as pd
#save questions and answers to csv
selected_columns = ['summary', 'question', 'answer']
selected_df = cleantech_con[selected_columns]
selected_df.to_csv('/content/drive/MyDrive/CLT Project/NLP Stage 3/summary_question_answer.csv', index=True)


In [None]:
#load data
import pandas as pd
selected_columns = pd.read_csv("/content/drive/MyDrive/CLT Project/NLP Stage 3/summary_question_answer.csv",delimiter=",",low_memory=False)
selected_columns

In [None]:
selected_columns = selected_columns.drop('Unnamed: 0', axis=1, errors='ignore')

In [None]:
selected_columns = selected_columns[["question","answer"]]
selected_columns
selected_columns.to_csv('/content/drive/MyDrive/CLT Project/NLP Stage 3/question_answer.csv')

#### 4. Fine Tuning pre-trained T5 Model

In [None]:
pip install torch transformers pandas

In [None]:
!pip uninstall transformers
!pip install transformers==3.2.0
!pip install transformers[sentencepiece]
!pip install sentencepiece

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import T5Tokenizer, T5ForConditionalGeneration, AdamW
from sklearn.model_selection import train_test_split

# Extract relevant columns
questions = selected_columns['question'].tolist()
answers = selected_columns['answer'].tolist()

# Define a custom dataset
class CustomDataset(Dataset):
    def __init__(self, questions, answers, tokenizer, max_length=128):
        self.questions = questions
        self.answers = answers
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        input_text = f"summary question: {self.questions[idx]} answer: {self.answers[idx]}"

        # Tokenize input text
        encoding = self.tokenizer(input_text, return_tensors='pt', max_length=self.max_length, truncation=True, padding='max_length')

        # Tokenize answer separately for labels
        labels = self.tokenizer(self.answers[idx], return_tensors='pt', max_length=self.max_length, truncation=True, padding='max_length')

        # Adjust the 'input_ids' and 'attention_mask' to include the 'labels'
        encoding['labels'] = labels['input_ids'].squeeze()

        # Ensure keys are present
        encoding = {key: encoding[key].squeeze() for key in encoding.keys()}

        return encoding


# Initialize the T5 tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Create the custom dataset
dataset = CustomDataset(questions, answers, tokenizer)

# Define data loader
batch_size = 16
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Fine-tuning parameters
num_epochs = 1
learning_rate = 5e-5
optimizer = AdamW(model.parameters(), lr=learning_rate)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)



Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import T5Tokenizer, T5ForConditionalGeneration, AdamW
from sklearn.model_selection import train_test_split


questions = selected_columns['question'].tolist()
answers = selected_columns['answer'].tolist()

# Split data into training and evaluation sets
questions_train, questions_eval, answers_train, answers_eval = train_test_split(
    questions, answers, test_size=0.2, random_state=42
)

# Define a custom dataset for training
train_dataset = CustomDataset(questions_train, answers_train, tokenizer)

# Define a custom dataset for evaluation
eval_dataset = CustomDataset(questions_eval, answers_eval, tokenizer)

# Create DataLoader for training
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Create DataLoader for evaluation
eval_dataloader = DataLoader(eval_dataset, batch_size=batch_size, shuffle=False)

# Initialize the T5 tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Fine-tuning parameters
num_epochs = 1
learning_rate = 5e-5
optimizer = AdamW(model.parameters(), lr=learning_rate)

# Fine-tune the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for batch in train_dataloader:
        inputs = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()

        outputs = model(inputs, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        total_loss += loss.item()

    average_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch + 1}/{num_epochs}, Average Loss: {average_loss}")

# Save the fine-tuned model
model.save_pretrained("fine_tuned_model")
tokenizer.save_pretrained("fine_tuned_model")

# Evaluation on the evaluation set
model.eval()  # Set the model to evaluation mode
total_correct = 0
total_samples = 0

with torch.no_grad():
    for batch in eval_dataloader:
        inputs = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(inputs, attention_mask=attention_mask, labels=labels)
        predictions = torch.argmax(outputs.logits, dim=-1)

        total_correct += (predictions == labels).sum().item()
        total_samples += labels.numel()

accuracy = total_correct / total_samples
print(f"Fine-tuned Model Accuracy on Evaluation Set: {accuracy}")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Epoch 1/1, Average Loss: 0.5833480951841921
Fine-tuned Model Accuracy on Evaluation Set: 0.9917190758602711


In [None]:
# Save the fine-tuned model
model.save_pretrained("/content/drive/MyDrive/CLT Project/NLP Stage 3/fine_tuned_model")
tokenizer.save_pretrained("/content/drive/MyDrive/CLT Project/NLP Stage 3/fine_tuned_model")

('/content/drive/MyDrive/CLT Project/NLP Stage 3/fine_tuned_model/tokenizer_config.json',
 '/content/drive/MyDrive/CLT Project/NLP Stage 3/fine_tuned_model/special_tokens_map.json',
 '/content/drive/MyDrive/CLT Project/NLP Stage 3/fine_tuned_model/spiece.model',
 '/content/drive/MyDrive/CLT Project/NLP Stage 3/fine_tuned_model/added_tokens.json')

To test now if the fine tuned model works well, we have manually taken some new input data from energyintel.com:

In [None]:
import pandas as pd
df_new = pd.read_csv("/content/drive/MyDrive/CLT Project/NLP Stage 3/new_input_cleantech.csv",delimiter=";",low_memory=False)
df_new

In [None]:
df_new = df_new[['summary', 'question']]
df_new

Unnamed: 0,summary,question
0,The US Environmental Protection Agency (EPA) h...,What major commitment did 50 global oil compan...
1,President Joe Biden's 2024 re-election bid foc...,What is the key clean energy policy that Donal...
2,The fate of the Russia-Ukraine gas transit con...,What major obstacle hinders negotiations for t...
3,COP29 in Azerbaijan is shaping up with climate...,"What is the main theme and focus of COP29, as ..."


In [None]:
from  transformers  import  AutoTokenizer, AutoModelWithLMHead, pipeline

#function that uses our fine-tuned model:
def generate_answers(row):
  model_path = "/content/drive/MyDrive/CLT Project/NLP Stage 3/fine_tuned_model"
  tokenizer = AutoTokenizer.from_pretrained(model_path)
  model = AutoModelWithLMHead.from_pretrained(model_path)

  question = row["question"]
  context = row["summary"]

  input = f"question: {question} context: {context}"

  encoded_input = tokenizer([input],
                              return_tensors='pt',
                              max_length=512,
                              truncation=True)

  output = model.generate(input_ids = encoded_input.input_ids,
                              attention_mask = encoded_input.attention_mask)
  output = tokenizer.decode(output[0], skip_special_tokens=True)

  return output

In [None]:
#!pip install sentencepiece
#!pip install transformers
#!pip install transformers[sentencepiece]
import sentencepiece
from tqdm import tqdm
tqdm.pandas(desc="Processing...")

# Function to generate answers
df_new["answer"] = df_new.progress_apply(generate_answers,axis=1)
df_new

Processing...:   0%|          | 0/4 [00:00<?, ?it/s]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Processing...:  50%|█████     | 2/4 [00:02<00:02,  1.32s/it]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Processing...:  75%|███████▌  | 3/4 [00:04<00:01,  1.68s/it]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Processing...: 100%|██████████| 4/4 [00:07<00:00,  2.10s/it]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Processing...: 100%|██████████| 4/4 [00:13<00:00,  3.42s/it]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#re

Unnamed: 0,summary,question,answer
0,The US Environmental Protection Agency (EPA) h...,What major commitment did 50 global oil compan...,net-zero emissions
1,President Joe Biden's 2024 re-election bid foc...,What is the key clean energy policy that Donal...,Inflation Reduction Act
2,The fate of the Russia-Ukraine gas transit con...,What major obstacle hinders negotiations for t...,war in Ukraine
3,COP29 in Azerbaijan is shaping up with climate...,"What is the main theme and focus of COP29, as ...",climate finance


In [None]:
for i, question in enumerate(df_new.question):
  print('Question: ' + question)
  print('Answer: ' + df_new.answer[i])

Question: What major commitment did 50 global oil companies make at the UN climate conference regarding methane emissions?
Answer: net-zero emissions
Question: What is the key clean energy policy that Donald Trump has promised to dismantle if he defeats Joe Biden in the election?
Answer: Inflation Reduction Act
Question: What major obstacle hinders negotiations for the Russia-Ukraine gas transit contract renewal?
Answer: war in Ukraine
Question: What is the main theme and focus of COP29, as confirmed by Azerbaijan's economy minister, Mikayil Jabbarov?
Answer: climate finance


These are the answers that our fine tuned model generated. Testing the zero-shot capability of the LLM ChatGPT the following answers were generated given the same questions and summaries:



Question: What major commitment did 50 global oil companies make at the UN climate conference regarding methane emissions?

Answer: Most non-state-owned global companies, including 50 major oil firms, signed onto a pledge at the UN climate conference to reduce methane emissions to nearly zero by 2030 and cease routine flaring.
<br>
<br>
Question: What is the key clean energy policy that Donald Trump has promised to dismantle if he defeats Joe Biden in the election?

Answer: Donald Trump has promised to dismantle the Inflation Reduction Act (IRA), a key clean energy policy of the Biden administration, if he defeats Joe Biden in the election.
<br><br>
Question: What major obstacle hinders negotiations for the Russia-Ukraine gas transit contract renewal?

Answer: Negotiations for the Russia-Ukraine gas transit contract renewal are hindered by Russia's ongoing war in Ukraine, with Kyiv opposing talks until the EU mediates.
<br><br>
Question: What is the main theme and focus of COP29, as confirmed by Azerbaijan's economy minister, Mikayil Jabbarov?

Answer: The main theme of COP29, as confirmed by Azerbaijan's economy minister, Mikayil Jabbarov, will be finance, with a focus on addressing the "how to" of clean energy investment.

**Conclusion**

Concluding the text generation process it is possible to see that the training and vocabulary of the OpenSource LLM like ChatGPT is more extensive. Naturally, the results generated by ChatGPT have a qualitatively higher context, stronger semantic relationships and seem more human-like.

#### 5. Conclusion


During the process of comparing our overall training process with the cleantech data to the outcomes of models like ChatGPT, we have realized the large extent of training and training data it might take to achieve a high-performing model like the common popular open-source models. Although working on this project stage has involved relatively computationally expensive processes, the amount and quality of the outcome was relatively limited. This has helped us to better understand the constituents of NLP models and gave us insight on how much data it might take to generate meaningful language.
