# Project: NLP for Cleantech
### Stage 1: Data cleaning, preprocessing, and exploratory data analysis including topic modelling
#### Part 1: Data cleaning and preprocessing
Authors: Esin Isik, Sabrina Rigo

### 1. Data Cleaning

#### 1.1. Loading the dataset that was downloaded from Kaggle

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
cleantech = pd.read_csv("/content/drive/MyDrive/CLT Project/NLP Stage 1/cleantech_media_dataset_v1_20231109.csv",delimiter=",",low_memory=False)
cleantech

Unnamed: 0.1,Unnamed: 0,title,date,author,content,domain,url
0,1280,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,,"[""Qatar Petroleum ( QP) is targeting aggressiv...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
1,1281,India Launches Its First 700 MW PHWR,2021-01-15,,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL)...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
2,1283,New Chapter for US-China Energy Trade,2021-01-20,,"[""New US President Joe Biden took office this ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
3,1284,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,,"[""The slow pace of Japanese reactor restarts c...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
4,1285,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,,"[""Two of New York City's largest pension funds...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
...,...,...,...,...,...,...,...
9602,82339,Strata Clean Energy Nets $ 300 Million in Fund...,2023-11-06,,['Strata Clean Energy has closed a $ 300 milli...,solarindustrymag,https://solarindustrymag.com/strata-clean-ener...
9603,82340,Orsted Deploying SparkCognition Renewable Suit...,2023-11-07,,['Global renewable energy developer Ørsted is ...,solarindustrymag,https://solarindustrymag.com/orsted-deploying-...
9604,82341,Veolia Has Plans for 5 MW of Solar in Arkansas,2023-11-07,,"['Veolia North America, a provider of environm...",solarindustrymag,https://solarindustrymag.com/veolia-has-plans-...
9605,82342,"SunEdison: Too Big, Too Fast?",2023-11-08,,['Once the self-proclaimed “ leading renewable...,solarindustrymag,http://www.solarindustrymag.com/online/issues/...


In [None]:
#Rename "Unnamed: 0 " column for better accessibility:
cleantech.rename(columns={'Unnamed: 0':'rowID'}, inplace=True)

####1.2. Check for duplicates

At first glance, it is visible that the title and content column could be interesting for this NLP project. The title will be kept as it can also bring value in the further steps.
The first step is to check for duplicate values in the title and content column.

In [None]:
dupl_title = cleantech[cleantech.duplicated(subset=["title"],keep=False)]
dupl_title

Unnamed: 0,rowID,title,date,author,content,domain,url
803,6466,Macquarie targets North Sea as the green energ...,2022-08-08,,['Macquarie Group is betting the North Sea – e...,energyvoice,https://www.energyvoice.com/renewables-energy-...
820,6483,Macquarie targets North Sea as the green energ...,2022-08-15,,['Macquarie Group is betting the North Sea – e...,energyvoice,https://sgvoice.energyvoice.com/2022/08/08/mac...
864,6529,GE blocked from selling huge offshore turbine ...,2022-09-08,,['General Electric was blocked by a federal ju...,energyvoice,https://www.energyvoice.com/renewables-energy-...
891,6557,GE blocked from selling huge offshore turbine ...,2022-09-27,,['General Electric was blocked by a federal ju...,energyvoice,https://sgvoice.energyvoice.com/reporting/comp...
910,6576,"Liz Truss opposes solar panels on farmland, Do...",2022-10-10,,['Liz Truss opposes the installation of solar ...,energyvoice,https://www.energyvoice.com/renewables-energy-...
918,6584,"Liz Truss opposes solar panels on farmland, Do...",2022-10-11,,['Liz Truss opposes the installation of solar ...,energyvoice,https://sgvoice.energyvoice.com/2022/10/11/liz...
925,6591,Green hydrogen seen competing with LNG within ...,2022-10-18,,['The cost of clean hydrogen will fall to that...,energyvoice,https://www.energyvoice.com/renewables-energy-...
931,6597,Green hydrogen seen competing with LNG within ...,2022-10-20,,['The cost of clean hydrogen will fall to that...,energyvoice,https://sgvoice.energyvoice.com/investing/mark...
981,6649,"XR goes big on fake oil in protests at SLB, In...",2022-11-21,,['Extinction Rebellion has targeted a number o...,energyvoice,https://www.energyvoice.com/oilandgas/462151/x...
986,6655,Aberdeen’ s NZTC plans national centre for geo...,2022-11-22,,['Aberdeen’ s NZTC is planning a national cent...,energyvoice,https://www.energyvoice.com/renewables-energy-...


In [None]:
dupl_content = cleantech[cleantech.duplicated(subset=["content"],keep=False)]
dupl_content


Unnamed: 0,rowID,title,date,author,content,domain,url
1016,6686,Indonesia seeks investors for giant geothermal...,2022-12-09,,"['Indonesia, home to the world’ s largest geot...",energyvoice,https://www.energyvoice.com/oilandgas/467719/i...
1020,6690,Indonesia seeks investors for giant geothermal...,2022-12-09,,"['Indonesia, home to the world’ s largest geot...",energyvoice,https://sgvoice.energyvoice.com/investing/2002...
6183,78727,Portugal energy transition plan targets massiv...,2023-07-03,,['Portugal has more than doubled its 2030 goal...,rechargenews,https://www.rechargenews.com/energy-transition...
6185,78729,"Wind, hydrogen and solar fused in Portugal's p...",2023-07-03,,['Portugal has more than doubled its 2030 goal...,rechargenews,https://www.rechargenews.com/energy-transition...
6188,78732,China's wind giants are chasing global growth:...,2023-07-06,,['Geopolitics as much as price or quality will...,rechargenews,https://www.rechargenews.com/wind/chinas-wind-...
6189,78733,Why geopolitics will set the limits of China's...,2023-07-06,,['Geopolitics as much as price or quality will...,rechargenews,https://www.rechargenews.com/wind/why-geopolit...
6198,78742,Quest for endless green energy from Earth's co...,2023-07-17,,['One of Japan’ s largest utility groups Chubu...,rechargenews,https://www.rechargenews.com/energy-transition...
6200,78744,Limitless green energy from Earth's core quest...,2023-07-17,,['One of Japan’ s largest utility groups Chubu...,rechargenews,https://www.rechargenews.com/news/2-1-1487279
7988,80594,Sodium-ion battery production capacity to grow...,2023-07-17,,['Global demand for sodium-ion batteries is ex...,pv-magazine,https://www.pv-magazine.com/2023/07/17/sodium-...
7994,80600,Sodium-ion battery fleet to grow to 10 GWh by ...,2023-07-17,,['Global demand for sodium-ion batteries is ex...,pv-magazine,https://www.pv-magazine.com/2023/07/17/sodium-...


We can see that duplicated titles do not necessarily mean duplicated content: Looking at the URL column, it is visible that it's an article on multiple pages in some cases. <br>
The URL's also show that some articles have been published on multiple websites (sgvoice.energyvoice.com and energycoice.com)
However, as none of the rows shown in dupl_title and dupl_content match, some contain nearly identical content. <br>
Therefore, it can be useful to delete all articles present in dupl_title that were published on sgvoice.energyvoice.com. <br> Also, all duplicated entries in the dupl_content df can be removed.

In [None]:
#Extract all rows to be removed in dupl title:
contain_sgvoice_dupltitle = dupl_title[dupl_title['url'].str.contains('sgvoice')]
#print(contain_sgvoice_dupltitle)

#Delete the rows in contain_sgvoice_dupltitle from cleantech:
list1 = contain_sgvoice_dupltitle["rowID"].values.tolist()
cleantech_nodupl = cleantech[cleantech.rowID.isin(list1) == False]


In [None]:
#Delete all articles with duplicate content:
cleantech_nodupl = cleantech_nodupl.drop_duplicates(subset='content')

#Extract useful columns:
cleantech_con = cleantech_nodupl[["title", "content"]]

#Concatenate title and content:
cleantech_con['text'] = cleantech_con['title'] + ' ' + cleantech_con['content']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleantech_con['text'] = cleantech_con['title'] + ' ' + cleantech_con['content']


####1.3. Remove special characters

For the tokenization process, it will be necessary to remove additional noise in the text in form of special characters. <br> Taking a look at a cell in the content column shows which kind of characters should be removed:

In [None]:
print(cleantech_con.loc[0, "text"]) #Example shows special characters

Qatar to Slash Emissions as LNG Expansion Advances ["Qatar Petroleum ( QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion. In its latest Sustainability Report published on Wednesday, QP said its goals include `` reducing the emissions intensity of Qatar's LNG facilities by 25% and of its upstream facilities by at least 15%. '' The company is also aiming to reduce gas flaring intensity across its upstream facilities by more than 75% and has raised its carbon capture and storage ambitions from 5 million tons/yr to 7 million tons/yr by 2027. About 2.2 million tons/yr of the carbon capture goal will come from the 32 million ton/yr Phase 1 of the LNG expansion, also known as the North Field East project. A further 1.1 million tons/yr will come from Phase 2, known as the North Field South project, which will raise Qatar's LNG capacity by a further 16 million tons/yr. Qatar currently has an LNG

In [None]:
#Function to remove special characters with regex and replace with space:
import re #import regex
def strip_character(value):
    new_value = re.sub(r'[^a-zA-Z0-9\s]+', ' ', value)
    return new_value

cleantech_con["cleaned_text"] = cleantech_con.text.apply(strip_character)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleantech_con["cleaned_text"] = cleantech_con.text.apply(strip_character)


In [None]:
print(cleantech_con.loc[0, "cleaned_text"]) #Example shows that special characters were removed

Qatar to Slash Emissions as LNG Expansion Advances  Qatar Petroleum   QP  is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion  In its latest Sustainability Report published on Wednesday  QP said its goals include   reducing the emissions intensity of Qatar s LNG facilities by 25  and of its upstream facilities by at least 15    The company is also aiming to reduce gas flaring intensity across its upstream facilities by more than 75  and has raised its carbon capture and storage ambitions from 5 million tons yr to 7 million tons yr by 2027  About 2 2 million tons yr of the carbon capture goal will come from the 32 million ton yr Phase 1 of the LNG expansion  also known as the North Field East project  A further 1 1 million tons yr will come from Phase 2  known as the North Field South project  which will raise Qatar s LNG capacity by a further 16 million tons yr  Qatar currently has an LNG pro

### 2. Data Preprocessing

The following functions will be created and collectively applied to the data to prepare for further steps.

#### 2.1. Tokenize content

In [None]:
import spacy
from spacy.cli.download import download
download(model="en_core_web_sm")

spacy.load("en_core_web_sm")
from spacy.lang.en import English
parser = English()
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
          continue
        else:
          lda_tokens.append(token.lower_) #transformation to lowercase
    return lda_tokens

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


#### 2.2. Look up word forms with NLTK's Wordnet.morphy()

In [None]:
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma

#### 2.3. Apply lemmatization to reduce words to their root form

In [None]:
import nltk
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

[nltk_data] Downloading package wordnet to /root/nltk_data...


#### 2.4. Filter out stop words with NLTK's Stopwords

In [None]:
nltk.download('stopwords')
en_stopwords = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#### 2.5. Apply all preprocessing functions

In [None]:
import nltk
nltk.download("words")
from nltk.corpus import words

def processed_text(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4] #consider tokens with more than 4 characters as words
    tokens = [token for token in tokens if token not in en_stopwords] #remove stopwords
    tokens = [get_lemma(token) for token in tokens] #get word forms
    tokens = [get_lemma2(token) for token in tokens] #apply root form of words
    tokens = [token for token in tokens if not any(c.isdigit() for c in token)] #remove tokens that contain numbers
    return tokens

cleantech_con["tokenized"] = cleantech_con.cleaned_text.apply(processed_text)

print(cleantech_con.loc[0, "tokenized"]) #checking the tokenized output


[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


['qatar', 'slash', 'emission', 'expansion', 'advance', 'qatar', 'petroleum', 'target', 'aggressive', 'greenhouse', 'emission', 'prepare', 'launch', 'phase', 'plan', 'million', 'expansion', 'latest', 'sustainability', 'report', 'publish', 'wednesday', 'goal', 'include', 'reducing', 'emission', 'intensity', 'qatar', 'facility', 'upstream', 'facility', 'least', 'company', 'aim', 'reduce', 'flare', 'intensity', 'across', 'upstream', 'facility', 'raise', 'carbon', 'capture', 'storage', 'ambition', 'million', 'million', 'million', 'carbon', 'capture', 'million', 'phase', 'expansion', 'know', 'north', 'field', 'project', 'million', 'phase', 'know', 'north', 'field', 'south', 'project', 'raise', 'qatar', 'capacity', 'million', 'qatar', 'currently', 'production', 'capacity', 'around', 'million', 'eye', 'phase', 'expansion', 'million', 'eliminate', 'routine', 'flare', 'methane', 'emission', 'limited', 'setting', 'methane', 'intensity', 'target', 'across', 'facility', 'company', 'plan', 'build', 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleantech_con["tokenized"] = cleantech_con.cleaned_text.apply(processed_text)


#### 2.6. Export preprocessed file

In [None]:
cleantech_con.to_parquet(r'/content/drive/MyDrive/CLT Project/NLP Stage 1/Stage1_Preprocessed.csv', index=False)