
#ESG-Insight: NLP-Based Claim Classification and Verification
**What this notebook does (step-by-step):**

1. Upload your CSV file (use Colab's file upload or place the file on Drive).  
2. Detect the text column with claims and (optionally) a company column.  
3. Clean claim text (lowercase, remove punctuation, stopwords, lemmatize).  
4. Convert claims to sentence embeddings using `sentence-transformers` (`all-MiniLM-L6-v2`).  
5. Train two classifiers (Logistic Regression, Random Forest) on embeddings.  
6. Evaluate models (precision, recall, macro-F1) and show confusion matrix.  
7. Flag unverifiable/vague claims using keyword heuristics + model confidence.  
8. (Optional) A minimal Streamlit app script is written to `esg_app.py` for a lightweight dashboard.




In [2]:

# Install required Python packages (run this cell once in Colab).
# The '!' prefix runs shell commands in the notebook environment.
# We install sentence-transformers (for embeddings), xgboost (optional), plotly (visuals), streamlit (dashboard), and nltk (text cleaning).
!pip install -q sentence-transformers xgboost plotly streamlit nltk


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m54.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m63.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:

# Import pandas for data handling.
import pandas as pd
# Import numpy for numerical arrays.
import numpy as np
# Import train_test_split for creating train/test splits.
from sklearn.model_selection import train_test_split
# Import LabelEncoder to convert string labels to integers.
from sklearn.preprocessing import LabelEncoder
# LogisticRegression is our simple discriminative baseline.
from sklearn.linear_model import LogisticRegression
# RandomForestClassifier is our stronger tree-based baseline.
from sklearn.ensemble import RandomForestClassifier
# Standard evaluation metrics.
from sklearn.metrics import classification_report, confusion_matrix, f1_score
# joblib to save/load trained models.
import joblib
# SentenceTransformer for embeddings.
from sentence_transformers import SentenceTransformer
# Regular expressions for simple text cleaning.
import re
# NLTK and its helpers for stopwords and lemmatization.
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Plotly for interactive charts.
import plotly.express as px

# Download NLTK resources (run once).
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [4]:

# Use Colab's upload helper to upload a single CSV file interactively.
# This cell will open a file picker in Colab; upload your ESG CSV here.
from google.colab import files
uploaded_files = files.upload()




Saving preprocessed_content.csv to preprocessed_content.csv


In [5]:
# Read the CSV into a DataFrame using pandas.
# We let pandas infer separators and encodings; if your file uses a different delimiter, adjust 'sep' accordingly.
claims_df = pd.read_csv('preprocessed_content.csv')

# Show the dataframe shape and the first 5 rows so you can confirm we loaded the intended file.
print('Loaded file:', claims_df)
print('Data shape:', claims_df.shape)
claims_df.head()

Loaded file:      Unnamed: 0              filename ticker  year  \
0             0      ASX_BSX_2020.pdf    BSX  2020   
1             1      ASX_BSX_2022.pdf    BSX  2022   
2             2      ASX_EXR_2022.pdf    EXR  2022   
3             3      LSE_ADM_2019.pdf    ADM  2019   
4             4      LSE_ADM_2020.pdf    ADM  2020   
..          ...                   ...    ...   ...   
861         861  TSX_ABT_2021 (1).pdf    ABT  2021   
862         862      TSX_ABT_2021.pdf    ABT  2021   
863         863      TSX_EFX_2022.pdf    EFX  2022   
864         864      TSX_MSI_2018.pdf    MSI  2018   
865         865      TSX_MSI_2019.pdf    MSI  2019   

                                  preprocessed_content  \
0    style guide colour colour use imagecolour prof...   
1    sustainability report look mining green office...   
2    report environment social governance esg basel...   
3    corporate social responsibilty report introduc...   
4    sustainability admiral commit maintain resp

Unnamed: 0.1,Unnamed: 0,filename,ticker,year,preprocessed_content,ner_entities,e_score,s_score,g_score,total_score
0,0,ASX_BSX_2020.pdf,BSX,2020,style guide colour colour use imagecolour prof...,"['bk%', 'rgb', 'un', 'el ectric mine consortiu...",3.16,18.0,11.83,32.98
1,1,ASX_BSX_2022.pdf,BSX,2022,sustainability report look mining green office...,"['murray street', 'west perth', 'west perth', ...",2.83,12.86,10.32,26.02
2,2,ASX_EXR_2022.pdf,EXR,2022,report environment social governance esg basel...,"['september', 'mongolia', 'australia', 'austra...",3.81,4.28,5.86,13.94
3,3,LSE_ADM_2019.pdf,ADM,2019,corporate social responsibilty report introduc...,"['david stevens', 'csr board', 'just over yea...",16.38,14.2,5.9,36.36
4,4,LSE_ADM_2020.pdf,ADM,2020,sustainability admiral commit maintain respons...,"['year', 'health & wellbeing', 'a -month', 'on...",15.89,13.51,5.38,34.78


In [6]:

# Heuristics to find a reasonable text column that contains claims/statements.
# We'll check common column names first, then fall back to the longest string-like column.
candidate_text_names = ['claim','text','statement','content','claim_text','claim_statement','sentence','description']

# Normalize column names for matching.
lower_cols = [c.lower() for c in claims_df.columns]

# Find the first match among candidate names (case-insensitive).
text_column = None
for cand in candidate_text_names:
    if cand in lower_cols:
        text_column = claims_df.columns[lower_cols.index(cand)]
        break

# If none of the common names match, pick the object dtype column with the largest average string length.
if text_column is None:
    # Get object (string) columns.
    object_columns = claims_df.select_dtypes(include=['object']).columns.tolist()
    if len(object_columns) == 0:
        raise ValueError('No string columns found in uploaded CSV. Ensure your CSV has a column with claim text.')
    # Choose the object column with the largest average length (most likely to be the claim text).
    text_column = max(object_columns, key=lambda c: claims_df[c].dropna().astype(str).map(len).mean())

# Print what we selected and show a sample of values to confirm.
print('Selected text column:', text_column)
claims_df[[text_column]].head(10)


Selected text column: preprocessed_content


Unnamed: 0,preprocessed_content
0,style guide colour colour use imagecolour prof...
1,sustainability report look mining green office...
2,report environment social governance esg basel...
3,corporate social responsibilty report introduc...
4,sustainability admiral commit maintain respons...
5,look future sustainability report customer cle...
6,sustainability report guidance version content...
7,apple develop new alloy enable use percent rec...
8,report apple report content introduction etter...
9,reportenvironmental introduction letter cook r...


In [7]:

# Try to detect an existing label/category column. If not found, we will create labels via keyword mapping.
candidate_label_names = ['category','label','esg_category','claim_category','topic','class']

# Normalize lower-case column names again.
lower_cols = [c.lower() for c in claims_df.columns]
label_column = None
for cand in candidate_label_names:
    if cand in lower_cols:
        label_column = claims_df.columns[lower_cols.index(cand)]
        break

# If a label column exists, show distribution.
if label_column is not None:
    print('Found label column:', label_column)
    display(claims_df[label_column].value_counts().head(20))
else:
    print('No label column detected. Creating labels using keyword heuristics (conservative mapping).')
    # Define a conservative mapping from keywords to standard ESG categories.
    def map_to_category(text):
        # Normalize text.
        t = str(text).lower()
        # Carbon neutrality related.
        if any(k in t for k in ['carbon neutral', 'net zero', 'carbon neutrality', 'net-zero']):
            return 'Carbon Neutrality'
        # Renewable energy related.
        if any(k in t for k in ['renewable', 'solar', 'wind', 'hydro', 'geothermal']):
            return 'Renewable Energy'
        # Emissions and GHG related.
        if any(k in t for k in ['emission', 'ghg', 'greenhouse', 'scope 1', 'scope 2', 'scope 3']):
            return 'Emissions'
        # Social responsibility related.
        if any(k in t for k in ['community', 'employee', 'labor', 'diversity', 'human rights', 'health and safety']):
            return 'Social Responsibility'
        # Governance related.
        if any(k in t for k in ['governance', 'board', 'compliance', 'audit', 'ethics']):
            return 'Governance'
        # Fallback label.
        return 'Other'

    # Apply mapping to create a new column named 'category'.
    claims_df['category'] = claims_df[text_column].apply(map_to_category)
    label_column = 'category'
    display(claims_df[label_column].value_counts())


No label column detected. Creating labels using keyword heuristics (conservative mapping).


Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
Renewable Energy,538
Carbon Neutrality,298
Emissions,24
Social Responsibility,4
Other,1
Governance,1


In [8]:

# Create a simple, reproducible text cleaning pipeline.
# We keep it conservative: lowercase, remove URLs, remove non-alphanumeric characters (keep spaces),
# remove stopwords, and lemmatize words. This is sufficient for sentence embeddings.

# Instantiate a lemmatizer for later use.
lemmatizer = WordNetLemmatizer()
# Load English stopwords from NLTK.
stop_words = set(stopwords.words('english'))

# Define the cleaning function. Comments above each line explain the operation.
def clean_text_for_model(s):
    # Convert the input to string and lowercase it for normalization.
    # This ensures non-string inputs won't crash the pipeline.
    # Example: 'We aim to be Carbon Neutral by 2030.' -> 'we aim to be carbon neutral by 2030.'
    s = str(s).lower()
    # Remove URLs as they don't help classification and can be noisy.
    s = re.sub(r'http\S+|www\.\S+', ' ', s)
    # Remove punctuation/special characters but keep numbers and letters.
    s = re.sub(r'[^a-z0-9\s]', ' ', s)
    # Tokenize on whitespace.
    tokens = s.split()
    # Remove stopwords and single-letter tokens, then lemmatize each token.
    processed = []
    for tok in tokens:
        # Skip stopwords like 'the', 'and', etc.
        if tok in stop_words:
            continue
        # Skip single-character tokens like 'a' or 'i' (not informative for this task).
        if len(tok) == 1:
            continue
        # Lemmatize token to reduce inflectional variance.
        tok = lemmatizer.lemmatize(tok)
        processed.append(tok)
    # Rejoin tokens into a cleaned string to feed into the sentence-transformer model.
    return ' '.join(processed)

# Apply cleaning to the chosen text column and store as 'clean_text'.
claims_df['clean_text'] = claims_df[text_column].astype(str).apply(clean_text_for_model)

# Show examples of original vs cleaned text.
claims_df[[text_column, 'clean_text']].head(10)


Unnamed: 0,preprocessed_content,clean_text
0,style guide colour colour use imagecolour prof...,style guide colour colour use imagecolour prof...
1,sustainability report look mining green office...,sustainability report look mining green office...
2,report environment social governance esg basel...,report environment social governance esg basel...
3,corporate social responsibilty report introduc...,corporate social responsibilty report introduc...
4,sustainability admiral commit maintain respons...,sustainability admiral commit maintain respons...
5,look future sustainability report customer cle...,look future sustainability report customer cle...
6,sustainability report guidance version content...,sustainability report guidance version content...
7,apple develop new alloy enable use percent rec...,apple develop new alloy enable use percent rec...
8,report apple report content introduction etter...,report apple report content introduction etter...
9,reportenvironmental introduction letter cook r...,reportenvironmental introduction letter cook r...


In [9]:

# Load the transformer model. In Colab this will download the model the first time you run it.
# 'all-MiniLM-L6-v2' is small, fast, and gives good sentence-level embeddings for classification tasks.
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the cleaned texts into numeric embeddings. Use show_progress_bar=True to monitor progress.
# Convert to numpy array for scikit-learn compatibility.
texts = claims_df['clean_text'].fillna('').tolist()
embeddings = embedding_model.encode(texts, show_progress_bar=True, convert_to_numpy=True)

# Save embeddings to disk for speed if you need to rerun later in the same session.
import numpy as _np
_np.save('claim_embeddings.npy', embeddings)
print('Embeddings shape:', embeddings.shape)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/28 [00:00<?, ?it/s]

Embeddings shape: (866, 384)


In [10]:
# Prepare labels: use the detected label column.
labels_raw = claims_df[label_column].astype(str).fillna('Other').tolist()

# Convert string labels to integers with LabelEncoder for classifiers.
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(labels_raw)

# Use a stratified train/test split to preserve label proportions.
# Test size is 20% by default; set random_state for reproducibility.
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(
    embeddings, y, claims_df.index.values, test_size=0.20, random_state=42
)

# Train a Logistic Regression classifier (fast baseline).
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

# Train a Random Forest classifier (stronger baseline for many tasks).
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)

# Evaluate both models on the test set and display classification reports.
y_pred_log = logreg.predict(X_test)
y_pred_rf = rf.predict(X_test)

print('Logistic Regression — Macro F1:', f1_score(y_test, y_pred_log, average='macro'))
print(classification_report(y_test, y_pred_log, target_names=label_encoder.classes_, labels=np.unique(y_test)))

print('\nRandom Forest — Macro F1:', f1_score(y_test, y_pred_rf, average='macro'))
print(classification_report(y_test, y_pred_rf, target_names=label_encoder.classes_, labels=np.unique(y_test)))

Logistic Regression — Macro F1: 0.245224171539961
                       precision    recall  f1-score   support

    Carbon Neutrality       0.62      0.13      0.21        63
            Emissions       0.00      0.00      0.00         1
           Governance       0.65      0.95      0.77       109
                Other       0.00      0.00      0.00         1

             accuracy                           0.64       174
            macro avg       0.32      0.27      0.25       174
         weighted avg       0.63      0.64      0.56       174


Random Forest — Macro F1: 0.2643768074716731
                       precision    recall  f1-score   support

    Carbon Neutrality       0.69      0.17      0.28        63
            Emissions       0.00      0.00      0.00         1
           Governance       0.66      0.95      0.78       109
                Other       0.00      0.00      0.00         1

             accuracy                           0.66       174
            macro

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [11]:

# Choose the model with higher macro F1 on the test set automatically.
f1_log = f1_score(y_test, y_pred_log, average='macro')
f1_rf = f1_score(y_test, y_pred_rf, average='macro')

if f1_rf >= f1_log:
    best_model = rf
    best_name = 'random_forest'
else:
    best_model = logreg
    best_name = 'logistic_regression'

print('Best model chosen:', best_name)

# Save the chosen model and label encoder for later inference or the Streamlit app.
joblib.dump(best_model, f'{best_name}_esg_model.joblib')
joblib.dump(label_encoder, 'label_encoder.joblib')
print('Saved model and label encoder to disk.')


Best model chosen: random_forest
Saved model and label encoder to disk.


In [12]:

# Predict on all embeddings (use the chosen best_model).
pred_probs = best_model.predict_proba(embeddings)
# The predicted integer labels.
pred_ints = pred_probs.argmax(axis=1)
# Convert integer labels back to strings via the label encoder.
pred_labels = label_encoder.inverse_transform(pred_ints)
# Max probability per instance (model confidence).
max_probs = pred_probs.max(axis=1)

# Attach predictions to the dataframe for inspection.
claims_df['predicted_category'] = pred_labels
claims_df['predicted_confidence'] = max_probs

# Flagging heuristics for unverifiable/vague claims:
# 1) Model confidence below 0.6 OR 2) claim contains vague words like 'care', 'commit', 'strive' and no numeric evidence.
vague_words = ['care', 'commit', 'strive', 'aim', 'aspire', 'we believe', 'focus on', 'dedicated to', 'supporting']

def is_vague_claim(row):
    text = str(row['clean_text']).lower()
    # If model is not confident, mark as vague.
    if row['predicted_confidence'] < 0.60:
        return True
    # If contains vague words AND no digit present in original text, mark as vague.
    if any(w in text for w in vague_words):
        # Check original (non-cleaned) text for numbers, dates, percentages.
        orig = str(row[text_column])
        if not re.search(r'\d', orig):
            return True
    return False

claims_df['is_vague'] = claims_df.apply(is_vague_claim, axis=1)

# Show a few flagged examples for manual review.
claims_df[claims_df['is_vague']].head(20)[[text_column, 'predicted_category', 'predicted_confidence']]


Unnamed: 0,preprocessed_content,predicted_category,predicted_confidence
0,style guide colour colour use imagecolour prof...,Other,0.93
1,sustainability report look mining green office...,Other,0.87
2,report environment social governance esg basel...,Other,0.84
3,corporate social responsibilty report introduc...,Other,0.88
4,sustainability admiral commit maintain respons...,Carbon Neutrality,0.79
5,look future sustainability report customer cle...,Other,0.61
6,sustainability report guidance version content...,Other,0.89
7,apple develop new alloy enable use percent rec...,Other,0.78
8,report apple report content introduction etter...,Carbon Neutrality,0.73
9,reportenvironmental introduction letter cook r...,Carbon Neutrality,0.72


In [13]:

# Visualize the distribution of predicted categories across the dataset.
# Use Plotly for an interactive bar chart.
category_counts = claims_df['predicted_category'].value_counts().reset_index()
category_counts.columns = ['category', 'count']

# Simple bar chart of claim counts per predicted category.
fig = px.bar(category_counts, x='category', y='count', title='Predicted ESG Category Distribution')
fig.show()

# Detect a company column (optional) to rank companies by claim frequency.
candidate_company_names = ['company','organization','org','company_name','issuer','entity','firm']
company_col = None
for cand in candidate_company_names:
    if cand in lower_cols:
        company_col = claims_df.columns[lower_cols.index(cand)]
        break

# If a company column was found, show top companies by number of claims and a treemap.
if company_col is not None:
    # Count claims per company.
    company_counts = claims_df[company_col].value_counts().reset_index().head(30)
    company_counts.columns = ['company', 'claims_count']
    fig2 = px.bar(company_counts, x='company', y='claims_count', title='Top Companies by Claim Count')
    fig2.update_layout(xaxis_tickangle=-45, height=500)
    fig2.show()
else:
    print('No company/organization column detected; skip company ranking visuals.')


No company/organization column detected; skip company ranking visuals.


In [16]:
# Create a minimal Streamlit app script so you can run a lightweight dashboard locally or in environments that support Streamlit.

# Define the Streamlit app code as a multiline string
app_code = """
import streamlit as st
import pandas as pd
import joblib
from sentence_transformers import SentenceTransformer
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Simple cleaning (same logic as the notebook; keep it conservative).
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def clean_text_for_model(s: str) -> str:
    s = str(s).lower()
    s = re.sub(r'http\S+|www\.\S+', ' ', s)
    s = re.sub(r'[^a-z0-9\s]', ' ', s)
    tokens = s.split()
    processed = []
    for tok in tokens:
        if tok in stop_words or len(tok) == 1:
            continue
        processed.append(lemmatizer.lemmatize(tok))
    return ' '.join(processed)

st.title('ESG Claim Classifier (Minimal)')
uploaded = st.file_uploader('Upload a CSV with a claims column', type=['csv'])
model = None
label_encoder = None
embedder = None

if uploaded is not None:
    df = pd.read_csv(uploaded)
    st.write('Columns found:', df.columns.tolist())
    # Try to detect text-like column.
    text_col = None
    candidates = ['claim','text','statement','content','description']
    lowcols = [c.lower() for c in df.columns]
    for i, c in enumerate(lowcols):
        if c in candidates:
            text_col = df.columns[i]
            break
    if text_col is None:
        # fallback to first object column
        obj_cols = df.select_dtypes(include=['object']).columns.tolist()
        if len(obj_cols) == 0:
            st.error('No text column found. Please upload a CSV with a text/claim column.')
        else:
            text_col = obj_cols[0]
    st.write('Using text column:', text_col)
    df['clean_text'] = df[text_col].astype(str).apply(clean_text_for_model)
    # Load model artifacts saved by the notebook.
    try:
        model = joblib.load('random_forest_esg_model.joblib')
    except Exception:
        try:
            model = joblib.load('logistic_regression_esg_model.joblib')
        except Exception:
            st.error('Model not found. Run the notebook to train models first and place the model files next to the app.')
    if model is not None:
        label_encoder = joblib.load('label_encoder.joblib')
        embedder = SentenceTransformer('all-MiniLM-L6-v2')
        embeddings = embedder.encode(df['clean_text'].tolist(), convert_to_numpy=True)
        probs = model.predict_proba(embeddings)
        preds = probs.argmax(axis=1)
        df['predicted_category'] = label_encoder.inverse_transform(preds)
        df['predicted_confidence'] = probs.max(axis=1)
        st.dataframe(df[[text_col, 'predicted_category', 'predicted_confidence']].head(200))
"""

# Write the Streamlit app code to disk.
with open('esg_app.py', 'w', encoding='utf-8') as f:
    f.write(app_code)

print('Wrote esg_app.py (a minimal Streamlit app).')
print('To run locally: streamlit run esg_app.py')

Wrote esg_app.py (a minimal Streamlit app).
To run locally: streamlit run esg_app.py



invalid escape sequence '\S'


invalid escape sequence '\S'


invalid escape sequence '\S'



In [None]:
!streamlit run /usr/local/lib/python3.12/dist-packages/colab_kernel_launcher.py


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.53.117.22:8501[0m
[0m


In [None]:
# Gradio if needed

import gradio as gr
import joblib
from sentence_transformers import SentenceTransformer
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# --------------------------
# Load model + encoder
# --------------------------
# Adjust the filename if your best model was Logistic Regression instead
model = joblib.load('random_forest_esg_model.joblib')
label_encoder = joblib.load('label_encoder.joblib')

# Load embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# --------------------------
# Text cleaning function
# --------------------------
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def clean_text_for_model(s: str) -> str:
    s = str(s).lower()
    s = re.sub(r"http\\S+|www\\.\\S+", " ", s)
    s = re.sub(r"[^a-z0-9\\s]", " ", s)
    tokens = s.split()
    processed = []
    for tok in tokens:
        if tok in stop_words or len(tok) == 1:
            continue
        processed.append(lemmatizer.lemmatize(tok))
    return " ".join(processed)

# --------------------------
# Prediction + vague check
# --------------------------
vague_words = [
    "care", "commit", "strive", "aim", "aspire",
    "we believe", "focus on", "dedicated to", "supporting"
]

def predict_claim(claim_text):
    cleaned = clean_text_for_model(claim_text)
    embedding = embedder.encode([cleaned], convert_to_numpy=True)
    probs = model.predict_proba(embedding)[0]
    pred_index = probs.argmax()
    pred_label = label_encoder.inverse_transform([pred_index])[0]
    confidence = probs[pred_index]

    # Vague flag
    lower_text = cleaned.lower()
    orig = claim_text
    is_vague_flag = False
    if confidence < 0.60:
        is_vague_flag = True
    else:
        if any(w in lower_text for w in vague_words) and not re.search(r"\\d", orig):
            is_vague_flag = True

    return {
        "Predicted Category": pred_label,
        "Confidence": round(float(confidence), 3),
        "Is Vague?": "Yes" if is_vague_flag else "No"
    }

# --------------------------
# Gradio Interface
# --------------------------
iface = gr.Interface(
    fn=predict_claim,
    inputs=gr.Textbox(lines=4, label="Enter an ESG claim"),
    outputs="json",
    title="ESG Claim Classifier",
    description="Enter a single claim and get the predicted category, confidence, and a vague-claim flag."
)

if __name__ == "__main__":
    iface.launch()

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://94ec856fb97c7a68f1.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)



## Final notes and next steps

- Run this notebook **sequentially** in Google Colab. The cells are intentionally minimal and annotated line-by-line.  
- If your CSV already includes a category/label column, the notebook will use it; otherwise it creates conservative keyword-based labels.  
- The sentence-transformer model will be downloaded when you run the embeddings cell (internet required in Colab).  
- The `esg_app.py` file is a minimal Streamlit app scaffold — place the trained model files (`*_esg_model.joblib` and `label_encoder.joblib`) next to the app before running `streamlit run esg_app.py` locally or in a compatible cloud environment.  
- If you want me to adapt the notebook to a specific CSV after you upload it here, re-run the notebook in Colab and then tell me which column names the file contains; I can then refine the label mapping and heuristics further.
