<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0073e6; overflow:hidden"><b> LLM Llama - Engineering prompt Sentiment analysis climate</b></div>

<div align="center">
    <img src="https://img.freepik.com/fotos-gratis/chamines-contra-uma-paisagem-industrial-de-ceu-limpo_91128-4692.jpg?t=st=1727545980~exp=1727549580~hmac=3ad404a9538b0cff5eed7ac466180b5e6b5fffa776daa61b06c27e8b8e3e6f6e&w=740" />
</div>

# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0073e6; overflow:hidden"><b> Part 1 - Business problem</b></div>

### Business Problem: Sentiment Analysis of Tweets on Climate Change

**Objective:**
The goal is to classify tweets about "Climate Change" into three sentiment categories: positive, neutral, and negative, using a pre-trained large language model (LLM), specifically LLaMA. This classification will help identify public opinion trends and reactions to climate change over time. The insights gained can be used by environmental organizations, policymakers, and businesses to better understand public sentiment and develop strategies for communication, engagement, and policy making.

**Key Questions:**
1. What are the predominant sentiments (positive, neutral, or negative) expressed in daily discussions on Twitter about climate change?
2. How does public sentiment about climate change evolve over time?
3. Are there specific events or time periods that trigger noticeable shifts in sentiment?
4. Which factors (e.g., hashtags, keywords, influencers) are associated with positive or negative sentiment in the climate change conversation?

**Dataset Overview:**
- The dataset includes daily tweets containing the keyword "Climate Change" from January 1, 2022, to July 19, 2022.
- It consists of 11 columns, likely containing text data, user information, tweet metadata (e.g., likes, retweets), and timestamps.

**Steps to Solve the Problem:**

1. **Data Preparation and Exploration**:
   - Perform data cleaning (removing duplicates, handling missing values, cleaning up the tweet text by removing links, mentions, hashtags, etc.).
   - Conduct exploratory data analysis to understand the distribution of tweets over time and any potential trends in volume.

2. **Preprocessing for LLaMA**:
   - Tokenize and preprocess the tweet text to be compatible with the LLaMA model.
   - Optionally fine-tune the LLaMA model using labeled data if available, or use it directly for sentiment classification through prompt-based techniques.

3. **Sentiment Classification**:
   - Use LLaMA to classify each tweet into one of three categories: positive, neutral, or negative.
   - Generate prompts that guide the LLaMA model to assess the sentiment of each tweet.

4. **Evaluation and Metrics**:
   - Evaluate the model's performance using metrics like accuracy, precision, recall, F1-score, and confusion matrices.
   - Analyze the performance of the sentiment classification to ensure reliable results.

5. **Visualization and Insights**:
   - Create visualizations that show the sentiment trends over time.
   - Identify spikes in sentiment around significant events or milestones in the climate change discussion.

6. **Business Impact**:
   - Organizations can use the sentiment data to adjust communication strategies based on public opinion trends.
   - Governments and policymakers can use sentiment insights to craft targeted interventions or campaigns.
   - Businesses in the green energy sector can leverage these insights for marketing strategies.

**Expected Outcome**:
By successfully classifying tweets into positive, neutral, or negative sentiments, the analysis will provide a clear picture of public opinion regarding climate change. This will offer actionable insights to various stakeholders involved in climate action.

This business problem aims to harness the power of LLaMA for sentiment analysis in a highly relevant and timely context—public discourse on climate change.

In [None]:
# Installing the latest versions of the Hugging Face Transformers library and Accelerate library
# Transformers: a library for natural language processing tasks like text classification, translation, etc.
# Accelerate: used to easily scale models across different hardware setups (CPU, GPU, multi-GPU, etc.)
# Install the bitsandbytes library
# bitsandbytes: a lightweight library that allows running large language models with fewer bits, enabling memory-efficient model training and inference.

# Installing packages
!pip install watermark
!pip install -U transformers accelerate
!pip install bitsandbytes
!pip install torch
!pip install spacy
!pip install langdetect

In [None]:
# Download necessary resources from NLTK
import nltk
import spacy

nltk.download('punkt_tab')
nltk.download('punkt')  # Tokenizer models
nltk.download('stopwords')  # Stopwords data
nltk.download('wordnet')  # WordNet lemmatizer data

# Download Spacy language models for English
# English model
!python -m spacy download en_core_web_sm

In [None]:
# Import of libraries

# System libraries
import re
import unicodedata
import itertools
import string
from collections import Counter

# Library for file manipulation
import pandas as pd
import numpy as np
import pandas

# Data visualization
import seaborn as sns
import matplotlib.pylab as pl
import matplotlib as m
import matplotlib as mpl
import matplotlib.pyplot as plt
import plotly.express as px
from matplotlib import pyplot as plt
from wordcloud import WordCloud, STOPWORDS

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import spacy
from spacy.lang.en.stop_words import STOP_WORDS as STOP_WORDS_EN  # English stopwords

# Load the English language model from spaCy
nlp_en = spacy.load('en_core_web_sm')

# Configuration for graph width and layout
sns.set_theme(style='whitegrid')
palette='viridis'

# Importing necessary libraries from PyTorch and Hugging Face Transformers
# PyTorch is a deep learning framework used for model training and inference
import torch

# AutoTokenizer: Automatically loads a pre-trained tokenizer for encoding text
# AutoModelForCausalLM: Loads a pre-trained model for causal language modeling (e.g., for text generation)
# pipeline: Provides an easy-to-use interface to perform tasks like text generation, sentiment analysis, etc.
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Warnings remove alerts
import warnings
warnings.filterwarnings("ignore")

# Python version
from platform import python_version
print('Python version in this Jupyter Notebook:', python_version())

# Load library versions
import watermark

# Library versions
%reload_ext watermark
%watermark -a "Library versions" --iversions

# **GPU In LLM**

In [None]:
!nvidia-smi

In [None]:
!nvidia-smi -a

In [None]:
!nvidia-smi -L

In [None]:
# Check if a GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

In [None]:
# Print the summary of CUDA memory usage for the specified device
print(torch.cuda.memory_summary(device=torch.device('cuda')))

# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0073e6; overflow:hidden"><b> Part 2 - Database</b></div>

- Aqui vamos só carregar 1.000 linhas o dataset tem 9.050 linhas ou seja modelo vai demorar para processar pelo menos 3 horas estamos usando Google Colab existe um limite de horas na GPU gratuito.

- Portando vamos limitar o dataset para 1.000 linhas.


**- Link:** [Base dados - Kaggle](https://www.kaggle.com/datasets/die9origephit/climate-change-tweets)

In [None]:
# Carregando dataset
df = pd.read_csv("/content/Climate change_2022-1-17_2022-7-19.csv", nrows=100)
df = df[['UserScreenName', 'UserName', 'Text', 'Embedded_text']]
df.head()

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.dtypes

# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0073e6; overflow:hidden"><b> Part 3 - Preprocessing Text</b></div>

In [None]:
# Ensure that the "Text" column is a string
df['Text'] = df['Text'].astype(str)

# Data info
df.info()

In [None]:
# Function to clean text
def limpar_texto(texto):
    # Remove URLs
    texto = re.sub(r'http\S+|www.\S+', '', texto)

    # Remove mentions (@user)
    texto = re.sub(r'@\w+', '', texto)

    # Remove hashtags (#hashtag)
    texto = re.sub(r'#\w+', '', texto)

    # Remove emojis and non-ASCII characters
    texto = texto.encode('ascii', 'ignore').decode('ascii')

    # Remove punctuation
    texto = texto.translate(str.maketrans('', '', string.punctuation))

    # Convert to lowercase
    texto = texto.lower()

    # Tokenize the text
    palavras = word_tokenize(texto)

    # Remove Portuguese stopwords
    stop_words = set(stopwords.words('portuguese'))
    palavras = [palavra for palavra in palavras if palavra not in stop_words]

    # Join the words back together
    texto_limpo = ' '.join(palavras)

    return texto_limpo

# Apply the cleaning function to the 'Embedded_text' column
df['Text_Limpo'] = df['Embedded_text'].apply(limpar_texto)

# List of columns to be removed
colunas_para_remover = ['UserName',
                        'Timestamp',
                        'Emojis',
                        'Comments',
                        'Likes',
                        'Retweets',
                        'Image link',
                        'Tweet URL']

# Check which columns exist in the dataframe and are in the removal list
colunas_existentes = [col for col in colunas_para_remover if col in df.columns]

# Drop the existing columns from the dataframe
df.drop(columns=colunas_existentes, inplace=True)

# Display the dataset
df = df[["Text", "Text_Limpo"]]
df.Text_Limpo.head(n=20)

In [None]:
# Function for removing stopwords
def remover_stopwords_nltk(tokens):
    """
    Remove stopwords from a list of tokens using NLTK.

    Parameters:
    tokens (list): List of tokens (words) to be filtered.

    Returns:
    list: List of tokens without stopwords.
    """
    stop_words = set(stopwords.words('english'))
    tokens_filtrados = [token for token in tokens if token.lower() not in stop_words]
    return tokens_filtrados

# Function for tokenization
def tokenizar_texto_spacy(texto, modelo):
    """
    Tokenize text using spaCy.

    Parameters:
    texto (str): Text to be tokenized.
    modelo (spaCy model): spaCy language model.

    Returns:
    list: List of tokens.
    """
    if pd.isnull(texto):
        return []

    # Process the text with the spaCy model
    doc = modelo(texto)

    # Extract alphabetic tokens only
    tokens = [token.text.lower() for token in doc if token.is_alpha]

    return tokens

# Function for Pre-Processing with spacy
def processar_texto_en_spacy(texto):
    """
    Process text by cleaning, tokenizing, removing stopwords, and lemmatizing using spaCy.

    Parameters:
    texto (str): Text to be processed.

    Returns:
    dict: Dictionary containing the tokens and lemmas of the text.
    """
    if pd.isnull(texto):
        return {"tokens_spacy": [], "lemmas_spacy": []}

    # Remove URLs
    texto = re.sub(r'http\S+|www\.\S+', '', texto)

    # Remove mentions (@user)
    texto = re.sub(r'@\w+', '', texto)

    # Remove hashtags (#hashtag)
    texto = re.sub(r'#\w+', '', texto)

    # Remove emojis and non-ASCII characters
    texto = texto.encode('ascii', 'ignore').decode('ascii')

    # Remove punctuation
    texto = texto.translate(str.maketrans('', '', string.punctuation))

    # Convert to lowercase
    texto = texto.lower()

    # Tokenize with spaCy
    tokens_spacy = tokenizar_texto_spacy(texto, nlp_en)

    # Remove stopwords using NLTK
    tokens_filtrados_spacy = remover_stopwords_nltk(tokens_spacy)

    # Lemmatization using spaCy
    doc = nlp_en(' '.join(tokens_filtrados_spacy))
    lemas_spacy = [token.lemma_ for token in doc]

    return {"tokens_spacy": tokens_filtrados_spacy, "lemmas_spacy": lemas_spacy}

# Apply the text processing function using spaCy
df[['Tokens_SpaCy', 'Lemas_SpaCy']] = df['Text_Limpo'].apply(lambda x: pd.Series(processar_texto_en_spacy(x)))

# Display the dataset
df.head()

In [None]:
# Deleting columns from the dataset
columns_to_remove = ['UserName', 'Timestamp', 'Emojis', 'Comments', 'Likes', 'Retweets', 'Image link', 'Tweet URL']
existing_columns = [col for col in columns_to_remove if col in df.columns]
df.drop(columns=existing_columns, inplace=True)
df.Text_Limpo.head(n=20)

# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0073e6; overflow:hidden"><b> Part 4 - Exploratory data analysis</b></div>

In [None]:
# Joining all texts from the 'Text_Limpo' column into a single string
text = " ".join(review for review in df.Text_Limpo)

# Defining additional stopwords
stopwords = set(STOPWORDS)

# Displaying the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
plt.figure(figsize=(20.5, 10))
plt.title("Word cloud - General clean text")
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
# Combine all tokens from the 'Tokens_SpaCy' column into a single list
all_tokens = [token for tokens in df['Tokens_SpaCy'] for token in tokens]

# Count the frequency of each token using Counter
token_counts = Counter(all_tokens)

# Get the top 20 most common tokens
common_tokens = token_counts.most_common(20)  # Limiting the result to the top 20 tokens

# Separate the tokens and their frequencies for plotting
tokens, frequencies = zip(*common_tokens)

# Create a bar plot to visualize the most frequent tokens
plt.figure(figsize=(12, 6))  # Set figure size
sns.barplot(x=list(frequencies), y=list(tokens), palette='Set2')  # Plot using seaborn with 'husl' color palette

# Set title and labels with improved readability
plt.title('Top 20 Most Common Tokens', fontsize=16)  # Title for the plot
plt.xlabel('Frequency', fontsize=14)  # X-axis label
plt.ylabel('Tokens', fontsize=14)  # Y-axis label

# Add gridlines to the x-axis for better readability
plt.grid(axis='x', linestyle='--', alpha=0.7)  # Dashed gridlines on the x-axis

# Adjust the layout to ensure the plot elements fit well
plt.tight_layout()

# Remove default seaborn grid
plt.grid(False)

# Display the plot
plt.show()

In [None]:
# Combine all tokens into a single list
all_tokens = [token for tokens in df['Lemas_SpaCy'] for token in tokens]

# Count the frequency of tokens
token_counts = Counter(all_tokens)

# Get the top 20 most common tokens
common_tokens = token_counts.most_common(35)  # Limiting to top 35

# Separate tokens and their frequencies
tokens, frequencies = zip(*common_tokens)

# Create a bar plot for the most frequent tokens
plt.figure(figsize=(12, 6))
sns.barplot(x=list(frequencies), y=list(tokens), palette='Set2')  # Changed palette

# Improved title and axis labels
plt.title('Top 35 Most Common Lemmatization ', fontsize=16)
plt.xlabel('Frequency', fontsize=14)
plt.ylabel('Tokens', fontsize=14)

# Add gridlines for easier reading of bar heights
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Display the plot
plt.tight_layout()  # Ensure layout is clean and labels fit well
plt.grid(False)
plt.show()

# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0073e6; overflow:hidden"><b> Part 5 - Model LLM</b></div>

In [None]:
!huggingface-cli login

In [None]:
# Define the path to the local model directory
# In this case, the model is stored locally at the specified path, which likely
# contains the necessary files for a LLaMA-based model
model_name = "meta-llama/Llama-3.1-8B-Instruct"

In [None]:
# Check if a GPU is available
# If a GPU is available, it will use "cuda"; otherwise, it will default to "cpu"
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
# Define the path to the local model directory
# In this case, the model is stored locally at the specified path, which likely
# contains the necessary files for a LLaMA-based model
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the model with automatic device mapping to save memory
# The model is loaded from the pre-trained local path and uses float16 precision (half-precision) to save memory.
# The "device_map='auto'" allows for automatic distribution of the model across available hardware (e.g., GPU, CPU) to optimize memory usage.
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct",
                                             torch_dtype=torch.float16,
                                             device_map="auto")

In [None]:
# Load the tokenizer
# This loads the pre-trained tokenizer from the specified local model path.
# The tokenizer is responsible for converting text into token IDs that the model can process.
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

In [None]:
# 2. Pipeline Initialization
print("Initializing the zero-shot classification pipeline...")
classifier = pipeline('zero-shot-classification',
                      model=model,
                      tokenizer=tokenizer)

# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0073e6; overflow:hidden"><b> Part 8.1 - Engineering prompt examples</b></div>

In [None]:
# 3. Defining Sentiment Labels
labels = ["Positive", "Negative", "Neutral"]

In [None]:
df.shape

In [None]:
# Function to generate text based on a prompt
# This function takes a text prompt and generates a continuation of the text.
# The max_length parameter controls the maximum number of tokens generated.
def generate_text(prompt, max_length=150):
    # Tokenize the input prompt and move it to the model's device (e.g., GPU or CPU)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate text from the model based on the input token IDs
    outputs = model.generate(inputs.input_ids, max_length=max_length)

    # Decode the output tokens back into a human-readable string, skipping special tokens
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# prompt the LLM 1

# Input prompt for generating text
prompt = "What is exoplanet discovery ?"

# Generate text based on the prompt
response = generate_text(prompt)

# Print the generated response
print(response)

In [None]:
# prompt the LLM 2

# Input prompt for generating text
prompt = "What was the James Webb Telescope looking for?"

# Generate text based on the prompt
response = generate_text(prompt)

# Print the generated response
print(response)

In [None]:
# prompt the LLM 4

# Input prompt for generating text
prompt = (
    "You are a helpful and professional customer service assistant for a retail company. "
    "Respond to the following customer inquiry in a friendly and efficient manner:\n\n"
    "Customer Inquiry: \"I received my order today, but one of the items is damaged. What should I do?\"\n\n"
    "Response:"
)

# Generate text based on the prompt
response = generate_text(prompt)

# Print the generated response
print(response)

# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0073e6; overflow:hidden"><b> Part 9 - Engineering prompt no dataset</b></div>

In [None]:
%%time

# Check if the "Text" column exists
if 'Text_Limpo' not in df.columns:
    raise ValueError("The dataset must contain a column named 'Text'.")

# Initialize a list to store the results
results = []

print("Starting sentiment classification...")
for index, row in df.iterrows():
    text = row['Text_Limpo']

    # Create a specific prompt for each text
    prompt = (
        f"Analyze the sentiment of the following text and classify it as Positive, Negative, or Neutral.\n"
        f"Text: \"{text}\"\n"
        f"Sentiment:"
    )

    try:
        # Perform zero-shot classification
        result = classifier(
            sequences=text,
            candidate_labels=labels,
            hypothesis_template="This text expresses a {} sentiment."
        )

        # Get the label with the highest score
        sentiment = result['labels'][0]
    except Exception as e:
        print(f"Error processing text: {text[:30]}... - {e}")
        sentiment = "Error"

    results.append(sentiment)

    # Optional: Display progress every 100 iterations
    if (index + 1) % 100 == 0:
        print(f"{index + 1} texts processed...")

In [None]:
# Add a new column for sentiment analysis results to the DataFrame
df['Sentiment_LLM'] = results

# Select only the relevant columns: cleaned text and sentiment
data = df[["Text_Limpo", "Sentiment_LLM"]]

# Ensure that the "Sentiment_LLM" and "Text_Limpo" columns are treated as strings
data['Sentiment_LLM'] = data['Sentiment_LLM'].astype(str)  # Convert sentiment column to string
data['Text_Limpo'] = data['Text_Limpo'].astype(str)  # Convert cleaned text column to string

# View the resulting dataset with the selected columns
data.head(n=20)

In [None]:
# Create a count plot for the 'Sentiment_LLM' column, specifying the x-axis and DataFrame
sns.countplot(x='Sentiment_LLM', data=df, palette='Set2')  # Use 'Set2' color palette

# Add title and axis labels for clarity (optional)
plt.title('Distribution of Sentiments')  # Set the title of the plot
plt.xlabel('Sentiment')  # Label for the x-axis
plt.ylabel('Count')  # Label for the y-axis

# Display the plot without gridlines
plt.grid(False)  # Remove gridlines
plt.show()  # Show the plot

In [None]:
from wordcloud import WordCloud, STOPWORDS

# Create a custom set of stopwords, updating it with additional words
stopwords_customizadas = set(STOPWORDS)
stopwords_customizadas.update(['python', 'twitter', 'rt'])  # Adding custom stopwords

# Filter the DataFrame for each sentiment category
sentimentos = ['Positive', 'Neutral', 'Negative']  # Adjust based on the actual values in your sentiment column

# Create subsets of the DataFrame for each sentiment category
df_positive = df[df['Sentiment_LLM'] == 'Positive']
df_neutral = df[df['Sentiment_LLM'] == 'Neutral']
df_negative = df[df['Sentiment_LLM'] == 'Negative']

# Combine all the text from each sentiment category into a single string
texto_positive = ' '.join(df_positive['Text_Limpo'].astype(str))  # Combine positive sentiment text
texto_neutral = ' '.join(df_neutral['Text_Limpo'].astype(str))  # Combine neutral sentiment text
texto_negative = ' '.join(df_negative['Text_Limpo'].astype(str))  # Combine negative sentiment text

# Define additional stopwords to be excluded from the word cloud
stopwords_customizadas = set(STOPWORDS)
stopwords_customizadas.update(['python', 'twitter', 'rt'])  # Example of words to add to the stopword list


In [None]:
def criar_nuvem_palavras(texto, titulo, stopwords_customizadas, cor='viridis', salvar=False, nome_arquivo='nuvem_palavras.png'):
    """
    Creates and displays a word cloud from the provided text.

    Parameters:
    texto (str): Text from which the word cloud will be generated.
    titulo (str): Title of the word cloud chart.
    stopwords_customizadas (set): Set of words to be excluded from the word cloud.
    cor (str): Color palette for the word cloud.
    salvar (bool): If True, saves the word cloud image.
    nome_arquivo (str): Filename for saving the word cloud image.
    """
    wordcloud = WordCloud(
        width=800,
        height=400,
        background_color='white',
        stopwords=stopwords_customizadas,
        max_words=200,  # Maximum number of words to be displayed in the word cloud
        max_font_size=100,  # Maximum font size for the largest words
        scale=3,  # Scale for adjusting the word cloud size
        random_state=42,  # Ensures reproducibility of the word cloud layout
        colormap=cor  # Color map for the word cloud
    ).generate(texto)

    # Set the size of the figure
    plt.figure(figsize=(15, 7.5))
    # Display the word cloud with smooth interpolation
    plt.imshow(wordcloud, interpolation='bilinear')
    # Add a title to the chart
    plt.title(titulo, fontsize=20)
    # Hide the axis
    plt.axis('off')

    # If saving the word cloud, save it in PNG format
    if salvar:
        plt.savefig(nome_arquivo, format='png')

    # Show the word cloud
    plt.show()

# Define neutral sentiment text
texto_neutral = ' '.join(df_neutral['Text_Limpo'].astype(str))

# Define positive sentiment text
texto_positive = ' '.join(df_positive['Text_Limpo'].astype(str))


In [None]:
# Create and display the word cloud for positive sentiment
criar_nuvem_palavras(texto_positive,
                     'Word Cloud - Positive',
                     stopwords_customizadas,
                     cor='Greens',  # Use green color palette for positive sentiment
                     salvar=True,  # Save the image
                     nome_arquivo='nuvem_positivo.png')  # Save as 'nuvem_positivo.png'

In [None]:
# Create and display the word cloud for negative sentiment
criar_nuvem_palavras(texto_negative,
                     'Word Cloud - Negative',
                     stopwords_customizadas,
                     cor='Reds',  # Use red color palette for negative sentiment
                     salvar=True,  # Save the image
                     nome_arquivo='nuvem_negativo.png')  # Save as 'nuvem_negativo.png'

In [None]:
# Create and display the word cloud for neutral sentiment
criar_nuvem_palavras(texto_neutral,
                     'Word Cloud - Neutral',
                     stopwords_customizadas,
                     cor='Blues',  # Use blue color palette for neutral sentiment
                     salvar=True,  # Save the image
                     nome_arquivo='nuvem_neutro.png')  # Save as 'nuvem_neutro.png'

In [None]:
# Save the sorted DataFrame to a new CSV file
df.to_csv("dataset_final.csv")
print("Sorting completed successfully!")

# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0073e6; overflow:hidden"><b> Part 9 - Engineering Prompt on Dataset 2</b></div>

**Objective**: Identify keywords related to climate.

In [None]:
%%time

# Check if the "Text_Limpo" column exists
if 'Text_Limpo' not in df.columns:
    raise ValueError("The dataset must contain a column named 'Text_Limpo'.")

# Initialize a list to store the results
results = []

print("Starting keyword extraction for climate-related terms...")
for index, row in df.iterrows():
    text = row['Text_Limpo']

    # Create a specific prompt for keyword extraction
    prompt = (
        f"Identify and extract keywords related to climate or environmental topics from the following text.\n"
        f"Text: \"{text}\"\n"
        f"Keywords:"
    )

    try:
        # Use the model to extract keywords (adjust according to the LLM or tool being used)
        keywords = classifier(
            sequences=prompt,
            candidate_labels=["Climate", "Environment", "Sustainability", "Weather", "Pollution", "Energy"],
            hypothesis_template="This text contains keywords about {}."
        )

        # Collect the most relevant keywords
        extracted_keywords = keywords['labels']
    except Exception as e:
        print(f"Error processing text: {text[:30]}... - {e}")
        extracted_keywords = ["Error"]

    results.append(extracted_keywords)

    # Optional: Display progress every 100 iterations
    if (index + 1) % 100 == 0:
        print(f"{index + 1} texts processed...")

# Store the extracted keywords in the DataFrame
df['Climate_Keywords'] = results

print("Keyword extraction completed!")

In [None]:
# Add a new column for climate-related keywords to the DataFrame
df['Climate_Keywords'] = results  # 'results' contém as palavras-chave extraídas

# Select only the relevant columns: cleaned text and keywords
data_keywords = df[["Text_Limpo", "Climate_Keywords"]]

# Ensure that the "Climate_Keywords" and "Text_Limpo" columns are treated as strings

# Convert keywords column to string
data_keywords['Climate_Keywords'] = data_keywords['Climate_Keywords'].astype(str)

# Convert cleaned text column to string
data_keywords['Text_Limpo'] = data_keywords['Text_Limpo'].astype(str)

# View the resulting dataset with the selected columns
data_keywords.head(n=20)