---
## 1. Get the data

**Download the data by executing the code below:**

`Notes:`
* This script will download all PDF files from an AWS S3 bucket, maintaining the directory structure, and store them in a DataFrame.
* Ensure you have the necessary AWS credentials and configurations set in a .env file.
* The script uses boto3 to interact with S3, pandas to handle the data, and re for string manipulation.
* The script first downloads all PDF files, then filters these files to obtain those with the most recent and oldest years per company.
* Finally, it extracts the content of the filtered PDFs using the LlamaParse library.
* The extracted content is stored in a new DataFrame, which includes the PDF file names and their corresponding text content.


In [1]:
# Importing the necessary libraries
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root)

from src.plots import(
  analyze_text,
  analyze_sentiment,
  generate_word_cloud,
  plot_common_words,
  display_ngrams_with_plot_side_by_side
)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\egunza\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
from src import data_utils

# Call the function to download the PDFs
filtered_pdfs_df = data_utils.download_pdfs_and_convert_to_text()


## 2. Normalize the data

**Normalizing text is crucial for preparing data for further analysis, ensuring the text is consistent and easy to process by removing noise and standardizing the format.**

- **Text Cleaning (`clean_text`):** Convert text to lowercase, remove unwanted characters such as punctuation, URLs, HTML tags, and digits.
- **Expand Contractions (`expand_contractions`):** Replace contractions (e.g., "can't" to "cannot") using a predefined dictionary of contractions.
- **Lemmatize Text (`lemmatize_text`):** Tokenize the text and apply lemmatization to convert words to their base form (e.g., "running" to "run").
- **Remove Stopwords (`remove_stopwords`):** Tokenize the text and remove common stop words that do not contribute to the meaning (e.g., "and", "the").
- **Normalize Corpus (`normalize_corpus`):** Combine text chunks into a single string if needed, apply text cleaning, contraction expansion, lemmatization, and stop word removal in sequence, save the cleaned and processed text to a `.txt` file with a specified prefix, and return the normalized text and the output file name.


In [None]:
from src import text_normalizer

# Text cleanup and normalization
cleaned_text = text_normalizer.normalize_corpus(filtered_pdfs_df)

**Exploratory Data Analysis (EDA)**

1. Number of Words in the Vocabulary

In [None]:
# Call the analyze_text function
X, num_words, vocab_sample = analyze_text(cleaned_text)

print("Number of words in the vocabulary:", num_words)
print("Words in the vocabulary:", vocab_sample)

2. Sentiment of the Text

In [None]:
sentiment = analyze_sentiment(cleaned_text)
print(f"Sentiment of the text: Polarity={sentiment.polarity}, Subjectivity={sentiment.subjectivity}")

3. Word Cloud

In [None]:
generate_word_cloud(cleaned_text)

4. Common Words Frequency

In [None]:
plot_common_words(cleaned_text)

5. Top 10 Bigrams

In [None]:
display_ngrams_with_plot_side_by_side(cleaned_text, n=2, top_n=10)

---
## 3. Feature Engineering

In this stage, we have split the texts into fragments and vectorized them so that the machine learning models can understand them. We used the `CharacterTextSplitter` class from LangChain to divide the long texts into more manageable fragments, ensuring that each fragment retains enough context.

After splitting the texts, we store the fragments in a new column of the DataFrame. Then, we use OpenAI embeddings to convert these text fragments into numerical vectors. Embeddings are numerical representations that capture the semantics and context of the texts.

Finally, we store these vectors in a `VectorStore` using FAISS, a library for searching and storing large amounts of vectors. This will allow us to search and retrieve similar text fragments quickly and efficiently.


In [None]:
import os
from src import text_processing

# Create and save the vectorstore
text_processing.create_and_save_vectorstore(cleaned_text)

## 4. Ragas Evaluation

TODO

In [None]:
from src import ragas_utils

#Evaluation
result = ragas_utils.get_evaluation()
result

In [2]:
import pandas as pd
from datasets import load_dataset
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    answer_correctness,
    answer_similarity
)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
ragas_dataset = load_dataset('json', data_files='data1.json')
data = ragas_dataset['train']
ragas_dataset

Generating train split: 2 examples [00:00, 54.34 examples/s]


DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'contexts', 'ground_truth'],
        num_rows: 2
    })
})

In [4]:
# Metrics
metrics=[
    context_precision,
    faithfulness,
    answer_relevancy,
    context_recall,
    answer_correctness,
    answer_similarity
]

In [5]:
# Evaluation
try:
    result = evaluate(
        data,
        metrics=metrics,
        raise_exceptions=False
    )
    result
except Exception as e:
    print(f"An error occurred: {e}")

Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]Runner in Executor raised an exception
Evaluating:   8%|▊         | 1/12 [00:03<00:37,  3.45s/it]Runner in Executor raised an exception
Evaluating:  17%|█▋        | 2/12 [00:03<00:15,  1.53s/it]Runner in Executor raised an exception
Runner in Executor raised an exception
Runner in Executor raised an exception
Runner in Executor raised an exception
Evaluating:  50%|█████     | 6/12 [00:03<00:02,  2.57it/s]Runner in Executor raised an exception
Runner in Executor raised an exception
Runner in Executor raised an exception
Runner in Executor raised an exception
Runner in Executor raised an exception
Evaluating:  92%|█████████▏| 11/12 [00:04<00:00,  5.45it/s]Runner in Executor raised an exception
Evaluating: 100%|██████████| 12/12 [00:04<00:00,  2.99it/s]
  value = np.nanmean(self.scores[cn])


In [None]:
def get_evaluation():
    ragas_dataset = load_dataset('json', data_files='data.json')
    data = ragas_dataset['train']
    #print("-------- DATA ------------")
    #print(data)

    # Metrics
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
        answer_correctness,
        answer_similarity
    ]

    # Evaluation
    result = evaluate(
    data,
    metrics=metrics,
    raise_exceptions=False
    )

    # Resultado Global
    #print(result)
    df = pd.DataFrame(result, index=[0])
    res_df = df.transpose()
    res_df.columns = ["Result"]
    #st.dataframe(res_df)

    # Resultado por pregunta
    result_df = result.to_pandas()
    return result_df

In [None]:
result = get_evaluation()
result