---
## 1. Get the data

**Download the data by executing the code below:**

`Notes:`
* This script will download all PDF files from an AWS S3 bucket, maintaining the directory structure, and store them in a DataFrame.
* Ensure you have the necessary AWS credentials and configurations set in a .env file.
* The script uses boto3 to interact with S3, pandas to handle the data, and re for string manipulation.
* The script first downloads all PDF files, then filters these files to obtain those with the most recent and oldest years per company.
* Finally, it extracts the content of the filtered PDFs using the LlamaParse library.
* The extracted content is stored in a new DataFrame, which includes the PDF file names and their corresponding text content.


In [1]:
# Importing the necessary libraries
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root)

from src.plots import(
  analyze_text,
  analyze_sentiment,
  generate_word_cloud,
  plot_common_words,
  display_ngrams_with_plot_side_by_side
)

[nltk_data] Downloading package punkt to /home/oem/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
from src import data_utils

# Call the function to download the PDFs
filtered_pdfs_df = data_utils.download_pdfs_and_convert_to_text()

Started parsing the file under job_id 4d611e76-889a-404c-8893-ce7fb5759769
Started parsing the file under job_id faad762b-84ec-47cb-89f3-55ed5f833041



## 2. Normalize the data

**Normalizing text is crucial for preparing data for further analysis, ensuring the text is consistent and easy to process by removing noise and standardizing the format.**

- **Text Cleaning (`clean_text`):** Convert text to lowercase, remove unwanted characters such as punctuation, URLs, HTML tags, and digits.
- **Expand Contractions (`expand_contractions`):** Replace contractions (e.g., "can't" to "cannot") using a predefined dictionary of contractions.
- **Lemmatize Text (`lemmatize_text`):** Tokenize the text and apply lemmatization to convert words to their base form (e.g., "running" to "run").
- **Remove Stopwords (`remove_stopwords`):** Tokenize the text and remove common stop words that do not contribute to the meaning (e.g., "and", "the").
- **Normalize Corpus (`normalize_corpus`):** Combine text chunks into a single string if needed, apply text cleaning, contraction expansion, lemmatization, and stop word removal in sequence, save the cleaned and processed text to a `.txt` file with a specified prefix, and return the normalized text and the output file name.


In [3]:
from src import text_normalizer

# Text cleanup and normalization
cleaned_text = text_normalizer.normalize_corpus(filtered_pdfs_df)

[nltk_data] Downloading package punkt to /home/oem/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/oem/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/oem/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Exploratory Data Analysis (EDA)**

1. Number of Words in the Vocabulary

In [4]:
# Call the analyze_text function
X, num_words, vocab_sample = analyze_text(cleaned_text)

print("Number of words in the vocabulary:", num_words)
print("Words in the vocabulary:", vocab_sample)

Number of words in the vocabulary: 4013
Words in the vocabulary: ['american', 'nir2o2c09', '442', 'risk', 'needed', 'forecast', '2012', 'neuhaus', '2769', 'produce']


2. Sentiment of the Text

In [None]:
sentiment = analyze_sentiment(cleaned_text)
print(f"Sentiment of the text: Polarity={sentiment.polarity}, Subjectivity={sentiment.subjectivity}")

3. Word Cloud

In [None]:
generate_word_cloud(cleaned_text)

4. Common Words Frequency

In [None]:
plot_common_words(cleaned_text)

5. Top 10 Bigrams

In [None]:
display_ngrams_with_plot_side_by_side(cleaned_text, n=2, top_n=10)

---
## 3. Feature Engineering

In this stage, we have split the texts into fragments and vectorized them so that the machine learning models can understand them. We used the `CharacterTextSplitter` class from LangChain to divide the long texts into more manageable fragments, ensuring that each fragment retains enough context.

After splitting the texts, we store the fragments in a new column of the DataFrame. Then, we use OpenAI embeddings to convert these text fragments into numerical vectors. Embeddings are numerical representations that capture the semantics and context of the texts.

Finally, we store these vectors in a `VectorStore` using FAISS, a library for searching and storing large amounts of vectors. This will allow us to search and retrieve similar text fragments quickly and efficiently.


In [4]:
import os
from src import text_processing

# Create and save the vectorstore
text_processing.create_and_save_vectorstore(cleaned_text)

## 4. Ragas Evaluation

`ragas` is a library designed for evaluating the performance of question-answering (QA) systems. It provides various metrics to measure the quality of answers generated by these systems. The metrics help in assessing aspects such as the relevance of the answer, faithfulness to the context, precision, recall, correctness, and similarity of the answer.

When this code is executed:

* A dataset will be loaded from a JSON file.
* A set of evaluation metrics will be defined to measure different aspects of the QA system's performance.
* The dataset will be evaluated using the specified metrics.
* The evaluation results will be processed into two DataFrames: one for the global results and one for the results by question.
* The processed results will be returned for analysis or visualization.

In [2]:
from src import ragas_utils
from src import ragas_model

# Read information questions.txt
data_questions = ragas_utils.process_information()

# Process model for add answer and contexts
data_ragas = ragas_model.execute(data_questions)

# Create data.json
ragas_utils.create_ragas_data_file(data_ragas)

The data.json file was created successfully


In [3]:
from src import ragas_evaluate

#Evaluation
global_result, question_result = ragas_evaluate.get_evaluation()

Generating train split: 0 examples [00:00, ? examples/s]

Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]

In [4]:
global_result

Unnamed: 0,Result
context_precision,0.75
faithfulness,0.625
answer_relevancy,0.490567
context_recall,0.25
answer_correctness,0.421659
answer_similarity,0.936635


In [5]:
question_result

Unnamed: 0,question,answer,contexts,ground_truth,context_precision,faithfulness,answer_relevancy,context_recall,answer_correctness,answer_similarity
0,"What were 1-800-Flowers.com, Inc.'s 2019 tot...","1-800-Flowers.com, Inc.'s 2019 total net reven...",[| | |total net revenues|417956|278776|605642|...,"1-800-Flowers.com, Inc.'s 2019 total net reve...",1.0,1.0,0.979257,1.0,0.24727,0.989082
1,"What was 1-800-Flowers.com, Inc.'s gross prof...","1-800-Flowers.com, Inc.'s gross profit margin ...",[| | |total net revenues|417956|278776|605642|...,"As of July 2019, 1-800-Flowers.com, Inc.'s gr...",1.0,0.0,0.98301,0.0,0.995184,0.980734
2,"What were 1-800-Flowers.com, Inc.'s net reven...","I don't have information on 1-800-Flowers.com,...",[| | |total net revenues|417956|278776|605642|...,"In 2021, 1-800-Flowers.com, Inc.'s net revenu...",0.0,1.0,0.0,0.0,0.221617,0.886469
3,"What was 1-800-Flowers.com, Inc.'s cost of re...",I don't have the information for 1-800-Flowers...,[| | |total net revenues|417956|278776|605642|...,"In 2022, 1-800-Flowers.com, Inc.'s cost of re...",1.0,0.5,0.0,0.0,0.222564,0.890256
