---
## 1. Get the data

**Download the data by executing the code below:**

`Notes:`
* This script will download all PDF files from an AWS S3 bucket, maintaining the directory structure, and store them in a DataFrame.
* Ensure you have the necessary AWS credentials and configurations set in a .env file.
* The script uses boto3 to interact with S3, pandas to handle the data, and re for string manipulation.
* The script first downloads all PDF files, then filters these files to obtain those with the most recent and oldest years per company.
* Finally, it extracts the content of the filtered PDFs using the LlamaParse library.
* The extracted content is stored in a new DataFrame, which includes the PDF file names and their corresponding text content.


In [1]:
# Importing the necessary libraries
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root)

from src.plots import(
  analyze_text,
  analyze_sentiment,
  generate_word_cloud,
  plot_common_words,
  display_ngrams_with_plot_side_by_side
)

[nltk_data] Downloading package punkt to /home/oem/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
from src import data_utils

# Call the function to download the PDFs
filtered_pdfs_df = data_utils.download_pdfs_and_convert_to_text()

Started parsing the file under job_id be702afd-bc7c-4241-b329-14b452d83978
Started parsing the file under job_id 89e8b081-7cc5-47f3-a9ba-01f8527c6f03



## 2. Normalize the data

**Normalizing text is crucial for preparing data for further analysis, ensuring the text is consistent and easy to process by removing noise and standardizing the format.**

- **Text Cleaning (`clean_text`):** Convert text to lowercase, remove unwanted characters such as punctuation, URLs, HTML tags, and digits.
- **Expand Contractions (`expand_contractions`):** Replace contractions (e.g., "can't" to "cannot") using a predefined dictionary of contractions.
- **Lemmatize Text (`lemmatize_text`):** Tokenize the text and apply lemmatization to convert words to their base form (e.g., "running" to "run").
- **Remove Stopwords (`remove_stopwords`):** Tokenize the text and remove common stop words that do not contribute to the meaning (e.g., "and", "the").
- **Normalize Corpus (`normalize_corpus`):** Combine text chunks into a single string if needed, apply text cleaning, contraction expansion, lemmatization, and stop word removal in sequence, save the cleaned and processed text to a `.txt` file with a specified prefix, and return the normalized text and the output file name.


In [3]:
from src import text_normalizer

# Text cleanup and normalization
cleaned_text = text_normalizer.normalize_corpus(filtered_pdfs_df)

[nltk_data] Downloading package punkt to /home/oem/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/oem/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/oem/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Exploratory Data Analysis (EDA)**

1. Number of Words in the Vocabulary

In [4]:
# Call the analyze_text function
X, num_words, vocab_sample = analyze_text(cleaned_text)

print("Number of words in the vocabulary:", num_words)
print("Words in the vocabulary:", vocab_sample)

Number of words in the vocabulary: 4013
Words in the vocabulary: ['american', 'nir2o2c09', '442', 'risk', 'needed', 'forecast', '2012', 'neuhaus', '2769', 'produce']


2. Sentiment of the Text

In [None]:
sentiment = analyze_sentiment(cleaned_text)
print(f"Sentiment of the text: Polarity={sentiment.polarity}, Subjectivity={sentiment.subjectivity}")

3. Word Cloud

In [None]:
generate_word_cloud(cleaned_text)

4. Common Words Frequency

In [None]:
plot_common_words(cleaned_text)

5. Top 10 Bigrams

In [None]:
display_ngrams_with_plot_side_by_side(cleaned_text, n=2, top_n=10)

---
## 3. Feature Engineering

In this stage, we have split the texts into fragments and vectorized them so that the machine learning models can understand them. We used the `CharacterTextSplitter` class from LangChain to divide the long texts into more manageable fragments, ensuring that each fragment retains enough context.

After splitting the texts, we store the fragments in a new column of the DataFrame. Then, we use OpenAI embeddings to convert these text fragments into numerical vectors. Embeddings are numerical representations that capture the semantics and context of the texts.

Finally, we store these vectors in a `VectorStore` using FAISS, a library for searching and storing large amounts of vectors. This will allow us to search and retrieve similar text fragments quickly and efficiently.


In [None]:
import os
from src import text_processing

# Create and save the vectorstore
text_processing.create_and_save_vectorstore(cleaned_text)

## 4. Ragas Evaluation

TODO

In [2]:
from src import ragas_utils

#Evaluation
result = ragas_utils.get_evaluation()
result

Evaluating:   0%|          | 0/42 [00:00<?, ?it/s]

Unnamed: 0,Result
context_precision,1.0
faithfulness,0.488095
answer_relevancy,0.787714
context_recall,0.857143
answer_correctness,0.654904
answer_similarity,0.926747


In [5]:
import pandas as pd
from datasets import load_dataset
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    answer_correctness,
    answer_similarity
)

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
ragas_dataset = load_dataset('json', data_files='data.json')
data = ragas_dataset['train']
ragas_dataset

Generating train split: 7 examples [00:00, 334.94 examples/s]


DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'contexts', 'ground_truth'],
        num_rows: 7
    })
})

In [8]:
# Metrics
metrics=[
    context_precision,
    faithfulness,
    answer_relevancy,
    context_recall,
    answer_correctness,
    answer_similarity
]

In [9]:
# Evaluation
try:
    result = evaluate(
        data,
        metrics=metrics,
        raise_exceptions=False
    )
    result
except Exception as e:
    print(f"An error occurred: {e}")

Evaluating: 100%|██████████| 42/42 [00:13<00:00,  3.03it/s]


In [10]:
def get_evaluation():
    ragas_dataset = load_dataset('json', data_files='data.json')
    data = ragas_dataset['train']
    #print("-------- DATA ------------")
    #print(data)

    # Metrics
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
        answer_correctness,
        answer_similarity
    ]

    # Evaluation
    result = evaluate(
    data,
    metrics=metrics,
    raise_exceptions=False
    )

    # Resultado Global
    #print(result)
    df = pd.DataFrame(result, index=[0])
    res_df = df.transpose()
    res_df.columns = ["Result"]
    #st.dataframe(res_df)

    # Resultado por pregunta
    result_df = result.to_pandas()
    return result_df

In [11]:
result = get_evaluation()
result

Evaluating: 100%|██████████| 42/42 [00:16<00:00,  2.58it/s]


Unnamed: 0,question,answer,contexts,ground_truth,context_precision,faithfulness,answer_relevancy,context_recall,answer_correctness,answer_similarity
0,What was the net income of 1st Source Corporat...,"In 2019, 1st Source Corporation's net income r...",[2019 net income was $91.96 million compared t...,The net income of 1st Source Corporation in 20...,1.0,1.0,0.961423,1.0,0.843516,0.974066
1,What was the total interest income for 1st Sou...,1st Source Corporation raked in $282.8 million...,"[Total interest income in 2019 was $282.877 , ...",The total interest income for 1st Source Corpo...,1.0,0.0,0.983839,1.0,0.74097,0.963879
2,What were the total deposits at the end of 201...,Total deposits held by 1st Source Corporation ...,"[At year-end, total assets were $6.62 billion,...","At the end of 2019, total deposits were $5.36 ...",1.0,1.0,0.875316,1.0,0.973539,0.894156
3,What was the amount of total loans and leases ...,The combined value of loans and leases that 1s...,"[At year-end, total assets were $6.62 billion,...",The amount of total loans and leases outstandi...,1.0,0.5,0.915475,1.0,0.72628,0.905147
4,What were the net charge-offs in 2019 and how ...,"In 2019, 1st Source Corporation achieved a sub...","[Net charge-offs (recoveries) were $5,048,000 ...","Net charge-offs in 2019 were $5,048,000, compa...",1.0,0.0,0.914138,1.0,0.526317,0.905246
5,What was the return on average assets (ROAA) f...,1st Source Corporation's ROAA rose to 1.41% in...,[Return on average assets (as a percent) for t...,The return on average assets (ROAA) for 2019 w...,1.0,0.0,0.857697,0.5,0.556876,0.894209
6,"What was the efficiency ratio in 2019, and how...",2019 income: Non-interest income ($101.13M) vs...,[The efficiency ratio in 2019 is not explicitl...,"Noninterest income was $101,130,000 in 2019, n...",1.0,0.75,0.0,0.5,0.666208,0.9506
