---
## 1. Get the data

**Download the data by executing the code below:**

`Notes:`
* This script will download all PDF files from an AWS S3 bucket, maintaining the directory structure, and store them in a DataFrame.
* Ensure you have the necessary AWS credentials and configurations set in a .env file.
* The script uses boto3 to interact with S3, pandas to handle the data, and re for string manipulation.
* The script first downloads all PDF files, then filters these files to obtain those with the most recent and oldest years per company.
* Finally, it extracts the content of the filtered PDFs using the LlamaParse library.
* The extracted content is stored in a new DataFrame, which includes the PDF file names and their corresponding text content.


In [1]:
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root)

from src import data_utils

# Call the function to download the PDFs
p = data_utils.download_pdfs_and_convert_to_text()

Started parsing the file under job_id cac11eca-bc6a-4862-ae7f-4c68a1c810af
Started parsing the file under job_id cac11eca-09a5-480e-a15e-0fbe5afb417a



## 2. Normalize the data

**Normalizing text is crucial for preparing data for further analysis, ensuring the text is consistent and easy to process by removing noise and standardizing the format.**

- **Text Cleaning (`clean_text`):** Convert text to lowercase, remove unwanted characters such as punctuation, URLs, HTML tags, and digits.
- **Expand Contractions (`expand_contractions`):** Replace contractions (e.g., "can't" to "cannot") using a predefined dictionary of contractions.
- **Lemmatize Text (`lemmatize_text`):** Tokenize the text and apply lemmatization to convert words to their base form (e.g., "running" to "run").
- **Remove Stopwords (`remove_stopwords`):** Tokenize the text and remove common stop words that do not contribute to the meaning (e.g., "and", "the").
- **Normalize Corpus (`normalize_corpus`):** Combine text chunks into a single string if needed, apply text cleaning, contraction expansion, lemmatization, and stop word removal in sequence, save the cleaned and processed text to a `.txt` file with a specified prefix, and return the normalized text and the output file name.


In [3]:
from src import text_normalizer

# Text cleanup and normalization
cleaned_text = text_normalizer.normalize_corpus(p['text'])

[nltk_data] Downloading package punkt to /home/yohana/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/yohana/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/yohana/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
print(cleaned_text["Normalized Text"][0])

2019 annual report

|sel||nir2o2c09|
|-- -|-- -|-- -|
|cheddar|shaap|cheese|
|vus|saved|popcorthe|
|factor|ftsa|uhoco|
|cky|dpcor|rowth|
|muse|lpi|munch|
|riving|g|kr|
|pofcornr e|popcorn||
|bu|momgnidil|mutne|

driving growth , building momentum vision inspire human expression , connection celebration mission deliver smile 1-800-flowers.com , inc. leading provider gift designed inspire human expression , connection celebration . company celebration ecosystem feature all-star family brand , including 1-800-flowers.com , 1-800-baskets.com , cheryls cooky , harry david , shari berry , fruitbouquets.com , moose munch , popcorn factory , wolfermans bakerysm , personalization universe , simply chocolate , goodsey . also offer top-quality steak chop stock yard . celebration passport loyalty program , provides member free standard shipping service charge across portfolio brand , 1-800-flowers.com , inc. strives deepen relationship customer . company also operates bloomnet , international flor

---
## 3. Feature Engineering

You already have the pre-processed data, now you must vectorize them, because remember that the models only understand numbers. At this stage choose whether you want to vectorize with BoW or with TF-IDF. Later we will train our own embedding but for now we go with a more "classic" vectori