# 📄 FinRiskGPT - Day 2: Data Preparation & Text Extraction
This notebook helps you extract raw text from SEC 10-K PDF reports and prepare them for LLM-based processing.

In [1]:
# ✅ Install required library (only once needed)
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.25.5-cp39-abi3-macosx_10_9_x86_64.whl (19.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.5


In [23]:
# ✅ Import libraries
import fitz  # PyMuPDF
import os
from pathlib import Path

## 📥 Step 1: Define the PDF text extraction function

In [28]:
def extract_text_from_pdf(pdf_path, output_txt_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    with open(output_txt_path, "w", encoding="utf-8") as f:
        f.write(text)
    print(f"✅ Extracted to {output_txt_path}")

## 🚀 Step 2: Run extraction - replace with your PDF file path

In [51]:
# Place your 10-K PDF file in the data/raw folder
pdf_file = "raw data/tsla-20241231.pdf"
txt_file = "data processed/tsla-20241231.txt"
extract_text_from_pdf(pdf_file, txt_file)

✅ Extracted to data processed/tsla-20241231.txt


## 📑 Step 3: Split text into paragraphs for embedding

In [52]:
with open(txt_file, "r", encoding="utf-8") as f:
    sample = f.read(1000)
print(sample)


UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)

ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended December 31, 2024
OR

TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from _________ to _________
Commission File Number: 001-34756
Tesla, Inc.
(Exact name of registrant as specified in its charter)
Texas
91-2197729
(State or other jurisdiction of
incorporation or organization)
(I.R.S. Employer
Identification No.)
1 Tesla Road
Austin, Texas
78725
(Address of principal executive offices)
(Zip Code)
(512) 516-8177
(Registrant’s telephone number, including area code)
Securities registered pursuant to Section 12(b) of the Act:
Title of each class
Trading Symbol(s)
Name of each exchange on which registered
Common stock
TSLA
The Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:
None
In

In [53]:
import re

def split_paragraphs(text_path, min_len=100):
    with open(text_path, "r", encoding="utf-8") as f:
        text = f.read()
    
    # 使用正则表达式兼容不同换行格式，空行分段
    raw_paragraphs = re.split(r"\n\s*\n", text)
    
    paragraphs = [p.strip() for p in raw_paragraphs if len(p.strip()) > min_len]
    print(f"✅ Parsed {len(paragraphs)} paragraphs.")
    return paragraphs

In [54]:
# Run the splitter
paragraphs = split_paragraphs(txt_file)

✅ Parsed 37 paragraphs.


In [55]:
import json

def save_paragraphs_to_json(paragraphs, output_json_path):
    # 将所有段落组成一个列表字典结构
    data = [
        {"id": f"para_{i}", "text": p}
        for i, p in enumerate(paragraphs)
    ]
    
    # 保存为 JSON 文件
    with open(output_json_path, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    
    print(f"✅ Saved {len(data)} paragraphs to {output_json_path}")


In [56]:
output_json_path = "data processed/tsla-20241231.json"
save_paragraphs_to_json(paragraphs, output_json_path)


✅ Saved 37 paragraphs to data processed/tsla-20241231.json
