# Knowledge Point Generator #

### TLDR

This notebook demonstrates the extraction, processing, and refinement of text from a PDF document, saving key information as a CSV file. The steps involved are:

1. **Extract Text from PDF**: Use `PyPDF2` to read and extract text from specified pages.
2. **Split Text into Sections**: Divide the extracted text into manageable sections.
3. **Extract Key Sentences**: Identify and extract key sentences using regular expressions.
4. **Refine Knowledge Points**: Clean and filter the extracted sentences.
5. **Save Knowledge Points to CSV**: Convert the refined sentences to a DataFrame and save as a CSV file.



In [1]:
!pip install PyPDF2 openai pandas
!pip install anthropic




[notice] A new release of pip is available: 24.0 -> 24.1
[notice] To update, run: C:\Users\Tiffany\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.1
[notice] To update, run: C:\Users\Tiffany\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
import PyPDF2
import time
import re
import requests
import pandas as pd
import anthropic 
import csv

## Extracting Text from PDF

Please set the parameters before use (start page, end page, pdf_path).

In [3]:

def extract_text_from_pdf(pdf_path, start_page=0, end_page=None):
    text = ""
    try:
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            #Defaults to the last page of the PDF, I chose to skip the last section of the textbook since it was about establishing good study habits
            end_page = end_page or len(reader.pages) 
            #Looping through the pages
            for page_number in range(start_page, end_page):
                page = reader.pages[page_number]
                #Extracting the text
                text += page.extract_text()
    except Exception as e:
        print(f"Error reading PDF file: {e}")
    return text

pdf_path = 'C:/Users/Tiffany/Desktop/MuseTax/Secret Sauce CFA Level 1.pdf'
start_page = 4
end_page = 230
pdf_text = extract_text_from_pdf(pdf_path, start_page=start_page, end_page=end_page)
print(pdf_text[:1000])  # Print to check


products, making quality and value received more difficult to evaluate than for tangible
products, trust in investment professionals takes on an even greater importance. Failure
to act in a highly ethical manner can damage not only client wealth but also impede the
success of investment firms and investment professionals because potential investors
will be less likely to use their services.
Unethical behavior by financial services professionals can have negative effects for
society as a whole. A lack of trust in financial advisors will reduce the funds entrusted
to them and increase the cost of raising capital for business investment and growth.
Unethical behavior such as providing incomplete, misleading, or false information to
investors can affect the allocation of the capital that is raised.
Ethical vs. Legal Standards
Not all unethical actions are illegal, and not all illegal actions are unethical. Acts of
“whistleblowing” or civil disobedience that may be illegal in some places ar

Sources used: https://pypdf2.readthedocs.io/en/3.x/

## Splitting the text for even distribution of knowledge points

In [4]:
def split_text_into_sections(text, num_sections=10):
    lines = text.split('\n')
    section_length = len(lines) // num_sections
    sections = []
    for i in range(num_sections):
        start = i * section_length
        end = start + section_length
        sections.append('\n'.join(lines[start:end]))
    return sections

sections = split_text_into_sections(pdf_text)

## Extract Key Sentences

In [5]:
def extract_key_sentences(section, interval=15):
    sentences = re.split(r'(?<=[.!?]) +', section) #Splitting into sentences
    key_sentences = []
    for i, sentence in enumerate(sentences):
        if i % interval == 0:  
            sentence = sentence.strip()
            if (len(sentence.split()) > 5 and  # Ensure the sentence is not tooo long
                #Filters to make sure each point contains information
                not re.match(r'^(Chapter|Study Session|Section|Unit|Table of Contents|Foreword|Appendix|Index)', sentence, re.IGNORECASE) and
                not re.match(r'^[0-9]+.*$', sentence) and  
                not re.search(r'\b(CF\d|NPV|CPT|PV)\b', sentence)):
                key_sentences.append(sentence)
    return key_sentences

knowledge_points = []
for section in sections:
    knowledge_points.extend(extract_key_sentences(section))

## Refine Knowledge Points

In [6]:
def refine_knowledge_points(knowledge_points):
    refined_points = []
    for point in knowledge_points:
        refined_point = point.strip() #Removing trailing or leading whitespace - ok
        if len(refined_point) > 0 and not refined_point.isspace():
            refined_points.append(refined_point)
    return refined_points

In [7]:
knowledge_points = knowledge_points[:100]
knowledge_points = refine_knowledge_points(knowledge_points)

In [8]:
for i, point in enumerate(knowledge_points[10:20]):
    print(f"{i+1}: {point}")

1: For example,
“The mean return on the S&P 500 
Index is equal to zero.”
Steps in Hypothesis Testing
State the hypothesis.
Select a test statistic.
Specify the level of significance.
State the decision rule for the hypothesis.
Collect the sample and calculate statistics.
Make a decision about the hypothesis.
Make a decision based on the test results.Null and Alternative Hypotheses
The 
null hypothesis
, designated as H
0
, is the hypothesis the researcher wants to reject.
2: A test that is unlikely to 
reject a false null hypothesis
has little power.
Significance Level (
α
)
The 
significance level
 is the probability of 
making a Type I error (rejecting the null
when it is true) and is designated by the 
Greek letter alpha (
α
).
3: An upward
sloping trendline can be drawn that connects the low 
points for a stock in an uptrend.
A market is in a downtrend if prices are consistently reaching lower 
lows and retracing
to lower highs.
4: Some
technical analysts interpret these indicator

## Write to CSV

In [9]:
df = pd.DataFrame(knowledge_points, columns=['Knowledge Points'])
df.to_csv('knowledge_points.csv', index=False)