<a href="https://colab.research.google.com/github/sahopkin/AP_Standards_PDF_Extraction/blob/main/AP%20Computer%20Science%20-%20Best%20Model%20Yet%20(3).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pulling Standards from AP Course PDFs
This notebook works through using the pdfplumber package to parse through pages of AP course description guides to pull out all content standards.  This particular page works through pages from the AP Computer Science course.  

*NOTE:  The two AP English courses follow a format that is different than most of the other courses.  A specialized script must be used for them.*

### Setup the intial run through
Begin by importing:
* `pdfplumber` to parse pages of the pdf
* `re` to utilize regular expressions for pulling out standard identifiers
* `pandas` to convert the final 2-D list into a dataframe

The `pull_standards_tables()` function searches each page for the terms 'ENDURING UNDERSTANDING' OR 'LEARNING OBJECTIVE'.  If either of those terms exist on the page, it pulls the page tables, and checks to see if either of the terms are present as the 0-index element of a sub-list.  If either are, the sub-list is popped and appended into the empty array `all_pages_contents`.  The function returns the `all_pages_contents` array.

*The code below installs pdfplumber since it is not a native Python library*

In [1]:
!pip install pdfplumber

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [12]:
import pdfplumber
import re
import pandas as pd


def pull_standards_tables(pdf_doc, start_num, end_num):
    
    all_pages_contents = []
    
    for i in range(start_num, end_num-10):
        with pdfplumber.open(ap_pdf) as pdf:
            outcome_page = pdf.pages[i]
            eu_is_present = outcome_page.search('ENDURING UNDERSTANDING', regex=False, case=True)
            lo_is_present = outcome_page.search('LEARNING OBJECTIVE', regex=False, case=True)
            print(outcome_page.page_number)

            if eu_is_present != [] or lo_is_present != []:
                page_tables = outcome_page.extract_tables(table_settings={"text_x_tolerance": 1, "text_y_tolerance": 5})

                for table in page_tables:
                    for column in table:
                        #print(outcome_page.page_number)
                        try:
                            if column[0] == []:
                                pass
                            elif 'LEARNING OBJECTIVE' in column[0] or 'LEARNING OBJECTIVE' in column[1]:
                                idx = page_tables.index(table)
                                standard = page_tables.pop(idx)
                                all_pages_contents.append(standard)
                                print(f'{outcome_page.page_number} added to all_pages_contents list')
                                break
                            else:
                                pass
                        except TypeError:
                            #print(f'{outcome_page.page_number} list not iterable')
                            pass
                        except IndexError:
                            break    
            else:
                pass
            pdf.close()
            
    return all_pages_contents

### Get the standards
The PDF file is stored in the variable `ap_pdf`.  An initial look at the pdf is made to determine the number of pages in the pdf; this length is stored in `num_pdf_pages`.  `pull_standards_tables` is called with the `ap_pdf`, a number around 20, and `num_pdf_pages` as arguments.  The resulting array is stored in the variable `all_content`.

In order to use the scripts on other files:
* **Update `ap_pdf` to include the proper PDF file path**
* **Update `ap_identifer` to represent the appropriate AP Identifier prefix** 

In [13]:
ap_pdf = "/content/ap-chemistry-course-and-exam-description.pdf"
ap_identifier = "AP.CH."

with pdfplumber.open(ap_pdf) as pdf:
    num_pdf_pages = len(pdf.pages)
    pdf.close()

all_content = pull_standards_tables(ap_pdf, 20, num_pdf_pages)


21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
36 added to all_pages_contents list
37
38
39
40
41
42
43
43 added to all_pages_contents list
44
44 added to all_pages_contents list
45
45 added to all_pages_contents list
46
46 added to all_pages_contents list
47
47 added to all_pages_contents list
48
48 added to all_pages_contents list
49
49 added to all_pages_contents list
50
50 added to all_pages_contents list
51
51 added to all_pages_contents list
52
53
54
55
56
57
57 added to all_pages_contents list
58
58 added to all_pages_contents list
59
59 added to all_pages_contents list
60
60 added to all_pages_contents list
61
61 added to all_pages_contents list
62
62 added to all_pages_contents list
63
63 added to all_pages_contents list
64
64 added to all_pages_contents list
65
65 added to all_pages_contents list
66
67
68
69
70
71
72
72 added to all_pages_contents list
73
74
74 added to all_pages_contents list
75
75 added to all_pages_contents list
76
76 added to all_pages_contents list
77
7

In [14]:
print(all_content[0:5])

[[['ENDURING UNDERSTAND\nSPQ-1\nThe mole allows different units to', 'ING\nbe compared.'], ['LEARNING OBJECTIVE\nSPQ-1.B\nErmaee nlx ealp da mstl stai eo h i sn nn ep ts t m’eh shce ai ip st sq r ob suu teem oa stn p w oot eei fft s ea at.htnni ev tee hle em ent', 'ESSENTIAL KNOWLEDGE\nSPQ-1.B.1\nTsirSdeih Pnelae Qgnt -ltm ie 1vit . ea Bye s . l a2oes bfm s uthpe nene d tc ai sct nr oa cu tn em o bpo oeef f seu a aos csef hadthm it saop ot ld tee oel ec ptmoe ern eim nt na i ntnin aaei tn n utg dh r e ea t.h e\nTbtisheh Xoe e te I tIcfmoi N rn hossa oop etT notv me n i e tE Ae rtm a a po Rr s Pa i traap npoP n e Egt i mei tRd nec i c xe n igEd i acmi eg ta T m mssf it a mI or o .N uoo r ns ae tlm Gm ss thsl i ea epsi wM c rt ts ls ih ei A t lvm p u lhe ee Ses nal a cw eSi onas n tm t e rbSs s g ae bi u i Pg o nnet onEhh t gf f asd C ltea s sye a oTa n s md n crm eR he c p saa Ap asl eee sv l rea. e gsem dks e r s o oe da f na gn e rte i a sc o ica nfhn g']], [['ENDURING UNDERSTANDING\n


---

### Functioning Script
The function below creates a list of sub-lists.  Each sub-list contains an identifier and its respective descriptor.  These are all pulled from the TOPIC pages of the course PDF.  Each sub-list will contain 2 elements - a standard identifier and a standard descriptor.  The list of sub-lists can then be fed into a `Pandas` to generate a `DataFrame`. 

### Regex Explanations
`'\w{3}-.*'`
This expression is used in the re.search method and filters out any extraneous starting text like "ENDURING UNDERSTANDING".  It only returns the identifer with its description.  It looks for exactly 3 word characters immediately followed by a dash and then any additional characters after that.

`'\w{3}\-(\d\s|\d\.\w\s|\d\.\w\.\d\s)'`
This expression is found in the re.match method and is used to parse out any standard identifier.  It first looks for exactly 3 word characters followed by a dash and then looks for three additional possibilites using the capture group and OR operators:
* a number followed by a space (EX: CON-2 {ENDURING UNDERSTANDINGS})
* a number period letter space (Ex: CON-2.P {LEARNING OBJECTIVES})
* a number period letter period number space (Ex: CON-2.P.1 {ESSENTIAL KNOWLEDGE})

*NOTE:  re.search() and re.match() return results in a match group.  Referencinng the zero index of the regex variable will provide the string result (Ex: `identifier[0]` --> `LIM-1`)*

ESSENTIAL KNOWLEDGE standards are in the same "cell" based on the organization of the table.  This was causing only the first essential knowledge standard to be separated into an identifier and descriptor.  All additional identifiers and standards were being lumped into the first standard's description.  
```python
    check_desc_stds = re.findall(r'\w{3}\-\d\.\w\.\d+\s', descriptor)
    #print(check_desc_stds)

    if len(check_desc_stds) >= 1:
        string_list = re.split(r'\w{3}\-\d\.\w\.\d+\s', descriptor)
        #print(string_list)
        descriptor = string_list[0]

        for i in range(len(check_desc_stds)):
            ident = check_desc_stds[i]
            desc = string_list[i+1]
            outcomes.append([ident, desc])
```
The code block above remedies this issue by:
* searching the descriptor for additional essential knowledge identifiers using the `re.findall(r'\w{3}\-\d\.\w\.\d+\s', descriptor)` method and stores all results in a list as `check_desc_stds`.
* If there is at least one identifier in the current descriptor, the descriptor is split on the identifier(s) using `re.split(r'\w{3}\-\d\.\w\.\d+\s', descriptor)`.  This creates a list (`string_list`)of descriptors without any identifiers, but the order of the list is equivalent to the order of the list of identifiers in `check_desc_stds`.
* The original descriptor is trimmed to include only the first portion of the descriptor using `string_list[0]` since the first element in the split `descriptor` is the true first descriptor.
* Finally, we loop through `check_desc_stds` and `string_list` storing each identifier `ident` and its relevant descriptor `desc` in a list of 2 elements which is appended to `outcomes`.

In [15]:
def standards_separator(content):
    
    content_length = len(content)
    
    outcomes = []
    
    while content_length > 0:
        
        for all_lists in content[content_length-1]:
            for item in all_lists:
                if item == None or item == "":
                    pass
                else:
                    outcome = item.replace('\n', ' ')
                    outcome = outcome.replace('\xa0', ' ')
                    try:
                        outcome_expression = re.search(r'\w{3}-.*', outcome)[0]
                        identifier = re.match(r'\w{3}\-(\d\s|\d\.\w\s|\d\.\w\.\d+\s)', outcome_expression)
                        identifier = identifier[0].replace(' ', '')
                        #identifier = f'{ap_identifier}{identifier}'  #THIS IS NEW
                        descriptor = outcome_expression[len(identifier)+1:]
                        #print(descriptor)
                        
                        check_desc_stds = re.findall(r'\w{3}\-\d\.\w\.\d+\s', descriptor)
                        #print(check_desc_stds)
                        
                        if len(check_desc_stds) >= 1:
                            string_list = re.split(r'\w{3}\-\d\.\w\.\d+\s', descriptor)
                            #print(string_list)
                            descriptor = string_list[0]

                            for i in range(len(check_desc_stds)):
                                ident = check_desc_stds[i]
                                #ident = f'{ap_identifier}{ident}'  #THIS IS NEW
                                desc = string_list[i+1]
                                outcomes.append([ident, desc])

                        outcomes.append([identifier, descriptor])
                    except TypeError:
                        pass
                
        content_length -= 1
        
    return outcomes

### Call the function
The `standards_separator()` function is called with `all_content` as an argument, and the result is stored in a variabled called `ready_for_df` since the resulting list of sub-lists is formatted properly for a `DataFrame`.

In [16]:
ready_for_df = standards_separator(all_content)
print(ready_for_df[0:10])

[['ENE-6', 'Electrical energy can be generated by chemical reactions.'], ['ENE-6.D', 'Calculate the amount of charge flow based on changes in the amounts of reactants and products in an electrochemical cell.'], ['ENE-6.D.1', 'Faraday’s laws can be used to determine the stoichiometry of the redox reaction occurring in an electrochemical cell with respect to the following: a. Number of electrons transferred b. Mass of material deposited on or removed from an electrode c. Current d. Time elapsed e. Charge of ionic species I = q t EQN: /'], ['ENE-6.C', 'Explain the relationship between deviations from standard cell conditions and changes in the cell potential.'], ['ENE-6.C.4', 'Algorithmic calculations using the Nernst equation are insufficient to demonstrate an understanding of electrochemical cells under nonstandard conditions. However, students should qualitatively understand the effects of concentration on cell potential and use conceptual reasoning, including the qualitative use of th

### List to DataFrame with Pandas
This code transforms the `ready_for_df` list into a dataframe that can be saved as a .csv for later spreadsheet upload.  Column names are renamed to "Identifier" and "Descriptor".  All leading and trailing whitespace is stripped, the table is sorted alphabetically by identifiers, and all duplicates are dropped from the table.  Lastly, the `ap_identifier` is prepended to the original identifier to make it as unique as possible.

In [17]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
df = pd.DataFrame(ready_for_df)
df = df.rename(columns={0:'Identifier', 1:'Descriptor'})
df['Identifier'].str.strip()
df['Descriptor'].str.strip()
df.sort_values(by=['Identifier'], inplace=True)
no_dups_df = df.drop_duplicates(subset=['Identifier'], keep='first', inplace=False)
no_dups_df = no_dups_df.drop_duplicates(subset=['Descriptor'], keep='first', inplace=False)
no_dups_df['Identifier'] = no_dups_df.Identifier.apply(lambda x: f'{ap_identifier}{x}')
no_dups_df['Descriptor'] = no_dups_df.Descriptor.apply(lambda x: x.replace('\xa7', '\n\u2022'))
# df.drop_duplicates(keep='first', inplace=True)
no_dups_df.reset_index(drop=True, inplace=True)
no_dups_df.head(30)

# left_aligned_df = df.style.set_properties(**{'text-align': 'left'})
# left_aligned_df

Unnamed: 0,Identifier,Descriptor
0,AP.CH.ENE-1,The speed at which a reaction occurs can be influenced by a catalyst.
1,AP.CH.ENE-1.A,Explain the relationship between the effect of a catalyst on a reaction and changes in the reaction mechanism.
2,AP.CH.ENE-1.A.1,"In order for a catalyst to increase the rate of a reaction, the addition of the catalyst must increase the number of effective collisions and/ or provide a reaction path with a lower activation energy relative to the original reaction coordinate."
3,AP.CH.ENE-1.A.2,"In a reaction mechanism containing a catalyst, the net concentration of the catalyst is constant. However, the catalyst will frequently be consumed in the rate-determining step of the reaction, only to be regenerated in a subsequent step in the mechanism."
4,AP.CH.ENE-1.A.3,Some catalysts accelerate a reaction by binding to the reactant(s). The reactants are either oriented more favorably or react with lower activation energy. There is often a new reaction intermediate in which the catalyst is bound to the reactant(s). Many enzymes function in this manner.
5,AP.CH.ENE-1.A.4,"Some catalysts involve covalent bonding between the catalyst and the reactant(s). An example is acid-base catalysis, in which a reactant or intermediate either gains or loses a proton. This introduces a new reaction intermediate and new elementary reactions involving that intermediate."
6,AP.CH.ENE-1.A.5,"In surface catalysis, a reactant or intermediate binds to, or forms a covalent bond with, the surface. This introduces elementary reactions involving these new bound reaction intermediate(s)."
7,AP.CH.ENE-2,Changes in a substance’s properties or change into a different substance requires an exchange of energy.
8,AP.CH.ENE-2.A,Explain the relationship between experimental observations and energy changes associated with a chemical or physical transformation.
9,AP.CH.ENE-2.A.1,Temperature changes in a system indicate energy changes.


### Save resulting dataframe as a .csv
The argument `index=False` removes the index column for the .csv download so that the file only includes Identifiers and Descriptors.

In [18]:
no_dups_df.to_csv('ap_chem_stds.csv', index=False)

In [19]:
from google.colab import files
files.download('ap_chem_stds.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>