# Pulling Standards from AP Course PDFs
This notebook works through using the pdfplumber package to parse through pages of AP course description guides to pull out all content standards.  This particular page works through pages from the AP Computer Science course.  

*NOTE:  The two AP English courses follow a format that is different than most of the other courses.  A specialized script must be used for them.*

### Setup the intial run through
Begin by importing:
* `pdfplumber` to parse pages of the pdf
* `re` to utilize regular expressions for pulling out standard identifiers
* `pandas` to convert the final 2-D list into a dataframe

The `pull_standards_tables()` function searches each page for the terms 'ENDURING UNDERSTANDING' OR 'LEARNING OBJECTIVE'.  If either of those terms exist on the page, it pulls the page tables, and checks to see if either of the terms are present as the 0-index element of a sub-list.  If either are, the sub-list is popped and appended into the empty array `all_pages_contents`.  The function returns the `all_pages_contents` array.

In [14]:
import pdfplumber
import re
import pandas as pd


def pull_standards_tables(pdf_doc, start_num, end_num):
    
    all_pages_contents = []
    
    for i in range(start_num, end_num-10):
        with pdfplumber.open(ap_pdf) as pdf:
            outcome_page = pdf.pages[i]
            eu_is_present = outcome_page.search('ENDURING UNDERSTANDING', regex=False, case=True)
            lo_is_present = outcome_page.search('LEARNING OBJECTIVE', regex=False, case=True)
            print(outcome_page.page_number)

            if eu_is_present != [] or lo_is_present != []:
                page_tables = outcome_page.extract_tables(table_settings={"text_x_tolerance": 1, "text_y_tolerance": 5})

                for table in page_tables:
                    for column in table:
                        #print(outcome_page.page_number)
                        try:
                            if column[0] == []:
                                pass
                            elif 'LEARNING OBJECTIVE' in column[0] or 'LEARNING OBJECTIVE' in column[1]:
                                idx = page_tables.index(table)
                                standard = page_tables.pop(idx)
                                all_pages_contents.append(standard)
                                print(f'{outcome_page.page_number} added to all_pages_contents list')
                                break
                            else:
                                pass
                        except TypeError:
                            #print(f'{outcome_page.page_number} list not iterable')
                            pass
                        except IndexError:
                            break    
            else:
                pass
            pdf.close()
            
    return all_pages_contents

### Get the standards
The PDF file is stored in the variable `ap_pdf`.  An initial look at the pdf is made to determine the number of pages in the pdf; this length is stored in `num_pdf_pages`.  `pull_standards_tables` is called with the `ap_pdf`, a number around 20, and `num_pdf_pages` as arguments.  The resulting array is stored in the variable `all_content`.

In [15]:
ap_pdf = "ap-computer-science-a-course-and-exam-description.pdf"

with pdfplumber.open(ap_pdf) as pdf:
    num_pdf_pages = len(pdf.pages)
    pdf.close()

all_content = pull_standards_tables(ap_pdf, 20, num_pdf_pages)


21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
36 added to all_pages_contents list
37
38
39
40
41
42
43
43 added to all_pages_contents list
43 added to all_pages_contents list
44
44 added to all_pages_contents list
45
45 added to all_pages_contents list
46
46 added to all_pages_contents list
47
47 added to all_pages_contents list
48
48 added to all_pages_contents list
49
50
51
52
53
54
55
55 added to all_pages_contents list
56
56 added to all_pages_contents list
57
57 added to all_pages_contents list
57 added to all_pages_contents list
58
58 added to all_pages_contents list
59
59 added to all_pages_contents list
60
60 added to all_pages_contents list
61
61 added to all_pages_contents list
62
62 added to all_pages_contents list
63
63 added to all_pages_contents list
64
64 added to all_pages_contents list
65
65 added to all_pages_contents list
66
66 added to all_pages_contents list
67
67 added to all_pages_contents list
67 added to all_pages_contents list
68
68 added to all_pages_conten

In [16]:
print(all_content)

[[['ENDURING UNDERSTANDING\nCON-1\nTdehtee wrmayin veasr itahbel ecso manpdu otepde rraetsourlst. are sequenced and combined in an expression', None], ['LEARNING OBJECTIVE\nCON-1.B \nEioansvf  asaa ilnvgu anaertmxieape bwrnelehts  assattis oai sant e s rwmetoisterhuen ladtt .n', 'ESSENTIAL KNOWLEDGE\nC/TofevoCCralhppo=oOOeXreeemmm,NNi    ETieoa rr%ni--xfaa pbXnhe11t  =ptptthe..colCnhBBooerr )eretuuieL .. rres45c.so fi.snU  s (Tsmacerx−dStio hn  ooaoef −Iuaoe nOfrrbnr )resrs is nmNetana sde( e yco iri  uS .g(aervw epieesTn.n. aem,ue elAd vlme.aus,rd Tea tmra+eeneh rElit+u[d ntenexoMo xe a  +tnAr)tpfEn   + o iota(PalsNd].+a n)p  a   EvTcaddi+edsa—xese  )ridacro sn aaoi(rmu isaEt1egnftiobm.K ds dntorileh eede sCrd neoe   dOs(oet t + t ahurtNcoh=o esbprea-rse 1,e tn t sri.r−mhgBc aaaeo=t.nce5rop mrt,n)r ea :s*1 t ey  =  n,t']], [['ENDURING UNDERSTANDING\nMOD-1\nSome objects or concepts are so frequently represented that programmers can \ndraw upon existing code that has already been tested


---

### Functioning Script
The function below creates a list of sub-lists.  Each sub-list contains an identifier and its respective descriptor.  These are all pulled from the TOPIC pages of the course PDF.  Each sub-list will contain 2 elements - a standard identifier and a standard descriptor.  The list of sub-lists can then be fed into a `Pandas` to generate a `DataFrame`. 

### Regex Explanations
`'\w{3}-.*'`
This expression is used in the re.search method and filters out any extraneous starting text like "ENDURING UNDERSTANDING".  It only returns the identifer with its description.  It looks for exactly 3 word characters immediately followed by a dash and then any additional characters after that.

`'\w{3}\-(\d\s|\d\.\w\s|\d\.\w\.\d\s)'`
This expression is found in the re.match method and is used to parse out any standard identifier.  It first looks for exactly 3 word characters followed by a dash and then looks for three additional possibilites using the capture group and OR operators:
* a number followed by a space (EX: CON-2 {ENDURING UNDERSTANDINGS})
* a number period letter space (Ex: CON-2.P {LEARNING OBJECTIVES})
* a number period letter period number space (Ex: CON-2.P.1 {ESSENTIAL KNOWLEDGE})

*NOTE:  re.search() and re.match() return results in a match group.  Referencinng the zero index of the regex variable will provide the string result (Ex: `identifier[0]` --> `LIM-1`)*

ESSENTIAL KNOWLEDGE standards are in the same "cell" based on the organization of the table.  This was causing only the first essential knowledge standard to be separated into an identifier and descriptor.  All additional identifiers and standards were being lumped into the first standard's description.  
```python
    check_desc_stds = re.findall(r'\w{3}\-\d\.\w\.\d+\s', descriptor)
    #print(check_desc_stds)

    if len(check_desc_stds) >= 1:
        string_list = re.split(r'\w{3}\-\d\.\w\.\d+\s', descriptor)
        #print(string_list)
        descriptor = string_list[0]

        for i in range(len(check_desc_stds)):
            ident = check_desc_stds[i]
            desc = string_list[i+1]
            outcomes.append([ident, desc])
```
The code block above remedies this issue by:
* searching the descriptor for additional essential knowledge identifiers using the `re.findall(r'\w{3}\-\d\.\w\.\d+\s', descriptor)` method and stores all results in a list as `check_desc_stds`.
* If there is at least one identifier in the current descriptor, the descriptor is split on the identifier(s) using `re.split(r'\w{3}\-\d\.\w\.\d+\s', descriptor)`.  This creates a list (`string_list`)of descriptors without any identifiers, but the order of the list is equivalent to the order of the list of identifiers in `check_desc_stds`.
* The original descriptor is trimmed to include only the first portion of the descriptor using `string_list[0]` since the first element in the split `descriptor` is the true first descriptor.
* Finally, we loop through `check_desc_stds` and `string_list` storing each identifier `ident` and its relevant descriptor `desc` in a list of 2 elements which is appended to `outcomes`.

In [17]:
def standards_separator(content):
    
    content_length = len(content)
    
    outcomes = []
    
    while content_length > 0:
        
        for all_lists in content[content_length-1]:
            for item in all_lists:
                if item == None or item == "":
                    pass
                else:
                    outcome = item.replace('\n', ' ')
                    outcome = outcome.replace('\xa0', ' ')
                    try:
                        outcome_expression = re.search(r'\w{3}-.*', outcome)[0]
                        identifier = re.match(r'\w{3}\-(\d\s|\d\.\w\s|\d\.\w\.\d+\s)', outcome_expression)
                        identifier = identifier[0].replace(' ', '')
                        descriptor = outcome_expression[len(identifier)+1:]
                        #print(descriptor)
                        
                        check_desc_stds = re.findall(r'\w{3}\-\d\.\w\.\d+\s', descriptor)
                        #print(check_desc_stds)
                        
                        if len(check_desc_stds) >= 1:
                            string_list = re.split(r'\w{3}\-\d\.\w\.\d+\s', descriptor)
                            #print(string_list)
                            descriptor = string_list[0]

                            for i in range(len(check_desc_stds)):
                                ident = check_desc_stds[i]
                                desc = string_list[i+1]
                                outcomes.append([ident, desc])

                        outcomes.append([identifier, descriptor])
                    except TypeError:
                        pass
                
        content_length -= 1
        
    return outcomes

### Call the function
The `standards_separator()` function is called with `all_content` as an argument, and the result is stored in a variabled called `ready_for_df` since the resulting list of sub-lists is formatted properly for a `DataFrame`.

In [18]:
ready_for_df = standards_separator(all_content)
print(ready_for_df)

[['CON-2', 'Programmers incorporate iteration and selection into code as a way of providing  instructions for the computer to process each of the many possible input values.'], ['CON-2.P', 'Apply recursive search  algorithms to information  in String, 1D array, or  ArrayList objects.'], ['CON-2.P.2 ', 'The binary search algorithm starts at the  middle of a sorted array or ArrayList and  eliminates half of the array or ArrayList in  each iteration until the desired value is found or  all elements have been eliminated. '], ['CON-2.P.3 ', 'Binary search can be more efficient than  sequential/linear search. X    EXCLUSION STATEMENT—(EK CON-2.P.3):  Search algorithms other than sequential/linear  and binary search are outside the scope of the  course and AP Exam. '], ['CON-2.P.4 ', 'The binary search algorithm can be written  either iteratively or recursively.'], ['CON-2.P.1', 'Data must be in sorted order to use the binary  search algorithm.  '], ['CON-2.Q', ' Apply recursive algorithms  t

### List to DataFrame with Pandas
This code transforms the `ready_for_df` list into a dataframe that can be saved as a .csv for later spreadsheet upload.  Column names are renamed to "Identifier" and "Descriptor".  All leading and trailing whitespace is stripped, the table is sorted alphabetically by identifiers, and all duplicates are dropped from the table.

In [19]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
df = pd.DataFrame(ready_for_df)
df = df.rename(columns={0:'Identifier', 1:'Descriptor'})
df['Identifier'].str.strip()
df['Descriptor'].str.strip()
df.sort_values(by=['Identifier'], inplace=True)
no_dups_df = df.drop_duplicates(subset=['Identifier'], keep='first', inplace=False)
no_dups_df = no_dups_df.drop_duplicates(subset=['Descriptor'], keep='first', inplace=False)
# df.drop_duplicates(keep='first', inplace=True)
no_dups_df.reset_index(drop=True, inplace=True)
no_dups_df

# left_aligned_df = df.style.set_properties(**{'text-align': 'left'})
# left_aligned_df

Unnamed: 0,Identifier,Descriptor
0,CON-1,The way variables and operators are sequenced and combined in an expression determines the computed result.
1,CON-1.A,Evaluate arithmetic expressions in a program code.
2,CON-1.A.1,A literal is the source code representation of a fixed value.
3,CON-1.A.2,Arithmetic expressions include expressions of type int and double.
4,CON-1.A.3,"The arithmetic operators consist of +, −, *, /, and %."
5,CON-1.A.4,An arithmetic operation that uses two int values will evaluate to an int value.
6,CON-1.A.5,An arithmetic operation that uses a double value will evaluate to a double value.
7,CON-1.A.6,Operators can be used to construct compound expressions.
8,CON-1.A.7,"During evaluation, operands are associated with operators according to operator precedence to determine how they are grouped."
9,CON-1.A.8,An attempt to divide an integer by zero will result in an ArithmeticException to occur.


### Save resulting dataframe as a .csv

In [13]:
no_dups_df.to_csv('ap_comp_sci_stds.csv', index=False)


---

#### The commented-out code below was used to develop the standars_separator function.

In [108]:
# import pdfplumber
# import re
# import pandas as pd

# with pdfplumber.open("ap-computer-science-a-course-and-exam-description.pdf") as pdf:
#     outcome_page = pdf.pages[45]
#     eu_is_present = outcome_page.search('ENDURING UNDERSTANDING', regex=False, case=True)
#     lo_is_present = outcome_page.search('LEARNING OBJECTIVE', regex=False, case=True)
    
#     if eu_is_present == []:
#         print('empty list')
#     else:
#         print(eu_is_present)
    
#     print('\n')
    
#     if lo_is_present == []:
#         print('empty list')
#     else:
#         print(lo_is_present)
    
#     print(eu_is_present)
#     print('\n')
#     print(lo_is_present)
#     page_36_tables = outcome_page.extract_tables()
    
#     print(page_36_tables)
#     print('\n')
    
#     page_36_content = []
#     for table in page_36_tables:
#         for column in table:
#             if 'ENDURING UNDERSTANDING' in column[0] or 'LEARNING OBJECTIVE' in column[0]:
#                 idx = page_36_tables.index(table)
#                 standard = page_36_tables.pop(idx)
#                 page_36_content.append(standard)
#                 break
#             else:
#                 pass
#     pdf.close()

# print(page_36_content)
    

In [80]:
# import pdfplumber
# import re

# #PAGE 36
# with pdfplumber.open("ap-computer-science-a-course-and-exam-description.pdf") as pdf:
#     outcome_page = pdf.pages[42]
    
#     page_36_tables = outcome_page.extract_tables()
#     page_36_words = outcome_page.extract_words()
    
#     print(page_36_tables)
#     print('\n')
    
#     words_on_page = ''
#     for words_dict in page_36_words:
#         words_on_page += words_dict['text'] + ' '
    
#     print(words_on_page)
#     print('\n')
#     print('TOPIC' in words_on_page)
#     print('\n')
#     print('continued on next page' in words_on_page)
#     print('\n')
#     print('ESSENTIAL KNOWLEDGE' in words_on_page)

In [75]:
# content_length = len(content)

# outcomes = []

# while content_length > 0:
    
#     for all_lists in content[content_length-1]:
#         for item in all_lists:
#             if item == None or item == "":
#                 pass
#             else:
#                 outcome = item.replace('\n', ' ')
#                 outcome_expression = re.search(r'\w{3}-.*', outcome)[0]
#                 identifier = re.match(r'\w{3}\-(\d\s|\d\.\w\s|\d\.\w\.\d\s)', outcome_expression)
#                 identifier = identifier[0].replace(' ', '')
#                 descriptor = outcome_expression[len(identifier)+1:]
#                 outcomes.append([identifier, descriptor])
#                 #print(outcomes)
#     content_length -= 1

# # # outcome = contents_36[1][0][0].replace('\n', ' ')
# # # expression = re.search(r'LIM-.*', outcome)
# print(outcomes)

In [76]:
# outcomes = []

# for all_text in page_36[1]:
#     for content in all_text:
#         if content == None or content == "":
#             pass
#         else:
#             outcome = content.replace('\n', ' ')
#             outcome_expression = re.search(r'\w{3}-.*', outcome)[0]
#             identifier = re.match(r'\w{3}\-(\d\s|\d\.\w\s|\d\.\w\.\d\s)', outcome_expression)
#             identifier = identifier[0].replace(' ', '')
#             descriptor = outcome_expression[len(identifier)+1:]
#             outcomes.append([identifier, descriptor])
#             #print(outcomes)

# # # outcome = contents_36[1][0][0].replace('\n', ' ')
# # # expression = re.search(r'LIM-.*', outcome)
# print(outcomes)