# Overview

`equation_scraper` is a Python package that scrapes Wikipedia pages for mathematical equations and then parses the equations into its components to build prior distributions. Specifically, these priors include information such as the number of times an operation or function appears across all equations scraped. For example, the expression `m*x+b*sin(y)` would be parsed into the prior: `{*: 2, +: 1, sin: 1}`. The package includes much more information than this simple prior, for example conditional priors---a full breakdown of the included metrics is detailed on the `priors` section of our documentation. The package was designed to provide equation discovery modelling techniques, such as Symbolic Regression and the Bayesian Machine Scientist, with informed priors; however, the application of this package can extend far beyond this. 

# Functions

The `equation-scraper` has a main function, `scrape_and_parse_equations()`, that sequentially runs two sub-functions, `scrape_equations()` and `parse_equations()`. The two sub-functions 1) scrape equations from Wikipedia---`scrape_equations()`, and 2) parse the equations and build priors---`parse_equations()`. 




All four of these functions take the same input: a list of keywords to search for. For example, `['Cognitive_psychology', 'Cognitive_science', 'Neuroscience']`, where the keywords are the Wikipedia topics you want scraped, parsed, and/or plotted. The individual functions, `scrape_equations()`, `parse_equations()`, `plot_priors()` must be run sequentially because they each produce files (e.g., `.txt`, `.pkl`, `.png`) that the proceeding function uses. Each function only uses the written files from the previous function, so the environmental variables of a function do not need to be preserved to run the next function.

# Keyword tags

You can use keyword tags to change the behaviour of the scraping function. Without a tag---e.g., `Cognitive_psychology`--- the scraper will search for equations from all of the links from the corresponding wikipedia category page. Using the `Super:` tag---e.g., `Super:Cognitive_psychology`--- will search for equations from all links from the corresponding wikipedia pages, and the links from those scraped, and finally the links from those sublinks. Both without a tag and with the `Super:` specifically searches Wikipedia category [ADD LINK HERE] pages. Further, the keywords you provide must be the end of the URL of the category you are looking to scrape (capitalization matters), i.e., everything following Wikipedia's `Category:` tag: https://<wbr>en.wikipedia.org/wiki/Category:**Cognitive_psychology**.

The final tag, `Direct:`, For example, `Direct:https://en.wikipedia.org/wiki/Cognitive_psychology`, will search whichever URL you provide it, and does not assume a Wikipedia page. At this time, it only searches the first links in the same way as when not using a tag. The direct links do not need to be Wikipedia pages but the parsing was built around Wikipedia and direct links outside of Wikipedia has yet to be tested, so do so functionality is more likely to break in these cases.

You can mix these tags in your keywords: `scrape_and_parse_equations(['Cognitive_psychology', 'Super:Cognitive_science', 'Direct:https://en.wikipedia.org/wiki/Neuroscience'])`

# Application

## The Main Function

Let's begin with the function that scrapes Wikipedia, parses the equations, and derives priors all at once. First, we will install the package using pip and then use the `scrape_and_parse_equations()` function to scrape a category page using the `Super` tag.


In [None]:
#Install the equation-scraper
!pip install equation-scraper

In [None]:
#Import the equation-scraper
from equation_scraper import scrape_and_parse_equations

scrape_and_parse_equations(['Super:Cognitive_psychology'])



All data produced will be stored in a `data` folder at your current working directory. If this folder does not exists, the package will create it for you. Each search is further organized by creating its own specific sub-folder named after your keywords, separated by underscores. For example `[Super:Cognitive_psychology, Neuroscience]` would create the folder `SUPERCognitivePsychology_Neuroscience` within the `data` folder. Note, keywords using the `Direct:` tag are not included in this filename nomenclature. 

If you repeat the exact same search, this folder is first deleted and then rebuilt with the new search. After running this function, you will notice a series of files, as well as a `debug` folder with more files, within this folder, but we will explain what each file does in the next section of the tutorial when running each of the two sub-functions individually. 

# The Sub-Functions

## Scrape Equations

The `scrape_equations()` function searches Wikipedia and derives a list of equations to be parsed. 

In [None]:
#!pip install equation-scraper #Uncomment this to install the equation-scraper if you didn't do so above

from equation_scraper import scrape_equations

scrape_equations(['Super:Cognitive_psychology, Neuroscience'])

This function produces a single file: `equations_*.txt` where `*` corresponds to the keywords you searched for (i.e., the name of the search's sub-folder), for example: `equations_SUPERCognitivePsychology_Neuroscience.txt` in the case of the example code above. The first line of this file lists meta-data of the keywords used for the search (e.g., `#CATEGORIES: ['Super:Cognitive_Psychology', 'Neuroscience]`). After this, you will see a pattern of each Wikipedia page scraped that looks like this:

`#ROOT: Biological Motion Perception`

`#LINK: /wiki/Biological_motion_perception`

`{\displaystyle \nu _{\psi }(t)={\frac {R_{\psi }(t)-{\bar {R}}}{\bar {R}}}}`

`...`

where `#ROOT` is the title of the corresponding Wikipedia page, `#LINK:` is the Wikipedia URL without the  `https://en.wikipedia.org` prefix (so, `/wiki/Augmented_cognition` can be used as `https://en.wikipedia.org/wiki/Augmented_cognition`) and everything after this is a scraped equation (i.e., `{\displaystyle \nu _{\psi }(t)={\frac {R_{\psi }(t)-{\bar {R}}}{\bar {R}}}} ...` in the above example). The format of these equations has not been modified in any way up to this point, so these are exactly as they were scraped from Wikipedia.


## Parse Equations

The `parse_equations()` function then loads the text file written from the previous function and iterates through the equations where it parses their operators and functions and builds priors. Running this function still requires that you pass the same keywords as with the previous function because it uses these to determine which folder the data is saved into.

In [None]:
#!pip install equation-scraper #Uncomment this to install the equation-scraper if you didn't do so above

from equation_scraper import parse_equations

parse_equations(['Super:Cognitive_psychology, Neuroscience'])

The main file produced by this function is the `priors_*.pkl` file where `*` corresponds to the keywords you searched for (i.e., the name of the search's sub-folder), for example: `priors_SUPERCognitivePsychology_Neuroscience.pkl`. We will look into this file in the `Load Priors` section below. The other file that this function produces is the `parsed_equations_*.txt` file, which includes the parsing results per equation. 

Additionally, there are three files within the debug folder: `debug_parsed_*.txt`, `skipped_equations_*.txt`, `wordsRemoved_equations_*.txt`. `debug_parsed_*.txt` presents the same information as `parsed_equations*.txt` but with a different organization structure; `skipped_equations_*.txt` presents all equations discarded and not used for priors (these are most often discarded because they are not actual expressions, such as with a variable decleration within the text); `wordsRemoved_equations_*.txt` is a list of words that were turned into variables---this occurs when equations contain words to represent a single variable, for example: `WEIGHT = HEIGHT * c` would be transformed to `y = x * c` and the words `WEIGHT` and `HEIGHT` would be added to the list. 

# Other Functions
## Plot Priors
The `plot_priors()` function will produce a figure---`figure_*.png`--- that is a barplot of the frequencies of operators/functions. 

In [None]:
#!pip install equation-scraper #Uncomment this to install the equation-scraper if you didn't do so above

from equation_scraper import plot_priors

plot_priors(['Super:Cognitive_psychology, Neuroscience'])

## Load Priors
The `load_priors()` function will allow you to load the pickle file containing the prior information with a point-and-click interface.

In [None]:
#!pip install equation-scraper #Uncomment this to install the equation-scraper if you didn't do so above

from equation_scraper import load_prior

es_priors = load_prior()

# Priors

The priors pickle file has two sub-fields, `metadata` and `priors`. The full structure looks like this:

`es_priors ` <br>
`│`  <br>
`└───metadata` <br>
`│   │   number_of_equations: ` insert <br>
`│   │   unparsed_equations:` insert <br>
`│   │   list_of_operators: ` insert <br>
`│   │   list_of_functions:` insert <br>
`│   │   list_of_constants:` insert <br>
`│   │   list_of_equations:` insert <br>
`│   `  <br>
`└───priors` <br>
&nbsp;&nbsp;`    │   max_depth:` insert <br>
&nbsp;&nbsp;`    │   depth: ` insert <br>
&nbsp;&nbsp;`    │   structures:` insert <br>
&nbsp;&nbsp;`    │   features:   ` insert <br>
&nbsp;&nbsp;`    │   functions:` insert <br>
&nbsp;&nbsp;`    │   operators: ` insert <br>
&nbsp;&nbsp;`    │   function_conditionals:` insert <br>
&nbsp;&nbsp;`    │   operator_conditionals:` insert <br>
&nbsp;&nbsp;`    │   operator_and_functions:` insert <br>



We can access the metadata field:

In [None]:
metadata = es_priors['metadata']

for key in metadata.keys():
    print(f"{key}: {metadata[key]}")

We can access the priors field:

In [None]:
priors = es_priors['priors']

for key in priors.keys():
    print(f"{key}: {priors[key]}")

# Conclusion

In conclusion, ...