# Math Textbooks vs The World

### An NLP Investigation about Language Usage in Math Textbooks by Tommy Xu

### Getting the Data

First, I had to use multiple packages for this particular project:
- pandas was useful for setting up dataframes and plotting
- requests and bs4 allowed me webscraping tools to pull text off of any website
- re was used to ensure standard formatting for all my text
- collections.Counter was the primary letter counter in my work
- matplotlib.pyplot and wordcloud.WordCloud were primarily used for data visualization

In [1]:
import pandas as pd
import requests
import bs4
import string
import re
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud

Using these tools, I developed functions that could read all the words and letters from a website using only the URL. This was possible thanks to the powerful tools of BeautifulSoup and RegEx syntax. For words(), it produces a list of all words used in the page, and letters() produces a long string that contains all letters from the webpage. These were designed to be this way since collections.Counter can transform this information into a dictionary of frequencies.

In [2]:
def words(url):
    '''
    Consumes a url of a webpage, produces a list of all the words used in the page
    '''
    req = requests.get(url)
    soup = bs4.BeautifulSoup(req.text, 'lxml')
    
    lowords = filter(lambda x: x != " ", soup.get_text().lower().replace('\n', ' ')         # Pre-formatting the text and 
                     .translate(str.maketrans("", "", string.punctuation)).split(" "))      # splitting by " "
    
    lowords_lst = []
    for word in lowords:
        if word.isdigit():                          ## For this analysis, digits were excluded
            continue
        if len(word) == 0:
            continue
        elif word[0].isdigit():                     ## Due to BeautifulSoup() reading LaTeX, there were many cases where
            lowords_lst.append(word[1:])            ## the first letter of a word was a digit and followed by a normal word.
        else:
            lowords_lst.append(word)
    return lowords_lst

In [3]:
def letters(url):
      
    '''
    Takes in url from any website, produces 1 long string of all the letters used on that page.
    '''
    req = requests.get(url)
    soup = bs4.BeautifulSoup(req.text, 'lxml')
    raw_text = soup.get_text().replace("\n", '').replace(' ', '')
    
    clean_letter_upper = ''.join(filter(lambda x: re.match("[\x41-\x5A]", x), raw_text)).lower()
    clean_letter_lower = ''.join(filter(lambda x: re.match("[\x61-\x7A]", x), raw_text)).lower()
    
    clean_letters = clean_letter_lower + clean_letter_upper
    
    return clean_letters

Next, I would apply the collections.Counter() to these outputs and it will create a dictionary with key = word/letter and value = frequency. Lastly, I will transform the dictionary into a sorted dataframe with the corresponding column headings. 

For example, for a simple website like https://books.toscrape.com/, you can see the results of the output of words() and letters(). All letters/words are made into lowercase for ease of comparison

In [4]:
## Reading all the words off the website
words_on_website = words("https://books.toscrape.com/")
word_counter = Counter(words_on_website)

## Transforming into a dataframe
word_dict = {}
for key in sorted(word_counter.keys()):
    word_dict[key] = word_counter[key]
df_words = pd.DataFrame(list(word_dict.items()), columns = ["Word", "Frequency"]).sort_values(by = 'Frequency', 
                                                                                              ascending = False)
# df_words.to_csv(name_of_file, index = False)


## First ten rows
df_words[:10]


Unnamed: 0,Word,Frequency
124,to,23
52,in,22
2,add,21
117,stock,20
10,basket,20
120,the,10
36,fiction,6
0,a,5
5,and,5
17,books,3


In [5]:
## Reading all letters
letters_on_website = letters("https://books.toscrape.com/")
letter_counter = Counter(letters_on_website)

## Setting up as a dataframe
letter_dict = {}
for key in sorted(letter_counter.keys()):
	letter_dict[key] = letter_counter[key]
df_letters = pd.DataFrame(list(letter_dict.items()), columns = ["Letter", "Frequency"])
## df_letters.to_csv(name_of_file, index = False)

## First ten rows
df_letters[:10]

Unnamed: 0,Letter,Frequency
0,a,118
1,b,39
2,c,66
3,d,68
4,e,117
5,f,16
6,g,22
7,h,39
8,i,110
9,j,1


The last thing left to do was to apply this function to **every page of a textbook**. I found two sources of online textbooks, PreText and OpenBC, that both had identical structures for all textbooks. There was structure in the webpages, structure in the naming conventions, and structure in the url progressions from page to page. That meant that I could keep the HTML tags and for loops the same. 

See below for a sample of my algorithm that could read entire PreText textbooks that uses Interactive Linear Algebra (UBC Edition) by Dan Margalit, Joseph Rabinoff, and Ben Williams at https://personal.math.ubc.ca/~tbjw/ila/index.html.

For example, I noticed that for every page in the textbook, the first portion of the textbook was the same, and it was just that each page had a different ending-url. Here are some examples:

| Texbook Section        |                       URL                                        |
| :-------------------:  | :--------------------------------------------------------------: |
| Index                  | https://personal.math.ubc.ca/~tbjw/ila/index.html                |
| 1.1 Vectors            | https://personal.math.ubc.ca/~tbjw/ila/vectors.html              |
| 3.6 The Rank Theorem   | https://personal.math.ubc.ca/~tbjw/ila/rank-thm.html             |
| 6.5 Complex Eigenvalues | https://personal.math.ubc.ca/~tbjw/ila/complex-eigenvalues.html |

This means that **to find the URL of the next page, I only need to find the tail component of the URL**.

In [6]:
## Additional function for finding the common portion of the textbook's url pages.

def common_string(string1, string2):
    """
    Given two urls, provide the common url that was the base for the webpages. Must end on a / to prevent
    redundant half-strings at the end. E.g:
    
    start_url = 'https://faculty.uml.edu//klevasseur/ads/index-ads.html'
    end_url = 'https://faculty.uml.edu//klevasseur/ads/index-1.html'
    
    Want: 'https://faculty.uml.edu//klevasseur/ads/'
    NOT: 'https://faculty.uml.edu//klevasseur/ads/index-'    
    
    """
    
    def slash_end(string):
        """
        Recursively ensure that the last char of the string is a /
        assume: must have a / near the the end (if not THE end)
        
        """
        if string.endswith('/'):
            return string
        else:
            string = string[0:-1]
            return slash_end(string)
    
    common = ''
    for i in range(0, min(len(string1), len(string2))):
        if string1[i] == string2[i]:
            common += string1[i]
        if string1[i] != string2[i]:
            break
    
    result = slash_end(common)
    
    return result


In [7]:
start_url = "https://personal.math.ubc.ca/~tbjw/ila/index.html"        # first page of the online textbook
end_url = "https://personal.math.ubc.ca/~tbjw/ila/colophon-2.html"     # very last page of the online textbook

## Finding all individual urls of every page in the online textbook 
lo_urls = [start_url]
cons_url_prefix = common_string(start_url, end_url)     #in this case, this is "https://personal.math.ubc.ca/~tbjw/ila/"
url = start_url

while end_url not in lo_urls:                                               #loops until end_url (last page) is in the list
    req = requests.get(url)
    soup = bs4.BeautifulSoup(req.text, 'lxml')
    next_url = soup.select(".next-button.button.toolbar-item")[0]['href']   #there is a URL stored in the "next page" button
    url = cons_url_prefix + next_url
    lo_urls.append(url)
    

print(f"There are {len(lo_urls)} total pages in this Linear Algebra Textbook.\n")
print(f"The first five URLs for the first five pages are:\n")
for url in lo_urls[:5]:
    print(url)


There are 51 total pages in this Linear Algebra Textbook.

The first five URLs for the first five pages are:

https://personal.math.ubc.ca/~tbjw/ila/index.html
https://personal.math.ubc.ca/~tbjw/ila/colophon-1.html
https://personal.math.ubc.ca/~tbjw/ila/preface-1.html
https://personal.math.ubc.ca/~tbjw/ila/preface-2.html
https://personal.math.ubc.ca/~tbjw/ila/overview.html


In [8]:
## Uses similar algorithm as above to read all letters from EVERY url in the textbook

all_letters = ''
for url in lo_urls:
    all_letters += letters(url)
letter_counter = Counter(all_letters)

## Transforming into a dataframe
letter_dict = {}
for key in sorted(letter_counter.keys()):
	letter_dict[key] = letter_counter[key]
df_letters = pd.DataFrame(list(letter_dict.items()), columns = ["Letter", "Frequency"])

## first ten letters frequencies
df_letters[:10]

Unnamed: 0,Letter,Frequency
0,a,30729
1,b,6028
2,c,13138
3,d,9654
4,e,41817
5,f,8387
6,g,5093
7,h,12575
8,i,29647
9,j,726


You can see more about this process with the exact python script for PreText Textbooks and OpenBC Textbooks in the git repository, under "*src -> scripts*", or at this link: https://github.com/tommysteryy/LetterAnalyzer/tree/test_branch/src/scripts