# Spotlight on beautifulSoup

--Ruihong Wang

Beautiful soup is a tool for web crawling, which will return you a well parsed document tree so that you can easily find what you want. This spotlight will tell you how to use it, most of which can be find in https://www.crummy.com/software/BeautifulSoup/bs4/doc/. Finally, I will do the PageRank on the page graph crawled by TAMU portal.

First, you should fetch the source html code for the aiming url. Here, we set TAMU potal as an example. Usually, we build the connection through the requests package.

In [2]:
import requests

url = 'https://www.tamu.edu/'

r = requests.get(url)
html_doc = r.text


Here is a part of the source html.

In [3]:
print(html_doc[0:3000])

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
	<head>
	    <link href="https://www.tamu.edu/index.html" rel="canonical"/>
	    <!-- Google Tag Manager -->
<script>// <![CDATA[
(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&amp;l='+l:'';j.async=true;j.src=
'//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-53LFZF');
// ]]></script>
<!-- End Google Tag Manager --> 
		<meta charset="utf-8"/>
		<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
		<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
		<title>
            Texas A&amp;M University, College Station, TX
        </title>
		<meta content="texas a&amp;m university, texas, a&amp;m, texas a&amp;m, aggies, texas aggies, university, universities, college, colleges, higher education, college sta

Then we parse and build the document tree for this page by our beautiful soup.From now on, soup represents our document tree.

In [4]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

'lxml' here can be replaced by "html.parser", "lxml-xml", "html5lib', which represent python html or xml parsers.

Here, I will show you why the soup is called a document tree and why it is useful for crawling. Generally speaking, beautiful soup have distinguidhed all the html blocks and figure out the relationship between each blocks. you can fetch on particular block according to the html structure like a tree. 
## Tag 
Tag is an important attribute for the html block. 'a' is the tage for &lt; a href="contact.html" title="Contact" &gt;. The beautiful soup enable us to fetch the blocks by their tags.

In [5]:
soup.find('a')

<a class="skipnav js-skipnav sr-only" href="#contentarea">Skip Navigation</a>

In [6]:
soup.find('p')

<p>The event will feature discussions on the new space age, research demonstrations and exhibits.</p>

I think you have noticed that it only return the first block with a tag. To fetch all the blocks from all depth of the tree, you need do as below. By the way, the return of the command is a list. 'a' and 'p' are two important tags. 'a' contains hyperlink, and 'p' contains the paragraph to be fetch.

In [7]:
soup.find_all('a')[0:5]

[<a class="skipnav js-skipnav sr-only" href="#contentarea">Skip Navigation</a>,
 <a href="/index.html">Texas A&amp;M University</a>,
 <a aria-controls="search" aria-expanded="false" aria-label="View search form" class="icon-search large-icon icon-home" href="#search" tabindex="0">
 <span class="home-search">
 <em>Search</em>
 </span>
 </a>,
 <a href="/index.html">
 <img alt="Texas A&amp;M University Logo" src="/assets/images/TAM-Logo-white.png"/>
 </a>,
 <a aria-controls="search" aria-expanded="false" aria-label="View search form" class="icon-search large-icon mobile-search show-for-small-only" tabindex="0">
 <strong>
 <span class="sr-only">Search</span>
 </strong>
 </a>]

In [8]:
soup.find_all('p')[0:5]

[<p>The event will feature discussions on the new space age, research demonstrations and exhibits.</p>,
 <p>The post <a href="https://today.tamu.edu/2020/02/20/texas-am-will-launch-space-lab-at-next-months-sxsw-interactive-conference/" rel="nofollow">Texas A&amp;M Will Launch ‘Space Lab’ At Next Month’s SXSW Interactive Conference</a> appeared first on <a href="https://today.tamu.edu" rel="nofollow">Texas A&amp;M Today</a>.</p>,
 <p>The top five teams from around the world will participate in the final pitch competition at Texas A&amp;M March 31-April 2. </p>,
 <p>The post <a href="https://today.tamu.edu/2020/02/19/students-take-on-global-challenges-at-third-invent-for-the-planet/" rel="nofollow">Students Take On Global Challenges At Third Invent For The Planet</a> appeared first on <a href="https://today.tamu.edu" rel="nofollow">Texas A&amp;M Today</a>.</p>,
 <p>The capital campaign supports Bush School of Government and Public Service scholarships, programming and faculty support, pl

## Attribute
I think you must notice that each block have many attributes,such as class href, title, id and content. 'href' and 'content are two important attributes for crawling. href contain the url link for the pages you need to go to in the next step. 'content' may contain the nested blocks which waiting for your mining. 


In [9]:
soup.a.attrs

{'class': ['skipnav', 'js-skipnav', 'sr-only'], 'href': '#contentarea'}

To fetch specific attributes, you can use 'get'.

In [10]:
soup.find('a').get('href')

'#contentarea'

Oops, it seems the href return us some strange link! Don't worry this means the url is relative url. 
To fix that:

In [11]:
from urllib.parse import urljoin  
url_togo = urljoin(url, soup.find('a').get('href'))
print(url_togo)

https://www.tamu.edu/#contentarea


## make the page graph by beautiful soup


The way to build the page graph is pretty simple, you can first set some depth threshold for your crawling, iteratively go into the link on the source page (like the depth first searching). on each page, store all the hyperlink to a dictionary, whose key represent the source and the value represent the out link pages list. After removing the duplicates from every out link pages list, the dictionary can represernt the webpages graph.  

### Set the configuration for the pagerank algorithm

In [12]:
from urllib.parse import urljoin   
import re
import requests
from bs4 import BeautifulSoup
url = 'https://www.tamu.edu/' #set the start website as tamu portal
graph = {} #the place to store the relation of those webpages
depth = 1 #set link depth


### Algorithm
Notice: sometimes the connection may failed due to the verification or the anti-crawling plugin. we just abort those link. Due to the large time consumption we set the depth as 2.

In [13]:

def crawling(url,depth,graph):
    try:
        r = requests.get(url)
    except:
        print('An error occurred.')
    else:
        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'lxml')# obtain the parsed doc object
        if url in graph.keys():
            for link in soup.find_all('a'):
                if re.findall("^http", str(link.get('href'))): #check whether the link is relative address or absolutely address
                    url_togo = str(link.get('href'))
                    
                else:
                    url_togo = urljoin(url, str(link.get('href')))
                graph[url].append(url_togo)
                if depth > 0:
                    crawling(url_togo,depth-1,graph)
        else:
            for link in soup.find_all('a'):
                if re.findall("^http", str(link.get('href'))):
                    url_togo = str(link.get('href'))
                else:
                    url_togo = urljoin(url, str(link.get('href')))
                graph[url] = []
                graph[url].append(url_togo)
                if depth > 0:
                    crawling(url_togo,depth-1,graph)

In [14]:
crawling(url,depth,graph)

An error occurred.


The graph will contain all the link crawled by beautifulsoup. The key is the source, the value is the list of all the targets in that page.

In [15]:
graph.keys()

dict_keys(['https://www.tamu.edu/', 'https://www.tamu.edu/#contentarea', 'https://www.tamu.edu/index.html', 'https://www.tamu.edu/#search', 'https://www.tamu.edu/None', 'https://www.tamu.edu/about/index.html', 'https://www.tamu.edu/admissions/index.html', 'https://www.tamu.edu/academics/index.html', 'https://www.tamu.edu/athletics/index.html', 'https://www.tamu.edu/research/index.html', 'https://www.tamu.edu/student-life/index.html', 'https://www.tamu.edu/future-students/index.html', 'https://www.tamu.edu/current-students/index.html', 'https://www.tamu.edu/faculty-staff/index.html', 'https://www.tamu.edu/parents/index.html', 'https://www.tamu.edu/visitors/index.html', 'https://www.tamu.edu/former-students/index.html', 'http://leadbyexample.tamu.edu/', 'http://leadbyexample.tamu.edu', 'http://majors.tamu.edu/', 'http://ogaps.tamu.edu/Prospective-Students/Programs-and-Degrees', 'http://tuition.tamu.edu/', 'https://financialaid.tamu.edu/', 'http://corps.tamu.edu/', 'https://www.tamu.edu/v

### PageRank algorithm
This is the same algorithm to the HW2.

In [17]:
import pandas as pd
column_names = ["Oid", "Did"]
df = pd.DataFrame(columns = column_names)
for key in graph.keys():
    for des in graph[key]:
        df = df.append({'Oid' : key , 'Did' : des} , ignore_index=True)
        


In [18]:
import numpy as np

def fill_the_matrix(Ori,Des,Trans,linknum,allnode):
    Trans[allnode[allnode == Ori].index[0],allnode[allnode == Des].index[0]] = float(1/linknum.loc[Ori])

def pagerank(df):
    allnode = pd.concat([df.Oid,df.Did],axis = 0).drop_duplicates()
    Trans = np.zeros((allnode.count(),allnode.count()))
    nodenum = allnode.shape[0]
    link_unique = df[['Oid','Did']].drop_duplicates()
    linknum = link_unique.groupby('Oid',as_index = True).size().to_frame('linknum')
    allnode.reset_index(drop=True, inplace = True)
    link_unique.apply(lambda x: fill_the_matrix(x['Oid'],x['Did'],Trans,linknum,allnode), axis = 1)
    check = Trans.sum(axis = 1)
    check = np.where(check == 0)
    for index in check:
        Trans[index] = np.full(nodenum,float(1/nodenum))
    teleport = np.full((allnode.count(),allnode.count()),float(1/1003))
    new_Trans = np.add(0.9*Trans,0.1*teleport)
    node_pro = np.full((nodenum), float(1/nodenum))
    iterchange = np.abs(node_pro).mean()
#     print(type(iterchange))
    threshold = 1/((1000)*nodenum)
#     print(type(threshold))
#     print(type(iterchange > threshold))
#     if iterchange > threshold:
#         print('Nothing happened with if')
    while(iterchange > threshold):
        node_pro_new = np.dot(node_pro,Trans)
        iterchange = np.abs(node_pro_new - node_pro).mean()
        node_pro = node_pro_new
#         print(iterchange > threshold)
#         print(iterchange)
    indexsorted = np.argsort(-node_pro)

    for i in indexsorted[0:10]:
        print("%s - %s" %(allnode.iloc[i],node_pro[i]))
  

## Outcomes

In [19]:
pagerank(df) 

https://www.facebook.com/tamu - 0.429296597574961
http://www.texas.gov/ - 0.4228073017222359
http://admissions.tamu.edu/webmaster - 0.0007172457430466118
mailto:sbs@tamu.edu - 0.0006683872003370104
https://www.addthis.com/tellfriend_v2.php?v=300&winname=addthis&pub=ra-57963d7cf091a805&source=ctbx-300&lng=en&s=email&wid=5ubj&url=http%3A%2F%2Ftx.ag%2Flead&title=Transforming%20lives%20%26%20charting%20impact%20through%20education%2C%20discovery%20and%20innovation.%20%23TAMUleads%20%23LeadByExample&ate=AT-ra-57963d7cf091a805/-/-/59658c12f7344574/7&uid=59658c121654a9c9&description=From%20the%20lab%20to%20the%20field.%20From%20College%20Station%20to%20developing%20countries.%20From%20me%20to%20you.%20Letâs%20stand%20together%20and%20transform%20lives.%20%23TAMUleads%20%23LeadByExample&uud=1&ct=0&ui_email_to=&ui_email_from=&ui_email_note=&tt=0&captcha_provider=recaptcha2&pro=1&ats=imp_url%3D0%26smd%3Drsi%253D%2526gen%253D0%2526rsc%253D%2526dr%253D%2526sta%253DAT-ra-57963d7cf091a805%25252F-%