# Web Scraping with Beautiful Soup and Pandas



## I. Making Database From Scratch With Beautiful Soup

There are a number of different packages available for web scraping, and one of the most popular is [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). Beautiful Soup parses web content into a Python object and makes the [DOM](https://en.wikipedia.org/wiki/Document_Object_Model) queryable element by element. Used in conjunction with a requests package, it makes web scraping very easy!

---
### Installation of Beautiful Soup (if haven't done so)
In the `bash` terminal or `Anaconda Prompt`,run:
```bash
conda install beautifulsoup4
```
---

In [5]:
# Standard imports
import pandas as pd

# For web scraping
import requests
from bs4 import BeautifulSoup
import re


### Scrape The Data


In [7]:
# Save the URL of the webpage we want to scrape to a variable
url = 'https://docs.python.org/3/library/random.html#module-random'

When web scraping, the first step is to pull down the content of the page into a Python (string) variable. For simpler webscraping tasks you can do this with the `requests` package, which is what we'll use here. For more complex tasks (involving, e.g., webpages with lots of Javascript or other elements that are rendered by the web browser) you may need to use something more advanced, like `urllib` or [Selenium](https://selenium-python.readthedocs.io/index.html).

In [9]:
# Send a get request and assign the response to a variable
response=requests.get(url)

Let's take a look at what we have!

In [11]:
response

<Response [200]>

In [12]:
response.content



That's a lot to look at! It's also pretty unreadable. This is where Beautiful Soup comes in. What Beautiful Soup does is helps us parse the page content properly, into a form that we can more easily use.

In [14]:
# Turn the undecoded content into a Beautiful Soup object and assign it to a variable
soup = BeautifulSoup(response.content)
type(soup)

bs4.BeautifulSoup

**Now let's take a look at this.**

In [16]:
# Check soup variable

soup

<!DOCTYPE html>
<html data-content_root="../" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/><meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="random — Generate pseudo-random numbers" property="og:title"/>
<meta content="website" property="og:type"/>
<meta content="https://docs.python.org/3/library/random.html" property="og:url"/>
<meta content="Python documentation" property="og:site_name"/>
<meta content="Source code: Lib/random.py This module implements pseudo-random number generators for various distributions. For integers, there is uniform selection from a range. For sequences, there is uniform s..." property="og:description"/>
<meta content="https://docs.python.org/3/_static/og-image.png" property="og:image"/>
<meta content="Python documentation" property="og:image:alt"/>
<meta content="Source code: Lib/random.py This module implements pseudo-random number generators for various d

**So it looks like we're looking for a `dt` element with `id='random.___'`. We can easily retrieve this with Beautiful Soup's `.findAll` command.**

In [18]:
# Find all function names - we specify the name of the element in this case is 'dt'

names = soup.body.findAll('dt')

print(names)

[<dt class="sig sig-object py" id="random.seed">
<span class="sig-prename descclassname"><span class="pre">random.</span></span><span class="sig-name descname"><span class="pre">seed</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">a</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">version</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">2</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#random.seed" title="Link to this definition">¶</a></dt>, <dt class="sig sig-object py" id="random.getstate">
<span class="sig-prename descclassname"><span class="pre">random.</span></span><span class="sig-name descname"><span class="pre">getstate</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a c

**There are still some works to do! This is when regex kicks in.**


In [20]:
# Find all the information we're looking for with regex
# In this case, it would be every string at starts with id='random.'

function_names = re.findall('id="random.\\w+' , str(names)) # '\w+' which means the string should end with the function name

# Let print the results
print(function_names)

['id="random.seed', 'id="random.getstate', 'id="random.setstate', 'id="random.randbytes', 'id="random.randrange', 'id="random.randint', 'id="random.getrandbits', 'id="random.choice', 'id="random.choices', 'id="random.shuffle', 'id="random.sample', 'id="random.binomialvariate', 'id="random.random', 'id="random.uniform', 'id="random.triangular', 'id="random.betavariate', 'id="random.expovariate', 'id="random.gammavariate', 'id="random.gauss', 'id="random.lognormvariate', 'id="random.normalvariate', 'id="random.vonmisesvariate', 'id="random.paretovariate', 'id="random.weibullvariate', 'id="random.Random', 'id="random.Random', 'id="random.Random', 'id="random.Random', 'id="random.Random', 'id="random.Random', 'id="random.SystemRandom']


**remove the first few characters from each string.**

In [22]:
# Using list comprehension to edit our values:

function_names = [item[4:] for item in function_names]

# Let print the results
print(function_names)

['random.seed', 'random.getstate', 'random.setstate', 'random.randbytes', 'random.randrange', 'random.randint', 'random.getrandbits', 'random.choice', 'random.choices', 'random.shuffle', 'random.sample', 'random.binomialvariate', 'random.random', 'random.uniform', 'random.triangular', 'random.betavariate', 'random.expovariate', 'random.gammavariate', 'random.gauss', 'random.lognormvariate', 'random.normalvariate', 'random.vonmisesvariate', 'random.paretovariate', 'random.weibullvariate', 'random.Random', 'random.Random', 'random.Random', 'random.Random', 'random.Random', 'random.Random', 'random.SystemRandom']


In [23]:
# Find all the function description

description = soup.body.findAll('dd')

print(description)

[<dd><p>Initialize the random number generator.</p>
<p>If <em>a</em> is omitted or <code class="docutils literal notranslate"><span class="pre">None</span></code>, the current system time is used.  If
randomness sources are provided by the operating system, they are used
instead of the system time (see the <a class="reference internal" href="os.html#os.urandom" title="os.urandom"><code class="xref py py-func docutils literal notranslate"><span class="pre">os.urandom()</span></code></a> function for details
on availability).</p>
<p>If <em>a</em> is an int, it is used directly.</p>
<p>With version 2 (the default), a <a class="reference internal" href="stdtypes.html#str" title="str"><code class="xref py py-class docutils literal notranslate"><span class="pre">str</span></code></a>, <a class="reference internal" href="stdtypes.html#bytes" title="bytes"><code class="xref py py-class docutils literal notranslate"><span class="pre">bytes</span></code></a>, or <a class="reference internal" hre

In [24]:
# Create a list

function_usage = []

# Create a loop

for item in description:
    item = item.text      #  Save the extracted text to a variable
    item = item.replace('\n', ' ')     # to get rid of the next line operator which is `\n` 
    function_usage.append(item)
    
#print(function_usage)  # Don't get overwhelmed! they are just all the function description from the above function names

In [25]:
# Let's check the length of the function_names and function_usage

print(f' Length of function_names: {len(function_names)}')
print(f' Length of function_usage: {len(function_usage)}')

 Length of function_names: 31
 Length of function_usage: 31


### Make A Database

In [27]:
# Create a dataframe since the length of both variables are equal!

data = pd.DataFrame( {  'function name': function_names, 
                      'function usage' : function_usage  } )

data

Unnamed: 0,function name,function usage
0,random.seed,Initialize the random number generator. If a i...
1,random.getstate,Return an object capturing the current interna...
2,random.setstate,state should have been obtained from a previou...
3,random.randbytes,Generate n random bytes. This method should no...
4,random.randrange,Return a randomly selected element from range(...
5,random.randint,Return a random integer N such that a <= N <= ...
6,random.getrandbits,Returns a non-negative Python integer with k r...
7,random.choice,Return a random element from the non-empty seq...
8,random.choices,Return a k sized list of elements chosen from ...
9,random.shuffle,Shuffle the sequence x in place. To shuffle an...


In [28]:
# Let make a CSV file from the dataframe

data.to_csv('random_function.csv')