In [None]:
!pip install bs_ds
!pip install fake_useragent
!pip install lxml

## Using python's `requests` module:



-  Use `requests` library to initiate connections to a website.
- Check the status code returned to determine if connection was successful (status code=200)

```python
import requests
url = 'https://en.wikipedia.org/wiki/Stock_market'

# Connect to the url using requests.get
response = requests.get(url)
response.status_code
```

 ___
| Status Code | Code Meaning
| --------- | -------------|
1xx |   Informational
2xx|    Success
3xx|     Redirection
4xx|     Client Error
5xx |    Server Error

___
- **Note: You can add a `timeout` to `requests.get()` to avoid indefinite waiting**
    - Best in multiples of 3 (`timeout=3` or `6` , `9` ,etc.)

```python
# Add a timeout to prevent hanging
response = requests.get(url, timeout=3)
response.status_code

```
- **`response` is a dictionary with the contents printed below**





In [None]:
import requests
url = 'https://en.wikipedia.org/wiki/Stock_market'

response = requests.get(url, timeout=3)
print('Status code: ',response.status_code)
if response.status_code==200:
    print('Connection successfull.\n\n')
else:
    print('Error. Check status code table.\n\n')



# Print out the contents of a request's response
print(f"{'---'*20}\n\tContents of Response.items():\n{'---'*20}")

for k,v in response.headers.items():
    print(f"{k:{25}}: {v:{40}}") # Note: add :{number} inside of a

Status code:  200
Connection successfull.


------------------------------------------------------------
	Contents of Response.items():
------------------------------------------------------------
Date                     : Mon, 24 Jun 2019 15:13:12 GMT           
Content-Type             : text/html; charset=UTF-8                
Content-Length           : 64418                                   
Connection               : keep-alive                              
Server                   : mw1274.eqiad.wmnet                      
X-Content-Type-Options   : nosniff                                 
P3P                      : CP="This is not a P3P policy! See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."
X-Powered-By             : HHVM/3.18.6-dev                         
Content-language         : en                                      
Last-Modified            : Mon, 24 Jun 2019 13:58:29 GMT           
Backend-Timing           : D=137942 t=1561384720599492 

In [None]:
for k,v in response.headers.items():
    print(f"{k}: {v}") # Note: add :{number} inside of a

Date: Mon, 24 Jun 2019 15:13:12 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 64418
Connection: keep-alive
Server: mw1274.eqiad.wmnet
X-Content-Type-Options: nosniff
P3P: CP="This is not a P3P policy! See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."
X-Powered-By: HHVM/3.18.6-dev
Content-language: en
Last-Modified: Mon, 24 Jun 2019 13:58:29 GMT
Backend-Timing: D=137942 t=1561384720599492
Vary: Accept-Encoding,Cookie,Authorization,X-Seven
Content-Encoding: gzip
X-Varnish: 458249968 437952989, 205788053 193766102
Via: 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1)
Age: 4471
X-Cache: cp1089 hit/10, cp1083 hit/1
X-Cache-Status: hit-front
Server-Timing: cache;desc="hit-front"
Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
Set-Cookie: WMF-Last-Access=24-Jun-2019;Path=/;HttpOnly;secure;Expires=Fri, 26 Jul 2019 12:00:00 GMT, WMF-Last-Access-Global=24-Jun-2019;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Fri, 26 J

## Random Tips - Text Printing/Formatting:**



- **You can repeat strings by using multiplication**
    - `'---'*20` will repeat the dashed lines 20 times

- **You can determine how much space is alloted for a variable when using f-strings**
    - Add a `:{##}` after the variable to specify the allocated width
    - Add a `>` before the `{##}` to force alignment
    - Add another symbol (like '.'' or '-') before `>` to add guiding-line/placeholder (like in a table of contents)

```python
print(f"Status code: {response.status_code}")
print(f"Status code: {response.status_code:>{20}}")
print(f"Status code: {response.status_code:->{20}}")
```    
```
# Returns:
Status code: 200
Status code:                  200
Status code: -----------------200
```

___

## Quick Review -  HTML & Tags


- All HTML pages have the following components
    1. document declaration followed by html tag
    
    `<!DOCTYPE html>`<br>
    `<html>`
    2. Head
     html tag<br>
    `<head> <title></title></head>`
    3. Body<br>
    `<body>` ... content... `</body>`<br>
    `</html>`

- Html content is divdied into **tags** that specify the type of content.
    - [Basic Tags Reference Table](https://www.w3schools.com/tags/ref_byfunc.asp)
    - [Full Alphabetical Tag Reference Table](https://www.w3schools.com/tags/)
    
    - **tags** have attributes
        - [Tag Attributes](https://www.w3schools.com/html/html_attributes.asp)
        - Attributes are always defined in the start/opening tag.

    - **tags** may have several content-creator-defined attributes such as `class` or `id`
- We will **use the tag and its identifying attributes to isolate content** we want on a web page with BeautifulSoup.

___
___

#  1) Using `BeautifulSoup`



## Cook a soup

- Connect to a website using`response = requests.get(url)`
- Feed `response.content` into BeautifulSoup
- Must specify the parser that will analyze the contents
    - default available is `'html.parser'`
    - recommended is to install and use `lxml` [[lxml documentation](https://lxml.de/3.7/)]
- use soup.prettify() to get a user-friendly version of the content to print

```python
# Define Url and establish connection
url = 'https://en.wikipedia.org/wiki/Stock_market'
response = requests.get(url, timeout=3)

# Feed the response's .content into BeauitfulSoup
page_content = response.content
soup = BeautifulSoup(page_content,'lxml') #'html.parser')

# Preview soup contents using .prettify()
print(soup.prettify()[:2000])

```




## What's in a Soup?
- **A soup is essentially a collection of `tag objects`**
    - each tag from the html is a tag object in the soup
    - the tag's maintain the hierarchy of the html page, so tag objects will contain _other_ tag objects that were under it in the html tree.

- **Each tag has a:**
    - `.name`
    - `.contents`
    - `.string`
    
- **A tag can be access by name (like a column in a dataframe using dot notation)**
    - and then you can access the tags within the new tag-variable just like the first tag
    ```python
    # Access tags by name
    meta = soup.meta
    head = soup.head
    body = soup.body
    # and so on...
    ```
- [!] ***BUT this will only return the FIRST tag of that type, to access all occurances of a tag-type, we will need to navigate the html family tree***



## Navigating the HTML Family Tree: Children, siblings, and parents

- **Each tag is located within a tree-hierarchy of parents, siblings, and children**
    - The family-relation is based on the identation level of the tags.

- **Methods/attributes for the location/related tags of a tag**
    - `.parent`, `.parents`
    - `.child`, `.children`
    - `.descendents`
    - `.next_sibling`, `.previous_sibling`

- *Note: a newline character `\n` is also considered a tag/sibling/child*

#### Accessing Child Tags

- To get to later occurances of a tag type (i.e. the 2nd `<p>` tag in a tree), we need to navigate through the parent tag's `children`
    - To access an iterable list of a tag's children use `.children`
        - But, this only returns its *direct children*  (one indentation level down)     
        
    ```python
    # print direct children of the body tag
    body = soup.body
    for child in body.children:
        # print child if its not empty
        print(child if child is not None else ' ', '\n\n')  # '\n\n' for visual separation
    ```
- To access *all children* use `.descendents`
    - Returns all chidren and children of children
    ```python
    for child in body.descendents:
        # print all children/grandchildren, etc
        print(child if child is not None else ' ','\n\n')  
    ```
    
#### Accessing Parent tags

- To access the parent of a tag use `.parent`
```python
title = soup.head.title
print(title.parent.name)
```

- To get a list of _all parents_ use `.parents`
```python
title = soup.head.title
for parent in title.parents:
    print(parent.name)
```

#### Accessing Sibling tags
- siblings are tags in the same tree indentation level
- `.next_sibling`, `.previous_sibling`


## Searching Through Soup


### Finding the target tags to isolate
Using example  from  [Wikipedia article](https://en.wikipedia.org/wiki/Stock_market)
where we are trying to isolate the body of the article content.


- **Examine the website using Chrome's inspect view.**

    - Press F12 or right-click > inspect

    - Use the mouse selector tool (top left button) to explore the web page content for your desired target
        - the web page element will be highlighted on the page itself and its corresponding entry in the document tree.
        - Note: click on the web page with the selector in order to keep it selected in the document tree

    - Take note of any identifying attributes for the target tag (class, id, etc)
<img src="https://drive.google.com/uc?export-download&id=1KifQ_ukuXFdnCh1Tz1rwzA_cWkB_45mf" width=450>

### Using BeautifulSoup's search functions
Note: while the process below is a decent summary, there is more nuance to html/css tags than I personally have been able to digest.
    - If something doesn't work as expected/explained, please verify in the documentation.
        - [BeauitfulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautiful-soup-documentation)
        - [docs for .find_all()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all)
    
- **BeautifulSoup has methods for searching through descendent-tags**
    - `.find`
    - `.find_all`
    
- **Using `.find_all()`**
    - Searches through all descendent tags and returns a result set (list of tag objects)
```python
# How to get results from .find_all()
results = soup.find_all(name, attrs, recursive, string, limit,**kwargs) `
```        
    - `.find_all()` parameters:
        - `name` _(type of tags to consider)_
            - only consider tags with this name
                - Ex: 'a',  'div', 'p' ,etc.
        - `atrrs`_(css attributes that you are looking for in your target tag)_
            - enter an attribute such as the class or id as a string

                `attrs='mw-content-ltr'`
            - if passing more than one attribute, must use a dictionary:

            `attrs={'class':'mw-content-ltr', 'id':'mw-content-text'}`
        - `recursive`_(Default=True)_
            - search all children (`True`)
            - search only  direct children(`False`)

        - `string`
            - search for text _inside_ of tags instead of the tags themselves
            - can be regular expression
        - `limit`
            - How many results you want it to return


    


In [None]:
!pip install fake_useragent
!pip install lxml

Collecting fake_useragent
  Downloading https://files.pythonhosted.org/packages/d1/79/af647635d6968e2deb57a208d309f6069d31cb138066d7e821e575112a80/fake-useragent-0.1.11.tar.gz
Building wheels for collected packages: fake-useragent
  Building wheel for fake-useragent (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/5e/63/09/d1dc15179f175357d3f5c00cbffbac37f9e8690d80545143ff
Successfully built fake-useragent
Installing collected packages: fake-useragent
Successfully installed fake-useragent-0.1.11


# 2) Walk-through example/code


    - James functions
    - Functional code scraping wikipedia pages

## Walkthrough - using James' functions

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin

from fake_useragent import UserAgent
url = 'https://en.wikipedia.org/wiki/Stock_market'
soup = cook_soup_from_url(url,sleep_time=1)


## Get all links that match are interal wikipedia redirects [yes?]
kwds = {'class':'mw-redirect'}
links = get_all_links(soup)#,kwds)


# preview first 5 links
print(links[:5])


# Turn relative links into absolute links
abs_links = make_absolute_links(url,links)
print(abs_links[:5])

['/wiki/Rent_seeking', '/wiki/Rhine_capitalism', '/wiki/State-sponsored_capitalism', '/wiki/Global_capitalism', '/wiki/Perspectives_on_capitalism']
['https://en.wikipedia.org/wiki/Rent_seeking', 'https://en.wikipedia.org/wiki/Rhine_capitalism', 'https://en.wikipedia.org/wiki/State-sponsored_capitalism', 'https://en.wikipedia.org/wiki/Global_capitalism', 'https://en.wikipedia.org/wiki/Perspectives_on_capitalism']


In [None]:
# Selecting only the first 5 links to test
abs_links_for_soups = abs_links[:5]


# Cooking a batch of soups from those chosen links
batch_of_soups = cook_batch_of_soups(abs_links_for_soups, sleep_time=2)

# batch_of_soups is a list as long as the input link_list
print(f'# of input links: == # of soups in batch:\n{len(abs_links_for_soups)} == {len(batch_of_soups)}\n')

# batch_of_soups is a list of soup-dictionaries
soup_dict = batch_of_soups[0]
print('Each soup_dict has ',soup_dict.keys())

# the page's soup is stored under soup_dict['soup']
soup_from_soup_dict = soup_dict['soup']
type(soup_from_soup_dict)

# of input links: == # of soups in batch:
5 == 5

Each soup_dict has  dict_keys(['_url', 'path', 'soup'])


bs4.BeautifulSoup

#### Notes on extracting content.
- Edit the `extract_target_text function` in the James' functions settings or uncomment and use the `extract_target_text_custom function` below

In [None]:
## ADDING extract_target_text to precisely target text
# def extract_target_text_custom(soup_or_tag,tag_name='p', attrs_dict=None, join_text =True, save_files=False):
#     """User-specified function to add extraction of specific content during 'cook batch of soups'"""

#     if attrs_dict==None:
#         found_tags = soup_or_tag.find_all(name=tag_name)
#     else:
#         found_tags = soup_or_tag.find_all(name=tag_name,attrs=attrs_dict)


#     # if extracting from multiple tags
#     output=[]
#     output = [tag.text for tag in found_tags if tag.text is not None]

#     if join_text == True:
#         output = ' '.join(output)

#     ## ADDING SAVING EACH
#     if save_files==True:
#         text = output #soup.body.string
#         filename =f"drive/My Drive/text_extract_{url_dict_key}.txt"
#         soup_dict['filename'] = filename
#         with open(filename,'w+') as f:
#             f.write(text)
#         print(f'File  successfully saved as {filename}')

#     return  output

# ####################

## RUN A LOOP TO ADD EXTRACTED TEXT TO EACH SOUP IN THE BATCH
for i, soup_dict in enumerate(batch_of_soups):

    # Get the soup from the dict
    soup = soup_dict['soup']

    # Extract text
    extracted_text = extract_target_text(soup)

    # Add key:value for results of extract
    soup_dict['extracted'] = extracted_text

    # Replace the old soup_dict with the new one with 'extracted'
    batch_of_soups[i] = soup_dict

example_extracted_text=batch_of_soups[0]['extracted']
print(example_extracted_text[:1000])


 Rent-seeking is a concept in public choice theory as well as in economics, that involves seeking to increase one's share of existing wealth without creating new wealth. Rent-seeking results in reduced economic efficiency through misallocation of resources, reduced wealth-creation, lost government revenue, heightened income inequality,[1] and potential national decline.
 Attempts at capture of regulatory agencies to gain a coercive monopoly can result in advantages for the rent seeker in a market while imposing disadvantages on their incorrupt competitors. This is one of many possible forms of rent-seeking behavior.
 The idea of rent-seeking was developed by Gordon Tullock in 1967,[2] while the expression rent-seeking itself was coined in 1974 by Anne Krueger.[3] The word "rent" does not refer specifically to payment on a lease but rather to Adam Smith's division of incomes into profit, wage, and rent.[4] The origin of the term refers to gaining control of land or other natural resour