![ine-divider](https://user-images.githubusercontent.com/7065401/92672068-398e8080-f2ee-11ea-82d6-ad53f7feb5c0.png)
<hr>

# Web scraping in Python

## Robot exclusion protocol

In this project you will perform some web scraping with a sensitivity to robot exclusion rules.

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Finding multiple contact pages 

For this task, you should construct a robot to do the following steps:

1. Peform a search of a popular search engine, such as Google, Yandex, or Baidu, for a term you pass in.
2. Collect all the results for the term (a first page worth of results is fine).
3. Extract the domains from the results
4. Determine if you are permitted to request the "Contact" page for the site.  E.g. `http://example.com/contact`. For the check, determine this permission for at least the following user agents:
   * MyRobot
   * Fetch
   * Microsoft.URL.Control
   * Xenu
5. If you are permitted to request that resource, determine if it exists at all.
6. Create a table of possible answers, with values PROHIBITED, NON-EXISTENT, PERMITTED for each user agent and domain.

Such a table might look something like this:

```python
>>> permissions = term_permissions('robots', 'google.com')
>>> pd.DataFrame(permissions).sample(10)
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Agent</th>
      <th>Domain</th>
      <th>Status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>82</th>
      <td>MyRobot</td>
      <td>spectrum.ieee.org</td>
      <td>PERMITTED</td>
    </tr>
    <tr>
      <th>111</th>
      <td>Fetch</td>
      <td>m.imdb.com</td>
      <td>NON-EXISTENT</td>
    </tr>
    <tr>
      <th>24</th>
      <td>Microsoft.URL.Control</td>
      <td>en.wikipedia.org</td>
      <td>PROHIBITED</td>
    </tr>
    <tr>
      <th>92</th>
      <td>Microsoft.URL.Control</td>
      <td>spectrum.ieee.org</td>
      <td>PERMITTED</td>
    </tr>
    <tr>
      <th>81</th>
      <td>Xenu</td>
      <td>spectrum.ieee.org</td>
      <td>PERMITTED</td>
    </tr>
    <tr>
      <th>65</th>
      <td>Xenu</td>
      <td>abcstlouis.com</td>
      <td>NON-EXISTENT</td>
    </tr>
    <tr>
      <th>20</th>
      <td>Microsoft.URL.Control</td>
      <td>builtin.com</td>
      <td>PROHIBITED</td>
    </tr>
    <tr>
      <th>14</th>
      <td>MyRobot</td>
      <td>www.wired.com</td>
      <td>PROHIBITED</td>
    </tr>
    <tr>
      <th>39</th>
      <td>Fetch</td>
      <td>maps.google.com</td>
      <td>PROHIBITED</td>
    </tr>
    <tr>
      <th>38</th>
      <td>MyRobot</td>
      <td>maps.google.com</td>
      <td>PROHIBITED</td>
    </tr>
  </tbody>
</table>


In [16]:
from urllib import robotparser
from urllib.parse import urlparse
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [20]:
def google_link_domains(links):
    # Google (and other search engines) force links to go 
    # through their redirect and tracking keys
    domains = set()
    for link in links:
        if link.startswith('/url?q='):
            link = link[7:]
        domain = urlparse(link).hostname
        if domain:   # Might be None
            domains.add(domain)
    return domains

In [49]:
def term_permissions(term, 
                     search_engine="google.com",
                     robots=["Microsoft.URL.Control", 
                             "Xenu", "MyRobot", "Fetch"]):
    # For now we have only implemented Google search support
    if search_engine != "google.com":
        raise NotImplementedError("Non-Google search engines pending")
    
    # Do the initial search on the term
    tuples = []
    parser = robotparser.RobotFileParser()
    query = f"https://{search_engine}/search?q={term}"
    result = BeautifulSoup(requests.get(query).text)
    links = [a['href'] for a in result.find_all('a')]
    domains = google_link_domains(links)
    
    # Figure out which contact page requests are prohibited
    to_contact = []
    for domain in domains:
        parser.set_url(f"https://{domain}/robots.txt")
        parser.read()
        for robot in robots:
            contact = f"https://{domain}/contact"
            if parser.can_fetch(robot, contact):
                to_contact.append(contact)
            else:
                tuples.append((robot, domain, 'PROHIBITED'))
    
    # Of the permitted ones, determine if the page actually exists
    # The server MIGHT give different status per user agent
    for contact in to_contact:
        domain = urlparse(contact).hostname
        for robot in robots:
            header = {'User-agent': robot}
            try:
                # Might take too long to get the contact page
                contact_page = requests.get(contact, 
                                            timeout=0.1, 
                                            headers=header)
            except (requests.exceptions.Timeout, Exception):
                # Not obvious the cause, but let's call it 503 status
                contact_page.status_code = 503
                
            if contact_page.status_code == 200:
                tuples.append((robot, domain, "PERMITTED"))
            else:
                tuples.append((robot, domain, "NON-EXISTENT"))
    
    return tuples

In [50]:
permissions = term_permissions('spiders')
pd.DataFrame(permissions, columns=['Agent', 'Domain', 'Status'])

Unnamed: 0,Agent,Domain,Status
0,Microsoft.URL.Control,en.wikipedia.org,PROHIBITED
1,Xenu,en.wikipedia.org,PROHIBITED
2,Fetch,en.wikipedia.org,PROHIBITED
3,Microsoft.URL.Control,www.livescience.com,PROHIBITED
4,Xenu,www.livescience.com,PROHIBITED
...,...,...,...
142,Fetch,www.nationalgeographic.com,NON-EXISTENT
143,Microsoft.URL.Control,www.nationalgeographic.com,NON-EXISTENT
144,Xenu,www.nationalgeographic.com,NON-EXISTENT
145,MyRobot,www.nationalgeographic.com,NON-EXISTENT


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)