<h1>T1 Coleta, Preparação e Analise de Dados</h1>

Faça um crawler capaz de navegar por todas as páginas de países e baixar seus HTMLS.

Vitor Delela, ...

Importando o módulo request da biblioteca urllib e bs4 da Beautiful Soup

In [50]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

Evidentemente, nem todos links serão de nosso interesse. Podemos filtrar apenas os links que nos interessam procurando por algum padrão no endereço e utilizar uma expressão regular para realizar o filtro.

Abaixo, vamos filtrar apenas links para outros artigos da wiki, ignorando âncoras, links para arquivos, etc. Faremos isso nos aproveitando de conhecimento de como um verbete na wiki é organizado. Todos links de artigos estarão sempre dentro da tag **div** que contém um atributo de **id** com valor **'bodyContent'**. Além disso, todo link de verbete necessariamente começa com o endereço "/wiki/" e não possui ":" no endereço.

Podemos generalizar este código em forma de uma função **getLinks()**. Isto possibilitará que busquemos os links de qualquer verbete da wiki.

A função **getLinks()** anterior funciona se quisermos encontrar todos os links de uma única página, porém, se quisermos fazer um crawler efetivo, precisamos procurar por páginas linkadas dentro de outras páginas de forma recursiva. Podemos fazer isso chamando a nosa própria função de procurar link de forma recursiva.

In [9]:
import re

webpages = set()
countriesLink = set()

def getLinks(pageUrl):
    global webpages, countriesLink
    html = urlopen('http://127.0.0.1:8000{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find('section',  {'id':'main'}).find_all('a', href=re.compile('^(/places/)((?!:).)*$')):
        if link.attrs['href'] not in webpages:
            #Encontramos um link para uma página nova
            newPage = link.attrs['href']
            if re.match('^(\/places\/default\/view\/)\S*-\d*$', newPage):
                if newPage not in countriesLink:
                    countriesLink.add(newPage)
            if re.match('^(\/places\/default\/index\/)\d*$', newPage):
                if newPage not in webpages:
                    webpages.add(newPage)
                    getLinks(newPage)



Teste para validar as paginas encontradas pelo programa

In [83]:
getLinks('/places/default/index')
print(sorted(webpages))
print(len(webpages))
print(sorted(countriesLink))
print(len(countriesLink))

['/places/default/index/0', '/places/default/index/1', '/places/default/index/10', '/places/default/index/11', '/places/default/index/12', '/places/default/index/13', '/places/default/index/14', '/places/default/index/15', '/places/default/index/16', '/places/default/index/17', '/places/default/index/18', '/places/default/index/19', '/places/default/index/2', '/places/default/index/20', '/places/default/index/21', '/places/default/index/22', '/places/default/index/3', '/places/default/index/4', '/places/default/index/5', '/places/default/index/6', '/places/default/index/7', '/places/default/index/8', '/places/default/index/9']
23
['/places/default/view/Afghanistan-1', '/places/default/view/Aland-Islands-2', '/places/default/view/Albania-3', '/places/default/view/Algeria-4', '/places/default/view/American-Samoa-5', '/places/default/view/Andorra-6', '/places/default/view/Angola-7', '/places/default/view/Anguilla-8', '/places/default/view/Antarctica-9', '/places/default/view/Antigua-and-B

pegando a lista de paises e coletando as informacoes de cada um

In [84]:
import re
from datetime import datetime
from urllib.error import HTTPError
from urllib.error import URLError

countriesEntity = []

for countryLink in countriesLink:
    print(countryLink)
    try:
        # html = urlopen('http://127.0.0.1:8000{}'.format('/places/default/view/Afghanistan-1'))
        html = urlopen('http://127.0.0.1:8000{}'.format(countryLink))
    except HTTPError as e:
        print("The server returned an HTTP error")
    except URLError as e:
        print("The server could not be found!")
    else:
        bs = BeautifulSoup(html, 'html.parser')

        country = {}
        country['name'] = bs.find('tr',  {'id':'places_country__row'}).find('td',  {'class':'w2p_fw'}).text.replace(","," ")
        country['capital'] = bs.find('tr',  {'id':'places_capital__row'}).find('td',  {'class':'w2p_fw'}).text
        country['area'] = bs.find('tr',  {'id':'places_area__row'}).find('td',  {'class':'w2p_fw'}).text

        neighbors = [] #acessando o nome dos vizinhos a partir das siglas
        for ng in bs.find('tr',  {'id':'places_neighbours__row'}).find('td',  {'class':'w2p_fw'}).find_all('a'):
            if re.match('^(\/places\/default\/iso\/)[A-Z]{2}$', ng.get('href')):
                html = urlopen('http://127.0.0.1:8000{}'.format(ng.get('href')))
                bs = BeautifulSoup(html, 'html.parser')
                neighbors.append(bs.find('tr',  {'id':'places_country__row'}).find('td',  {'class':'w2p_fw'}).text)

        country['neighbors'] = neighbors
        country['timestamp'] = datetime.now().strftime('%d-%m-%Y %H:%M:%S') # Obter a string de data e hora atual
        countriesEntity.append(country)

/places/default/view/Germany-83
/places/default/view/Bahamas-17
/places/default/view/Mongolia-147
/places/default/view/Rwanda-186
/places/default/view/Barbados-20
/places/default/view/Grenada-88
/places/default/view/Libya-126
/places/default/view/Cyprus-57
/places/default/view/Nicaragua-160
/places/default/view/Spain-213
/places/default/view/Qatar-181
/places/default/view/New-Zealand-159
/places/default/view/Armenia-12
/places/default/view/Falkland-Islands-72
/places/default/view/Christmas-Island-48
/places/default/view/Kyrgyzstan-120
/places/default/view/Comoros-51
/places/default/view/Latvia-122
/places/default/view/Slovakia-205
/places/default/view/Panama-172
/places/default/view/Algeria-4
/places/default/view/Ghana-84
/places/default/view/British-Virgin-Islands-34
/places/default/view/France-76
/places/default/view/Jordan-114
/places/default/view/Ivory-Coast-110
/places/default/view/Anguilla-8
/places/default/view/Bahrain-18
/places/default/view/Mexico-143
/places/default/view/Pitc

gravando a lista de paises em um csv

In [64]:
import csv

nome_arquivo = "/Users/vitordelela/Documents/PUCRS/7 SEM/ColePrepAnDados/paises.csv"

with open(nome_arquivo, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=';')
    writer.writerow(['Nome', 'Capital', 'Area', 'Vizinhos', 'Obtido em'])  # Escreve o cabeçalho do arquivo CSV

    for country in countriesEntity:
        writer.writerow([country['name'], country['capital'], country['area'], country['neighbors'], country['timestamp']]) 

<h1><b>crawler que verifica atualizacoes</b></h1>

In [88]:
with open(nome_arquivo, newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter=';')

    csvCountries = []
    
    for country in csvreader:
        csvCountry = {
            'name': country[0],
            'capital': country[1],
            'area': country[2],
            'neighbors': country[3],
            'timestamp': country[4]
        }
        
        csvCountries.append(csvCountry)
        print(csvCountry)

{'name': 'Nome', 'capital': 'Capital', 'area': 'Area', 'neighbors': 'Vizinhos', 'timestamp': 'Obtido em'}
{'name': 'Germany', 'capital': 'Berlin', 'area': '357,021 square kilometres', 'neighbors': "['Switzerland', 'Poland', 'Netherlands', 'Denmark', 'Belgium', 'Czech Republic', 'Luxembourg', 'France', 'Austria']", 'timestamp': '25/03/24 13:46'}
{'name': 'Bahamas', 'capital': 'Nassau', 'area': '13,940 square kilometres', 'neighbors': '[]', 'timestamp': '25/03/24 13:46'}
{'name': 'Mongolia', 'capital': 'Ulan Bator', 'area': '1,565,000 square kilometres', 'neighbors': "['China', 'Russia', 'Brasil']", 'timestamp': '11/03/24 07:37'}
{'name': 'Rwanda', 'capital': 'Kigali', 'area': '26,338 square kilometres', 'neighbors': "['Tanzania', 'Democratic Republic of the Congo', 'Burundi', 'Uganda']", 'timestamp': '11/03/24 07:38'}
{'name': 'Barbados', 'capital': 'Bridgetown', 'area': '431 square kilometres', 'neighbors': '[]', 'timestamp': '25/03/24 07:38'}
{'name': 'Grenada', 'capital': "St. George

metodo para atualizar os registros do pais em caso de atualizacao

In [82]:
def findCountryAndUpdate(lista, country_):
    for country in lista:
        if country['name'] == country_['name']:
            print(country)
            print(country_)
            if country['capital'] != country_['capital']:
                country['capital'] = country_['capital']
                country['timestamp'] = country_['timestamp']
                print("updated capital")
            if country['area'] != country_['area']:
                country['area'] = country_['area']
                country['timestamp'] = country_['timestamp']
                print("updated area")
            if eval(country['neighbors']) != country_['neighbors']:
                country['neighbors'] = country_['neighbors']
                country['timestamp'] = country_['timestamp']
                print("updated neighbors")


crawler percorre as paginas e procura informações diferentes das atuais do csv

In [89]:
for countryLink in countriesLink:
    print(countryLink)
    try:
        # html = urlopen('http://127.0.0.1:8000{}'.format('/places/default/view/Afghanistan-1'))
        html = urlopen('http://127.0.0.1:8000{}'.format(countryLink))
    except HTTPError as e:
        print("The server returned an HTTP error")
    except URLError as e:
        print("The server could not be found!")
    else:
        bs = BeautifulSoup(html, 'html.parser')

        country = {}
        country['name'] = bs.find('tr',  {'id':'places_country__row'}).find('td',  {'class':'w2p_fw'}).text.replace(","," ")
        country['capital'] = bs.find('tr',  {'id':'places_capital__row'}).find('td',  {'class':'w2p_fw'}).text
        country['area'] = bs.find('tr',  {'id':'places_area__row'}).find('td',  {'class':'w2p_fw'}).text

        neighbors = [] #acessando o nome dos vizinhos a partir das siglas
        for ng in bs.find('tr',  {'id':'places_neighbours__row'}).find('td',  {'class':'w2p_fw'}).find_all('a'):
            if re.match('^(\/places\/default\/iso\/)[A-Z]{2}$', ng.get('href')):
                html = urlopen('http://127.0.0.1:8000{}'.format(ng.get('href')))
                bs = BeautifulSoup(html, 'html.parser')
                neighbors.append(bs.find('tr',  {'id':'places_country__row'}).find('td',  {'class':'w2p_fw'}).text)

        country['neighbors'] = neighbors
        country['timestamp'] = datetime.now().strftime('%d-%m-%Y %H:%M:%S') # Obter a string de data e hora atual

        findCountryAndUpdate(csvCountries, country)
        
        

/places/default/view/Germany-83
{'name': 'Germany', 'capital': 'Berlin', 'area': '357,021 square kilometres', 'neighbors': "['Switzerland', 'Poland', 'Netherlands', 'Denmark', 'Belgium', 'Czech Republic', 'Luxembourg', 'France', 'Austria']", 'timestamp': '25/03/24 13:46'}
{'name': 'Germany', 'capital': 'Berlin', 'area': '357,021 square kilometres', 'neighbors': ['Switzerland', 'Poland', 'Netherlands', 'Denmark', 'Belgium', 'Czech Republic', 'Luxembourg', 'France', 'Austria'], 'timestamp': '25-03-2024 13:56:21'}
/places/default/view/Bahamas-17
{'name': 'Bahamas', 'capital': 'Nassau', 'area': '13,940 square kilometres', 'neighbors': '[]', 'timestamp': '25/03/24 13:46'}
{'name': 'Bahamas', 'capital': 'Nassau', 'area': '13,940 square kilometres', 'neighbors': [], 'timestamp': '25-03-2024 13:56:21'}
/places/default/view/Mongolia-147
{'name': 'Mongolia', 'capital': 'Ulan Bator', 'area': '1,565,000 square kilometres', 'neighbors': "['China', 'Russia', 'Brasil']", 'timestamp': '11/03/24 07:37'

In [90]:
# gravando os resultados no arquivo CSV.
with open(nome_arquivo, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile,delimiter=';')

    for country in csvCountries:
        writer.writerow([country['name'], country['capital'], country['area'], country['neighbors'], country['timestamp']]) 