# Web Scraper

### Summary

The purpose of this notebook is collect data from Foxter Real State company site on the Internet using a Web Scraper. This is the first step into the project to estimate prices of apartaments in Porto Alegre (Brazil).

In this notebook, the Web Scraper will:
1. download the sitemap.xml;
2. visit each page in sitempa.xml file;
3. collect data of each page, but some pages will be ignored (some conditionals if are specified);
4. in each page visited, the scaper will collect new address of new pages;
5. at the end, all data collected will be saved in a pandas dataframe.


### Main Strategy

The Web Scraper was constructed, in its main core, with BeautifulSoup, Requests and Pandas. The [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) turned easy to collect the data from every page. The [Requests](https://2.python-requests.org/en/master/) library made the connection possible to get the data. And [Pandas](https://pandas.pydata.org) was used to save the data collected into a sigle dataframe.

### Import modules

In [1]:
import sys
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
import random
from datetime import date
try:
    from urlparse import urljoin  # Python2
except ImportError:
    from urllib.parse import urljoin  # Python3

### Download Sitemap

In [7]:
print('downloading sitemap.xml...')
url="https://www.foxterciaimobiliaria.com.br/sitemap.xml"
response = requests.get(url)
with open('sitemap.xml', 'wb') as file:
    file.write(response.content)
print('sitemap.xml ok!')

downloading sitemap.xml...
sitemap.xml ok!


### Find all links

In [2]:
def union(tocrawl,newlinks):
    for e in newlinks:
        if e not in tocrawl:
            tocrawl.append(e)
    return   

                  
# Find all links into content Soup and save "alllinks"
def get_all_links(page):
    links = []
    for link in page_content(page).find_all('a'):
        links.append(str(link.get('href')))
    return links
                  
                  
def check_links(alllinks):
    links_checked = []

    for link_to_check in alllinks:

        if link_to_check[:38] == url_seed:
            links_checked.append(link_to_check)
        
        if link_to_check[:8] == url_segment:
            link_to_check_join = urljoin(url_seed, link_to_check)
            links_checked.append(link_to_check_join)
            
    return links_checked    

### Collect content from one page

Collect all data from a page using BeautifullSoup. The function will inform if the connecting maybe refused by the server.

In [3]:
def page_content(page):
    
    # if server refuse
    response = ''
    while response == '':
        wait = 0
        try:
            response = requests.get(page)
        except:
            wait = random.randint(10,2000) # in case if the server does not respond.
            print("Connection refused by the server..")
            print("Let me sleep for " + str(wait) + " seconds")
            print("ZZzzzz...")
            time.sleep(wait)
            print("Was a nice sleep, now let me continue...")
            continue
    html = response.text
    soup = BeautifulSoup(html, "html.parser")
    
    
    # ignore page without id
    if soup.find('span', {'itemprop':"identifier"}) is None:
        return soup
    
    
    #collect id
    id_imovel = soup.find('span', {'itemprop':"identifier"})
    id_clean = id_imovel.get_text()[7:]
    id_list.append(id_clean)
        
        
    #collect price
    try:        
        price = soup.find('span', {'itemprop': "price"})
        price_clean0 = price.get_text()[3:]
        price_clean = price_clean0.replace(".", "")
    except:
        price_clean = 'NA'
    price_list.append(price_clean)
    
        
    #collect private area
    try:
        area = soup.find(string='Área privativa').next_element.next_element
        area_clean = area.get_text()[:-2]
    except:
        area_clean = 'NA'        
    area_list.append(area_clean)
        
        
    #collect neighborhood
    try:
        district = soup.find(string='Bairro').next_element.next_element
        district_clean = district.get_text()
    except:
        district_clean = 'NA'
    district_list.append(district_clean)
        
        
    #collect city
    try:
        city = soup.find(string='Cidade').next_element.next_element
        city_clean = city.get_text()
    except:
        city_clean = 'NA'
    city_list.append(city_clean)

        
    #collect type
    try:
        type1 = soup.find('span', {'itemprop': "category"})
        type1_clean = type1.get_text()
    except:
        type1_clean = 'NA'
    type1_list.append(type1_clean)


    #collect segment
    try:
        segment = soup.find(string='Segmento').next_element.next_element
        segment_clean = segment.get_text()
    except:
        segment_clean = 'NA'
    segment_list.append(segment_clean)

        
    #collect condominium price/value
    try:
        condominium = soup.find(string='Condomínio').next_element.next_element
        condominium_clean0 = condominium.get_text()[3:]
        condominium_clean = condominium_clean0.replace(".", "")
    except:
        condominium_clean = 'NA'
    condominium_list.append(condominium_clean)

        
    #collect tax - IPUT
    try:
        iptu = soup.find(string='IPTU Anual').next_element.next_element
        iptu_clean0 = iptu.get_text()[3:]
        iptu_clean = iptu_clean0.replace(".", "")
    except:
        iptu_clean = 'NA'
    iptu_list.append(iptu_clean)
    
        
    #collect rooms
    try:
        rooms = soup.find(string='Dormitórios').next_element.next_element
        rooms_clean = rooms.get_text()
    except:
        rooms_clean = 'NA'
    rooms_list.append(rooms_clean)

        
    #collect park space quantity
    try:
        box = soup.find('span', {'class': "value vagas"})
        box_clean = box.get_text()
    except:
        box_clean = 'NA'
    box_list.append(box_clean)
                  
        
    #collect url
    url_list.append(page)
    
    
    #save data
    date_list.append(date_today)
    
    return soup

### Initializing visit and capture

A group of "if condition" is necessary to specify what to not collect. At the end, all lists are joined into a pandas dataframe.

In [4]:
def crawl_web(url_seed):
    data_base = {}
    tocrawl = [url_seed]
    crawled = []
    
    # reading the xml file
    with open('sitemap.xml') as file_object:
        print('reading sitemap.xml ...')
        for line in file_object:
            soup = BeautifulSoup(line, "html.parser")
        
            for link in soup.find_all('loc'):
                tocrawl.append(str(link.get_text())) 

    print("xml reading finished ... starting search ...\n")
    
    while tocrawl:

        page = tocrawl.pop()

        sys.stdout.write("\rpages analised: {}/{}.".format(len(crawled), (len(crawled) + len(tocrawl))))
        sys.stdout.flush()
    
        if page not in crawled:
            
            # ingnore page withou "http"
            if page[:5] != "http:":
                crawled.append(page)
                continue
            
            # ignore pages from "construtora"
            if "/construtora" in page:
                crawled.append(page)
                continue
            
            # ignore pages from "empreendimento"
            if '/empreendimento' in page: 
                crawled.append(page)
                continue
            
            # ignore pages from "bairro"
            if '/bairro' in page: 
                crawled.append(page)
                continue
            
            # ignore pages with url error
            if '.com.brhttps:' in page: 
                crawled.append(page)
                continue
                       
            # delimiting the search for three neighborhoods of interest:
            # auxiliadora, bela vista, mont serrat
            if "-porto-alegre-auxiliadora-apartamento-" not in page:
                if "-porto-alegre-bela-vista-apartamento-" not in page:
                    if "-porto-alegre-mont-serrat-apartamento-" not in page:
                        crawled.append(page)
                        continue               

            union(tocrawl, check_links(get_all_links(page)))
            crawled.append(page)            
            
            # join list into a dataframe
            data_base = pd.DataFrame({'id': id_list,
                                      'price': price_list,
                                      'area': area_list,
                                      'district': district_list,
                                      'city': city_list,
                                      'type': type1_list,
                                      'segment': segment_list,
                                      'condominium': condominium_list,
                                      'iptu': iptu_list,
                                      'rooms': rooms_list,
                                      'box': box_list,
                                      'url': url_list,
                                      'date': date_list
                                     })
            
            data_base.to_csv((date_today + '-foxter.csv'), sep='\t') #salvando csv
            
    print("\nall adresses finished!!")
    print("\ntotal apartaments saved: " + str(data_base.shape[0]))

    return

### Web Scraper Initializing

Here is where the web scraper start with a url seed.

In [5]:
# data used to identify the collected data file.
date_today = str(date.today())

# data to capture: lists initialization
id_list = []
price_list = []
area_list = []
district_list = []
city_list = []
type1_list = []
segment_list = []
condominium_list = []
iptu_list = []
rooms_list = []
box_list = []
url_list = []
date_list = []



url_seed = ("http://www.foxterciaimobiliaria.com.br") # Initial page - seed

url_segment = '/imovel/' # identify what kind of segment into the site



crawl_web(url_seed)

reading sitemap.xml ...
xml reading finished ... starting search ...

pages analised: 27544/27544.
all adresses finished!!

total apartaments saved: 595
