# Capstone 2: Web Scraping #

To build the dataset for the project, we will be scraping the website https://chilepropiedades.cl/. It contains thousands of listings of new and used homes for sale in Chile. To run a search on the website, you choose a locality, type of home (house, apartment, etc.), and the state of the home (new, used, or all).  We will focus on the Metropolitan Region, which is composed of Santiago, Chile’s capital, as well as the surrounding *comunas*. A *comuna* is like a borough or district.  Each listing contains information pertaining to the home such as the total area, number of bedrooms, etc.

The following code is designed to extract the desired information from the website's source code.  The specific operations involved in web scraping vary considerably and must be tailored to the particular page being scraped.  This process involves a lot of trial and error.  In addition, any changes to the site's HTML can cause the code to no longer work, so there is no guarantee that code below will continue to work indefinitely in the future.  The final function `scrape_to_df`, takes a URL and a number of pages as parameters, and returns a dataframe. We simply run a search for a type of home, then feed the URL of the search results into the function along with the number of pages of results.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
from random import randint

### Read Webpage into Python ###

Start by getting a sample of HTML to work with. Go to chilepropiedades.cl and run a search. Here we search houses (*casas*). Once the search is complete, copy the URL of the first page of results. Now use the requests library to fetch the web page by passing the URL to the `get` function, and store the results to a response object called `r`. The response object has a `text` attribute which can be used to access the HTML.

In [2]:
r = requests.get('https://chilepropiedades.cl/propiedades/venta/casa/region-metropolitana(rm)/0')

In [3]:
# print the first 500 characters of the HTML
print(r.text[0:1000])

 












<!DOCTYPE HTML>
<html lang="es">
  <head>
    <link rel="alternate" href="https://chilepropiedades.cl" hreflang="es-cl"/>
    <meta name="robots" content="noarchive" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    
    <script type="text/javascript">var callAtTheEnd = [];</script>
    
    

  <title>Venta de Casa en Región Metropolitana(RM) - Avisos ChilePropiedades.cl</title>
  <meta name="description" content="Chile propiedades. Venta, arriendo de departamentos, casas y otras propiedades en Chile. Casa, Venta de Casa en Región Metropolitana(RM)" />
  <meta name="keywords" content="Chile propiedades, portal inmobiliario, Venta, Casa, Venta de Casa en Región Metropolitana(RM), propiedades en Venta, Venta de Casa"/>
  <meta property="og:title" content="Venta de Casa en Región Metropolitana(RM) - Avisos ChilePropiedades.cl"/>
  <meta property="og:description" content

### Parse the HTML ###

Next, parse the HTML using the Beautiful Soup 4 library. This code parses the HTML into an object called `soup` which Beatiful soup understands. (html.parser is the default parser in Python, but others can be used.

In [4]:
soup = BeautifulSoup(r.text, 'html.parser')

### Collect the Results ###

HTML contains text as well as tags used to mark up the text.  The tags are found within angle brackets, and many can be seen in the sample of HTML above. For example, there is an opening tag `<title>`, followed by some text, followed by a closing tag `</title>`.  Tags also have attributes, for example the tag `<html lang="es">` specifies the language of the text as Spanish (*español*). Finally, tags can be nest within one another. Web scraping involves taking advantage of all of these facts to locate and extract information on a website.

From this point on, the code used is specific to this particular project. Inspecting the HTML of chilepropiedades.cl reveals that all of listings for homes are contained with `div` tags with the class `col-sm-12 publication-element`. The following code produces an object called a "ResultSet" which behaves like a Python list. Use this to access individual listings.

In [5]:
results = soup.find_all('div', attrs={'class':'col-sm-12 publication-element'})

In [6]:
len(results)

10

In [7]:
# the prettify method shows how tags are nested
print(results[0].prettify())

<div class="col-sm-12 publication-element">
 <div class="clp-premium-table-item" id="publication_4227020">
  <div class="margin-left-0 margin-right-0 relative row">
   <div class="col-sm-3 clp-search-image-container" style="position:relative;">
    <div class="new-premium-ribon-clp">
     <i class="glyphicon glyphicon-star">
     </i>
     Destacado
    </div>
    <a href="/ver-publicacion/venta-usada/puente-alto/casa/lugo-sur/4227020">
     <img alt="Venta propiedad usada / Casa / Puente Alto" src="/imagenes/publicacion/venta-usada/casa/puente-alto/0/1971750E39F6814AEAACADA61A0BC8D3FA20799DE44.jpg"/>
    </a>
    <div class="text-center">
     <small>
      10/09/2020
     </small>
    </div>
   </div>
   <div class="col-sm-6">
    <div>
     <h2 class="publication-title-list">
      <a href="/ver-publicacion/venta-usada/puente-alto/casa/lugo-sur/4227020">
       Puente Alto, Lugo sur
      </a>
      <span class="clp-reciente">
       Publicación Reciente
      </span>
     </h2>
   

The HTML code above contains all of the information for the first listing in the results. Some of the information we need is in an `h3` tag, and can be accessed and extracted using the `find` method.

In [8]:
results[0].find('h3').text

'Venta propiedad usada / Casa / Puente Alto'

In [9]:
results[0].find('h3').text.split('/')

['Venta propiedad usada ', ' Casa ', ' Puente Alto']

In [10]:
comuna = results[0].find('h3').text.split('/')[2].strip()
home_type = results[0].find('h3').text.split('/')[1].strip()
state = results[0].find('h3').text.split('/')[0].split(' ')[-2]

print(comuna)
print(home_type)
print(state)

Puente Alto
Casa
usada


There is more useful information contained within `small` tags. Since there are multiple small tags, use the `find_all` method, which again returns a ResultSet.

In [11]:
results[0].find_all('small')

[<small>10/09/2020</small>,
 <small>
 <span class="light-bold">Terreno:</span> 90 m²
     </small>,
 <small>
 <span class="light-bold">Superficie Construida:</span> 60 m²
     </small>,
 <small>
 <span class="light-bold">Habitaciones:</span> 3
     </small>,
 <small>
 <span class="light-bold">Baños:</span> 3
     </small>]

This time it is a little more challenging to access the information needed. Some of the listings have fields missing, so the order cannot be trusted for indexing. The text needed to identify the field is found within a `span` tag inside each `small` tag, but the information needed is outside of the `span` tag. There is a field that indicates the apartment is furnished, which is only included only for furnished homes. The function below extracts six values and stores them as variables. If the field is missing, an empty string is returned.

In [12]:
def smalls_parse(result):   
    smalls = result.find_all('small')
    
    date = smalls[0].text

    tot_area = ''
    built_area = ''
    bedrooms = ''
    bathrooms = ''
    furnished = ''

    for n in range(1, len(smalls)):
        rec = smalls[n]
        if rec.find('span', class_="light-bold").text == 'Terreno:':
            tot_area = rec.contents[-1].split()[0]
        elif rec.find('span', class_="light-bold").text == 'Superficie Construida:':
            built_area = rec.contents[-1].split()[0]
        elif rec.find('span', class_="light-bold").text == 'Habitaciones:':
            bedrooms = rec.contents[-1].split()[0]
        elif rec.find('span', class_="light-bold").text == 'Baños:':
            bathrooms = rec.contents[-1].split()[0]
        elif rec.find('span', class_="light-bold").text == 'Amoblado:':
            furnished = rec.contents[-1].strip()
    return date, tot_area, built_area, bedrooms, bathrooms, furnished

In [13]:
date, tot_area, built_area, bedrooms, bathrooms, furnished = smalls_parse(results[1])
print(date)
print(tot_area)
print(built_area)
print(bedrooms)
print(bathrooms)
print(furnished)

10/09/2020
140
90
4
2



This code extracts the address, which may be useful later, although in many cases it is not specific.

In [14]:
address = results[0].find('h2', class_="publication-title-list").text.split('\n')[1]
print(address)

Puente Alto, Lugo sur


Also scrape the description of the home and assign it to the variable `description`. This will make the final dataset larger in size, but the descriptions contain information that might be helpful.

In [15]:
description = results[0].find('div', class_="font-size-small").text
print(description)

Casa en Hacienda Los Conquistadores (Puente Alto) Acogedora casa con segundo piso, ubicada en barrió residencial en Hacienda Los Conquistadores en Puente Alto. Propiedad inmersa en un sector residencial consolidado, tranquilo, con área verde. Excelente conexión y locomoción a través de Av. Camilo...


Next, locate the value for price, as well as the unit in which the price is expressed, and then assign them to the variables `price` and `unit`.

In [16]:
results[0].find_all('span', class_="clp-value-container")

[<span areaunit="" class="clp-value-container" data-no-value-unit="" value="8.2E7" valueunit="1">
         82.000.000
       </span>,
 <span areaunit="" class="clp-value-container" data-no-value="" valueunit="1">
         CLP
       </span>]

In [17]:
price = results[0].find_all('span', class_="clp-value-container")[0].text.strip()
unit = results[0].find_all('span', class_="clp-value-container")[1].text.strip()

print(price)
print(unit)

82.000.000
CLP


### Building the Dataset ###

Now we know how to extract all of the information needed to build the dataframe. The function below compiles all of the all of the code we have written so far and loops over the ResultSet, creating a tuple of variables for each publication. Each tuple is then appended to a list called records.

In [18]:
# Creates a list of tuples for the records on each page
def scrape_page(results):
    records = []
    
    for result in results:
        comuna = result.find('h3').text.split('/')[2].strip()
        home_type = result.find('h3').text.split('/')[1].strip()
        state = result.find('h3').text.split('/')[0].split(' ')[-2]
        address = result.find('h2', class_="publication-title-list").text.split('\n')[1]
        description = result.find('div', class_="font-size-small").text
        price = result.find_all('span', class_="clp-value-container")[0].text.strip()
        unit = result.find_all('span', class_="clp-value-container")[1].text.strip()

        date, tot_area, built_area, bedrooms, bathrooms, furnished = smalls_parse(result)

        records.append((date, comuna, home_type, state, tot_area, built_area, bedrooms, bathrooms, furnished, 
                        address, description, price, unit))
    
    return records

Time to try out the new function on the ResultSet that was scraped from the webpage. After that, convert the list of tuples into a pandas DataFrame.

In [19]:
records = scrape_page(results)

In [20]:
cols_list = ['date', 'comuna', 'home_type', 'state', 'total_area', 'built_area', 'bedrooms', 'bathrooms', 
             'furnished', 'address', 'description', 'price', 'unit']
df = pd.DataFrame(records, columns=cols_list)
df['date'] = pd.to_datetime(df['date'], dayfirst=True)

In [21]:
df

Unnamed: 0,date,comuna,home_type,state,total_area,built_area,bedrooms,bathrooms,furnished,address,description,price,unit
0,2020-09-10,Puente Alto,Casa,usada,90.0,60,3.0,3,,"Puente Alto, Lugo sur",Casa en Hacienda Los Conquistadores (Puente Al...,82.000.000,CLP
1,2020-09-10,San Bernardo,Casa,usada,140.0,90,4.0,2,,"San Bernardo, Villa maestranza iii",Excelente chalet ubicado en la calle Eduardo O...,89.500.000,CLP
2,2020-09-09,Lampa,Casa,usada,1.0,110,4.0,1,,"Lampa, Arturo Prat",Oportunidad Inversionistas o particualares! ca...,4.100,UF
3,2020-09-09,El Monte,Casa,usada,1.8,200,,1,,"El Monte, Los Libertadores",Acepta ofertas. Ideal negocio y/o vivir. Cason...,6.600,UF
4,2020-09-09,El Monte,Casa,usada,5.0,360,4.0,4,,"El Monte, Chinigue","Excelente oportunidad, plusvalía. Vendo estupe...",235.000.000,CLP
5,2020-09-09,Quinta Normal,Casa,usada,250.0,130,5.0,1,,"Quinta Normal, Salvador Gutiérrez",Vendo amplia casa. Conversable. Cuenta con 5 d...,130.000.000,CLP
6,2020-09-09,Maipú,Casa,usada,130.0,110,4.0,2,,"Maipú, Ramayana , Villa Las Terrazas",Linda casa de 2 pisos en pasaje cercano al Met...,150.000.000,CLP
7,2020-09-09,Lo Barnechea,Casa,usada,492.0,212,5.0,4,,"Lo Barnechea, Avenida San Josemaría Escrivá de...",San Josemaría Escrivá de Balaguer / La Torment...,13.900,UF
8,2020-09-09,Macul,Casa,usada,550.0,110,4.0,1,,"Macul, Nicanor Molinare",Se vende casa en excelente ubicación de Villa ...,255.000.000,CLP
9,2020-09-09,Peñalolén,Casa,usada,160.0,120,5.0,3,,"Peñalolén, Camino Peñalolén","Peñalolén, Se vende casa de tres pisos en pasa...",180.000.000,CLP


We have a DataFrame! The `scrape_page` function sucessfully exracted the necessary information from the first page of results. This is great, except that the DataFrame has only 10 rows. There are hundreds of pages of search results, so individually scraping each page will take too long. The solution is to write yet another function that loops over all the pages of search results and returns a DataFrame containing all of the listings. This can be done by examining how the URL changes when displaying each successive page results, and writing a loop that modifies the URL accordingly. When scraping multiple pages, it is good practice to make the program sleep for random intervals after scraping each page of results. This avoids overwhelming the website's server, and possibly getting our ip address blocked.

In [22]:
# Scrapes multiple pages of search results and returns DataFrame

def scrape_to_df(url, num_pages):
    all_records = []
    
    for n in range(num_pages):
        r = requests.get(url[:-1] + str(n))
        soup = BeautifulSoup(r.text, 'html.parser')
        results = soup.find_all('div', attrs={'class':'col-sm-12 publication-element'})
        records = scrape_page(results)
        all_records += records
        sleep(randint(2,10))
    
    cols_list = ['date', 'comuna', 'home_type', 'state', 'total_area', 'built_area', 'bedrooms', 'bathrooms', 
                 'furnished', 'address', 'description', 'price', 'unit']
    df = pd.DataFrame(all_records, columns=cols_list)
    df['date'] = pd.to_datetime(df['date'], dayfirst=True)
    
    return df

Now that all the functions are defined, it is time to try them out on the website.  A search for studio apartments for sale yields only three results.  This will take much less time to scrape, so it is a good chance to test the code before we use it on larger sets of results.  We just enter the url and the number of pages, which is one.

In [23]:
url = 'https://chilepropiedades.cl/propiedades/venta/estudio/region-metropolitana(rm)/0'
studio_df = scrape_to_df(url, 1)
studio_df.shape

(3, 13)

In [24]:
studio_df.head()

Unnamed: 0,date,comuna,home_type,state,total_area,built_area,bedrooms,bathrooms,furnished,address,description,price,unit
0,2020-09-09,Santiago,Estudio,usada,20,,1,1,Sí,"Santiago, Estación Central","Oportunidad de Inversión, Estación Central, Ve...",1.700,UF
1,2020-09-09,Santiago,Estudio,usada,29,,1,1,Sí,"Santiago, Santiago","Oportunidad de Inversión, Santiago Centro, Ave...",1.780,UF
2,2020-07-10,Recoleta,Estudio,usada,31,,1,1,,"Recoleta, Santos Dumont 867",DISPONIBILIDAD INMEDIATA *Ideal Inversión Inmo...,45.000.000,CLP


Everything looks good, so now we try on a slightly larger set of results covering multiple pages. In Spanish, a *parcela* is a plot of land, often with a house on it.  The website has 255 listings for *parcelas* on 29 pages.

In [25]:
url = 'https://chilepropiedades.cl/propiedades/venta/parcela/region-metropolitana(rm)/0'
parcela_df = scrape_to_df(url, 29)
parcela_df.shape

(284, 13)

In [26]:
parcela_df.head()

Unnamed: 0,date,comuna,home_type,state,total_area,built_area,bedrooms,bathrooms,furnished,address,description,price,unit
0,2020-08-31,Melipilla,Parcela,nueva,5.0,,,,,"Melipilla, huechun","SE VENDE GRAN PARCELA DE AGRADO, con las sigui...",50.000.000,CLP
1,2020-08-28,Talagante,Parcela,usada,7.191,,4.0,3.0,,"Talagante, Parcela en Lonquén Sur. Talagante",- (CE) Propiedad de 7.191 m2 con 2 Casas - Lot...,290.000.000,CLP
2,2020-08-28,Curacaví,Parcela,usada,11.701,490.0,6.0,5.0,,"Curacaví, Parcela de 11.701 m2 en Condominio L...",- Casa Aislada Sólida con Segundo Piso. (360 m...,190.000.000,CLP
3,2020-08-28,Colina,Parcela,usada,5.787,310.0,,6.0,,"Colina, Chicureo, Excelente Parcela de Agrado ...","- Casas del Alba Pc 18, Chicureo. Colina - Cas...",310.000.000,CLP
4,2020-08-21,San Pedro,Parcela,usada,40.0,,,,,"San Pedro, San pedro","Esta parcela agrícola de más de 4 Há, cuenta c...",140.000.000,CLP


The code has worked again, so it is time to try it on the search results for houses (*casas*), which contain more than two thousand listings.  This step can take quite a while.

In [27]:
url = 'https://chilepropiedades.cl/propiedades/venta/casa/region-metropolitana(rm)/0'
casa_df = scrape_to_df(url, 215)
casa_df.shape

(2142, 13)

In [28]:
casa_df.head()

Unnamed: 0,date,comuna,home_type,state,total_area,built_area,bedrooms,bathrooms,furnished,address,description,price,unit
0,2020-09-10,Puente Alto,Casa,usada,90.0,60,3.0,3,,"Puente Alto, Lugo sur",Casa en Hacienda Los Conquistadores (Puente Al...,82.000.000,CLP
1,2020-09-10,San Bernardo,Casa,usada,140.0,90,4.0,2,,"San Bernardo, Villa maestranza iii",Excelente chalet ubicado en la calle Eduardo O...,89.500.000,CLP
2,2020-09-09,Lampa,Casa,usada,1.0,110,4.0,1,,"Lampa, Arturo Prat",Oportunidad Inversionistas o particualares! ca...,4.100,UF
3,2020-09-09,El Monte,Casa,usada,1.8,200,,1,,"El Monte, Los Libertadores",Acepta ofertas. Ideal negocio y/o vivir. Cason...,6.600,UF
4,2020-09-09,El Monte,Casa,usada,5.0,360,4.0,4,,"El Monte, Chinigue","Excelente oportunidad, plusvalía. Vendo estupe...",235.000.000,CLP


Finally, we scrape the results for apartments (*departamentos*), which is the category with the most listings, at over three thousand.

In [29]:
url = 'https://chilepropiedades.cl/propiedades/venta/departamento/region-metropolitana(rm)/0'
depto_df = scrape_to_df(url, 344)
depto_df.shape

(3434, 13)

In [30]:
depto_df.head()

Unnamed: 0,date,comuna,home_type,state,total_area,built_area,bedrooms,bathrooms,furnished,address,description,price,unit
0,2020-09-10,Santiago,Departamento,usada,,,,,,"Santiago, Santiago, Región Metropolitana",CODIGO INTERNO: 56117 Se Vende Impecable Depar...,73.000.000,CLP
1,2020-09-10,Puente Alto,Departamento,usada,42.0,,2.0,1.0,,"Puente Alto, Sgto Menadier 2779","Departamento ubicado en segundo piso, en plena...",25.000.000,CLP
2,2020-09-09,Ñuñoa,Departamento,usada,70.0,,2.0,2.0,,"Ñuñoa, Irarrázaval 1401",Depto Vista despejada 2D 2B E B *2 dormitorios...,4.900,UF
3,2020-09-09,Santiago,Departamento,usada,23.0,,1.0,1.0,,"Santiago, Huérfanos",Oportunidad Inversionistas o particulares. Ven...,1.575,UF
4,2020-09-09,Santiago,Departamento,usada,163.0,,3.0,4.0,,"Santiago, Vespucio Norte","Vespucio Norte, Las Condes, Vendo confortable ...",13.500,UF


### Combine Results into a Single DataFrame ###

Now that the web scraping is finished, the next step combines all of the dataframes that have been created into one large dataframe.

In [31]:
# Combine the dataframes
combined_df = pd.concat([depto_df, casa_df, parcela_df, studio_df], axis=0)
combined_df.shape

(5863, 13)

In [32]:
# Reset the index and check the indices of the final rows
combined_df = combined_df.reset_index(drop=True)
combined_df.tail()

Unnamed: 0,date,comuna,home_type,state,total_area,built_area,bedrooms,bathrooms,furnished,address,description,price,unit
5858,2020-02-20,Padre Hurtado,Parcela,usada,5.0,300.0,4,3,,"Padre Hurtado, Camino antiguo de Valparaiso/Ru...",Naturaleza a sólo 35 minutos de Santiago. Prec...,249.000.000,CLP
5859,2020-02-20,Melipilla,Parcela,usada,5.05,370.0,4,6,,"Melipilla, Hijuela 1/___","Excelente parcela, hall de entrad, muy espacio...",14.600,UF
5860,2020-09-09,Santiago,Estudio,usada,20.0,,1,1,Sí,"Santiago, Estación Central","Oportunidad de Inversión, Estación Central, Ve...",1.700,UF
5861,2020-09-09,Santiago,Estudio,usada,29.0,,1,1,Sí,"Santiago, Santiago","Oportunidad de Inversión, Santiago Centro, Ave...",1.780,UF
5862,2020-07-10,Recoleta,Estudio,usada,31.0,,1,1,,"Recoleta, Santos Dumont 867",DISPONIBILIDAD INMEDIATA *Ideal Inversión Inmo...,45.000.000,CLP


### Save DataFrame to a CSV File ###

The final step is to save new dataframe as a CSV file. Remember the listings on the website will be constantly changing.  This is the raw data that we scraped from the website before any cleaning. We will use semicolons the field delimiter because the dataframe might contain commas. (In Spanish decimals are expressed with commas rather than periods.) 

In [33]:
# Save dataframe as a csv file
combined_df.to_csv('scraped_data.csv', sep=';', index=False)