<p align="center"><img src="https://lp.eloscience.com/wp-content/uploads/2021/01/data-hands.png" width="60%"></p>

The housing market is something that everyone living in any country has to deal with and, as a result, is a great topic about data analysis.<br>
<br>
Within the next few days I will be moving to another city and I took the opportunity to use python to be able to research some houses and identify potential opportunities in that market.<br>
<br>
I will use Web Scraping to extract data from the real estate market.<br>
<br>

# **Part I - Understanding how Web Scraping works**

## **1. Importing the libraries**

In this first phase, we will only need these 3 libraries, **`requests`** will request our web page, **`pandas`** will create our dataframe at the end with the data obtained. And we also have **`bs4`**, more precisely `BeautifulSoup`, this will be responsible for obtaining the `tags` of our HTML.

In [None]:
# create iterators
import itertools 

# create dataframe and analyze data
import pandas       as pd

# inspect html
from bs4            import BeautifulSoup

# make requests to the site
from requests       import get
import requests

We are going to create headers, as many sites block any type of scraping. The header will be created using the `User-Agent`, in it we will pass each type of different browser and their respective versions. Here we don't have to worry about how it works, the model is standard and we will simply replicate that pattern within our variable.

In [None]:
headers = ({'User-Agent':
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})

## **2. Fazer a requisição do site**

A primeira coisa que precisamos fazer é criar uma variável, essa variável vai receber o endereço do nosso site de pesquisa. Após isso, precisamos fazer a requisição utilizando a biblioteca `requests` e passando o headers que criamos anteriormente.

In [None]:
sapo = 'https://casa.sapo.pt/alugar-apartamentos/odivelas/'
response = get(sapo, headers=headers)

Here we simply visualize the response to our request, by default the results that start with 2 indicate success.

In [None]:
print(response)

<Response [200]>


## **3. HTML analysis**

We will use the BeatifulSoup library, in it we will pass our request and transform it into text using **`.text`**. In this step, we also need to pass another parameter called **'parser'**. We have a few options available, but in this example we will use the parameter **'lxml'**, as it is light and faster.

In [None]:
soup = BeautifulSoup(response.text, 'lxml')

After inspecting our HTML content, we were able to view the information we wanted. We need to identify which tag we are going to use to extract the data, in our example we identified the tag **`div` ** containing the content we are going to use. After identifying the tag, we also need to obtain the class of our tag. The class will more objectively identify our content.
<br>
<br>
We pass all this inside the function **`.find_all`**, this function as the name says, will find all the information according to the parameters that we put, in our example it is the tag **`div`** and the **`class`**.

In [None]:
house_containers = soup.find_all('div', class_='searchResultProperty item hastitle')

We need to create a variable that will store our content extracted from the tag, and soon after, you will notice that we use the function **`.find_all`**, but searching within our content already extracted the tag **`span`** .

In [None]:
first = house_conteiners[0]
first.find_all('span')

[<span class="titleG3">T2 em Odivelas recuperado junto ao Mcdonald`s</span>,
 <span class="messengerOFFLINE" id="MC_PropertyInList_repProperties_spanMessenger_1" onclick="goToMessenger('/a30df3c4-5d76-11eb-921c-060000000052.html');" style="cursor: pointer;">
 <i class="fa fa-comment fa-lg"></i>
 </span>,
 <span>
                         Apartamento T2, Odivelas, Lisboa
                     </span>,
 <span class="btnContactPVPI" id="MC_PropertyInList_repProperties_btnContactPVPINormal_1" onclick="ShowContactForm('a30df3c4-5d76-11eb-921c-060000000052', '4', '5', true, false, '0'); return false;" style="z-index: 9999;" title="Contacte Anunciante">Contacte Anunciante</span>,
 <span>800 <strong title="Euro">€</strong></span>]

### **3.1 Getting the Price**

When doing this, we then receive an element containing only what we request. Now the next step is to obtain the property prices separately and using the same pattern used in the previous steps, calling the tag **`span`**, then passing its location on the list and finally, converting it to text.

In [None]:
price_1 = first.find_all('span')[-1].text
price_1

'800 €'

**3.1.2 If necessary**

We can see at this point that when requesting this information, the correct value was returned, but containing the following signaling **'1 \ xa0'**. We will then need to do a **`.replace()`** in order to replace this unsolicited value.

In [None]:
price_1 = price_1.replace('1\xa0', '')
price_1

'800 €'

You will need to transform this data using the iterator. In it we will join the empty spaces that we replaced using the function **.Join()** and using the function **.Takewhile()** we will take this value and transform it into a string.

In [None]:
price_1 = int(''.join(itertools.takewhile(str.isdigit, price_1)))
print(price_1, type(price_1))

800 <class 'int'>


### **3.2 Location**

Here the process is very similar to the previous one, the only difference is that we need to find the tag that indicates the location of the properties and we will also need the name of the class to which it belongs. In our example, the tag containing this information is a paragraph tag ie **`p`**.

In [None]:
location = first.find_all('p', class_='searchPropertyLocation HasFeatures')[0].text
location

'\r\n                    Odivelas, Lisboa\r\n                '

Again when extracting empty spaces and unwanted letters came, but this is easily solved using the pandas function called **`.strip()`**. However, we need to define which word we are going to extract, as we know the city very well and we only want the name of the neighborhoods, we can simply delimit our extraction by placing the limit "**,(vírgula)**" as a limit. So, every time my scraping is performed and finds our delimiter, it will simply stop and show just what we asked for.

In [None]:
location[7:location.find(',')].strip()

'Odivelas'

### **3.3 Size of the Property**

This is the simplest part of our job, here we will only request the tag **`p`** and pass the position of our property sizes.

In [None]:
first.find_all('p')[7]

<p>100m²</p>

### **3.4 Property Description**

We use the pattern we are used to, which consists of: searching for the tag, getting the class, converting it to text and applying the strip to eliminate the empty spaces in our text.

In [None]:
first.find_all('p', class_='searchPropertyDescription')[0].text.strip()

'T2 recuperado junto ao Mcdonald`s e escola secundária de Odivelas, com áreas bastante generosas situado num prédio em bom estado de conservação, composto por, cozinha semi-equipada, dispensa, quarto com janelas duplas, casa (...)'

### **3.5 Extract links**

Now we want to extract the links to be able to consult more simply and quickly. For this, we will use a **`loop for`**, which will inspect our HTML and take only the link tag that is **`a`**. That done, let's make the request by calling the link reference **`href`**.

In [None]:
for url in first.find_all('a'):
  print(url.get('href'))

/alugar-apartamento-t2-odivelas-perto-escola,centro-da-cidade,policia,transportes-publicos-com-arrecadacao-tem-varandas,sotao,marquise,varanda-a30df3c4-5d76-11eb-921c-060000000052.html
/alugar-apartamento-t2-odivelas-perto-escola,centro-da-cidade,policia,transportes-publicos-com-arrecadacao-tem-varandas,sotao,marquise,varanda-a30df3c4-5d76-11eb-921c-060000000052.html
/agencia/h4-mediacao-imobiliaria,-lda/?cl=14135&sys=5
/alugar-apartamento-t2-odivelas-perto-escola,centro-da-cidade,policia,transportes-publicos-com-arrecadacao-tem-varandas,sotao,marquise,varanda-a30df3c4-5d76-11eb-921c-060000000052.html
/alugar-apartamento-t2-odivelas-perto-escola,centro-da-cidade,policia,transportes-publicos-com-arrecadacao-tem-varandas,sotao,marquise,varanda-a30df3c4-5d76-11eb-921c-060000000052.html


And to top it off, we can create a concatenation by joining the default website address with the complement of the website that comes after **`/`**. The only attention we must take is to slice the link, because if it is not done the browser will not recognize the link we created.

In [None]:
'https//casa.sapo.pt/' + first.find_all('a')[0].get('href')[1:-5]

'https//casa.sapo.pt/alugar-apartamento-t2-odivelas-perto-escola,centro-da-cidade,policia,transportes-publicos-com-arrecadacao-tem-varandas,sotao,marquise,varanda-a30df3c4-5d76-11eb-921c-060000000052'

# **Part II - Store the data in a DataFrame**

In this step we will gather everything you have learned and create a DataFrame so that we can store our data in a simple and clear way. First, we will need to create an empty list for each topic we are going to extract, that list will be filled with our data in a more organized way.

## **4.0 Empty lists**

In [None]:
titles = []
prices = []
areas = []
zone = []
condition = []
descriptions = []
urls = []

## **4.1 Loop to extract data**

Moving towards the end of our project, we will need to create the for loop to be able to capture the data we want. For this, we will just repeat the processes previously done with the difference that will be inside the loop.

In [None]:
%%time

n_pages = 0

for page in range(0,10):
    n_pages += 1
    sapo_url = 'https://casa.sapo.pt/alugar-apartamentos/odivelas/'+'&pn='+str(page)
    r = get(sapo_url, headers=headers)
    page_html = BeautifulSoup(r.text, 'lxml')
    house_conteiners = soup.find_all('div', class_='searchResultProperty item hastitle')
    if house_containers != []:
        for container in house_containers:
            
            # Price            
            price = container.find_all('span')[2].text
            if price == 'Contacte Anunciante':
                price = container.find_all('span')[3].text
                if price.find('/') != -1:
                    price = price[0:price.find('/')-1]
            if price.find('/') != -1:
                price = price[0:price.find('/')-1]
            
            price_ = [int(price[s]) for s in range(0,len(price)) if price[s].isdigit()]
            price = ''
            for x in price_:
                price = price+str(x)
            prices.append(int(price))

            # Zone
            location = container.find_all('p', class_='searchPropertyLocation HasFeatures')[0].text
            location = location[7:location.find(',')]
            zone.append(location)

            # Title
            name = container.find_all('span')[0].text
            titles.append(name)

            # Status
            status = container.find_all('p')[5].text
            condition.append(status)

            # Area
            m2 = container.find_all('p')[9].text
            if m2 != '-':
                m2 = m2.replace('\xa0','')
                m2 = float("".join(itertools.takewhile(str.isdigit, m2)))
                areas.append(m2)
                
            else:
                m2 = container.find_all('p')[7].text
                if m2 != '-':
                    m2 = m2.replace('\xa0','')
                    m2 = float("".join(itertools.takewhile(str.isdigit, m2)))
                    areas.append(m2)
                else:
                    areas.append(m2)

            # Description
            desc = container.find_all('p', class_='searchPropertyDescription')[0].text[:-1].strip()
            descriptions.append(desc)

            # url
            link = 'https://casa.sapo.pt/' + container.find_all('a')[0].get('href')[1:-5]
            urls.append(link)

    else:
        break
    
print('You scraped {} pages containing {} properties.'.format(n_pages, len(titles)))

You scraped 10 pages containing 180 properties.
CPU times: user 573 ms, sys: 13 ms, total: 586 ms
Wall time: 8.77 s


## **4.2 DataFrame**

## **4.2.1 - Creating the DataFrame**

Here we are going to create a list containing the column names. After doing this process, we will create a variable with the name of the location I chose. Within this variable, we will use the **`.DataFrame`** function and here we will use a dictionary using our column headings and placing our data stored within the headings.

In [None]:
cols = ['Title', 'Zone', 'Price', 'Size (m²)', 'Status', 'Description', 'URL']

odivelas = pd.DataFrame({'Title': titles,
                           'Price': prices,
                           'Size (m²)': areas,
                           'Zone': zone,
                           'Status': condition,
                           'Description': descriptions,
                           'URL': urls})[cols]

### **4.2.2 Confirming the DataFrame creation**

In [None]:
odivelas.head()

Unnamed: 0,Title,Zone,Price,Size (m²),Status,Description,URL
0,Apartamento T2 com Parqueamentos e Arrecadação...,Colinas do Cruzeiro,1250,101.0,Usado,APARTAMENTO T2 com Parqueamentos e Arrecadação...,https://casa.sapo.pt/alugar-apartamento-t2-odi...
1,T2 - Centro Casal de Cambra,Caneças,2,90.0,Usado,T2 C/ Arrecadação Centro de Casal de Cambra Ex...,https://casa.sapo.pt/alugar-apartamento-t2-odi...
2,Apartamento T2 na Urb. do Jardim da Amoreira,Jardim da Amoreira (Ramada),2,103.0,Usado,|| Em Exclusivo na PMC Imobiliária || Numa das...,https://casa.sapo.pt/alugar-apartamento-t2-odi...
3,T2+1 -- Excelentes Condições -- Caneças,Caneças,3,95.0,Usado,"Sala c/ 30m2, Cozinha em Kitchenette Cozinha ...",https://casa.sapo.pt/alugar-apartamento-t3-odi...
4,Apartamento T2 em Odivelas c/ cozinha equipada,Póvoa de Santo Adrião e Olival ...,2,80.0,Usado,FAÇA CONNOSCO O MELHOR NEGÓCIO Excelente Opo...,https://casa.sapo.pt/alugar-apartamento-t2-odi...


In [None]:
odivelas.shape

(180, 7)

In [None]:
odivelas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Title        180 non-null    object 
 1   Zone         180 non-null    object 
 2   Price        180 non-null    int64  
 3   Size (m²)    180 non-null    float64
 4   Status       180 non-null    object 
 5   Description  180 non-null    object 
 6   URL          180 non-null    object 
dtypes: float64(1), int64(1), object(5)
memory usage: 10.0+ KB


In [None]:
odivelas.describe()

Unnamed: 0,Price,Size (m²)
count,180.0,180.0
mean,9173.333333,90.722222
std,36685.537683,26.450767
min,2.0,14.0
25%,2.0,84.0
50%,3.0,93.5
75%,750.0,103.0
max,160000.0,125.0


### **4.2.3 Exporting the DataFrame**

In [None]:
# EXCEL
odivelas.to_excel('odivelas_excel.xlsx')

In [None]:
# DATAFRAME
odivelas.to_csv('odivelas_csv.csv')