In [5]:
import pandas as pd
c = pd.read_pickle('hemnet_search_all_data')
df = df.T.drop_duplicates().T

# Intro

As a data scientist I look at large quantities of data to help people make better decisions. One decision that many people make is buying a house. Here it can be econimically sound to study the data closely before making a decision. In Sweden, there is a marketplace website called Hemnet where many houses and apartment ads are posted. So I wantet to get hold of all the data from the area where I live to see if I could spot any trends.

The url I used for my search was the following:

https://www.hemnet.se/bostader?item_types[]=bostadsratt&location_ids[]=17755

When you visit this url a map is shown with dots where all the apartments are located as well as a list of aggregate statistics about your search, for example, that the search returned ~2000 apartments. 

# How does Browsing Hemnet work?

There are multiple way you can browse hemnet

- Make advanced searches using paramters like cost, square meters of living, number of rooms etc.
- Click around on a graphical map and select houses or apartments based on geographical surroundings
- and a lot more

A table of listings is shown to you when you have made your search and if you search result contains more than 50 results, you can paginate through your results.

# Circumventing CloudFlare

Scraping a website in 2023 has become much harder. Cloudflare offers VPN blocking and bot-protection out of the box. Hemnet uses Cloudflare and blocks any traffic that cannot prove that it can run Javascript. Thankfully, someone has tried this before and written a python module to circumvent this Cloudflare blocking. Its called Cloudscraper,

https://pypi.org/project/cloudscraper/

This tool is invaluable and saves us from having to write Selenium code and fake a whole web-browser just get past cloudflare.

# Scraping Method

So to get hold of the data I wrote a script that paginates through the search results and fetches the urls for each listing.

# Results

Using the above mentioned method I was able to extract the following data frame

In [41]:
print("rows: {} columns: {} ".format(*df.shape))

rows: 2350 columns: 23 


In [42]:
df.head(5)

Unnamed: 0,data,url,date,reason,price,address,location,Bostadstyp,Upplåtelseform,Antal rum,...,Byggår,Förening,Avgift,Driftkostnad,Pris/m²,Unnamed: 17,Energiklass,Uteplats,Biarea,id
0,"<!DOCTYPE html>\n<html lang=""sv-SE"" dir=""ltr"">...",https://www.hemnet.se/bostad/lagenhet-2rum-cen...,2023-07-12 17:52:28.330152,OK,1 950 000 kr,Hisingsgatan 23,"Centrala Hisingen, Göteborgs kommun",Lägenhet,Bostadsrätt,2 rum,...,1932,\n\nBRF Brämgården\n\nOm föreningen\n,2 322 kr/mån,6 000 kr/år,56 522 kr/m²,\nRäkna på boendet\n,,,,20234135
1,"<!DOCTYPE html>\n<html lang=""sv-SE"" dir=""ltr"">...",https://www.hemnet.se/bostad/lagenhet-3rum-cen...,2023-07-12 17:52:28.812681,OK,3 950 000 kr,Stampgatan 68 A,"Centrum - Stampen, Göteborgs kommun",Lägenhet,Bostadsrätt,3 rum,...,1985,\n\nBRF Breitenfeld\n\nOm föreningen\n,3 958 kr/mån,4 200 kr/år,51 769 kr/m²,\nRäkna på boendet\n,E,,,20259116
2,"<!DOCTYPE html>\n<html lang=""sv-SE"" dir=""ltr"">...",https://www.hemnet.se/bostad/lagenhet-3rum-joh...,2023-07-12 17:52:29.367351,OK,4 800 000 kr,Eklandagatan 16,"Johanneberg / Lorensberg, Göteborgs kommun",Lägenhet,Bostadsrätt,3 rum,...,1929,,3 681 kr/mån,5 400 kr/år,59 259 kr/m²,\nRäkna på boendet\n,,,,19626736
3,"<!DOCTYPE html>\n<html lang=""sv-SE"" dir=""ltr"">...",https://www.hemnet.se/bostad/lagenhet-2rum-joh...,2023-07-12 17:52:29.851366,OK,2 490 000 kr,Fredriksdalsgatan 5C,"Johanneberg - Fredriksdal, Göteborgs kommun",Lägenhet,Bostadsrätt,2 rum,...,1949,\n\nBRF Fredriksdalsgatan 5\n\nOm föreningen\n,4 166 kr/mån,7 440 kr/år,48 824 kr/m²,\nRäkna på boendet\n,,,,20259375
4,"<!DOCTYPE html>\n<html lang=""sv-SE"" dir=""ltr"">...",https://www.hemnet.se/bostad/lagenhet-2rum-cen...,2023-07-12 17:52:30.543874,OK,178 000 kr,Norra Kungsvägen 30,"Centrum, Tidaholms kommun",Lägenhet,Bostadsrätt,2 rum,...,1961-1962,\n\nRiksbyggen BRF Tidaholmshus nr 2\n\nOm för...,4 917 kr/mån,1 848 kr/år,3 043 kr/m²,\nRäkna på boendet\n,,,,20258995


## Veryfing the data

Before we can do anything interesting with this data we should clearly state our assumptions about the data and check that those assumptions hold true.

- We assume that the data is from the Västra Götalands region.
- We assume that the data is "Bostadsrätter" which means apartments in Swedish.
- We assume that the number of data points are 2350.
- We assume that the apartments are for a sale at the time of fetching the listings.

For the sake of brevity, I will only check that we actually got all 2350 search results. The other assumptions can be revisited as we study the data further down the line

## Method: How to determine uniqueness?



If we study the urls we see a string of digits at the end, see below output

In [43]:
urls = list(df.url.head(5))
for u in urls: print(u)

https://www.hemnet.se/bostad/lagenhet-2rum-centrala-hisingen-goteborgs-kommun-hisingsgatan-23-20234135
https://www.hemnet.se/bostad/lagenhet-3rum-centrum-stampen-goteborgs-kommun-stampgatan-68-a-20259116
https://www.hemnet.se/bostad/lagenhet-3rum-johanneberg-lorensberg-goteborgs-kommun-eklandagatan-16-19626736
https://www.hemnet.se/bostad/lagenhet-2rum-johanneberg-fredriksdal-goteborgs-kommun-fredriksdalsgatan-5c-20259375
https://www.hemnet.se/bostad/lagenhet-2rum-centrum-tidaholms-kommun-norra-kungsvagen-30-20258995


this is likely a unique identifier. A simple way to check this hypothesis is to parse out the possible unique identifier and output all urls that contain this identifier, if we see multiple urls with the same identifier, than it is not a unique identifer.

In [44]:
df["id"] = df.url.apply(lambda url: url.split('-')[-1])

In [45]:
df.id.head(5)

0    20234135
1    20259116
2    19626736
3    20259375
4    20258995
Name: id, dtype: object

Lets take the first identifer `20234135` and find out!

In [46]:
urls = list(df[df["id"] == "20234135"].url)
for u in urls: print(u)

https://www.hemnet.se/bostad/lagenhet-2rum-centrala-hisingen-goteborgs-kommun-hisingsgatan-23-20234135
https://www.hemnet.se/bostad/lagenhet-2rum-centrala-hisingen-goteborgs-kommun-hisingsgatan-23-20234135
https://www.hemnet.se/bostad/lagenhet-2rum-centrala-hisingen-goteborgs-kommun-hisingsgatan-23-20234135
https://www.hemnet.se/bostad/lagenhet-2rum-centrala-hisingen-goteborgs-kommun-hisingsgatan-23-20234135
https://www.hemnet.se/bostad/lagenhet-2rum-centrala-hisingen-goteborgs-kommun-hisingsgatan-23-20234135
https://www.hemnet.se/bostad/lagenhet-2rum-centrala-hisingen-goteborgs-kommun-hisingsgatan-23-20234135
https://www.hemnet.se/bostad/lagenhet-2rum-centrala-hisingen-goteborgs-kommun-hisingsgatan-23-20234135


## Hypothesis confirmed

As we can see from the above output, the supposed unique identifier `20234135` returned 5 identical urls. Had the output contained for example two urls pointing to two different apartments with the same supposed unique identifier, then we would have confirmed that `20234135` does not uniquely identify this listing. However, our result came back in favor of our guess. After running through a couple of examples like this I was convinced of my hypothesis. The last digits are likely a unique identifier for the listings. 

So why does my data frame contain multiple samples of the same url? As we can see in the above output, the same url appeared multiple times. Why?

In [28]:
print("Before: ", df.shape[0], " items.")
df_no_duplicates = df.drop_duplicates(subset=["id"])
print("After: ", df_no_duplicates.shape[0], " items.")

Before:  2350  items.
After:  62  items.


Wow! So the number of items displayed was actually only 62! Interesting

# What is actually going on?

It looks like the listings are repeated over and over after a certain point. The natural question is when? And how do the listings repeat? Lets answer when first! Keep in mind that the number of listings displayed per page is 50!

In [54]:
pd.set_option('display.max_rows', 500)
df_no_duplicates[:55][["id", "url"]]

Unnamed: 0,id,url
0,20234135,https://www.hemnet.se/bostad/lagenhet-2rum-cen...
1,20259116,https://www.hemnet.se/bostad/lagenhet-3rum-cen...
2,19626736,https://www.hemnet.se/bostad/lagenhet-3rum-joh...
3,20259375,https://www.hemnet.se/bostad/lagenhet-2rum-joh...
4,20258995,https://www.hemnet.se/bostad/lagenhet-2rum-cen...
5,20258935,https://www.hemnet.se/bostad/lagenhet-2rum-kal...
6,20258815,https://www.hemnet.se/bostad/lagenhet-3rum-ell...
7,20258695,"https://www.hemnet.se/bostad/lagenhet-1,5rum-v..."
8,18054396,https://www.hemnet.se/bostad/lagenhet-1rum-tor...
9,20257456,https://www.hemnet.se/bostad/lagenhet-1rum-cen...


In the above output I have listed the 55 first listings. In the first column we can see the original index before we removed duplicates. This feature of pandas is really helping us determine in what order items are being displayed for the user on Hemnet. We see here that the first 50 results are indeed unique, as indicates by the consecutive index numbers displayed in the first column. After the 50th, the next page only has one more not previously seen listing, the 100th.

In [55]:
df[["id","url"]].head(101)

Unnamed: 0,id,url
0,20234135,https://www.hemnet.se/bostad/lagenhet-2rum-cen...
1,20259116,https://www.hemnet.se/bostad/lagenhet-3rum-cen...
2,19626736,https://www.hemnet.se/bostad/lagenhet-3rum-joh...
3,20259375,https://www.hemnet.se/bostad/lagenhet-2rum-joh...
4,20258995,https://www.hemnet.se/bostad/lagenhet-2rum-cen...
5,20258935,https://www.hemnet.se/bostad/lagenhet-2rum-kal...
6,20258815,https://www.hemnet.se/bostad/lagenhet-3rum-ell...
7,20258695,"https://www.hemnet.se/bostad/lagenhet-1,5rum-v..."
8,18054396,https://www.hemnet.se/bostad/lagenhet-1rum-tor...
9,20257456,https://www.hemnet.se/bostad/lagenhet-1rum-cen...


# Discussion "Page 10 on a search is no mans land"

No human will ever go to page 10 on a search page. What I think is happening is that the clickthrough rate is higher if listings are repeated through the pagination. People who browse for apartments will likely click on a apartment they've seen multiple times. So lets study how listings are repeated through the dataframe.