# MagicBricks scraper script

This notebook will contain the code related to scraping the Magic Bricks website (www.magicbricks.com). There will be 3 kinds of urls. BASE URL will be the first URL to open the page with all the properties for sale. SERVICE URL will be the service call to extract the properties in chunks of 30 and the SERVICE PAYLOAD will contain the parameters related to it.

In [None]:
from bs4 import BeautifulSoup
import requests
import re
import json
import pandas as pd
from numpy import NaN
import numpy as np

In [None]:
# Both are GET requests to be fired
BASE_URL = 'https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Bangalore'
# service url will be the url to collect the individual house related data.
SERVICE_URL = 'https://www.magicbricks.com/mbsearch/propertySearch.html' 
# Payload will carry the additional parameters required for the service url
SERVICE_PAYLOAD = {
        'propertyType_new' : '10002_10003_10021_10022',
        'city':3327,
        'searchType':1,
        'propertyType':'10002,10003,10021,10022',
        'disWeb':'',
        'pType':'10002,10003,10021,10022',
        'category':'S',
        'groupstart':30,
        'offset':'',
        'maxOffset':'',
        'attractiveIds':'',
        'page':2,
        'ltrIds':'',
        'preCompiledProp':'',
        'excludePropIds':'',
        'addpropertyDataSet':''
        }

Some functions to cater the needs

In [None]:
# This will extract the URL from the extracted string.
def extractUrl(st):
    l = re.findall(r"'(.*?)'",st)[0].replace('\\','')
    _id = l.split('&')[-1].split('=')[-1]
    return [l,_id]

Let's start by hitting the base URL and extract the required parameters and fields

In [None]:
# Hitting the base url with get request
req = requests.get(BASE_URL)

# Converting the request response content to the soup parser for easier extraction and use
soup = BeautifulSoup(req.content, 'html.parser')

# Pattern to identify the individual property URL
pat = r'(if\(openDetailPage\(event, .*?\)\))'

#st = str(req.content)

# Extracting all matches for the pattern
li = re.findall(pat, str(req.content))

These are the parameters that will be required for service URLs which will be fired later.

In [None]:
ltrIds = soup.find('span',{'id':'ltrIds'}).text
scrText = soup.find('div',{'id':'resultDiv'}).find('script').text

scrText = scrText.replace('var gsData = ','').replace('\r\n','').replace('\t','').replace(':','\':').replace('{','{\'',1).replace(',',',\'').replace('\'','"').replace(',"}','}')
# scrText = scrText
# scrText = scrText
# scrText = scrText
# scrText = scrText
# scrText = scrText
# scrText = scrText
# scrText = scrText

service_params = json.loads(scrText)

Time to extract the URLs from the strings and save them into a file in order to prevent loss of data at later steps due to any error

In [None]:
property_URLs = [extractUrl(x) for x in li]
# filename = 'URLs_1'
# ext = '.txt'
# with open(filename+ext,'w') as f:
#     for u in property_URLs:
#         f.write(u[0]+';'+u[1]+'\n')
#     f.close()

Let's make the file saving part into a method with a slight change as to append for every time urls are extracted from the service url

In [None]:
def appendURLsToFile(properties,filename):
    with open(filename,'a') as f:
        for u in properties:
            f.write(u[0]+';'+u[1]+'\n')
        f.close()

Now we will get the properties with the help of service URL. The same process till now has to be followed just that it will be with the service URLs

In [None]:
max_pages = int(service_params['pageCount'])
SERVICE_PAYLOAD['maxOffset'] = service_params['maxOffset']
SERVICE_PAYLOAD['offset'] = service_params['offset']
SERVICE_PAYLOAD['ltrIds'] = ltrIds

Payload will be carrying the additional parameters required for the service url. Only some of them will have to be updated everytime which are offset, maxOffset, page, groupStart.

The function below will extract property urls with the help of service url 

In [None]:
def extractProperties(filename):
    req = requests.get(SERVICE_URL,params=SERVICE_PAYLOAD)
    esoup = BeautifulSoup(req.content,'html.parser')
    matches = re.findall(pat,str(req.content))
    
    properties = [extractUrl(x) for x in matches]
    appendURLsToFile(properties,filename)
    
    sc = esoup.find('script').text
    t = sc.replace(' ','').replace('\n\t','').replace('\'','').split(';')[:-1]
    SERVICE_PAYLOAD['groupstart'] = int(SERVICE_PAYLOAD['groupstart']) + 30
    #SERVICE_PAYLOAD['offset'] = t[1].split('=')[1]
    #SERVICE_PAYLOAD['maxOffset'] = t[2].split('=')[1]
    SERVICE_PAYLOAD['page'] = int(SERVICE_PAYLOAD['page']) + 1
    

The service request will be fired from initial count 2 to the max_pages available. The parameters offset and maxoffset don't seem to have much effect on the results. So, we are setting them to 0.

In [None]:
SERVICE_PAYLOAD['offset'] = 0
SERVICE_PAYLOAD['maxOffset'] = 0

it = 2
while it < max_pages:
    file_name = 'URLs_' + str(int(it/50)+1)
    extractProperties(file_name+ext)
    print(str(it) + ' pages URLs extracted')
    it += 1

Till here we have extracted the URLs and IDs of properties available for sale on the platform. Let's concat all the files and create a master file storing all URLs.

In [None]:
import glob
files = glob.glob('URLs_*.txt')
with open('URLsMaster.txt','a') as m:
    for f in files:
        with open(f,'r') as e:
            m.writelines(e.readlines())
            e.close()
    m.close()

So, the master file with all the URLs is created. From here we will extract the data for individual property

Now we have the data to be collected from the webpage. As there are higher number of URLs to hit and get records from, we will try to achieve this by going file by file. Each file with URLs can be taken for an iteration and then property details can be loaded into a dataframe. After capturing the parameters from the file, we will save the dataframe into a csv file for later reference.

First we will perform a sequence of steps before hitting the URLs. 
1. Take the list of files to be used.
2. Create the steps for iteration
3. For iteration:
        a. Create a dataframe.
        b. Load the URLs from file.
        c. For each URL:
            i. Create a column for IDs from the URL. 
            ii. Load the header properties and format them.
            iii. Load the data attributes from the domCache_detailpage element.
            iv. Load these into the dataframe. 
        d. After the loop is done, save the dataframe into a file with the same name as the URL file but with an extension '.csv'
4. Make sure the loop wouldn't stop due to exceptions.

In [None]:
import glob
files = glob.glob('URLs_*.txt')

def forEachURL(url):
    splits = url.split(';')
    urlstring = splits[0]
    _id = splits[1]
    req = requests.get(urlstring)
    soup = BeautifulSoup(req.content, 'html.parser')
    infos = soup.findAll('div',{'class':'p_infoColumn'})
    details = {}
    for ta in infos:
        t = ta.find('div',{'class':'p_title'}).text.replace('\n','')
        v = ta.find('div',{'class':'p_value'}).text.replace('\n','')
        details[t] = v
    data_attrs = soup.find('span',{'id':'domcache_detailpage'})
    domcache = {}
    try:
        domcache = data_attrs.attrs
    except AttributeError:
        domcache = {}
    #print('Data obtained for URL with ID: ' + _id)
    return { 'propid': _id, **details, **domcache}

In [None]:
import pandas as pd
def forFile(filename):
    df_cur = pd.DataFrame()
    lines = [line.rstrip('\n') for line in open(filename)]
    props = [forEachURL(url) for url in lines]
    df_cur = pd.DataFrame(props)
    df_cur.to_csv(filename.replace('.txt','.csv'))
    print('File with data created: ' + filename)

In [None]:
for name in files[1:]:
    forFile(name)

In [None]:
data_files = glob.glob('URLs_*.csv')

Create a master data file containing all the data for every URL extracted. This will be our master data store.

In [None]:
df_master = pd.DataFrame()
for f in data_files:
    df_cur = pd.read_csv(f)
    df_master = df_master.append(df_cur, sort=False)
    
df_master.to_csv('URLsDataMaster_1.csv')

Now we have all the data that can be collected from the Magic Bricks website. The next part is the ratings for the locality that have to be collected. After initial exploration of the traffic generated for the web pages, we get that the ratings for the different localities are fetched from a different source(https://rating.magicbricks.com/mbRating/getWiget.json). So, we will again fire different set of requests for fetching the ratings of the localities. 

In [133]:
df_locs = pd.DataFrame(df_master['data-localityid'].unique())
df_locs.to_csv('LocalityIds.csv')

In [136]:
df_locsload = pd.read_csv('LocalityIds.csv')

In [138]:
df_locsload.columns = ['slno','LocID']

In [171]:
def extractRating(locId):
    key = '5e8e646d535182a72b693a4669ff9e'
    url = 'https://rating.magicbricks.com/mbRating/getWiget.json'
    params = {
        'callback':'newDetailsRatingAndReviewWidget',
        'host':'magicbricks.com',
        'key':key,
        'code':locId,
        'type':'LOCALITY'
    }
    resp = requests.get(url,params=params)
    resp = resp.text.replace(');','').replace('newDetailsRatingAndReviewWidget(','')
    j = json.loads(resp)
    rat = j['avgRating']
    plcs = {}
    try:
        plcs = j['categoryRatingMap']['Places of Interest']
    except KeyError:
        plcs = {}
    return { 'rating' : rat, **plcs, 'locId' : locId}

In [None]:
dics = [extractRating(int(s)) for s in df_locsload['LocID']]
df_locdata = pd.DataFrame(dics)
df_locdata.to_csv('locs_data1.csv')

Now, we have the required data from the magicbricks.com domain. There are requirements of some other location based data like geospatial data of the locations of the properties but the sources are yet to be found. We will proceed with the data collected so far and create our model.