### Data Acquisition & Preparation:

This notebook includes the code for scraping property listings & processing the retrieved data. 

In [1]:
import numpy as np
import pandas as pd
import requests as rq
import re
import os
import os.path
import random
import pickle
from collections import defaultdict

# web scraping:
from fake_useragent import UserAgent
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as bsoup
from IPython.core.display import display, HTML
import time

# custom modules:
import sys
sys.path.append('../code/')
from process_listings import featureCounts, ArrayMaker

#### Web Scraping:

Landsoftexas.com provides querying functionality, but only returns 30 listings per page. In order to take advantage of the former and adjust for the latter, the scraping code saves each webpage to its own text file in the scraped_listings folder.

In order to preserve each page's raw HTML formatting, the files had to be written as a binary and encoded as `utf-8` format so that BeautifulSoup could be utilized when subsequently opened. 

In [2]:
# randomize user agent to prevent getting blocked:
ua = UserAgent()
header={'User-Agent':str(ua.random)}
base_url = 'https://www.landsoftexas.com/Blacklands-North-Texas-Region/all-land/50-5000-acres/is-sold/no-house/'

page_list = [base_url]
results_path = '../data/scraped_listings/'

# create a list of 33 urls for site pages (shortened to 3 for demo):
for i in range(1, 3):
    temp = base_url + 'page-' + str(i)
    page_list.append(temp)
    
# get page data for urls in page_list:
for i, page in enumerate(page_list):
    print('Processing...', page)
    r = rq.get(page, headers = header)
    
    # save page data to individual results file in utf-8 format:
    temp_name = os.path.join(results_path, 'results' + str(i) + '.txt') 

    with open(temp_name, 'wb+') as file:
        file.write(r.text.encode('utf-8'))
        file.close()
        
    # randomize the timing to appear less programmatic:
    time.sleep(.63 + 3 * random.random())

Processing... https://www.landsoftexas.com/Blacklands-North-Texas-Region/all-land/50-5000-acres/is-sold/no-house/
Processing... https://www.landsoftexas.com/Blacklands-North-Texas-Region/all-land/50-5000-acres/is-sold/no-house/page-1
Processing... https://www.landsoftexas.com/Blacklands-North-Texas-Region/all-land/50-5000-acres/is-sold/no-house/page-2


#### Code Body

The main code body opens all text files consecutively and reads them to single variable for BeautifulSoup to interpret. BeautifulSoup's text parsing functionality is employed to pull content from specific sections. All features are appended to a single prop_data list, which is then sent to the `ArrayMaker` function.

The site's Property Description section is not formatted, so a separate function `featureCounts` was written to convert this language into structured data.

In [3]:
prop_data = []
total = 0
results_path = '../data/scraped_listings/'
txt_files = [i for i in os.listdir(results_path)]

for fname in txt_files:
    temp_name = os.path.join(results_path, fname) 
    with open(temp_name, 'rb') as file:
        prop_cache = file.read()
        properties = bsoup(prop_cache)
        properties = properties.find_all('article')
        total += len(properties)
        
        for prop in properties:
            temp = prop.select('a[href*="/property"]')
            temp = str(temp).strip('</a>').strip('[<a href"=property/').split('">')
            temp = temp[0]    # flatten list 
            prop_data.append((temp[-7:]))    # prop_id
            prop_data.append(prop.find('span', class_ = 'size').text.replace('acres','').replace(',','').strip())   # size
            
            try:
                s = prop.find('span', class_ = 'price').text.replace(',','').strip('$')
                prop_data.append(float(s))    # price
            except:
                prop_data.append(0)    
                
            try:
                prop_data.append(prop.find('span', {'itemprop':'streetAddress'}).text.lower())   # street address  
            except:
                prop_data.append(None)    
            
            try:
                prop_data.append(prop.find('span', class_ = 'county').text.replace(' ','').replace('County','').replace('-','').lower()) # county
                prop_data.append(prop.find('span', {'itemprop':'postalCode'}).text)   # zip code
            except:
                prop_data.append(None)
                
            try:
                prop_data.append(prop.find('h3', class_ = 'panel-title').text.lower())  # listing title
            except:
                prop_data.append(None)
                continue
                
            prop_data.append(temp)  # html link
            
            # process the property description with `featureCounts()` and append the returned dictionary:
            try:                                                    
                z = prop.find('div', class_ = 'invDescription signature-desc').get_text()
                z = z.replace('invDescription signature-desc','').replace('\n', ' ').replace('\r', '').replace('\t', '').lower()
                prop_data.append(featureCounts([z]))    
            except:
                try:
                    z = prop.find('div', class_ = 'invDescription showcase-desc').get_text()
                    z = z.replace('invDescription showcase-desc','').replace('\n', ' ').replace('\r', '').replace('\t', '').lower()
                    prop_data.append(featureCounts([z])) 
                except:
                    try:
                        z = prop.find('div', class_ = 'invDescription premium-desc').get_text()
                        z = z.replace('invDescription showcase-desc','').replace('\n', ' ').replace('\r', '').replace('\t', '').lower()
                        prop_data.append(featureCounts([z])) 
                    except:
                        z = 0
                        prop_data.append(featureCounts([z]))
                                         
print(f'Processing {total} listings...')

df = ArrayMaker(prop_data)

# remove properties listed with zero price values:
zeros = df[df.price < 100].shape[0]

if zeros > 0:
    print(f'>> Removed {zeros} properties listed with zero price.\n')
    df = df.drop(df[df.price < 100].index)
    
# save null values to df_nans if present:
null_vals = df.isnull().values.sum()
if null_vals > 0:
    print(f'Found {null_vals} null values in column(s) {df.columns[df.isna().any()].tolist()} \n')
    print('Saving null values to list df_nans...')
    df_nans = df.loc[df.isnull().any(axis=1)]
else:
    print('Zero null values found.')

df.to_pickle('../data/original_df.pkl')
df.head()

Processing 840 listings...
>> Removed 4 properties listed with zero price.

Found 103 null values in column(s) ['address'] 

Saving null values to list df_nans...


Unnamed: 0,id,size,price,address,county,zip,auto_gate,barn,cattle,easement,...,bosque,coryell,falls,freestone,hill,limestone,mclennan,navarro,title,link
0,1396986,220.69,440287,729 cr 3430,bosque,76634,0,0,0,0,...,1,0,0,0,0,0,0,0,"220.69 acres clifton, tx",220.69-acres-in-Bosque-County-Texas/1396986
1,1149611,80.0,275000,0000 vilas rd,bell,76534,0,0,0,0,...,0,0,0,0,0,0,0,0,"80 acres holland, tx",80-acres-in-Bell-County-Texas/1149611
2,2033595,102.0,204000,fm 2838,limestone,76667,0,0,0,0,...,0,0,0,0,0,1,0,0,near mexia part woods,102-acres-in-Limestone-County-Texas/2033595
3,1621214,67.55,168875,off fm 933,hill,76622,0,0,0,0,...,0,0,0,0,1,0,0,0,"67.55 acres aquilla, tx",67.55-acres-in-Hill-County-Texas/1621214
4,/867838,50.0,147500,cr 326,falls,76680,0,0,0,0,...,0,0,1,0,0,0,0,0,"50 acres reagan, tx",50-acres-in-Falls-County-Texas/867838
