<a href="https://colab.research.google.com/github/samcast1/Short-Term-Investments-Model/blob/main/1.0-sc-intro-loading-data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
The aim of this project is to predict the most valuable real estate investment options for short-term rental (e.g., airbnb). The model will be enhanced by NLP Sentiment Analysis using web-scraped reviews in addition to market trends and economic metrics provided by Zillow research data.

# Install and Import Libraries . . .

In [None]:
!pip install pyspark
!pip install findspark
!pip install beautifulsoup4
!pip install requests
!pip install selenium


Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=9c9cb244c5f50f8287adb704fc3ce178d2907d22e8e78a466748fffcbe8f5927
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1
Collecting selenium
  Downloading selenium-4.21.0-py3-none-any.whl (9.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import pyspark
import findspark
from pyspark.sql import SparkSession

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from bs4 import BeautifulSoup
import requests

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

from google.colab import files

# Acquiring and Loading Data . . .

In [None]:
findspark.init()
spark = SparkSession.builder.appName('ShortTermInvestment').getOrCreate()

In [None]:
uploaded = files.upload()

Saving market_heat_index.csv to market_heat_index.csv
Saving rental_index.csv to rental_index.csv
Saving sales_count.csv to sales_count.csv
Saving sales_median.csv to sales_median.csv
Saving sales_to_list_ratio.csv to sales_to_list_ratio.csv
Saving zhvi_bottom_tier.csv to zhvi_bottom_tier.csv
Saving zhvi_mid_tier.csv to zhvi_mid_tier.csv
Saving zhvi_top_tier.csv to zhvi_top_tier.csv


In [None]:
uploaded_files = uploaded.keys()
print(uploaded_files)

dict_keys(['market_heat_index.csv', 'rental_index.csv', 'sales_count.csv', 'sales_median.csv', 'sales_to_list_ratio.csv', 'zhvi_bottom_tier.csv', 'zhvi_mid_tier.csv', 'zhvi_top_tier.csv'])


In [None]:
zhvi_top = spark.read.csv("zhvi_top_tier.csv", header=True, inferSchema=True)
zhvi_mid = spark.read.csv("zhvi_mid_tier.csv", header=True, inferSchema=True)
zhvi_bot = spark.read.csv("zhvi_bottom_tier.csv", header=True, inferSchema=True)
sales_count = spark.read.csv("sales_count.csv", header=True, inferSchema=True)
sales_med = spark.read.csv("sales_median.csv", header=True, inferSchema=True)
sales_list = spark.read.csv("sales_to_list_ratio.csv", header=True, inferSchema=True)
rental = spark.read.csv("rental_index.csv", header=True, inferSchema=True)
market_heat = spark.read.csv("market_heat_index.csv", header=True, inferSchema=True)

Web-Scraping for Sentiment Analysis:

- Generate a list of cities using the zillow metro data.
- Use this list to construct a list of airbnb urls that look like:

    'https://www.airbnb.com/s/{city}/homes'

- Use the constructed urls to scrape reviews with Selenium and BeautifulSoup.

In [None]:
# run sql queries to gather all city names from the zhvi tables

zhvi_top.createOrReplaceTempView("top")
zhvi_mid.createOrReplaceTempView("mid")
zhvi_bot.createOrReplaceTempView("bot")

In [None]:
# make it a little easier to run queries

def sql(query):
  result = spark.sql(query)
  result.show()

In [None]:
sql("""
SELECT *
FROM top
LIMIT 10
""")

+--------+--------+----------------+----------+---------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------

In [None]:
# create list of all cities in zhvi data with dashes to state abbreviations - this will be the standard cities included in alL ldata

sql("""
SELECT REPLACE(REPLACE((LOWER(RegionName)), ", ", "-"), " ", "") AS city
 FROM top
  LEFT JOIN mid
    USING (RegionName)
  LEFT JOIN bot
    USING (RegionName)
WHERE RegionName <> 'United States'
""")

+---------------+
|           city|
+---------------+
|     newyork-ny|
|  losangeles-ca|
|     chicago-il|
|      dallas-tx|
|     houston-tx|
|  washington-dc|
|philadelphia-pa|
|       miami-fl|
|     atlanta-ga|
|      boston-ma|
|     phoenix-az|
|sanfrancisco-ca|
|   riverside-ca|
|     detroit-mi|
|     seattle-wa|
| minneapolis-mn|
|    sandiego-ca|
|       tampa-fl|
|      denver-co|
|   baltimore-md|
+---------------+
only showing top 20 rows



In [None]:
def sql_to_list(query):
  result = spark.sql(query)
  result_list = result.rdd.flatMap(lambda x: x).collect()
  return result_list

In [None]:
city_list = sql_to_list("""
SELECT REPLACE(REPLACE((LOWER(RegionName)), ", ", "-"), " ", "") AS city
 FROM top
  LEFT JOIN mid
    USING (RegionName)
  LEFT JOIN bot
    USING (RegionName)
WHERE RegionName <> 'United States'
""")

In [None]:
city_list[-50:]

['wahpeton-nd',
 'magnolia-ar',
 'elkcity-ok',
 'liberal-ks',
 'worthington-mn',
 'oskaloosa-ia',
 'pampa-tx',
 'clarksdale-ms',
 'sterling-co',
 'beatrice-ne',
 'jamestown-nd',
 'levelland-tx',
 'grenada-ms',
 'maryville-mo',
 'arkadelphia-ar',
 'dumas-tx',
 'guymon-ok',
 'borger-tx',
 'pierre-sd',
 'huron-sd',
 'stormlake-ia',
 'cordele-ga',
 'evanston-wy',
 'raymondville-tx',
 'portlavaca-tx',
 'othello-wa',
 'vineyardhaven-ma',
 'parsons-ks',
 'price-ut',
 'portales-nm',
 'losalamos-nm',
 'hereford-tx',
 'andrews-tx',
 'spiritlake-ia',
 'fitzgerald-ga',
 'winnemucca-nv',
 'maysville-ky',
 'snyder-tx',
 'helena-ar',
 'spencer-ia',
 'atchison-ks',
 'fairfield-ia',
 'vermillion-sd',
 'sweetwater-tx',
 'pecos-tx',
 'zapata-tx',
 'ketchikan-ak',
 'craig-co',
 'vernon-tx',
 'lamesa-tx']

In [None]:
city_urls = []
for city in city_list:
  url = f'https://www.airbnb.com/s/{city}/homes'
  city_urls.append(url)

city_urls[:5]

['https://www.airbnb.com/s/newyork-ny/homes',
 'https://www.airbnb.com/s/losangeles-ca/homes',
 'https://www.airbnb.com/s/chicago-il/homes',
 'https://www.airbnb.com/s/dallas-tx/homes',
 'https://www.airbnb.com/s/houston-tx/homes']

## That worked surprisingly well. Now I need to generate the web scraper.

#### Google Colab environment is not conducive to running a selenium chrome driver, so I'll need to do this on my local machine in a standard Jupyter notebook.

I'll need to export the city urls as a csv, load them into a Jupyter notebook, build and apply the web scraper, and collect reviews in my local directory to introduce back into Colab for further processing an analysis with the zillow data.

In [None]:
url_df = pd.DataFrame(city_urls, columns = ['url'])
url_df.head()
url_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 894 entries, 0 to 893
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   url     894 non-null    object
dtypes: object(1)
memory usage: 7.1+ KB


In [None]:
url_df.to_csv('city_urls.csv', index=False)
files.download('city_urls.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>