# Tristn Joseph - IST 718 - Lab 2 (Data Loading and Cleaning)

## Introduction

Real estate investing involves the purchase, ownership, management, rental and/or sale of real estate for profit. Improvement of realty property as part of a real estate investment strategy is generally considered to be a sub-specialty of real estate investing called real estate development.

In general, to invest means to allocate money (and other resources) with the expectation of a posotive return in the future. Therefore, the problem statement of an investor is: as an investor, I want to identify ideal oppoerunities quickly such that I can invest appropriately and maximize my returns. Therefore, as an investor, `how can I identify opportunities quickly`?

One approach is to use `prediction models`. Predictive modeling is the act of using statistics to generate/predict outcomes, and it involves analyzing historical data to uncover trends and key inflection points.

There are multiple factors which could affect real estate, and these factors can include crime rates, school zones, median income levels, geography, and even population. However, a useful indicator is the trend of real estate (aggregated at various levels).

Within this assignment, the goal is to determine the best locations for the Syracuse Real Estate Investment Trust (SREIT) to invest into. The assumption is that the SREIT will purchase real estate at the current time (the most recent date within the data set -- `March 31, 2020`), and ideal investments will grow positively over time. To determine this, I analyze time series data of housing values, and generate predictive models to forecast the median housing values per location.

## Packages

In [1]:
import pandas as pd

In [2]:
# displaying all rows and columns
pd.set_option('max_rows', None)
pd.set_option('max_columns', None)

## Data Loading

In [3]:
# loading the base data from Zillow (files.zillowstatic.com/research/public/Zip/Zip_Zhvi_SingleFamilyResidence.csv)
zwillow_sfr_data_path = "C:/Users/trist/OneDrive/Desktop/Trist'n/School/Syracuse University/Q4 2021/IST718/Labs/Lab 2/Zip_Zhvi_SingleFamilyResidence.csv"
zwillow_sfr_initial_df = pd.read_csv(zwillow_sfr_data_path)

## Data Cleaning

For appropriate analysis to be conducted, data transformation and cleaning steps are necessary.

In the initial data set, each date was recorded as a column. This structure is inappropriate for data analysis because the table is not in `tidy` format. To convert the table into a `tidy` format, each date column needs to be converted into a row value.

In [4]:
# need to convert the data columns from columns to rows
zwillow_sfr_df = zwillow_sfr_initial_df.melt(
    id_vars=[
        'RegionID', 'SizeRank', 'RegionName',
        'RegionType', 'StateName', 'State',
        'City', 'Metro', 'CountyName'
    ],
    var_name='Date',
    value_name='MedianHousingValue'
)

In [5]:
# converting the data types such that they are apporpriate for analysis
zwillow_sfr_df['RegionID'] = zwillow_sfr_df['RegionID'].astype('string')
zwillow_sfr_df['SizeRank'] = zwillow_sfr_df['SizeRank'].astype('category')
zwillow_sfr_df['RegionName'] = zwillow_sfr_df['RegionName'].astype('string')
zwillow_sfr_df['RegionID'] = zwillow_sfr_df['RegionID'].astype('string')
zwillow_sfr_df['Date'] = zwillow_sfr_df['Date'].astype('datetime64[ns]')

In [6]:
# removing the null values
zwillow_sfr_df = zwillow_sfr_df.dropna()

In [7]:
zwillow_sfr_df.head()

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,Date,MedianHousingValue
1,84654,1,60657,Zip,IL,IL,Chicago,Chicago-Naperville-Elgin,Cook County,1996-01-31,364892.0
3,91982,3,77494,Zip,TX,TX,Katy,Houston-The Woodlands-Sugar Land,Harris County,1996-01-31,200475.0
4,84616,4,60614,Zip,IL,IL,Chicago,Chicago-Naperville-Elgin,Cook County,1996-01-31,546663.0
5,91940,5,77449,Zip,TX,TX,Katy,Houston-The Woodlands-Sugar Land,Harris County,1996-01-31,97521.0
7,91733,7,77084,Zip,TX,TX,Houston,Houston-The Woodlands-Sugar Land,Harris County,1996-01-31,97381.0


In [8]:
# cleaned_data_path = "C:/Users/trist/OneDrive/Desktop/Trist'n/School/Syracuse University/Q4 2021/IST718/Labs/Lab 2/cleaned_zwillow_data.csv"
# zwillow_sfr_df.to_csv(cleaned_data_path)