#By Sina Azartash

#Loading Modules

In [2]:
!pip install prophet

Collecting prophet
  Downloading prophet-1.0.1.tar.gz (65 kB)
[K     |████████████████████████████████| 65 kB 3.2 MB/s 
Collecting cmdstanpy==0.9.68
  Downloading cmdstanpy-0.9.68-py3-none-any.whl (49 kB)
[K     |████████████████████████████████| 49 kB 6.2 MB/s 
Collecting ujson
  Downloading ujson-5.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (45 kB)
[K     |████████████████████████████████| 45 kB 2.4 MB/s 
Building wheels for collected packages: prophet
  Building wheel for prophet (setup.py) ... [?25l[?25hdone
  Created wheel for prophet: filename=prophet-1.0.1-py3-none-any.whl size=6640021 sha256=49afc8aca8c8984797e58e0f024abdf59603f8bb693af1c5663775329e360073
  Stored in directory: /root/.cache/pip/wheels/4e/a0/1a/02c9ec9e3e9de6bdbb3d769d11992a6926889d71567d6b9b67
Successfully built prophet
Installing collected packages: ujson, cmdstanpy, prophet
  Attempting uninstall: cmdstanpy
    Found existing installation: cmdstanpy 0.9.5
    Uninstalling cmdstanpy-0.9

In [3]:
import pandas as pd
import numpy as np
from datetime import datetime
import sys
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
sys.path.append("../")

#Source Code

In [4]:
class Predict_Rent:
    """
    Creates an AI model for each column in dataframe by abstracting the running of Meta's Prophet
    Is currently used to automatically model and then forecast rental prices for each city in a county
    https://facebook.github.io/prophet/
    """
    


    def __init__(self, df):
      """

      Constructor for Predict Rental Class. Loads a Predict Rental Object based on one dataframe. 
      All csv files must be aggregated and then preprocessed into a single dataframe.
      
      :param df: The dataframe containing the orginal data aggregated preprocessed data
      :type df: pandas.core.frame.DataFrame

      """
      self.df = df               #stored for access
      self.df_dict = {}          #stored for access
      self.model_dict = {}       #needs to be stord for plot 
      self.forecasts_dict = {}   #used by plot and also containts trend stats
      self.predicted_df = None   #the final results



    def auto_predict(self, months ,confidence=0.95, verbose = False):
      """

      Splits a DataFrame into several smaller dataframes by column

      :param months: The number of months into the future
      :type months: integer
      :param confidence: The original dataframe that has several columns that will be split into smaller dataframes
      :type confidence: float
      :param confidence: The model will predict a value within in a specified interval percent amount of the time. Must be >=0.5 and <=1 
      :type confidence: float, double
      :param cocat: default = False. False --> export only the predictions. True --> add predictions to the end of the original dataframe
      :type concat: bool
      :param verbose: default = False. False --> quiet mode. True --> displays text reporting running of the model
      :type verbose: bool
      :returns: The predicted values
      :rtype: pandas.core.frame.DataFrame

      """
      if verbose: print(f"splitting dataframe based on columns:\n {self.df.columns[1:]}...")
      self.__split_df()
      if verbose: print(f"generating model with {confidence}% confidence... ")
      self.__generate_models(confidence)
      if verbose: print(f"predicting trend based on model {months} into the future... ")
      return self.__generate_predictions(months)
      


    def __split_df(self, cols = None):
      """

      Splits a DataFrame into several smaller dataframes by column

      :param df: The original dataframe that has several columns that will be split into smaller dataframes
      :type df: pandas.core.frame.DataFrame
      :param split_cols: The desired columns in aggregate column. Each column listed will become a new dataframe
      :type split_cols: pandas.core.indexes.base.Index, Example: df.columns[2:]
      :returns: A dictionary of the column name with the corresponding new dataframe 
      :rtype: dict

      """
      if cols == None:
        split_cols = self.df.columns[1:]
      
      for city in split_cols:
        temp_df = (self.df).filter(['index',city], axis=1)
        temp_df.rename({city:'y'},axis=1, inplace=True)
        temp_df.columns = ['ds', 'y']
        temp_df['ds']= pd.to_datetime(temp_df['ds'])
        self.df_dict[city]=temp_df
      return self.df_dict



    def __generate_models(self, confidence=0.95):
      """

      Generates a prophet model for each dataframe in a dictionary.

      :param df_dict: A dictionary containing dataframes by name of the column. Example city_df["San Diego"] = df
      :type df_dict: dict
      :param confidence: The model will predict a value within in a specified interval percent amount of the time. Must be >=0.5 and <=1 
      :type confidence: float, double
      :returns: A dictionary of the column name with the corresponding new model
      :rtype: dict

      """
      if confidence >= 1 or confidence <= 0.5:
        raise ValueError('ERROR: Confidence argument must be a value within .50 and 1')
      for index, temp_df in self.df_dict.items():
        m = Prophet(interval_width=confidence, seasonality_mode = 'multiplicative' ,weekly_seasonality=False, daily_seasonality=False)
        m.fit(temp_df)
        self.model_dict[index] = m
      return self.model_dict



    def __generate_predictions(self, months):
      """

      Generates a median and uncertainity interval prediction for the fututre.
      Each cell now has 3 values: [median, lower bound, upper bound].
      
      :param model_dict: A dictionary containing prophet models by name of the column. Example city_models["San Diego"] = model
      :type model_dict: dict
      :param months: The number of months into the future. Example 48 Months = 2 years.
      :type months: int
      :returns: A df containing the months predictions into the future for each column
      :rtype: pandas.core.frame.DataFrame
      
      """
      dummy = list(self.model_dict.values())[0]  #to obtain time column 'ds'
      future = dummy.make_future_dataframe(periods = months, freq ='MS').iloc[len(self.df):]
      forecast = dummy.predict(future)
      self.predicted_df = forecast.filter(['ds'])   #date values of the new dataframe is being calculated
      self.predicted_df.columns = ['index']         #the time value of new dataframe is changed to index
      for city_name, city_model in self.model_dict.items():
        future = city_model.make_future_dataframe(periods = months, freq ='MS')
        forecast = city_model.predict(future)
        self.forecasts_dict[city_name] = forecast.copy() #we want to save the entire forecast for the plotting function
        forecast = forecast.iloc[len(self.df):] #only the future dates are pertinent to the predicted values
        forecast.reset_index(inplace = True, drop = True)
        self.predicted_df[city_name] = (forecast['yhat'].round(0)).apply(np.uintc) 
      return self.predicted_df


    def display(self, name):
      """

      Displays the statistics for the predicted values of a specific column
      
      :param name: A dictionary containing prophet models by name of the column. Example city_models["San Diego"] = model
      :type name: dict
      
      :returns: A df containing the num_of_months predictions into the future for each column
      :rtype: pandas.core.frame.DataFrame
      
      """
      print(name)
      stats = self.forecasts_dict[name].filter(['ds','yhat', 'yhat_lower', 'yhat_upper'])
      stats = stats.iloc[len(self.df):] #only the future dates are pertinent to the predicted values
      stats.reset_index(inplace = True, drop = True)
      stats.loc["Total"] = stats.sum()
      stats[['yhat', 'yhat_lower', 'yhat_upper']] = (stats[['yhat', 'yhat_lower', 'yhat_upper']]).round(0).apply(np.uintc)
      stats['yhat_lower_ratio'] = ((stats['yhat'] - stats['yhat_lower']) / stats['yhat']).round(2)
      stats['yhat_upper_ratio'] = ((stats['yhat_upper'] - stats['yhat']) / stats['yhat']).round(2)
      stats.rename(columns={'ds': 'date', 'yhat': 'median', 'yhat_lower': 'lower', 'yhat_upper': 'upper', 'yhat_lower_ratio': 'lower ratio', 'yhat_upper_ratio': 'upper ratio' }, inplace=True)
      return stats


    def plot(self, name, component = False):
      """
      
      Graphs the original data along with the predictions of a specific column

      :param name: The name of the specied column
      :type name: string
      :returns: a plotly figure

      """
      if component:
         return plot_components_plotly(self.model_dict[name], self.forecasts_dict[name])
      return plot_plotly(self.model_dict[name], self.forecasts_dict[name])

#Loading the Data


https://www.deptofnumbers.com/affordability/california/san-diego/ 

In [5]:
url1 = "https://raw.githubusercontent.com/soazarta/Multivariate_time/main/home_price_SD.csv"
df = pd.read_csv(url1) #dataframe with historical data
df.head()

Unnamed: 0,Region,February 2012,March 2012,April 2012,May 2012,June 2012,July 2012,August 2012,September 2012,October 2012,...,February 2021,March 2021,April 2021,May 2021,June 2021,July 2021,August 2021,September 2021,October 2021,November 2021
0,"San Diego County, CA",$300K,$320K,$325K,$330K,$335K,$342K,$345K,$350K,$355K,...,$685K,$700K,$729K,$760K,$770K,$760K,$748K,$758K,$760K,$775K
1,"San Diego, CA",$304K,$330K,$344K,$353K,$360K,$370K,$361K,$360K,$375K,...,$705K,$700K,$740K,$761K,$800K,$789K,$760K,$770K,$801K,$800K
2,"San Diego, CA - 4s Ranch",$640K,$460K,$465K,$456K,$540K,$540K,$570K,$529K,$600K,...,$915K,"$1,100K","$1,185K","$1,360K","$1,400K","$1,405K","$1,381K","$1,336K","$1,425K","$1,192K"
3,"San Diego, CA - Adams North",$410K,$321K,$385K,$395K,$410K,$447K,$465K,$499K,$433K,...,$873K,$855K,$855K,$890K,$928K,$840K,$871K,$902K,"$1,100K","$1,079K"
4,"San Diego, CA - Adams Park",$278K,$302K,$299K,$270K,$270K,$241K,$325K,$359K,$301K,...,$853K,$830K,$770K,$718K,$760K,$818K,$780K,$712K,$685K,$633K


In [6]:
df.replace(['San Diego, CA -'], [''], regex = True, inplace = True)
df.iloc[:,1:] = (df.iloc[:,1:]).replace(to_replace = ['\$','K',','], value = ['','',''], regex = True)
df = df.transpose()
df.columns = df.iloc[0]
df = df.iloc[1:]
cols = df.columns
df[cols] = df[cols].apply(pd.to_numeric)
new_cols = []
for name in df.columns:
  str1 = name.strip()
  new_cols.append(str1)
df.columns = new_cols
NaNull = [df.isnull().any()]
print(df.isnull().sum())
df.reset_index(inplace = True)
df.head()

San Diego County, CA        0
San Diego, CA               0
4s Ranch                    0
Adams North                 0
Adams Park                  0
                           ..
Views                       0
Village                     0
Webster                     0
Western San Diego           0
San Diego, CA metro area    0
Length: 209, dtype: int64


  from ipykernel import kernelapp as app


Unnamed: 0,index,"San Diego County, CA","San Diego, CA",4s Ranch,Adams North,Adams Park,Allied Gardens,Alta Vista,Auberge at del Sur,Bankers Hill - Park West,...,University City,University Heights,Upper Hermosa,Uptown,Valencia Park,Views,Village,Webster,Western San Diego,"San Diego, CA metro area"
0,February 2012,300,304,640,410,278,335,240.0,,405,...,290,503,1718.0,435,208,159,433,201,297,300
1,March 2012,320,330,460,321,302,326,256.0,,355,...,309,500,1569.0,411,200,157,620,206,300,320
2,April 2012,325,344,465,385,299,325,253.0,,390,...,320,535,1718.0,420,200,170,685,232,301,326
3,May 2012,330,353,456,395,270,325,268.0,,445,...,324,535,2765.0,420,205,172,720,228,337,330
4,June 2012,335,360,540,410,270,358,268.0,,430,...,327,569,2765.0,415,218,185,638,229,340,335


Converting Housing Data to Rental Data by using a Linear Transform. This increases uncertainity.<br>
1000 --> to convert K units to normal dollar amount
30.5 --> The Price-To-Rent Index Ratio (discussed more in Interpertation Section)
12 --> to convert yearly rent to monthly rent

In [7]:
df[df.select_dtypes(include=['int64']).columns] = (df[df.select_dtypes(include=['int64']).columns] * 1000/30.5/12).round(0).apply(np.uintc)
df.head()

Unnamed: 0,index,"San Diego County, CA","San Diego, CA",4s Ranch,Adams North,Adams Park,Allied Gardens,Alta Vista,Auberge at del Sur,Bankers Hill - Park West,...,University City,University Heights,Upper Hermosa,Uptown,Valencia Park,Views,Village,Webster,Western San Diego,"San Diego, CA metro area"
0,February 2012,820,831,1749,1120,760,915,240.0,,1107,...,792,1374,1718.0,1189,568,434,1183,549,811,820
1,March 2012,874,902,1257,877,825,891,256.0,,970,...,844,1366,1569.0,1123,546,429,1694,563,820,874
2,April 2012,888,940,1270,1052,817,888,253.0,,1066,...,874,1462,1718.0,1148,546,464,1872,634,822,891
3,May 2012,902,964,1246,1079,738,888,268.0,,1216,...,885,1462,2765.0,1148,560,470,1967,623,921,902
4,June 2012,915,984,1475,1120,738,978,268.0,,1175,...,893,1555,2765.0,1134,596,505,1743,626,929,915


# Tutorial and Examples

In [8]:
#To view the available cities, use the columns attribute
df.columns

Index(['index', 'San Diego County, CA', 'San Diego, CA', '4s Ranch',
       'Adams North', 'Adams Park', 'Allied Gardens', 'Alta Vista',
       'Auberge at del Sur', 'Bankers Hill - Park West',
       ...
       'University City', 'University Heights', 'Upper Hermosa', 'Uptown',
       'Valencia Park', 'Views', 'Village', 'Webster', 'Western San Diego',
       'San Diego, CA metro area'],
      dtype='object', length=210)

## auto_predict()

As an example, lets say we are only interested in the two cities: 'Rolando Park' and 'San Ysdiro North' that are both in San Diego County

In [9]:
df2 = df.filter(['index','Rolando Park','San Ysdiro North'], axis=1)
df2

Unnamed: 0,index,Rolando Park,San Ysdiro North
0,February 2012,806,508
1,March 2012,820,500
2,April 2012,738,505
3,May 2012,738,500
4,June 2012,746,577
...,...,...,...
113,July 2021,1770,1352
114,August 2021,1803,1634
115,September 2021,1803,1470
116,October 2021,1858,1470


Below, we 1st instantiate our Predict_Rent class. The class can only handle one dataframe at a time so it needs to be aggregated.<br>
Next, we call the auto_predict() function with a confidence of 90%, 4 years into the future<br>
90% of the time, our model will predict a rent amount that falls within our prediction interval. This is explained more below in `display`<br>
Finally, we call the tail method to show only the most future months

In [10]:
example1 = Predict_Rent(df = df2)
new_df = example1.auto_predict(confidence = .975,months = 48)
new_df.tail()

Unnamed: 0,index,Rolando Park,San Ysdiro North
43,2025-07-01,2386,1644
44,2025-08-01,2371,1728
45,2025-09-01,2395,1666
46,2025-10-01,2449,1683
47,2025-11-01,2472,1668


We can also call the auto_predict function for every single column in the dataset. We are choosing every single column because I just directly plugged in the orginal dataset without any filtering. This will be alot slower, so I will choose a a lower confidence and less months ahead <br>
The text below "INFO: prophet:" is giving notifications about how it is automatically choosing the best parameters for the model

In [11]:
example2 = Predict_Rent(df = df)
new_df = example2.auto_predict(confidence = .75,months = 12)
new_df.tail()

INFO:prophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:prophet:n_changepoints greater than number of observations. Using 1.
INFO:prophet:n_changepoints greater than number of observations. Using 3.
INFO:prophet:n_changepoints greater than number of observations. Using 24.
INFO:prophet:n_changepoints greater than number of observations. Using 20.
INFO:prophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:prophet:n_changepoints greater than number of observations. Using 1.
INFO:prophet:n_changepoints greater than number of observations. Using 3.
INFO:prophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:prophet:n_changepoints greater than number of observations. Using 1.
INFO:prophet:n_changepoints greater than number of observations. Using 11.
INFO:prophet:n_changepoints greater than number of observations. Using 7.


Unnamed: 0,index,"San Diego County, CA","San Diego, CA",4s Ranch,Adams North,Adams Park,Allied Gardens,Alta Vista,Auberge at del Sur,Bankers Hill - Park West,...,University City,University Heights,Upper Hermosa,Uptown,Valencia Park,Views,Village,Webster,Western San Diego,"San Diego, CA metro area"
7,2022-07-01,2318,2324,4122,2450,1882,2259,647,1360,2752,...,1920,2848,2798,2339,1753,1214,3671,1699,2004,2314
8,2022-08-01,2320,2321,4099,2503,1926,2311,663,1426,2814,...,1937,2629,2916,2344,1751,1201,3516,1692,2036,2318
9,2022-09-01,2313,2288,3957,2600,1999,2304,671,1334,3009,...,1945,2755,2832,2390,1814,1220,3016,1678,2033,2310
10,2022-10-01,2319,2323,3941,2735,2032,2289,682,1314,3086,...,1873,2790,2647,2445,1853,1269,3012,1659,2053,2315
11,2022-11-01,2351,2335,3749,2809,1904,2299,670,1327,2974,...,1888,2672,2572,2440,1873,1276,3226,1596,2046,2347


## Display()

The auto_predict method only returns the median predicted values. There is actually an upper and lower range associated with our prediction. We call this range a confidence interval. The confidence interval is shown by the display method. This function returns a dataframe so we have to specify which column we are interested in as an argument.<br>
med = median <br>
upper = upper bound on our interval <br>
lower = lower bound on our interval <br>
upper_ratio = how far up the interval moves away from the median <br>
upper_ratio = (upper - med)/med <br>
lower_ratio = how far down the interval moves away from the median <br>
upper_ratio = (med - lower)/med <br>

.tail() --> show the last 5 examples <br>
.head() --> show the first 5 examples. <br>
.iloc[start,end] where ':' indicates all and negative number counts from end <br>

In [12]:
example1.display('Rolando Park').head()

Rolando Park




Unnamed: 0,date,median,lower,upper,lower ratio,upper ratio
0,2021-12-01,1753,1638,1870,0.07,0.07
1,2022-01-01,1788,1657,1897,0.07,0.06
2,2022-02-01,1770,1658,1876,0.06,0.06
3,2022-03-01,1820,1706,1946,0.06,0.07
4,2022-04-01,1847,1732,1959,0.06,0.06


The total row is based only on the future predictions. Above, our first predicted month is December 2021. Below, our last month is November 2025. The total amount of rent gathered in that timeframe would be 101,121$

In [13]:
example1.display('Rolando Park').tail()

Rolando Park




Unnamed: 0,date,median,lower,upper,lower ratio,upper ratio
44,2025-08-01,2371,2241,2510,0.05,0.06
45,2025-09-01,2395,2266,2536,0.05,0.06
46,2025-10-01,2449,2313,2598,0.06,0.06
47,2025-11-01,2472,2331,2612,0.06,0.06
Total,NaT,101121,95190,107038,0.06,0.06


Above, 97.5% of the time, our example 2 model will predict a value that will fall within the interval specified. 10% of the time it will be outside this interval. This interval only applies for the data specified so as we change data or add data, the interval will change. The more we are confident, the larger our intervals will get. Lowering the confidence will make the intervals more percise.

In [14]:
example2.display('Rolando Park').iloc[-4:] #last 4 examples

Rolando Park




Unnamed: 0,date,median,lower,upper,lower ratio,upper ratio
9,2022-09-01,1905,1845,1965,0.03,0.03
10,2022-10-01,1961,1899,2021,0.03,0.03
11,2022-11-01,1933,1878,1997,0.03,0.03
Total,NaT,22283,21554,22999,0.03,0.03


Above, we had 75% confdience that our model will predict a value that is within 3% below the median and 4% above the median. As the amount of time into the future increases, our interval will get significantly larger

## Plot()

We can also view our confidence interval and predictions on a graph. Below, the dark line is the median value predicted. The light blue shading is the uncertainity interval

In [15]:
example1.plot('San Ysdiro North') #with 97.5% confidence

In [16]:
example2.plot('San Ysdiro North') #with 75% confidence

Better quality data, more data decrease the width of our interval. <br>
More months ahead increases the width of our interval. <br>
Different cities will have different interval sizes. 

In [17]:
example2.plot("San Diego, CA", component = True)

# Uncertaintity and Limitations

**House Prices --> Rent Prices**
The data used in this experiment uses housing prices, but rental price predictions are needed. Obtaining historical rental price data would be most ideal. In order to convert housing price to rent, I used a Price-To-Rent Ratio. For San Diego, it is 31.5. We divide the price of the house by the ratio to obtain the rent. The Price-To-Rent Ratio value changes with time and I was not able to find historical data for this.
https://www.fortunebuilders.com/san-diego-real-estate-market-trends/ <br>
https://www.thanmerrill.com/price-to-rent-ratio/ <br> 
<br>
There are two sources of uncertainity for the housing ratio index. First, we do not know by how much the Price-To-Rent Ratio will change in the future. The Price-To-Rent Ratio of 2022 is being applied to other years which is an approximation. This can be solved by obtaining historical data. Second, different cities within San Diego will have a different Price-To-Rent Ratio as well as different zones. Also, within each city and zone, each house will have its own Price-To-Rent Ratio. To put it simply, Price-To-Rent Ratio is an approximation of several different types of houses in different areas. This can be mitigated by using a Price-To-Rent function rather than a ratio. Making such function would require data on several attributes of houses.

**Implications of Median Housing Prices** <br>
The historical housing price data provided by RedFin and Zillow is based on *median* housing values. RedFin and Zillow did not provide the underlying distribution. Therefore, we cannot be sure if the majority of housing prices were above or below the median. We need better data to know the range of values for each year. <br>
For this reason, this model can only report the median expected rent for a specific city. The true rental price will be around, below or above the median. <br>
Furthermore, this model is *univariate*. We only looked at prices and geography. We did not account for the many attributes of the house. There can be an expensive piece of land in a less afluent city and there can be cheap piece of land in a very affluent city. These unaccounted attributes will help explain why the true housing price will be above or below the predicted median. More custom models will be *multivariate*

**NaN 'Null values'**
Missing values are denoted by "Null". This model can handle missing values, but as the number of missing values increase, the uncertainity interval increases. Cities such as "Auberge at del Sur" with a large number of unknown values will have a wider interval of predicted prices

**Specific to San Diego**
Using data specific to San Diego and cities within San Diego helps focus on patterns and trends specific to San Diego and filter out rental price trends not unique to San Diego.<br>
**Can it handle other cities?**
As of now, the model uses prophet with minimal tuning which is a simple generalizable model. The only thing that makes this model specific to San Diego is the data is from San Diego. Therefore, this model can be generalized to other cities. However, this might entail a few adjustments to the code, parameter tuning and resolving some speed and memory issues.

**Disclaimer: The AI model is only as good as the data**
The model works to fit a trend to the data provided. The trend is used to make future predictions. There is some error associated between the trend and the data which can be improved by data engineering and ai engineering. If the data provided is not fully representative of the true rental price prediction, then the AI will not be fully representative to a similar degree.

**What is the quality of this this model?** <br>
This is a starter model. Prophet is a library that automatically fits a time series trend using a general additive model (GAM). More sophisticated models will be custom and the AI engineer will have to experiment with a variety of methods. A custom Long Term Short Term Memory (LSTM) recurrent neueral network will likely be the most accurate.

**Why is yhat_lower_ratio and yhat_upper_ratio given if they are equal to each other?**
yhat_lower_ratio and yhat_upper_ratio will not always equal each other. These values are shown as a reminder that this is a starter model. A big problem with Prophet is that it assumes a gaussian residual error.  To explain simply, prophet assumes that the upper bound and lower bound will both deviate from the median to the same degree. Prophet asumes that the data is following a specific type of distribution, but this is not the case in real life. Custom models will account for this.
