# Mod 4 Project - Starter Notebook

This notebook has been provided to you so that you can make use of the following starter code to help with the trickier parts of preprocessing the Zillow dataset. 

The notebook contains a rough outline the general order you'll likely want to take in this project. You'll notice that most of the areas are left blank. This is so that it's more obvious exactly when you should make use of the starter code provided for preprocessing. 

**_NOTE:_** The number of empty cells are not meant to infer how much or how little code should be involved in any given step--we've just provided a few for your convenience. Add, delete, and change things around in this notebook as needed!

# Some Notes Before Starting

This project will be one of the more challenging projects you complete in this program. This is because working with Time Series data is a bit different than working with regular datasets. In order to make this a bit less frustrating and help you understand what you need to do (and when you need to do it), we'll quickly review the dataset formats that you'll encounter in this project. 

## Wide Format vs Long Format

If you take a look at the format of the data in `zillow_data.csv`, you'll notice that the actual Time Series values are stored as separate columns. Here's a sample: 

<img src='~/../images/df_head.png'>

You'll notice that the first seven columns look like any other dataset you're used to working with. However, column 8 refers to the median housing sales values for April 1996, column 9 for May 1996, and so on. This This is called **_Wide Format_**, and it makes the dataframe intuitive and easy to read. However, there are problems with this format when it comes to actually learning from the data, because the data only makes sense if you know the name of the column that the data can be found it. Since column names are metadata, our algorithms will miss out on what dates each value is for. This means that before we pass this data to our ARIMA model, we'll need to reshape our dataset to **_Long Format_**. Reshaped into long format, the dataframe above would now look like:

<img src='~/../images/melted1.png'>

There are now many more rows in this dataset--one for each unique time and zipcode combination in the data! Once our dataset is in this format, we'll be able to train an ARIMA model on it. The method used to convert from Wide to Long is `pd.melt()`, and it is common to refer to our dataset as 'melted' after the transition to denote that it is in long format. 

# Helper Functions Provided

Melting a dataset can be tricky if you've never done it before, so you'll see that we have provided a sample function, `melt_data()`, to help you with this step below. Also provided is:

* `get_datetimes()`, a function to deal with converting the column values for datetimes as a pandas series of datetime objects
* Some good parameters for matplotlib to help make your visualizations more readable. 

Good luck!


# Step 1: Load the Data/Filtering for Chosen Zipcodes

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import statsmodels.api as sm
import itertools
%matplotlib inline
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('zillow_data.csv')
display(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14723 entries, 0 to 14722
Columns: 272 entries, RegionID to 2018-04
dtypes: float64(219), int64(49), object(4)
memory usage: 30.6+ MB


None

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,...,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04
0,84654,60657,Chicago,IL,Chicago,Cook,1,334200.0,335400.0,336500.0,...,1005500,1007500,1007800,1009600,1013300,1018700,1024400,1030700,1033800,1030600
1,90668,75070,McKinney,TX,Dallas-Fort Worth,Collin,2,235700.0,236900.0,236700.0,...,308000,310000,312500,314100,315000,316600,318100,319600,321100,321800
2,91982,77494,Katy,TX,Houston,Harris,3,210400.0,212200.0,212200.0,...,321000,320600,320200,320400,320800,321200,321200,323000,326900,329900
3,84616,60614,Chicago,IL,Chicago,Cook,4,498100.0,500900.0,503100.0,...,1289800,1287700,1287400,1291500,1296600,1299000,1302700,1306400,1308500,1307000
4,93144,79936,El Paso,TX,El Paso,El Paso,5,77300.0,77300.0,77300.0,...,119100,119400,120000,120300,120300,120300,120300,120500,121000,121500


In [3]:
null_columns = df.columns[df.isnull().any()]
df[null_columns].isnull().sum()

Metro      1043
1996-04    1039
1996-05    1039
1996-06    1039
1996-07    1039
1996-08    1039
1996-09    1039
1996-10    1039
1996-11    1039
1996-12    1039
1997-01    1039
1997-02    1039
1997-03    1039
1997-04    1039
1997-05    1039
1997-06    1039
1997-07    1038
1997-08    1038
1997-09    1038
1997-10    1038
1997-11    1038
1997-12    1038
1998-01    1036
1998-02    1036
1998-03    1036
1998-04    1036
1998-05    1036
1998-06    1036
1998-07    1036
1998-08    1036
           ... 
2012-01     224
2012-02     224
2012-03     224
2012-04     224
2012-05     224
2012-06     224
2012-07     206
2012-08     206
2012-09     206
2012-10     206
2012-11     206
2012-12     206
2013-01     151
2013-02     151
2013-03     151
2013-04     151
2013-05     151
2013-06     151
2013-07     109
2013-08     109
2013-09     109
2013-10     109
2013-11     109
2013-12     109
2014-01      56
2014-02      56
2014-03      56
2014-04      56
2014-05      56
2014-06      56
Length: 220, dtype: int6

In [4]:
df.Metro.value_counts()

New York                          779
Los Angeles-Long Beach-Anaheim    347
Chicago                           325
Philadelphia                      281
Washington                        249
Boston                            246
Dallas-Fort Worth                 217
Minneapolis-St Paul               201
Houston                           187
Pittsburgh                        177
Miami-Fort Lauderdale             162
Portland                          161
Detroit                           153
Atlanta                           152
Seattle                           141
St. Louis                         140
San Francisco                     134
Kansas City                       127
Phoenix                           126
Baltimore                         122
Tampa                             118
Riverside                         116
Cincinnati                        109
Denver                            106
Rochester                         100
Cleveland                          94
Indianapolis

In [5]:
df.Metro = df.Metro.fillna(value=df['City'])
df.Metro.isna().any()

False

In [6]:
df.Metro.value_counts()

New York                          779
Los Angeles-Long Beach-Anaheim    347
Chicago                           325
Philadelphia                      282
Washington                        249
Boston                            246
Dallas-Fort Worth                 217
Minneapolis-St Paul               201
Houston                           188
Pittsburgh                        177
Miami-Fort Lauderdale             162
Portland                          162
Atlanta                           153
Detroit                           153
Seattle                           141
St. Louis                         140
San Francisco                     134
Kansas City                       127
Phoenix                           126
Baltimore                         122
Tampa                             118
Riverside                         116
Cincinnati                        109
Denver                            106
Rochester                         101
Cleveland                          96
Albany      

In [7]:
df.isnull().any(axis=1).sum()

1039

In [8]:
null_rows = df[df.isnull().any(axis=1)]
null_rows.isna().sum(axis=1)

20       105
36       213
105      207
156       93
232      111
272       93
275      107
345       87
469      167
508       93
713      201
796      107
800       87
842      207
854      107
868       87
884      165
1033     207
1252     105
1299     189
1359     183
1413      87
1434      93
1524      87
1534      93
1615      87
1754     111
1768     171
1809     177
1850     123
        ... 
14533    213
14538    123
14543    123
14548    107
14550    167
14558    201
14573    153
14577    219
14585    177
14587    207
14606     87
14618    167
14622    111
14623    213
14624    167
14633     87
14643    177
14651    111
14660    183
14666     87
14669    167
14674    167
14677    135
14682    111
14687    123
14703    117
14705    111
14706    171
14707    219
14708    213
Length: 1039, dtype: int64

In [9]:
df = df.dropna(axis=0)
display(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13684 entries, 0 to 14722
Columns: 272 entries, RegionID to 2018-04
dtypes: float64(219), int64(49), object(4)
memory usage: 28.5+ MB


None

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,...,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04
0,84654,60657,Chicago,IL,Chicago,Cook,1,334200.0,335400.0,336500.0,...,1005500,1007500,1007800,1009600,1013300,1018700,1024400,1030700,1033800,1030600
1,90668,75070,McKinney,TX,Dallas-Fort Worth,Collin,2,235700.0,236900.0,236700.0,...,308000,310000,312500,314100,315000,316600,318100,319600,321100,321800
2,91982,77494,Katy,TX,Houston,Harris,3,210400.0,212200.0,212200.0,...,321000,320600,320200,320400,320800,321200,321200,323000,326900,329900
3,84616,60614,Chicago,IL,Chicago,Cook,4,498100.0,500900.0,503100.0,...,1289800,1287700,1287400,1291500,1296600,1299000,1302700,1306400,1308500,1307000
4,93144,79936,El Paso,TX,El Paso,El Paso,5,77300.0,77300.0,77300.0,...,119100,119400,120000,120300,120300,120300,120300,120500,121000,121500


In [10]:
df.isna().any().sum()

0

In [11]:
df[['RegionID', 'RegionName', 'City', 'State', 'Metro', 'CountyName']].nunique()

RegionID      13684
RegionName    13684
City           7046
State            50
Metro          1290
CountyName     1063
dtype: int64

In [14]:
df.drop(columns=['RegionID'], inplace=True)
df.rename(columns={'RegionName':'zipcode'}, inplace=True)
df.head(1)

Unnamed: 0,zipcode,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,1996-07,...,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04
0,60657,Chicago,IL,Chicago,Cook,1,334200.0,335400.0,336500.0,337600.0,...,1005500,1007500,1007800,1009600,1013300,1018700,1024400,1030700,1033800,1030600


In [15]:
df.zipcode = df.zipcode.astype(str)
df.zipcode.dtype

dtype('O')

In [None]:
df.columns

In [None]:
States = df.groupby('State').mean()
States.head()

In [None]:
display(States)

In [None]:
Metros = df.groupby('Metro').mean()
Metros.head()

In [None]:
Cities = df.groupby(['City']).mean()
Cities.head()

In [None]:
df.columns

In [None]:
def growth_rates(df):
    df['total_growth'] = (df['2018-04'] - df['1996-04']) / df['1996-04']
    df['10yr_growth'] = (df['2018-04'] - df['2008-04']) / df['2008-04']
    df['5yr_growth'] = (df['2018-04'] - df['2013-04']) / df['2013-04']
    df['2yr_growth'] = (df['2018-04'] - df['2016-04']) / df['2016-04']
    df['1yr_growth'] = (df['2018-04'] - df['2017-04']) / df['2017-04']

In [None]:
growth_rates(df)
growth_rates(States)
growth_rates(Metros)
growth_rates(Cities)
display(df.head(2), States.head(2), Metros.head(2), Cities.head(2))

In [None]:
fig = plt.figure(figsize=(16,8))

sns.barplot(x=States_total_sort.index, y=States_total_sort.total_growth, data=States_total_sort)
plt.hlines(y=States.total_growth.mean(), xmin=0, xmax=len(States))

In [None]:
States_total_sort = States.sort_values('total_growth', ascending=False)
States_total_sort.head()

In [None]:
States_1yr_sort = States.sort_values('1yr_growth', ascending=False)
fig = plt.figure(figsize=(16,8))
sns.barplot(States_1yr_sort.index, States_1yr_sort['1yr_growth'], data=States_1yr_sort)

In [None]:
States_1yr_sort.index[:10], States_total_sort.index[:10]

In [None]:
Top_2yr_States = States.sort_values('2yr_growth', ascending=False).index[:10]
Top_2yr_States

In [None]:
Top_5yr_States = States.sort_values('5yr_growth', ascending=False).index[:10]
Top_5yr_States

In [None]:
test_list = list(enumerate(States_total_sort.index[:10]))
test_list

# Step 2: Data Preprocessing

In [None]:
def get_datetimes(df):
    return pd.to_datetime(df.columns.values[1:], format='%Y-%m')

# Step 3: EDA and Visualization

In [None]:
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 22}

matplotlib.rc('font', **font)

# NOTE: if you visualizations are too cluttered to read, try calling 'plt.gcf().autofmt_xdate()'!

# Step 4: Reshape from Wide to Long Format

In [None]:
def melt_data(df):
    melted = pd.melt(df, id_vars=['RegionName', 'City', 'State', 'Metro', 'CountyName'], var_name='time')
    melted['time'] = pd.to_datetime(melted['time'], infer_datetime_format=True)
    melted = melted.dropna(subset=['value'])
    #return melted.groupby('time').aggregate({'value':'mean'})
    return melted

In [None]:
test_df = df.drop(columns=['RegionID', 'SizeRank'])
test_df = melt_data(test_df)
test_df.head(20)

In [None]:
test_df.shape

In [None]:
states = sorted(list(set(test_df.State.values)))
print(states)

In [None]:
for s in states:
    b = str(s)
    s = test_df[test_df['State'] == s].groupby(['State','time']).mean()
    s['growth_rate'] = s.value.pct_change()
    s.growth_rate.plot(figsize=(12,4))
    plt.title(b)
    plt.show()

In [None]:
test_state_df = test_df.groupby(['State', 'time']).mean()
test_state_df.head(10)

In [None]:
test_state_df['roll'] = test_state_df.value.pct_change()
test_state_df.head()

In [None]:
test_state_df.roll.CA.plot()

# Step 5: ARIMA Modeling

# Step 6: Interpreting Results