
- The os module has a perfect method to list files in a directory.
- Pandas json normalize could work here but is not necessary to convert the JSON data to a dataframe.
- You may need a nested for-loop to access each sale!
- We've put a lot of time into creating the structure of this repository, and it's a good example for future projects.  In the file functions_variables.py, there is an example function that you can import and use.  If you have any variables, functions or classes that you want to make, they can be put in the functions_variables.py file and imported into a notebook.  Note that only .py files can be imported into a notebook. If you want to import everything from a .py file, you can use the following:
```python
from functions_variables import *
```
If you just import functions_variables, then each object from the file will need to be prepended with "functions_variables"\
Using this .py file will keep your notebooks very organized and make it easier to reuse code between notebooks.

In [468]:
# (this is not an exhaustive list of libraries)
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
import os
import json
from pprint import pprint

from modules.JSONFramer import JSONFramer

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

In [469]:
JSONFramer = JSONFramer('../data')
df = JSONFramer.frame()
df.reset_index(drop=True, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8159 entries, 0 to 8158
Data columns (total 45 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   branding_name          7910 non-null   object 
 1   baths                  7980 non-null   float64
 2   baths_1qtr             0 non-null      object 
 3   baths_3qr              0 non-null      object 
 4   baths_full             7311 non-null   object 
 5   baths_half             2281 non-null   object 
 6   beds                   7504 non-null   object 
 7   garage                 4448 non-null   object 
 8   lot_sqft               6991 non-null   object 
 9   name                   0 non-null      object 
 10  sold_date              8159 non-null   object 
 11  sold_price             6716 non-null   object 
 12  sqft                   7323 non-null   object 
 13  stories                6260 non-null   object 
 14  sub_type               1427 non-null   object 
 15  type

In [470]:
#count unique property ids
df['property_id'].nunique()
#1795 unique property_ids in a dataframe of 8159 rows indicates that row duplication exists for each property

1795

In [471]:
#display df sorted by 'property_id' column to visually inspect the data
df.sort_values(by=['property_id']).head(6)
#Quick check shows that multiple values for 'branding_name' column exist for each property_id

Unnamed: 0,branding_name,baths,baths_1qtr,baths_3qr,baths_full,baths_half,beds,garage,lot_sqft,name,sold_date,sold_price,sqft,stories,sub_type,type,year_built,is_coming_soon,is_contingent,is_for_rent,is_foreclosure,is_new_construction,is_new_listing,is_pending,is_plan,is_price_reduced,is_subdivision,last_update_date,show_contact_an_agent,list_date,list_price,listing_id,city,lat,lon,line,postal_code,state,state_code,open_houses,price_reduced_amount,brand_name,property_id,status,tags
6194,Compass,3.0,,,3.0,,3.0,5.0,43560.0,,2023-12-15,925000,2980.0,1.0,,single_family,1978.0,,,,,,False,,,,,2023-12-15T21:25:24Z,True,2023-09-25T16:20:27.000000Z,925000,2959966417,Carson City,39.093819,-119.79112,3527 Arcadia Dr,89705,Nevada,NV,,,basic_opt_in,1003442504,sold,"[community_outdoor_space, energy_efficient, fa..."
1837,"Weichert, Realtors - Courtney Valleywide",3.0,,,3.0,,4.0,2.0,95832.0,,2023-12-22,625600,2136.0,1.0,,single_family,1996.0,,,,,,False,,,,,2023-12-23T09:05:36Z,True,2023-11-20T16:22:33.000000Z,600000,2961854149,Phoenix,33.804177,-112.056055,35033 N 12th St,85086,Arizona,AZ,,,basic_opt_in,1007849822,sold,"[single_story, garage_1_or_more, garage_2_or_m..."
7199,My Home Group Real Estate,3.0,,,3.0,,4.0,2.0,95832.0,,2023-12-22,625600,2136.0,1.0,,single_family,1996.0,,,,,,False,,,,,2023-12-23T09:05:36Z,True,2023-11-20T16:22:33.000000Z,600000,2961854149,Phoenix,33.804177,-112.056055,35033 N 12th St,85086,Arizona,AZ,,,basic_opt_in,1007849822,sold,"[single_story, garage_1_or_more, garage_2_or_m..."
571,Cooper Premier Properties Llc,3.0,,,3.0,,4.0,2.0,95832.0,,2023-12-22,625600,2136.0,1.0,,single_family,1996.0,,,,,,False,,,,,2023-12-23T09:05:36Z,True,2023-11-20T16:22:33.000000Z,600000,2961854149,Phoenix,33.804177,-112.056055,35033 N 12th St,85086,Arizona,AZ,,,basic_opt_in,1007849822,sold,"[single_story, garage_1_or_more, garage_2_or_m..."
6386,Keller Williams Realty Sonoran Living,3.0,,,3.0,,4.0,2.0,95832.0,,2023-12-22,625600,2136.0,1.0,,single_family,1996.0,,,,,,False,,,,,2023-12-23T09:05:36Z,True,2023-11-20T16:22:33.000000Z,600000,2961854149,Phoenix,33.804177,-112.056055,35033 N 12th St,85086,Arizona,AZ,,,basic_opt_in,1007849822,sold,"[single_story, garage_1_or_more, garage_2_or_m..."
3479,"Sell Your Home Services, Inc.",3.0,,,3.0,,4.0,2.0,95832.0,,2023-12-22,625600,2136.0,1.0,,single_family,1996.0,,,,,,False,,,,,2023-12-23T09:05:36Z,True,2023-11-20T16:22:33.000000Z,600000,2961854149,Phoenix,33.804177,-112.056055,35033 N 12th St,85086,Arizona,AZ,,,basic_opt_in,1007849822,sold,"[single_story, garage_1_or_more, garage_2_or_m..."


In [472]:
#check shape of the df after dropping duplicated 'branding_name' column and unhashable 'tags' column
temp_dataframe = df.drop(columns =['branding_name','tags'])
temp_dataframe.drop_duplicates().shape
#1795 rows is same number as unique property ids, therefore eliminating the branding_name column will remove duplicates

(1795, 43)

In [473]:
# #Note: the 'tags' column was checked for duplication by converting the list type of each value to a string and then applying '.drop_duplicates'

# def tas_to_string...

# string_data = all_data
# string_data.loc[:, 'tags'] = string_data['tags'].apply(lambda x: tags_to_string(x))

# string_data.drop_duplicates().info()

# # No duplication was found

In [474]:
#check to see how many unique 'branding_name' values there are (just in case there were only a few and we could keep this column)
len(df['branding_name'].unique())
#As expected, there are too many (172) so the column can be dropped

172

**-Drop Columns:**

In [475]:
#drop the 'branding_name' column 
#also drop all columns with 0 non-null values including the following:
df = df.drop(columns = ['branding_name', 'baths_1qtr', 'baths_3qr', 'name', 'is_coming_soon', 'is_contingent', 'is_for_rent', 'is_new_construction', 'is_pending', 'is_plan', 'is_subdivision', 'open_houses'])

# df = dataframe.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8159 entries, 0 to 8158
Data columns (total 33 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   baths                  7980 non-null   float64
 1   baths_full             7311 non-null   object 
 2   baths_half             2281 non-null   object 
 3   beds                   7504 non-null   object 
 4   garage                 4448 non-null   object 
 5   lot_sqft               6991 non-null   object 
 6   sold_date              8159 non-null   object 
 7   sold_price             6716 non-null   object 
 8   sqft                   7323 non-null   object 
 9   stories                6260 non-null   object 
 10  sub_type               1427 non-null   object 
 11  type                   8125 non-null   object 
 12  year_built             7316 non-null   object 
 13  is_foreclosure         42 non-null     object 
 14  is_new_listing         7752 non-null   object 
 15  is_p

**-Drop rows where target column is null:** 

In [476]:
#Because our model's target variable is 'sold_price', we can drop all rows where this is null
df.dropna(subset=['sold_price'], inplace=True)
df.shape

(6716, 33)

**-Encode 'tags' in columns:**

In [477]:
#we need to explode and one-hot-code the 'tags' column
#(this needs to be done before dropping duplicated as the list data type for tags values is unhashable)

In [478]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s = df['tags'].apply(lambda x: x if x is not None else [])
binarized_tags = pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)

binarized_tags.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6716 entries, 0 to 8158
Columns: 152 entries, baseball to wrap_around_porch
dtypes: int64(152)
memory usage: 7.8 MB


In [479]:
df = df.merge(binarized_tags, left_index=True, right_index=True).drop(columns='tags')
df = df.drop_duplicates()

In [480]:
df.columns

Index(['baths', 'baths_full', 'baths_half', 'beds', 'garage', 'lot_sqft',
       'sold_date', 'sold_price', 'sqft', 'stories',
       ...
       'views', 'volleyball', 'washer_dryer', 'water_view', 'waterfront',
       'well_water', 'white_kitchen', 'wine_cellar', 'wooded_land',
       'wrap_around_porch'],
      dtype='object', length=184)

In [481]:
#Drop 'tags' column
#Perform count for all newly created binary tag columns / by total row count
#Drop all binary tag columns that are represented in less that [10%?] of the data



binarized_tags_dict = (df.loc[:, mlb.classes_.tolist()].sum()/df.shape[0]).to_dict()

binarized_tags_list = []

for key in binarized_tags_dict.keys():
    if binarized_tags_dict[key] < 0.05:
        binarized_tags_list.append(key)

binarized_tags_list

['baseball',
 'basketball',
 'basketball_court',
 'beach',
 'beautiful_backyard',
 'big_bathroom',
 'boat_dock',
 'cathedral_ceiling',
 'clubhouse',
 'coffer_ceiling',
 'community_boat_facilities',
 'community_center',
 'community_clubhouse',
 'community_elevator',
 'community_golf',
 'community_gym',
 'community_horse_facilities',
 'community_park',
 'community_spa_or_hot_tub',
 'community_tennis_court',
 'courtyard_entry',
 'cul_de_sac',
 'den_or_office',
 'detached_guest_house',
 'dual_master_bedroom',
 'efficient',
 'elevator',
 'exposed_brick',
 'fenced_courtyard',
 'first_floor_master_bedroom',
 'fixer_upper',
 'fruit_trees',
 'furniture',
 'game_room',
 'gated_community',
 'golf_course',
 'golf_course_lot_or_frontage',
 'golf_course_view',
 'gourmet_kitchen',
 'granite_kitchen',
 'greenbelt',
 'greenhouse',
 'guest_house',
 'guest_parking',
 'handicap_access',
 'hill_or_mountain_view',
 'hoa',
 'horse_facilities',
 'indoor_basketball_court',
 'investment_opportunity',
 'jack_and

In [482]:
df = df.drop(columns=binarized_tags_list)

In [483]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1475 entries, 0 to 8117
Data columns (total 79 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   baths                        1443 non-null   float64
 1   baths_full                   1329 non-null   object 
 2   baths_half                   434 non-null    object 
 3   beds                         1367 non-null   object 
 4   garage                       764 non-null    object 
 5   lot_sqft                     1275 non-null   object 
 6   sold_date                    1475 non-null   object 
 7   sold_price                   1475 non-null   object 
 8   sqft                         1331 non-null   object 
 9   stories                      1118 non-null   object 
 10  sub_type                     259 non-null    object 
 11  type                         1471 non-null   object 
 12  year_built                   1329 non-null   object 
 13  is_foreclosure         

**-Fill N/A Values**

In [484]:
#fill N/A values with zeros for 'baths_half', 'garage', price_reduced_amount
df.update(df[['baths_full','baths_half', 'garage', 'price_reduced_amount']].fillna(0))

In [485]:
#fill N/A values with 'False' for 'is_price_reduced'
df['is_price_reduced'].fillna(False, inplace=True)

**-Drop more unecessary Columns**

In [486]:
#drop 'sub-type' column as it appears to be either missing or very similar to type in most instances
df.drop(columns='sub_type', inplace=True)

In [487]:
#We can probably drop 'last_update' column as I assume it has to do with website or realtor activity
df.drop(columns='last_update_date', inplace=True)

In [488]:
#Only False values for this column so we can drop it
#It might be the case that None type should be 'True' but I don't think we have enough info to make this inference
df.drop(columns='is_new_listing', inplace=True)

In [489]:
#Not obvious what brand name is - I suggest we drop but we could also just try modelling with it left in
df.drop(columns='brand_name', inplace=True)

In [490]:
#status is always sold so we can drop this
df.drop(columns='status', inplace=True)

**-Change Boolean values to Binary**

In [491]:
df.drop(columns='show_contact_an_agent', inplace=True)

In [492]:
#Do we want to drop the foreclosure column because so few data points exist?
df.drop(columns='is_foreclosure', inplace=True)
#Alternatively we can just fill thse values null values with False

In [503]:
df[df['beds'].isna()]

Unnamed: 0,baths,baths_full,baths_half,beds,garage,lot_sqft,sold_date,sold_price,sqft,stories,type,year_built,is_price_reduced,list_date,list_price,listing_id,city,lat,lon,line,postal_code,state,state_code,price_reduced_amount,property_id,basement,big_lot,big_yard,carport,central_air,central_heat,city_view,community_outdoor_space,community_security_features,community_swimming_pool,corner_lot,dining_room,disability_features,dishwasher,energy_efficient,ensuite,family_room,farm,fenced_yard,fireplace,floor_plan,forced_air,front_porch,garage_1_or_more,garage_2_or_more,garage_3_or_more,groundscare,hardwood_floors,high_ceiling,laundry_room,master_bedroom,modern_kitchen,new_roof,open_floor_plan,park,ranch,recreation_facilities,rental_property,shopping,single_story,swimming_pool,trails,two_or_more_stories,updated_kitchen,view,views,washer_dryer
41,0.0,,0.0,,0.0,19800.0,2024-01-05,74000.0,,,land,,False,2021-10-06T16:40:40.000000Z,90000.0,2935179051.0,Lincoln,40.711364,-96.617101,7646 Isidore Dr,68516,Nebraska,NE,0.0,9078473372,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
139,0.0,,0.0,,0.0,273992.0,2024-01-12,102000.0,,,land,,False,2023-10-09T22:26:57.000000Z,110000.0,2960391412.0,Spencer,35.540562,-97.351857,6700 N Post Rd,73084,Oklahoma,OK,0.0,7480938180,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
144,0.0,,0.0,,0.0,43560.0,2024-01-12,147500.0,,,land,,False,2023-12-12T17:26:25.000000Z,160000.0,2962409797.0,Edmond,35.595714,-97.411402,12317 Thelmas Way,73013,Oklahoma,OK,0.0,9031371252,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
246,0.0,,0.0,,0.0,43560.0,2023-12-28,7700.0,,,land,,False,2023-09-14T18:27:45.000000Z,12999.0,2959674302.0,Charleston,38.418295,-81.584571,2310 Mile Fork Rd,25312,West Virginia,WV,0.0,9812715777,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
255,0.0,,0.0,,0.0,653400.0,2023-11-14,15400.0,,,land,,False,2023-06-08T21:27:07Z,20000.0,2956543845.0,Charleston,,,Youngs Hollow Rd,25320,West Virginia,WV,0.0,9895387158,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
256,0.0,,0.0,,0.0,196020.0,2023-11-14,6930.0,,,land,,False,2023-08-04T19:17:44Z,19999.0,2958387594.0,Charleston,38.294911,-81.628972,1441 Cane Fork Rd,25314,West Virginia,WV,0.0,3975252838,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
258,0.0,,0.0,,0.0,1348618.0,2023-10-24,20900.0,,,land,,False,2023-07-18T17:41:05Z,35000.0,2957803138.0,Charleston,38.48811,-81.596275,5575 Legg Fork Rd,25320,West Virginia,WV,0.0,9727111037,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
265,0.0,,0.0,,0.0,24394.0,2023-08-22,1122.0,,,land,,False,2023-05-31T16:25:31Z,5000.0,2956242830.0,Charleston,38.332075,-81.497943,1077 Campbells Creek Dr,25306,West Virginia,WV,0.0,3263346436,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
267,0.0,,0.0,,0.0,51401.0,2023-08-22,385.0,,,land,,False,2023-05-31T16:25:31Z,5000.0,2956242804.0,Charleston,,,Maple Hollow Rd,25311,West Virginia,WV,0.0,9500733492,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
268,0.0,,0.0,,0.0,6098.0,2023-08-22,308.0,,,land,,False,2023-05-31T16:25:31Z,5000.0,2956242828.0,Charleston,38.306225,-81.571146,5744 Victory Ave,25304,West Virginia,WV,0.0,3233095968,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


**-Time on Market, Month sold, Month listed Columns**

In [494]:
#We could pull just the month (and/or year) from the 'sold date' (and/or list date) columns to make 'month listed', 'month sold' columns
#We could also create a 'time on market' from the 'sold date' and 'month listed' columns

**-Drop foreclosure Column?**

**-Create Mean House Sale Value by City Column**

In [495]:
#(based on Samson's suggestion)
#Drop location (city, state) columns
#create new column that lists the mean house sale value based on the row's corresponding city

**-Normalize Data**

In [496]:
#Normalize, scaler?

At this point, ensure that you have all sales in a dataframe.
- Is each cell one value, or do some cells have lists?
- Maybe the "tags" will help create some features.
- What are the data types of each column?
- Some sales may not actually include the sale price.  These rows should be dropped.
- Some sales don't include the property type.
- There are a lot of None values.  Should these be dropped or replaced with something?

In [497]:
# load and concatenate data here
# drop or replace values as necessary

Consider the fact that with tags, there are a lot of categorical variables.
- How many columns would we have if we OHE tags, city and state?
- Perhaps we can get rid of tags that have a low frequency.

In [498]:
# OHE categorical variables here
# tags will have to be done manually

- Sales will vary drastically between cities and states.  Is there a way to keep information about which city it is without OHE such as using central tendency?
- Could we label encode or ordinal encode?  Yes, but this may have undesirable effects, giving nominal data ordinal values.
- If you replace cities or states with numerical values, make sure that the data is split so that we don't leak data into the training selection. This is a great time to train test split. Compute on the training data, and join these values to the test data
- Drop columns that aren't needed.
- Don't keep the list price because it will be too close to the sale price.

In [499]:
# perform train test split here
# do something with state and city
# drop any other not needed columns

**STRETCH**

- You're not limited to just using the data provided to you. Think/ do some research about other features that might be useful to predict housing prices. 
- Can you import and join this data? Make sure you do any necessary preprocessing and make sure it is joined correctly.
- Example suggestion: could mortgage interest rates in the year of the listing affect the price? 

In [500]:
# import, join and preprocess new data here

Remember all of the EDA that you've been learning about?  Now is a perfect time for it!
- Look at distributions of numerical variables to see the shape of the data and detect outliers.
- Scatterplots of a numerical variable and the target go a long way to show correlations.
- A heatmap will help detect highly correlated features, and we don't want these.
- Is there any overlap in any of the features? (redundant information, like number of this or that room...)

In [501]:
# perform EDA here

Now is a great time to scale the data and save it once it's preprocessed.
- You can save it in your data folder, but you may want to make a new `processed/` subfolder to keep it organized