## Intro:

In this notebook, follow the python code to extract and transform raw property sales data into a python list format. My own commentary surrounding certain pieces of code is denoted by the '#' symbol.

### Raw data files repository and technical documentation:

Source of raw dataset: https://valuation.property.nsw.gov.au/embed/propertySalesInformation

*You can download data files in weekly batches (if the files pertain to property data from the current year), or in yearly increments (for years prior to the current year). The files come in zip format and include multiple .DAT files (thousands if you are downloading yearly zip files).*

Please review the terms and conditions relating to this raw dataset via the above and below links.

NSW property raw data guide: https://www.valuergeneral.nsw.gov.au/__data/assets/pdf_file/0016/216403/Property_Sales_Data_File_-_Data_Elements_V3.pdf

## Setup:

Download as many of the raw data files as you deem necessary. In this particular case I downloaded 12 weeks worth of raw data, ranging from 6 Jan 2020 to 23 Mar 2020. I then unzipped the files and saved all the included DAT files into a single directory on my computer (which takes roughly 30 seconds using Winzip software).  
  

In [15]:
# Importing necessary Python modules:

import os, csv
from datetime import datetime


In [16]:
# Initial variable creation:

properties_list = [] # Creates an empty list. This will temporarily store the semi-processed dataset.
district_codes_list = [] # Creares an empty list used for checking purposes. 
record_codes_list = [] # Creares an empty list used for checking purposes. 
owd = os.getcwd() #saves original working directory reference 
os.chdir(owd) #resets to original working directory just in case you are running this script over and over

## Main script:

In [17]:
# This cell includes the script that automatically goes through each DAT file and extracts and transforms the relevant data points. This code cell takes only a couple of seconds to run. 

os.chdir('./DAT_Files') # Changes the active directory to the location of the DAT files

start_time = datetime.now() 
DATfile_count = 0 #Variable created that will allow us to know exactly how many DAT files are processed

for DATfile in os.listdir('.'): # searches through each file in the active directory
    if not DATfile.endswith('.DAT'):
        continue    # skips non-DAT files

    DATfile_obj = open(DATfile)
    reader_obj = csv.reader(DATfile_obj, delimiter=";") # reads the DAT file in csv format
    ListData = list(reader_obj) # turns the data read from the csv file into a python list object
    DATfile_count += 1 #Counts the DAT files processed

    for i in ListData: # loops through each element in the newly created python list. Each element essentially equates to a single row if the DAT file were viewed in a csv format inside Excel. Each index of each element represents a particular data point. For example i[0] (aka the first index) is either going to contain a 'A', 'B', 'C', or 'D'. For this script we need to only extract the elements where the first index holds a 'B' value, as this is the row of data which holds the property transaction data we need.
        if (i[0] == 'B' and i[17] == 'R'): # this if statement ensures we only take rows in the DAT files that hold residential property sales information. It ignores commercial (i.e. businesses), parking spaces, and empty land property sales.  
            if i[12] == 'H': # here we are converting hectares into square metres. This is because there is inconsistency in the raw data, where some property sales entries have inputs denominating the area of the property in hectares, while others in square metres.
                floating_prop_area = float(i[11])
                i[11] = floating_prop_area * 10000
                i[12] = 'M' # 'M' denotes metres, as opposed to 'H' for hectares. I later delete this data point as it is superfluous to future analysis. 
            if not i[19] == '': # assigns a 1 value to column 20 aka i[19], assuming that the property is a unit as opposed to a house
                i[19] = '1'
            if not i[20] == '': # assigns a 1 value to column 21 aka i[20], assuming that the property is a house as opposed to a unit
                i[20] = '1'
            
            District_Code = i[1] # used for assessing the amount of the unique district codes included in our processed dataset. 
            Record_Code = i[23] # used for assessing the amount of the unique property sales transactions included in our processed dataset. 

            # Removing redundant data points:
            del i[24]
            del i[22]
            del i[21]
            del i[18]
            del i[17]
            del i[12]
            del i[5]
            del i[4]
            del i[3]
            del i[0]

            properties_list.append(i) # Adding each row of processed data into a list holding all rows of data processed
            district_codes_list.append(District_Code) # used for assessing the amount of the unique district codes included in our processed dataset. 
            record_codes_list.append(Record_Code) # used for assessing the amount of the unique property sales transactions included in our processed dataset.
    DATfile_obj.close()

os.chdir(owd) #changes back to original working directory

print('Time elapsed (hh:mm:ss.ms) {}'.format(datetime.now() - start_time))

Time elapsed (hh:mm:ss.ms) 0:00:01.853459


## Summarising the processed data:

In [18]:
# Running this cell provides information about the extracted dataset

print('Number of rows of data extracted (AKA the number of property transactions): ' + str(len(district_codes_list)))
print('Number of DAT files processed: ' + str(DATfile_count))

Number of rows of data extracted (AKA the number of property transactions): 32968
Number of DAT files processed: 1400


In [19]:
# Running this cell will return 15 data points (in a python list object) for a single property transaction.

print('Sample of one of the property sale transaction rows (The 763rd entry): \n')

print(properties_list[762])

print('\n' *2)

print('Number of data points (variables/columns) for each property transaction entry: ' + str(len(properties_list[762])))


Sample of one of the property sale transaction rows (The 763rd entry): 

['004', '117039', '', '7', 'MURRAY RD', 'CARDIFF', '2285', '392', '20191206', '20200114', '450000', 'R3', '', '1', 'AP826618']



Number of data points (variables/columns) for each property transaction entry: 15


It is worth understanding what these 15 data points represent.

1. District Code  
*A unique 3 digit numeric identifier applied to every district within the State of New South Wales*

2. Property Id  
*A unique numeric identifier applied to every property within the State of New South Wales*  

3. Property Unit Number 
*The unit number of a property as recorded in the Register of Land Values*

4. Property House Number    
*The house number of a property as recorded in the Register of Land Values*

5. Property Street Name 
*The street name of a property as recorded in the Register of Land Values*

6. Property Locality    
*The name of the locality a property exists within as recorded in the Register of Land Values*

7. Property Post Code     
*The unique 4 digit numeric postal code a property exists within as recorded in the Register of Land Values.*

8. Area                 
*The extent or measurement of land as recorded in the Register of Land Values*

9. Contract Date    
*The calendar date on which contracts were exchanged as recorded in the Register of Land Values and sourced from the Notice of Sale.*

10. Settlement Date      
*The calendar date on which a contract was settled as recorded in the Register of Land Values and sourced from the Notice of Sale*

11. Purchase Price  
*The purchase price of a property as recorded in the Register of Land Values*

12. Zoning  
*The zone classification applied to a property as recorded in the Register of Land Values*

13. Unit    
*A boolean created variable. A value of 1 indicates that the property is a unit, as opposed to a house.*

14. House   
*A boolean created variable. A value of 1 indicates that the property is a house, as opposed to a unit.* 

15. Dealing number  
*A unique identifier applied to a dealing created within the State of New South Wales.*


    
Definitions and labels are primarily derived from documentation linked: https://www.valuergeneral.nsw.gov.au/__data/assets/pdf_file/0016/216403/Property_Sales_Data_File_-_Data_Elements_V3.pdf



In [20]:
# Running this cell provides further information about the extracted dataset

number_of_datapoints_total = len(district_codes_list) * 15

print('Number of unique record codes: ' + str(len(record_codes_list))) 
print('Number of total data points in dataset (including NaN): ' + str(number_of_datapoints_total))

district_codes_sets = set(district_codes_list) # collates all unique property district codes featured in the processed dataset
district_codes_number_of = len(district_codes_sets)
totalprop_ofpossibledistricts = district_codes_number_of / 130
totalprop_ofpossibledistricts_perc = "{:.0%}".format(totalprop_ofpossibledistricts)

print('Number of unique district codes extracted: ' + str(len(district_codes_sets)))
print('Total number of unique districts in NSW, according to the raw data documentation, is 130.')
print('Percentage of all possible district codes in the dataset: ' + totalprop_ofpossibledistricts_perc) # According to this reference there are 130 possible district codes: https://www.valuergeneral.nsw.gov.au/__data/assets/pdf_file/0018/216405/Property_Sales_Data_File_District_Codes_and_Names.pdf . It is totally reasonable for a few to be missing in our dataset, as some districts in remote NSW areas may go many months without a residential property sale. 


Number of unique record codes: 32968
Number of total data points in dataset (including NaN): 494520
Number of unique district codes extracted: 127
Total number of unique districts in NSW, according to the raw data documentation, is 130.
Percentage of all possible district codes in the dataset: 98%


In [21]:
# Run this cell if you wish the save the processed data into a csv file. 

os.makedirs('processed_data', exist_ok=True) # Creates a new directory where the collated processed data is saved. 
os.chdir('./processed_data')
csvfile = 'processed_data.csv' # Creates an empty csv file. 
csvfile_obj = open(os.path.join('.', csvfile), 'a', newline='') 
csvWriter = csv.writer(csvfile_obj)
for row in properties_list:
    csvWriter.writerow(row)
csvfile_obj.close()

os.chdir(owd)