## Intro

I chose MySQL as the relational database management system to handle my dataset and used the accompanying MySQL Workbench software to run SQL queries. It is possible to run SQL queries via python code, but the Workbench is much easier to use and includes shortcut tools that speed up the process. Python code is however needed to load the dataset into the MySQL database.



### Database and main table creation

In MySQL Workbench I created a database titled ‘testdatabase’ and inside this database I created a table which would exist to store my dataset inside. After consideration and tweaking of the assigned SQL data types that would correspond to each of the 15 data points (columns) per property sales transaction, the following SQL query was written, executed, and subsequently the table, titled ‘property_sales’, which would store the dataset, was created. Below is the query. Remove the triple " symbols if you are to copy the query. 

Also take into account whether or not you have replicated my removal of transactions with certain data points missing. In the previous exercise, I added code to remove cases where the street name, contract date, or settlement date were missing. If are replicating my ETL process but did not choose to use the afformentioned code, then you will need to edit the below SQL query so that certain columns do not require data points to have values. Doing this is as simple as checking the relevant column line in the below SQL query and replacing 'NOT NULL' with 'NULL'.

In [101]:
"""

CREATE TABLE `testdatabase`.`property_sales` (
  `district_code` SMALLINT(3) NOT NULL,
  `property_id` INT NULL,
  `property_unit_number` VARCHAR(10) NULL,
  `property_house_number` VARCHAR(10) NULL,
  `property_street_name` VARCHAR(150) NOT NULL,
  `property_locality` VARCHAR(100) NOT NULL,
  `property_post_code` INT NOT NULL,
  `area` DECIMAL(16,8) NULL,
  `contract_date` DATE NOT NULL,
  `settlement_date` DATE NOT NULL,
  `purchase_price` INT NOT NULL,
  `zoning_classification` VARCHAR(3) NULL,
  `is_unit` TINYINT(1) NULL,
  `is_house` TINYINT(1) NULL,
  `record_id` VARCHAR(25) NOT NULL,
  `id` INT NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`),
  UNIQUE INDEX `id_UNIQUE` (`id` ASC) VISIBLE);

  """

'\n\nCREATE TABLE `testdatabase`.`property_sales` (\n  `district_code` SMALLINT(3) NOT NULL,\n  `property_id` INT NULL,\n  `property_unit_number` VARCHAR(10) NULL,\n  `property_house_number` VARCHAR(10) NULL,\n  `property_street_name` VARCHAR(150) NOT NULL,\n  `property_locality` VARCHAR(100) NOT NULL,\n  `property_post_code` INT NOT NULL,\n  `area` DECIMAL(16,8) NULL,\n  `contract_date` DATE NOT NULL,\n  `settlement_date` DATE NOT NULL,\n  `purchase_price` INT NOT NULL,\n  `zoning_classification` VARCHAR(3) NULL,\n  `is_unit` TINYINT(1) NULL,\n  `is_house` TINYINT(1) NULL,\n  `record_id` VARCHAR(25) NOT NULL,\n  `id` INT NOT NULL AUTO_INCREMENT,\n  PRIMARY KEY (`id`),\n  UNIQUE INDEX `id_UNIQUE` (`id` ASC) VISIBLE);\n\n  '

## Uploading the dataset to the SQL database

In [84]:
# You may need to install these libraries before importing them

import pandas as pd
import sqlalchemy  

In [85]:
engine = sqlalchemy.create_engine('mysql+pymysql://root:password@hostname/databasename') #This allows python to interact with your MySQL database. # Tutorial to be watched to understand this line of code: https://www.youtube.com/watch?v=M-4EpNdlSuY

In [86]:
dataset = pd.read_csv(r'processed_data\processed_data.csv', index_col=False, header=None) #This will read the csv file which holds the dataset created in the previous exercise, and save it into a pandas dataframe object. 

In [87]:
dataset.shape # Running this should return two figures. The first is the number of rows in the dataframe. The second is the number of columns in the dataframe. Run it to make sure the number of rows matches what you expect. 

(32965, 15)

In [88]:
column_names = ['district_code', 'property_id', 'property_unit_number', 'property_house_number', 'property_street_name', 'property_locality', 'property_post_code', 'area', 'contract_date', 'settlement_date', 'purchase_price', 'zoning_classification', 'is_unit', 'is_house', 'record_id']

In [89]:
dataset.columns = column_names

In [90]:
dataset.to_sql(name='property_sales', index=False, con=engine, if_exists='append')

## Querying the dataset and removing duplicates - using MySQL Workbench



The following query checks the number of property transactions in the SQL table.

In [None]:
""" 

SELECT COUNT(1) as 'Number of rows' FROM testdatabase.property_sales; 

"""


The following query checks for cases where the record_id is duplicated along with the property_id, property_house_number, while property_unit_number is NULL. 

In [None]:
"""

SELECT record_id, COUNT(record_id), COUNT(property_id)
FROM property_sales 
GROUP BY record_id, property_unit_number, property_house_number, property_id 
HAVING COUNT(record_id) >1 AND COUNT(property_id) >1 AND COUNT(property_unit_number) = 0 AND COUNT(property_house_number) > 1;

"""

The following query removes cases where the record_id is duplicated along with property_id, property_house_number, while property_unit_number is NULL. In my case there were 1,611 duplicates that met this criteria!  

In [99]:
"""

DELETE t1 FROM property_sales t1
INNER JOIN property_sales t2
WHERE t1.id < t2.id AND
t1.record_id = t2.record_id AND
t1.property_id = t2.property_id AND
t1.property_house_number = t2.property_house_number AND
t1.property_unit_number IS NULL AND
t2.property_unit_number IS NULL;

"""

'\n\nDELETE t1 FROM property_sales t1\nINNER JOIN property_sales t2\nWHERE t1.id < t2.id AND\nt1.record_id = t2.record_id AND\nt1.property_id = t2.property_id AND\nt1.property_house_number = t2.property_house_number AND\nt1.property_unit_number IS NULL AND\nt2.property_unit_number IS NULL;\n\n'

The following query checks for cases where the record_id is duplicated along with property_id, property_house_number, AND property_unit_number. This is slightly different to the previous query. 

In [100]:
"""

SELECT record_id, COUNT(record_id), COUNT(property_id)
FROM property_sales 
GROUP BY record_id, property_unit_number, property_house_number, property_id 
HAVING COUNT(record_id) >1 AND COUNT(property_id) >1 AND COUNT(property_unit_number) > 1 AND COUNT(property_house_number) > 1;

"""

'\n\nSELECT record_id, COUNT(record_id), COUNT(property_id)\nFROM property_sales \nGROUP BY record_id, property_unit_number, property_house_number, property_id \nHAVING COUNT(record_id) >1 AND COUNT(property_id) >1 AND COUNT(property_unit_number) > 1 AND COUNT(property_house_number) > 1;\n\n'

The following query removes cases where the record_id is duplicated along with property_id, property_house_number, AND property_unit_number. In my case there were 2,703 duplicates that met this criteria!  

In [None]:
"""

DELETE t1 FROM property_sales t1
INNER JOIN property_sales t2
WHERE t1.id < t2.id AND
t1.record_id = t2.record_id AND
t1.property_id = t2.property_id AND
t1.property_house_number = t2.property_house_number AND
t1.property_unit_number = t2.property_unit_number;

"""