# Nashville Housing Data Cleaning

This project will only focus on data cleaning and transformation. This Nashville Housing dataset was obtained from [Kaggle](https://www.kaggle.com/datasets/tmthyjames/nashville-housing-data/data), which shows the home value data for the Nashville house market between 2013 and 2016.

There are over 56,000 rows in this dataset, However, column 9 to column to column 19 will be excluded as over half of the data are missing in these columns, so there is no reason to include them in this project. For demonstration purpose, only the first 8 columns of this dataset will be cleaned and transformed. 

The cleaned data can be downloaded through this [Link](https://github.com/s262680/SQL_Projects/blob/main/Data_Cleaning/Cleaned_Nashville_Housing_Data.csv).

<br>

### Load dataset
The first block of code loads the ipython-sql library and connects to the local database database Hashville_Hosing where the table Data_2013_2016 is located. 

In [1]:
%load_ext sql
%config SqlMagic.displaycon = False
%config SqlMagic.feedback = False
%sql mssql+pyodbc://******

<br>
The first 8 columns will be stored in a temporary table, so the original dataset will not be affected.

In [2]:
%%sql
IF NOT EXISTS (SELECT * FROM tempdb.sys.objects WHERE NAME = 'temp_data_cleaning' AND TYPE = 'U')
    BEGIN
        SELECT UniqueID,
               ParcelID,
               LandUse,
               PropertyAddress,
               SaleDate,
               SalePrice,
               LegalReference,
               SoldAsVacant
        INTO #temp_data_cleaning
        FROM data_2013_2016
    END

COMMIT; 

[]

<br>
A quick look at the table structure and format.

Notice that there are some obvious formatting issues, such as the unnecessary time format in the SaleDate column and decimals in the UniqueID and SalePrice columns, these issues will be addressed later in this project.

In [3]:
%%sql
SELECT TOP(20) *
FROM #temp_data_cleaning 

UniqueID,ParcelID,LandUse,PropertyAddress,SaleDate,SalePrice,LegalReference,SoldAsVacant
2045.0,007 00 0 125.00,SINGLE FAMILY,"1808 FOX CHASE DR, GOODLETTSVILLE",2013-04-09 00:00:00,240000.0,20130412-0036474,No
16918.0,007 00 0 130.00,SINGLE FAMILY,"1832 FOX CHASE DR, GOODLETTSVILLE",2014-06-10 00:00:00,366000.0,20140619-0053768,No
54582.0,007 00 0 138.00,SINGLE FAMILY,"1864 FOX CHASE DR, GOODLETTSVILLE",2016-09-26 00:00:00,435000.0,20160927-0101718,No
43070.0,007 00 0 143.00,SINGLE FAMILY,"1853 FOX CHASE DR, GOODLETTSVILLE",2016-01-29 00:00:00,255000.0,20160129-0008913,No
22714.0,007 00 0 149.00,SINGLE FAMILY,"1829 FOX CHASE DR, GOODLETTSVILLE",2014-10-10 00:00:00,278000.0,20141015-0095255,No
18367.0,007 00 0 151.00,SINGLE FAMILY,"1821 FOX CHASE DR, GOODLETTSVILLE",2014-07-16 00:00:00,267000.0,20140718-0063802,No
19804.0,007 14 0 002.00,SINGLE FAMILY,"2005 SADIE LN, GOODLETTSVILLE",2014-08-28 00:00:00,171000.0,20140903-0080214,No
54583.0,007 14 0 024.00,SINGLE FAMILY,"1917 GRACELAND DR, GOODLETTSVILLE",2016-09-27 00:00:00,262000.0,20161005-0105441,No
36500.0,007 14 0 026.00,SINGLE FAMILY,"1428 SPRINGFIELD HWY, GOODLETTSVILLE",2015-08-14 00:00:00,285000.0,20150819-0083440,No
19805.0,007 14 0 034.00,SINGLE FAMILY,"1420 SPRINGFIELD HWY, GOODLETTSVILLE",2014-08-29 00:00:00,340000.0,20140909-0082348,No


<br>

### Identify null values

Verify all the rows that contain null values in any of the columns in this table.

Notice that there are many null values in the PropertyAddress column.

In [4]:
%%sql
SELECT *
FROM #temp_data_cleaning
WHERE UniqueID IS NULL 
    OR ParcelID IS NULL 
    OR LandUse IS NULL 
    OR PropertyAddress IS NULL 
    OR SaleDate IS NULL 
    OR SalePrice IS NULL 
    OR LegalReference IS NULL 
    OR SoldAsVacant IS NULL 


UniqueID,ParcelID,LandUse,PropertyAddress,SaleDate,SalePrice,LegalReference,SoldAsVacant
43076.0,025 07 0 031.00,SINGLE FAMILY,,2016-01-15 00:00:00,179900.0,20160120-0005776,No
39432.0,026 01 0 069.00,VACANT RESIDENTIAL LAND,,2015-10-23 00:00:00,153000.0,20151028-0109602,No
45290.0,026 05 0 017.00,SINGLE FAMILY,,2016-03-29 00:00:00,155000.0,20160330-0029941,No
53147.0,026 06 0A 038.00,RESIDENTIAL CONDO,,2016-08-25 00:00:00,144900.0,20160831-0091567,No
43080.0,033 06 0 041.00,SINGLE FAMILY,,2016-01-04 00:00:00,170000.0,20160107-0001526,No
45295.0,033 06 0A 002.00,SINGLE FAMILY,,2016-03-29 00:00:00,210000.0,20160331-0030709,No
48731.0,033 15 0 123.00,SINGLE FAMILY,,2016-05-05 00:00:00,199900.0,20160506-0045368,No
50927.0,044 05 0 135.00,SINGLE FAMILY,,2016-06-15 00:00:00,160000.0,20160617-0061987,No
3299.0,052 01 0 296.00,SINGLE FAMILY,,2013-05-31 00:00:00,79370.0,20130620-0063114,No
40678.0,042 13 0 075.00,SINGLE FAMILY,,2015-11-30 00:00:00,208000.0,20151209-0123831,No


<br>
Since there could be multiple sales records for these properties, the task here is to find out whether other rows with the same parcel ID contain the property address by retrieving filter records where the PropertyAddress is null and the ParcelID matches any other record's ParcelID, then order the results by ParcelID for better readability.

In [5]:
%%sql
SELECT UniqueID, ParcelID, PropertyAddress
FROM #temp_data_cleaning
WHERE ParcelID IN(
    SELECT ParcelID
    FROM #temp_data_cleaning
    WHERE PropertyAddress IS NULL)
ORDER BY ParcelID

UniqueID,ParcelID,PropertyAddress
38077.0,025 07 0 031.00,"410 ROSEHILL CT, GOODLETTSVILLE"
43076.0,025 07 0 031.00,
22721.0,026 01 0 069.00,"141 TWO MILE PIKE, GOODLETTSVILLE"
39432.0,026 01 0 069.00,
4521.0,026 05 0 017.00,"208 EAST AVE, GOODLETTSVILLE"
45290.0,026 05 0 017.00,
19828.0,026 06 0A 038.00,"109 CANTON CT, GOODLETTSVILLE"
53147.0,026 06 0A 038.00,
7003.0,033 06 0 041.00,"1129 CAMPBELL RD, GOODLETTSVILLE"
43080.0,033 06 0 041.00,


<br>
Since a parcel ID will correspond to the same address, the missing PropertyAddress values in the NullAdd table can be updated from the PopAdd table, ensuring the PropertyAddress is updated from a different record with the same ParcelID and a different UniqueID, but only if it's null in NullAdd.

In [6]:
%%sql
UPDATE NullAdd
SET NullAdd.PropertyAddress=PopAdd.PropertyAddress
FROM #temp_data_cleaning AS NullAdd
INNER JOIN #temp_data_cleaning AS PopAdd
ON NullAdd.parcelID=PopAdd.ParcelID AND NullAdd.uniqueID!=PopAdd.UniqueID
WHERE NullAdd.PropertyAddress IS NULL

COMMIT;

[]

<br>
Verify whether there are still any null values in the PropertyAddress column.

No result was found as all null values are updated with the correct data.

In [7]:
%%sql
SELECT PropertyAddress
FROM #temp_data_cleaning
WHERE PropertyAddress IS NULL

PropertyAddress


<br>

### Identify duplicates

Select all rows where the UniqueID appears more than once which indicates duplicated records. 

No result was found as there are no duplicate IDs.

In [8]:
%%sql
SELECT a.*
FROM #temp_data_cleaning AS a
JOIN (
    SELECT UniqueID, Count(*) AS DuplicateCount
    FROM #temp_data_cleaning
    GROUP BY UniqueID
    HAVING Count(*)>1) AS b
ON a.UniqueID = b.UniqueID


UniqueID,ParcelID,LandUse,PropertyAddress,SaleDate,SalePrice,LegalReference,SoldAsVacant


<br>
Select records that share the same combination of parcel ID, property address, sale date, sale price and legal reference.
 
Counts the occurrences of each combination and returns only those with a count greater than 1 which indicate duplicated records.

In [9]:
%%sql
SELECT ParcelID, PropertyAddress, SaleDate, SalePrice, LegalReference, Count(*) AS DuplicateCount
FROM #temp_data_cleaning
GROUP BY ParcelID, PropertyAddress, SaleDate, SalePrice, LegalReference
HAVING Count(*) >1

ParcelID,PropertyAddress,SaleDate,SalePrice,LegalReference,DuplicateCount
081 02 0 144.00,"1728 PECAN ST, NASHVILLE",2015-02-02 00:00:00,57000.0,20150205-0010843,2
081 07 0 265.00,"1806 15TH AVE N, NASHVILLE",2015-02-17 00:00:00,65000.0,20150223-0015122,2
081 10 0 313.00,"1626 25TH AVE N, NASHVILLE",2015-02-20 00:00:00,35000.0,20150224-0015904,2
081 11 0 168.00,"1710 DR D B TODD JR BLVD, NASHVILLE",2015-02-13 00:00:00,44500.0,20150218-0013602,2
081 11 0 495.00,"1718 ARTHUR AVE, NASHVILLE",2015-02-09 00:00:00,36500.0,20150210-0012450,2
081 15 0 263.00,"1520 14TH AVE N, NASHVILLE",2015-02-12 00:00:00,55000.0,20150218-0013742,2
081 15 0 472.00,"1818 B SCOVEL ST, NASHVILLE",2015-02-20 00:00:00,35000.0,20150223-0015257,2
090 08 0 191.00,"743 CROLEY DR, NASHVILLE",2015-02-13 00:00:00,169000.0,20150219-0014430,2
090 11 0A 030.00,"515 BASSWOOD AVE, NASHVILLE",2015-02-05 00:00:00,60000.0,20150209-0011960,2
090 12 0 091.00,"501 FOUNDRY DR, NASHVILLE",2015-02-18 00:00:00,208000.0,20150223-0015576,2


<br>
Identify duplicates using a Common Table Expression named DuplicateCTE by assigning a row number to each record within groups of records sharing the same combination ordering by unique ID.

Then delete records from the DuplicateCTE where the DuplicateCount is greater than 1 to remove the duplicate records from the #temp_data_cleaning table.

In [10]:
%%sql
WITH DuplicateCTE AS (
	SELECT *, Row_number() OVER (partition BY ParcelID, PropertyAddress, SaleDate, SalePrice, LegalReference ORDER BY UniqueID) AS DuplicateCount
	FROM #temp_data_cleaning
	)

DELETE
FROM DuplicateCTE
WHERE DuplicateCount > 1

COMMIT;


[]

<br>
Verify whether there are still any duplicates.

No result was found, which indicates all duplicates have been removed.

In [11]:
%%sql
SELECT ParcelID, PropertyAddress, SaleDate, SalePrice, LegalReference, Count(*) AS DuplicateCount
FROM #temp_data_cleaning
GROUP BY ParcelID, PropertyAddress, SaleDate, SalePrice, LegalReference
HAVING Count(*) > 1

ParcelID,PropertyAddress,SaleDate,SalePrice,LegalReference,DuplicateCount


<br>

### Transform Data

Breaking the PropertyAddress column into 2 different columns to make it more useable.

First, add PropertyStreet and PropertyCity columns to hold the separated parts of the address.

Update each record by extracting the substring from the start of PropertyAddress up to the comma and assigning it to PropertyStreet.

Then extract the substring after the comma to the end of PropertyAddress and assign it to PropertyCity.

In [12]:
%%sql
ALTER TABLE #temp_data_cleaning
ADD PropertyStreet NVARCHAR(255), PropertyCity NVARCHAR(255)

COMMIT;

UPDATE #temp_data_cleaning
SET
PropertyStreet =
	Substring(PropertyAddress,
	1,
	Charindex(',', PropertyAddress)-1),
PropertyCity =
	Substring(PropertyAddress,
	Charindex(',', PropertyAddress)+1,
	Len(PropertyAddress))

COMMIT;

[]

<br>
Verify whether the new columns have been updated correctly.

In [13]:
%%sql
SELECT top(10) PropertyStreet, PropertyCity
FROM #temp_data_cleaning;

PropertyStreet,PropertyCity
1808 FOX CHASE DR,GOODLETTSVILLE
1832 FOX CHASE DR,GOODLETTSVILLE
1864 FOX CHASE DR,GOODLETTSVILLE
1853 FOX CHASE DR,GOODLETTSVILLE
1829 FOX CHASE DR,GOODLETTSVILLE
1821 FOX CHASE DR,GOODLETTSVILLE
2005 SADIE LN,GOODLETTSVILLE
1917 GRACELAND DR,GOODLETTSVILLE
1428 SPRINGFIELD HWY,GOODLETTSVILLE
1420 SPRINGFIELD HWY,GOODLETTSVILLE


<br>
Remove time from the SaleDate column as there aren't any actual time values in this column.

In [14]:
%%sql
ALTER TABLE #temp_data_cleaning
ALTER COLUMN SaleDate DATE

COMMIT;

[]

<br>
Verify the result.

In [15]:
%%sql
SELECT TOP(10) SaleDate
FROM #temp_data_cleaning

SaleDate
2013-04-09
2014-06-10
2016-09-26
2016-01-29
2014-10-10
2014-07-16
2014-08-28
2016-09-27
2015-08-14
2014-08-29


<br>
The UniqueID and SalePrice columns have decimals that were not in the original dataset.

After checking the column's data type, notice that UniqueID and SalePrice have been converted to float when the data was imported to the database, resulting in adding decimals in their data values.

In [16]:
%%sql
SELECT COLUMN_NAME, DATA_TYPE 
FROM tempdb.INFORMATION_SCHEMA.COLUMNS 
WHERE TABLE_NAME LIKE '#temp_data_cleaning%'

COLUMN_NAME,DATA_TYPE
UniqueID,float
ParcelID,nvarchar
LandUse,nvarchar
PropertyAddress,nvarchar
SaleDate,date
SalePrice,float
LegalReference,nvarchar
SoldAsVacant,nvarchar
PropertyStreet,nvarchar
PropertyCity,nvarchar


<br>
Remove the decimals by altering the UniqueID data type to VARCHAR and SalePrice to INTEGER

In [17]:
%%sql
ALTER TABLE #temp_data_cleaning
ALTER COLUMN UniqueID VARCHAR(255);

ALTER TABLE #temp_data_cleaning
ALTER COLUMN SalePrice INT;

COMMIT;


[]

<br>
Verify the result.

In [18]:
%%sql
SELECT TOP (10) UniqueID, SalePrice
FROM #temp_data_cleaning

UniqueID,SalePrice
2045,240000
16918,366000
54582,435000
43070,255000
22714,278000
18367,267000
19804,171000
54583,262000
36500,285000
19805,340000


<br>

### Check data consistancy

Going through each column in this table to ensure the data format and structure are consistent.

<br>
Check if the UniqueID column contains anything other than number characters.

In [19]:
%%sql
SELECT UniqueID
FROM #temp_data_cleaning
WHERE UniqueID NOT LIKE '%[0-9]%'

UniqueID


<br>
The data values in the ParcellID column consist of a mix of numbers and letters, where the length is 15 characters for normal records and 16 characters for records that have missing data after column 9.

Verify the data length in this column to ensure it conforms with the format.

In [20]:
%%sql
SELECT Len(ParcelID) AS ParcelID_Length, Count(DISTINCT ParcelID) AS Distinct_Count
FROM #temp_data_cleaning
GROUP BY Len(ParcelID)

ParcelID_Length,Distinct_Count
15,23370
16,25189


<br>
Verify the distinct values in the LandUse column to identify any inconsistency.

In [21]:
%%sql
SELECT LandUse, Count(*)
FROM #temp_data_cleaning
GROUP BY LandUse
ORDER BY LandUse ASC

LandUse,Unnamed: 1
APARTMENT: LOW RISE (BUILT SINCE 1960),2
CHURCH,33
CLUB/UNION HALL/LODGE,1
CONDO,247
CONDOMINIUM OFC OR OTHER COM CONDO,35
CONVENIENCE MARKET WITHOUT GAS,1
DAY CARE CENTER,2
DORMITORY/BOARDING HOUSE,19
DUPLEX,1372
FOREST,10


<br>
The data values GREENBELT/RESGRRENBELT/RES and VACANT RESIENTIAL LAND appear to be typos where they should be replaced with GREENBELT/RES and VACANT RESIDENTIAL LAND, respectively.

In [22]:
%%sql
UPDATE #temp_data_cleaning
SET LandUse = 'GREENBELT/RES'
WHERE LandUse LIKE '%GRRENBELT/RES'

COMMIT;

UPDATE #temp_data_cleaning
SET LandUse = 'VACANT RESIDENTIAL LAND'
WHERE LandUse ='VACANT RESIENTIAL LAND'

COMMIT;


[]

<br>
Validate the result.

In [23]:
%%sql
SELECT LandUse, Count(*)
FROM #temp_data_cleaning
WHERE
    LandUse = 'GREENBELT/RES'
    OR LandUse LIKE '%GRRENBELT/RES'
    OR LandUse = 'VACANT RESIDENTIAL LAND'
    OR LandUse = 'VACANT RESIENTIAL LAND'
GROUP BY LandUse
ORDER BY LandUse ASC

LandUse,Unnamed: 1
GREENBELT/RES,3
VACANT RESIDENTIAL LAND,3543


<br>
Verify whether any data in the SaleData column cannot be converted to a valid date using the 121 style ('YYYY-MM-DD' format). 

No record was found which indicates that the SaleDate does not contain any invalid date values.

In [24]:
%%sql
SELECT SaleDate
FROM #temp_data_cleaning
WHERE Isdate(CONVERT(VARCHAR, SaleDate, 121)) = 0

SaleDate


<br>
Verify whether the SalePrice column only contains numeric values.

In [25]:
%%sql
SELECT SalePrice
FROM #temp_data_cleaning
WHERE SalePrice NOT LIKE '%[0-9]%'

SalePrice


<br>
Verify the character length for the LegalReference column, it appears that 16 characters is the standard format for this column and 13 records don't conform with this format.

In [26]:
%%sql
SELECT Len(LegalReference) AS LegalReferenceLength, Count(DISTINCT ParcelID) AS DistinctCount
FROM #temp_data_cleaning
GROUP BY Len(LegalReference)

LegalReferenceLength,DistinctCount
8,6
15,1
16,48552
17,6


<br>
It appears that these records contain inaccurate and missing values. However, it is impossible to confirm the actual values for these records. It will be reported back to the data owner in a real-world scenario.

In [27]:
%%sql
SELECT *
FROM #temp_data_cleaning
WHERE Len(LegalReference) !=16

UniqueID,ParcelID,LandUse,PropertyAddress,SaleDate,SalePrice,LegalReference,SoldAsVacant,PropertyStreet,PropertyCity
42865,128 13 0A 158.00,RESIDENTIAL CONDO,"158 WESTFIELD DR, NASHVILLE",2016-01-28,108000,-2016946,No,158 WESTFIELD DR,NASHVILLE
43048,059 15 0A 147.00,SINGLE FAMILY,"912 BORDEAUX PL, NASHVILLE",2016-01-04,264757,20160112-00003053,No,912 BORDEAUX PL,NASHVILLE
20017,083 01 0D 001.00,RESIDENTIAL CONDO,"1118 A SHARPE AVE, NASHVILLE",2014-08-29,419240,-2022158,No,1118 A SHARPE AVE,NASHVILLE
46422,104 10 0 154.00,VACANT RESIDENTIAL LAND,"3206 OVERLOOK DR, NASHVILLE",2016-04-06,410929,20160408-00339999,Yes,3206 OVERLOOK DR,NASHVILLE
48164,104 10 0O 009.00,SINGLE FAMILY,"112 RANSOM AVE, NASHVILLE",2016-05-18,839820,20160519-00503315,No,112 RANSOM AVE,NASHVILLE
2863,104 10 0R 007.00,RESIDENTIAL CONDO,"607 CHESTERFIELD WAY, NASHVILLE",2013-05-17,363320,2013520-0050586,No,607 CHESTERFIELD WAY,NASHVILLE
28415,062 07 0 001.00,VACANT RESIDENTIAL LAND,"2929 WESTERN HILLS DR, NASHVILLE",2015-03-09,50000,20150310 -0020554,No,2929 WESTERN HILLS DR,NASHVILLE
17927,104 16 0 287.00,SINGLE FAMILY,"2402 OAKLAND AVE, NASHVILLE",2014-07-30,695150,-2020988,No,2402 OAKLAND AVE,NASHVILLE
4743,083 14 0 153.00,SINGLE FAMILY,"1801 FATHERLAND ST, NASHVILLE",2013-06-07,449830,20130610-00588852,No,1801 FATHERLAND ST,NASHVILLE
43449,136 07 0 030.00,SINGLE FAMILY,"117 TIMBER RIDGE DR, NASHVILLE",2016-01-14,160000,-2016598,No,117 TIMBER RIDGE DR,NASHVILLE


<br>
Verify the distinct values in the SoldAsVacant column to identify any inconsistency.

In [28]:
%%sql
SELECT DISTINCT SoldAsVacant, Count(*)
FROM #temp_data_cleaning
GROUP BY SoldAsVacant

SoldAsVacant,Unnamed: 1
N,398
Yes,4617
Y,52
No,51306


<br>
It appears that there is some inconsistency in the data in this column. Both Yes and Y refer to the properties that were sold as vacant, while both No and N refer to the properties that weren't sold as vacant.

The format of these data will be standardised by replacing all Y with Yes and N with NO.

In [29]:
%%sql
UPDATE #temp_data_cleaning
SET SoldAsVacant =
CASE
    WHEN SoldAsVacant = 'Y' THEN 'Yes'
    WHEN SoldAsVacant = 'N' THEN 'No'
    ELSE SoldAsVacant
END

COMMIT;

[]

<br>
Varify the result.

In [30]:
%%sql
SELECT DISTINCT SoldAsVacant, Count(*)
FROM #temp_data_cleaning
GROUP BY SoldAsVacant

SoldAsVacant,Unnamed: 1
Yes,4669
No,51704


<br>
Verify the PropertyStreet format by identifying any street that does not start with numeric characters followed by alphabetic characters.

Only 20 results are displayed for demonstration purposes, as there are too many records with missing street numbers and some even only have a zero value.

In [31]:
%%sql
SELECT PropertyStreet 
FROM #temp_data_cleaning
WHERE PropertyStreet NOT LIKE '[0-9]%[A-Za-z]%'
ORDER BY ParcelID
OFFSET 70 ROWS
FETCH NEXT 20 ROWS ONLY

PropertyStreet
MONROE ST
MONROE ST
MONROE ST
BOSCOBEL ST
RUSSELL ST
RUSSELL ST
RUSSELL ST
PORTER RD
CARTER AVE
CARTER AVE


<br>
Unable to populate the street values as many records show different street names despite sharing the same parcel ID.

Again, it will be reported back to the data owner in a real-world senario.

In [32]:
%%sql
SELECT ParcelID, PropertyStreet
FROM #temp_data_cleaning
WHERE ParcelID IN (
    SELECT ParcelID
    FROM #temp_data_cleaning
    WHERE PropertyStreet NOT LIKE '[0-9]%[A-Za-z]%')
ORDER BY ParcelID
OFFSET 80 ROWS
FETCH NEXT 20 ROWS ONLY


ParcelID,PropertyStreet
082 09 0X 008.00,MONROE ST
082 09 0X 008.00,1304 A 7TH AVE N
082 09 0X 009.00,MONROE ST
082 09 0X 009.00,1304 B 7TH AVE N
082 09 0X 010.00,MONROE ST
082 09 0X 010.00,1304 C 7TH AVE N
082 16 0I 003.00,BOSCOBEL ST
082 16 0L 001.00,RUSSELL ST
082 16 0L 001.00,0 RUSSELL ST
082 16 0L 003.00,RUSSELL ST


<br>
Verify the distinct values in the PropertyCity column to identify any inconsistency.

Notice there is one unknown value in the PropertyCity column.

In [33]:
%%sql
SELECT DISTINCT PropertyCity, Count(*)
FROM #temp_data_cleaning
GROUP BY PropertyCity

PropertyCity,Unnamed: 1
OLD HICKORY,1415
WHITES CREEK,97
MOUNT JULIET,180
UNKNOWN,1
JOELTON,11
GOODLETTSVILLE,735
ANTIOCH,6286
BELLEVUE,1
FRANKLIN,1
MADISON,2114


<br>
Find out whether there is another record that shares the same parcel ID and has the correct address.

In [34]:
%%sql
SELECT *
FROM #temp_data_cleaning
WHERE ParcelID IN (
    SELECT ParcelID
    FROM #temp_data_cleaning
    WHERE PropertyCity LIKE '%UNKNOWN%')

UniqueID,ParcelID,LandUse,PropertyAddress,SaleDate,SalePrice,LegalReference,SoldAsVacant,PropertyStreet,PropertyCity
12726,093 06 1B 618.00,RESIDENTIAL CONDO,"231 5TH AVE N, NASHVILLE",2014-02-11,255900,20140214-0013082,No,231 5TH AVE N,NASHVILLE
46010,093 06 1B 618.00,RESIDENTIAL CONDO,"0 5TH AVE N, UNKNOWN",2016-03-31,298000,20160404-0031713,No,0 5TH AVE N,UNKNOWN


<br>
Notice the street number is also incorrect for the record that has an unknown city value.

Therefore, both street and city will be updated based on the other records that share the same parcel ID.

In [35]:
%%sql
UPDATE #temp_data_cleaning
SET PropertyStreet = '231 5TH AVE N', PropertyCity = 'NASHVILLE'
WHERE UniqueID = '46010'

COMMIT;

[]

<br>
Verify the result.

In [36]:
%%sql
SELECT * 
FROM #temp_data_cleaning
WHERE ParcelID ='093 06 1B 618.00'

UniqueID,ParcelID,LandUse,PropertyAddress,SaleDate,SalePrice,LegalReference,SoldAsVacant,PropertyStreet,PropertyCity
12726,093 06 1B 618.00,RESIDENTIAL CONDO,"231 5TH AVE N, NASHVILLE",2014-02-11,255900,20140214-0013082,No,231 5TH AVE N,NASHVILLE
46010,093 06 1B 618.00,RESIDENTIAL CONDO,"0 5TH AVE N, UNKNOWN",2016-03-31,298000,20160404-0031713,No,231 5TH AVE N,NASHVILLE


<br>
Drop the PropertyAddress column as it is no longer needed.

In [37]:
%%sql
ALTER TABLE #temp_data_cleaning
DROP COLUMN PropertyAddress

COMMIT;

[]

<br>
Verify the cleaned data.

Only 50 rows are displayed here for demonstration purposes, as there are over 56,000 rows in this table.

The full cleaned data can be downloaded through this [Link](https://github.com/s262680/SQL_Projects/blob/main/Data_Cleaning/Cleaned_Nashville_Housing_Data.csv).

In [38]:
%%sql
SELECT TOP (50) *
FROM #temp_data_cleaning

UniqueID,ParcelID,LandUse,SaleDate,SalePrice,LegalReference,SoldAsVacant,PropertyStreet,PropertyCity
2045,007 00 0 125.00,SINGLE FAMILY,2013-04-09,240000,20130412-0036474,No,1808 FOX CHASE DR,GOODLETTSVILLE
16918,007 00 0 130.00,SINGLE FAMILY,2014-06-10,366000,20140619-0053768,No,1832 FOX CHASE DR,GOODLETTSVILLE
54582,007 00 0 138.00,SINGLE FAMILY,2016-09-26,435000,20160927-0101718,No,1864 FOX CHASE DR,GOODLETTSVILLE
43070,007 00 0 143.00,SINGLE FAMILY,2016-01-29,255000,20160129-0008913,No,1853 FOX CHASE DR,GOODLETTSVILLE
22714,007 00 0 149.00,SINGLE FAMILY,2014-10-10,278000,20141015-0095255,No,1829 FOX CHASE DR,GOODLETTSVILLE
18367,007 00 0 151.00,SINGLE FAMILY,2014-07-16,267000,20140718-0063802,No,1821 FOX CHASE DR,GOODLETTSVILLE
19804,007 14 0 002.00,SINGLE FAMILY,2014-08-28,171000,20140903-0080214,No,2005 SADIE LN,GOODLETTSVILLE
54583,007 14 0 024.00,SINGLE FAMILY,2016-09-27,262000,20161005-0105441,No,1917 GRACELAND DR,GOODLETTSVILLE
36500,007 14 0 026.00,SINGLE FAMILY,2015-08-14,285000,20150819-0083440,No,1428 SPRINGFIELD HWY,GOODLETTSVILLE
19805,007 14 0 034.00,SINGLE FAMILY,2014-08-29,340000,20140909-0082348,No,1420 SPRINGFIELD HWY,GOODLETTSVILLE
