# Get rows with only latest transaction

- Use the zillow dataset
- Acquire data from mySQL using the python module to connect and query. You will want to end with a single dataframe. Make sure to include: the logerror, all fields related to the properties that are available. You will end up using all the tables in the database.
- Be sure to do the correct join (inner, outer, etc.). We do not want to eliminate properties purely because they may have a null value for airconditioningtypeid.

- Only include properties with a transaction in 2017, and include only the last transaction for each property (so no duplicate property ID's), along with zestimate error and date of transaction.

- Only include properties that include a latitude and longitude value.


## Solution
1. GroupBy parcelid
2. Take the max transaction date from the groupby object (a Series)
3. Turn the max transaction date by parcel Series into a dataframe
4. Merge/join the max transaction date series dataframe with the original dataframe to filter out older transaction dates

In [1]:
import pandas as pd
from env import get_db_url
url = get_db_url('zillow')

In [2]:
sql = """
SELECT *
FROM properties_2017
JOIN predictions_2017 using(parcelid)
LEFT JOIN airconditioningtype air USING (airconditioningtypeid) 
LEFT JOIN architecturalstyletype arch USING (architecturalstyletypeid) 
LEFT JOIN buildingclasstype build USING (buildingclasstypeid) 
LEFT JOIN heatingorsystemtype heat USING (heatingorsystemtypeid) 
LEFT JOIN propertylandusetype landuse USING (propertylandusetypeid) 
LEFT JOIN storytype story USING (storytypeid) 
LEFT JOIN typeconstructiontype construct USING (typeconstructiontypeid) 
WHERE latitude IS NOT NULL AND longitude IS NOT NULL
AND transactiondate < '2018-01-01'
"""

df = pd.read_sql(sql, url)
df

Unnamed: 0,typeconstructiontypeid,storytypeid,propertylandusetypeid,heatingorsystemtypeid,buildingclasstypeid,architecturalstyletypeid,airconditioningtypeid,parcelid,id,basementsqft,...,id.1,logerror,transactiondate,airconditioningdesc,architecturalstyledesc,buildingclassdesc,heatingorsystemdesc,propertylandusedesc,storydesc,typeconstructiondesc
0,,,261.0,,,,,14297519,1727539,,...,0,0.025595,2017-01-01,,,,,Single Family Residential,,
1,,,261.0,,,,,17052889,1387261,,...,1,0.055619,2017-01-01,,,,,Single Family Residential,,
2,,,261.0,,,,,14186244,11677,,...,2,0.005383,2017-01-01,,,,,Single Family Residential,,
3,,,261.0,2.0,,,,12177905,2288172,,...,3,-0.103410,2017-01-01,,,,Central,Single Family Residential,,
4,,,266.0,2.0,,,1.0,10887214,1970746,,...,4,0.006940,2017-01-01,Central,,,Central,Condominium,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77574,,,266.0,2.0,,,1.0,10833991,2864704,,...,77608,-0.002245,2017-09-20,Central,,,Central,Condominium,,
77575,,,261.0,2.0,,,,11000655,673515,,...,77609,0.020615,2017-09-20,,,,Central,Single Family Residential,,
77576,,,261.0,,,,,17239384,2968375,,...,77610,0.013209,2017-09-21,,,,,Single Family Residential,,
77577,,,261.0,2.0,,,1.0,12773139,1843709,,...,77611,0.037129,2017-09-21,Central,,,Central,Single Family Residential,,


In [3]:
n_duplicates = df.parcelid.duplicated().sum()
n_duplicates

198

In [4]:
# Convert transaction date into proper datetime format
df.transactiondate = pd.to_datetime(df.transactiondate)

In [5]:
# Obtain the maximimum transaction date by each parcelid
# For parcelids that show once, this returns the only transactiondate
# For parcelids showing multiple times, this returns the max transactiondate
max_transaction_date_by_parcel = df.groupby('parcelid').transactiondate.max()
max_transaction_date_by_parcel

parcelid
10711855    2017-07-07
10711877    2017-08-29
10711888    2017-04-04
10711910    2017-03-17
10711923    2017-03-24
               ...    
167686999   2017-02-28
167687739   2017-03-03
167687839   2017-05-31
167688532   2017-02-03
167689317   2017-03-14
Name: transactiondate, Length: 77381, dtype: datetime64[ns]

In [6]:
# Convert the above Series into a Dataframe
max_transaction_date_by_parcel = pd.DataFrame(max_transaction_date_by_parcel)
max_transaction_date_by_parcel['parcelid'] = max_transaction_date_by_parcel.index
max_transaction_date_by_parcel = max_transaction_date_by_parcel.reset_index(drop=True)
max_transaction_date_by_parcel

Unnamed: 0,transactiondate,parcelid
0,2017-07-07,10711855
1,2017-08-29,10711877
2,2017-04-04,10711888
3,2017-03-17,10711910
4,2017-03-24,10711923
...,...,...
77376,2017-02-28,167686999
77377,2017-03-03,167687739
77378,2017-05-31,167687839
77379,2017-02-03,167688532


In [7]:
# Merge/join on both the transaction date AND parcel id in order to filter out the older transactiondates by parcel
# The merge/join is where the "magic" happens of removing rows with duplicate parcelids w/ less recent transactiondates
output = df.merge(max_transaction_date_by_parcel, on=["transactiondate", "parcelid"])
output

Unnamed: 0,typeconstructiontypeid,storytypeid,propertylandusetypeid,heatingorsystemtypeid,buildingclasstypeid,architecturalstyletypeid,airconditioningtypeid,parcelid,id,basementsqft,...,id.1,logerror,transactiondate,airconditioningdesc,architecturalstyledesc,buildingclassdesc,heatingorsystemdesc,propertylandusedesc,storydesc,typeconstructiondesc
0,,,261.0,,,,,14297519,1727539,,...,0,0.025595,2017-01-01,,,,,Single Family Residential,,
1,,,261.0,,,,,17052889,1387261,,...,1,0.055619,2017-01-01,,,,,Single Family Residential,,
2,,,261.0,,,,,14186244,11677,,...,2,0.005383,2017-01-01,,,,,Single Family Residential,,
3,,,261.0,2.0,,,,12177905,2288172,,...,3,-0.103410,2017-01-01,,,,Central,Single Family Residential,,
4,,,266.0,2.0,,,1.0,10887214,1970746,,...,4,0.006940,2017-01-01,Central,,,Central,Condominium,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77376,,,266.0,2.0,,,1.0,10833991,2864704,,...,77608,-0.002245,2017-09-20,Central,,,Central,Condominium,,
77377,,,261.0,2.0,,,,11000655,673515,,...,77609,0.020615,2017-09-20,,,,Central,Single Family Residential,,
77378,,,261.0,,,,,17239384,2968375,,...,77610,0.013209,2017-09-21,,,,,Single Family Residential,,
77379,,,261.0,2.0,,,1.0,12773139,1843709,,...,77611,0.037129,2017-09-21,Central,,,Central,Single Family Residential,,


In [8]:
assert n_duplicates == (df.shape[0] - output.shape[0])
print("Number of duplicates is equal to the length of the original dataframe minus the length of the final dataframe")

Number of duplicates is equal to the length of the original dataframe minus the length of the final dataframe


In [9]:
output.parcelid.duplicated().sum()

0

In [10]:
assert output.parcelid.duplicated().sum() == 0
print("No duplicate parcelid values in the final dataframe")

No duplicate parcelid values in the final dataframe
