# Case Study: New York Annual Property Sale Record

The New York city government has published neighborhood and city wide [property sale record](https://www.nyc.gov/site/finance/taxes/property-annualized-sales-update.page) every year since 2007. It is an amazing source of studying how different events impact the real estate sector in NYC. In this case study, we will try to clean up this property sale data from 2019 to 2021 in order to observe how COVID19 pandemic affects the buying and selling behavior of homes.

## Retrieving the data

From the [property sale record](https://www.nyc.gov/site/finance/taxes/property-annualized-sales-update.page), it is quite easy to get the url for the excel sheeting containing the data.  We can load them directly into Pandas data frames.

In [1]:
import pandas as pd

url_2021 = "https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2021/2021_queens.xlsx"
url_2020 = "https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2020/2020_queens.xlsx"
url_2019 = "https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2019/2019_queens.xlsx"
data_2021 = pd.read_excel(url_2021, skiprows=6)
data_2020 = pd.read_excel(url_2020, skiprows=6)
data_2019 = pd.read_excel(url_2019, skiprows=4)

Queens
Queens


Notice that each excel sheet contains several rows of text explanation, which should be skipped. We have manually counted the number of such non-data rows, and we use the `skiprows` argument in the `pandas.read_excel()` method to skip them.

## Cleaning up the column headers

Let's take a closer look at the column headers of each data frame next.

In [2]:
print(data_2021.columns)
print(data_2020.columns)
print(data_2019.columns)

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING CLASS CATEGORY',
       'TAX CLASS AT PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING CLASS AT PRESENT', 'ADDRESS', 'APARTMENT NUMBER', 'ZIP CODE',
       'RESIDENTIAL\nUNITS', 'COMMERCIAL\nUNITS', 'TOTAL \nUNITS',
       'LAND \nSQUARE FEET', 'GROSS \nSQUARE FEET', 'YEAR BUILT',
       'TAX CLASS AT TIME OF SALE', 'BUILDING CLASS\nAT TIME OF SALE',
       'SALE PRICE', 'SALE DATE'],
      dtype='object')
Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING CLASS CATEGORY',
       'TAX CLASS AT PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING CLASS AT PRESENT', 'ADDRESS', 'APARTMENT NUMBER', 'ZIP CODE',
       'RESIDENTIAL\nUNITS', 'COMMERCIAL\nUNITS', 'TOTAL \nUNITS',
       'LAND \nSQUARE FEET', 'GROSS \nSQUARE FEET', 'YEAR BUILT',
       'TAX CLASS AT TIME OF SALE', 'BUILDING CLASS\nAT TIME OF SALE',
       'SALE PRICE', 'SALE DATE'],
      dtype='object')
Index(['BOROUGH\n', 'NEIGHBORHOOD\n', 'BUILDING CLASS CATEGORY\n',
       'TAX

Here we see a few minor problems:

1. The column header contains spaces and new line (`\n`) characters, which makes referencing them hard.
2. All headers are in upper case. It will make our lives a bit easier if they are in lower case.
3. There is a slight difference in column names between 2019's data and 2020/2021 data.

Fortunately, they can be easily fixed using Pandas:

In [3]:
for data in [data_2021, data_2020, data_2019]:
    data.columns = data.columns\
        .str.replace(r'\s+', '_', regex=True)\
        .str.strip('_')\
        .str.lower()

print(data_2021.columns)
print(data_2020.columns)
print(data_2019.columns)

Index(['borough', 'neighborhood', 'building_class_category',
       'tax_class_at_present', 'block', 'lot', 'ease-ment',
       'building_class_at_present', 'address', 'apartment_number', 'zip_code',
       'residential_units', 'commercial_units', 'total_units',
       'land_square_feet', 'gross_square_feet', 'year_built',
       'tax_class_at_time_of_sale', 'building_class_at_time_of_sale',
       'sale_price', 'sale_date'],
      dtype='object')
Index(['borough', 'neighborhood', 'building_class_category',
       'tax_class_at_present', 'block', 'lot', 'ease-ment',
       'building_class_at_present', 'address', 'apartment_number', 'zip_code',
       'residential_units', 'commercial_units', 'total_units',
       'land_square_feet', 'gross_square_feet', 'year_built',
       'tax_class_at_time_of_sale', 'building_class_at_time_of_sale',
       'sale_price', 'sale_date'],
      dtype='object')
Index(['borough', 'neighborhood', 'building_class_category',
       'tax_class_as_of_final_roll_

Let's break it down a bit:
* `data.columns` extracts the column index from the data frame. It is an [index](https://pandas.pydata.org/docs/reference/indexing.html) object.
* `.str` tells pandas to treat the series/index as a list of strings.
* `.str.<method_name>()` applies the corresponding string methods in parallel to each element of the series/index.

Note that the columns of `data_2020` and `data_2021` are identical, but some columns of `data_2019` have different names.  Fortunately, these inconsistently named columns are not part of the columns we will use in this exercise. So we will leave them as they are.

## Selecting columns

Next, let's only keep the columns we will use and drop the rest of the columns in each data frame. Once done, we can combine all data frames together by concatenating the rows.

In [4]:
selected_columns = ["neighborhood", 
                    "building_class_at_time_of_sale",
                    "block", 
                    "lot", 
                    "residential_units", 
                    "tax_class_at_time_of_sale", 
                    "year_built", 
                    "sale_price", 
                    "sale_date"]
data = pd.concat(
    [data_2021[selected_columns], data_2020[selected_columns], data_2019[selected_columns]]
)
data.describe()

Unnamed: 0,block,lot,residential_units,tax_class_at_time_of_sale,year_built,sale_price
count,77934.0,77934.0,63168.0,77934.0,74670.0,77934.0
mean,6811.938512,213.439333,1.710898,1.467434,1949.775385,723359.2
std,4469.833791,454.562783,7.183315,0.772764,29.621333,4719585.0
min,6.0,1.0,0.0,1.0,1018.0,0.0
25%,2875.0,17.0,1.0,1.0,1930.0,0.0
50%,6434.0,39.0,1.0,1.0,1945.0,365000.0
75%,10479.0,83.0,2.0,2.0,1960.0,750000.0
max,16350.0,8006.0,1115.0,4.0,2021.0,369300000.0


## Check for missing values

With all the data frames combined into one, let's examine the actual data using `head()`.

In [5]:
data.head()

Unnamed: 0,neighborhood,building_class_at_time_of_sale,block,lot,residential_units,tax_class_at_time_of_sale,year_built,sale_price,sale_date
0,,,,,,,,,NaT
1,AIRPORT LA GUARDIA,A5,949.0,24.0,1.0,1.0,1945.0,0.0,2021-05-27
2,AIRPORT LA GUARDIA,A5,949.0,30.0,1.0,1.0,1945.0,935000.0,2021-12-08
3,AIRPORT LA GUARDIA,A5,976.0,12.0,1.0,1.0,1950.0,800000.0,2021-01-11
4,AIRPORT LA GUARDIA,A5,976.0,15.0,1.0,1.0,1950.0,0.0,2021-04-30


The result of `head()` reveals that the data frame contains missing data (marked as `NaN` and `NaT`) in the first 5 rows. To get a more complete picture, let's also check using `info()`.

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77936 entries, 0 to 26420
Data columns (total 9 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   neighborhood                    77934 non-null  object        
 1   building_class_at_time_of_sale  77934 non-null  object        
 2   block                           77934 non-null  float64       
 3   lot                             77934 non-null  float64       
 4   residential_units               63168 non-null  float64       
 5   tax_class_at_time_of_sale       77934 non-null  float64       
 6   year_built                      74670 non-null  float64       
 7   sale_price                      77934 non-null  float64       
 8   sale_date                       77934 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(6), object(2)
memory usage: 5.9+ MB


The output above shows that there are a total of 77936 rows in the data frame. Among them, most of the columns (e.g. `neighborhood`, `building_class_at_time_of_sale`, etc.) contain 77934 non-null entries, which means the remaining 2 entries are missing data. These missing values are probably due to the original excel sheet containing empty rows.

In comparison, `residential_units` and `year_built` contain a lot more missing values.  Missing value for `residential_units` column indicates the corresponding real estate property may be a parking lot or garage, so it does not make sense to specify the number of residential units. Missing value for `year_built` column may actually indicate the information is missing, which could happen due to bookkeeping errors. 

Because we are only interested in the residential property data in this exercise, we can safely drop those rows containing missing value.

In [7]:
data = data.dropna()
data.isna().any()

neighborhood                      False
building_class_at_time_of_sale    False
block                             False
lot                               False
residential_units                 False
tax_class_at_time_of_sale         False
year_built                        False
sale_price                        False
sale_date                         False
dtype: bool

Here, `data.isna()` is a different way of checking missing values. It goes over each cell and checks if that cell contains missing values. It outputs a data frame of `True` (value is missing) and `False` (value is not missing) values, one for each cell.  The `.any()` methods simply aggregate those values for each column.  Since the result for all columns are `False`, our data frame is now free of missing values.

## Specify column types

We can check the data types of each column by looking at the `dtypes` property.

In [8]:
data.dtypes

neighborhood                              object
building_class_at_time_of_sale            object
block                                    float64
lot                                      float64
residential_units                        float64
tax_class_at_time_of_sale                float64
year_built                               float64
sale_price                               float64
sale_date                         datetime64[ns]
dtype: object

`neighborhood` and `building_class_at_time_of_sale` columns contain strings, so their data type shows up as `object`, which is OK. `sale_date` has data type `datetime64[ns]`, which is correct. However, `block`, `lot`, `residential_units`, `tax_class_at_time_of_sale` and `year_built` all show up as `float64`. This is not ideal.  `block`, `lot`, `tax_class_at_time_of_sale` are categorical data encoded using numbers. It does not make sense to add/subtract or interpolate these values. `residential_units` should be an integer value, and `year_build` should be of date type.  Let's fix them.

In [9]:
data.block = data.block.astype("category")
data.lot = data.lot.astype("category")
data.residential_units = data.residential_units.astype(int)
data.tax_class_at_time_of_sale = data.tax_class_at_time_of_sale.astype("category")
data.dtypes

neighborhood                              object
building_class_at_time_of_sale            object
block                                   category
lot                                     category
residential_units                          int64
tax_class_at_time_of_sale               category
year_built                               float64
sale_price                               float64
sale_date                         datetime64[ns]
dtype: object

In the code above, we use the `astype()` method to change the data type for `block`, `lot`, `residential_units` and `tax_class_at_time_of_sale` columns. `dtypes` reflects this change.

Next, let's change the data type of the `year_built` column. Because there are many ways of representing temporal data, we cannot simply use `astype()` to cast a column to a date/time class.  Pandas provides a handy function `pd.to_datetime` just for this purpose. Let's give it a try.

In [10]:
# data.year_built = pd.to_datetime(data.year_built, format="%Y")

If you uncomment the above line and run it, it will produce an error similar to the one below.
```
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1018-01-01 00:00:00
```

This indicates that we have some property that was built in year 1018, and it is beyond the range Pandas date/time object can represent.

In [11]:
print(f"min: {pd.Timestamp.min}, max: {pd.Timestamp.max}")

min: 1677-09-21 00:12:43.145224193, max: 2262-04-11 23:47:16.854775807


The `pd.Timestamp` class can only represent years later than 1677. Year 1018 is indeed out of range.

In [12]:
data.year_built.describe()

count    60112.000000
mean      1948.510414
std         31.841573
min       1018.000000
25%       1925.000000
50%       1940.000000
75%       1960.000000
max       2021.000000
Name: year_built, dtype: float64

From the statistics above, it seems year 1018 may be an outlier since both the mean and median year are in the 1940s.

In [13]:
data[data.year_built < 1800]

Unnamed: 0,neighborhood,building_class_at_time_of_sale,block,lot,residential_units,tax_class_at_time_of_sale,year_built,sale_price,sale_date
66,ARVERNE,B2,15915.0,71.0,2,1.0,1018.0,10.0,2020-10-21
100,ARVERNE,B2,15915.0,71.0,2,1.0,1018.0,0.0,2019-04-11
20941,RIDGEWOOD,R1,3431.0,1003.0,1,2.0,1030.0,485000.0,2019-04-05


Investigate it a bit further, only 3 properties were built before 1800. Two built in 1018, and one built in 1030. Chances are these entries may be typos when people are entering the record.  The actual years are likely to be 1918 and 1930, and people mistyped 0 instead of 9.

In any case, we should ignore these entries since we cannot trust their accuracy.

In [14]:
data = data[data.year_built >= 1800]
data.year_built = pd.to_datetime(data.year_built, format="%Y")
data.dtypes

neighborhood                              object
building_class_at_time_of_sale            object
block                                   category
lot                                     category
residential_units                          int64
tax_class_at_time_of_sale               category
year_built                        datetime64[ns]
sale_price                               float64
sale_date                         datetime64[ns]
dtype: object

Now everything is in their intended data types.

## Filter rows corresponding to residential property

As mentioned before, the excel sheet contains all types of property sale records including parking lot, garage, factory, office building, etc. We are only interested in the residential properties, i.e. homes. The `residential_units` column may be able to help us filter the rows.

In [15]:
data[data.residential_units==0]

Unnamed: 0,neighborhood,building_class_at_time_of_sale,block,lot,residential_units,tax_class_at_time_of_sale,year_built,sale_price,sale_date
215,ARVERNE,H3,15852.0,60.0,0,4.0,2012-01-01,25000000.0,2021-08-26
216,ARVERNE,F5,15857.0,1.0,0,4.0,1931-01-01,90400000.0,2021-11-18
217,ARVERNE,F5,15857.0,7.0,0,4.0,1949-01-01,7750000.0,2021-04-09
219,ARVERNE,G1,16078.0,6.0,0,4.0,1920-01-01,410000.0,2021-04-29
220,ARVERNE,E9,16006.0,1.0,0,4.0,1965-01-01,1500000.0,2021-02-23
...,...,...,...,...,...,...,...,...,...
26416,WOODSIDE,RB,1307.0,1101.0,0,4.0,2007-01-01,0.0,2019-06-27
26417,WOODSIDE,RP,1183.0,1033.0,0,4.0,2008-01-01,0.0,2019-03-15
26418,WOODSIDE,RG,1299.0,1015.0,0,4.0,2008-01-01,35000.0,2019-04-11
26419,WOODSIDE,RG,1307.0,1122.0,0,4.0,2007-01-01,665000.0,2019-10-17


It is OK to assume that if `residential_units == 0`, the property is not a home. For the sake of this exercise, let's also focus on the record where `residential_units == 1` for simplicity.

Also, we are interested in the buying and selling of homes. So the `sale_price` should not be 0. Zero sale price may indicate some form of ownership change without monetary exchange.

In [16]:
data[data.sale_price <= 0]

Unnamed: 0,neighborhood,building_class_at_time_of_sale,block,lot,residential_units,tax_class_at_time_of_sale,year_built,sale_price,sale_date
1,AIRPORT LA GUARDIA,A5,949.0,24.0,1,1.0,1945-01-01,0.0,2021-05-27
4,AIRPORT LA GUARDIA,A5,976.0,15.0,1,1.0,1950-01-01,0.0,2021-04-30
7,AIRPORT LA GUARDIA,B1,976.0,1.0,2,1.0,1950-01-01,0.0,2021-08-29
9,AIRPORT LA GUARDIA,C0,949.0,59.0,3,1.0,1945-01-01,0.0,2021-11-08
10,ARVERNE,A5,15828.0,53.0,1,1.0,2002-01-01,0.0,2021-03-30
...,...,...,...,...,...,...,...,...,...
26399,WOODSIDE,F4,1294.0,68.0,0,4.0,1958-01-01,0.0,2019-05-14
26402,WOODSIDE,F5,2318.0,28.0,0,4.0,1929-01-01,0.0,2019-03-29
26412,WOODSIDE,E9,2283.0,1.0,0,4.0,1931-01-01,0.0,2019-06-19
26416,WOODSIDE,RB,1307.0,1101.0,0,4.0,2007-01-01,0.0,2019-06-27


The last thing we can use for filtering is the `building_class_at_time_of_sale` column. According to [NYC building classification](https://www.nyc.gov/assets/finance/jump/hlpbldgcode.html), only building code starts with "A", "B", "C", "D" and "R" can be residential properties. For building code that starts with "R", it is a residential unit only if the second letter following "R" is a number.

This means we should separate `building_class_at_time_of_sale` into 2 columns: one containing the first letter of the building code, the second containing the rest of the building code.

In [17]:
data["building_class_1"] = data.building_class_at_time_of_sale.str[0]
data["building_class_2"] = data.building_class_at_time_of_sale.str[1:]
data.groupby("building_class_1").building_class_1.count()

building_class_1
A    26616
B    16444
C     5436
D      154
E      331
F      183
G      371
H       15
I       19
J        7
K      866
M       72
N       11
O      247
P       28
Q        5
R     7868
S     1314
V       65
W       23
Y        1
Z       33
Name: building_class_1, dtype: int64

In the code above, we split `building_class_at_time_of_sale` accordingly into `building_class_1` (the first letter) and `building_class_2` (the rest). We can now count the number of entries for each category in `building_class_1` using the `groupby()`.  It shows most sale records are either in the `A` or `B` building class category.

Now, let's carry out the filtering on all criteria mentioned before.

In [18]:
data = data[data.building_class_1.isin(["A", "B", "C", "D", "R"])]
data[data.building_class_1 == "R"].groupby("building_class_2").count()
valid_class_2 = list(map(str, range(1,10)))
data = data[(data.building_class_1!="R") | data.building_class_2.isin(valid_class_2)]
data = data[data.residential_units == 1]
data = data[data.sale_price > 0]
data = data[data.tax_class_at_time_of_sale.isin([1, 2])]
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22593 entries, 2 to 26375
Data columns (total 11 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   neighborhood                    22593 non-null  object        
 1   building_class_at_time_of_sale  22593 non-null  object        
 2   block                           22593 non-null  category      
 3   lot                             22593 non-null  category      
 4   residential_units               22593 non-null  int64         
 5   tax_class_at_time_of_sale       22593 non-null  category      
 6   year_built                      22593 non-null  datetime64[ns]
 7   sale_price                      22593 non-null  float64       
 8   sale_date                       22593 non-null  datetime64[ns]
 9   building_class_1                22593 non-null  object        
 10  building_class_2                22593 non-null  object        
dtypes:

In the end, we are left with 34375 residential sale records.

## Double checking

Now we have completed all the data clean up steps. Let's double check!

In [19]:
data.head()

Unnamed: 0,neighborhood,building_class_at_time_of_sale,block,lot,residential_units,tax_class_at_time_of_sale,year_built,sale_price,sale_date,building_class_1,building_class_2
2,AIRPORT LA GUARDIA,A5,949.0,30.0,1,1.0,1945-01-01,935000.0,2021-12-08,A,5
3,AIRPORT LA GUARDIA,A5,976.0,12.0,1,1.0,1950-01-01,800000.0,2021-01-11,A,5
5,AIRPORT LA GUARDIA,A5,976.0,26.0,1,1.0,1950-01-01,815000.0,2021-06-25,A,5
6,AIRPORT LA GUARDIA,A5,976.0,62.0,1,1.0,1950-01-01,950000.0,2021-08-17,A,5
11,ARVERNE,A2,15830.0,43.0,1,1.0,1920-01-01,397000.0,2021-07-09,A,2


The first couple rows look good to me!  Let's try to run some statistical analysis.

In [20]:
data.groupby("neighborhood").sale_price.count().sort_values(ascending=False)

neighborhood
FLUSHING-NORTH              2577
LONG ISLAND CITY            1653
BAYSIDE                     1153
ST. ALBANS                   971
QUEENS VILLAGE               892
SO. JAMAICA-BAISLEY PARK     828
FLUSHING-SOUTH               784
SPRINGFIELD GARDENS          724
SOUTH OZONE PARK             592
MIDDLE VILLAGE               576
HOWARD BEACH                 559
HOLLIS                       553
ELMHURST                     550
ASTORIA                      550
FOREST HILLS                 535
LAURELTON                    524
RICHMOND HILL                520
SOUTH JAMAICA                484
ROSEDALE                     447
WHITESTONE                   439
OZONE PARK                   387
CAMBRIA HEIGHTS              349
REGO PARK                    347
WOODHAVEN                    334
FLORAL PARK                  327
BELLEROSE                    313
JACKSON HEIGHTS              295
FAR ROCKAWAY                 270
DOUGLASTON                   270
LITTLE NECK                  2

In the code above, we group records by neighborhood and count the number of records in each neighborhood. It shows that Flushing-North has the most active residential real estate market with over 3000 records from 2019 to 2021. Long Island City is a distant second.

In [21]:
data.groupby("neighborhood").sale_price.sum().sort_values(ascending=False)

neighborhood
FLUSHING-NORTH              3.353200e+09
ELMHURST                    1.888525e+09
LONG ISLAND CITY            1.833825e+09
BAYSIDE                     9.743075e+08
FOREST HILLS                6.115548e+08
FLUSHING-SOUTH              5.891606e+08
QUEENS VILLAGE              4.808491e+08
ST. ALBANS                  4.572763e+08
WHITESTONE                  4.249463e+08
ASTORIA                     4.156667e+08
MIDDLE VILLAGE              4.003844e+08
SO. JAMAICA-BAISLEY PARK    3.774314e+08
SPRINGFIELD GARDENS         3.454123e+08
HOWARD BEACH                3.344731e+08
SOUTH OZONE PARK            3.114146e+08
HOLLIS                      2.972953e+08
RICHMOND HILL               2.902529e+08
DOUGLASTON                  2.727235e+08
LITTLE NECK                 2.505815e+08
LAURELTON                   2.472694e+08
REGO PARK                   2.429241e+08
FLORAL PARK                 2.285595e+08
ROSEDALE                    2.253296e+08
JACKSON HEIGHTS             2.183549e+08
SOU

Here, instead of counting the number of records, we compute the total sale value of all residential properties in each neighborhood.  Again, Flushing-North comes to the top, and Elmhurst is a distant second.

In [22]:
data[data.sale_price > 1e6].groupby("neighborhood").sale_price.count().sort_values(ascending=False)

neighborhood
LONG ISLAND CITY            769
FLUSHING-NORTH              500
FOREST HILLS                258
BAYSIDE                     205
WHITESTONE                  122
ELMHURST                    107
DOUGLASTON                  100
JAMAICA ESTATES              93
ASTORIA                      77
FLUSHING-SOUTH               77
LITTLE NECK                  67
BEECHHURST                   67
FRESH MEADOWS                62
HOLLISWOOD                   43
KEW GARDENS                  38
NEPONSIT                     31
OAKLAND GARDENS              30
REGO PARK                    29
BELLE HARBOR                 26
JACKSON HEIGHTS              26
HOLLIS HILLS                 25
SUNNYSIDE                    23
HOWARD BEACH                 22
FAR ROCKAWAY                 20
MIDDLE VILLAGE               18
FLORAL PARK                  17
ROSEDALE                      9
COLLEGE POINT                 9
GLEN OAKS                     8
SPRINGFIELD GARDENS           8
ROCKAWAY PARK              

This time, we only focus on sales with prices above 1 million USD (i.e. very expensive homes, or average-cost home in NYC standard). This time, Long Island City comes up on top.

Lastly, let's see how does pandemic impact NYC sale records using statistics.

In [23]:
data.groupby(data.sale_date.dt.year).sale_price.count()

sale_date
2019    7331
2020    6130
2021    9132
Name: sale_price, dtype: int64

Note the number of property sale records visibly decreased in 2020, but it bounced back in 2021.

In [24]:
data.groupby([data.sale_date.dt.year, data.sale_date.dt.quarter])\
    .sale_price.count()

sale_date  sale_date
2019       1            1610
           2            1759
           3            1965
           4            1997
2020       1            1753
           2             981
           3            1450
           4            1946
2021       1            1982
           2            2322
           3            2509
           4            2319
Name: sale_price, dtype: int64

We can also aggregate sales data by quarter as shown above. It shows the second quarter of 2020 takes the hardest hit as the number of sale records is roughly half of the number in Q2 2019.

## Saving data

With everything looking good, let's save the cleaned up data for later usage.

In [25]:
data.to_excel("queens_residential_property_sale_record.xlsx")

## Summary

In this exercise, we walked through the process of cleaning up New York property sale record dataset. Here is a summary of the pandas functionality we used and their documentation:

* `pd.read_excel` [doc](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html)
* `pd.DataFrame` [doc](https://pandas.pydata.org/docs/reference/frame.html)
  * `pd.DataFrame.describe` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe)
  * `pd.DataFrame.info` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info)
  * `pd.DataFrame.columns` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html#pandas.DataFrame.columns)
  * `pd.DataFrame.head` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head)
  * `pd.DataFrame.dtypes` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html#pandas.DataFrame.dtypes)
  * `pd.DataFrame.isna` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html?highlight=isna#pandas.DataFrame.isna)
  * `pd.DataFrame.dropna` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html?highlight=dropna#pandas.DataFrame.dropna)
  * `pd.DataFrame.gropuby` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby)
* `pd.Series` [doc](https://pandas.pydata.org/docs/reference/series.html)
  * `pd.Series.str` [doc](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html)
* `pd.concat` [doc](https://pandas.pydata.org/docs/reference/api/pandas.concat.html?highlight=concat#pandas.concat)
* `pd.to_datetime` [doc](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html?highlight=to_datetime#pandas.to_datetime)