# The Analysis

Begin by opening `cleaned.csv`. Make sure we can see all of the columns.

## Fix the column names

Sigh, I don't like the whole "GROSS SQUARE FEET" kind of thing. Let's turn that into `gross_square_feet` (along with all the other columns). 

## Convert the sale date to a datetime using `pd.to_datetime`

You could have also done it with `read_csv`! You can do `parse_dates=[...]` with a list of the dates you'd like to be turned into datetimes.

## I've heard house buying is seasonal, with no sales in the winter. Which months of the year have the highest number of home sales?

Show me on a graph. Is my secondhand knowledge correct?

## Hm. How about years? Is there a pattern there?

I'd like to see a bar chart (columns) of number of homes sold each year.

## That really reminds me of the [Great Recession](https://en.wikipedia.org/wiki/Great_Recession).

I'd like a little more detail, though, the years are just so *lumpy*. Can I see the number of house sales on a monthly basis through the whole dataset?

> I kept using `colname.count()` when we did this in class, which was a horrible habit of mine in like 2016. I got smarter later, and realized you can just use `.size()` to count the number of rows in each group.

And please make sure the x axis starts at zero. It's really a misrepresentation of the truth if it starts at 750 or whatever matplotlib wants to do by default.

## That's still kind of rough. Can we smooth it out some more?

I'd like to move from monthly house sales to something a little more spread out. Here are some frequency options from the pandas documentation:

```
B         business day frequency
C         custom business day frequency (experimental)
D         calendar day frequency
W         weekly frequency
M         month end frequency
SM        semi-month end frequency (15th and end of month)
BM        business month end frequency
CBM       custom business month end frequency
MS        month start frequency
SMS       semi-month start frequency (1st and 15th)
BMS       business month start frequency
CBMS      custom business month start frequency
Q         quarter end frequency
BQ        business quarter endfrequency
QS        quarter start frequency
BQS       business quarter start frequency
A         year end frequency
BA, BY    business year end frequency
AS, YS    year start frequency
BAS, BYS  business year start frequency
BH        business hour frequency
H         hourly frequency
T, min    minutely frequency
S         secondly frequency
L, ms     milliseconds
U, us     microseconds
N         nanoseconds
```

On top of just picking a frequency by itself, you can also add numbers! For example, `3A` is three years at a time, and `15D` is 15 days, etc. Maybe you could put that to use?

Can you think of a better option than bundling this information in groups like this? Maybe from the cherry trees homework?

## But what about prices?

I'm also pretty confident that 

W... wait, what? What are those months where it seems like nothing was sold, or it was all sold for zero dollars? Let's find the top 20 months for lowest median sale price.

**??WHAT??**

Did any houses even sell then???? Let's look at the sales on `2012-09-30`.

## We've got a LOT of weird stuff going on here! There are all of these FREE houses?

Let's calculate what percent of the time sale prices are zero.

**Two approaches:**

1. The low-tech approach is calculating the size of the entire dataset, then filtering for where sale price is zero and getting the size of that subset.
2. The other approach (THAT I LOVE) takes two steps:
    - Write the "is your sale price 0?" code, but don't feed it to `df[...]` yet. It should be giving you Trues and Falses.
    - Put parens around your statement, then add `.value_counts()`. This does a `value_counts()` on your Trues and Falses, thus calculating how often it's zero and how often it isn't!!!

## Also, what's the "building class" thing?

I always assumed we were talking about houses, now there's this `41 TAX CLASS 4 - OTHER` thing? And when you google the addresses, they turn out to be parking lots???

**I guess we were really just making some dumb assumptions!** Let's narrow things down a bit. We'll start by looking at what the different building classes.

Look at what the most common building classes are in the dataset. There are a few columns, take a look at couple between `building_class_category` and `building_class_at_present` or `building_class_at_time_of_sale`. Maybe just check out the top 20 building classes in a sale?

While R4, C0 and B1 seem like wonderful building classes, *I have no idea what they mean*. Luckily there's [a website we can go to](https://www1.nyc.gov/assets/finance/jump/hlpbldgcode.html) that will tell us what each of them mean.

But also: I don't want to have to look them up all the time.

## Making those descriptions a column in our dataset

### Use `pd.read_html` to download the columns into a dataframe.

The first few rows should look like this:
    
||Building Code|Description|
|---|---|---|
0|A|ONE FAMILY DWELLINGS|
1|A0|CAPE COD|
2|A1|TWO STORIES - DETACHED SM OR MID|
3|A2|ONE STORY - PERMANENT LIVING QUARTER|

* **Tip:** It doesn't involve any fancy CSS selectors or anything.

### Combine the codes dataframe with our original dataframe

Feel free to fix up the new column headers so they're lower-case with `_` instead of spaces, if you want.

## Let's save again, just for safety's sake

Take your merged dataset and save it as, I don't know, `merged.csv` I guess. Remember to add `index=False` so the index doesn't get saved as a column!