## Make a list of all of the filenames you want to open

You _could_ do this manually, but I suggest using my favorite-named tool: **glob**! It works like this:

```python
# Get a list of all CSV files in the current directory 
# that start with "sales," e.g. sales-2020.csv, sales-2015.csv, etc
import glob
filenames = glob.glob("sales-*.csv")
```

* _**Tip:** `*` means "match anything." _It's different than the `.*` we used in class, but it's the same idea._
* _**Tip:** Make sure your list includes both 2015 *and* 2019. Remember, some are `xls` and some are `xlsx`!

In [115]:
!ls

01 - Data Acquisition.ipynb          2018_brooklyn.xlsx
02 - Data Compilation.ipynb          2019_brooklyn.xlsx
03 - Data Analysis.ipynb             2020_brooklyn.xlsx
04 - Data Exploration.ipynb          2021_brooklyn.xlsx
2009_brooklyn.xls                    cleaned.csv
2010_brooklyn.xls                    merged.csv
2011_brooklyn.xls                    merged_df.csv
2012_brooklyn.xls                    nyc_annualized_sales_links_excel.rtf
2013_brooklyn.xls                    nyc_annualized_sales_links_excel.txt
2014_brooklyn.xls                    sales_2007_brooklyn.xls
2015_brooklyn.xls                    sales_2008_brooklyn.xls
2016_brooklyn.xls                    urls.txt
2017_brooklyn.xls


In [116]:
import glob

In [139]:
filenames = glob.glob("*brooklyn.xls*")
filenames
filenames[:5]

['sales_2007_brooklyn.xls',
 '2014_brooklyn.xls',
 '2013_brooklyn.xls',
 '2012_brooklyn.xls',
 '2015_brooklyn.xls']

## Open one of them with pandas just to test it out. Any of them!

You'll need to use `skiprows=` to skip the first few rows, as they're informational and not actual data.

* _**Tip:** Yes, the column names are awful right now, but you'll fix them later_

In [140]:
import pandas as pd

df_2021 =pd.read_excel ("2021_brooklyn.xlsx",skiprows= 5)

In [141]:
df_2021.head()

Unnamed: 0,Note: Condominium and cooperative sales are on the unit level and understood to have a count of one.,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20
0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,APARTMENT NUMBER,...,RESIDENTIAL\nUNITS,COMMERCIAL\nUNITS,TOTAL \nUNITS,LAND \nSQUARE FEET,GROSS \nSQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS\nAT TIME OF SALE,SALE PRICE,SALE DATE
1,,,,,,,,,,,...,,,,,,,,,,
2,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6364,74,,A5,72 BAY 14TH ST.,,...,1,0,1,2492,972,1950,1,A5,0,2021-05-21 00:00:00
3,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6364,74,,A5,72 BAY 14TH STREET,,...,1,0,1,2492,972,1950,1,A5,890000,2021-10-08 00:00:00
4,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6367,24,,A9,8645 BAY 16 STREE,,...,1,0,1,1571,1456,1935,1,A9,925000,2021-11-03 00:00:00


## Now open another one.

Keep opening them with the same `.read_excel` options until you find one with bad headers. **UGH!!!** They all have different `skiprows=` values!

In [142]:
df_2020 =pd.read_excel ("2020_brooklyn.xlsx",skiprows= 5)
df_2020.head()

Unnamed: 0,Note: Condominium and cooperative sales are on the unit level and understood to have a count of one.,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20
0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,APARTMENT NUMBER,...,RESIDENTIAL\nUNITS,COMMERCIAL\nUNITS,TOTAL \nUNITS,LAND \nSQUARE FEET,GROSS \nSQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS\nAT TIME OF SALE,SALE PRICE,SALE DATE
1,,,,,,,,,,,...,,,,,,,,,,
2,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6359,70,,S1,8684 15TH AVENUE,,...,1,1,2,1933,4080,1930,1,S1,1300000,2020-04-28 00:00:00
3,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6360,48,,A5,14 BAY 10TH STREET,,...,1,0,1,2513,1428,1930,1,A5,849000,2020-03-18 00:00:00
4,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6360,56,,A5,30 BAY 10TH STREET,,...,1,0,1,1547,1428,1930,1,A5,75000,2020-11-30 00:00:00


In [143]:
df_2019=pd.read_excel ("2019_brooklyn.xlsx",skiprows=3)

In [144]:
df_2019.head()

Unnamed: 0,Building Class Category is based on Building Class at Time of Sale. Note: Condominium and cooperative sales are on the unit level and understood to have a count of one.,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20
0,BOROUGH\n,NEIGHBORHOOD\n,BUILDING CLASS CATEGORY\n,TAX CLASS AS OF FINAL ROLL 18/19,BLOCK\n,LOT\n,EASE-MENT\n,BUILDING CLASS AS OF FINAL ROLL 18/19,ADDRESS\n,APARTMENT NUMBER\n,...,RESIDENTIAL UNITS\n,COMMERCIAL UNITS\n,TOTAL UNITS\n,LAND SQUARE FEET\n,GROSS SQUARE FEET\n,YEAR BUILT\n,TAX CLASS AT TIME OF SALE\n,BUILDING CLASS AT TIME OF SALE\n,SALE PRICE\n,SALE DATE\n
1,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6363,22,,A9,8645 16TH AVENUE,,...,1,0,1,2058,1492,1930,1,A9,0,2019-04-23 00:00:00
2,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6363,48,,A9,12 BAY 13TH STREET,,...,1,0,1,3142,3200,1999,1,A9,0,2019-02-27 00:00:00
3,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6363,48,,A9,12 BAY 13TH STREET,,...,1,0,1,3142,3200,1999,1,A9,0,2019-02-11 00:00:00
4,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6364,74,,A5,72 BAY 14TH STREET,,...,1,0,1,2492,972,1950,1,A5,0,2019-08-15 00:00:00


In [145]:
df_2018 = pd.read_excel ("2018_brooklyn.xlsx")

In [146]:
df_2018.head()

Unnamed: 0,"BROOKLYN ANNUALIZE SALE FOR 2018. (All Sales From January 1, 2018 - December 31, 2018)",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20
0,Sales File as of 4/11/2019. Coop Sales Files ...,,,,,,,,,,...,,,,,,,,,,
1,Neighborhood Name and Descriptive Data is as o...,,,,,,,,,,...,,,,,,,,,,
2,Building Class Category is based on Building C...,,,,,,,,,,...,,,,,,,,,,
3,BOROUGH\n,NEIGHBORHOOD\n,BUILDING CLASS CATEGORY\n,TAX CLASS AS OF FINAL ROLL 18/19,BLOCK\n,LOT\n,EASE-MENT\n,BUILDING CLASS AS OF FINAL ROLL 18/19,ADDRESS\n,APARTMENT NUMBER\n,...,RESIDENTIAL UNITS\n,COMMERCIAL UNITS\n,TOTAL UNITS\n,LAND SQUARE FEET\n,GROSS SQUARE FEET\n,YEAR BUILT\n,TAX CLASS AT TIME OF SALE\n,BUILDING CLASS AT TIME OF SALE\n,SALE PRICE\n,SALE DATE\n
4,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6360,23,,A5,8645 15TH AVENUE,,...,1,0,1,1547,1428,1930,1,A5,750000,2018-05-18 00:00:00


In [147]:
df_2017 = pd.read_excel ("2017_brooklyn.xls",header=None)
df_2016 = pd.read_excel ("2016_brooklyn.xls",header=None)
df_2015 = pd.read_excel ("2015_brooklyn.xls",header=None)
df_2014 = pd.read_excel ("2014_brooklyn.xls",header=None)
df_2013 = pd.read_excel ("2013_brooklyn.xls",header=None)
df_2012 = pd.read_excel ("2012_brooklyn.xls",header=None)
df_2011 = pd.read_excel ("2011_brooklyn.xls",header=None)
df_2010 = pd.read_excel ("2010_brooklyn.xls",header=None)
df_2009 = pd.read_excel ("2009_brooklyn.xls",header=None)
df_2008 = pd.read_excel ("sales_2008_brooklyn.xls",header=None)
df_2007 = pd.read_excel ("sales_2007_brooklyn.xls",header=None)

## Ignoring headers

We're going to fix this by getting rid of `skiprows=` and using `header=None`. That way NONE of them will have ANY headers.

Try `header=None` on one of them.

(After we combine them all we'll update them with the right header rows.)

In [148]:
df_2018=pd.read_excel ("2018_brooklyn.xlsx",header=None)

In [149]:
df_2019 = pd.read_excel ("2019_brooklyn.xlsx",header=None)
df_2020 = pd.read_excel ("2020_brooklyn.xlsx",header=None)
df_2021 = pd.read_excel ("2021_brooklyn.xlsx",header=None)

## Open them all at the same time!

Starting from your list of filenames, use a list comprehension (similar to how we did with the Excel sheets) to create a list of dataframes.

You'll probably want to cut and paste your `.read_excel` from above so that none of them come in with headers. We'll add them in later!

* _**Tip:** Make sure you have 15 years of data (aka fifteen years of dataframes)_

In [150]:
dataframes = [pd.read_excel(filename,header=None) for filename in filenames]
dataframes

#don't know 

[                                                      0   \
 0      Brooklyn All Sales For 2007 (January 2007 - De...   
 1      Neighborhood Name 05/06/12, Descriptive Data i...   
 2      Building Class Category is based on Building C...   
 3                                                BOROUGH   
 4                                                      3   
 ...                                                  ...   
 28642                                                  3   
 28643                                                  3   
 28644                                                  3   
 28645                                                  3   
 28646                                                  3   
 
                               1   \
 0                            NaN   
 1                            NaN   
 2                            NaN   
 3                   NEIGHBORHOOD   
 4      BATH BEACH                  
 ...                          ...   
 28642 

In [151]:
dataframes[0]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,20
0,Brooklyn All Sales For 2007 (January 2007 - De...,,,,,,,,,,...,,,,,,,,,,
1,"Neighborhood Name 05/06/12, Descriptive Data i...",,,,,,,,,,...,,,,,,,,,,
2,Building Class Category is based on Building C...,,,,,,,,,,...,,,,,,,,,,
3,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,APARTMENT NUMBER,...,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE,SALE DATE
4,3,BATH BEACH,01 ONE FAMILY HOMES,1,6361,19,,A5,51 BAY 10TH STREET,,...,1,0,1,1933,1660,1930,1,A5,649000,2007-08-31 00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28642,3,WYCKOFF HEIGHTS,30 WAREHOUSES,4,3199,34,,E9,40 WYCKOFF AVENUE,,...,0,1,1,4250,4250,1931,4,E9,0,2007-12-17 00:00:00
28643,3,WYCKOFF HEIGHTS,31 COMMERCIAL VACANT LAND,2B,3259,13,,C1,296 STOCKHOLM STREET,,...,8,0,8,2500,7534,2008,4,V1,360000,2007-01-05 00:00:00
28644,3,WYCKOFF HEIGHTS,31 COMMERCIAL VACANT LAND,,3328,16,,,358 GROVE STREET,,...,0,0,0,0,0,0,4,V1,301990,2007-12-12 00:00:00
28645,3,WYCKOFF HEIGHTS,41 TAX CLASS 4 - OTHER,4,3188,30,,V1,386 TROUTMAN STREET,,...,0,0,0,2500,0,0,4,Z9,0,2007-07-13 00:00:00


## Combine them with `pd.concat`

Confirm that you should have 35,8054 rows and 21 columns. If your numbers are a *little* off you probably didn't ignore headers! (In which case, go back and do that.)

Your headers should just be numbers - 0, 1, 2, 3, 4.... etc.

* _**Tip:** Be sure to `ignore_index=True`_

In [152]:
#to combine a bunch of dataframes, you use pd.concat
#pd.concat([df1, df2, df3, df4), ignore_index=True])

In [153]:
df = pd.concat([df_2021,df_2020,df_2019,df_2018,df_2017,df_2016, \
            df_2015,df_2014,df_2013,df_2012,df_2011,df_2010,df_2009,df_2008,df_2007], ignore_index=True)
df.shape

(358054, 21)

In [154]:
df = pd.concat(dataframes,ignore_index=True)
df.shape

(358054, 21)

## Add in the headers

The fourth row seems to be the headers. You can update the headers to be the info from the 4rd row.

```python
df.columns = df.loc[3].tolist()
```

In [155]:
df.columns = df.loc[3].tolist()
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,APARTMENT NUMBER,...,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE,SALE DATE
0,Brooklyn All Sales For 2007 (January 2007 - De...,,,,,,,,,,...,,,,,,,,,,
1,"Neighborhood Name 05/06/12, Descriptive Data i...",,,,,,,,,,...,,,,,,,,,,
2,Building Class Category is based on Building C...,,,,,,,,,,...,,,,,,,,,,
3,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,APARTMENT NUMBER,...,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE,SALE DATE
4,3,BATH BEACH,01 ONE FAMILY HOMES,1,6361,19,,A5,51 BAY 10TH STREET,,...,1,0,1,1933,1660,1930,1,A5,649000,2007-08-31 00:00:00


## Remove the notation rows from the top of the Excel sheets

We used `dropna` in class on Monday to remove rows that were missing a `Treatment Date`. Let's do the same thing here to help remove some of the garbage - it seems like we can probably rely on `NEIGHBORHOOD` or `BLOCK` missing to mean that it's a garbage row.

In [156]:
df.columns = df.columns.str.lower().str.replace(" ", "_")
df.head(1)

Unnamed: 0,borough,neighborhood,building_class_category,tax_class_at_present,block,lot,ease-ment,building_class_at_present,address,apartment_number,...,residential_units,commercial_units,total_units,land_square_feet,gross_square_feet,year_built,tax_class_at_time_of_sale,building_class_at_time_of_sale,sale_price,sale_date
0,Brooklyn All Sales For 2007 (January 2007 - De...,,,,,,,,,,...,,,,,,,,,,


In [157]:
#We want to get rid of every single row that is missing a treatment date
print("Before dropping", df.shape)
df=df.dropna(subset=['neighborhood'])
df=df.dropna(subset=['block']) 
print("After dropping", df.shape)

Before dropping (358054, 21)
After dropping (357992, 21)


Confirm that you have **357992** rows remaining.

## Clean up the data, then remove the duplicated header rows

Every Excel sheet brought in a new 'BOROUGH' and 'NEIGHBORHOOD', etc, that were supposed to be headers.

Let's look at `df.BOROUGH`. Do a `value_counts()` to see whether you notice anything unexpected.

In [158]:
df.borough.value_counts()

3            285837
3             72140
BOROUGH\n         8
BOROUGH           7
Name: borough, dtype: int64

Looks like there's all sorts of spaces or newlines – instead of `3` sometimes it's `3 ` (and probably other garbage like that). In theory we could get rid of it easily using `.str.strip()`, which removes whitespace from before/after a string.

```python
df.BOROUGH = df.BOROUGH.str.strip()
```

The problem is this is probably a problem in *all of the columns*. [This StackOverflow answer sets you up with a pretty good option,](https://stackoverflow.com/a/45270483) but it doesn't work in some edge cases. And of course our dataset is one of them! So try this out:

```python
df = df.apply(lambda col: col.astype(str).str.strip())
```

`.apply` is like a for loop for pandas - this loops through every column and runs `.str.strip()` on it.

In [159]:
df = df.apply(lambda col: col.astype(str).str.strip())
df.borough.value_counts()

3          357977
BOROUGH        15
Name: borough, dtype: int64

Try your `value_counts()` again and let's see if it worked! It should look something like this:

```
3          72140
BOROUGH       15
Name: BOROUGH, dtype: int64
```

In [160]:
#why is my result different?df1 = df.loc[df["Discount"] >=1500 ]

*Now* we can finally remove all of the rows where the column `df.BOROUGH` is the string `"BOROUGH"`.

In [161]:
#how do we drop rows. 
#df = df.loc[df["borough"] == 3]

df[df["borough"] == 3].borough.value_counts()

Series([], Name: borough, dtype: int64)

Confirm you now have **357,977 rows**.

## Save the cleaned file

It's good practice to save your cleaned data before you start your analysis. Use `.to_csv` to save the cleaned data, passing `index=False` so it doesn't save the index.

In [162]:
df.to_csv('nychousingdata.csv', index=False)