## Make a list of all of the filenames you want to open

You _could_ do this manually, but I suggest using my favorite-named tool: **glob**! It works like this:

```python
# Get a list of all CSV files in the current directory 
# that start with "sales," e.g. sales-2020.csv, sales-2015.csv, etc
import glob
filenames = glob.glob("sales-*.csv")
```

* _**Tip:** `*` means "match anything." _It's different than the `.*` we used in class, but it's the same idea._
* _**Tip:** Make sure your list includes both 2015 *and* 2019. Remember, some are `xls` and some are `xlsx`!

In [9]:
import glob
filenames = glob.glob('*brooklyn.xlsx') # filenames = glob.glob('*.xls')
filenames

['2018_brooklyn.xlsx',
 '2021_brooklyn.xlsx',
 '2020_brooklyn.xlsx',
 '2019_brooklyn.xlsx']

## Open one of them with pandas just to test it out. Any of them!

You'll need to use `skiprows=` to skip the first few rows, as they're informational and not actual data.

* _**Tip:** Yes, the column names are awful right now, but you'll fix them later_

In [18]:
import pandas as pd
df = pd.read_excel(filenames[0], skiprows=4)
df.head()

Unnamed: 0,BOROUGH\n,NEIGHBORHOOD\n,BUILDING CLASS CATEGORY\n,TAX CLASS AS OF FINAL ROLL 18/19,BLOCK\n,LOT\n,EASE-MENT\n,BUILDING CLASS AS OF FINAL ROLL 18/19,ADDRESS\n,APARTMENT NUMBER\n,...,RESIDENTIAL UNITS\n,COMMERCIAL UNITS\n,TOTAL UNITS\n,LAND SQUARE FEET\n,GROSS SQUARE FEET\n,YEAR BUILT\n,TAX CLASS AT TIME OF SALE\n,BUILDING CLASS AT TIME OF SALE\n,SALE PRICE\n,SALE DATE\n
0,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6360,23,,A5,8645 15TH AVENUE,,...,1.0,0.0,1.0,1547.0,1428.0,1930.0,1,A5,750000,2018-05-18
1,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6366,69,,A1,8658 BAY 16TH STREET,,...,1.0,0.0,1.0,4833.0,1724.0,1930.0,1,A1,0,2018-10-25
2,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6366,72,,A1,8664 BAY 16TH STREET,,...,1.0,0.0,1.0,4833.0,2300.0,1925.0,1,A1,1720000,2018-12-12
3,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6367,41,,S1,1728 86TH STREET,,...,1.0,1.0,2.0,1342.0,1920.0,1931.0,1,S1,1380000,2018-07-26
4,3,BATH BEACH,01 ONE FAMILY DWELLINGS,1,6371,20,,A9,75 BAY 20TH STREET,,...,1.0,0.0,1.0,2417.0,1742.0,1930.0,1,A9,710000,2018-02-21


## Now open another one.

Keep opening them with the same `.read_excel` options until you find one with bad headers. **UGH!!!** They all have different `skiprows=` values!

In [20]:
df1 = pd.read_excel(filenames[1], skiprows=6)
df1.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,APARTMENT NUMBER,...,RESIDENTIAL\nUNITS,COMMERCIAL\nUNITS,TOTAL \nUNITS,LAND \nSQUARE FEET,GROSS \nSQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS\nAT TIME OF SALE,SALE PRICE,SALE DATE
0,,,,,,,,,,,...,,,,,,,,,,NaT
1,3.0,BATH BEACH,01 ONE FAMILY DWELLINGS,1.0,6364.0,74.0,,A5,72 BAY 14TH ST.,,...,1.0,0.0,1.0,2492.0,972.0,1950.0,1.0,A5,0.0,2021-05-21
2,3.0,BATH BEACH,01 ONE FAMILY DWELLINGS,1.0,6364.0,74.0,,A5,72 BAY 14TH STREET,,...,1.0,0.0,1.0,2492.0,972.0,1950.0,1.0,A5,890000.0,2021-10-08
3,3.0,BATH BEACH,01 ONE FAMILY DWELLINGS,1.0,6367.0,24.0,,A9,8645 BAY 16 STREE,,...,1.0,0.0,1.0,1571.0,1456.0,1935.0,1.0,A9,925000.0,2021-11-03
4,3.0,BATH BEACH,01 ONE FAMILY DWELLINGS,1.0,6370.0,65.0,,A9,26 BAY 20TH STREET,,...,1.0,0.0,1.0,2417.0,1584.0,1930.0,1.0,A9,0.0,2021-09-04


## Ignoring headers

We're going to fix this by getting rid of `skiprows=` and using `header=None`. That way NONE of them will have ANY headers.

Try `header=None` on one of them.

(After we combine them all we'll update them with the right header rows.)

In [21]:
df2 = pd.read_excel(filenames[2], header=None)
df2.head() #this looks awful, pretty sure i ddidn't do this right

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,20
0,BROOKLYN ANNUAL SALES FOR CALENDAR YEAR 2020,,,,,,,,,,...,,,,,,,,,,
1,All Sales From January 2020- December 2020. Pr...,,,,,,,,,,...,,,,,,,,,,
2,"For sales prior to the Final, Neighborhood Nam...",,,,,,,,,,...,,,,,,,,,,
3,"Sales after the Final Roll, Neighborhood Name ...",,,,,,,,,,...,,,,,,,,,,
4,Building Class Category is based on Building C...,,,,,,,,,,...,,,,,,,,,,


## Open them all at the same time!

Starting from your list of filenames, use a list comprehension (similar to how we did with the Excel sheets) to create a list of dataframes.

You'll probably want to cut and paste your `.read_excel` from above so that none of them come in with headers. We'll add them in later!

* _**Tip:** Make sure you have 15 years of data (aka fifteen years of dataframes)_

## Combine them with `pd.concat`

Confirm that you should have 35,8054 rows and 21 columns. If your numbers are a *little* off you probably didn't ignore headers! (In which case, go back and do that.)

Your headers should just be numbers - 0, 1, 2, 3, 4.... etc.

* _**Tip:** Be sure to `ignore_index=True`_

## Add in the headers

The fourth row seems to be the headers. You can update the headers to be the info from the 4rd row.

```python
df.columns = df.loc[3].tolist()
```

## Remove the notation rows from the top of the Excel sheets

We used `dropna` in class on Monday to remove rows that were missing a `Treatment Date`. Let's do the same thing here to help remove some of the garbage - it seems like we can probably rely on `NEIGHBORHOOD` or `BLOCK` missing to mean that it's a garbage row.

Confirm that you have **357992** rows remaining.

## Clean up the data, then remove the duplicated header rows

Every Excel sheet brought in a new 'BOROUGH' and 'NEIGHBORHOOD', etc, that were supposed to be headers.

Let's look at `df.BOROUGH`. Do a `value_counts()` to see whether you notice anything unexpected.

Looks like there's all sorts of spaces or newlines – instead of `3` sometimes it's `3 ` (and probably other garbage like that). In theory we could get rid of it easily using `.str.strip()`, which removes whitespace from before/after a string.

```python
df.BOROUGH = df.BOROUGH.str.strip()
```

The problem is this is probably a problem in *all of the columns*. [This StackOverflow answer sets you up with a pretty good option,](https://stackoverflow.com/a/45270483) but it doesn't work in some edge cases. And of course our dataset is one of them! So try this out:

```python
df = df.apply(lambda col: col.astype(str).str.strip())
```

`.apply` is like a for loop for pandas - this loops through every column and runs `.str.strip()` on it.

Try your `value_counts()` again and let's see if it worked! It should look something like this:

```
3          72140
BOROUGH       15
Name: BOROUGH, dtype: int64
```

*Now* we can finally remove all of the rows where the column `df.BOROUGH` is the string `"BOROUGH"`.

Confirm you now have **357,977 rows**.

## Save the cleaned file

It's good practice to save your cleaned data before you start your analysis. Use `.to_csv` to save the cleaned data, passing `index=False` so it doesn't save the index.