## Make a list of all of the filenames you want to open

You _could_ do this manually, but I suggest using my favorite-named tool: **glob**! It works like this:

```python
# Get a list of all CSV files in the current directory 
# that start with "sales," e.g. sales-2020.csv, sales-2015.csv, etc
import glob
filenames = glob.glob("sales-*.csv")
```

* _**Tip:** `*` means "match anything." _It's different than the `.*` we used in class, but it's the same idea._
* _**Tip:** Make sure your list includes both 2015 *and* 2019. Remember, some are `xls` and some are `xlsx`!

## Open one of them with pandas just to test it out. Any of them!

You'll need to use `skiprows=` to skip the first few rows, as they're informational and not actual data.

* _**Tip:** Yes, the column names are awful right now, but you'll fix them later_

## Now open another one.

Keep opening them with the same `.read_excel` options until you find one with bad headers. **UGH!!!** They all have different `skiprows=` values!

## Ignoring headers

We're going to fix this by getting rid of `skiprows=` and using `header=None`. That way NONE of them will have ANY headers.

Try `header=None` on one of them.

(After we combine them all we'll update them with the right header rows.)

## Open them all at the same time!

Starting from your list of filenames, use a list comprehension (similar to how we did with the Excel sheets) to create a list of dataframes.

You'll probably want to cut and paste your `.read_excel` from above so that none of them come in with headers. We'll add them in later!

* _**Tip:** Make sure you have 15 years of data (aka fifteen years of dataframes)_

## Combine them with `pd.concat`

Confirm that you should have 35,8054 rows and 21 columns. If your numbers are a *little* off you probably didn't ignore headers! (In which case, go back and do that.)

Your headers should just be numbers - 0, 1, 2, 3, 4.... etc.

* _**Tip:** Be sure to `ignore_index=True`_

## Add in the headers

The fourth row seems to be the headers. You can update the headers to be the info from the 4rd row.

```python
df.columns = df.loc[3].tolist()
```

## Remove the notation rows from the top of the Excel sheets

We used `dropna` in class on Monday to remove rows that were missing a `Treatment Date`. Let's do the same thing here to help remove some of the garbage - it seems like we can probably rely on `NEIGHBORHOOD` or `BLOCK` missing to mean that it's a garbage row.

Confirm that you have **357992** rows remaining.

## Clean up the data, then remove the duplicated header rows

Every Excel sheet brought in a new 'BOROUGH' and 'NEIGHBORHOOD', etc, that were supposed to be headers.

Let's look at `df.BOROUGH`. Do a `value_counts()` to see whether you notice anything unexpected.

Looks like there's all sorts of spaces or newlines – instead of `3` sometimes it's `3 ` (and probably other garbage like that). In theory we could get rid of it easily using `.str.strip()`, which removes whitespace from before/after a string.

```python
df.BOROUGH = df.BOROUGH.str.strip()
```

The problem is this is probably a problem in *all of the columns*. [This StackOverflow answer sets you up with a pretty good option,](https://stackoverflow.com/a/45270483) but it doesn't work in some edge cases. And of course our dataset is one of them! So try this out:

```python
df = df.apply(lambda col: col.astype(str).str.strip())
```

`.apply` is like a for loop for pandas - this loops through every column and runs `.str.strip()` on it.

Try your `value_counts()` again and let's see if it worked! It should look something like this:

```
3          72140
BOROUGH       15
Name: BOROUGH, dtype: int64
```

*Now* we can finally remove all of the rows where the column `df.BOROUGH` is the string `"BOROUGH"`.

Confirm you now have **357,977 rows**.

## Save the cleaned file

It's good practice to save your cleaned data before you start your analysis. Use `.to_csv` to save the cleaned data, passing `index=False` so it doesn't save the index.