# <p style="background-color: #f5df18; padding: 10px;">Looping Over Data Sets | **Variables and Assignments** </p>



### <strong>Instructor: <span style="color: darkblue;">Name (Affliation)</span></strong>

Estimated completion time: 🕚 15 minutes


<div style="display: flex;">
    <div style="flex: 1; margin-right: 20px;">
        <h2>Questions</h2>
        <ul>
            <li>How can I process many data sets with a single command?</li>
        </ul>
    </div>
    <div style="flex: 1;">
        <h2>Learning Objectives</h2>
        <ul> 
            <li>Be able to read and write globbing expressions that match sets of files.</li>
            <li>Use glob to create lists of files.</li>
            <li>Write for loops to perform operations on files given their names in a list.</li>
        </ul>
    </div>
</div>


## Use a `for` loop to process files given a list of their names.

- A filename is a character string.
- And lists can contain character strings.

In [None]:
import pandas as pd
for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())

## Use [`glob.glob`](https://docs.python.org/3/library/glob.html#glob.glob) to find sets of files whose names match a pattern.

- In Unix, the term "globbing" means "matching a set of files with a pattern".
- The most common patterns are:
  - `*` meaning "match zero or more characters"
  - `?` meaning "match exactly one character"
- Python's standard library contains the [`glob`](https://docs.python.org/3/library/glob.html) module to provide pattern matching functionality
- The [`glob`](https://docs.python.org/3/library/glob.html) module contains a function also called `glob` to match file patterns
- E.g., `glob.glob('*.txt')` matches all files in the current directory
  whose names end with `.txt`.
- Result is a (possibly empty) list of character strings.


In [None]:
## Use glob to print all `csv` files in 'data' directory  ###



In [None]:
### Print all `PDF` files



## Use `glob` and `for` to process batches of files.

- Helps a lot if the files are named and stored systematically and consistently
  so that simple patterns will find the right data.

In [None]:
### loop over gapminder files and print the minimum gdp per capita for the year 1952 – i.e, 'gdpPercap_1952'



- This includes all data, as well as per-region data.
- Use a more specific pattern in the exercises to exclude the whole data set.
- But note that the minimum of the entire data set is also the minimum of one of the data sets,
  which is a nice check on correctness.

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Determining Matches </p>

---

Which of these files is *not* matched by the expression `glob.glob('data/*as*.csv')`?

1. `data/gapminder_gdp_africa.csv`
2. `data/gapminder_gdp_americas.csv`
3. `data/gapminder_gdp_asia.csv`

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Minimum File Size </p>

---

Modify this program so that it prints the number of records in
the file that has the fewest records.

```python
import glob
import pandas as pd
fewest = ____
for filename in glob.glob('data/*.csv'):
    dataframe = pd.____(filename)
    fewest = min(____, dataframe.shape[0])
print('smallest file has', fewest, 'records')
```

Note that the [`DataFrame.shape()` method][shape-method]
returns a tuple with the number of rows and columns of the data frame.

In [None]:
### your answer here ###

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Comparing Data </p>

---

Write a program that reads in the regional data sets
and plots the average GDP per capita for each region over time
in a single chart. Pandas will raise an error if it encounters
non-numeric columns in a dataframe computation so you may need
to either filter out those columns or tell pandas to ignore them.

In [7]:
### your answer here ###

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Dealing with File Paths </p>


The [`pathlib` module][pathlib-module] provides useful abstractions for file and path manipulation like
returning the name of a file without the file extension. This is very useful when looping over files and
directories. In the example below, we create a `Path` object and inspect its attributes.

```python
from pathlib import Path

p = Path("data/gapminder_gdp_africa.csv")
print(p.parent)
print(p.stem)
print(p.suffix)
```

```output
data
gapminder_gdp_africa
.csv
```

**Hint:** Check all available attributes and methods on the `Path` object with the `dir()`
function.

In [None]:
### your answer here ###

# <p style="background-color: #f5df18; padding: 10px;"> 🗝️ Key points</p>
---

- Use a `for` loop to process files given a list of their names.
- Use `glob.glob` to find sets of files whose names match a pattern.
- Use `glob` and `for` to process batches of files.
