## Use a `for` loop to process files given a list of their names.

* A filename is just a character string.
* And lists can contain character strings.

In [1]:
import pandas
for filename in ['../data/gapminder_gdp_africa.csv', '../data/gapminder_gdp_asia.csv']:
    data = pandas.read_csv(filename, index_col='country')
    print(filename, data.min())

../data/gapminder_gdp_africa.csv gdpPercap_1952    298.846212
gdpPercap_1957    335.997115
gdpPercap_1962    355.203227
gdpPercap_1967    412.977514
gdpPercap_1972    464.099504
gdpPercap_1977    502.319733
gdpPercap_1982    462.211415
gdpPercap_1987    389.876185
gdpPercap_1992    410.896824
gdpPercap_1997    312.188423
gdpPercap_2002    241.165877
gdpPercap_2007    277.551859
dtype: float64
../data/gapminder_gdp_asia.csv gdpPercap_1952    331.0
gdpPercap_1957    350.0
gdpPercap_1962    388.0
gdpPercap_1967    349.0
gdpPercap_1972    357.0
gdpPercap_1977    371.0
gdpPercap_1982    424.0
gdpPercap_1987    385.0
gdpPercap_1992    347.0
gdpPercap_1997    415.0
gdpPercap_2002    611.0
gdpPercap_2007    944.0
dtype: float64


## Use `glob.glob` to find sets of files whose names match a pattern.

* In Unix, the term "globbing" means "matching a set of files with a pattern".
* The most common patterns are:
    * `*` meaning "match zero or more characters"
    * `?` meaning "match exactly one character"
* Python contains the `glob` library to provide pattern matching functionality
* The `glob` library contains a function also called `glob` to match file patterns
* E.g., `glob.glob('*.txt')` matches all files in the current directory whose names end with `.txt`.
* Result is a (possibly empty) list of character strings.

In [3]:
import glob
print('all csv files in data directory:', glob.glob('../data/*.csv'))

all csv files in data directory: ['../data\\asia_gdp_per_capita.csv', '../data\\gapminder_all.csv', '../data\\gapminder_gdp_africa.csv', '../data\\gapminder_gdp_americas.csv', '../data\\gapminder_gdp_asia.csv', '../data\\gapminder_gdp_europe.csv', '../data\\gapminder_gdp_oceania.csv']


In [4]:
print('all PDB files:', glob.glob('*.pdb'))

all PDB files: []


## Use `glob` and `for` to process batches of files.

* Helps a lot if the files are named and stored systematically and consistently so that simple patterns will find the right data.

In [5]:
for filename in glob.glob('../data/gapminder_*.csv'):
    data = pandas.read_csv(filename)
    print(filename, data['gdpPercap_1952'].min())

../data\gapminder_all.csv 298.8462121
../data\gapminder_gdp_africa.csv 298.8462121
../data\gapminder_gdp_americas.csv 1397.7171369999999
../data\gapminder_gdp_asia.csv 331.0
../data\gapminder_gdp_europe.csv 973.5331947999999
../data\gapminder_gdp_oceania.csv 10039.595640000001


* This includes all data, as well as per-region data.
* Use a more specific pattern in the exercises to exclude the whole data set.
* But note that the minimum of the entire data set is also the minimum of one of the data sets, which is a nice check on correctness.

## Questions

#### Q1: Determining Matches

Which of these files is not matched by the expression `glob.glob('data/*as*.csv')`?

1. `data/gapminder_gdp_africa.csv`
2. `data/gapminder_gdp_americas.csv`
3. `data/gapminder_gdp_asia.csv`
4. 1 and 2 are not matched.


#### [Answer](#answer_key)

#### Q2: Minmum File Size

Modify this program so that it prints the number of records in the file that has the fewest records.

In [None]:
import glob
import pandas
fewest = ____
for filename in glob.glob('data/*.csv'):
    dataframe = pandas.____(filename)
    fewest = min(____, dataframe.shape[0])
print('smallest file has', fewest, 'records')`

Notice that the shape method returns a tuple with the number of rows and columns of the data frame.

#### [Answer](#answer_key)

#### Q3: Comparing Data

Write a program that reads in the regional data sets and plots the average GDP per capita for each region over time in a single chart.

#### [Answer](#answer_key)

******
******
******
******
******
******
******
******
******
******
******
******
******

## <a id='answer_Key'> Answers </a>

#### Q1: Determining Matches

1 is not matched by the glob.

#### Q2: Minmum File Size

In [None]:
import glob
import pandas
fewest = float('Inf')
for filename in glob.glob('data/*.csv'):
    dataframe = pandas.read_csv(filename)
    fewest = min(fewest, dataframe.shape[0])
print('smallest file has', fewest, 'records')

#### Q3: Comparing Data

In [None]:
import glob
import pandas 
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('data/gapminder_gdp*.csv'):
    dataframe = pandas.read_csv(filename)
    # extract region from the filename, expected to be in the format 'data/gapminder_gdp_<region>.csv'
    region = filename.rpartition('_')[2][:-4] 
    dataframe.mean().plot(ax=ax, label=region)
plt.legend()
plt.show()