# Looping Over Data Files

---

[Watch a walk-through of this lesson on YouTube](https://youtu.be/zcg28Qz0ToY)



## Questions:
- How can I efficiently read in many data sets from different files?
- How can I combine data from different files into one pandas DataFrame?

## Learning Objectives:
- Be able to write "globbing" expressions that match sets of files
- Use `glob` to create lists of files
- Write `for` loops to perform operations on many files 
- Write list comprehensions to perform operations on many files
- combine pandas DataFrames

---

## Use a `for` loop to process files given a list of their names

We can use a `for` loop to read in a set of data files, and do a thing for each one. In this case, we'll print the minimum value in each file:

~~~python
import pandas as pd

data_files = ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']

for filename in data_files:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())
~~~

In [37]:
import pandas as pd

data_files = ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']

for filename in data_files:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())

data/gapminder_gdp_africa.csv gdpPercap_1952    298.846212
gdpPercap_1957    335.997115
gdpPercap_1962    355.203227
gdpPercap_1967    412.977514
gdpPercap_1972    464.099504
gdpPercap_1977    502.319733
gdpPercap_1982    462.211415
gdpPercap_1987    389.876185
gdpPercap_1992    410.896824
gdpPercap_1997    312.188423
gdpPercap_2002    241.165876
gdpPercap_2007    277.551859
dtype: float64
data/gapminder_gdp_asia.csv gdpPercap_1952    331.0
gdpPercap_1957    350.0
gdpPercap_1962    388.0
gdpPercap_1967    349.0
gdpPercap_1972    357.0
gdpPercap_1977    371.0
gdpPercap_1982    424.0
gdpPercap_1987    385.0
gdpPercap_1992    347.0
gdpPercap_1997    415.0
gdpPercap_2002    611.0
gdpPercap_2007    944.0
dtype: float64


## Use [`glob.glob`](https://docs.python.org/3/library/glob.html#glob.glob) to find sets of files whose names match a pattern.

*   In Unix, the term ***globbing*** means matching a set of files with a pattern.
*   The most common patterns are:
    *   `*` meaning match zero or more characters
    *   `?` meaning match exactly one character
*   Python's standard library contains the [`glob`](https://docs.python.org/3/library/glob.html) module to provide pattern matching functionality
*   The [`glob`](https://docs.python.org/3/library/glob.html) module contains a function also called `glob` to match file patterns
*   E.g., `glob.glob('*.txt')` matches all files in the current directory 
    whose names end with `.txt`.
*   Result is a list of strings.

~~~python
import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))

~~~

In [38]:
import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))

# in glob: * means match any text

all csv files in data directory: ['data/gapminder_gdp_americas.csv', 'data/gapminder_gdp_europe.csv', 'data/gapminder_all.csv', 'data/gapminder_gdp_oceania.csv', 'data/gapminder_gdp_africa.csv', 'data/s2.csv', 'data/s3.csv', 'data/s1.csv', 'data/gapminder_life_expectancy_years.csv', 'data/gapminder_gdp_asia.csv']


## Use `glob` and `for` to process batches of files.

It's good practice to name your files systematically. As you've learned, Python is very precise about things like capitalization, so if your file names are inconsistent (e.g., `Gapminder_Europe.csv`, `gapminder_americas.csv`, `gapminder_Oceania.csv`), then it is harder to write code with `glob` that works correctly. 

For the Gapminder data, fortunately the file names are quite systematic and consistent (as are the names of the columns inside each file), so we can use the following to read in each one and print the minimum GDP from 1952:

~~~python
for filename in glob.glob('data/gapminder_gdp*.csv'):
    data = pd.read_csv(filename)
    print(filename, data['gdpPercap_1952'].min())
~~~

In [39]:
for filename in glob.glob('data/gapminder_*.csv'):
    data = pd.read_csv(filename)
    try:
        print(filename, data['gdpPercap_1952'].min())
    except KeyError:
        print(f"'gdpPercap_1952' not found in {filename}")

# The YouTube tutorial must not have contained data/gapminder_lifeexpectancy_years.csv as Aaron did not get a KeyError
# I fixed the KeyError by using a try-except block to catch the KeyError
# YouTube video used 'data/gapminder_*.csv' while the tutorial here uses 'data/gapminder_gdp*.csv'

data/gapminder_gdp_americas.csv 1397.717137
data/gapminder_gdp_europe.csv 973.5331948
data/gapminder_all.csv 298.8462121
data/gapminder_gdp_oceania.csv 10039.59564
data/gapminder_gdp_africa.csv 298.8462121
'gdpPercap_1952' not found in data/gapminder_life_expectancy_years.csv
data/gapminder_gdp_asia.csv 331.0


## Appending Files to a Single DataFrame

Often we don't just want to open a file and extract a small bit of data (such as the minimum value in examples above). Rather, we might want to open a set of related data files and combine them into one big DataFrame. For example, in psychology and neuroscience most experiments involve multiple participants. For each participant, when we run the experiment we get a data file. To analyze the data across participants, we would want to read in all participants' data files and combined them into one DataFrame.

pandas has a few methods that allow us to combine DataFrames, including:
- [`.concat()`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)
- [`.merge()`](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=merge#)
- [`.append()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html?highlight=append#pandas.DataFrame.append)
- [`.join()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html#pandas.DataFrame.join)

We will focus here on the first one. `concat` stands for "concatenate" which essentially means combine files by "stacking" them. That is, start with one DataFrame, and add a new data frame to the bottom of it, creating additional rows. In what we'll do here, we assume that all of the data files we're reading have the same columns. For example, in the Gapminder GDP data sets, each file has a column for `country` plus a series of columns for GDP in different years — and the same years are in the columns of all the data sets. 

### Reading in data from multiple experimental participants

Let's say we have data from an experiment in which we ran three human participants (sometimes called "people") on different days. For each participant, we have a data file. The columns in all the files are the same, because the files were generated by a computer program that ran the experiment.

We give the participants anonymized ID codes to protect their privacy, and allow for a simple, systematic naming convention for the files. The first participant's data is saved in a file called `s1.csv`, the second's in `s2.csv`, etc..

We can glob the data folder in which the files are stored, to find all the CSV files whose names start with an `s` followed by a single character, followed by `.csv`. We'll save the result to a list that we can loop through later:

~~~python
filenames = glob.glob('data/s?.csv')
~~~

In [45]:
filenames = glob.glob('data/s?.csv')
filenames = sorted(filenames)
# ? works for only 1 character. use ?? for two characters. file names would need to be named s01.csv, s02.csv, etc.

Next, we create an empty list that we will store the DataFrames from each participant in. It will end up being a list of DataFrames (remember, lists can contain just about any other Python data type), and once we have read in all the data, we will combine them into one DataFrame. This is a trick that's important to use in pandas. The reason has to do with how pandas combines DataFrames and stores them in memory. In simple terms, each time we concatenate DataFrames, pandas does a lot of internal checking to make sure there are no errors. Doing this checking once, when combining many DataFrames, is far more efficient (and thus faster) than doing it many times. Likewise, when a DataFrame is created, an appropriate amount of memory space is allocated for it on the computer. Each time we append additional data, we have to create a new, bigger block of memory. Allocating new blocks of memory, many times, takes more time than just doing it once.

~~~python
df_list = []
~~~

In [46]:
df_list = []

Finally, use a `for` loop to read the files in. This will cycle through the items in the `filenames` list; each time through the loop, `filename` has the value of the current file name, and we use the list `append()` method to add the data from that file to `df_list`:

~~~python
for f in filenames:
    df_list.append(pd.read_csv(f))
~~~

In [47]:
for f in filenames:
    df_list.append(pd.read_csv(f))

When we view the contents of the list, we see each data set, with its two columns (with headers saying what they are), and commas separating the list entries, as is typical of a list. 

~~~python
df_list
~~~

In [48]:
df_list

[  participantID  trial        RT
 0            s1      1  0.508971
 1            s1      2  0.389858
 2            s1      3  0.404175
 3            s1      4  0.269520
 4            s1      5  0.437765
 5            s1      6  0.368142
 6            s1      7  0.400544
 7            s1      8  0.335198
 8            s1      9  0.341722
 9            s1     10  0.439583,
   participantID  trial        RT
 0            s2      1  0.433094
 1            s2      2  0.392526
 2            s2      3  0.396831
 3            s2      4  0.417988
 4            s2      5  0.371810
 5            s2      6  0.659228
 6            s2      7  0.411051
 7            s2      8  0.409580
 8            s2      9  0.486828
 9            s2     10  0.468912,
   participantID  trial        RT
 0            s3      1  0.322099
 1            s3      2  0.396106
 2            s3      3  0.384297
 3            s3      4  0.364524
 4            s3      5  0.454075
 5            s3      6  0.494156
 6          

## Reading multiple files using list comprehension

While the `for` loop above works fine, there is an alternative way to do this, using [**list comprehension**](https://neuraldatascience.io/3/for-loops.html#list-comprehension). Recall that list comprehensions are basically just a compact version of a `for` loop, but they have some advantages:
- they are *more pythonic*: they only require one line of code, whereas the `for` loop above required two
- they are *more efficient*: list comprehensions actually run faster. This may not be an issue in the small examples here, but can make a big difference when working with real, large data sets

~~~python
df_list = [pd.read_csv(f) for f in filenames]
df_list
~~~

In [49]:
df_list = [pd.read_csv(f) for f in filenames]
df_list

[  participantID  trial        RT
 0            s1      1  0.508971
 1            s1      2  0.389858
 2            s1      3  0.404175
 3            s1      4  0.269520
 4            s1      5  0.437765
 5            s1      6  0.368142
 6            s1      7  0.400544
 7            s1      8  0.335198
 8            s1      9  0.341722
 9            s1     10  0.439583,
   participantID  trial        RT
 0            s2      1  0.433094
 1            s2      2  0.392526
 2            s2      3  0.396831
 3            s2      4  0.417988
 4            s2      5  0.371810
 5            s2      6  0.659228
 6            s2      7  0.411051
 7            s2      8  0.409580
 8            s2      9  0.486828
 9            s2     10  0.468912,
   participantID  trial        RT
 0            s3      1  0.322099
 1            s3      2  0.396106
 2            s3      3  0.384297
 3            s3      4  0.364524
 4            s3      5  0.454075
 5            s3      6  0.494156
 6          

## Combining DataFrames

At this point, we've read each input file in and stored it as a DataFrame, but we have a list of three distinct DataFrames. In most cases, we'll want to combine these in some way. Having built our list of DataFrames through reading a set of files, we can combine them into a single DataFrame using the pandas `.concat()` method:

~~~python
df = pd.concat(df_list)



In [50]:
df = pd.concat(df_list)

Confirm this worked by viewing a random sample of rows
~~~
df.sample(8)
~~~

In [55]:
df.sample(8)

Unnamed: 0,participantID,trial,RT
4,s3,5,0.454075
4,s2,5,0.37181
5,s3,6,0.494156
7,s2,8,0.40958
1,s3,2,0.396106
9,s2,10,0.468912
8,s2,9,0.486828
6,s2,7,0.411051


## Setting the index column

Recall that row labels in pands are called *indexes*. We can convert any column to an index using the `.set_index()` method. For this data, an appropriate index is the participant ID, which is in the column `Participant`. Note that we need to assign the result of the `.set_index()` operation back to `df` for the change to be stored:

~~~python
df = df.set_index('Participant')
df.sample(8)
~~~

In [58]:
df = df.set_index('participantID')
df.sample(8)

Unnamed: 0_level_0,trial,RT
participantID,Unnamed: 1_level_1,Unnamed: 2_level_1
s3,5,0.454075
s2,3,0.396831
s3,9,0.340722
s1,7,0.400544
s3,3,0.384297
s2,9,0.486828
s3,8,0.506836
s1,4,0.26952


---
# Exercises
## Determining Matches

Which of these files is *not* matched by the expression `glob.glob('data/*as*.csv')`?

1. `data/gapminder_gdp_africa.csv`
2. `data/gapminder_gdp_americas.csv`
3. `data/gapminder_gdp_asia.csv`

```{admonition} Click the button to reveal the answer
:class: dropdown

1 is not matched. The string `as` occurs in both americ**as** and **as**ia

```

In [60]:
# The string 'as' occurs in both americ**as** and **as**ia.

## Globbing files

Fill in the blanks so that the code below does the following: 
- Find all of the CSV files in the data folder that contain GDP data
- Read these files in using a `for` loop
- Concatenate the data files into a single pandas DataFrame
- Print out the first 10 lines of the final combined DataFrame

*Note* that not all the Gapminder data files contain GDP data, but the file names will indicate which ones do. 

In [22]:
import glob
import pandas as pd

# List of data files
data_files = glob.glob('data/*gdp*.csv')

# Create an empty list to hold the DataFrames
df_list = []

for f in data_files:
    df_list.append(pd.read_csv(f))
    
df = pd.concat(df_list)

df.head(10)

Unnamed: 0,continent,country,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
0,Americas,Argentina,5911.315053,6856.856212,7133.166023,8052.953021,9443.038526,10079.02674,8997.897412,9139.671389,9308.41871,10967.28195,8797.640716,12779.37964
1,Americas,Bolivia,2677.326347,2127.686326,2180.972546,2586.886053,2980.331339,3548.097832,3156.510452,2753.69149,2961.699694,3326.143191,3413.26269,3822.137084
2,Americas,Brazil,2108.944355,2487.365989,3336.585802,3429.864357,4985.711467,6660.118654,7030.835878,7807.095818,6950.283021,7957.980824,8131.212843,9065.800825
3,Americas,Canada,11367.16112,12489.95006,13462.48555,16076.58803,18970.57086,22090.88306,22898.79214,26626.51503,26342.88426,28954.92589,33328.96507,36319.23501
4,Americas,Chile,3939.978789,4315.622723,4519.094331,5106.654313,5494.024437,4756.763836,5095.665738,5547.063754,7596.125964,10118.05318,10778.78385,13171.63885
5,Americas,Colombia,2144.115096,2323.805581,2492.351109,2678.729839,3264.660041,3815.80787,4397.575659,4903.2191,5444.648617,6117.361746,5755.259962,7006.580419
6,Americas,Costa Rica,2627.009471,2990.010802,3460.937025,4161.727834,5118.146939,5926.876967,5262.734751,5629.915318,6160.416317,6677.045314,7723.447195,9645.06142
7,Americas,Cuba,5586.53878,6092.174359,5180.75591,5690.268015,5305.445256,6380.494966,7316.918107,7532.924763,5592.843963,5431.990415,6340.646683,8948.102923
8,Americas,Dominican Republic,1397.717137,1544.402995,1662.137359,1653.723003,2189.874499,2681.9889,2861.092386,2899.842175,3044.214214,3614.101285,4563.808154,6025.374752
9,Americas,Ecuador,3522.110717,3780.546651,4086.114078,4579.074215,5280.99471,6679.62326,7213.791267,6481.776993,7103.702595,7429.455877,5773.044512,6873.262326


In [40]:
import pandas as pd
import glob

data_files = glob.glob('data/*gdp*.csv')

data = []

# Iterate over the data files
for file in data_files:
    continent = file.split('_')[2].split('.')[0] # Extract the continent from the filename
    df = pd.read_csv(file, index_col='country')
    df['continent'] = continent
    data.append(df)

df = pd.concat(data)

df.sample(10)


Unnamed: 0_level_0,continent,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Somalia,africa,1135.749842,1258.147413,1369.488336,1284.73318,1254.576127,1450.992513,1176.807031,1093.244963,926.960296,930.596428,882.081822,926.141068
Puerto Rico,americas,3081.959785,3907.156189,5108.34463,6929.277714,9123.041742,9770.524921,10330.98915,12281.34191,14641.58711,16999.4333,18855.60618,19328.70901
Bolivia,americas,2677.326347,2127.686326,2180.972546,2586.886053,2980.331339,3548.097832,3156.510452,2753.69149,2961.699694,3326.143191,3413.26269,3822.137084
Tunisia,africa,1468.475631,1395.232468,1660.30321,1932.360167,2753.285994,3120.876811,3560.233174,3810.419296,4332.720164,4876.798614,5722.895655,7092.923025
Malawi,africa,369.16508,416.369806,427.901086,495.514781,584.621971,663.223677,632.803921,635.517363,563.200014,692.27581,665.423119,759.34991
Sierra Leone,africa,879.787736,1004.484437,1116.639877,1206.043465,1353.759762,1348.285159,1465.010784,1294.447788,1068.696278,574.648158,699.489713,862.540756
Mauritius,africa,1967.955707,2034.037981,2529.067487,2475.387562,2575.484158,3710.982963,3688.037739,4783.586903,6058.253846,7425.705295,9021.815894,10956.99112
Equatorial Guinea,africa,375.643123,426.096408,582.841971,915.596003,672.412257,958.566812,927.825343,966.896815,1132.055034,2814.480755,7703.4959,12154.08975
Ghana,africa,911.298937,1043.561537,1190.041118,1125.69716,1178.223708,993.223957,876.032569,847.006113,925.060154,1005.245812,1111.984578,1327.60891
Botswana,africa,851.241141,918.232535,983.653976,1214.709294,2263.611114,3214.857818,4551.14215,6205.88385,7954.111645,8647.142313,11003.60508,12569.85177


```{admonition} Click the button to reveal!
:class: dropdown

~~~python
import glob
import pandas as pd

data_files = glob.glob('data/*gdp*.csv')

df_list = []

for f in data_files:
    df_list.append(pd.read_csv(f))
    
df = pd.concat(df_list)

df.head(10)
~~~
```

### List comprehension

Now rewrite the code above to use list comprehension rather than a `for` loop, and only *two* lines of code total (excluding the `import` commands and viewing the first 10 lines of the result). 

In [46]:
df_list = [pd.read_csv(f) for f in glob.glob('data/*gdp*.csv')]
df = pd.concat(df_list)

df.sample(10)

Unnamed: 0,continent,country,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
9,,Comoros,1102.990936,1211.148548,1406.648278,1876.029643,1937.577675,1172.603047,1267.100083,1315.980812,1246.90737,1173.618235,1075.811558,986.147879
10,,Germany,7144.114393,10187.82665,12902.46291,14745.62561,18016.18027,20512.92123,22031.53274,24639.18566,26505.30317,27788.88416,30035.80198,32170.37442
16,,Eritrea,328.940557,344.161886,380.995843,468.79497,514.324208,505.753808,524.875849,521.134133,582.85851,913.47079,765.350001,641.369524
32,,Morocco,1688.20357,1642.002314,1566.353493,1711.04477,1930.194975,2370.619976,2702.620356,2755.046991,2948.047252,2982.101858,3258.495584,3820.17523
29,,Mali,452.336981,490.382187,496.174343,545.009887,581.368876,686.395269,618.014064,684.171558,739.014375,790.257985,951.409752,1042.581557
10,,Israel,4086.522128,5385.278451,7105.630706,8393.741404,12786.93223,13306.61921,15367.0292,17122.47986,18051.52254,20896.60924,21905.59514,25523.2771
13,,Iceland,7267.688428,9244.001412,10350.15906,13319.89568,15798.06362,19654.96247,23269.6075,26923.20628,25144.39201,28061.09966,31163.20196,36180.78919
5,,Hong Kong China,3054.421209,3629.076457,4692.648272,6197.962814,8315.928145,11186.14125,14560.53051,20038.47269,24757.60301,28377.63219,30209.01516,39724.97867
3,Americas,Canada,11367.16112,12489.95006,13462.48555,16076.58803,18970.57086,22090.88306,22898.79214,26626.51503,26342.88426,28954.92589,33328.96507,36319.23501
49,,Uganda,734.753484,774.371069,767.27174,908.918522,950.735869,843.733137,682.266227,617.724406,644.170797,816.559081,927.721002,1056.380121


For an even bigger challenge, see if you can reduce the code to a single line!

In [47]:
df_list = pd.concat([pd.read_csv(f) for f in glob.glob('data/*gdp*.csv')])

df.head(10)

Unnamed: 0,continent,country,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
0,Americas,Argentina,5911.315053,6856.856212,7133.166023,8052.953021,9443.038526,10079.02674,8997.897412,9139.671389,9308.41871,10967.28195,8797.640716,12779.37964
1,Americas,Bolivia,2677.326347,2127.686326,2180.972546,2586.886053,2980.331339,3548.097832,3156.510452,2753.69149,2961.699694,3326.143191,3413.26269,3822.137084
2,Americas,Brazil,2108.944355,2487.365989,3336.585802,3429.864357,4985.711467,6660.118654,7030.835878,7807.095818,6950.283021,7957.980824,8131.212843,9065.800825
3,Americas,Canada,11367.16112,12489.95006,13462.48555,16076.58803,18970.57086,22090.88306,22898.79214,26626.51503,26342.88426,28954.92589,33328.96507,36319.23501
4,Americas,Chile,3939.978789,4315.622723,4519.094331,5106.654313,5494.024437,4756.763836,5095.665738,5547.063754,7596.125964,10118.05318,10778.78385,13171.63885
5,Americas,Colombia,2144.115096,2323.805581,2492.351109,2678.729839,3264.660041,3815.80787,4397.575659,4903.2191,5444.648617,6117.361746,5755.259962,7006.580419
6,Americas,Costa Rica,2627.009471,2990.010802,3460.937025,4161.727834,5118.146939,5926.876967,5262.734751,5629.915318,6160.416317,6677.045314,7723.447195,9645.06142
7,Americas,Cuba,5586.53878,6092.174359,5180.75591,5690.268015,5305.445256,6380.494966,7316.918107,7532.924763,5592.843963,5431.990415,6340.646683,8948.102923
8,Americas,Dominican Republic,1397.717137,1544.402995,1662.137359,1653.723003,2189.874499,2681.9889,2861.092386,2899.842175,3044.214214,3614.101285,4563.808154,6025.374752
9,Americas,Ecuador,3522.110717,3780.546651,4086.114078,4579.074215,5280.99471,6679.62326,7213.791267,6481.776993,7103.702595,7429.455877,5773.044512,6873.262326


```{admonition} Click the button to reveal!
:class: dropdown

Done in two lines of code:
~~~python
df_list = [pd.read_csv(f) for f in glob.glob('data/*gdp*.csv')]
    
df = pd.concat(df_list)

df.head(10)
~~~

Done in one line of code: 

~~~python
df = pd.concat([pd.read_csv(f) for f in glob.glob('data/*gdp*.csv')])
    
df.head(10)
~~~

```

## Summary of Key Points:
- Use a `for` loop to process files given a list of their names
- Use `glob.glob` to find sets of files whose names match a pattern
- List comprehension can replace a `for` loop, resulting in more compact and efficient code
- Naming your files in a consistent manner is just as important in data science, as writing the code to read them
- When you want to combine multiple files into one pandas DataFrame, read each one in to a list of DataFrames, then run `pd.concat()` only once

---
This lesson is adapted from the [Software Carpentry](https://software-carpentry.org/lessons/) [Plotting and Programming in Python](http://swcarpentry.github.io/python-novice-gapminder/) workshop. 