# WHAT:

Your dataset is split up into multiple files, but you need to combine them into a single DataFrame

# HOW:

Use the glob module and a generator expression to read files and concat() to combine them

### But first.  Need to obtain some sample data and split it up:

In [1]:
import pandas as pd
from glob import glob

Sample data is the iris data set:

In [2]:
df = pd.read_csv(r'https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv')

In [3]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [4]:
df.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


### Now let's break-up this DataFrame by iris species, saving each species' data in its own csv file and save it in the data folder:

In [5]:
for species in df['species'].unique():
    df.query("species == @species").to_csv('data/' + species + '_data.csv')

### Now let's confirm our files are in the ```data``` folder.  I am going to execute bash's ```ls``` command to show me what's in the ```data``` folder:

In [6]:
!ls data/

setosa_data.csv  versicolor_data.csv  virginica_data.csv


Awesome, so we have our data split up into 3 separate files.

# Now let's combine the files into a single DataFrame:

#### Let's create a list of files using ```glob()``` so that we can use wild-card pattern matching:

In [7]:
species_files = sorted(glob('data/*.csv'))
species_files

['data/setosa_data.csv', 'data/versicolor_data.csv', 'data/virginica_data.csv']

#### Then use panda's ```concat()``` function to append or concatenate the contents of each file by wrapping the pd.read_csv() function inside a set of parenthesis to make it into a Python generator.  You could also instead use square brackets to create a Python list with list comprehension syntax:

In [8]:
pd.concat((pd.read_csv(file) for file in species_files), ignore_index=True)

Unnamed: 0.1,Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,0,5.1,3.5,1.4,0.2,setosa
1,1,4.9,3.0,1.4,0.2,setosa
2,2,4.7,3.2,1.3,0.2,setosa
3,3,4.6,3.1,1.5,0.2,setosa
4,4,5.0,3.6,1.4,0.2,setosa
5,5,5.4,3.9,1.7,0.4,setosa
6,6,4.6,3.4,1.4,0.3,setosa
7,7,5.0,3.4,1.5,0.2,setosa
8,8,4.4,2.9,1.4,0.2,setosa
9,9,4.9,3.1,1.5,0.1,setosa


# Alternatively, we could have used Dask as it supports reading in multiple files in its ```read_csv()``` function:

In [9]:
import dask.dataframe as dd

In [10]:
# Read multiple files as a single Dask DataFrame
ddf = dd.read_csv('data/*.csv')

In [11]:
type(ddf)

dask.dataframe.core.DataFrame

In [None]:
# Convert it back to pandas DataFrame
df = ddf.compute()

In [13]:
type(df)

pandas.core.frame.DataFrame

In [14]:
df

Unnamed: 0.1,Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,0,5.1,3.5,1.4,0.2,setosa
1,1,4.9,3.0,1.4,0.2,setosa
2,2,4.7,3.2,1.3,0.2,setosa
3,3,4.6,3.1,1.5,0.2,setosa
4,4,5.0,3.6,1.4,0.2,setosa
5,5,5.4,3.9,1.7,0.4,setosa
6,6,4.6,3.4,1.4,0.3,setosa
7,7,5.0,3.4,1.5,0.2,setosa
8,8,4.4,2.9,1.4,0.2,setosa
9,9,4.9,3.1,1.5,0.1,setosa
