[ENH] Add read_folderfiles() function #104

jcvall · 2019-01-05T00:22:27Z

I find myself loading a lot of files from a folder. I would use glob library for this but it is a lot to write out. For example I will write:

path ="C:/Finance/Month End/2018/CSV Imports YTD"
files_xls = glob.glob(path + "/*.csv")
df = pd.DataFrame()

for f in files_xls:
data1 = pd.read_csv(f,skiprows=0,low_memory=False,encoding="cp1252")
data1['File_Name'] = (f)
#data2['File_Name'] = (f)
data1.append(data1,ignore_index=True)
df = df.append(data1,ignore_index=True)

Somthing like this would be easier:

read_folderfiles( path = "", extension ="", encoding = "", add_filenames = True)

thoughts?
I can try to create this if you like.

ericmjl · 2019-01-05T05:51:33Z

@jcvall this sounds like a great idea! Before you go too deep into this, do have you seen any nice implementation of the functionality? For example, from a cursory search, I saw this implementation which looks nice!

As for the function signature, what do you think about the following design?

import pandas as pd
from typing import Union
from pathlib import Path

def read_csvs(df: pd.DataFrame, directory: Union[str, Path], pattern: str, filetype: str**kwargs):
    """
    :param df: A pandas dataframe.
    :param directory: The directory that contains the CSVs.
    :param pattern: The pattern of csvs to match.
    :param kwargs: Keyword arguments to pass into `read_csv`.
    """

I chose read_csvs because it is a plural version of read_csv, and hence easily carries the meaning of the original function, and I think read_csv is the most commonly used file I/O function in pandas (at least for me, it is). For a starter version of the function, it'd probably also be enough scope; I think we can wait for the case where others need read_xlss or read_hdfs to have a contribution from them.

What are your thoughts? Naturally, happy to leave the implementation details to you. Don't forget that a test for such a function would require multiple dummy csv files (which could be really dummy, 3 rows per file kind of data).

jcvall · 2019-01-05T14:12:13Z

Thanks, Sounds great. Just going to add, the idea generally came to me, when I use R I find myself using the readbulk library. And yes, I do like your idea for read_csvs, maybe later work on read_xlsxs.

_{Sent with GitHawk}

ericmjl · 2019-01-05T18:11:33Z

Ok! Looking forward to your contribution. 😄 Thanks for being active with the project!

jcvall · 2019-01-16T02:28:44Z

Don’t want you to think I dropped off this one. My personal goal is to have this in good shape before the months end.

_{Sent with GitHawk}

jcvall · 2019-02-15T01:21:09Z

Sorry I have been away for a while. I am happy to say I have been hired as a data analyst just last week and will be coding in python and r full time. I am catching up on the conversations and want to let you know I will go with what you think is best for releases. I been working on the read_csvs() function and will submit it soon. Should I do anything special for the branch when I do?

_{Sent with GitHawk}

ericmjl · 2019-02-15T03:15:33Z

I am happy to say I have been hired as a data analyst just last week and will be coding in python and r full time.

Congratulations! This is a wonderful opportunity to continue in the data world 😄.

Should I do anything special for the branch when I do?

Yes, be sure to update your fork!

There's a few ways to do this - the easiest is to delete your fork (on GitHub and locally) and then fork from my master again.

We recently made a few changes, I hope you've been keeping up with them. @zbarry updated the contribution guide. The key changes are:

We now PR into dev by default. (Don't worry, GitHub will take care of this for you.)
master is reserved for releases.
Each function should be tested at least with one test - in its own test_<function_name>.py.
We started using test "fixtures" (i.e. pre-built dataframes, basically, that can be used). (For this function, I think you don't have to worry about it.)

Anyways, we'll knock out what needs to be done when we get to that stage.

For now, congrats again on the new job! And looking forward to seeing your contribution!

szuckerman · 2019-02-21T12:47:56Z

Something to think about.... is the end goal for this method to read all the csvs into one DataFrame or make separate DataFrames from all the csvs?

I thought it was the former, but I've noticed that I perform the latter constantly and it might be useful. For example, a directory with [customers.csv, sales.csv] would be useful to just turn into
customers_df and sales_df automatically (probably with a list that gets returned to let users know how many DataFrames actually got created).

Note: For the former implementation, unfortunately there's going to need to be a ton of error checking on the files to ensure that one of the csvs isn't malformed and destroys the entire DataFrame.

@ericmjl I think this function actually brings up an interesting issue for pyjanitor in the sense that it's not really a method on a DataFrame, it's more top-level than that. Since pyjanitor is based on methods on instantiated DataFrames this function would either need to be implemented like this:

import janitor

pd.DataFrame().read_folderfiles()

or like this

import janitor

janitor.read_folderfiles()

ericmjl · 2019-02-21T14:23:51Z

@ericmjl I think this function actually brings up an interesting issue for pyjanitor in the sense that it's not really a method on a DataFrame, it's more top-level than that. Since pyjanitor is based on methods on instantiated DataFrames this function would either need to be implemented like this:

Yes, that's definitely true, @szuckerman.

@jcvall, I'd probably add a new io.py (I/O = input/output) module, and then add the read_csvs function there. That way, one can do what Sam's 2nd example is like, and cleanly return a pandas DataFrame.

jcvall · 2019-02-22T06:50:51Z

Sounds good. Hope to get you something this weekend to look over.

_{Sent with GitHawk}

jcvall · 2019-02-23T09:05:59Z

Wanted your thoughts. Certainly not finished but just wants some feedback..

import pandas as pd
import glob
import os

def read_csvs(df: pd.DataFrame, directory: Union[str, Path], pattern: str = pattern, sep:str = sep, 
                  skiprows: int = skiprows, compression: str = compression,  encoding: str = encoding,  low_memory: bool = low_memory, 
                  seperate_df: bool = seperate_df,
                  filetype: str**kwargs):
    """
    :param df: A pandas dataframe.
    :param directory: The directory that contains the CSVs.
    :param pattern: The pattern of csvs to match.
    :param sep: Delimited seperator, defalut is ","
    :param skiprows: Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
    :param compression: For on-the-fly decompression of on-disk data. 
        If ‘infer’ and filepath_or_buffer is path-like, then detect compression 
        from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no decompression). 
        If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression.
    :param encoding: Encoding to use for UTF when reading/writing (ex. ‘utf-8’). Default is 'latin1'
    :param low_memory: Internally process the file in chunks, resulting in lower memory use while parsing, 
        but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter.
    :param seperate_df: Returns a dictionary of seperate dataframes for each csv file read
    :param kwargs: Keyword arguments to pass into `read_csv`.
    """
     
        if seperate_df:
            dfs = {os.path.basename(f): pd.read_csv(f, engine = 'python', sep = sep, compression = compression, low_memory = low_memory , encoding=encoding, skiprows = skiprows) for f in glob.glob(os.path.join(path, "*.csv"))}
            return dfs
            print ("Use dfs.get(key) to get dataframes.")
            print ("List of keys:")
            for key in dfs :
                print (key)
        else: 
            df = pd.concat([pd.read_csv(f, sep = sep, compression = compression, low_memory = low_memory , encoding=encoding, skiprows = skiprows).assign(filename = os.path.basename(f)) for f in glob.glob(os.path.join(Path, pattern))], ignore_index=True, sort=False)
            return df

jcvall · 2019-02-23T13:54:33Z

One of the thoughts I have on the pattern feature is it will usually take on ".csv", but it may have to change if the file is compressed. In that case the user would use ".gz" for example. I wanted to keep some of the features of the original pd.read_csv as I often come into errors with encoding, skiprows or low_memory that need to be addressed. I am still new to this but will be learning in earnest as I go.

As a side note I really wanted to add tqdm_notebook() status bar, from the tqdm package, and have a progress bar included as a feature. I loved this as it gave me an est. time and progress bar as each file was loaded getting closer to 100%. It looks like you need a java notebook extension which is not a big deal, but I was afraid it would turn people off. If you want me to add it I can, just to see what it looks like, and if you dont like it I can take it away?
Thanks.

ericmjl · 2019-02-23T16:31:39Z

Nice work, @jcvall! Looking carefully at the code you wrote, I think there's some places that can be shortened.

First thing I noticed was that you had kwargs specified in there. I think that's a great start! We could condense the read_csvs kwargs into there, and simply just let them pass through to the read_csv call.

Second thing I did was remove the printing (I'm guessing you may have been using them as a debugging statement).

Third thing I did was manually apply some Black-style formatting.

def read_csvs(
    df: pd.DataFrame, 
    directory: Union[str, Path], 
    pattern: str = pattern, 
    seperate_df: bool = seperate_df,
    **kwargs
):
    """
    :param df: A pandas dataframe.
    :param directory: The directory that contains the CSVs.
    :param pattern: The pattern of CSVs to match.
    :param seperate_df: Returns a dictionary of seperate dataframes for each CSV file read
    :param kwargs: Keyword arguments to pass into `read_csv`.
    """
    if seperate_df:
        dfs = {
            os.path.basename(f): pd.read_csv(f, **kwargs) 
            for f 
            in glob.glob(os.path.join(path, "*.csv"))
        }
        return dfs
    else: 
        df = pd.concat(
            [
                pd.read_csv(f, **kwargs).assign(filename = os.path.basename(f)) 
                for f 
                in glob.glob(os.path.join(path, pattern))
            ]
            ignore_index=True, 
            sort=False)
        return df

On using tqdm, I think you can take a look at chemistry.py, in which contains an example usage of tqdm. Because the primary use of pyjanitor has been in the notebook (for me at least), I made it an optional kwarg by setting a default value.

dave-frazzetto · 2019-05-07T14:36:13Z

I would like to work on this task!

jcvall · 2019-05-07T15:14:35Z

Sure!

_{Sent with GitHawk}

dave-frazzetto · 2019-05-07T16:14:37Z

I like where the proposed implementations are going, however I would like to point out what mentioned by @szuckerman again. Although janitor favors functions concatenation, it does not seem to me intuitive that the function is implemented as a df.read_csvs(...).
In the current implementation the df argument is not used and overwritten on return.
I propose to move (as suggested) the function to a janitor.io module.

szuckerman · 2019-05-07T16:31:47Z

For inspiration: dask already has this functionality, in that you can use wildcards to read all the files with similar filenames.

The difference with dask, though is that since it processes on a distributed system, all the files "live separately" when running a function on the dask dataframe. In our case they would need to be combined into one DataFrame. (Or, maybe have an option to return one DataFrame or a list of DataFrames?)

dave-frazzetto · 2019-05-07T20:13:51Z

This is my current implementation proposal:

def read_csvs(
    filespath: str,
    seperate_df: bool = False,
    **kwargs
):
    """
    :param filespath: The string pattern matching the CSVs files. Accepts regular expressions, with or without csv extension
    :param seperate_df: If False (default) returns a single Dataframe with the concatenation of the csv files-
        If True, returns a dictionary of seperate dataframes for each CSV file.
    :param kwargs: Keyword arguments to pass into the original pandas `read_csv`.
    """
    # Sanitize input
    assert filespath is not None
    assert len(filespath) != 0

    # Check if the original filespath contains .csv
    if not filespath.endswith(".csv"):
        filespath += ".csv"
    # Read the csv files
    dfs = {
        os.path.basename(f) : pd.read_csv(f, **kwargs) 
        for f 
        in glob(filespath)
    }
    # Check if dataframes have been read
    if len(dfs) == 0:
        raise ValueError("No CSV files to read with the given filespath")
    # Concatenate the dataframes if requested (default)
    if seperate_df:
        return dfs
    else:
        try: 
            return pd.concat(
                list(dfs.values()),
                ignore_index=True, 
                sort=False)
        except:
            raise ValueError("Input CSV files cannot be concatenated")

It takes a single argument for the file path, that accepts regular expressions (via the glob package).
It is not a pandas Dataframe function, so it cannot be concatenated (it would not make sense logically)
By default it concatenates the csf files in a single dataframe.

szuckerman · 2019-05-07T21:03:37Z

Looks good!

A few comments:

1

    if not filespath.endswith(".csv"):
        filespath += ".csv"

I'm not sure we need to append "csv" to every file. There are many instances where multiple files may not have a csv filename, but will be comma delimited. It's more common for tab-separated files, though. In that case, I would propose adding a sep argument, similar to how you can do pd.read_csv('file', sep="\t") to read a tab-delimited file.

2

    dfs = {
        os.path.basename(f) : pd.read_csv(f, **kwargs) 
        for f 
        in glob(filespath)
    }

I like that you want to keep the filename to reference the DataFrame, but what if someone doesn't know all the filenames that are in there? It will be a bit difficult to traverse a dictionary without knowing what all the keys are. Obviously one can iterate over dfs.keys(), but that gets a bit tedious. Maybe return it as a namedtuple that has filename and data arguments so people can access the DataFrames in a list but also have access to a filename descriptor.

ericmjl · 2019-05-08T15:35:29Z

@dave-frazzetto, would you be kind enough to help me regain context here: has a PR been made, and if not, would you like to put one in for this issue?

ericmjl · 2019-05-12T02:02:23Z

Ah, I just realized, the io module is available! Closing!

ericmjl added the enhancement New feature or request label Jan 5, 2019

ericmjl added this to New Function Contributions in Sprint Tasks Apr 7, 2019

ericmjl added the available for hacking This issue has not been claimed by any individual. label May 6, 2019

ericmjl changed the title ~~Add read_folderfiles() function~~ [ENH] Add read_folderfiles() function May 8, 2019

ericmjl added the being worked on An individual has claimed this issue and would like to hack on it. label May 8, 2019

ericmjl closed this as completed May 12, 2019

Sprint Tasks automation moved this from New Function Contributions to Complete! May 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Add read_folderfiles() function #104

[ENH] Add read_folderfiles() function #104

jcvall commented Jan 5, 2019

ericmjl commented Jan 5, 2019 •

edited

jcvall commented Jan 5, 2019

ericmjl commented Jan 5, 2019

jcvall commented Jan 16, 2019 •

edited

jcvall commented Feb 15, 2019

ericmjl commented Feb 15, 2019

szuckerman commented Feb 21, 2019

ericmjl commented Feb 21, 2019 •

edited

jcvall commented Feb 22, 2019

jcvall commented Feb 23, 2019 •

edited by ericmjl

jcvall commented Feb 23, 2019 •

edited

ericmjl commented Feb 23, 2019

dave-frazzetto commented May 7, 2019

jcvall commented May 7, 2019

dave-frazzetto commented May 7, 2019 •

edited

szuckerman commented May 7, 2019

dave-frazzetto commented May 7, 2019

szuckerman commented May 7, 2019

ericmjl commented May 8, 2019

ericmjl commented May 12, 2019

[ENH] Add read_folderfiles() function #104

[ENH] Add read_folderfiles() function #104

Comments

jcvall commented Jan 5, 2019

ericmjl commented Jan 5, 2019 • edited

jcvall commented Jan 5, 2019

ericmjl commented Jan 5, 2019

jcvall commented Jan 16, 2019 • edited

jcvall commented Feb 15, 2019

ericmjl commented Feb 15, 2019

szuckerman commented Feb 21, 2019

ericmjl commented Feb 21, 2019 • edited

jcvall commented Feb 22, 2019

jcvall commented Feb 23, 2019 • edited by ericmjl

jcvall commented Feb 23, 2019 • edited

ericmjl commented Feb 23, 2019

dave-frazzetto commented May 7, 2019

jcvall commented May 7, 2019

dave-frazzetto commented May 7, 2019 • edited

szuckerman commented May 7, 2019

dave-frazzetto commented May 7, 2019

szuckerman commented May 7, 2019

1

2

ericmjl commented May 8, 2019

ericmjl commented May 12, 2019

ericmjl commented Jan 5, 2019 •

edited

jcvall commented Jan 16, 2019 •

edited

ericmjl commented Feb 21, 2019 •

edited

jcvall commented Feb 23, 2019 •

edited by ericmjl

jcvall commented Feb 23, 2019 •

edited

dave-frazzetto commented May 7, 2019 •

edited