New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Add read_folderfiles() function #104
Comments
@jcvall this sounds like a great idea! Before you go too deep into this, do have you seen any nice implementation of the functionality? For example, from a cursory search, I saw this implementation which looks nice! As for the function signature, what do you think about the following design? import pandas as pd
from typing import Union
from pathlib import Path
def read_csvs(df: pd.DataFrame, directory: Union[str, Path], pattern: str, filetype: str**kwargs):
"""
:param df: A pandas dataframe.
:param directory: The directory that contains the CSVs.
:param pattern: The pattern of csvs to match.
:param kwargs: Keyword arguments to pass into `read_csv`.
""" I chose What are your thoughts? Naturally, happy to leave the implementation details to you. Don't forget that a test for such a function would require multiple dummy csv files (which could be really dummy, 3 rows per file kind of data). |
Thanks, Sounds great. Just going to add, the idea generally came to me, when I use R I find myself using the readbulk library. And yes, I do like your idea for read_csvs, maybe later work on read_xlsxs. Sent with GitHawk |
Ok! Looking forward to your contribution. 😄 Thanks for being active with the project! |
Don’t want you to think I dropped off this one. My personal goal is to have this in good shape before the months end. Sent with GitHawk |
Sorry I have been away for a while. I am happy to say I have been hired as a data analyst just last week and will be coding in python and r full time. I am catching up on the conversations and want to let you know I will go with what you think is best for releases. I been working on the read_csvs() function and will submit it soon. Should I do anything special for the branch when I do? Sent with GitHawk |
Congratulations! This is a wonderful opportunity to continue in the data world 😄.
Yes, be sure to update your fork! There's a few ways to do this - the easiest is to delete your fork (on GitHub and locally) and then fork from my master again. We recently made a few changes, I hope you've been keeping up with them. @zbarry updated the contribution guide. The key changes are:
Anyways, we'll knock out what needs to be done when we get to that stage. For now, congrats again on the new job! And looking forward to seeing your contribution! |
Something to think about.... is the end goal for this method to read all the csvs into one I thought it was the former, but I've noticed that I perform the latter constantly and it might be useful. For example, a directory with [ Note: For the former implementation, unfortunately there's going to need to be a ton of error checking on the files to ensure that one of the csvs isn't malformed and destroys the entire @ericmjl I think this function actually brings up an interesting issue for pyjanitor in the sense that it's not really a method on a import janitor
pd.DataFrame().read_folderfiles() or like this import janitor
janitor.read_folderfiles() |
Yes, that's definitely true, @szuckerman. @jcvall, I'd probably add a new |
Sounds good. Hope to get you something this weekend to look over. Sent with GitHawk |
Wanted your thoughts. Certainly not finished but just wants some feedback.. import pandas as pd
import glob
import os
def read_csvs(df: pd.DataFrame, directory: Union[str, Path], pattern: str = pattern, sep:str = sep,
skiprows: int = skiprows, compression: str = compression, encoding: str = encoding, low_memory: bool = low_memory,
seperate_df: bool = seperate_df,
filetype: str**kwargs):
"""
:param df: A pandas dataframe.
:param directory: The directory that contains the CSVs.
:param pattern: The pattern of csvs to match.
:param sep: Delimited seperator, defalut is ","
:param skiprows: Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
:param compression: For on-the-fly decompression of on-disk data.
If ‘infer’ and filepath_or_buffer is path-like, then detect compression
from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no decompression).
If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression.
:param encoding: Encoding to use for UTF when reading/writing (ex. ‘utf-8’). Default is 'latin1'
:param low_memory: Internally process the file in chunks, resulting in lower memory use while parsing,
but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter.
:param seperate_df: Returns a dictionary of seperate dataframes for each csv file read
:param kwargs: Keyword arguments to pass into `read_csv`.
"""
if seperate_df:
dfs = {os.path.basename(f): pd.read_csv(f, engine = 'python', sep = sep, compression = compression, low_memory = low_memory , encoding=encoding, skiprows = skiprows) for f in glob.glob(os.path.join(path, "*.csv"))}
return dfs
print ("Use dfs.get(key) to get dataframes.")
print ("List of keys:")
for key in dfs :
print (key)
else:
df = pd.concat([pd.read_csv(f, sep = sep, compression = compression, low_memory = low_memory , encoding=encoding, skiprows = skiprows).assign(filename = os.path.basename(f)) for f in glob.glob(os.path.join(Path, pattern))], ignore_index=True, sort=False)
return df |
One of the thoughts I have on the pattern feature is it will usually take on ".csv", but it may have to change if the file is compressed. In that case the user would use ".gz" for example. I wanted to keep some of the features of the original pd.read_csv as I often come into errors with encoding, skiprows or low_memory that need to be addressed. I am still new to this but will be learning in earnest as I go. As a side note I really wanted to add tqdm_notebook() status bar, from the tqdm package, and have a progress bar included as a feature. I loved this as it gave me an est. time and progress bar as each file was loaded getting closer to 100%. It looks like you need a java notebook extension which is not a big deal, but I was afraid it would turn people off. If you want me to add it I can, just to see what it looks like, and if you dont like it I can take it away? |
Nice work, @jcvall! Looking carefully at the code you wrote, I think there's some places that can be shortened. First thing I noticed was that you had kwargs specified in there. I think that's a great start! We could condense the Second thing I did was remove the printing (I'm guessing you may have been using them as a debugging statement). Third thing I did was manually apply some Black-style formatting. def read_csvs(
df: pd.DataFrame,
directory: Union[str, Path],
pattern: str = pattern,
seperate_df: bool = seperate_df,
**kwargs
):
"""
:param df: A pandas dataframe.
:param directory: The directory that contains the CSVs.
:param pattern: The pattern of CSVs to match.
:param seperate_df: Returns a dictionary of seperate dataframes for each CSV file read
:param kwargs: Keyword arguments to pass into `read_csv`.
"""
if seperate_df:
dfs = {
os.path.basename(f): pd.read_csv(f, **kwargs)
for f
in glob.glob(os.path.join(path, "*.csv"))
}
return dfs
else:
df = pd.concat(
[
pd.read_csv(f, **kwargs).assign(filename = os.path.basename(f))
for f
in glob.glob(os.path.join(path, pattern))
]
ignore_index=True,
sort=False)
return df On using tqdm, I think you can take a look at chemistry.py, in which contains an example usage of tqdm. Because the primary use of pyjanitor has been in the notebook (for me at least), I made it an optional kwarg by setting a default value. |
I would like to work on this task! |
Sure! Sent with GitHawk |
I like where the proposed implementations are going, however I would like to point out what mentioned by @szuckerman again. Although janitor favors functions concatenation, it does not seem to me intuitive that the function is implemented as a |
For inspiration: dask already has this functionality, in that you can use wildcards to read all the files with similar filenames. The difference with dask, though is that since it processes on a distributed system, all the files "live separately" when running a function on the dask dataframe. In our case they would need to be combined into one |
This is my current implementation proposal: def read_csvs(
filespath: str,
seperate_df: bool = False,
**kwargs
):
"""
:param filespath: The string pattern matching the CSVs files. Accepts regular expressions, with or without csv extension
:param seperate_df: If False (default) returns a single Dataframe with the concatenation of the csv files-
If True, returns a dictionary of seperate dataframes for each CSV file.
:param kwargs: Keyword arguments to pass into the original pandas `read_csv`.
"""
# Sanitize input
assert filespath is not None
assert len(filespath) != 0
# Check if the original filespath contains .csv
if not filespath.endswith(".csv"):
filespath += ".csv"
# Read the csv files
dfs = {
os.path.basename(f) : pd.read_csv(f, **kwargs)
for f
in glob(filespath)
}
# Check if dataframes have been read
if len(dfs) == 0:
raise ValueError("No CSV files to read with the given filespath")
# Concatenate the dataframes if requested (default)
if seperate_df:
return dfs
else:
try:
return pd.concat(
list(dfs.values()),
ignore_index=True,
sort=False)
except:
raise ValueError("Input CSV files cannot be concatenated") It takes a single argument for the file path, that accepts regular expressions (via the glob package). |
Looks good! A few comments: 1 if not filespath.endswith(".csv"):
filespath += ".csv" I'm not sure we need to append "csv" to every file. There are many instances where multiple files may not have a csv filename, but will be comma delimited. It's more common for tab-separated files, though. In that case, I would propose adding a 2 dfs = {
os.path.basename(f) : pd.read_csv(f, **kwargs)
for f
in glob(filespath)
} I like that you want to keep the filename to reference the |
@dave-frazzetto, would you be kind enough to help me regain context here: has a PR been made, and if not, would you like to put one in for this issue? |
Ah, I just realized, the |
I find myself loading a lot of files from a folder. I would use glob library for this but it is a lot to write out. For example I will write:
path ="C:/Finance/Month End/2018/CSV Imports YTD"
files_xls = glob.glob(path + "/*.csv")
df = pd.DataFrame()
for f in files_xls:
data1 = pd.read_csv(f,skiprows=0,low_memory=False,encoding="cp1252")
data1['File_Name'] = (f)
#data2['File_Name'] = (f)
data1.append(data1,ignore_index=True)
df = df.append(data1,ignore_index=True)
Somthing like this would be easier:
read_folderfiles( path = "", extension ="", encoding = "", add_filenames = True)
thoughts?
I can try to create this if you like.
The text was updated successfully, but these errors were encountered: