Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dumping large pandas dataframe (~ >10k rows) times out #77

Closed
notacoat opened this issue Sep 28, 2014 · 10 comments · Fixed by #1563
Closed

dumping large pandas dataframe (~ >10k rows) times out #77

notacoat opened this issue Sep 28, 2014 · 10 comments · Fixed by #1563

Comments

@notacoat
Copy link

a check to see how large the array is and then some code that chunks it into <=10k sections would work. I wrote a function that does that and it is working fine.

@fzumstein
Copy link
Member

Great - can you open a pull request or do want to paste the function here? although the issue will not be dataframe specific but also apply to lists and numpy arrrays (they are all being transformed into lists before transferred over).

@fzumstein fzumstein added the bug label Oct 6, 2014
@fzumstein fzumstein modified the milestone: v0.2.3 Oct 6, 2014
@notacoat
Copy link
Author

here's the function I wrote.

def dump_largeDF(df, startcell='A1', chunk_size=10000):
    if len(df) <= (chunk_size +1): Range(startcell,index=False).value = df
    else: # chunk df and and dump each (default is 10k)\n",
        c = re.match(r"([a-z]+)([0-9]+)", startcell, re.I); cL = c.group(1); cN = int(c.group(2));n=0
        for i in (df[rw:rw + chunk_size] for rw in xrange(0, len(df), chunk_size)):
            if n==0: Range(cL + str(cN+n),index=False).value = i; cN += chunk_size
            else: Range(cL + str(cN+n)).value = i.values; cN += chunk_size
            n+=1

@fzumstein fzumstein modified the milestones: v0.2.3, v0.2.4 Oct 13, 2014
@fzumstein fzumstein modified the milestones: v0.2.4, v0.2.5 Nov 3, 2014
@Zenadix
Copy link

Zenadix commented Jan 7, 2015

The function posted by EmilyBarbour is bugged and causes an empty row to appear between chunks.

I modified the function and now it works fine. I posted it on StackOverflow, but I'm posting it here too for easy access.

# Dumps a large DataFrame in Excel via xlwings
# Does not include headers.
def dump_largeDF(df, startcell='A1', chunk_size=100000):
    if len(df) <= (chunk_size + 1):
        Range(startcell, index=False, header=False).value = df
    else: # Chunk df and and dump each
        c = re.match(r"([a-z]+)([0-9]+)", startcell, re.I) # A1
        row = c.group(1) # A
        col = int(c.group(2)) # 1
        for chunk in (df[rw:rw + chunk_size] for rw in
                      range(0, len(df), chunk_size)):
            print("Dumping chunk in %s%s" %(row, col))
            Range(row + str(col), index=False, header=False).value = chunk
            col += chunk_size

For me, a chunk size of 100k is fine, however you may change it depending on your needs.

@fzumstein
Copy link
Member

Great thanks! Still hope it'll make it into one of the next releases, but as pointed out above, we'll need to implement this further down to cover numpy arrays and lists also.

@fzumstein fzumstein modified the milestones: v0.3.1, v0.3.2, v0.3.3, v0.3.4 Jan 16, 2015
@aschleg
Copy link

aschleg commented Jan 29, 2015

Thanks a lot for the function Emily and Zenadix, really helpful for moving large datasets between Excel and Python.

I wanted to share a slight modification of Zenadix's function that copies in the column headers as well. Seems to be working well for my purposes so thought it might help others as well :)

def dump_largeDF(df, sheet, startcell, chunk_size = 10000):
    if len(df) <= (chunk_size + 1):
        Range(sheet, startcell, index = False, header = True).value = df
    else:
        Range(sheet, startcell, index = False).value = list(df.columns)
        c = re.match(r"([a-z]+)([0-9]+)", startcell[0] + str(int(startcell[1]) + 1), re.I)
        row = c.group(1)
        col = int(c.group(2))

        for chunk in (df[rw:rw + chunk_size] for rw in 
                      range(0, len(df), chunk_size)):
            Range(sheet, row + str(col), index = False, header = False).value = chunk
            col += chunk_size

@fzumstein fzumstein modified the milestones: v0.3.3, v0.3.4 Mar 1, 2015
@fzumstein fzumstein modified the milestones: v0.3.3, v0.3.4 Mar 8, 2015
@fzumstein
Copy link
Member

The Excel cursor might need to be switched to "busy" to prevent editing cells while xlwings is still trying to write out data.

@fzumstein
Copy link
Member

see also: ericremoreynolds/excelpython#60

@fzumstein fzumstein added this to the short list milestone Aug 25, 2016
@fzumstein fzumstein modified the milestones: short list, v0.10.1 Oct 16, 2016
@fzumstein fzumstein changed the title dumping large pandas dataframe (~ >10k rows) times out on mac osx dumping large pandas dataframe (~ >10k rows) times out Oct 17, 2016
@fzumstein fzumstein removed this from the v0.10.1 milestone Oct 26, 2016
@robertschulze
Copy link
Contributor

For everyone who is keen on also correctly handling the header and not relying on the Range, I modified the code a bit:

def dumpLargeDf(wb, df, startcell='A1', chunk_size=50000):
    # Dumps a large DataFrame in Excel via xlwings. Takes care of header.
    if len(df) <= (chunk_size + 1):
        wb.sheets.active.range(startcell).options(index=False).value = df
    else:                                       # Chunk df and and dump each
        c = re.match(r"([a-z]+)([0-9]+)", startcell, re.I)      # A1
        row = c.group(1)                                        # A
        col = int(c.group(2))                                   # 1
        useHeader = True
        for chunk in (df[rw:rw + chunk_size] for rw in
                      range(0, len(df), chunk_size)):
            print("Dumping chunk in %s%s" % (row, col))
            wb.sheets.active.range(row + str(col)) \
                .options(index=False, header=useHeader).value = chunk
            useHeader = False
            col += chunk_size

@fzumstein
Copy link
Member

With 0.23.1 you can now do: myrange.options(chunksize=10_000).value = df

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants