Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB consumes lots of disk #159

Closed
jkralik opened this issue Jan 20, 2020 · 15 comments
Closed

DB consumes lots of disk #159

jkralik opened this issue Jan 20, 2020 · 15 comments

Comments

@jkralik
Copy link
Contributor

jkralik commented Jan 20, 2020

Subject of the issue

Server's run from 5. December and DB has lot's of "*.vlog" (314) that consumes 309GB.

  • Pls how can I reduce it ?
  • Can I remove old "*.vlog" files ?

Your environment

  • OS Ubuntu
  • Version 18.04
    We are using ca with acme server and 8(services) as acme clients.

Expected behavior

I expected that it consumes max 1GB of disk.

Actual behavior

It consumes 309GB and it's growing.

@dopey
Copy link
Contributor

dopey commented Jan 21, 2020

Hey, apologies for the delayed response.

First off, I assume you are using the default Badger DB? That size is definitely not expected. I'm curious about your usage patterns: ballpark how many certificates have you created? You have 8 services, how often are they regenerating certificates?

Are you using the revocation feature? If not, then you can probably just start the database over entirely (meaning move the old one and stop using it and just start a new database).

More importantly, we'd definitely like to understand what's happening here. Unfortunately, we're storing the data as nosql key-value which makes it difficult to analyze without writing specific code to do that. Would you be open to sending us the database so that we can attempt to analyze it on our end?

@jkralik
Copy link
Contributor Author

jkralik commented Jan 23, 2020

Hi

  1. We are using badger DB. Every service has two acme clients:
  • one for listen socket
  • second one is use for connect to other service.

eg: Gateways are exposed to the world and they are use let's encrypt for listen and for internal communication with mutual authenticated TLS we use step-ca (connect). Some services are internal and in this use case, they have same configuration for both acme clients that points to the same acme server(step-ca). All services running in k8s.

We use acme cert manager that provides renew: https://github.com/go-ocf/kit/blob/5cad919232f614458aaae356353192d6a0e89706/security/acme/certManager.go#L123
and renew is called when the certificate's age is more than 2/3 it's lifetime.

  1. we don't use revocation.

  2. We want to provide the database. Do you have some endpoint(access) where we can upload DB ? Now it has 345GB.

@jkralik
Copy link
Contributor Author

jkralik commented Jan 24, 2020

I compressed DB with 7z and now it has 7.6GB.

@dopey
Copy link
Contributor

dopey commented Jan 24, 2020

Hey @jkralik that's awesome. Sorry for the delayed response, I've been chugging away at a late deadline all day.

I was thinking of easy ways for you to upload that. If you send me an ssh pub key I can give you access to a test box and then you can scp it over there. Would that work for you?

Also, follow up question, would you mind sending a snippet of logs from the CA? The rate at which the db is growing makes me think that something is pummeling the CA with requests.

@jkralik
Copy link
Contributor Author

jkralik commented Jan 24, 2020

Sure.
ssh pub key: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFg88ZxptKmiOJQhG6jK/96+psDz6joEpe4+/Bd2ZFR9

I restarted step-ca pod and all logs are lost. But I'm thinking that this issue can be related with #149 . Now we are used patched version with #162 and after clean DB it has only 2.7MB after 12hours.

@dopey
Copy link
Contributor

dopey commented Jan 24, 2020

Whoa! That's super interesting. hmmm.

Ok, I'm gonna try and pm you the host address. Also, first time I've seen an ssh ed25519 key out in the wild. cool.

@dopey
Copy link
Contributor

dopey commented Jan 24, 2020

Actually I don't know how to do that with github 😬. My email is max@smallstep.com. Wanna email me and then I'll email you the host address. Sorry, hate to make this complicated.

@jkralik
Copy link
Contributor Author

jkralik commented Jan 27, 2020

I found that #162 is noy related. Again it takes 3GB after three days running ... I will provide the smaller one for you.

@jkralik
Copy link
Contributor Author

jkralik commented Jan 27, 2020

I found where was the issue. I expected that lego client fill resource with PrivateKey when it's called Renew with CSR, but it's just set certificate and CSR without PrivateKey .... Sorry my fault.

@jkralik jkralik closed this as completed Jan 27, 2020
@dopey
Copy link
Contributor

dopey commented Jan 27, 2020

@jkralik I don't know the Lego client well enough. But, did this cause some sort of loop that continuously hit the db? I guess I'm not understanding why this was causing the DB to expand so rapidly.

@jkralik
Copy link
Contributor Author

jkralik commented Jan 28, 2020

I have loop in my cert manager that renew certificate and when any call fails it try again in 15seconds. It means that every 15seconds was called renew. In my case problem was in https://github.com/go-ocf/kit/blob/cbf12801499b2699b37d72c79f66d8c261d7767e/security/certManager/acme/certManager.go#L238 - this function fails because PrivateKey was empty. And then I fixed it with commit plgd-dev/kit@cbf1280#diff-b1659f964b8384a232f0aec94303c811
-> it set private key from previous certificate it is not set in new one.

@dopey
Copy link
Contributor

dopey commented Jan 28, 2020

Interesting. Even if you were renewing every 15 seconds, it's hard for me to understand how you could possibly be generating that much data. If you still have access to the 3GB database, I'd love to take a look at it.

@jkralik
Copy link
Contributor Author

jkralik commented Jan 30, 2020

Sure. I uploaded new archive at /newvol/smallstep-private.bck.1.7z. It contains DB and the log, but I have extended logs about logging middleware and challenges.

@ki-pete
Copy link

ki-pete commented Sep 15, 2021

Hi @dopey,
did you have a look into the attached DB file? I'm investigating a similar issue currently and its hard for me to find the root cause.
BTW: Do you have a hint how to open the Badger DB and take a look into it?
BR Kim

@dopey
Copy link
Contributor

dopey commented Sep 15, 2021

Hey @ki-pete want to hop in our Discord? It might be easier to debug in real time. https://discord.gg/fX5VJZAc

Here is a script you can use to count the rows in each table in your DB: https://gist.github.com/dopey/8e9206073e2cb052b6f633c0b7d4d8df. We'll want that info to help with debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants