Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate Postgres to Neon.tech #492

Merged
merged 20 commits into from
Jan 13, 2024

Conversation

gerhard
Copy link
Member

@gerhard gerhard commented Dec 27, 2023

This is a follow-up to:

https://changelog-2024-01-12.fly.dev/ is configured to use https://console.neon.tech/app/projects/orange-sound-86604986

image

Initial observations

1/3. P99 latency for /feed increased by 3x - 3s vs 1s

image

2/3. Ecto SSL config doesn't seem to be working with :verify_peer

See the this commit for more details. FTR:

We figured it out with @brendan-stephens via #492 (comment)

3/3. We are doing 70+ SELECTS when serving /

This seems a lot, but maybe it's necessary. Since each SELECT adds an extra 2-10ms due to the network latency, it is reasonable to expect an extra 500ms latency across 70+ SELECT statements.

See changelog-2024-01-12.fly.dev/ vs changelog-2022-03-13.fly.dev/. Now addressed in #492 (comment)

I expect other pages such as /feed which use 473 SELECT statements to result in ever slower responses. Initial observations suggest 4.7s. vs the typical 1.7s.

Next steps

  • @gerhard to reach out to Neon Support & figure out what he's doing wrong with :verify_peer
  • @gerhard to merge Send all app traces to Honeycomb #496 and rebase this on top so that we can see all traces on changelog-2022-03-13 app instance
  • @jerodsanto, @adamstac & @gerhard to decide if the extra latency is worth the migration
  • Maybe figure out if we want to fix the podcast & post URLs. They are all defaulting to changelog.com which makes comparing the two origins side-by-side difficult:
  • Get :verify_peer working in ssl_opts
  • Deploy new changelog-2024-01-12 app instance
  • Prepare Fastly config (remember the failed v176!)
  • Stop current instance
  • Restore db on new instance & ensure everything works as expected
  • Promote Fastly config
  • Ensure that everything looks good in Honeycomb.io & nothing triggers

After a few days, when we confirm that everything looks right, delete changelog-2022-03-13 app instance + changelog-postgres-2023-07-31 & party 🎉

@gerhard gerhard mentioned this pull request Dec 28, 2023
@gerhard
Copy link
Member Author

gerhard commented Dec 30, 2023

I have decided to hook up the 1Password Service Account part of this so that we only need to configure 1 secret in the new app, OP_SERVICE_ACCOUNT_TOKEN, and then op CLI configures all app secrets just-in-time during boot.

The biggest benefit is that secrets will be version controlled, in env.op, same as we do in nightly.

As I'm going this, I would like to double-check that we still use these. This is what I think we need:

  • AWS_ACCESS_KEY_ID
  • AWS_API_HOST
  • AWS_SECRET_ACCESS_KEY
  • BUFFER_TOKEN
  • CM_API_TOKEN
  • CM_SMTP_TOKEN
  • DATABASE_URL
  • DB_HOST
  • DB_PASS
  • DB_USER
  • FASTLY_TOKEN
  • GITHUB_API_TOKEN
  • GITHUB_CLIENT_ID
  • GITHUB_CLIENT_SECRET
  • HONEYCOMB_API_KEY
  • MASTODON_API_TOKEN
  • MASTODON_CLIENT_ID
  • MASTODON_CLIENT_SECRET
  • NOTION_API_TOKEN
  • PLUSPLUS_SLUG
  • R2_ACCESS_KEY_ID
  • R2_API_HOST
  • R2_SECRET_ACCESS_KEY
  • SECRET_KEY_BASE
  • SHOPIFY_API_KEY
  • SHOPIFY_API_PASSWORD
  • SIGNING_SALT
  • SLACK_APP_API_TOKEN
  • SLACK_INVITE_API_TOKEN
  • TURNSTILE_SECRET_KEY
  • TWITTER_CONSUMER_KEY
  • TWITTER_CONSUMER_SECRET
  • TYPESENSE_API_KEY
  • TYPESENSE_URL

Did I cross out all secrets that we no longer use @jerodsanto?

@jerodsanto
Copy link
Member

Looks correct.

@gerhard gerhard force-pushed the migrate-postgres-to-neon branch 2 times, most recently from 09186d5 to e812ebc Compare December 30, 2023 18:40
@gerhard
Copy link
Member Author

gerhard commented Dec 30, 2023

Thanks for confirming @jerodsanto!

I now have all of this wired up & working as https://changelog-2023-12-17.fly.dev/ . See the PR description for initial observations & next steps.

Leaving something here for Neon Support:


I am unable to make https://neon.tech/docs/guides/elixir-ecto#configure-ecto work for our Elixir application.

If I use the following config with our credentials:

config :friends, Friends.Repo,
  database: "friends",
  username: "alex",
  password: "AbC123dEf",
  hostname: "ep-cool-darkness-123456.us-west-2.aws.neon.tech",
  ssl: true,
  ssl_opts: [
    server_name_indication: 'ep-cool-darkness-123456.us-west-2.aws.neon.tech',
    verify: :verify_none
  ]

I get the following errors:

** (Postgrex.Error) ERROR 26000 (invalid_sql_statement_name) prepared statement "ecto_1922" does not exist
Screenshot 2023-12-30 at 17 32 26

If I follow the https://neon.tech/docs/guides/elixir-ecto#configure-ecto documentation further and configure verify: :verify_peer:

  ssl_opts: [
    cacerts: :public_key.cacerts_get(), # available since OTP26
    verify: :verify_peer,
    server_name_indication: String.to_charlist(System.get_env("DB_HOST", "db")),
    customize_hostname_check: [
      match_fun: :public_key.pkix_verify_hostname_match_fun(:https)
    ]
  ],

  ssl_opts: [
    verify: :verify_peer,
    cacerts: :public_key.cacerts_get(), # available since OTP26
    versions: [:"tlsv1.3"],
    depth: 3,
    server_name_indication: String.to_charlist(System.get_env("DB_HOST", "db")),
    customize_hostname_check: [
      match_fun: :public_key.pkix_verify_hostname_match_fun(:https)
    ]
  ],

Ecto is not even able to connect to Neon:

17:41:53.361 [notice] Application changelog exited: Changelog.Application.start(:normal, []) returned an error: shutdown: failed to start child: Changelog.EpisodeTracker
    ** (EXIT) an exception was raised:
        ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 2955ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:

  1. Ensuring your database is available and that you can connect to it
  2. Tracking down slow queries and making sure they are running fast enough
  3. Increasing the pool_size (although this increases resource consumption)
  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval

See DBConnection.start_link/2 for more information

            (ecto_sql 3.10.2) lib/ecto/adapters/sql.ex:1047: Ecto.Adapters.SQL.raise_sql_call_error/1
            (ecto_sql 3.10.2) lib/ecto/adapters/sql.ex:945: Ecto.Adapters.SQL.execute/6
            (ecto 3.10.3) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
            (ecto 3.10.3) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
            (changelog 0.0.1) lib/changelog/schema/episode/episode.ex:421: Changelog.Episode.flatten_for_filtering/1
            (changelog 0.0.1) lib/changelog/episode_tracker.ex:130: Changelog.EpisodeTracker.refresh_episodes/0
            (changelog 0.0.1) lib/changelog/episode_tracker.ex:57: Changelog.EpisodeTracker.init/1
            (stdlib 5.2) gen_server.erl:980: :gen_server.init_it/2
Screenshot 2023-12-30 at 17 44 14

This is the only config option that works - in combination with specifying the endpoint id in the password field

  ssl_opts: [ verify: :verify_none ],

While the above workaround works, I am not comfortable skipping remote peer verification in production.

What am I doing wrong?

@gerhard
Copy link
Member Author

gerhard commented Dec 30, 2023

Leaving this fun gotcha here:

Screen.Recording.2023-12-30.at.17.13.44.mp4

@jerodsanto
Copy link
Member

Rebasing master should reduce the number of SELECTs on the home page and podcast pages. It should also fix the URLs on podcasts, but I haven't done the posts URL one yet...

@gerhard
Copy link
Member Author

gerhard commented Jan 5, 2024

That was helpful @jerodsanto!

I just rebased on top of master, re-deployed & also re-imported the db. This is what I'm seeing now:


https://changelog-2022-03-13.fly.dev/ uses 15 queries & resolves in 134.6ms

image

https://changelog-2023-12-17.fly.dev/ uses 15 queries & resolves in 151.9ms, meaning .1x slower, which is a 20-30x improvement from what we had when this PR was opened.

image

That is a sweet improvement @jerodsanto 💪

I am going to compare the /feed URL next.

@gerhard
Copy link
Member Author

gerhard commented Jan 5, 2024

@jerodsanto
Copy link
Member

Worth noting that /feed is not actually served from the app in production because Fastly fetches it from R2. Same goes for all feed endpoints.

@gerhard
Copy link
Member Author

gerhard commented Jan 5, 2024

@gerhard
Copy link
Member Author

gerhard commented Jan 9, 2024

ssl_opts with :verify_peer are still failing as initially documented:

Screenshot 2024-01-09 at 07 56 10

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
Same version that we are running in Neon.tech.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
I found this helpful when testing a new production setup.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
We have not used this in years, pretty sure that we will not need it
anytime soon. Making it easy to git revert if I'm wrong about it.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
This adds the op CLI & versions the dagger CLI so that I can run the
following locally & publish a prod image:

    op inject -i envrc.op -o .envrc
    direnv allow
    dagger run mage image:production

Also check that a few more required env variables are present, otherwise
we will continue being surprised why things don't work...

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
This looks better:

    Targets:
      cd                  Run the CD pipeline
      ci                  Run the CI pipeline
      fly:daggerStart     Start Dagger Engine on Fly.io
      fly:daggerStop      Stop Dagger Engine on Fly.io
      fly:deploy          Push app container image to Fly.io
      image:production    Build & publish the production image
      image:runtime       Build & publish the runtime image
      test                Run tests

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
This will deploy https://changelog-2023-12-17.fly.dev

We are setting up a new production app so that we can test the Neon.tech
Postgres integration before promoting this to the new production. There
is one more commit missing to get this integration going...

Meanwhile, the 1Password Service Account integration allows us to set a
single secret in the app - OP_SERVICE_ACCOUNT_TOKEN - and then `op`
takes care of templating all other secrets just-in-time, when specific
commands are run, i.e. `db.migrate` or `app.start`. This simplifies the
app configuration considerably, and also makes rotating secrets super
simple - just modify them in 1Password, the `changelog` vault, and
restart the app 😉

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
This didn't work as documented, but I will add more context to the PR so
that we can go over it with with Neon.tech Support...

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
So that it is close to Neon AWS us-east-1 (lower db latency).

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
Same as the current production config.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
Just did this again & captured what worked today.

Also made a few more changes to INFRASTRUCTURE so that it reflects the
upcoming changes more accurately.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
Still fails as documented initially.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
Especially useful when iterating locally, and the git sha doesn't change.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
We are in 2024 baby!

While at it, capture the step-by-step instructions.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
While this app is still the current production, we will no longer be
deploying to it after we merge this.

I also updated all references to this app instance in our internal docs.

I also removed the other app which we were using to debug various Fastly
& Fly.io proxying issues. No longer needed, cleaning all of 2022.fly.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
@gerhard gerhard marked this pull request as ready for review January 13, 2024 13:19
@gerhard
Copy link
Member Author

gerhard commented Jan 13, 2024

https://changelog.com Postgres is now on Neon.tech.

There were a few issues, but nothing major: https://ui.honeycomb.io/changelog/datasets/fastly/result/Eb7gHnHDygg

image

P99 latency is 6% higher - 1.57s vs 1.48s - which is hardly note-worthy: https://ui.honeycomb.io/changelog/datasets/fastly/result/cSinitLo7TX

image

@gerhard
Copy link
Member Author

gerhard commented Jan 13, 2024

This looks good to me: last 30 minutes compared to a day before

image

@gerhard gerhard merged commit 2dd8a59 into thechangelog:master Jan 13, 2024
4 checks passed
@gerhard gerhard deleted the migrate-postgres-to-neon branch January 13, 2024 14:00
gerhard added a commit that referenced this pull request Mar 28, 2024
This is the next logical step after migrating to Neon.tech part of
#492

    cd changelog
    dagger call db-branch --neon-api-key=env:NEON_API_KEY

To learn more, see `changelog/README.md`. Part of this, we also deployed
a Dagger Engine v0.10.3 on Fly.io so that we don't need any sort of
container runtime running locally. I know that Jerod will appreciate
this.

The beginning of a new generation of tooling, I'm sure of it.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
gerhard added a commit that referenced this pull request Mar 28, 2024
This is the next logical step after migrating to Neon.tech part of
#492

    cd changelog
    dagger call db-branch --neon-api-key=env:NEON_API_KEY

To learn more, see `changelog/README.md`. Part of this, we also deployed
a Dagger Engine v0.10.3 on Fly.io so that we don't need any sort of
container runtime running locally. I know that Jerod will appreciate
this.

The beginning of a new generation of tooling, I'm sure of it.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
gerhard added a commit that referenced this pull request Mar 28, 2024
This is the next logical step after migrating to Neon.tech part of
#492

    cd changelog
    dagger call db-branch --neon-api-key=env:NEON_API_KEY

To learn more, see `changelog/README.md`. Part of this, we also deployed
a Dagger Engine v0.10.3 on Fly.io so that we don't need any sort of
container runtime running locally. I know that Jerod will appreciate
this.

The beginning of a new generation of tooling, I'm sure of it.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
gerhard added a commit that referenced this pull request Mar 29, 2024
This is the next logical step after migrating to Neon.tech part of
#492

    cd changelog
    dagger call db-branch --neon-api-key=env:NEON_API_KEY

To learn more, see `changelog/README.md`. Part of this, we also deployed
a Dagger Engine v0.10.3 on Fly.io so that we don't need any sort of
container runtime running locally. I know that Jerod will appreciate
this.

The beginning of a new generation of tooling, I'm sure of it.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
jerodsanto pushed a commit that referenced this pull request Mar 29, 2024
…nd (#508)

* Downgrade Erlang to 26.2.2

Elixir 1.14.5 installed was segfaulting on macOS 12.7.3 ARM. Installing
the Elixir `otp26` variant didn't fix it. Maybe an asdf issue... Will
try again another time.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>

* Enable changelog.com devs to create prod db forks with a single command

This is the next logical step after migrating to Neon.tech part of
#492

    cd changelog
    dagger call db-branch --neon-api-key=env:NEON_API_KEY

To learn more, see `changelog/README.md`. Part of this, we also deployed
a Dagger Engine v0.10.3 on Fly.io so that we don't need any sort of
container runtime running locally. I know that Jerod will appreciate
this.

The beginning of a new generation of tooling, I'm sure of it.

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>

---------

Signed-off-by: Gerhard Lazu <gerhard@changelog.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants