Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.5.0 - High MariaDB usage, Not working tags #1688

Closed
jonathandhn opened this issue Jan 30, 2023 · 12 comments · Fixed by #1700
Closed

3.5.0 - High MariaDB usage, Not working tags #1688

jonathandhn opened this issue Jan 30, 2023 · 12 comments · Fixed by #1700
Labels
Milestone

Comments

@jonathandhn
Copy link

How Shlink is set up

  • Shlink Version: 3.5.0
  • How do you serve Shlink: Openswoole Docker image
  • Database engine used: MariaDB 10.3

Summary

Since the last docker image upgrade, shlink is really slow on one & tags stop working (109 urls, 160k visits in two years), and multi-domains stop working on the other (main domain 404 redirect for everything), both were with redis, both get redis removed, but still events like Shlinkio\Shlink\Core\EventDispatcher\RedisPubSub\NotifyVisitToRedis (even 4) are run, redis removal improved a little on the perf side, two different servers, (Azure & OVH), Two different databases (10.3 remote and 10.6 local), Same ubuntu 20.04, and same docker verison (from docker repos) on both.

Current behavior

shlink web client browser console https://exemple.com//rest/v3/mercure-info code 501

docker run \
     --name exemple \
     -p 8080:8080 \
     --restart=always \
     -d \
      -e PORT=8080 \
     -e     DB_DRIVER=mariadb \
     -e     DB_NAME=exemple \
     -e     DB_HOST=exemple.mariadb.database.azure.com \
     -e     DB_PORT=3306 \
     -e     DB_USER=exemple@exemple \
     -e     DB_PASSWORD='exemple' \
     -e     DEFAULT_DOMAIN=exemple.com \
     -e     IS_HTTPS_ENABLED=true \
     -e     DEFAULT_SHORT_CODES_LENGTH=5 \
     -e     AUTO_RESOLVE_TITLES=true \
     -e     REDIRECT_APPEND_EXTRA_PATH=true \
     -e     GEOLITE_LICENSE_KEY='exemple' \
     -e     REDIRECT_STATUS_CODE=301 \
     -e     REDIRECT_CACHE_LIFETIME=30 \
     -e     DISABLE_TRACKING=false \
     -e     TRACK_ORPHAN_VISITS=true \
     -e     DISABLE_TRACK_PARAM=libre \
     -e     DISABLE_TRACKING_FROM= \
     -e     DISABLE_IP_TRACKING=false \
     -e     ANONYMIZE_REMOTE_ADDR=true \
     -e     DISABLE_UA_TRACKING=false \
     -e     DISABLE_REFERRER_TRACKING=false \
     -e     DEFAULT_BASE_URL_REDIRECT='https://www.exemple.com' \
     -e     DEFAULT_INVALID_SHORT_URL_REDIRECT='https://www.exemple.com/404' \
     -e     DEFAULT_REGULAR_404_REDIRECT='https://www.exemple.com/404' \
     -e     DEFAULT_QR_CODE_SIZE=300 \
     -e     DEFAULT_QR_CODE_MARGIN=0 \
     -e     DEFAULT_QR_CODE_FORMAT=png \
     -e     DEFAULT_QR_CODE_ERROR_CORRECTION=Q \
     -e     DEFAULT_QR_CODE_ROUND_BLOCK_SIZE=true \
     -e     DELETE_SHORT_URL_THRESHOLD=15 \
     -e     BASE_PATH= \
     -e     SHORT_URL_MODE=loosely \
     -e     SKIP_INITIAL_GEOLITE_DOWNLOAD=true \
     -e     ENABLE_PERIODIC_VISIT_LOCATE=true \
     -e WEB_WORKER_NUM=64 \
     -e TASK_WORKER_NUM=32 \
     shlinkio/shlink:stable
[2023-01-30T16:17:09.821637+00:00] [6ae1c565-793a-42df-b48d-566b74fbfbaa] Access.INFO - - "GET //rest/v3/health HTTP/1.1" 200 120
[2023-01-30T16:17:10.021823+00:00] [44e07a25-573c-49dc-9ec7-06bcdc434f3c] Shlink.WARNING - This Shlink instance is not integrated with a mercure hub.
[2023-01-30T16:17:10.022492+00:00] [44e07a25-573c-49dc-9ec7-06bcdc434f3c] Access.ERROR - - "GET //rest/v3/mercure-info HTTP/1.1" 501 189
[2023-01-30T16:17:11.051518+00:00] [5e9ac048-9e1e-41be-88f1-8d87a935bc98] Access.INFO - - "GET //rest/v3/visits HTTP/1.1" 200 183
[2023-01-30T16:17:20.662657+00:00] [b763b510-259d-4016-91ce-b081593ba1da] Access.INFO - - "GET //rest/v3/short-urls?page=1&orderBy=dateCreated-DESC HTTP/1.1" 200 4943
[2023-01-30T16:17:21.287931+00:00] [f59122a8-f7ee-43e6-bf73-a6bb1c50deff] Access.INFO - - "GET //rest/v3/domains HTTP/1.1" 200 443
[2023-01-30T16:17:26.333301+00:00] [a9a6151e-50b1-49d4-ab32-e29698845e94] Access.INFO - - "GET //rest/v3/short-urls?itemsPerPage=5&orderBy=dateCreated-DESC HTTP/1.1" 200 2507
[2023-01-30T16:17:30.117024+00:00] [c72b4429-101a-4c1d-a158-614848460096] Access.INFO - - "GET //rest/v3/short-urls?page=3&orderBy=dateCreated-DESC HTTP/1.1" 200 4758
[2023-01-30T16:17:34.211027+00:00] [b6b4c121-ac96-4892-af48-7e806e051dae] Access.INFO - - "GET //rest/v3/short-urls?page=2&orderBy=dateCreated-DESC HTTP/1.1" 200 4896
[2023-01-30T16:17:38.126902+00:00] [42f5bd79-f03c-4cd8-9bc6-d5f30b29d407] Access.INFO - - "GET //rest/v3/short-urls?page=11&orderBy=dateCreated-DESC HTTP/1.1" 200 4489
[2023-01-30T16:18:27.078922+00:00] [d4c98db7-f6b6-47f2-86c0-32edfac7cfcc] Access.INFO - - "GET //rest/v3/short-urls?page=9&orderBy=dateCreated-DESC HTTP/1.1" 200 4896
[2023-01-30T16:19:13.560711+00:00] [09ac4f1e-fe96-4fa3-8f47-e52454a2f159] Shlink.WARNING - Tried to locate visit with id "229186", but a GeoLite2 db was not found.
[2023-01-30T16:19:13.565111+00:00] [NULL] Shlink.NOTICE - Starting work on task 0 using: {"handler":"Shlinkio\\Shlink\\Core\\EventDispatcher\\Mercure\\NotifyVisitToMercure","arguments":[{"visitId":"229186"}]}
[2023-01-30T16:19:13.574907+00:00] [NULL] Shlink.NOTICE - Starting work on task 1 using: {"handler":"Shlinkio\\Shlink\\Core\\EventDispatcher\\RabbitMq\\NotifyVisitToRabbitMq","arguments":[{"visitId":"229186"}]}
[2023-01-30T16:19:13.579229+00:00] [NULL] Shlink.NOTICE - Starting work on task 2 using: {"handler":"Shlinkio\\Shlink\\Core\\EventDispatcher\\RedisPubSub\\NotifyVisitToRedis","arguments":[{"visitId":"229186"}]}
[2023-01-30T16:19:13.593418+00:00] [NULL] Shlink.NOTICE - Starting work on task 3 using: {"handler":"Shlinkio\\Shlink\\Core\\EventDispatcher\\NotifyVisitToWebHooks","arguments":[{"visitId":"229186"}]}
[2023-01-30T16:19:13.608227+00:00] [NULL] Shlink.NOTICE - Starting work on task 4 using: {"handler":"Shlinkio\\Shlink\\Core\\EventDispatcher\\UpdateGeoLiteDb","arguments":[{"visitId":"229186"}]}
[2023-01-30T16:19:13.631068+00:00] [09ac4f1e-fe96-4fa3-8f47-e52454a2f159] Access.INFO - - "GET /kvyi2/ HTTP/1.1" 301 0
[2023-01-30T16:19:13.671280+00:00] [09ac4f1e-fe96-4fa3-8f47-e52454a2f159] Shlink.NOTICE - Task 2 has finished
[2023-01-30T16:19:13.695478+00:00] [09ac4f1e-fe96-4fa3-8f47-e52454a2f159] Shlink.NOTICE - Task 3 has finished
[2023-01-30T16:19:13.712332+00:00] [NULL] Shlink.NOTICE - Downloading GeoLite2 db file...
[2023-01-30T16:19:13.715106+00:00] [09ac4f1e-fe96-4fa3-8f47-e52454a2f159] Shlink.NOTICE - Task 1 has finished
[2023-01-30T16:19:13.722203+00:00] [NULL] Shlink.NOTICE - Finished downloading GeoLite2 db file
[2023-01-30T16:19:15.852126+00:00] [09ac4f1e-fe96-4fa3-8f47-e52454a2f159] Shlink.NOTICE - Task 0 has finished
[2023-01-30T16:19:19.055039+00:00] [09ac4f1e-fe96-4fa3-8f47-e52454a2f159] Shlink.NOTICE - Task 4 has finished
[2023-01-30T16:28:04.037763+00:00] [77f54eea-1aa2-4447-9c9e-d9cfbac90844] Shlink.WARNING - This Shlink instance is not integrated with a mercure hub.
[2023-01-30T16:28:04.039250+00:00] [77f54eea-1aa2-4447-9c9e-d9cfbac90844] Access.ERROR - - "GET //rest/v3/mercure-info HTTP/1.1" 501 189
[2023-01-30T16:28:11.518291+00:00] [1d0ca9b4-1717-4cda-9750-80d6c8a182e3] Access.INFO - - "GET //rest/v3/short-urls?page=4&orderBy=dateCreated-DESC HTTP/1.1" 200 5228
[2023-01-30T16:28:34.725067+00:00] [296a7eeb-9d42-4d01-9797-56934f9dafc0] Access.INFO - - "GET //rest/v3/short-urls?page=7&orderBy=dateCreated-DESC HTTP/1.1" 200 5330
[2023-01-30T16:28:38.264190+00:00] [61711c93-ff33-45db-b139-f93c49dd5933] Access.INFO - - "GET //rest/v3/short-urls?page=8&orderBy=dateCreated-DESC HTTP/1.1" 200 5306
[2023-01-30T16:28:40.127875+00:00] [0ffd09f3-233e-49c3-83ab-9c3f18f3c663] Access.INFO - - "GET //rest/v3/short-urls?page=10&orderBy=dateCreated-DESC HTTP/1.1" 200 4919

Expected behavior

Tags, redis, and multi-domains keeps working.

How to reproduce

@jonathandhn
Copy link
Author

4 querys are run on mysql too, never ending, nor achieving, then MariaDB for Azure general purpose with 100Gb disk (340 max cons & 300 iops at least)& 2 Core top at 100% cpus for days, and the whole server slow to death.

On the second server on local MariaDB 10.6 it doesn't seam to trigger that.

Creating sort index

SELECT t.id_0 AS id, t.name_1 AS name, COUNT(DISTINCT s.id) AS short_urls_count, COUNT(DISTINCT v.id) AS visits, COUNT(DISTINCT v2.id) AS non_bot_visits FROM (SELECT t0_.id AS id_0, t0_.name AS name_1 FROM tags t0_ ORDER BY t0_.name ASC LIMIT 32) t LEFT JOIN short_urls_in_tags st ON t.id_0 = st.tag_id LEFT JOIN short_urls s ON s.id = st.short_url_id LEFT JOIN visits v ON st.short_url_id = v.short_url_id LEFT JOIN visits v2 ON (st.short_url_id = v2.short_url_id) AND (v2.potential_bot = '0') GROUP BY t.id_0, t.name_1 ORDER BY t.name_1 ASC

@tedo-develop
Copy link

We can confirm this behavior on version 3.5.0 Our server stuck with 100% CPU usage.
Downgrade to shlink version 3.4.0 and everything works again.

Setup

  • Docker image shlinkio/shlink:stable (v3.5.0)
  • MariaDB 10.5.18
  • 8000 Tags

@jonathandhn jonathandhn changed the title 3.5.0—Slowness, mutli-domains, and tags not working 3.5.0 - High MariaDB usage, Not working tags Feb 6, 2023
@jonathandhn
Copy link
Author

jonathandhn commented Feb 6, 2023

@tedo-develop

We can confirm this behavior on version 3.5.0 Our server stuck with 100% CPU usage. Downgrade to shlink version 3.4.0 and everything works again.

Setup

  • Docker image shlinkio/shlink:stable (v3.5.0)
  • MariaDB 10.5.18
  • 8000 Tags

Could you run show processlist; on your MariaDB Server while being connected to the web client in tags section ?

@marcelojuventino
Copy link

marcelojuventino commented Feb 6, 2023

Hi guys!

Just to let you know that we faced the same problem, 100% CPU on these queries, making the service unavailable. It was only possible to reestablish it when going back to version v3.4.0.

Running on MySQL.

@acelaya acelaya added this to the 3.5.2 milestone Feb 6, 2023
@tedo-develop
Copy link

@jonathandhn unfortunately we only have one production server with more than 100 requests per hour. So no i can't downgrade again, sorry.

@acelaya
Copy link
Member

acelaya commented Feb 11, 2023

I think this might be the culprit of the tags endpoint being so slow on version >=3.5.0:

@acelaya
Copy link
Member

acelaya commented Feb 11, 2023

I can confirm the PR above is what caused this.

I have been trying to understand why the query is now so much slower, but the only clear thing via EXPLAIN SELECT is that there was a missing index.

However, adding this index is barely noticible in the resulting execution time.

After a lot of tests I have come to the conclusion that there's no way to optimize this query any further (or I lack the knowledge to do it). However, counting the amount of non-bot visits in a separate query is way faster (even faster than the original query).

I'm experimenting with a workaround in which the original query is left as it was, and then there's another query fetching the non-bot visits amounts per tag.

Then, the results of those two are "merged".

It's not super nice, and it forces me to somehow extract all the duplicated logic from both queries. Also, this query was already very problematic and hard to maintain, so this just adds an extra layer of complexity.

I want to test this with a large enough dataset and every supported database engine, and hopefully have a fix by the end of the weekend.

I will then iterate over the solution in a hopefully near future, to make it more maintainable, and probably explore some way to detect performance regressions in the future.

@acelaya
Copy link
Member

acelaya commented Feb 11, 2023

Found a problem. Everything needed to be in the same query in order to support ordering by non-bot visits on paginated results.

I could just remove support to order by non-bot visits, as shlink-web-client does not support it and I don't expect many people to be using the API directly and be ordering tags lists by non-bot visits.

However, the idea is to allow displaying non-bot visits in shlink-web-client. If then sorting by visits is selected, unexpected results could be thrown when a tag has more bot visits than non-bot visits.

@acelaya
Copy link
Member

acelaya commented Feb 12, 2023

I may have found another way to keep everything in one query with a decent performance, by combining a couple of nested queries instead of a big chain of JOINS.

With my production instance the query used to take 20ms in Shlink 3.4, around 30 secons with Shlink 3.5, and this query takes 60ms producing the same result as the one from Shlink 3.5.0

I still want to test it with a larger dataset (more short URLs, more tags and more visits), and there's a couple of tests failing only for postgres that I need to investigate, but I would say I'm close.

@acelaya
Copy link
Member

acelaya commented Feb 12, 2023

I have tested the solution with 14k tags, 36k short URLs and 70k visits, and the performance issue seems to be solved.

I'm going to merge the solution and probably publish Shlink 3.5.2-beta.1, as I still want to fix one more bug for v3.5.2, but if any of you could test this beta version, there would still be time to apply further fixes if needed.

@jonathandhn I have only addressed the issue with tags, as I don't think the rest are related. Feel free to open new issues if you still experience other problems after the update.

@jonathandhn
Copy link
Author

Thank you, sorry for the delay @acelaya, I did miss the notification. I did release 3.5.2 into prod in the same setup, it works fine ! Still ongoing test, but it works. Thank you for your time.

@acelaya
Copy link
Member

acelaya commented Feb 17, 2023

Shlink 3.5.2 was released yesterday, including this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants