Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow most kinds of characters in URL query (fixes #8408) #8447

Merged
merged 3 commits into from Feb 2, 2019

Conversation

Projects
None yet
2 participants
@JMendyk
Copy link
Contributor

JMendyk commented Aug 25, 2018

Similiary to #4941, regexes from twitter-text were weakened to allow most kinds of characters in URL query string. Includes tests with unicode character in query string to avoid potential regressions in future.

Fixes #8408

@Gargron

This comment has been minimized.

Copy link
Member

Gargron commented Aug 25, 2018

I'm worried this might break something 🤔

@JMendyk

This comment has been minimized.

Copy link
Contributor Author

JMendyk commented Aug 26, 2018

I think I've rushed too much with this one 😞.

By "might break something" do you mean that it will be to eager/forgiving what a valid URL is?

@Gargron

This comment has been minimized.

Copy link
Member

Gargron commented Aug 26, 2018

@JMendyk It might affect how people use URLs in messages if it "eats up" characters that used to separate text from the URL... I don't know how this affects non-ASCII languages especially, Japanese doesn't use spaces between words, do they use spaces around URLs?

@JMendyk

This comment has been minimized.

Copy link
Contributor Author

JMendyk commented Aug 26, 2018

@Gargron Yes, that would be troublesome for users of non-ASCII languages.

One idea that comes to my mind is to get a sample of toots from instances used mostly by non-ASCII language users, run them through both versions of regexes and check how many urls are detected differently.

Edit: Well, I didn't know that Mastodon API returns toots with already formatted text, which makes my idea pointless as the text (including links) has already been altered in some way.

Alternative approach to unicode support in urls
Adds PoC/idea to approch this problem.
@JMendyk

This comment has been minimized.

Copy link
Contributor Author

JMendyk commented Sep 13, 2018

I created an alternative approach, in which before calling TwitterText, source text is url escaped to remove unicode characters; then when processed by TwitterText, brought back to old form. I am aware that it certainly has issues/unexpected consequences but I decided to share it because current match-all regex is obviously no good of a solution and as a PoC/food for thought.

@Gargron

Gargron approved these changes Feb 2, 2019

@Gargron Gargron merged commit 6a5e3da into tootsuite:master Feb 2, 2019

11 checks passed

ci/circleci: build Your tests passed on CircleCI!
Details
ci/circleci: check-i18n Your tests passed on CircleCI!
Details
ci/circleci: install Your tests passed on CircleCI!
Details
ci/circleci: install-ruby2.4 Your tests passed on CircleCI!
Details
ci/circleci: install-ruby2.5 Your tests passed on CircleCI!
Details
ci/circleci: install-ruby2.6 Your tests passed on CircleCI!
Details
ci/circleci: test-ruby2.4 Your tests passed on CircleCI!
Details
ci/circleci: test-ruby2.5 Your tests passed on CircleCI!
Details
ci/circleci: test-ruby2.6 Your tests passed on CircleCI!
Details
ci/circleci: test-webui Your tests passed on CircleCI!
Details
codeclimate All good!
Details

masanbol added a commit to masanbol/snoutsonline that referenced this pull request Feb 3, 2019

Gargron added a commit that referenced this pull request Feb 17, 2019

Allow most kinds of characters in URL query (fixes #8408) (#8447)
* Allow unicode characters in URL query strings

Fixes #8408

* Alternative approach to unicode support in urls

Adds PoC/idea to approch this problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.