-
Notifications
You must be signed in to change notification settings - Fork 34
/
Copy pathindex.html
105 lines (105 loc) Β· 30.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
<!doctype html><html lang=en><head><meta content="IE=edge" http-equiv=X-UA-Compatible><meta content="text/html; charset=utf-8" http-equiv=content-type><meta content="width=device-width,initial-scale=1.0,maximum-scale=1" name=viewport><title>Python 3.11: possessive quantifiers and atomic grouping added to re module</title><link href=https://learnbyexample.github.io/atom.xml rel=alternate title=RSS type=application/atom+xml><script src=https://cdnjs.cloudflare.com/ajax/libs/slideout/1.0.1/slideout.min.js></script><link href=https://learnbyexample.github.io/site.css rel=stylesheet><meta content="Python 3.11: possessive quantifiers and atomic grouping added to re module" property=og:title><meta content=website property=og:type><meta content="Python possessive quantifiers and atomic grouping, and how they help prevent catastrophic backtracking in regular expressions." property=og:description><meta content=https://learnbyexample.github.io/python-regex-possessive-quantifier/ property=og:url><meta content=https://learnbyexample.github.io/images/python_possessive_quantifiers.png property=og:image><meta content=1280 property=og:image:width><meta content=640 property=og:image:height><meta content=summary_large_image property=twitter:card><meta content=@learn_byexample property=twitter:site><link href=https://learnbyexample.github.io/favicon.svg rel=icon><link rel="shortcut icon" href=https://learnbyexample.github.io/favicon.png><body><div class=container><div class=mobile-navbar id=mobile-navbar><div class=mobile-header-logo><a class=logo href=/>learnbyexample</a></div><div class="mobile-navbar-icon icon-out"><span></span><span></span><span></span></div></div><nav class="mobile-menu slideout-menu slideout-menu-left" id=mobile-menu><ul class=mobile-menu-list><li class=mobile-menu-item><a href=https://learnbyexample.github.io/books> Books </a><li class=mobile-menu-item><a href=https://learnbyexample.github.io/mini> Mini </a><li class=mobile-menu-item><a href=https://learnbyexample.github.io/tips> Tips </a><li class=mobile-menu-item><a href=https://learnbyexample.github.io/tags> Tags </a><li class=mobile-menu-item><a href=https://learnbyexample.github.io/about> About </a></ul></nav><header id=header><div class=logo><a href=https://learnbyexample.github.io>learnbyexample</a></div><nav class=menu><ul><li><a href=https://learnbyexample.github.io/books> Books </a><li><a href=https://learnbyexample.github.io/mini> Mini </a><li><a href=https://learnbyexample.github.io/tips> Tips </a><li><a href=https://learnbyexample.github.io/tags> Tags </a><li><a href=https://learnbyexample.github.io/about> About </a></ul></nav></header><main><div class=content id=mobile-panel><div class=post-toc id=post-toc><h2 class=post-toc-title>Contents</h2><div class="post-toc-content always-active"><nav id=TableOfContents><ul><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-possessive-quantifier/#backtracking>Backtracking</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-possessive-quantifier/#possessive-quantifiers>Possessive quantifiers</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-possessive-quantifier/#atomic-grouping>Atomic grouping</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-possessive-quantifier/#catastrophic-backtracking>Catastrophic Backtracking</a></ul></nav></div></div><article class=post><header class=post__header><h1 class=post__title><a href=https://learnbyexample.github.io/python-regex-possessive-quantifier/>Python 3.11: possessive quantifiers and atomic grouping added to re module</a></h1><div class=post__meta><span class=post__time>2022-05-07</span></div></header><div class=post-content><p>Quoting from <a href=https://docs.python.org/3.11/whatsnew/3.11.html>What's New In Python 3.11</a>:<blockquote><p>Atomic grouping (<code>(?>...)</code>) and possessive quantifiers (<code>*+</code>, <code>++</code>, <code>?+</code>, <code>{m,n}+</code>) are now supported in regular expressions. (Contributed by Jeffrey C. Jacobs and Serhiy Storchaka in <a href=https://github.com/python/cpython/issues/34627>bpo-433030</a>.)</blockquote><p align=center><img alt="Python possessive quantifiers and atomic grouping" src=/images/python_possessive_quantifiers.png><p align=center><i>Poster created using <a href=https://www.canva.com/>Canva</a></i><p><img alt=info src=/images/info.svg> If you are not familiar with regular expressions, see my <a href=https://github.com/learnbyexample/py_regular_expressions>Understanding Python re(gex)?</a> ebook to get started.</p><span id=continue-reading></span><br><h2 id=backtracking>Backtracking<a aria-label="Anchor link for: backtracking" class=zola-anchor href=#backtracking>π</a></h2><p>Greedy quantifiers match as much as possible, provided the overall regex is satisfied. For example, <code>:.*</code> will match <code>:</code> followed by rest of the input line. However, if you change the pattern to <code>:.*apple</code>, the <code>.*</code> portion cannot simply consume the rest of the input line. The regex engine will have to find the largest portion such that <code>apple</code> is also part of the match (provided the input has such a string, of course).<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> import </span><span>re
</span><span>
</span><span style=color:#72ab00;>>>> </span><span>ip </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'fig:mango:pineapple:guava:apples:orange'
</span><span>
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>:</span><span style=color:#aeb52b;>.</span><span style=color:#72ab00;>*</span><span style=color:#d07711;>'</span><span>, ip)[</span><span style=color:#b3933a;>0</span><span>]
</span><span style=color:#d07711;>':mango:pineapple:guava:apples:orange'
</span><span>
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>:</span><span style=color:#aeb52b;>.</span><span style=color:#72ab00;>*</span><span style=color:#7c8f4c;>apple</span><span style=color:#d07711;>'</span><span>, ip)[</span><span style=color:#b3933a;>0</span><span>]
</span><span style=color:#d07711;>':mango:pineapple:guava:apple'
</span></code></pre><p>For the <code>:.*apple</code> case, the Python regular expression engine actually does consume all the characters on seeing <code>.*</code>. Then realizing that the overall match failed, it gives back one character from the end of line and checks again. This process is repeated until a match is found or failure is confirmed. In regular expression parlance, this is called <strong>backtracking</strong>.<p>This type of exploring matches to satisfy overall regex also applies to non-greedy quantifiers. <code>.*?</code> will start with zero characters followed by one, two, three and so on until a match is found.<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>ip </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'fig:mango:pineapple:guava:apples:orange'
</span><span>
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>:</span><span style=color:#aeb52b;>.</span><span style=color:#72ab00;>*?</span><span style=color:#d07711;>'</span><span>, ip)[</span><span style=color:#b3933a;>0</span><span>]
</span><span style=color:#d07711;>':'
</span><span>
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>:</span><span style=color:#aeb52b;>.</span><span style=color:#72ab00;>*?</span><span style=color:#7c8f4c;>apple</span><span style=color:#d07711;>'</span><span>, ip)[</span><span style=color:#b3933a;>0</span><span>]
</span><span style=color:#d07711;>':mango:pineapple'
</span></code></pre><blockquote><p><img alt=info src=/images/info.svg> Note that some regex engines like <a href=https://github.com/google/re2>re2</a> do not use backtracking.</blockquote><br><h2 id=possessive-quantifiers>Possessive quantifiers<a aria-label="Anchor link for: possessive-quantifiers" class=zola-anchor href=#possessive-quantifiers>π</a></h2><p>Until Python 3.10, you had to use alternatives like the third-party <a href=https://pypi.org/project/regex/>regex module</a> for possessive quantifiers. The <code>re</code> module supports possessive quantifiers from Python 3.11 version.<p>The difference between greedy and possessive quantifiers is that possessive will not backtrack to find a match. In other words, possessive quantifiers will always consume every character that matches the pattern on which it is applied. Syntax wise, you need to append <code>+</code> to greedy quantifiers to make it possessive, similar to adding <code>?</code> for non-greedy case.<p>Unlike greedy or non-greedy quantifiers, <code>:.*+apple</code> will never match, because <code>.*+</code> will consume rest of the line, leaving no way to match <code>apple</code>.<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span>$ python3</span><span style=color:#b3933a;>.11 </span><span style=color:#72ab00;>-</span><span>q
</span><span style=color:#72ab00;>>>> import </span><span>re
</span><span>
</span><span style=color:#72ab00;>>>> </span><span>ip </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'fig:mango:pineapple:guava:apples:orange'
</span><span>
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>:</span><span style=color:#aeb52b;>.</span><span style=color:#72ab00;>*+</span><span style=color:#d07711;>'</span><span>, ip)[</span><span style=color:#b3933a;>0</span><span>]
</span><span style=color:#d07711;>':mango:pineapple:guava:apples:orange'
</span><span>
</span><span style=color:#72ab00;>>>> </span><span style=color:#a2a001;>bool</span><span>(re.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>:</span><span style=color:#aeb52b;>.</span><span style=color:#72ab00;>*+</span><span style=color:#7c8f4c;>apple</span><span style=color:#d07711;>'</span><span>, ip))
</span><span style=color:#b3933a;>False
</span></code></pre><p>Here's a more practical example. Suppose you want to match integer numbers greater than or equal to <code>100</code> where these numbers can optionally have leading zeros.<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>numbers </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'42 314 001 12 00984'
</span><span>
</span><span style=color:#7f8989;># this solution fails because 0* and \d{3,} can both match leading zeros
</span><span style=color:#7f8989;># and greedy quantifiers will give up characters to help overall regex succeed
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>findall</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>0</span><span style=color:#72ab00;>*</span><span style=color:#aeb52b;>\d</span><span style=color:#72ab00;>{3,}</span><span style=color:#d07711;>'</span><span>, numbers)
</span><span>[</span><span style=color:#d07711;>'314'</span><span>, </span><span style=color:#d07711;>'001'</span><span>, </span><span style=color:#d07711;>'00984'</span><span>]
</span><span>
</span><span style=color:#7f8989;># here 0*+ will not give back leading zeros after they are consumed
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>findall</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>0</span><span style=color:#72ab00;>*+</span><span style=color:#aeb52b;>\d</span><span style=color:#72ab00;>{3,}</span><span style=color:#d07711;>'</span><span>, numbers)
</span><span>[</span><span style=color:#d07711;>'314'</span><span>, </span><span style=color:#d07711;>'00984'</span><span>]
</span><span>
</span><span style=color:#7f8989;># workaround if possessive quantifiers are not supported
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>findall</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>0</span><span style=color:#72ab00;>*</span><span style=color:#aeb52b;>[</span><span style=color:#b3933a;>1-9</span><span style=color:#aeb52b;>]\d</span><span style=color:#72ab00;>{2,}</span><span style=color:#d07711;>'</span><span>, numbers)
</span><span>[</span><span style=color:#d07711;>'314'</span><span>, </span><span style=color:#d07711;>'00984'</span><span>]
</span></code></pre><p>Here's another example. The goal is to match lines whose first non-whitespace character is not a <code>#</code> character. A matching line should have at least one non-<code>#</code> character, so empty lines and those with only whitespace characters should not match.<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>lines </span><span style=color:#72ab00;>= </span><span>[</span><span style=color:#d07711;>'#comment'</span><span>, </span><span style=color:#d07711;>'c = "#"'</span><span>, </span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>\t</span><span style=color:#d07711;> #comment'</span><span>, </span><span style=color:#d07711;>'abc'</span><span>, </span><span style=color:#d07711;>''</span><span>, </span><span style=color:#d07711;>' </span><span style=color:#aeb52b;>\t </span><span style=color:#d07711;>'</span><span>]
</span><span>
</span><span style=color:#7f8989;># this solution fails because \s* can backtrack
</span><span style=color:#7f8989;># and [^#] can match a whitespace character as well
</span><span style=color:#72ab00;>>>> </span><span>[e </span><span style=color:#72ab00;>for </span><span>e </span><span style=color:#72ab00;>in </span><span>lines </span><span style=color:#72ab00;>if </span><span>re.</span><span style=color:#5597d6;>match</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>\s</span><span style=color:#72ab00;>*</span><span style=color:#aeb52b;>[</span><span style=color:#72ab00;>^</span><span style=color:#aeb52b;>#]</span><span style=color:#d07711;>'</span><span>, e)]
</span><span>[</span><span style=color:#d07711;>'c = "#"'</span><span>, </span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>\t</span><span style=color:#d07711;> #comment'</span><span>, </span><span style=color:#d07711;>'abc'</span><span>, </span><span style=color:#d07711;>' </span><span style=color:#aeb52b;>\t </span><span style=color:#d07711;>'</span><span>]
</span><span>
</span><span style=color:#7f8989;># this works because \s*+ will not give back any whitespace characters
</span><span style=color:#72ab00;>>>> </span><span>[e </span><span style=color:#72ab00;>for </span><span>e </span><span style=color:#72ab00;>in </span><span>lines </span><span style=color:#72ab00;>if </span><span>re.</span><span style=color:#5597d6;>match</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>\s</span><span style=color:#72ab00;>*+</span><span style=color:#aeb52b;>[</span><span style=color:#72ab00;>^</span><span style=color:#aeb52b;>#]</span><span style=color:#d07711;>'</span><span>, e)]
</span><span>[</span><span style=color:#d07711;>'c = "#"'</span><span>, </span><span style=color:#d07711;>'abc'</span><span>]
</span><span>
</span><span style=color:#7f8989;># workaround if possessive quantifiers are not supported
</span><span style=color:#72ab00;>>>> </span><span>[e </span><span style=color:#72ab00;>for </span><span>e </span><span style=color:#72ab00;>in </span><span>lines </span><span style=color:#72ab00;>if </span><span>re.</span><span style=color:#5597d6;>match</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>\s</span><span style=color:#72ab00;>*</span><span style=color:#aeb52b;>[</span><span style=color:#72ab00;>^</span><span style=color:#aeb52b;>#</span><span style=color:#b3933a;>\s</span><span style=color:#aeb52b;>]</span><span style=color:#d07711;>'</span><span>, e)]
</span><span>[</span><span style=color:#d07711;>'c = "#"'</span><span>, </span><span style=color:#d07711;>'abc'</span><span>]
</span></code></pre><p align=center><iframe allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" title="YouTube video player" allowfullscreen frameborder=0 height=315 loading=lazy src=https://www.youtube.com/embed/Y7XuZOLdG0o width=560></iframe></p><br><h2 id=atomic-grouping>Atomic grouping<a aria-label="Anchor link for: atomic-grouping" class=zola-anchor href=#atomic-grouping>π</a></h2><p><code>(?>pat)</code> is an atomic group, where <code>pat</code> is the pattern you want to safeguard from further backtracking by isolating it from other parts of the regex.<p>Here's an example with greedy quantifier:<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>numbers </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'42 314 001 12 00984'
</span><span>
</span><span style=color:#7f8989;># 0* is greedy and the (?>) grouping prevents backtracking
</span><span style=color:#7f8989;># same as: re.findall(r'0*+\d{3,}', numbers)
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>findall</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>(</span><span style=color:#72ab00;>?</span><span style=color:#7c8f4c;>>0</span><span style=color:#72ab00;>*</span><span style=color:#7c8f4c;>)</span><span style=color:#aeb52b;>\d</span><span style=color:#72ab00;>{3,}</span><span style=color:#d07711;>'</span><span>, numbers)
</span><span>[</span><span style=color:#d07711;>'314'</span><span>, </span><span style=color:#d07711;>'00984'</span><span>]
</span></code></pre><p>Here's an example with non-greedy quantifier:<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>ip </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'fig::mango::pineapple::guava::apples::orange'
</span><span>
</span><span style=color:#7f8989;># this matches from the first '::' to the first occurrence of '::apple'
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>::</span><span style=color:#aeb52b;>.</span><span style=color:#72ab00;>*?</span><span style=color:#7c8f4c;>::apple</span><span style=color:#d07711;>'</span><span>, ip)[</span><span style=color:#b3933a;>0</span><span>]
</span><span style=color:#d07711;>'::mango::pineapple::guava::apple'
</span><span>
</span><span style=color:#7f8989;># '(?>::.*?::)' will match only from '::' to the very next '::'
</span><span style=color:#7f8989;># '::mango::' fails because 'apple' isn't found afterwards
</span><span style=color:#7f8989;># similarly '::pineapple::' fails
</span><span style=color:#7f8989;># '::guava::' succeeds because it is followed by 'apple'
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>(</span><span style=color:#72ab00;>?</span><span style=color:#7c8f4c;>>::</span><span style=color:#aeb52b;>.</span><span style=color:#72ab00;>*?</span><span style=color:#7c8f4c;>::)apple</span><span style=color:#d07711;>'</span><span>, ip)[</span><span style=color:#b3933a;>0</span><span>]
</span><span style=color:#d07711;>'::guava::apple'
</span></code></pre><p align=center><iframe allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" title="YouTube video player" allowfullscreen frameborder=0 height=315 loading=lazy src=https://www.youtube.com/embed/DQfcST_XN_E width=560></iframe><blockquote><p><img alt=info src=/images/info.svg> The <a href=https://pypi.org/project/regex/>regex module</a> has a <code>regex.REVERSE</code> flag to match from right-to-left making it better suited than atomic grouping for certain cases.<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> import </span><span>regex
</span><span>
</span><span style=color:#72ab00;>>>> </span><span>ip </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'fig::mango::pineapple::guava::apples::orange'
</span><span style=color:#72ab00;>>>> </span><span>regex.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>(</span><span style=color:#72ab00;>?</span><span style=color:#7c8f4c;>r)::</span><span style=color:#aeb52b;>.</span><span style=color:#72ab00;>*?</span><span style=color:#7c8f4c;>::apple</span><span style=color:#d07711;>'</span><span>, ip)[</span><span style=color:#b3933a;>0</span><span>]
</span><span style=color:#d07711;>'::guava::apple'
</span><span>
</span><span style=color:#7f8989;># this won't be possible with just atomic grouping
</span><span style=color:#72ab00;>>>> </span><span>ip </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'and this book is good and those are okay and that movie is bad'
</span><span style=color:#72ab00;>>>> </span><span>regex.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>(</span><span style=color:#72ab00;>?</span><span style=color:#7c8f4c;>r)th</span><span style=color:#aeb52b;>.</span><span style=color:#72ab00;>*?\b</span><span style=color:#7c8f4c;>is bad</span><span style=color:#d07711;>'</span><span>, ip)[</span><span style=color:#b3933a;>0</span><span>]
</span><span style=color:#d07711;>'that movie is bad'
</span></code></pre></blockquote><br><h2 id=catastrophic-backtracking>Catastrophic Backtracking<a aria-label="Anchor link for: catastrophic-backtracking" class=zola-anchor href=#catastrophic-backtracking>π</a></h2><p>Backtracking can become significantly time consuming for certain corner cases. Which is why some regex engines do not use them, at the cost of not supporting some features like lookarounds. If your application accepts user defined regex, you might need to protect against such catastrophic patterns. From <a href=https://en.wikipedia.org/wiki/Redos>wikipedia: ReDoS</a>:<blockquote><p>A regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or becoming unresponsive.</blockquote><p>Here's an example:<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> from </span><span>timeit </span><span style=color:#72ab00;>import </span><span>timeit
</span><span>
</span><span style=color:#72ab00;>>>> </span><span>greedy </span><span style=color:#72ab00;>= </span><span>re.</span><span style=color:#5597d6;>compile</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>(a</span><span style=color:#72ab00;>+|</span><span style=color:#aeb52b;>\w</span><span style=color:#72ab00;>+</span><span style=color:#7c8f4c;>)</span><span style=color:#72ab00;>*</span><span style=color:#7c8f4c;>:</span><span style=color:#d07711;>'</span><span>)
</span><span style=color:#72ab00;>>>> </span><span>possessive </span><span style=color:#72ab00;>= </span><span>re.</span><span style=color:#5597d6;>compile</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>(a</span><span style=color:#72ab00;>+|</span><span style=color:#aeb52b;>\w</span><span style=color:#72ab00;>+</span><span style=color:#7c8f4c;>)</span><span style=color:#72ab00;>*+</span><span style=color:#7c8f4c;>:</span><span style=color:#d07711;>'</span><span>)
</span><span>
</span><span style=color:#7f8989;># string that'll match the above patterns
</span><span style=color:#72ab00;>>>> </span><span>s1 </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'aaaaaaaaaaaaaaaa:123'
</span><span style=color:#7f8989;># string that does NOT match the above patterns
</span><span style=color:#72ab00;>>>> </span><span>s2 </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'aaaaaaaaaaaaaaaa-123'
</span><span>
</span><span style=color:#7f8989;># no issues when input string has a match
</span><span style=color:#72ab00;>>>> </span><span style=color:#5597d6;>timeit</span><span>(</span><span style=color:#d07711;>'greedy.search(s1)'</span><span>, </span><span style=color:#5597d6;>number</span><span style=color:#72ab00;>=</span><span style=color:#b3933a;>10000</span><span>, </span><span style=color:#5597d6;>globals</span><span style=color:#72ab00;>=</span><span style=color:#b39f04;>globals</span><span>())
</span><span style=color:#b3933a;>0.016464739997900324
</span><span style=color:#72ab00;>>>> </span><span style=color:#5597d6;>timeit</span><span>(</span><span style=color:#d07711;>'possessive.search(s1)'</span><span>, </span><span style=color:#5597d6;>number</span><span style=color:#72ab00;>=</span><span style=color:#b3933a;>10000</span><span>, </span><span style=color:#5597d6;>globals</span><span style=color:#72ab00;>=</span><span style=color:#b39f04;>globals</span><span>())
</span><span style=color:#b3933a;>0.016358205997676123
</span><span>
</span><span style=color:#7f8989;># if input doesn't match, greedy version suffers from catastrophic backtracking
</span><span style=color:#7f8989;># note that 'number' parameter is reduced to 10 since it takes a long time
</span><span style=color:#72ab00;>>>> </span><span style=color:#5597d6;>timeit</span><span>(</span><span style=color:#d07711;>'greedy.search(s2)'</span><span>, </span><span style=color:#5597d6;>number</span><span style=color:#72ab00;>=</span><span style=color:#b3933a;>10</span><span>, </span><span style=color:#5597d6;>globals</span><span style=color:#72ab00;>=</span><span style=color:#b39f04;>globals</span><span>())
</span><span style=color:#b3933a;>53.71723825200024
</span><span style=color:#72ab00;>>>> </span><span style=color:#5597d6;>timeit</span><span>(</span><span style=color:#d07711;>'possessive.search(s2)'</span><span>, </span><span style=color:#5597d6;>number</span><span style=color:#72ab00;>=</span><span style=color:#b3933a;>10</span><span>, </span><span style=color:#5597d6;>globals</span><span style=color:#72ab00;>=</span><span style=color:#b39f04;>globals</span><span>())
</span><span style=color:#b3933a;>0.00019008600065717474
</span></code></pre><p><code>(a+|\w+)*:</code> is a silly regex pattern, since it can be rewritten as <code>\w*:</code> which will not suffer from catastrophic backtracking. But this example shows how quantifiers applied to a group with multiple alternatives using quantifiers can lead to explosive results. More such patterns and mitigation strategies can be found in the following links:<ul><li><a href=https://www.rexegg.com/regex-explosive-quantifiers.html>The Explosive Quantifier Trap</a><li><a href=https://www.regular-expressions.info/catastrophic.html>Runaway Regular Expressions: Catastrophic Backtracking</a><li><a href=https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/>Details of the Cloudflare outage on July 2, 2019</a></ul></div><div class=post-footer><div class=post-tags><a href=https://learnbyexample.github.io/tags/python/>#python</a><a href=https://learnbyexample.github.io/tags/regular-expressions/>#regular-expressions</a><a href=https://learnbyexample.github.io/tags/possessive-quantifiers/>#possessive-quantifiers</a><a href=https://learnbyexample.github.io/tags/atomic-grouping/>#atomic-grouping</a></div><hr color=#e6e6e6><div class=post-nav><p><a class=previous href=https://learnbyexample.github.io/textual-first-impressions/>β Building TUIs with textual: first impressions</a><br><p><a class=next href=https://learnbyexample.github.io/duplicates-irrespective-field-order/>Removing duplicates irrespective of field order β</a><br></div><hr color=#e6e6e6><p>π° Use <a href=https://learnbyexample.github.io/atom.xml>this link</a> for the Atom feed. <br> β
Follow me on <a href=https://twitter.com/learn_byexample>Twitter</a>, <a href=https://github.com/learnbyexample>GitHub</a> and <a href=https://www.youtube.com/c/learnbyexample42>Youtube</a> for interesting tech nuggets. <br> π§ Subscribe to <a href=https://learnbyexample.gumroad.com/l/learnbyexample-weekly>learnbyexample weekly</a> for programming resources, tips, tools, free ebooks and more (free newsletter, delivered every Friday).<hr color=#e6e6e6></div></article></div></main></div><script src=https://learnbyexample.github.io/even.js></script>