# Regular Expressions

- re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string.
- re.findall() returns a list of all substrings that match a pattern.
- re.finditer() returns an iterator of all substrings that match a pattern.
- re.split() splits the string where there is a match and returns a list of strings where the splits have occurred.
- re.search() returns a match object if the regex matches the string anywhere in the string, otherwise it returns None.

<details>
<summary><Strong>More functions</Strong></summary>

re.subn() does the same thing as re.sub(), but returns a tuple of the new string and the number of substitutions made.

re.sub() replaces all occurrences of the pattern in string with repl, substituting all occurrences unless max provided. This method returns the modified string.

re.compile() compiles a regex into a regex object, which can be called with methods such as search() and findall().

re.IGNORECASE or re.I makes the regex case-insensitive.

re.VERBOSE or re.X allows whitespace and comments within the regex string to make it more readable.

re.MULTILINE or re.M makes the regex match across multiple lines.

re.DOTALL or re.S makes the . character match every character including the newline character.

re.ASCII or re.A makes the regex only match ASCII characters.

re.UNICODE or re.U makes the regex only match Unicode characters.

re.DEBUG or re.X displays debugging information about the regex.

re.escape() escapes all special characters in a string so that the string can be used in a regex pattern.

re.purge() clears the regular expression cache.

re.fullmatch() returns a match object if the regex matches the string completely, otherwise it returns None.
</details>

In [1]:
import re

# re.match(pattern, string, flags=0) # flags represent the matching mode i.e. ignorecase, multiline etc.
re.match('abc', 'abcde') # match

<re.Match object; span=(0, 3), match='abc'>

In [2]:
re.match('abc', 'abxde')  # No match

In [3]:
re.match(r'^\d{3}\-\d{3,8}$', '010-12345') # match

<re.Match object; span=(0, 9), match='010-12345'>

'^' is used to negate a character class. It must be the first character in the character class.<br>
'-' is used to specify a range of characters. It must be the first or last character in the character class.<br>
'\' is used to escape characters. It must be the first character in the character class.<br>

In [4]:
word_regex = r'\w+'  # A word is made up of one or more of the characters a-z, A-Z, 0-9 and underscore.
digit_regex = r'\d+' # A digit is made up of one or more of the characters 0-9.
space_regex = r'\s' # A space is made up of one or more of the characters space, tab, newline and carriage return.
wildcard_regex = r'.' # A wildcard is made up of one or more of any character except newline.
greedymatch_regex = r'+' # Greedy match or greedymatch_regex = r'*'
non_greedy_regex = r'+?' # Non-greedy match or non_greedy_regex = r'*?'
nonspace_regex = r'\S' # A nonspace is made up of one or more of any character except space, tab, newline and carriage return.
nonword_regex = r'\W' # A nonword is made up of one or more of any character except a-z, A-Z, 0-9 and underscore.
non_digit_regex = r'\D' # A non-digit is made up of one or more of any character except 0-9.
lowercase_regex = r'[a-z]' # A lowercase letter is made up of one or more of the characters a-z.
uppercase_regex = r'[A-Z]' # An uppercase letter is made up of one or more of the characters A-Z.
letter_regex = r'[a-zA-Z]' # A letter is made up of one or more of the characters a-z or A-Z.
alphanumeric_regex = r'[a-zA-Z0-9]' # An alphanumeric character is made up of one or more of the characters a-z, A-Z or 0-9.
not_letter_regex = r'[^a-zA-Z]' # A non-letter is made up of one or more of any character except a-z or A-Z.
not_alphanumeric_regex = r'[^a-zA-Z0-9]' # A non-alphanumeric character is made up of one or more of any character except a-z, A-Z or 0-9.
beginning_regex = r'^Hello' # The string must begin with Hello.
end_regex = r'world!$' # The string must end with world!.

In [5]:
re.match(word_regex, 'hi there!') # match

<re.Match object; span=(0, 2), match='hi'>

In [6]:
my_string = "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"

# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))


["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']
['Let', 'RegEx', 'Won', 'Can', 'Or']
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']
['4', '19']


# NLTK Library

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\ghimi/nltk_data'
    - 'c:\\Users\\ghimi\\AppData\\Local\\Programs\\Python\\Python312\\nltk_data'
    - 'c:\\Users\\ghimi\\AppData\\Local\\Programs\\Python\\Python312\\share\\nltk_data'
    - 'c:\\Users\\ghimi\\AppData\\Local\\Programs\\Python\\Python312\\lib\\nltk_data'
    - 'C:\\Users\\ghimi\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************
