# Advanced tokenization with regex
---------------
1.	Regex groups using or "|"
2.	OR is represented using |
3.	You can define a group using ()
4.	You can define explicit character ranges using []
5.	Regex ranges [] and groups ()
![advanced regex](https://github.com/rritec/datahexa/blob/dev/images/ds_advanced.png?raw=true)


## Example 1 : Capturing digits and alphabets

In [2]:
import re
# only digits
tokenize_digits= ('\d+')

In [3]:
re.findall(tokenize_digits,"he has 8 dogs and 11 cats")

['8', '11']

In [4]:
# only alphabets
tokenize_words= ('[a-z]+')

In [5]:
re.findall(tokenize_words,"he has 8 dogs and 11 cats")

['he', 'has', 'dogs', 'and', 'cats']

In [6]:
# Both digits and alphabets
tokenize_digits_and_words= ('[a-z]+|\d+')

In [7]:
re.findall(tokenize_digits_and_words,"he has 8 dogs and 11 cats")

['he', 'has', '8', 'dogs', 'and', '11', 'cats']

## Example 2 : Capture upto comma

In [8]:
import re

In [9]:
my_str = 'Match lowercase spaces nums like 12, but no commas'

In [10]:
re.match?

In [11]:
re.match('[A-Za-z0-9 ]+', my_str)

<_sre.SRE_Match object; span=(0, 35), match='Match lowercase spaces nums like 12'>

## Example 3 : Regexp Tokenize

In [12]:
from nltk import regexp_tokenize

In [13]:
regexp_tokenize?

In [14]:
my_string = "SOLDIER #10: Found them? In Mercea? The coconut's tropical!"

In [15]:
pattern1 = r"(\w+|#\d+|\?|!)"

In [16]:
regexp_tokenize(my_string,pattern=pattern1)

['SOLDIER',
 '#10',
 'Found',
 'them',
 '?',
 'In',
 'Mercea',
 '?',
 'The',
 'coconut',
 's',
 'tropical',
 '!']

## Example 4: Regex with NLTK tweet tokenization

In [17]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer

In [18]:
regexp_tokenize?

In [1]:
tweets = ['This is the best #nlp exercise ive found online! #python',
          '#NLP is super fun! <3 #learning',
          'Thanks @RRITEC :) #nlp #python']

In [2]:
len(tweets)

3

In [20]:
tweets[0]

'This is the best #nlp exercise ive found online! #python'

In [21]:
# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"

In [22]:
# Use the pattern on the first tweet in the tweets list
regexp_tokenize(tweets[0], pattern1)

['#nlp', '#python']

In [23]:
tweets[-1]

'Thanks @RRITEC :) #nlp #python'

In [24]:
# Write a pattern that matches both @ and hashtags
pattern2 = r"([#|@]\w+)"

In [25]:
# Use the pattern on the last tweet in the tweets list
regexp_tokenize(tweets[-1], pattern2)

['@RRITEC', '#nlp', '#python']

In [26]:
# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()

In [27]:
all_tokens = [tknzr.tokenize(t) for t in tweets]

In [28]:
print(all_tokens)

[['This', 'is', 'the', 'best', '#nlp', 'exercise', 'ive', 'found', 'online', '!', '#python'], ['#NLP', 'is', 'super', 'fun', '!', '<3', '#learning'], ['Thanks', '@RRITEC', ':)', '#nlp', '#python']]


## Exercise 5: Non-ascii tokenization

![non ascii](https://github.com/rritec/datahexa/blob/dev/images/ds%20nonascii.png?raw=true)

In [29]:
from nltk.tokenize import word_tokenize

In [30]:
# Create a string
german_text="Wann gehen wir zur Pizza? 🍕 Und fahren Sie mit vorbei? 🚕 "

In [31]:
# Tokenize and print all words in german_text
all_words = word_tokenize(german_text)

In [32]:
print(all_words)

['Wann', 'gehen', 'wir', 'zur', 'Pizza', '?', '🍕', 'Und', 'fahren', 'Sie', 'mit', 'vorbei', '?', '🚕']


In [34]:
# Tokenize and print only capital words
capital_words = r"[A-Z]\w+"
print(regexp_tokenize(german_text, capital_words))

['Wann', 'Pizza', 'Und', 'Sie']


In [35]:
# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))

['🍕', '🚕']
