# How to Create a Regular Expression to Extract Emoji in Python
(Updated with version 13.1)

A quick journey from the [raw emoji-test](
https://unicode.org/Public/emoji/13.1/emoji-test.txt) text file, to a Python regular expression to extract all emoji. And yes, a CSV file that can be imported as a DataFrame for general use.

The dataset also provides additional functionality for emoji for the [advertools online marketing package](https://github.com/eliasdabbas/advertools): 
* As a DataFrame `emoji_df`
* As a search option to search for emoji [`advertools.emoji_search`](https://advertools.readthedocs.io/en/master/advertools.emoji.html)
* One of the `extract_` functions that [extract emoji](https://advertools.readthedocs.io/en/master/advertools.extract.html#advertools.extract.extract_emoji) from a text list, together with statistics about their occurences, categories, and sub-categories.

How they were extracted...

I manually downloaded the file, and here we can open and inspect the first rows.

In [None]:
import re
from collections import namedtuple, Counter

with open('../input/emoji-data-descriptions-codepoints/emoji-test.txt', 'rt') as file:
    emoji_raw = file.read()
print(emoji_raw[:2800])

The first few lines explain some details about the file and how the data are represented. The remainder is like the last lines. Each line represents an emoji, and whenever there is a new group and/or sub-group, those are listed (on a line starting with # and the name of the group/sub-group), to show to group/sub-group, the following emoji belong to. 

We will go through the lines, one by one, and extract the information that we need and then put them in an easy-to-use format (`namedtuple`) so we can then use them to create the regex and the CSV file.  

A few things about emoji that need to be understood in order to get what we want done. 

# Single and multi code point emoji

Some emoji can simply be thought of as regular characters.

In [None]:
print('\U00000063')  # the lower-cae letter "c" for example

In [None]:
print('\U0001F44D')

But what about the similar emoji üëçüèø?  
Let's first compare the two.

In [None]:
len('üëç'), len('üëçüèø')

# ü§î

In [None]:
print('üëçüèø'[0], 'üëçüèø'[1])

In [None]:
import unicodedata
unicodedata.name('üëç'), unicodedata.name('üëçüèø'[0]), unicodedata.name('üëçüèø'[1])

The generic yellow-colored emoji is basically one character. The others (with different skin tones), are two characters; the first is the same generic emoji, and the second is simply a coloring square. There are five skin tones available. 

# Long and short words in regular expressions 
An important aspect of how regular expressions find their matches is that they are "greedy" (this mainly applies to regex-directed, and not text directed regex engines, which is what Python uses). One of the things that this means, is that when presented with several options, the regex is happy to find the first match and return it.  
Let's say you want to find the words "rest", and "restaurant" in a document.  
The regex is striaghtforward. `rest|restaurant`. 
Let's see greediness in action:

In [None]:
s = 'The rest of my friends are at the restaurant.'
regex = re.compile('rest|restaurant')
regex.findall(s)

The regex goes through the options from left to right, and returns the match if it finds one immediately. In this example, it found a match for "rest" in the second word of the sentence, and then found another match for "rest" in the last word. After finding the second match, the regex is now at the first "a" in "restaurant", because the regex has already 'consumed' the "rest" part of "restaurant".  
In a large text, you would get the false impression that there is no occurrence of the word "restaurant". The fix is easy. We simply put the long word(s) first, so the regex can check for their matches first. If it doesn't find a match (as it won't find in the first "rest"), then the regex will go on to try to match the second available option.

In [None]:
regex2 = re.compile('restaurant|rest')
regex2.findall(s)

In [None]:
thumbs_sentence = 'This is thumbs up: üëç, and this is thumbs up with dark skin tone: üëçüèø'
thumbs_regex = re.compile('üëç|üëçüèø')

thumbs_regex.findall(thumbs_sentence)

In [None]:
thumbs_regex2 = re.compile('üëçüèø|üëç')
thumbs_regex2.findall(thumbs_sentence)

Since the dark tone thumbs up emoji is made up of two code points, and since the first one is made of one, we are faced with the same case of "rest" and "restaurant". The regex finds the first word from left to right, and returns it. As in the previous example, putting the longer word first, made sure that we check for it first, and solves the issue. 


Here are the two emoji represented by code points. You can see that the first part of each of the 'words' is the same. 

In [None]:
print('\U0001F44D', '\U0001F44D\U0001F3FF')  # the U0001F44D code point exists in both

There are five skin tones, as well as four hair types. All of those fall under the group "component". Those emoji are not supposed to appear on their own, because they really don't mean anything. They function mainly as modifiers for the previous emoji, appearing right before them.  
Here they are, and we will be skipping them when creating the final regex. 

In [None]:
for i, line in enumerate(emoji_raw.splitlines()):
    if '; component' in line:
        print(i, line)

Now we create the data structure that will hold our emoji entries. We will use the `namedtuple` because it has a nice representation, telling us exactly what each element means, as well as giving us the ability to extract those elements by name, using dot notation `entry.name` or `entry.group` for example. 

In [None]:
EmojiEntry = namedtuple('EmojiEntry', ['codepoint', 'status', 'emoji', 'name', 'group', 'sub_group'])

The following code goes through lines one by one, extracting the information that is needed, and appending each entry to `emoji_entries` which will be a list containing all of them.  
I have annotated the code with some comments, and below elaborated a little more to clarify.

In [None]:
E_regex = re.compile(r' ?E\d+\.\d+ ') # remove the pattern E<digit(s)>.<digit(s)>
emoji_entries = []

for line in emoji_raw.splitlines()[32:]:  # skip the explanation lines
    if line == '# Status Counts':  # the last line in the document
        break
    if 'subtotal:' in line:  # these are lines showing statistics about each group, not needed
        continue
    if not line:  # if it's a blank line
        continue
    if line.startswith('#'):  # these lines contain group and/or sub-group names
        if '# group:' in line:
            group = line.split(':')[-1].strip()
        if '# subgroup:' in line:
            subgroup = line.split(':')[-1].strip()
    if group == 'Component':  # skin tones, and hair types, skip, as mentioned above
        continue
    if re.search('^[0-9A-F]{3,}', line):  # if the line starts with a hexadecimal number (an emoji code point)
        # here we define all the elements that will go into emoji entries
        codepoint = line.split(';')[0].strip()  # in some cases it is one and in others multiple code points
        status = line.split(';')[-1].split()[0].strip() # status: fully-qualified, minimally-qualified, unqualified
        if line[-1] == '#':
            # The special case where the emoji is actually the hash sign "#". In this case manually assign the emoji
            if 'fully-qualified' in line:
                emoji = '#Ô∏è‚É£'
            else:
                emoji = '#‚É£'  # they look the same, but are actually different 
        else:  # the default case
            emoji = line.split('#')[-1].split()[0].strip()  # the emoji character itself
        if line[-1] == '#':  # (the special case)
            name = '#'
        else:  # extract the emoji name
            split_hash = line.split('#')[1]
            rm_capital_E = E_regex.split(split_hash)[1]
            name = rm_capital_E
        templine = EmojiEntry(codepoint=codepoint,
                              status=status,
                              emoji=emoji,
                              name=name,
                              group=group,
                              sub_group=subgroup)
        emoji_entries.append(templine)


In [None]:
emoji_dict = {x.emoji: x for x in emoji_entries}

In [None]:
emoji_dict['üòÜ'].emoji

In [None]:
emoji_entries[0]

In [None]:
emoji_entries[0].emoji

In [None]:
emoji_entries[0].group, emoji_entries[0].sub_group

Here is a quick summary of the counts of the groups, sub-groups, and all group/sub-group combinations:

In [None]:
Counter([x.group for x in emoji_entries])

In [None]:
sorted(Counter([x.sub_group for x in emoji_entries]).items(), key=lambda x: x[1], reverse=True)[:30]

In [None]:
Counter([' | '.join([x.group, x.sub_group]) for x in emoji_entries])

## Emoji status
In case you are wondering about the status column, this is the explanation from the
[Unicode official documentation:](http://unicode.org/reports/tr51/#def_qualified_emoji_character) 

>ED-17a. qualified emoji character ‚Äî An emoji character in a string that (a) has default emoji presentation or (b) is the first character in an emoji modifier sequence or (c) is not a default emoji presentation character, but is the first character in an emoji presentation sequence.  
>ED-18. fully-qualified emoji ‚Äî A qualified emoji character, or an emoji sequence in which each emoji character is qualified.  
>ED-18a. minimally-qualified emoji ‚Äî An emoji sequence in which the first character is qualified but the sequence is not fully qualified.  
>ED-19. unqualified emoji ‚Äî An emoji that is neither fully-qualified nor minimally qualified.

As mentioned above, we need to handle single and multiple code point emoji slightly differently.  
We start by extracting the multi code points.

In [None]:
multi_codepoint_emoji = []

for code in [c.codepoint.split() for c in emoji_entries]:
    if len(code) > 1:
        # turn to a hexadecimal number zfilled to 8 zeros e.g: '\U0001F44D'
        hexified_codes = [r'\U' + x.zfill(8) for x in code]  
        hexified_codes = ''.join(hexified_codes)  # join all hexadecimal components 
        multi_codepoint_emoji.append(hexified_codes)

# sorting by length in decreasing order is extremely important as demonstrated above
multi_codepoint_emoji_sorted = sorted(multi_codepoint_emoji, key=len, reverse=True)

# join with a "|" to function as an "or" in the regex
multi_codepoint_emoji_joined = '|'.join(multi_codepoint_emoji_sorted)  
multi_codepoint_emoji_joined[:400]  # sample

In [None]:
single_codepoint_emoji = []

for code in [c.codepoint.split() for c in emoji_entries]:
    if len(code) == 1:
        single_codepoint_emoji.append(code[0])

# Regex character ranges

Since the single code point emoji are basically one character each, they can be treated as normal letters or numbers in the regex.  
One important feature of character classes is their ability to contain character ranges. 
If I want to match a character that falls between A and F, there are two ways to define the character class: 

- `[ABCDEF]`
- `[A-F]`

They effectively mean the same thing. The advantage of the second is that it is much more readable (imagine wanting to match the letters from A to T for example). It would be very difficult to read through and understand which letters are included. `[A-T]` is very easy to read.  
I also believe there might be a slight performance boost with character ranges. Some regex engines do certain optimizations on their own, and I'm not aware of those details. But in general making two comparisons is way more efficient than making fifty.  
For example, you have the number 42, and want to check if it falls between 1 and 100. 
In the character class case, you make to comparisons. You check if 42 >= 1 and 42 <=100.  
If you have all the numbers listed from 1 to 100, then you will have to make 42 comparisons to find out. On average, if you have a range of 100 numbers, you will be making fifty comparisons to find out. With larger ranges, this can obviously go very big.  

Below is the function `get_ranges`. It takes a list of integers, and returns a list of tuples, each representing the local minimum and maximum for any number of contiguous integers (numbers differing by 1).  
For example if I have the list `[1, 2, 3, 4, 6 7, 8, 10, 20]`, it will return `[(1, 4), (6, 8), (10, 10), (20, 20)]`

The numbers 1, 2, 3, and 4, can converted into a character range `[1-4]`, so do the numbers 6, 7, and 8. 10 and 20 are not part of a series of integers differing by one, so they are represented as single-number ranges. Later they will be used as single characters in the regex.

In [None]:
def get_ranges(nums):
    """Reduce a list of integers to tuples of local maximums and minimums.

    :param nums: List of integers.
    :return ranges: List of tuples showing local minimums and maximums
    """
    nums = sorted(nums)
    lows = [nums[0]]
    highs = []
    if nums[1] - nums[0] > 1:
        highs.append(nums[0])
    for i in range(1, len(nums)-1):
        if (nums[i] - nums[i-1]) > 1:
            lows.append(nums[i])
        if (nums[i + 1] - nums[i]) > 1:
            highs.append(nums[i])
    highs.append(nums[-1])
    if len(highs) > len(lows):
        lows.append(highs[-1])
    return [(l, h) for l, h in zip(lows, highs)]

In [None]:
# We first convert single_codepoint_emoji to integers to make calculations easier
single_codepoint_emoji_int = [int(x, base=16) for x in single_codepoint_emoji]
single_codepoint_emoji_ranges = get_ranges(single_codepoint_emoji_int)
single_codepoint_emoji_ranges[:10]

In [None]:
single_codepoint_emoji_raw = r''  # start with an empty raw string
for code in single_codepoint_emoji_ranges:
    if code[0] == code[1]:  # in this case make it a single hexadecimal character
        temp_regex =  r'\U' + hex(code[0])[2:].zfill(8)
        single_codepoint_emoji_raw += temp_regex
    else:
        # otherwise create a character range, joined by '-'
        temp_regex = '-'.join([r'\U' + hex(code[0])[2:].zfill(8), r'\U' + hex(code[1])[2:].zfill(8)])
        single_codepoint_emoji_raw += temp_regex

single_codepoint_emoji_raw[:100]  # sample

# Final regex
Now that we have created our sorted multi-code point characters, and generated the ranges for the single-code point emoji, we need to combine them together.  
The regex wil start with the longer 'words', which are emoji, represented by more than one character. These have already been sorted by length, in descending order. 
Single-code point emoji have already been made into a character class, where some values are single characters, and some are character ranges. 

The final regex will look something like this: 

`multi_code_point_emoji|[character_class_of_single_code_points]`

In more detail, this is how the first `multi_code_point_emoji` part will look like:

`longest_multi_code_point|shorter_multiple_code_point|...|shortest_multiple_code_point`

This is how the character class part `[character_class_of_single_code_points]` will look like: 
For simplicity I refer to `single_code_point` as `sp`. 

`[sp1sp2sp3sp4-sp20sp25sp500-sp600]` and so on. 

Below we concatenate both regexes into one, and show the first and last 500 characters as a sample. 

In [None]:
all_emoji_regex = re.compile(multi_codepoint_emoji_joined + '|' +  r'[' + single_codepoint_emoji_raw + r']')
all_emoji_regex.pattern[:500], all_emoji_regex.pattern[-500:]

# Testing
We need to know that our work is correct. It is easy to get it wrong, especially when we are talking about 3k+ characters, and especially that many of them are combinations of the others. 

As a quick sanity check, let see how many characters were actually in the initial text file. Each emoji entry contained a semicolon, so let's count those: 

![](https://drive.google.com/uc?id=1cR0fsIlSFjT5yNz9QbJ-_BpcoqbSuWgE)

* There are 4,591 semicolons in the file. One of them is part of the explanation on the first line, and remember that there were nine characters that we omitted, because they were basically modifiers. So the final number should be 4,591 - 1 - 9 = 4,581. 

Now we run `findall` by the combined final regex on a string that we create.  
This string is all the emoji characters in `emoji_entries` separated by spaces. Their number needs to be exactly 4,581. 

In [None]:
all_emoji_regex.findall(' '.join([x.emoji for x in emoji_entries])).__len__()

So far so good. Let's get some more assurance.

The code below goes through all the lines of the raw text file, as downloaded from the Unicode site.  
First we define `count` as zero, and increment its value, every time we find a new match. This should add up to the same number 3,287.  
We also create a set `found_emoji` where we add every emoji we find to it. If we match a certain emoji more than once and add it to the set, it will be discarded, because sets only contain unique values. Again the length of this set, should be equal to our magic number. If not, it means we found duplicates. Or it means we are matching other things, if we get a higher number. 

Lines 6-8 check if the length of the match is more than one, meaning the regex found more than one match in the line. We might be wrongly matching something more than once. It actually broke a few times, when I first ran it, until I fixed the issues.  
One final test is asserting that the name of the emoji (which we extract from `emoji_entries` is contained in the line in the raw text file, making sure that the names also correspond to the correct value, and extracted correctly. 

In [None]:
count = 0
found_emoji = set()
for line in emoji_raw.splitlines()[30:]:
    match = all_emoji_regex.findall(line)
    if match:
        if len(match) > 1:
            break
        count += 1
        found_emoji.add(match[0])
        temp_name = [x.name for x in emoji_entries if x.emoji == match[0]][0]
        assert temp_name in line

count, found_emoji.__len__()

## üéâ üéâ üéâ üéä üéä üéä üëç üëè üòâ

To save as a DataFrame, we can run the following code.  
I made it semicolon-separated, as there were commas in the descriptions so this is easier. The I let `pandas` do the heavy lifting of converting back to comma-separated format. 

In [None]:
with open('emoji_df.csv', 'wt') as file:
    print('emoji;name;group;sub_group;codepoints', file=file)
    for i, em in enumerate(emoji_entries):
        print(f"{em.emoji};{em.name};{em.group};{em.sub_group};{em.codepoint}", file=file)

In [None]:
import pandas as pd
pd.options.display.max_columns = None

emoji_df = pd.read_csv('emoji_df.csv', sep=';')
emoji_df.to_csv('emoji_df.csv', index=False)
emoji_df = pd.read_csv('emoji_df.csv')
emoji_df[:35]

# Emoji in Real-life Data
Let's see how we can use this regex on a tweet dataset containing five thousand tweets that contain the hashtag #JustDoIt.

In [None]:
justdoit = pd.read_csv('../input/5000-justdoit-tweets-dataset/justdoit_tweets_2018_09_07_2.csv')
justdoit.head(3)

The `word_frequency` function in `advertools` extracts words and counts their occurrences on an absolute and weighted basis. The function takes an optional `regex` parameter, whereby the function counts occurrences of matches of the regex (and not all words).  
We can now use the regex created, to extract and count emoji in our dataset. 

In [None]:
import advertools as adv
justdoit_emoji_freq = (adv.word_frequency(justdoit['tweet_full_text'],
                                          justdoit['user_followers_count'],
                                          regex=all_emoji_regex.pattern))
justdoit_emoji_freq.head(15)

The `abs_freq` column shows how many times each emoji was used (simply count). While `wtd_freq` counts the number of followers of the person who tweeted the tweet for each occurrence.  
In sample above you can see the monkey emoji being used only once, but since the user who tweeted has 2.9M followers, it has the highest `wtd_freq` of all emoji.  

Using the emoji_dict that we created we can show names, groups, and sub-groups of each emoji:

In [None]:
justdoit_emoji_freq['name'] = [emoji_dict[word].name if word != 'Ô∏è' else '' for word in justdoit_emoji_freq['word']]
justdoit_emoji_freq['group'] = [emoji_dict[word].group if word != 'Ô∏è' else '' for word in justdoit_emoji_freq['word']]
justdoit_emoji_freq['sub_group'] = [emoji_dict[word].sub_group if word != 'Ô∏è' else '' for word in justdoit_emoji_freq['word']]
justdoit_emoji_freq[:40]

The previous table shows the frequencies per emoji.  
What about the groups and sub-groups? 

We do this next: 

In [None]:
(justdoit_emoji_freq
 .groupby('group')
 .agg({'abs_freq': 'sum', 'wtd_freq': 'sum'})
 .sort_values('wtd_freq', ascending=False)
 .style.format({'wtd_freq': '{:,.0f}'}))

Note here that again, even though "Smileys & Emotion" emoji have been used 1,440 times and "Animals & Nature" only 38, the latter still ranks higher on a weighted basis.  
This is typical on social media. We often get a dataset that gets skewed by one tweet/user. 

In [None]:
(justdoit_emoji_freq
 .groupby('sub_group')
 .agg({'abs_freq': 'sum', 'wtd_freq': 'sum'})
 .sort_values('wtd_freq', ascending=False)
 .head(20)
 .style.format({'wtd_freq': '{:,.0f}'}))