## Hashtags and Emoji on Social Media
- Social media posts, especially tweets, can be difficult to work with. They're very short and often contain various emoji and hashtags. By only looking at the plain text, you'll lose a lot of valuable semantic information.
- Let's say you've extracted a large sample of social media posts and specific topic, for example posts mentioning a brand name or product. As the fist step of your data exploration, you want to filter out posts containing certain emoji and use them to assign a general sentiment score, based on wheter the expressed emotion is positive or negative, e.g. 😊 or 😢. You also want to find, marge and label hastags like #MondayMotivation, to be able to ignore analyze them later.
- By default, spaCy's tokenizer will split emoji into seperate tokens. This means that you can create a pattern for one or more emoji tokens. Valid hashtags usually consist of a #, plus a sequence of ASCII characters with no whitespace, making them easy to match as well.

In [1]:
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy

In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
pos_emoji = ["😊", "😃", "😄", "😁", "😆", "😅"] #positive_emoji
neg_emoji = ["😞", "😔", "😟", "😕", "🙁", "☹️"] #negative_emoji

In [4]:
print('positive emoji:',pos_emoji)
print('negative emoji:',neg_emoji)

positive emoji: ['😊', '😃', '😄', '😁', '😆', '😅']
negative emoji: ['😞', '😔', '😟', '😕', '🙁', '☹️']


In [5]:
# add patterns to match one or more emoji tokens
pos_patterns = [[{"ORTH": emoji}] for emoji in pos_emoji]
neg_patterns = [[{"ORTH": emoji}] for emoji in neg_emoji]

In [6]:
pos_patterns

[[{'ORTH': '😊'}],
 [{'ORTH': '😃'}],
 [{'ORTH': '😄'}],
 [{'ORTH': '😁'}],
 [{'ORTH': '😆'}],
 [{'ORTH': '😅'}]]

In [7]:
neg_patterns

[[{'ORTH': '😞'}],
 [{'ORTH': '😔'}],
 [{'ORTH': '😟'}],
 [{'ORTH': '😕'}],
 [{'ORTH': '🙁'}],
 [{'ORTH': '☹️'}]]

In [8]:
def label_sentiment(matcher, doc, i, matches):
  match_id, start, end = matches[i]
  if doc.vocab.strings[match_id] == "HAPPY":
    doc.sentiment += 0.1
  elif doc.vocab.strings[match_id] =='SAD':
    doc.sentiment -= 0.1

In [9]:
matcher = Matcher(nlp.vocab)

In [10]:
matcher.add("HAPPY", pos_patterns, on_match=label_sentiment)
matcher.add("SAD", neg_patterns, on_match=label_sentiment)

In [11]:
matcher.add("HASHTAG", [[{"ORTH":"#"}, {"IS_ASCII":True}]])

In [12]:
doc = nlp("Hello world 😊 #MondayMotivation")

In [13]:
matches = matcher(doc)

In [14]:
for match_id, start, end in matches:
  string_id = doc.vocab.strings[match_id] # look up string ID
  span = doc[start:end]
  print(string_id, span.text)

HAPPY 😊
HASHTAG #MondayMotivation
