# **Mini-project \#2 - RegEx Tokenizer**

Name: **Ryan Clemence Vasquez**

More information on the assessment is found in our Canvas course.

# **Loading Data into Memory**

This code block is provided and helps load in the data for this assessment from the provided file. Specifically, we are loading in tweets into a Dataframe. The file contains two columns: (1) text and (2) created_at. The 2nd column isn't very useful here and is only really meant to give you an idea of the timeframe these tweets were created. Focus on the 1st columns, but should the time frame help you in your analysis, you're free to analyze that. Also, this code assumes you've placed the tweet csv into the folder where the Python notebook is located.

**Note**: Tweets collected here are public tweets. The collection of the tweets was done so through Twitter's API.

In [11]:
import pandas as pd
import csv

tweets = pd.read_csv(
  "tweets_for_mp2.csv",
  index_col=False,
  quoting=csv.QUOTE_ALL)

In [None]:
tweets.head()


# **Your Solution**

Kindly place your solution in the code block below. Make sure that aside from loading in and tokenizing the data, your code should also display the total tokens, total vocabulary, and list top 25 tokens with their repsective counts.

In [13]:
import regex
from collections import Counter

## **Defining the RegEx-based Tokenizer**

In [356]:
class RegExTokenizer:
    def __init__(self, pattern=""):
        self.pattern = pattern
    
    def set_pattern(self, pattern):
        self.pattern = pattern
    
    def tokenize(self, text):
        if not self.pattern:
            raise ValueError("No regex pattern provided")

        tokens = regex.findall(self.pattern, text , regex.VERBOSE | regex.I | regex.UNICODE)
        return list(tokens)
    
    def tokenize_df(self, df):
        if not self.pattern: 
            raise ValueError("No regex pattern provided")
        
        return df.apply(self.tokenize)

class TokenCounter: 
    def __init__(self, source=None):
        self.source = source
        self.has_processed = False
    
    def set_tokens(self, new_source):
        self.source = new_source
        self.has_processed = False

    def generate_analysis(self):
        if not len(self.source):
            raise ValueError("No token series provided")

        # Flatten the list of tokens and count the total tokens
        self.all_tokens = [token for tokens_list in self.source for token in tokens_list]
        self.token_count = len(self.all_tokens)
        
        # Normalize capitalized words only
        normalized_tokens = [
            token.lower() if not token.isupper() else token 
            for token in self.all_tokens
        ]

        # Create the vocabulary from the normalized tokens
        self.vocab = set(normalized_tokens)
        self.vocab_count = len(self.vocab)

        # Count occurrences per unique token in the normalized list
        self.word_counts = Counter(normalized_tokens)
        self.has_processed = True
        print("Finished analyzing token series...")

        return self.word_counts

    def display_counts(self, k=25): 
        if not self.has_processed:
            raise ValueError("Token series not analyzed yet.")

        top_tokens = self.word_counts.most_common(k)
        
        print(f"Total tokens: {self.token_count}")
        print(f"Total vocabulary: {self.vocab_count}")
        print(f"All words + counts (Top {k} only):")
        
        for word, count in top_tokens:
            print(f"{word} : {count}")


## **Defining the RegEx Pattern to be Used**

In [368]:
emoticon_string = r"""
(?:
  [<>]?
  [:;=8]                     # eyes
  [\-o\*\']?                 # optional nose
  [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth      
  |
  [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
  [\-o\*\']?                 # optional nose
  [:;=8]                     # eyes
  [<>]?
)"""

# Unicode emoji
emoji_string = (
              "["
                u"\U0001F600-\U0001F64F"  # emoticons
                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                u"\U00002500-\U00002BEF"  # chinese char
                u"\U00002702-\U000027B0"
                u"\U0001f926-\U0001f937"
                u"\U00010000-\U0010ffff"
                u"\u2640-\u2642"
                u"\u2600-\u2B55"
                u"\u200d"
                u"\u23cf"
                u"\u23e9"
                u"\u231a"
                # u"\ufe0f"  # dingbats
                # u"\u3030"
              "]"
)

# u"\U000024C2-\U0001F251"

regex_strings = (
# URL:
r"""http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+""",
# Twitter username:
r"""(?:@[\w_]+)""",
# Hashtags:
r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Cashtags:
r"""(?:\$+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Remaining word types, respectively:
    # Numbers, including fractions, decimals.
    # Words, including contractions, and words with numbers too
r"""
(?:[+\-]?\d+[,/.:-]\d+[+\-]?)|(?:[a-zA-Z0-9\'-]+)                                 
"""
)

combined_regex_pattern = r"|".join([emoji_string] + list(regex_strings))

In [369]:
print(combined_regex_pattern)

[üòÄ-üôèüåÄ-üóøüöÄ-üõøüá†-üáø‚îÄ-‚ØØ‚úÇ-‚û∞ü§¶-ü§∑êÄÄ-Ùèøø‚ôÄ-‚ôÇ‚òÄ-‚≠ï‚Äç‚èè‚è©‚åö]|http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|(?:@[\w_]+)|(?:\#+[\w_]+[\w\'_\-]*[\w_]+)|(?:\$+[\w_]+[\w\'_\-]*[\w_]+)|
(?:[+\-]?\d+[,/.:-]\d+[+\-]?)|(?:[a-zA-Z0-9\'-]+)                                 



## **Simulating the Tokenizer given the Tweets dataset**

In [370]:
tokenizer = RegExTokenizer(combined_regex_pattern)

In [379]:
tokenizer.tokenize(text="I am New York")

['I', 'am', 'New', 'York']

In [372]:
tokens = tokenizer.tokenize_df(df=tweets['text'])

In [382]:
ctr = TokenCounter(tokens)
ctr.generate_analysis()

Finished analyzing token series...


pandas.core.series.Series

In [374]:
ctr.display_counts(25)

Total tokens: 2307876
Total vocabulary: 272067
All words + counts (Top 25 only):
na : 43112
ko : 32507
sa : 31016
ng : 19363
ako : 17934
the : 17658
to : 17243
ang : 16232
I : 14046
you : 13494
lang : 13478
üòÇ : 13068
and : 12289
ka : 11914
‚ù§ : 11854
üò≠ : 11536
pa : 11149
a : 11097
mo : 10503
naman : 9893
ü•∫ : 9522
of : 8819
for : 8556
mga : 8281
yung : 8198


In [375]:
from nltk import TweetTokenizer
twt_tokenizer = TweetTokenizer()

In [376]:
tokens2 = tweets['text'].apply(twt_tokenizer.tokenize)

In [377]:
ctr.set_tokens(tokens2)
ctr.generate_analysis()

Finished analyzing token series...


In [378]:
ctr.display_counts(50)

Total tokens: 2564983
Total vocabulary: 274958
All words + counts (Top 50 only):
. : 75105
! : 51925
na : 43103
, : 41423
ko : 32540
sa : 31012
? : 27509
ng : 19360
ako : 17920
Ô∏è : 17682
the : 17662
to : 17464
ang : 16203
I : 14047
you : 13495
lang : 13476
üòÇ : 12517
and : 12287
ka : 11908
‚ù§ : 11640
pa : 11141
a : 11020
mo : 10497
üò≠ : 9975
naman : 9893
ü•∫ : 9071
of : 8819
for : 8554
yung : 8292
mga : 8277
di : 8255
my : 7991
is : 7894
hahaha : 7283
may : 7257
ü§£ : 7207
@ : 7120
in : 7026
at : 6773
... : 6617
‚Äô : 6487
mag : 5948
it : 5932
ni : 5817
me : 5784
ba : 5593
pero : 5546
üòç : 5378
this : 5329
nga : 5295
