# **Mini-project \#2 - RegEx Tokenizer**

Name: **Ryan Clemence Vasquez**

More information on the assessment is found in our Canvas course.

# **Loading Data into Memory**

This code block is provided and helps load in the data for this assessment from the provided file. Specifically, we are loading in tweets into a Dataframe. The file contains two columns: (1) text and (2) created_at. The 2nd column isn't very useful here and is only really meant to give you an idea of the timeframe these tweets were created. Focus on the 1st columns, but should the time frame help you in your analysis, you're free to analyze that. Also, this code assumes you've placed the tweet csv into the folder where the Python notebook is located.

**Note**: Tweets collected here are public tweets. The collection of the tweets was done so through Twitter's API.

In [187]:
import pandas as pd
import csv

tweets = pd.read_csv(
  "tweets_for_mp2.csv",
  index_col=False,
  quoting=csv.QUOTE_ALL)

In [188]:
tweets.head()


Unnamed: 0,created_at_utc+8,text
0,2020-11-17 23:06:38,HOY YUNG TEASER KINAKABAHAN AKO HAHAHAHAHA
1,2020-11-17 23:06:38,Ay unahan po ser? https://t.co/6gdMT3SDwK
2,2020-11-17 23:06:39,Huyy wag lang ha hahahaha https://t.co/ob7Ed7MLny
3,2020-11-17 23:06:39,Labyuuuuu ol\n\nREQUEST @SB19Official @MTV #Fr...
4,2020-11-17 23:06:39,gusto kaayo nako i share sa page akong letteee...


# **Your Solution**

Kindly place your solution in the code block below. Make sure that aside from loading in and tokenizing the data, your code should also display the total tokens, total vocabulary, and list top 25 tokens with their repsective counts.

In [189]:
import regex
from collections import Counter

## **Defining the RegEx-based Tokenizer**

In [190]:
class RegExTokenizer:
    def __init__(self, pattern=""):
        self.pattern = pattern
    
    def set_pattern(self, pattern):
        self.pattern = pattern
    
    def tokenize(self, text):
        if not self.pattern:
            raise ValueError("No regex pattern provided")

        tokens = regex.findall(self.pattern, text , regex.VERBOSE | regex.I | regex.UNICODE)
        return list(tokens)
    
    def tokenize_df(self, df):
        if not self.pattern: 
            raise ValueError("No regex pattern provided")
        
        return df.apply(self.tokenize)

In [341]:
class TokenCounter: 
    def __init__(self, source=None):
        self.source = source
        self.has_processed = False
    
    def set_tokens(self, new_source):
        self.source = new_source
        self.has_processed = False

    def generate_analysis(self):
        if not len(self.source):
            raise ValueError("No token series provided")

        # Flatten the list of tokens and count the total tokens
        self.all_tokens = [token for tokens_list in self.source for token in tokens_list]
        self.token_count = len(self.all_tokens)
    
        # Groups capitalized and lower case words together; groups haha-sequences together
        def normalize_token(token):

            token = token.lower() if not token.isupper() else token
            
            if regex.fullmatch(r'((haha|ahah)+)[a-z0-9]*', token.lower()):
                return "haha"
            
            return token

        # Normalize all tokens
        normalized_tokens = [normalize_token(token) for token in self.all_tokens]

        # Create the vocabulary from the normalized tokens
        self.vocab = set(normalized_tokens)
        self.vocab_count = len(self.vocab)

        # Count occurrences per unique token in the normalized list
        self.word_counts = Counter(normalized_tokens)
        self.has_processed = True
        print("Finished analyzing token series...")

        return self.word_counts

    def display_counts(self, k=25): 
        if not self.has_processed:
            raise ValueError("Token series not analyzed yet.")

        top_tokens = self.word_counts.most_common(k)
        
        print(f"Total tokens: {self.token_count}")
        print(f"Total vocabulary: {self.vocab_count}")
        print( "------------------------------------")
        print(f"All words + counts (Top {k} only):")
        
        count_df = pd.DataFrame(top_tokens, columns=["word", "count"]) 
        count_df.index = pd.RangeIndex(start=1, stop=len(count_df)+1)

        return count_df


## **Defining the RegEx Pattern to be Used**

In [342]:
# Unicode codes for emoji
emoji_string = (
              "["
                u"\U0001F600-\U0001F64F"  # emoticons
                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                u"\U00002500-\U00002BEF"  # chinese char
                u"\U00002702-\U000027B0"
                u"\U0001f926-\U0001f937"
                u"\U00010000-\U0010ffff"
                u"\u2640-\u2642"
                u"\u2600-\u2B55"
                u"\u200d"
                u"\u23cf"
                u"\u23e9"
                u"\u231a"
              "]"
)

# u"\U000024C2-\U0001F251"

regex_strings = (
# URL:
r"""http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+""",
# Twitter username:
r"""(?:@[\w_]+)""",
# Hashtags:
r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Cashtags:
r"""(?:\$+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Remaining word types, respectively:
    # Numbers, including fractions, decimals.
    # Words, including contractions, and words with numbers too
r"""
(?:[+\-]?\d+[,/.:-]\d+[+\-]?)|(?:[a-zA-Z0-9\'-]+)                                 
"""
)

combined_regex_pattern = r"|".join([emoji_string] + list(regex_strings))

In [343]:
print(combined_regex_pattern)

[üòÄ-üôèüåÄ-üóøüöÄ-üõøüá†-üáø‚îÄ-‚ØØ‚úÇ-‚û∞ü§¶-ü§∑êÄÄ-Ùèøø‚ôÄ-‚ôÇ‚òÄ-‚≠ï‚Äç‚èè‚è©‚åö]|http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|(?:@[\w_]+)|(?:\#+[\w_]+[\w\'_\-]*[\w_]+)|(?:\$+[\w_]+[\w\'_\-]*[\w_]+)|
(?:[+\-]?\d+[,/.:-]\d+[+\-]?)|(?:[a-zA-Z0-9\'-]+)                                 



## **Simulating the Tokenizer with the Tweets Dataset**

In [344]:
tokenizer = RegExTokenizer(combined_regex_pattern)

In [345]:
tokens = tokenizer.tokenize_df(df=tweets['text'])

In [346]:
ctr = TokenCounter(tokens)
tokens2 = ctr.generate_analysis()

Finished analyzing token series...


In [348]:
ctr.display_counts(k=25)

Total tokens: 2307876
Total vocabulary: 269368
------------------------------------
All words + counts (Top 25 only):


Unnamed: 0,word,count
1,na,43112
2,haha,41303
3,ko,32507
4,sa,31016
5,ng,19363
6,ako,17934
7,the,17658
8,to,17243
9,ang,16232
10,I,14046
