# **Mini-project \#2 - RegEx Tokenizer**

Name: **Ryan Clemence Vasquez**

More information on the assessment is found in our Canvas course.

# **Loading Data into Memory**

This code block is provided and helps load in the data for this assessment from the provided file. Specifically, we are loading in tweets into a Dataframe. The file contains two columns: (1) text and (2) created_at. The 2nd column isn't very useful here and is only really meant to give you an idea of the timeframe these tweets were created. Focus on the 1st columns, but should the time frame help you in your analysis, you're free to analyze that. Also, this code assumes you've placed the tweet csv into the folder where the Python notebook is located.

**Note**: Tweets collected here are public tweets. The collection of the tweets was done so through Twitter's API.

In [79]:
import pandas as pd
import csv

tweets = pd.read_csv(
  "tweets_for_mp2.csv",
  index_col=False,
  quoting=csv.QUOTE_ALL)

In [80]:
tweets.head()


Unnamed: 0,created_at_utc+8,text
0,2020-11-17 23:06:38,HOY YUNG TEASER KINAKABAHAN AKO HAHAHAHAHA
1,2020-11-17 23:06:38,Ay unahan po ser? https://t.co/6gdMT3SDwK
2,2020-11-17 23:06:39,Huyy wag lang ha hahahaha https://t.co/ob7Ed7MLny
3,2020-11-17 23:06:39,Labyuuuuu ol\n\nREQUEST @SB19Official @MTV #Fr...
4,2020-11-17 23:06:39,gusto kaayo nako i share sa page akong letteee...


# **Your Solution**

Kindly place your solution in the code block below. Make sure that aside from loading in and tokenizing the data, your code should also display the total tokens, total vocabulary, and list top 25 tokens with their repsective counts.

In [81]:
import regex
from collections import Counter

## **Defining the RegEx-based Tokenizer**

In [102]:
class RegExTokenizer:
    def __init__(self, pattern=""):
        self.pattern = pattern
    
    def set_pattern(self, pattern):
        self.pattern = pattern
    
    def tokenize(self, text):
        if not self.pattern:
            raise ValueError("No regex pattern provided")

        tokens = regex.findall(self.pattern, text , regex.VERBOSE | regex.I | regex.UNICODE)
        return list(tokens)
    
    def tokenize_df(self, df):
        if not self.pattern: 
            raise ValueError("No regex pattern provided")
        
        return df.apply(self.tokenize)

class TokenCounter: 
    def __init__(self, source=None):
        self.source = source
        self.has_processed = False
    
    def set_tokens(self, new_source):
        self.source = new_source
        self.has_processed = False

    def generate_analysis(self):
        if not len(self.source):
            raise ValueError("No token series provided")

        # Flatten the list of tokens and count the total tokens
        self.all_tokens = [token for tokens_list in self.source for token in tokens_list]
        self.token_count = len(self.all_tokens)

        # Make the vocabulary 
        self.vocab = set(self.all_tokens)
        self.vocab_count = len(self.vocab)

        # Count occurences per unique token
        self.word_counts = Counter(self.all_tokens)
        self.has_processed = True
        print("Finished analyzing token series...")

    def display_counts(self, k=25): 
        if not self.has_processed:
            raise ValueError("Token series not analyzed yet.")

        top_tokens = self.word_counts.most_common(k)
        
        print(f"Total tokens: {self.token_count}")
        print(f"Total vocabulary: {self.vocab_count}")
        print(f"All words + counts (Top {k} only):")
        
        for word, count in top_tokens:
            print(f"{word} : {count}")


## **Defining the RegEx Pattern to be Used**

In [124]:
emoticon_string = r"""
(?:
  [<>]?
  [:;=8]                     # eyes
  [\-o\*\']?                 # optional nose
  [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth      
  |
  [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
  [\-o\*\']?                 # optional nose
  [:;=8]                     # eyes
  [<>]?
)"""

regex_strings = (
# URL:
r"""http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+""",
# Twitter username:
r"""(?:@[\w_]+)""",
# Hashtags:
r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Cashtags:
r"""(?:\$+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Remaining word types, respectively:
    # Numbers, including fractions, decimals.
    # Words, including contractions, and words with numbers too
r"""
(?:[+\-]?\d+[,/.:-]\d+[+\-]?)|(?:[a-zA-Z0-9\'-]+)                                 
"""
)

word_pattern = r"""(%s)""" % "|".join(regex_strings)

In [125]:
print(word_pattern)

(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|(?:@[\w_]+)|(?:\#+[\w_]+[\w\'_\-]*[\w_]+)|(?:\$+[\w_]+[\w\'_\-]*[\w_]+)|
(?:[+\-]?\d+[,/.:-]\d+[+\-]?)|(?:[a-zA-Z0-9\'-]+)                                 
)


## **Simulating the Tokenizer given the Tweets dataset**

In [126]:
tokenizer = RegExTokenizer(word_pattern)

In [127]:
tokenizer.tokenize("I ain't lying.")

['I', "ain't", 'lying']

In [128]:
tokens = tokenizer.tokenize_df(df=tweets['text'])

In [129]:
ctr = TokenCounter(tokens)
ctr.generate_analysis()

Finished analyzing token series...


In [130]:
ctr.display_counts()

Total tokens: 2115372
Total vocabulary: 295883
All words + counts (Top 25 only):
na : 42657
ko : 32465
sa : 30001
ng : 19218
ako : 17316
to : 16982
the : 15593
I : 14046
ang : 13480
lang : 13305
you : 12156
ka : 11543
and : 11501
a : 11097
pa : 10896
mo : 10423
naman : 9774
of : 8727
for : 8143
mga : 7925
is : 7630
yung : 7159
my : 6855
in : 6662
at : 6322
