Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider a trie based approach to maybe increase the overall performance of this package #54

Open
mourad1081 opened this issue Aug 19, 2023 · 0 comments

Comments

@mourad1081
Copy link

mourad1081 commented Aug 19, 2023

Hello,

In the context of my application, I developed a simple profanity checker. I compared it to yours and mine runs 10,000X faster.

Note: What I do not take into consideration are varying swear words (sex, s3x, etc.). However, building the Trie can incorporate these variations. In my context, I do not do it because I check well written book titles and descriptions.

Note 2: I only use a contains-based approach. However, this approach can also use a censor method.

Here is the implementation:

import re

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False


class ProfanityChecker:
    def __init__(self):
        with open("profanity.txt", "r") as f:
            self.root = self.build_trie(f.read().splitlines())

    # Building the trie: O(m * k),
    # where m is the number of words in the dictionary and k is the average length of words.
    def build_trie(self, dictionary):
        root = TrieNode()
        for word in dictionary:
            node = root
            for char in word:
                if char not in node.children:
                    node.children[char] = TrieNode()
                node = node.children[char]
            node.is_end_of_word = True
        return root

    # Searching each word in the text: O(n * k), where n is the number of words in the text.
    def contains_profanity(self, text):
        def search(remaining_text, node=self.root):
            for i, char in enumerate(remaining_text):
                if char in node.children:
                    node = node.children[char]
                    if node.is_end_of_word and (i == len(remaining_text) - 1 or remaining_text[i + 1] in (' ', '-')):
                        return True
                else:
                    break  # Stop searching if the character is not in the trie
            return False

        # Remove punctuation and convert to lowercase before tokenizing
        text = re.sub(r'[^\w\s]', '', text)
        words = text.split()

        for word in words:
            if search(word.lower()):
                return True
        return False


if __name__ == '__main__':
    pc = ProfanityChecker()
    print(pc.contains_profanity("my assessment"))  # Output: False
    print(pc.contains_profanity("my ass essment"))  # Output: True
    print(pc.contains_profanity("my ass-essment"))  # Output: True
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant