# Advanced Text Cleaning V01

One primary issue with text is that in its unstructured format, it contains a lot of information that is not directly useful. This is partly why normalization via LLM has been so useful. 

Pay-to-play LLMs have been trained on loads of data. So they are able to work with largely unstructured text. However, if you are creating your own LLMs, or simply want more consistent results, you may need to omit as much as you can from the text in order to cut down on the noise within your training data. 

My goal with this mini project was to come up with a process that cleans text given the following requirement:
1. It needs to be relatively efficient, it can't take 5 minutes to clean 1000 sentences.
2. It needs to be able to have consistent results whether it is cleaning 1 sentence or 100.
3. It needs to have the option to filter out stop-words, specified parts-of-speech (POS), and mispelled words. 
4. It needs to stem words in a manner that does not take the word out of context. 

I came up with the following process. 

I am calling this version 01 because the next will have 'smart' re-spelling of mispelled words (hopefully).

### Import Libs

Our first step is to import the libraries we will need. We will be combining features from several NLP libraries.

One thing to note is that I had some trouble installing 'hunspell' I ended up having to dowload the zip from git and installing the package locally. Here is where I went: https://github.com/MSeal/cython_hunspell

In [1]:
# =============================================================================
# import libraries and data
# =============================================================================
import os
import pandas as pd
import re
from nltk import word_tokenize, everygrams, FreqDist, pos_tag, stem
import stop_words
import hunspell
import collections
import spacy
import numpy as np
import random as rd
import time



### Practice Data

This is the practice data we will be using. It is a combination of product descriptions and sentences. 

In [2]:

# --- grab tbl and modify
df_01 = pd.DataFrame(
    {
     'str_orig': [
    "Lily & Dan Toddler or Childrens Footbed Sandals",
    "Two men are standing in the street behind a table that has a laptop and a TV monitor on it .",
    "$100 Sullivan's Steakhouse eGift Card",
    "A man is throwing freesbies into the air and the border collie is catching them in the air .",
    "Bissell PowerFresh Pet Steam Mop",
    "A golden retreiver standing outside in the snow with a person standing with skis and poles .",
    "Huggies Little Snugglers Diapers",
    "A man and two little boys walk around the outside of what appears to be some kind of store .",
    "Bluey Jumbo Halloween Plush",
    "A man and woman with their hands on their chins are standing next to a Christmas tree .",
    "Joyhound Racoon Snuffle Mat",
    "A man in a wetsuit is throwing a toddler up in the air and is ready to catch him .",
    "Swiffer Wetjet Mopping Pads",
    "Three women standing next to each other are smiling in front of a Christmas tree .",
    "Alessi Italian Bread Crumbs",
    "One man looks into the distance next to another who has his hand in the air .",
    "Kerrygold Pure Irish Butter",
    "A woman and a dog both kneel with their heads down and butt in the air .",
    "Safavieh Bahama Outdoor Rug",
    "A man rides on his bike with one hand and holds a drink with the other .",
    "Nutribullet Juicer Pro",
    "The two dogs , one holding a ball in its mouth , run through the grass .",
    "Sterilite Latching Box",
    "A boy is on a ramp with his hands down and his feet up on a skateboard .",
    "Driscoll's Raspberries",
    "A man has his hands up while standing next to another man with a drink .",
    "Mrs. Meyer's Hand Soap",
    "Three woman stand on a beach below with their shadows long behind them .",
    "Driscoll's Blueberries",
    "Two men toss a ball while standing in the water and a dog is with them .",
    "Casa Mamita Mini Tacos",
    "Two people are dancing with drums on the right and a crowd behind them .",
    "Ninja Foodi PossibleCooker Multi-Cooker",
    "a large group of people rasing their hands in the air all at once .",
    "Bioblender by EcoTools Cleansing Sponge",
    "The dog that the two children are walking is looking up at a tree .",
    "Sunvilla Harrington Outdoor Seating Set",
    "The dog is jumping for the item the man is holding above his head .",
    "Moose Toys Cookeez Makery Toasty Treatz",
    "A girl in a dress is blowing bubbles with another girl behind her .",
    "Hershey's Dog Toy",
    "A dog can be seen from the inside of a car as it crosses the road .",
    "Halloween Doormat",
    "A player on the Boston Red Sox prepares for a pitch during a game .",
    "Starbucks Tumbler",
    "Three dogs playing in a field , one with a ball in its mouth .",
    "Duncan Hines Dolly Parton Frosting",
    "a group of teenagers standing outside of a convienance store .",
    "Smartfood Popcorn",
    "The woman is wearing a black hat with an American flag on it .",
    "Popsicle Ice Pops",
    "A jug is jumping up it is being squirted with a jet of water .",
    "Dr. Seuss Erasers",
    "A girl is doing the splits in the air in front of some trees .",
    "Gwaltney Hot Dogs",
    "A man standing on a rock in a river with a spash next to him .",
    "liveGfree Ravioli",
    "The man is jumping through the air , while holding a bicycle .",
    "Covergirl Mascara",
    "A man is doing a trick on a bike in the air over a ramp .",
    "SmartFood Popcorn",
    "A dog is running across a desert with bushes around him .",
    "Sabatasso's Four Cheese Pizza",
    "A group of people put two of their fingers on a Frisbee .",
    "Serra Women's Footbed Sandals",
    "Two players are on a wet field and one is on the ground .",
    "Cheez-It Snap'd Cracker Chips",
    "One dog climbing on the back of another dog in the snow .",
    "Trolli Candy",
    "There are three women with cameras talking on the beach .",
    "Johnsonville Smoke Brats",
    "Two dogs wrestle with each other with their teeth bared .",
    "Johnsonville Fresh Brats",
    "One dog wades in the water while another dog is on land .",
    "Kroger Water",
    "three young african-american girls smile for the camera .",
    "Nissin Top Ramen Noodles",
    "The man on the left is looking at the woman and the man .",
    "Nicole Miller Danica Rug",
    "The person is fishing , with waves splashing around him .",
    "Sharwood's Cooking Sauce",
    "a group of men are fighting in front of a bar in a European city , one of them hitting another with a stick .",
    "Arnold Bread",
    "four Asian girls are walking through a locker room .",
    "Ninja Creami",
    "The girl is being pushed on the swing by the woman .",
    "Ninja CREAMi",
    "Two girls are walking along the street and talking .",
    "Evian Waters",
    "Two dogs fighting , one is black , the other beige .",
    "Wonderfold W4 Quad Wagon",
    "Three small dogs , two of which are sniffing noses .",
    "Gardenline 6-in-1 Tool or Pruner Set",
    "The male kayaker is moving through the rough water .",
    "Vornado 279TR Whole Room Air Circulator Fan",
    "A German Shephard carries a small log in the water .",
    "Marie Callender's Frozen Dinner",
    "There are two children in the water and one balancing on a float and they are all wearing helmets .",
    "Pillsbury Crescents",
    "Two dogs play with each other on a wood floor ."
    ]
    }
)

### Functions

Next we will be defining several functions. I won't be going into great detail because they are fairly well commented. But always feel free to reach out and ask questions.

Our first function simply cleans the text data by subbing everything that is not a letter and removing unnecessary white space.

In [3]:
# --- cleans post content data
def string_mod(StrInput):
    '''
    Parameters
    ----------
    str_x : TYPE string
        DESCRIPTION. a string that needs to be cleaned
        
    Returns
    -------
    str_mod : TYPE string
        DESCRIPTION. a cleaned version of the string
    '''   
    # -- remove all html
    str_mod = re.sub('\'', '', StrInput)
    # -- remove everything but space and alpha
    str_mod = re.sub(r'[^a-zA-Z ]', ' ', str_mod)
    # -- any space 2 or greater change to one
    str_mod = re.sub(r' {2,}', ' ', str_mod)
    # -- strip and lower
    str_mod = str_mod.strip().lower()   
    # -- return final string
    return str_mod

This next function takes all of the words that are present in the text and outputs a dictionary.

In [4]:
# --- function to get base dictionary
def base_dictionary(StringListInput, DictInput, StopWordInput):
    '''
    Parameters
    ----------
    StringListInput : TYPE list
        DESCRIPTION. a list of strings
    DictInput : TYPE hunspell.HunspellWrap
        DESCRIPTION. a dictionary from huspell library
    StopWordInput : TYPE list
        DESCRIPTION. a list of stop words
        
    Returns
    -------
    wrd_dct : TYPE dictionary
        DESCRIPTION. a dictionary of dictionaries
    '''
    # -- list to append to
    tmp_word_lst = []
    # -- get all the words and place in list
    for stc in [string_mod(i) for i in StringListInput]:
        try:
            [tmp_word_lst.append(wrd) for wrd in word_tokenize(stc)]
        except:
            continue
    # -- this is our first word dictionary with counts
    wrd_dct_counts = collections.Counter(tmp_word_lst)
    # -- our primary word dictionary
    wrd_dct = {}
    # -- loop through all words
    for wrd in wrd_dct_counts:
        # -- default values
        wrd_dct_loop = {
            'count':wrd_dct_counts[wrd],
            'pos':'',
            'all_stem':'',
            'short_stem':'',
            'can_spell':False,
            'suggested_spell':'',
            'nchar':len(wrd),
            'is_stop_word':False
        }
        # -- try to get pos
        try:
            tmp_pos = pos_tag([wrd])[0][1]
            wrd_dct_loop['pos'] = tmp_pos
        except:
            pass
        # -- try to get all stem and short stem
        try:
            tmp_stem_lst = list(DictInput.stem(wrd))
            tmp_short_stem = [
                i for i in tmp_stem_lst if len(i) == min([len(i) for i in tmp_stem_lst])
            ][0]
            wrd_dct_loop['all_stem'] = tmp_stem_lst
            wrd_dct_loop['short_stem'] = tmp_short_stem
        except:
            pass
        # -- see if spellable
        try:
            wrd_dct_loop['can_spell'] = DictInput.spell(wrd)                           
        except:
            pass
        # -- check if stop word
        try:
            if wrd in StopWordInput:
                wrd_dct_loop['is_stop_word'] = True
        except:
            pass
        # -- set value of wrd dct
        wrd_dct[wrd]=wrd_dct_loop
    # -- output dictionary
    return wrd_dct

This is a simple function that takes two words and outputs their cosin similiarity based on context. VERY USEFUL!

In [5]:
# --- function to get cosin similarity based on pretrained vals
def get_cos_sim(WordX, WordY, NlpInput):
    '''
    Parameters
    ----------
    WordX : TYPE string
        DESCRIPTION. one word string
    WordY : TYPE string
        DESCRIPTION. one word string
    NlpInput : TYPE spacy.lang
        DESCRIPTION. word vectorizer
    Returns
    -------
    similarity : TYPE float
        DESCRIPTION. similarity between two words
    '''
    vec_x = NlpInput(WordX).vector 
    vec_y = NlpInput(WordY).vector
    similarity = vec_x.dot(vec_y) / (np.linalg.norm(vec_x) * np.linalg.norm(vec_y))
    return similarity

Next we have a function that goes through each word in the dictionary we made earlier and decides what sub word should be attributed to it based on inputed conditions.

In [6]:
# --- gets sub word from dictionary
def get_sub_word(
        WordInput, 
        BaseDictionaryInput,
        NlpInput,
        CosineSimThold = 0.01,
        NcharThold = 0,
        OmitStopWords = False,
        OmitNonSpell = False,
        OmitPosList = []
    ):
    '''
    Parameters
    ----------
    WordInput : TYPE String
        DESCRIPTION. as singlular word
    BaseDictionaryInput : TYPE Dictionary
        DESCRIPTION. the base dictionary that was generated earlier
    NlpInput : TYPE spacy object
        DESCRIPTION. used to find cosin similarity based on context
    CosineSimThold : TYPE, optional float
        DESCRIPTION. The default is 0.01. threshold for cosin similarity
    NcharThold : TYPE, optional integer
        DESCRIPTION. The default is 0. number of characters that a word must have to be used
    OmitStopWords : TYPE, optional boolean
        DESCRIPTION. The default is False. whether or not to get rid of stop words
    OmitNonSpell : TYPE, optional boolean
        DESCRIPTION. The default is False. whether to get rid of words that do not have spelling
    OmitPosList : TYPE, optional list
        DESCRIPTION. The default is []. the list of pos that you want to get rid of

    Returns
    -------
    tmp_wrd_dct : TYPE dictionary
        DESCRIPTION. dictiary with sub words 
    '''
    # --- dictionary we are adding sub word to
    tmp_wrd_dct = BaseDictionaryInput[WordInput]
    # --- set default if no conditions met
    tmp_wrd_dct['sub'] = WordInput
    # --- default sub value is short stem word 
    if tmp_wrd_dct['short_stem']:
        if tmp_wrd_dct['short_stem'] == WordInput:
            tmp_wrd_dct['sub'] = WordInput
        else:
            try:
                cos_sim = get_cos_sim(WordInput, tmp_wrd_dct['short_stem'], NlpInput)
            except:
                cos_sim = 0.0
            if cos_sim >= CosineSimThold:
                tmp_wrd_dct['sub'] = tmp_wrd_dct['short_stem']
    # --- if its stop word
    if OmitStopWords and tmp_wrd_dct['is_stop_word']:
        tmp_wrd_dct['sub'] = ''
    # --- if its non spellable
    if OmitNonSpell and tmp_wrd_dct['can_spell'] == False:
        tmp_wrd_dct['sub'] = ''
    # --- if its below char threshold
    if tmp_wrd_dct['nchar'] <= NcharThold:
        tmp_wrd_dct['sub'] = ''
    # --- if its in omit pos list
    if tmp_wrd_dct['pos'] in OmitPosList:
        tmp_wrd_dct['sub'] = ''
    # --- return dictionary with sub word
    return tmp_wrd_dct

Our last function is a simple tool allowing us to sub each word in the unstructured text for its designated sub word.

In [7]:
# --- use sub word dictionary and output new string
def use_sub_dct(StrInput, SubWrdDctInput):
    '''
    Parameters
    ----------
    StrInput : TYPE String
        DESCRIPTION. a description or sentence that needs to be cleaned
    SubWrdDctInput : TYPE Dictionary
        DESCRIPTION. a dictionary with the word and sub word

    Returns
    -------
    string_out : TYPE string
        DESCRIPTION. the cleaned string
    '''
    # -- clean string
    new_string = string_mod(StrInput)
    # -- split to list
    new_string_split = word_tokenize(new_string)
    # -- sub
    new_string = ' '.join([SubWrdDctInput[i]['sub'] for i in new_string_split])
    # -- re clean to get rid of white space
    string_out = string_mod(new_string)
    # -- return string
    return string_out

### Apply Functions

Now that we have all the tools built. We can apply them to the text and clean it accordingly.

In [8]:
# --- get base dictionary to modify as sub dictionary
base_dct = base_dictionary(
    StringListInput=list(df_01['str_orig']),
    DictInput=hunspell.Hunspell('en_US'),
    StopWordInput=[string_mod(i) for i in stop_words.get_stop_words('en')]
)

# --- the nlp input we will use
nlp_input = spacy.load("en_core_web_lg")

# --- sub dictionary
sub_dct = {}

# --- set values in base dictionary
for word in base_dct:
    try:
        tmp_wrd_dct = get_sub_word(
            WordInput=word, 
            BaseDictionaryInput=base_dct,
            NlpInput=nlp_input,
            CosineSimThold = 0.5,
            NcharThold = 2,
            OmitStopWords = True,
            OmitNonSpell = True,
            OmitPosList =['CC', 'CD', 'DT', 'IN']
        )
        sub_dct[word] = tmp_wrd_dct
    except:
        continue

# --- use sub dictionary
df_01['str_mod'] = df_01['str_orig'].apply(lambda x: use_sub_dct(x, sub_dct))
df_01 = df_01[df_01['str_mod'] != '']

Here are the results. Its not perfect, but I am excited to see how this will help in training data, specifically for normalization. Let me know what you think!!

In [9]:
# --- print results
for inx, row in df_01.iterrows():
    orig_string = row['str_orig']
    new_string = row['str_mod']
    str_out = f'''

---------------------------------------------------------------
OLD:
    {orig_string}


NEW: 
    {new_string}
'''
    print(str_out)



---------------------------------------------------------------
OLD:
    Lily & Dan Toddler or Childrens Footbed Sandals


NEW: 
    lily toddler sandal



---------------------------------------------------------------
OLD:
    Two men are standing in the street behind a table that has a laptop and a TV monitor on it .


NEW: 
    men stand street table laptop monitor



---------------------------------------------------------------
OLD:
    $100 Sullivan's Steakhouse eGift Card


NEW: 
    steakhouse card



---------------------------------------------------------------
OLD:
    A man is throwing freesbies into the air and the border collie is catching them in the air .


NEW: 
    man throw air border collie catch air



---------------------------------------------------------------
OLD:
    Bissell PowerFresh Pet Steam Mop


NEW: 
    pet steam mop



---------------------------------------------------------------
OLD:
    A golden retreiver standing outside in the snow with a