Now we want to enhance the `get_bow_from_docs` function so that it will work with HTML webpages. In HTML, there are a lot of messy codes such as HTML tags, Javascripts, [unicodes](https://www.w3schools.com/charsets/ref_utf_misc_symbols.asp) that will mess up your bag of words. We need to clean up those junk before generating BoW.

Next, what you will do is to define several new functions each of which is specialized to clean up the HTML codes in one aspect. For instance, you can have a `strip_html_tags` function to remove all HTML tags, a `remove_punctuation` function to remove all punctuation, a `to_lower_case` function to convert string to lowercase, and a `remove_unicode` function to remove all unicodes.

Then in your `get_bow_from_doc` function, you will call each of those functions you created to clean up the HTML before you generate the corpus.

Note: Please use Python string operations and regular expression only in this lab. Do not use extra libraries such as `beautifulsoup` because otherwise you loose the purpose of practicing.

In [5]:
# Define your string handling functions below
import re
# Minimal 3 functions

def strip_html_tags(docs):
    '''
    Description: This functions takes a document and removes all tags.
    Input:
        * docs: Array of html documents paths
    
    Output:
        * output: Array containing the sentences without html tags
    '''
    output = re.sub(r'<.*?>', ' ', docs)
    return output

def remove_punctuation(docs):
    '''
    Description: This functions takes a html document and removes all punctuation.
    Input:
        * docs: html document
    
    Output:
        * output: String with all the punctuation removed.
    '''
    output = re.sub(r'[^\w]',' ', docs)
    return output

def remove_special_characters(docs):
    '''
    Description: This functions takes a document and removes spacial characters from them
    Input:
        * docs: an string document
    
    Output:
        * output: String with all special characters removed
    '''
 
    output = re.sub(r'&#\d+;?', ' ',docs)
    return output

def remove_numbers(docs):
    '''
    Description: This functions takes a document and removes points of lists
    Input:
        * docs: an string document
    
    Output:
        * output: String with all numbers and words containing numbers removed
    '''
    output = re.sub(r'\d+\w+?|\w+?\d+\w+', ' ',docs)
    return output

def remove_unicode(docs):
    '''
    Description: This functions takes a document and removes all non-latin characters
    Input:
        * docs: an string document
    
    Output:
        * output: String with all non-latin characters removed
    '''
    
    output = (docs.encode('ascii', 'ignore')).decode("utf-8")
    return output

def to_lower_case(docs):
    '''
    Description: This functions takes an document and removes all punctuation.
    Input:
        * docs: html string document
    
    Output:
        * output: String with all words in lower case
    '''
    output = docs.lower()
    return output

Next, paste your previously written `get_bow_from_docs` function below. Call your functions above at the appropriate place.

In [6]:
def get_bow_from_docs(docs, stop_words=[]):
    
    # In the function, first define the variables you will use such as `corpus`, `bag_of_words`, and `term_freq`.
    
    '''
    Description: This functions take an array of docs and calculates the words (bag_of_words) present 
    and its term frequency (term_freq). As an optional parameter, it can filter by an array of strings called 
    stop_words.
    
    Input:
        * docs: Array of document paths
        * stop_words: (Optional) Array of strings to filter.
        
    Output:
        * bag_of_words: Array of strings in lower case of unique words.
        * term_freq: Array of term_frequency of words.
        
    Local variables:
        * corpus: Array containing the sentences in document list. Need it to extract the unique words and its term frequency.
        * words: Array containg the words in each sentence of corpus.
    '''
    bag_of_words = []
    term_freq = []
    
    """
    Loop `docs` and read the content of each doc into a string in `corpus`.
    Remember to convert the doc content to lowercases and remove punctuation.
    """
    t0 = [open(f, 'r', encoding='utf-8').read() for f in docs]
          
    '''
    Use the functions defined above to filter the text.
    ''' 
    t1 = [strip_html_tags(f) for f in t0]
    t2 = [remove_special_characters(f) for f in t1]
    t3 = [remove_punctuation(f) for f in t2]
    t4 = [remove_numbers(f) for f in t3]
    t5 = [remove_unicode(f) for f in t4]
    corpus = [to_lower_case(f) for f in t5]
    print(f'Corpus = {corpus}')
    print("-----------------------------------------------------------")
    
    """
    Loop `corpus`. Append the terms in each doc into the `bag_of_words` array. The terms in `bag_of_words` 
    should be unique which means before adding each term you need to check if it's already added to the array.
    In addition, check if each term is in the `stop_words` array. Only append the term to `bag_of_words`
    if it is not a stop word.
    """
    for s in corpus:
        words = re.split('\s',s)
        for word in words:
            if word not in bag_of_words and word not in stop_words: #if word is not present and not in stop_words
                bag_of_words.append(word)

    """
    Loop `corpus` again. For each doc string, count the number of occurrences of each term in `bag_of_words`. 
    Create an array for each doc's term frequency and append it to `term_freq`.
    """
    #Calculate term frequency

    for s in corpus:
        words = re.split('\s',s)
        bag_vector = len(bag_of_words)*[0]
        for w in words:
            for i, word in enumerate(bag_of_words):
                if word == w:
                    bag_vector[i] += 1
        term_freq.append(bag_vector)
    
    # Now return your output as an object
    return {
        "bag_of_words": bag_of_words,
        "term_freq": term_freq
    }
    

Next, read the content from the three HTML webpages in the `your-codes` directory to test your function.

In [7]:
from sklearn.feature_extraction import stop_words
bow = get_bow_from_docs([
        'www.coursereport.com_ironhack.html',
        'en.wikipedia.org_Data_analysis.html',
        'www.lipsum.com.html'
    ],
    stop_words.ENGLISH_STOP_WORDS
)

print(bow)

Corpus = ['                ironhack reviews   course report            try  typekit load     catch e                  javascript_include_tag    oss maxcdn com libs   3 7 0   js    javascript_include_tag    oss maxcdn com libs respond js 1 4 2 respond min js           toggle navigation                browse schools    full stack web development    mobile development    front end web development    data science    ux design    digital marketing    product management    security    other      blog    advice    ultimate guide  choosing a school    best coding bootcamps    best in data science    best in ui ux design    best in cybersecurity      write a review    sign in                  ironhack     amsterdam  barcelona  berlin  madrid  mexico city  miami  paris  sao paulo        ironhack   ironhack          avg rating  4             reviews             about    courses    reviews    news          contact alex williams from ironhack       about     about    ironhack is a 9 week  full time

{'bag_of_words': ['', 'ironhack', 'reviews', 'course', 'report', 'try', 'typekit', 'load', 'catch', 'e', 'javascript_include_tag', 'oss', 'maxcdn', 'com', 'libs', '3', '7', '0', 'js', 'respond', '1', '4', '2', 'min', 'toggle', 'navigation', 'browse', 'schools', 'stack', 'web', 'development', 'mobile', 'end', 'data', 'science', 'ux', 'design', 'digital', 'marketing', 'product', 'management', 'security', 'blog', 'advice', 'ultimate', 'guide', 'choosing', 'school', 'best', 'coding', 'bootcamps', 'ui', 'cybersecurity', 'write', 'review', 'sign', 'amsterdam', 'barcelona', 'berlin', 'madrid', 'mexico', 'city', 'miami', 'paris', 'sao', 'paulo', 'avg', 'rating', 'courses', 'news', 'contact', 'alex', 'williams', '9', 'week', 'time', 'bootcamp', 'florida', 'spain', 'france', 'germany', 'uses', 'customized', 'approach', 'education', 'allowing', 'students', 'shape', 'experience', 'based', 'personal', 'goals', 'admissions', 'process', 'includes', 'submitting', 'written', 'application', 'interview',

Do you see any problem in the output? How do you improve the output?

A good way to improve your codes is to look into the HTML data sources and try to understand where the messy output came from. A good data analyst always learns about the data in depth in order to perform the job well.

Spend 20-30 minutes to improve your functions or until you feel you are good at string operations. This lab is just a practice so you don't need to stress yourself out. If you feel you've practiced enough you can stop and move on the next challenge question.