# Numerical Uniformity

The below code replaces all numbers with zeroes. We replace numbers in the title and body text with zero as the specific numerical values are not important, but may add additional noise to the data and may prevent the model from extracting the meaningful features from the text.

In [None]:
import re

In [None]:
def replace_with_zeroes(df, col):
    pattern = re.compile(r'\d+')
    df[col] = df[col].apply(lambda x: re.sub(pattern, '0', x))
    
    return df

# HTML Parsing and Separating Natural Language from Programming Language

In [None]:
import pandas as pd
from bs4 import BeautifulSoup

The below function performs the following processing steps separates text from code blocks, so that they can be passed as separate features to our bimodal NL/PL models. It also cleans the data by removing all HTML tags.

In [None]:
def separate_code_text(*bodies):
    '''
    returns: cleaned text, code for each of the body inputs
    '''
    results = {}
    for i, body in enumerate(bodies):
        soup = BeautifulSoup(body, "html.parser")
        blockquotes = soup.find_all("blockquote")
        
        # remove the duplicate blockquote elements and its contents
        for bq in blockquotes:
            if "Duplicate" in bq:
                bq.decompose()

        code_blocks = soup.find_all("code")
        code = "\n".join([cb.get_text() for cb in code_blocks])
        text = soup.get_text()
        
        results[f"BodyText{i+1}"] = text
        results[f"BodyCode{i+1}"] = code
    
    return pd.Series(results)

The below function removes all indicators that the post has a duplicate, as this would not be present in the real-life test set.

In [None]:
def remove_duplicate_blockquote(body_html_1, body_html_2):
    '''returns: cleaned body texts'''
    soup1 = BeautifulSoup(body_html_1, "html.parser")
    soup2 = BeautifulSoup(body_html_2, "html.parser")

    for soup in [soup1, soup2]:
        blockquotes = soup.find_all("blockquote")
        
        # remove the blockquote element and its contents
        for bq in blockquotes:
            if "Duplicate" in bq:
                bq.decompose()

    return pd.Series({"Body1": str(soup1.table), "Body2": str(soup2.table)})

# Similarity Heuristics for Challenging Examples

Using a similarity heuristic when selecting non-duplicate posts (negative examples) can help create a more balanced training dataset. Randomly selecting negative examples that are dissimilar to positive examples could result in a biased model. Using a similarity heuristic ensures that non-duplicate posts are similar enough to the duplicate posts to improve the overall accuracy of the model.

In [None]:
def similarity_heuristic(tags1, tags2):
    '''
    tags1, tags2 -> both are strings formatted like "<tagA><tagB><etc>"
    '''
    
    if not tags1 or not tags2:
        return pd.Series({"Similarity1": 0, "Similarity2": 0})

    tags1, tags2 = tags1[1:-1], tags2[1:-1]

    set1 = set(tags1.split("><"))
    set2 = set(tags2.split("><"))

    intersection = set1 & set2
    sim1 = len(intersection) / len(set2) if set2 else 0
    sim2 = len(intersection) / len(set1) if set1 else 0

    return pd.Series({"Similarity1": sim1, "Similarity2": sim2})

In [None]:
def apply_sim_heuristic(df, col_1, col_2, threshold=0.5):
    df[["Similarity1", "Similarity2"]] = df.apply(lambda x: similarity_heuristic(x[col_1], x[col_2]), axis=1)

    df_similar_tags = df[(df[col_1] > threshold) & (df[col_2] > threshold)]

    return df_similar_tags