# Data Cleaning Pipeline for StackOverflow Datasets

This notebook outlines the steps to clean text data from StackOverflow datasets in Spanish, English, and Portuguese. It includes removing HTML tags, handling ``code`` tags, removing excessive line breaks, and sanitizing the data to prepare it for further analysis.

## Goals

1. Clean and preprocess the text data from StackOverflow questions in three different languages (Spanish, English, and Portuguese).
2. Remove unnecessary HTML tags and `code` tag content.
3. Sanitize line breaks, spaces, punctuation, and digits.
4. Prepare the cleaned data for further analysis or machine learning tasks.

---

## Data Cleaning Steps


The following steps will be taken to preprocess the data:

### 1. Removing HTML Tags

- HTML tags can appear in the dataset (e.g., `<a>`, `<b>`, `<code>`) and need to be removed to prevent issues in data processing.

### 2. Handling Content Inside `<code>` Tags

- Code snippets are wrapped inside `<code></code>` tags. We will replace them with the placeholder "CODE" to focus on the actual question text.

### 3. Removing Excessive Line Breaks

- Line breaks inside and outside `<code>` tags can cause issues in the dataset. We will handle multiple line breaks by replacing them with a single space.

### 4. Removing Punctuation and Digits

- Any punctuation or digits will be removed to ensure that the data focuses on the textual content.

### 5. Handling Word Fragmentation

- Words like "co de" (from split code tags) need to be corrected back to "code". We'll fix these kinds of issues.

### 6. Normalization

- Convert all text to lowercase and remove extra spaces to standardize the data.

---

## Loading and Preprocessing the Data


### Loading Data

We will read the CSV files containing StackOverflow questions in Spanish, English, and Portuguese.

In [1]:
import pandas as pd

# Reading the Spanish, English, and Portuguese CSV files with appropriate encoding and separator
path_es = "../data/stackoverflow_espanhol.csv"
path_en = "../data/stackoverflow_ingles.csv"
path_pt = "../data/stackoverflow_portugues.csv"

df_es = pd.read_csv(path_es, encoding="ISO-8859-1", sep=";")
df_en = pd.read_csv(path_en)
df_pt = pd.read_csv(path_pt)

### Preprocessing Functions

#### 1. Replacing Content Inside `<code>` Tags
We define a function to replace the content inside `<code>` tags with a placeholder "CODE".

In [2]:
import re

# Function to replace the content inside <code> tags with "CODE"
def replace_code(texts, regex):
    if isinstance(texts, str):
        return regex.sub("CODE", texts)
    else:
        return [regex.sub("CODE", text) for text in texts]


#### 2. Removing HTML Tags

This function removes all HTML tags from the text using regular expressions.

In [3]:
# Function to remove HTML tags
def remove_html_tags(text):
    regex_html = re.compile(r"<.*?>")  # Captures HTML tags
    return re.sub(regex_html, "", text)


#### 3. Fixing Fragmented Words

Some words may be fragmented due to code tag processing (e.g., "co de" for "code"). This function corrects those fragments.

In [4]:
def fix_code_fragment(text):
    # Corrigir fragmentações de "SQL", "PHP", "code"
    regex_sql_fragment = re.compile(r"\b(s\s*q\s*l|s[q|l]|[s|q|l]{2,4})\b", re.IGNORECASE)  # Captura variações de "SQL"
    regex_php_fragment = re.compile(r"\b(p\s*h\s*p|p[h|p]{2})\b", re.IGNORECASE)  # Captura variações de "PHP"
    regex_code_fragment = re.compile(r"\b(c\s*o\s*d\s*e|c[o|d|e]{2,4})\b", re.IGNORECASE)  # Captura variações de "code"
    
    # Substitui as fragmentações com as palavras completas
    text = regex_sql_fragment.sub("SQL", text)
    text = regex_php_fragment.sub("PHP", text)
    text = regex_code_fragment.sub("code", text)
    
    return text


#### 4. Removing Excessive Line Breaks

We need to handle excessive line breaks both inside and outside the `<code>` tags.

In [5]:
# Function to remove excessive line breaks
def remove_extra_newlines(text):
    return re.sub(r"[\n\s]+", " ", text).strip()  # Replaces multiple line breaks with a single space


#### 5. Removing Punctuation and Digits

This function removes punctuation and digits from the text.

In [6]:
# Function to remove punctuation and digits
def remove_punctuation_and_digits(text):
    regex_punctuation_digits = re.compile(r"[^\w\s]|[\d]")  # Captures punctuation and digits
    return re.sub(regex_punctuation_digits, "", text)


#### 6. Removing Extra Spaces

Extra spaces are often present, and we will ensure that there is only one space between words.

In [7]:
# Function to remove extra spaces
def remove_extra_spaces(text):
    return re.sub(r"\s+", " ", text).strip()


#### 7. Converting to Lowercase

Converting the entire text to lowercase helps normalize the data.

In [8]:
# Function to convert text to lowercase
def convert_to_lowercase(text):
    return text.lower()


---

### The Cleaning Pipeline

The following function applies all the preprocessing steps to the data in the Question column of each dataset.

In [9]:
def clean_text(df, text_column):
    # Regex patterns
    regex_code = re.compile(r"<code>.*?</code>", re.DOTALL)  # Captures <code> tags and their content
    regex_html = re.compile(r"<.*?>")  # Captures HTML tags
    regex_punctuation_digits = re.compile(r"[^\w\s]|[\d]")  # Captures punctuation and digits
    regex_joint_words = re.compile(r"([a-zA-Z])([A-Z])")  # Fixes fragmented words like "co de" to "code"
    
    # Apply cleaning functions
    df["cleaned_code_tag"] = df[text_column].apply(lambda text: replace_code(text, regex_code))  # Replace content inside <code> tags
    df["cleaned_code_tag"] = df["cleaned_code_tag"].apply(remove_html_tags)  # Remove HTML tags
    df["cleaned_code_tag"] = df["cleaned_code_tag"].apply(fix_code_fragment)  # Fix fragmented words like "co de" to "code", "s q l" to "SQL" and "p h p" to "PHP"
    df["cleaned_code_tag"] = df["cleaned_code_tag"].apply(remove_punctuation_and_digits)  # Remove punctuation and digits
    df["cleaned_code_tag"] = df["cleaned_code_tag"].apply(lambda text: re.sub(regex_joint_words, r"\1 \2", text))  # Fix word fragmentation
    df["cleaned_code_tag"] = df["cleaned_code_tag"].apply(convert_to_lowercase)  # Convert to lowercase
    df["cleaned_code_tag"] = df["cleaned_code_tag"].apply(remove_extra_spaces)  # Remove extra spaces
    df["cleaned_code_tag"] = df["cleaned_code_tag"].apply(remove_extra_newlines)  # Remove excessive line breaks
    
    return df


---

### Applying the Pipeline to Each Dataset

In [11]:
# Apply the cleaning pipeline to each dataset
df_es_cleaned = clean_text(df_es, "Questão")
df_en_cleaned = clean_text(df_en, "Questão")
df_pt_cleaned = clean_text(df_pt, "Questão")

# Save the cleaned datasets to new CSV files
df_es_cleaned.to_csv("../cleaned_data/stackoverflow_spanish_clean.csv", index=False)
df_en_cleaned.to_csv("../cleaned_data/stackoverflow_english_clean.csv", index=False)
df_pt_cleaned.to_csv("../cleaned_data/stackoverflow_portuguese_clean.csv", index=False)


In [12]:
print(df_en_cleaned.cleaned_code_tag[95])

i have an array that is initialized like code i would like to convert this array into an object of the array list class code


In [13]:
print(df_pt_cleaned.cleaned_code_tag[94])

já li alguns comentários na web a respeito de utilizar ou não o code no final das linhas quando se escreve java script alguns dizem que sim outros dizem não ter necessidade mas nenhum sabe explicar bem os motivos das divergências exemplo code o interessante é que mesmo esquecendo o code no meu código as vezes ele continua rodando sem problemas e sem disparar erros então o correto é usar ou não o famoso ponto e vírgula


In [14]:
print(df_es_cleaned.cleaned_code_tag[99])

como parte de un trabajo de compiladores debo programar un editor de texto que reciba como entrada el lenguaje visual basic y lo transforme a otro lenguaje a mi eleccion actualmente estoy utilizando visual basic para programar lo anterior lo primero que hice fue recibir la cadena de entrada y separarla en tokens clasificandolos en palabras reservadas variables simbolos de agrupacion y operadores code metodo para evaluar tokens code automata para evaluar si token es numero code los tokens despues de ser evaluados por automatas los almaceno en list views que despues muestro en una ventana code para este traductor no debo hacer validaciones de signos de agrupación que verifiquen si faltan paréntesis o llaves solamente transformar los tokens obtenidos a otro lenguaje el lenguaje final puede ser cualquiera asimismo no es todo el lenguaje el que debo convertir solo la estructura de un for while e if por ejemplo si tengo code debo convertirlo a code en este caso los tokens que tengo son linea

---

### Conclusion

This project successfully cleaned the StackOverflow datasets, removing unnecessary fragments and excess text.