Encoding issues may occur when text that is originally UTF-8 get decoded or re-opened as Windows-1252/Latin-1 (or you write UTF-8 CSV and Excel guessed ANSI)

Best practice moving forward: standardize on UTF-8, always pass encoding="utf-8" when reading text, write CSV as utf-8-sig for Excel (or better: .xlsx), and avoid re-saving in editors that default to ANSI.

In [1]:
%pip install ftfy

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\ytian126\Documents\repos\MCQ-generation\.venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [8]:
%pip install --upgrade openpyxl

Note: you may need to restart the kernel to use updated packages.Collecting openpyxl
  Using cached openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Collecting et-xmlfile
  Using cached et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5



You should consider upgrading via the 'c:\Users\ytian126\Documents\repos\MCQ-generation\.venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [2]:
import logging
import re
from typing import Sequence
from unicodedata import normalize

import pandas as pd

try:
    from ftfy import fix_text as _ftfy_fix
    _HAVE_FTFY = True
except Exception:  # ftfy not installed; fallback only
    _HAVE_FTFY = False
    _ftfy_fix = None

In [3]:


# Heuristic: patterns that strongly suggest mojibake
_MOJIBAKE_RE = re.compile(r"(Ã.|Â|â€™|â€œ|â€”|â€“|â€)")

def _looks_mojibake(s: str) -> bool:
    return bool(_MOJIBAKE_RE.search(s))

def _redecode_latin1_to_utf8(s: str) -> str:
    """
    Reverse the common failure mode: UTF-8 bytes decoded as Latin-1/cp1252.
    Example: 'Isbellâ€™s' -> 'Isbell’s'
    """
    try:
        return s.encode("latin1").decode("utf-8")
    except UnicodeError:
        return s

def _repair_string(s: str, *, use_ftfy: bool = True, max_passes: int = 2) -> str:
    """
    Repair mojibake in a single string using:
      1) ftfy (if installed),
      2) latin1->utf8 re-decode,
      3) NFC normalization.
    Runs a couple of passes to handle multi-step corruption.
    """
    if not s:
        return s
    t = s
    for _ in range(max_passes):
        changed = False

        if use_ftfy and _HAVE_FTFY:
            t2 = _ftfy_fix(t)
            if t2 != t:
                t, changed = t2, True

        if _looks_mojibake(t):
            t2 = _redecode_latin1_to_utf8(t)
            if t2 != t:
                t, changed = t2, True

        if not changed:
            break

    return normalize("NFC", t)

def fix_mojibake_in_columns(
    df: pd.DataFrame,
    columns: Sequence[str],
    *,
    use_ftfy: bool = True,
    copy: bool = True,
) -> pd.DataFrame:
    """
    Fix mojibake/encoding artifacts in the specified text columns.

    Parameters
    ----------
    df : pd.DataFrame
        Input dataframe.
    columns : Sequence[str]
        Column names to clean.
    use_ftfy : bool, optional
        If True and 'ftfy' is installed, use it as the first pass, by default True.
    copy : bool, optional
        If True, operate on a copy and return it; else modify in place, by default True.

    Returns
    -------
    pd.DataFrame
        Cleaned dataframe (or the same object if copy=False).
    """
    if copy:
        df = df.copy()

    for col in columns:
        if col not in df.columns:
            raise KeyError(f"Column not found: {col}")
        # Only transform strings; leave non-strings as-is
        df[col] = df[col].map(
            lambda v: _repair_string(v, use_ftfy=use_ftfy) if isinstance(v, str) else v
        )

    return df


In [5]:
INPUT_CSV = "./some_recorded_output/source_texts.csv"
OUTPUT_XLSX = "./some_recorded_output/source_texts_cleaned.xlsx"
df = pd.read_csv(INPUT_CSV)



In [6]:
fixed = fix_mojibake_in_columns(df, ["text"])

In [9]:
fixed.to_excel(OUTPUT_XLSX, index=False)

In [36]:
baseline_input = "./some_recorded_output/baseline_questions.csv"
requesta_input = "./some_recorded_output/requesta_mcqs.csv"
baseline_output = "./some_recorded_output/baseline_questions_cleaned_uuid.xlsx"
requesta_output = "./some_recorded_output/requesta_mcqs_cleaned_uuid.xlsx"

In [24]:
baseline_input_df = pd.read_csv(baseline_input)
requesta_input_df = pd.read_csv(requesta_input)

In [25]:
def strip_tag(s, tag):
    return (s.str.replace(fr"(?is)</?\s*{tag}\b[^>]*>", "", regex=True)
             .str.replace(r"^\s+|\s+$", "", regex=True))  # trim

baseline_input_df["baseline_question"] = strip_tag(baseline_input_df["baseline_question"], "QUESTION")
baseline_input_df["baseline_answer"]  = strip_tag(baseline_input_df["baseline_answer"],  "ANSWER")

In [26]:
baseline_input_df["baseline_answer"]

0     C) They are more social and use a wide variety...
1     B) They contributed to the development of huma...
2            C) The survival of primates is threatened.
3     D) To learn how certain genes and traits are d...
4     C) The exploration of human origins, evolution...
                            ...                        
95                                    C) Studying twins
96                                              D) 2003
97    B) Genetics plays a significant role in shapin...
98    D) Socialization reproduces inequality by conv...
99    C) Both nature and nurture play significant ro...
Name: baseline_answer, Length: 100, dtype: object

In [27]:
baseline_fixed = fix_mojibake_in_columns(baseline_input_df, ["text", "baseline_question", "baseline_answer"])
requesta_fixed = fix_mojibake_in_columns(requesta_input_df, ["text", "requesta_question", "requesta_answer"])

In [33]:
# add a question_id_uuid column to requesta_fixed
# use uuid4 to generate unique ids

import uuid 
question_uuids = [str(uuid.uuid4()) for _ in range(len(requesta_fixed))]
requesta_fixed.insert(0, "question_uuid", 
                     question_uuids)


In [34]:
requesta_fixed.head()

Unnamed: 0,question_uuid,question_id,textID,text,requesta_question_type,requesta_question,requesta_answer
0,7702d5f4-57a7-48da-abc2-2fe7c197a530,requesta_anthropology_1_2_1,anthropology_1_2,Biological anthropology focuses on the earlies...,fact,Q1: Which combination of research approaches i...,"Q1: B) Fossil record, genetic data, primate st..."
1,97dec8f5-d6ef-4566-9637-5af9ea6db9dd,requesta_anthropology_1_2_2,anthropology_1_2,Biological anthropology focuses on the earlies...,fact,Q2: What adaptations does Isbell's snake detec...,Q2: D) Refined vision and communication to war...
2,b1059eba-5c49-4c6d-98eb-2748d2394fe9,requesta_anthropology_1_2_3,anthropology_1_2,Biological anthropology focuses on the earlies...,inference,Q3: Which inference is best supported by van S...,Q3: A) Cultural learning and problem-solving l...
3,5ed6ec45-7a03-4b1a-96d4-88745e3efcd7,requesta_anthropology_1_2_4,anthropology_1_2,Biological anthropology focuses on the earlies...,inference,"Q4: If the pressures of habitat loss, illegal ...",Q4: B) Diminished access to living comparative...
4,7127c216-c5ab-4670-8866-29bf76d31341,requesta_anthropology_1_2_5,anthropology_1_2,Biological anthropology focuses on the earlies...,main_idea,Q5: Which statement best summarizes the text?\...,Q5: A) Biological anthropology integrates evid...


In [35]:
# add a question_id_uuid column to baseline_fixed
# use uuid4 to generate unique ids
baseline_fixed.insert(0, "question_uuid", 
                      [str(uuid.uuid4()) for _ in range(len(baseline_fixed))])


In [37]:
baseline_fixed.head()

Unnamed: 0,question_uuid,question_id,textID,text,baseline_question_type,baseline_question,baseline_answer
0,2d1c8f3b-187a-4a2a-b326-c7ac0206ef75,baseline_anthropology_1_2_1,anthropology_1_2,Biological anthropology focuses on the earlies...,factual,What did Carel van Schaik discover about orang...,C) They are more social and use a wide variety...
1,eeff2c44-7c35-4561-a41a-b200e295bfde,baseline_anthropology_1_2_2,anthropology_1_2,Biological anthropology focuses on the earlies...,factual,"According to Lynne Isbell, what role did snake...",B) They contributed to the development of huma...
2,e7152717-7f6e-464f-8dca-89b2ed2097ba,baseline_anthropology_1_2_3,anthropology_1_2,Biological anthropology focuses on the earlies...,inferential,What might be a consequence of habitat loss an...,C) The survival of primates is threatened.
3,927b3c6e-25a6-42a7-8c86-a9fd0962cb87,baseline_anthropology_1_2_4,anthropology_1_2,Biological anthropology focuses on the earlies...,inferential,Why might biological anthropologists study the...,D) To learn how certain genes and traits are d...
4,640a7dce-9c07-441e-a806-957d52dc8d09,baseline_anthropology_1_2_5,anthropology_1_2,Biological anthropology focuses on the earlies...,main idea,What is the main focus of the text on biologic...,"C) The exploration of human origins, evolution..."


In [38]:
# write out
baseline_fixed.to_excel(baseline_output, index=False)
requesta_fixed.to_excel(requesta_output, index=False)