Sreejoni Roy
*Cleaning Mona’s Dataset: Country Normalization and Likert Conversion*

I first cleaned Mona’s raw survey dataset to make sure the data was consistent and ready for modeling. The first part of the cleaning focused on standardizing the country field, because people wrote “Netherlands” in many different ways (e.g., Netherlands, Nederland, NL, Holland, The Netherlands, and even versions with typos). To avoid mismatches later on, I created a simple normalisation function that converts all of these variants into a single label: “netherlands”.

After normalising the country values, I filtered the dataset so that only participants located in the Netherlands were kept. 

Next, I worked on the survey’s Likert-scale questions. These items were answered using text labels such as Strongly disagree, Disagree, Neutral, Agree, and Strongly agree as well as other worded scales. I mapped the five categories to a numeric scale from 1 to 5, keeping the original text responses intact but adding new columns with the numeric equivalents (using a “_num” suffix).
Detecting the Likert questions was done automatically by scanning for common response words, and then applying the numeric mapping. 

Overall, these steps produced a clean, consistent dataset containing only Dutch respondents and numeric versions of all Likert-scale variables.

In [None]:
import pandas as pd
import numpy as np
#imports

In [None]:
df = pd.read_csv("Mona Dataset.csv")
#reading mona dataset file

In [3]:
#checking the shape and columns
print("Shape:", df.shape)
print("First 10 columns:\n", list(df.columns)[:10])

Shape: (225, 57)
First 10 columns:
 ['ID', 'Start time', 'Completion time', 'Email', 'Name', 'Last modified time', 'Please share your email address', 'In which country are you located?', 'What is your country of origin?', "Which description fits your parents' country of origin the best?"]


In [4]:
# standardize the country of origin field and remove anyone whose origin is not the Netherlands

#picking the correct column/question
ORIGIN_COL = "What is your country of origin?"

def normalize_country(x):
    
    # if a value is missing, return it to NaN so that I can drop it later
    if pd.isna(x):
        return np.nan
    
    # basic lowercasing and trimming (lowercasing so it's easier)
    s = str(x).strip().lower()

    # all the common ways people typed Netherlands (no uppercase cause that was fixed in the code above.)
    direct_map = {
        "the netherlands": "netherlands",
        "netherlands": "netherlands",
        "nederland": "netherlands",
        "holland": "netherlands",
        "nl": "netherlands",
        "the netherland": "netherlands",
    }

    # this checks if the cleaned string matches the list directly
    if s in direct_map:
        return direct_map[s]
    
    # removing any punctuation and keep only letters/numbers/spaces
    s_simple = "".join(ch for ch in s if ch.isalnum() or ch.isspace()).strip()

    # checking again using the simplified version
    if s_simple in direct_map:
        return direct_map[s_simple]

    # finding things like "netherlands amsterdam"
    if "netherland" in s_simple or "nederland" in s_simple:
        return "netherlands"
    
    # returning the cleaned value so I can inspect it later
    return s_simple



In [5]:
#  cleaning the origin column
# I'm adding a new column called "origin_norm" which stores the cleaned and normalized version of each person’s origin.
df["origin_norm"] = df[ORIGIN_COL].apply(normalize_country)


In [6]:
# Removing any missing country-of-origin values

df_clean_origin = df.dropna(subset=["origin_norm"]).copy()

In [None]:
#  keeping only the respondents whose origin is Netherlands, therefore removing non-netherlands people

df_origin_nl = df_clean_origin[df_clean_origin["origin_norm"] == "netherlands"].copy()

In [8]:
print("Rows kept (Netherlands origin only):", 
      df_origin_nl.shape[0], "out of", df.shape[0])

print("\nUnique cleaned origin values still present:")
print(df_origin_nl["origin_norm"].value_counts())

#showing how many rows are remaning after cleaning

Rows kept (Netherlands origin only): 132 out of 225

Unique cleaned origin values still present:
origin_norm
netherlands    132
Name: count, dtype: int64


In [9]:
# Likert-style scales to numbers

import numpy as np
import pandas as pd

likert_df = df_origin_nl.copy()


In [10]:
#  cleaning each cell 
def _clean_cell(x):
    if pd.isna(x):
        return np.nan
    s = str(x).strip().lower()
    s = s.replace("\u00a0", " ") 
    s = " ".join(s.split())       # removing multiple spaces
    return s


In [None]:
# defining the scales 

# 1) Agreement scale (5-points)
agree_options = [
    "strongly disagree",
    "disagree",
    "neutral",
    "agree",
    "strongly agree",
]
agree_map = {opt: i+1 for i, opt in enumerate(agree_options)}  # 1..5

In [12]:
# 2) Frequency scale (5-points) 
freq_options = [
    "never",
    "rarely",
    "sometimes",
    "often",
    "always",
]
freq_map = {
    "never": 1,
    "rarely": 2,
    "sometimes": 3,
    "often": 4,
    "always": 5,
}

In [13]:
# 3) Duration/Recency scale (5-points) —  "I don't know" is NaN
dur_options = [
    "less than a day",
    "1-2 days",
    "3-5 days",
    "about a week",
    "more than a week",
    "i don't know",
]
dur_map = {
    "less than a day": 1,
    "1-2 days": 2,
    "3-5 days": 3,
    "about a week": 4,
    "more than a week": 5,
    "i don't know": np.nan,   # treat it as missing
}

In [14]:
 #4) Exposure scale (4-points)
expose_options = [
    "no, never",
    "i've heard of it but haven't seen it",
    "yes, a few times",
    "yes, many times",
]
expose_map = {
    "no, never": 1,
    "i've heard of it but haven't seen it": 2,
    "yes, a few times": 3,
    "yes, many times": 4,
}

In [15]:
# packaging the scales so that I can loop 
scales = [
    ("agreement_5", set(agree_options), agree_map),
    ("frequency_5", set(freq_options),  freq_map),
    ("duration_5",  set(dur_options),   dur_map),
    ("exposure_4",  set(expose_options), expose_map),
]

mapped_cols = []


In [16]:
# finding out and mapping each column if its unique cleaned values are a subset of one of the known scales
for col in likert_df.columns:
                                       #  ignoring NaN
    uniq = (
        likert_df[col]
        .dropna()
        .map(_clean_cell)
        .dropna()
        .unique()
        .tolist()
    )
    if len(uniq) == 0:
        continue

    uniq_set = set(uniq)

    # trying out each scale
    for scale_name, opt_set, opt_map in scales:
           # allowing the columns to contain a subset of the options 
        if uniq_set.issubset(opt_set):
            num_col = col + "_num"
            likert_df[num_col] = likert_df[col].map(_clean_cell).map(opt_map)
            mapped_cols.append((col, scale_name))
            break  # stopping at first matching scale

In [17]:
# seeing what i got mapped and how
print("Mapped columns (original -> scale):")
for c, s in mapped_cols:
    print(f" - {c}  ->  {s}")

# seeing the counts for a couple of mapped columns
for c, _ in mapped_cols[:3]:
    print(f"\n{c}_num value counts:")
    print(likert_df[c + "_num"].value_counts(dropna=False))

# putting it back to the main dataframe 
df_origin_nl = likert_df

Mapped columns (original -> scale):
 - I use social media  ->  frequency_5
 - I think social media is fair  ->  agreement_5
 - I see repeated topics on my feed  ->  agreement_5
 - I see different opinions on the same topic  ->  frequency_5
 - My feed varies within my social circle   ->  agreement_5
 - I often notice that someone didn't know about a trend, which I thought everybody had seen   ->  agreement_5
 - It is harder for me to take part in a discussion regarding a topic when I have not seen the topic on my feed yet   ->  agreement_5
 - I notice that people mostly agree with what's trending, they rarely add their own personal views - add happens to options  ->  frequency_5
 - A topic/trend usually stays on my feed for ...  ->  duration_5
 - When I see a post and open the comments, the top comments reflect what I am already thinking about  ->  frequency_5
 - I scroll through the comments to see if someone has mentioned what I am thinking about  ->  frequency_5
 - I notice a lot of 