Sreejoni Roy
**Cleaning our datasets**

For this sub-assignment, I am working with survey data that was originally created by Mona Zakkour for her bachelor thesis. Mona developed a theoretical framework and designed a questionnaire based on it. She collected her data in the Netherlands using Google Forms.

As part of the assignment, our group had to roll out the same survey individually to at least 20 participants in one assigned country: Greece, Germany, or Spain. Because of this, we now have multiple separate datasets; The four datasets I am working with are:

-Two datasets from Spain

-One dataset from Greece

-One dataset from Germany

These files were exported as CSV files from Google Forms, which means the structure is mostly the same, but the responses contain differences such as:

-variations in language (e.g., España, spain, Spain)

-different capitalisation and spacing

-inconsistent country names across datasets

-extra spaces or symbols inside text entries

Before I can merge everything into one complete dataset for analysis, I need to clean each file individually.

This includes:

-removing accidental spaces and formatting issues

-standardising country names (e.g., España → Spain, Deutschland → Germany)

-converting missing values into a consistent format

-making sure all columns are aligned so the files can be merged properly

Once all four datasets are cleaned, I will merge them into one combined dataset.
This final dataset will allow us to run the statistical tests required for the assignment (Mona’s original Netherlands dataset + Greece + Spain + Germany).

This cleaning and merging step is essential because combining uncleaned data from different countries would create inconsistencies, reduce data quality, and affect the reliability of the analysis.

**Spain**

In [None]:
# Cleaning the two Spain Datasets 
# In this code I am cleaning the two Spain survey files.
# To fix spelling differences in country names and remove extra spaces.

import pandas as pd
import numpy as np
import unicodedata
from pathlib import Path


In [None]:
# I put all files in the same folder so I just point to the current directory.
DATA_PATH = Path(".")

# These are the two Spain files .
SPAIN1 = DATA_PATH / "spain1dataset.csv"
SPAIN2 = DATA_PATH / "spain2dataset.csv"


In [4]:
# This function tries to read a CSV safely even when the file has weird or mixed encodings.
# Some CSV files exported from Qualtrics, Excel, or Google Sheets do not read properly with
# the standard UTF-8 encoding. They might contain hidden characters (like BOM markers) or
# use different text encodings depending on the user's system.
def safe_read_csv(path):

    # I looped through a small list of common encodings that usually cause problems.
    # "utf-8" is the normal one,
    # "utf-8-sig" handles files that start with a BOM,
    # "latin-1" and "cp1252" are common in European Windows systems.
    for enc in ["utf-8", "utf-8-sig", "latin-1", "cp1252"]:
        try:
            # reading the file with the current encoding.
            # If this works, the function returns the DataFrame immediately.
            return pd.read_csv(path, encoding=enc)
        except:
            # If this fails, then let it go
            pass

    #default
    return pd.read_csv(path)


In [5]:
# This code helps me compare the text regardless of accents or capital letters.
# For example "España", "espana", "ESPAÑA" should all detect as the same thing.
def normalize_string(s):
    if pd.isna(s):
        return ""
    if not isinstance(s, str):
        s = str(s)
    s = " ".join(s.split()).strip()     # remove weird spacing
    s = s.lower()                       # make everything lowercase
    # remove accents so "España" becomes "espana"
    s = unicodedata.normalize("NFKD", s).encode("ascii", "ignore").decode("ascii")
    return s


In [6]:
#country names and possible typos
COUNTRY_MAP = {
    "spain": "Spain",
    "espana": "Spain",
    "es": "Spain",

    "netherlands": "Netherlands",
    "the netherlands": "Netherlands",
    "nederland": "Netherlands",
    "holland": "Netherlands",

    "argentina": "Argentina",

    "": np.nan,
    "nan": np.nan
}


In [7]:
# This code cleans all text columns.
# removing leading/trailing spaces that people often leave out.
def clean_text_columns(df):
    df = df.copy()
    df.columns = [c.strip() for c in df.columns]   # clean column names
    for col in df.select_dtypes(include=["object"]).columns:
        df[col] = df[col].apply(lambda x: " ".join(x.split()).strip()
                                if isinstance(x, str) else x)
    return df


In [8]:

# This coding fixes all the country columns.
# I looked for any column that contained the word "country".
def standardize_country_columns(df):
    df = df.copy()
    country_cols = [c for c in df.columns if "country" in c.lower()]

    for col in country_cols:
        # I turned the column values into "normalized keys"
        # so I could map them properly to the COUNTRY_MAP
        keys = df[col].apply(lambda x: normalize_string(x) if not pd.isna(x) else x)

        # If the normalized version is in the dictionary, use the official version.
        # If not, keep the original.
        mapped_values = keys.map(COUNTRY_MAP)
        df[col] = np.where(mapped_values.notna(), mapped_values, df[col])

        #second check for very common mistakes like "Spain " or "spain".
        df[col] = df[col].apply(
            lambda x: "Spain" if normalize_string(x) == "spain"
            else ("Netherlands" if normalize_string(x) in
                  {"netherlands", "the netherlands", "nederland", "holland"}
                  else ("Argentina" if normalize_string(x) == "argentina"
                        else (np.nan if normalize_string(x) in {"", "nan"} else x)))
        )

    return df


In [None]:
def show_before_after_uniques(before, after, title):
    print(f"\n===== {title} =====")
    country_cols = [c for c in before.columns if "country" in c.lower()]
    for col in country_cols:
        print(f"\nColumn: {col}")
        print("Before:", sorted(before[col].astype(str).fillna("NaN").unique().tolist()))
        print("After: ", sorted(after[col].astype(str).fillna("NaN").unique().tolist()))

#to see a sample.

In [None]:
# Loading both the Spain files
spain1_raw = safe_read_csv(SPAIN1)
spain2_raw = safe_read_csv(SPAIN2)


In [11]:
# cleaning the basic text formatting
spain1_step1 = clean_text_columns(spain1_raw)
spain2_step1 = clean_text_columns(spain2_raw)


In [12]:
#fixing the country names
spain1_clean = standardize_country_columns(spain1_step1)
spain2_clean = standardize_country_columns(spain2_step1)

In [13]:
# printing the before/after so I can visually confirm everything
show_before_after_uniques(spain1_raw, spain1_clean, "SPAIN 1")
show_before_after_uniques(spain2_raw, spain2_clean, "SPAIN 2")


===== SPAIN 1 =====

Column: In which country are you located?
Before: ['España', 'Spain', 'Spain ', 'spain']
After:  ['Spain']

Column: What is your country of origin?
Before: ['Argentina', 'España', 'Spain', 'Spain ', 'spain']
After:  ['Argentina', 'Spain']

Column: Which description fits your parents' country of origin the best?
Before: ['Both parents are from outside the EU', 'Both parents are from the EU']
After:  ['Both parents are from outside the EU', 'Both parents are from the EU']

===== SPAIN 2 =====

Column: In which country are you located?
Before: ['España', 'Netherlands', 'Netherlands ', 'Spain', 'Spain ', 'nan', 'spain']
After:  ['Netherlands', 'Spain', 'nan']

Column: What is your country of origin?
Before: ['España', 'Spain', 'nan', 'spain']
After:  ['Spain', 'nan']

Column: Which description fits your parents' country of origin the best?
Before: ['Both parents are from the EU', 'One parent is from the EU and one parent is outside of the EU', 'nan']
After:  ['Both pa

In [14]:
# saving both cleaned versions
spain1_clean.to_csv("spain1dataset_clean.csv", index=False)
spain2_clean.to_csv("spain2dataset_clean.csv", index=False)

print("\nSaved cleaned files: spain1dataset_clean.csv and spain2dataset_clean.csv")



Saved cleaned files: spain1dataset_clean.csv and spain2dataset_clean.csv


**Greece**

In [15]:

import pandas as pd
import numpy as np
import unicodedata
from pathlib import Path


In [16]:
# folder where everything is saved
DATA_PATH = Path(".")
GREECE = DATA_PATH / "greekdataset.csv"

In [17]:
def safe_read_csv(path):
    # I put in Greek encodings (cp1253, iso-8859-7).
    for enc in ["utf-8", "utf-8-sig", "cp1253", "iso-8859-7", "latin-1", "cp1252"]:
        try:
            return pd.read_csv(path, encoding=enc)
        except Exception:
            pass
    
    return pd.read_csv(path)

In [18]:
# This code normalises text so comparisons are easier.
# I removed spacing and accents, but I kept the original letters (so Greek stays Greek).
def normalize_basic(s):
    if pd.isna(s):
        return ""
    if not isinstance(s, str):
        s = str(s)
    # removed extra internal spaces and trim
    s = " ".join(s.split()).strip().lower()
    # removed diacritics (accents) but keep letters 
    nf = unicodedata.normalize("NFKD", s)
    s_no_marks = "".join(ch for ch in nf if unicodedata.category(ch) != "Mn")
    return s_no_marks

In [19]:
COUNTRY_MAP = {
    # Greece variants
    "ελλαδα": "Greece",   # Ελλάδα without accent after normalize_basic
    "ελλας": "Greece",    # Ελλας (Hellas in Greek letters)
    "greece": "Greece",
    "gr": "Greece",
    "ellada": "Greece",   
    "hellas": "Greece",

    # Also includes other countries just in case they appear in these columns
    "germany": "Germany",
    "deutschland": "Germany",
    "alemania": "Germany",

    "spain": "Spain",
    "espana": "Spain",

    "netherlands": "Netherlands",
    "the netherlands": "Netherlands",
    "nederland": "Netherlands",
    "holland": "Netherlands",

    # missed data will become real NaN
    "": np.nan,
    "nan": np.nan,
}

In [20]:
# I cleanned the text in the whole table:
# stripped the spaces from column names
# trimmed the spaces inside text cells
def clean_text_columns(df):
    df = df.copy()
    df.columns = [c.strip() for c in df.columns]
    for col in df.select_dtypes(include=["object"]).columns:
        df[col] = df[col].apply(lambda x: " ".join(x.split()).strip()
                                if isinstance(x, str) else x)
    return df


In [21]:
# fixing any column that looks like it stores a country.
# matching by normalising the text and looking it up in COUNTRY_MAP.
def standardize_country_columns(df):
    df = df.copy()
    country_cols = [c for c in df.columns if "country" in c.lower()]

    for col in country_cols:
        # creating a "key" version of each value for mapping
        keys = df[col].apply(lambda x: normalize_basic(x) if not pd.isna(x) else x)
        mapped = keys.map(COUNTRY_MAP)

        # if a canonical label is found, the code will use it. Otherwise keep it the original.
        df[col] = np.where(mapped.notna(), mapped, df[col])

        # a second pass to catch super-common loose cases
        df[col] = df[col].apply(
            lambda x: (
                "Greece" if normalize_basic(x) in {"ελλαδα","ελλας","greece","ellada","hellas","gr"}
                else ("Germany" if normalize_basic(x) in {"germany","deutschland","alemania"}
                else ("Spain" if normalize_basic(x) in {"spain","espana"}
                else ("Netherlands" if normalize_basic(x) in {"netherlands","the netherlands","nederland","holland"}
                else (np.nan if normalize_basic(x) in {"", "nan"} else x))))
            )
        )

    return df


In [22]:
# printing the unique values before and after for the country columns.
def show_before_after_uniques(before, after, title):
    print(f"\n===== {title} =====")
    country_cols = [c for c in before.columns if "country" in c.lower()]
    for col in country_cols:
        b = sorted(pd.Series(before[col].astype(str).fillna("NaN").unique()).tolist())
        a = sorted(pd.Series(after[col].astype(str).fillna("NaN").unique()).tolist())
        print(f"\nColumn: {col}")
        print("Before:", b)
        print("After: ", a)

In [23]:
# loading the Greece file safely
gr_raw = safe_read_csv(GREECE)

In [24]:
# Cleaning general text issues first (spaces, weird formatting)
gr_step1 = clean_text_columns(gr_raw)

In [25]:
# Standardizing country names (Ελλάδα → Greece etc.)
gr_clean = standardize_country_columns(gr_step1)

In [26]:
# Showing the change in country columns
show_before_after_uniques(gr_raw, gr_clean, "GREECE")


===== GREECE =====

Column: In which country are you located?
Before: ['Amsterdam ', 'Canada', 'Canada ', 'France', 'Greece', 'Greece ', 'Netherlands ', 'The Netherlands', 'The Netherlands ', 'UK', 'cairo']
After:  ['Amsterdam', 'Canada', 'France', 'Greece', 'Netherlands', 'UK', 'cairo']

Column: What is your country of origin?
Before: ['Canada', 'China ', 'Egypt ', 'Greece', 'Greece ', 'Greek']
After:  ['Canada', 'China', 'Egypt', 'Greece', 'Greek']

Column: Which description fits your parents' country of origin the best?
Before: ['Both parents are born and brought up in the Greece', 'Both parents are from outside the EU', 'Both parents are from the EU']
After:  ['Both parents are born and brought up in the Greece', 'Both parents are from outside the EU', 'Both parents are from the EU']


In [27]:
# saving the cleaned dataset
gr_clean.to_csv("greekdataset_clean.csv", index=False)
print("\nSaved: greekdataset_clean.csv")


Saved: greekdataset_clean.csv


**Germany**

In [28]:
import pandas as pd
import numpy as np
import unicodedata
from pathlib import Path


In [29]:
DATA_PATH = Path(".")
GERMANY = DATA_PATH / "Germany dataset.csv"   # original file has a space in the name


In [30]:
#coding for google form errors
def safe_read_csv(path):
    for enc in ["utf-8", "utf-8-sig", "latin-1", "cp1252"]:
        try:
            return pd.read_csv(path, encoding=enc)
        except Exception:
            pass
    # If everything else fails,  default will work
    return pd.read_csv(path)


In [31]:
#  normalizing text so comparisons are easier.
# This code removes the extra spaces, lowers case, and strips accents.
def normalize_basic(s):
    if pd.isna(s):
        return ""
    if not isinstance(s, str):
        s = str(s)
    s = " ".join(s.split()).strip().lower()
    s = unicodedata.normalize("NFKD", s).encode("ascii", "ignore").decode("ascii")
    return s


In [32]:
COUNTRY_MAP = {
    # Germany variants
    "germany": "Germany",
    "deutschland": "Germany",
    "ger": "Germany",
    "de": "Germany",
    "alemania": "Germany",

    # Spain variants (appears in origin/parents columns sometimes)
    "spain": "Spain",
    "espana": "Spain",

    # Greece variants 
    "greece": "Greece",
    "ellada": "Greece",
    "hellas": "Greece",
    "gr": "Greece",

    # Netherlands (sometimes appears in 'located in')
    "netherlands": "Netherlands",
    "the netherlands": "Netherlands",
    "nederland": "Netherlands",
    "holland": "Netherlands",

    # missing info will turn into real NaN
    "": np.nan,
    "nan": np.nan,
}

In [33]:
#  cleaning the column names and trim the spaces inside every text cell.
def clean_text_columns(df):
    df = df.copy()
    df.columns = [c.strip() for c in df.columns]
    for col in df.select_dtypes(include=["object"]).columns:
        df[col] = df[col].apply(lambda x: " ".join(x.split()).strip()
                                if isinstance(x, str) else x)
    return df

In [34]:
# Standardizing any column that looks like a country column.
# I searched for 'country' in the column name to keep this flexible.
def standardize_country_columns(df):
    df = df.copy()
    country_cols = [c for c in df.columns if "country" in c.lower()]

    for col in country_cols:
        # Creating normalized keys for mapping
        keys = df[col].apply(lambda x: normalize_basic(x) if not pd.isna(x) else x)
        mapped = keys.map(COUNTRY_MAP)

        # If there is a mapped value, this code will use it; otherwise it will keep the original.
        df[col] = np.where(mapped.notna(), mapped, df[col])

        #  catching very common loose cases (e.g., "Germany " or "ger")
        df[col] = df[col].apply(
            lambda x: (
                "Germany" if normalize_basic(x) in {"germany","deutschland","ger","de","alemania"}
                else ("Spain" if normalize_basic(x) in {"spain","espana"}
                else ("Greece" if normalize_basic(x) in {"greece","ellada","hellas","gr"}
                else ("Netherlands" if normalize_basic(x) in {"netherlands","the netherlands","nederland","holland"}
                else (np.nan if normalize_basic(x) in {"", "nan"} else x))))
            )
        )

    return df


In [35]:
#before and after sample show
def show_before_after_uniques(before, after, title):
    print(f"\n===== {title} =====")
    country_cols = [c for c in before.columns if "country" in c.lower()]
    for col in country_cols:
        b = sorted(pd.Series(before[col].astype(str).fillna("NaN").unique()).tolist())
        a = sorted(pd.Series(after[col].astype(str).fillna("NaN").unique()).tolist())
        print(f"\nColumn: {col}")
        print("Before:", b)
        print("After: ", a)

In [36]:
de_raw = safe_read_csv(GERMANY) #loading the file

In [None]:
# Cleaning the general text issues first 
de_step1 = clean_text_columns(de_raw)

In [38]:
#  Standardizing the country names 
de_clean = standardize_country_columns(de_step1)

In [39]:
# Showing before/after
show_before_after_uniques(de_raw, de_clean, "GERMANY")



===== GERMANY =====

Column: In which country are you located?
Before: ['Germany', 'Germany ', 'Netherlands', 'Netherlands ', 'Netherworld ', 'Singapore', 'The Netherlands', 'The Netherlands ', 'the Netherlands']
After:  ['Germany', 'Netherlands', 'Netherworld', 'Singapore']

Column: What is your country of origin?
Before: ['Germany', 'Germany ', 'I was Born in Germany, but my family is originally from Sri Lanka.', 'Portugal', 'South Africa ']
After:  ['Germany', 'I was Born in Germany, but my family is originally from Sri Lanka.', 'Portugal', 'South Africa']

Column: Which description fits your parents' country of origin the best?
Before: ['Both parents are born and brought up in the Germany', 'Both parents are from outside the EU', 'Both parents are from the EU', 'One parent is from the EU and one parent is outside of the EU']
After:  ['Both parents are born and brought up in the Germany', 'Both parents are from outside the EU', 'Both parents are from the EU', 'One parent is from th

In [40]:
#saving the cleaned file
de_clean.to_csv("germanydataset_clean.csv", index=False)
print("\nSaved: germanydataset_clean.csv")


Saved: germanydataset_clean.csv


**Merge**

In [41]:

import pandas as pd
from pathlib import Path

In [42]:
DATA_PATH = Path(".")

In [None]:
files = [
    DATA_PATH / "spain1dataset_clean.csv",
    DATA_PATH / "spain2dataset_clean.csv", #all the four cleaned files in one
    DATA_PATH / "greekdataset_clean.csv",
    DATA_PATH / "germanydataset_clean.csv"
]

dfs = []

In [44]:
#putting in a list
for f in files:
    df = pd.read_csv(f)
    dfs.append(df)


In [45]:
# Before merging, I need to check if all files have the same columns
# This prevents alignment issues later.
for i, df in enumerate(dfs):
    print(f"\nDataset {i+1} columns ({files[i].name}):")
    print(list(df.columns))



Dataset 1 columns (spain1dataset_clean.csv):
['ID', 'Start time', 'Completion time', 'Email', 'Name', 'Last modified time', 'Please share your email address', 'In which country are you located?', 'What is your country of origin?', "Which description fits your parents' country of origin the best?", "What's your highest/current level of education?", 'How many languages do you speak?', 'Which languages do you use most online? (select all that applies)', 'What age group are you in?', 'What is your gender?', 'I use social media', 'I think social media is fair', 'I see repeated topics on my feed', 'I see different opinions on the same topic', 'My feed varies within my social circle', "I often notice that someone didn't know about a trend, which I thought everybody had seen", 'It is harder for me to take part in a discussion regarding a topic when I have not seen the topic on my feed yet', "I notice that people mostly agree with what's trending, they rarely add their own personal views - add

In [46]:
# If all column names match, then this code will merge them by stacking rows on top of each other.
# pd.concat is the easiest way to do this.
combined = pd.concat(dfs, ignore_index=True)

In [47]:
# saving the final merged file.
combined.to_csv("combined_clean_dataset.csv", index=False)

In [48]:
print("\nSaved: combined_clean_dataset.csv")
print("Final shape:", combined.shape)


Saved: combined_clean_dataset.csv
Final shape: (81, 59)


All three country datasets are cleaned and merged into one combined file. Country names, formatting issues, and inconsistencies are corrected so that all data aligns properly. The final merged dataset is complete for statistical analysis in the next steps of the assignment.