### Tasks to Complete

1. Review the descriptive statistics of the sample:

   - Characteristics such as year of training, age, gender, race, and prior education.
   - Mean scores of PPOS and PPOS subscales.

2. Perform comparisons of PPOS and subscales:

   - By year of training.
   - By gender (Male/Female).
   - By prior education (e.g., CEGEP, Bachelors, Graduate).
   - By intended specialization:
     - Primary vs. non-primary care.
     - Surgical vs. non-surgical.

3. Clean and preprocess the dataset:

   - Ensure all columns are renamed appropriately using the `column_rename` dictionary.
   - Drop unnecessary columns (`gender_other`, `race_other`).

4. Explore and visualize the data:

   - Generate plots to visualize the distribution of key variables (e.g., age, PPOS scores).
   - Create comparison plots for PPOS and subscales based on the specified categories.

5. Document findings and insights:
   - Summarize the key characteristics of the sample.
   - Highlight significant differences in PPOS and subscales across the specified categories.


In [34]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import re

sns.set_theme()

In [35]:
column_rename = {
    "Record ID": "record_id",
    "Timepoint": "timepoint",
    "PPOS": "ppos",
    "PPOS-S": "ppos_s",
    "PPOS-C": "ppos_c",
    "Age:": "age",
    "Gender Identity": "gender",
    "Other: Specify": "gender_other",
    "How proficient are you in French?": "french_proficiency",
    "How proficient are you in English?": "english_proficiency",
    "Do you speak any other languages? Please specify your proficiency (Fluent, Moderate, Basic). Please answer in format: [Language, Proficiency] for all languages.    ": "other_languages",
    "Please specify if other level of training": "other_training",
    "Please list all previous education you have completed (e.g., CEGEP, BSc, MSc, PHD, other professional certification). Please answer in format [degree, year of completion]": "prior_education",
    "Site of study": "site_of_study",
    "Do you have an intended specialization (including Family Medicine)?": "intended_specialization",
    "What is your intended specialization?": "specialty",
    "Other (Please specify):": "race_other",
    # Cultural identity columns
    "Which Cultural Identities do you identify as? (Select that apply)   (choice=Black/African Canadian)": "cultural_black",
    "Which Cultural Identities do you identify as? (Select that apply)   (choice=East Asian (e.g., Chinese, Japanese, Korean))": "cultural_east_asian",
    "Which Cultural Identities do you identify as? (Select that apply)   (choice=Indigenous (First Nations, Métis, Inuit))": "cultural_indigenous",
    "Which Cultural Identities do you identify as? (Select that apply)   (choice=Middle Eastern/North African (e.g., Arab, Persian))": "cultural_middle_eastern",
    "Which Cultural Identities do you identify as? (Select that apply)   (choice=Latin American (e.g., Mexican, Brazilian, Coloumbian))": "cultural_latin_american",
    "Which Cultural Identities do you identify as? (Select that apply)   (choice=South Asian (e.g., Indian, Bangladeshi, Sri Lankan))": "cultural_south_asian",
    "Which Cultural Identities do you identify as? (Select that apply)   (choice=South East Asian (e.g. Filipino, Vietnamese, Thai))": "cultural_southeast_asian",
    "Which Cultural Identities do you identify as? (Select that apply)   (choice=White/Caucasian)": "cultural_white",
    "Which Cultural Identities do you identify as? (Select that apply)   (choice=Other (please specify))": "cultural_other",
}

df = pd.read_excel("MSHumanism_CleanQuantData_250625_AC.xlsx", sheet_name="Full Data")
df = df.loc[:, ~df.columns.str.contains("^Unnamed")].rename(columns=column_rename)
df

Unnamed: 0,record_id,timepoint,ppos,ppos_s,ppos_c,age,gender,gender_other,cultural_black,cultural_east_asian,...,cultural_other,race_other,french_proficiency,english_proficiency,other_languages,other_training,prior_education,site_of_study,intended_specialization,specialty
0,1,M2 (TCP),88,40,48,21,Female,,Unchecked,Unchecked,...,Unchecked,,Fluent,Fluent,"Spanish (moderate), Arab (moderate)",,"CEGEP, 2022",Montreal,Yes,Internal medicine
1,2,M4,75,29,46,42,Female,,Unchecked,Unchecked,...,Unchecked,,Fluent,Fluent,"Hindi, Punjabi, Urdu",,Graduate studies in Experimental Medicine,Montreal,No,
2,3,M3,76,38,38,30,Male,,Unchecked,Unchecked,...,Unchecked,,Moderate,Fluent,"Arabic, proficient",,"Bsc, jd, llm",Montreal,Yes,Dermatology
3,5,M2 (TCP),76,42,34,25,Male,,Unchecked,Unchecked,...,Unchecked,,Moderate,Fluent,none,,"CEGEP, Bsc, Msc",Montreal,Yes,internal medicine - maybe medical oncology
4,7,M3,61,33,28,27,Male,,Checked,Unchecked,...,Unchecked,,Fluent,Fluent,Basic Japanese,,BSc Human Kinetics 2020 BSc Translational and...,Montreal,Yes,Anesthesia
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133,207,M3,41,20,21,28,Female,,Checked,Unchecked,...,Unchecked,,Moderate,Fluent,"Arabic, fluent",,Bsc Msc,Montreal,No,
134,208,M4,41,28,13,31,Female,,Unchecked,Unchecked,...,Unchecked,,Fluent,Fluent,"tamil, moderate",,"cegep, 2013 bsc, 2015 MSc, 2017 md 2025",Montreal,Yes,family medicine
135,210,M4,56,28,28,28,Female,,Unchecked,Unchecked,...,Unchecked,,Fluent,Fluent,basic spanish,,"cegep in health sciences, 2016 1 year in cine...",Montreal,Yes,I am starting residency in family medicine
136,212,M3,50,36,14,27,Male,,Unchecked,Unchecked,...,Unchecked,,Moderate,Fluent,Arabic moderate Spanish basic,,"BSc, MSc studies,",Montreal,Yes,Family Medicine + 1 Emergency Medicine


In [36]:
# dropping columns with 0 or non-essential data
df = df.drop(columns=["gender_other", "race_other", "record_id"])

In [37]:
def checked(row):
    return row.lower() == "checked"


checked_cols = [
    "cultural_black",
    "cultural_east_asian",
    "cultural_indigenous",
    "cultural_middle_eastern",
    "cultural_latin_american",
    "cultural_south_asian",
    "cultural_southeast_asian",
    "cultural_white",
    "cultural_other",
]
df[checked_cols] = df[checked_cols].map(checked)

In [38]:
df

Unnamed: 0,timepoint,ppos,ppos_s,ppos_c,age,gender,cultural_black,cultural_east_asian,cultural_indigenous,cultural_middle_eastern,...,cultural_white,cultural_other,french_proficiency,english_proficiency,other_languages,other_training,prior_education,site_of_study,intended_specialization,specialty
0,M2 (TCP),88,40,48,21,Female,False,False,False,True,...,False,False,Fluent,Fluent,"Spanish (moderate), Arab (moderate)",,"CEGEP, 2022",Montreal,Yes,Internal medicine
1,M4,75,29,46,42,Female,False,False,False,False,...,False,False,Fluent,Fluent,"Hindi, Punjabi, Urdu",,Graduate studies in Experimental Medicine,Montreal,No,
2,M3,76,38,38,30,Male,False,False,False,True,...,False,False,Moderate,Fluent,"Arabic, proficient",,"Bsc, jd, llm",Montreal,Yes,Dermatology
3,M2 (TCP),76,42,34,25,Male,False,False,False,False,...,True,False,Moderate,Fluent,none,,"CEGEP, Bsc, Msc",Montreal,Yes,internal medicine - maybe medical oncology
4,M3,61,33,28,27,Male,True,False,False,False,...,False,False,Fluent,Fluent,Basic Japanese,,BSc Human Kinetics 2020 BSc Translational and...,Montreal,Yes,Anesthesia
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133,M3,41,20,21,28,Female,True,False,False,False,...,False,False,Moderate,Fluent,"Arabic, fluent",,Bsc Msc,Montreal,No,
134,M4,41,28,13,31,Female,False,False,False,False,...,False,False,Fluent,Fluent,"tamil, moderate",,"cegep, 2013 bsc, 2015 MSc, 2017 md 2025",Montreal,Yes,family medicine
135,M4,56,28,28,28,Female,False,False,False,False,...,True,False,Fluent,Fluent,basic spanish,,"cegep in health sciences, 2016 1 year in cine...",Montreal,Yes,I am starting residency in family medicine
136,M3,50,36,14,27,Male,False,False,False,True,...,False,False,Moderate,Fluent,Arabic moderate Spanish basic,,"BSc, MSc studies,",Montreal,Yes,Family Medicine + 1 Emergency Medicine
