## **PULSE SURVEY 12 MULTI SELECT VALIDATION** 

1. Check the Question Stem Total column for at least 3 single select questions
2. Check Count and Demographic Value Totals column for each demographic for at least 2 single select questions
    - Note that Reporting College and Multiple Ethnicities are double counting demographics, which means that each student with multiple majors / ethnicities is counted once in each unique category. So, a student in L&S and CNR are counted as 2 responses- one from L&S and one from CNR. This means that their Demographic Value Totals will add up to more than their Question Stem Totals
3. Check Count and Demographic Value Totals, by Undergrad Grad column for one non-double-counting demographic and one double-counting demographic for at least 2 single-select questions (preferably questions that haven’t been checked)
4. Check that each Question Stem Id matches their Question Stem/Item & Question Response
    - Use Pulse Survey Content documents for this (must download the .docx files to be able to view)
    - While you’re doing this, make sure the text looks correct
5. Repeat the same thing for multi select questions

### **Demographic Categories**
**Double counting**
- Reporting College
- Multiple Ethnicities

**Non-double-counting**
- Undergrad Grad
- Derived Residency Desc
- Entry Status Desc
- Ucb Level1 Ethnic Rollup Desc
- Ucb Level2 Ethnic Rollup Desc


In [1]:
from sklearn.pipeline import Pipeline, FeatureUnion
import pandas as pd
import numpy as np
from IPython.display import display

In [2]:
%run cleaning_transformers.ipynb

In [3]:
%run multiselect_counter_transformers.ipynb

In [4]:
DATA_SOURCE = pd.read_csv('12_ps_data_source.csv')
RAW_SURVEY = pd.read_csv('pulse_survey_12_raw_data.csv')

  RAW_SURVEY = pd.read_csv('pulse_survey_12_raw_data.csv')


##
## 1. Clean raw data

In [5]:
# data cleaning variables
COLUMNS_TO_REMOVE = ['RecordedDate'] ## may need to add:'PHQ2SCORE', 'GAD2SCORE', 'PHQ2', 'GAD2'
UNGRAD_GRAD_COL = 'UNGRADGRADCD' ## may need to replace
RESIDENCY_COL = 'RESIDENCY' ## may need to replace
ENTRY_STATUS_COL = 'ENTRYSTATUSDESC' ## may need to replace
ETH_LEVEL1_COL = 'LEVEL1ETH' ## may need to replace
ETH_LEVEL2_COL = 'LEVEL2ETH' ## may need to replace
VALUES_TO_NULLIFY = [-99, '-99', -1, '-1', -999, '-999', 'Not selected'] ## may need to replace

############# OPTIONAL: use ONLY if Reporting College cols look like a stem id #############
# rename reporintg college columns to avoid them getting treated as a question
RAW_SURVEY = RAW_SURVEY.rename(columns={'REPORTCOLLEGE1':'Reporting College - First Plan',
                                        'REPORTCOLLEGE2':'Reporting College - Second Plan',
                                        'REPORTCOLLEGE3':'Reporting College - Third Plan'})
############################################################################################
COLLEGE_COLS = RAW_SURVEY.columns[RAW_SURVEY.columns.str.contains('Reporting College')]
MULTI_ETH_COLS = ['African American / Black',
                  'Asian / Asian American',
                  'Hispanic / Latinx',
                  'International',
                  'American Indian / Alaska Native',
                  'Pacific Islander',
                  'Southwest Asian / North African',
                  'White / Caucasian',
                  'No Response']

# counting variables
QUESTION_DESC = RAW_SURVEY.loc[[0]] 
DATA = RAW_SURVEY[1:] 
DEMOGRAPHIC_COLUMNS = ['Undergrad Grad',
                       'Derived Residency Desc',
                       'Entry Status Desc',
                       'Ucb Level1 Ethnic Rollup Desc',
                       'Ucb Level2 Ethnic Rollup Desc',
                       'Low-income Status',
                       'First Gen College',
                       'Person Gender Desc',
                       'Reporting College',
                       'Multiple Ethnicities']

cleaning_pipeline = Pipeline([
    # drop null responses, remove duplicates and columns, make all missing/irrelevant values nan
    ('null rows remover', RemoveNullRowsTransformer()),
    ('values nullifier', ReplaceValuesTransformer(values_to_replace=VALUES_TO_NULLIFY)),
    ('duplicates remover', RemoveFirstDuplicateTransformer()),
    ('irrelevant columns remover', RemoveColumnsTransformer(columns_to_remove=COLUMNS_TO_REMOVE)),
    # rename column names
    ('undergrad grad col renamer', RenameColumnTransformer(UNGRAD_GRAD_COL, 'Undergrad Grad')),
    ('residency col renamer', RenameColumnTransformer(RESIDENCY_COL, 'Derived Residency Desc')),
    ('entry status col renamer', RenameColumnTransformer(ENTRY_STATUS_COL, 'Entry Status Desc')),
    ('ethnic lvl1 col renamer', RenameColumnTransformer(ETH_LEVEL1_COL, 'Ucb Level1 Ethnic Rollup Desc')),
    ('ethnic lvl2 col renamer', RenameColumnTransformer(ETH_LEVEL2_COL, 'Ucb Level2 Ethnic Rollup Desc')),
    # rename dataframe values
    ('undergrad value renamer', RelabelColumnTransformer(column_to_relabel='Undergrad Grad', new_label='U')),
    ('grad value renamer', RelabelColumnTransformer(column_to_relabel='Undergrad Grad', new_label='G')),
    ('first-year entry value renamer', RelabelColumnTransformer(column_to_relabel='Entry Status Desc', new_label='First-year')),
    # replace ADVANCED STANDING with NaN for all grad students
    ('advanced standing grad nullifier', ReplaceStringWithNaNTransformer(standing_col='Entry Status Desc')),
    # create columns for double counting demographics & mental health scores
    ('reporting clg col generator', UniqueStringListTransformer(columns_to_list=COLLEGE_COLS, unique_col_list='Reporting College')),
    ('multiple eth col generator', UniqueStringListTransformer(columns_to_list=MULTI_ETH_COLS, unique_col_list='Multiple Ethnicities')),
    ('depression col generator', AddColumnsTransformer(column_1='MHLTH1', column_2='MHLTH2', new_column='PHQ2', binary_column='DEPRESSION')),
    ('anxiety col generator', AddColumnsTransformer(column_1='MHLTH3', column_2='MHLTH4', new_column='GAD2', binary_column='ANXIETY'))
])

In [6]:
RAW_SURVEY = cleaning_pipeline.fit_transform(DATA)

In [7]:
DATA_SOURCE['Count'] = pd.to_numeric(DATA_SOURCE['Count'], downcast="integer")
DATA_SOURCE.head(2)

Unnamed: 0,Question Stem Id,Question Item Id,Demographic Category,Demographic Value,Undergrad Grad,Question Response,Count,Question Item,Question Stem,Demographic Value Total,"Demographic Value Total, by Undergrad Grad",Question Stem Total,Question Item Total
0,HOUSING_RESIDE,HOUSING_RESIDE,Undergrad Grad,G,G,Cooperative (Co-op) housing,70,,Which of the following best describes your housing for Fall 2022?,4152,4152,15299,15299
1,HOUSING_RESIDE,HOUSING_RESIDE,Undergrad Grad,G,G,I have not yet secured housing for Fall 2022,20,,Which of the following best describes your housing for Fall 2022?,4152,4152,15299,15299


In [8]:
RAW_SURVEY.head(2)

Unnamed: 0,ResponseId,Duration,EDUCNONEXAMLEVEL,EDUCNONEXAMLEVELCD,UGENTRYSTATUS,REGSTATUSDESC,GENDER,SHORTETHNICDESC,TYPE,Undergrad Grad,LowSocioEconomicStatusFlg,NeitherParent4yrClgDegFlg,ACADPLANNM1,ACADPLANNM2,ACADPLANNM3,CNR,CHE,COE,CED,CLS,BUS,GSE,GSJ,SPP,SOI,LAW,OPT,SPH,SSW,HOUSING_RESIDE,HOUSING_SAT_1,HOUSING_SAT_2,HOUSING_SAT_3,HOUSING_SAT_4,HOUSING_SAT_5,HOUSING_NEW,HOUSING_SEARCH,HOUSING_DIFF_1,HOUSING_DIFF_2,HOUSING_DIFF_3,HOUSING_DIFF_4,HOUSING_DIFF_5,HOUSING_DIFF_6,COURSES_REQU,COURSES_ELEC,COURSES_TTD,SERVICES_1,SERVICES_2,SERVICES_3,SERVICES_4,SERVICES_5,SERVICES_6,SERVICES_7,SERVICES_8,SERVICES_9,SERVICES_10,SERVICES_11,SERVICES_12,SERVICES_TIME_1,SERVICES_TIME_2,SERVICES_TIME_3,SERVICES_TIME_4,SERVICES_TIME_5,SERVICES_TIME_6,SERVICES_TIME_7,SERVICES_TIME_8,SERVICES_TIME_9,SERVICES_TIME_10,SERVICES_TIME_11,SERVICES_TIME_12,SERVICES_SAT_1,SERVICES_SAT_2,SERVICES_SAT_3,SERVICES_SAT_4,SERVICES_SAT_5,SERVICES_SAT_6,SERVICES_SAT_7,SERVICES_SAT_8,SERVICES_SAT_9,SERVICES_SAT_10,SERVICES_SAT_11,SERVICES_SAT_12,WIFI_SAT,WIFI_KEYSUSE,WIFI_KEYSSAT,CAPS_1,CAPS_2,CAPS_3,CAPS_4,RESOURCES,RESOURCES_MORE_1,RESOURCES_MORE_2,RESOURCES_MORE_3,RESOURCES_MORE_4,RESOURCES_MORE_5,RESOURCES_MORE_6,RESOURCES_MORE_7,RESOURCES_MORE_8,MHLTH1,MHLTH2,MHLTH3,MHLTH4,PHQ2SCORE,GAD2SCORE,PHQ2,GAD2,Semester Year Name Concat,African American / Black,Asian / Asian American,Hispanic / Latinx,International,American Indian / Alaska Native,Pacific Islander,Southwest Asian / North African,White / Caucasian,No Response,First Gen College,Person Gender Desc,Entry Status Desc,Derived Residency Desc,Ucb Level1 Ethnic Rollup Desc,Ucb Level2 Ethnic Rollup Desc,Reporting College - First Plan,Reporting College - Second Plan,Reporting College - Third Plan,Low-income Status,Reporting College,Multiple Ethnicities,DEPRESSION,ANXIETY
1,R_3PBs3Lj4wGkssfe,192.0,Doctoral not advanced to candidacy,Doctoral (not advanced to candidacy),,Continuing Student,Male,South Asian,Graduate Student,G,,,Chemical Engineering PhD,,,,,,,,,,,,,,,,,"Renting an apartment, house or condominium (not UCB-owned)",Very satisfied,Somewhat satisfied,Somewhat satisfied,Very satisfied,Very satisfied,No,Somewhat difficult,Selected,Selected,Selected,Selected,Selected,,Not applicable,No,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Somewhat dissatisfied,No,,Very well,Very well,Very well,Well,Very confident,,,,,,,,,Not at all,Not at all,Not at all,Not at all,0.0,0.0,0.0,0.0,2022 Fall,,Asian / Asian American,,,,,,White / Caucasian,,Not first-generation college,Man,FIRST TIME IN PROGRAM,Out of State Domestic,Asian,Asian,College of Chemistry,,,,[College of Chemistry],"[Asian / Asian American, White / Caucasian]",NO,NO
2,R_3hzM0hhcUj8G6E2,171.0,Freshman,Freshman,New Freshman,New Student,Female,,Undergraduate,U,,,Letters & Sci Undeclared UG,,,,,,,,,,,,,,,,,University Residence Halls or other University-managed housing,Very satisfied,Somewhat satisfied,Somewhat satisfied,Very dissatisfied,Somewhat satisfied,Not applicable or was not here last semester,Somewhat easy,,,,,,,Yes,No,,Selected,,Selected,,Selected,Selected,Selected,,,,,,Very satisfied,,Less than 1 day,,Less than 1 day,Less than 1 day,Less than 1 day,,,,,,Somewhat satisfied,,Somewhat satisfied,,Very satisfied,Very satisfied,Very satisfied,,,,,,Very dissatisfied,Yes,Somewhat dissatisfied,Well,Well,Well,Poorly,Somewhat confident,,,,,,,,,Several days,Several days,Several days,Not at all,2.0,1.0,2.0,1.0,2022 Fall,,,,,,,,,No Response,Unknown,Woman,First-year,CA Resident,,,College of Letters and Science,,,Not low-income,[College of Letters and Science],[No Response],NO,NO


In [9]:
STEM_ID = DATA_SOURCE['Question Stem Id'].unique()
STEM_ID

array(['HOUSING_RESIDE', 'HOUSING_SAT_1', 'HOUSING_SAT_2',
       'HOUSING_SAT_3', 'HOUSING_SAT_4', 'HOUSING_SAT_5', 'HOUSING_NEW',
       'HOUSING_SEARCH', 'COURSES_REQU', 'COURSES_ELEC', 'COURSES_TTD',
       'SERVICES_TIME_1', 'SERVICES_TIME_2', 'SERVICES_TIME_3',
       'SERVICES_TIME_4', 'SERVICES_TIME_5', 'SERVICES_TIME_6',
       'SERVICES_TIME_7', 'SERVICES_TIME_8', 'SERVICES_TIME_9',
       'SERVICES_TIME_10', 'SERVICES_TIME_11', 'SERVICES_TIME_12',
       'SERVICES_SAT_1', 'SERVICES_SAT_2', 'SERVICES_SAT_3',
       'SERVICES_SAT_4', 'SERVICES_SAT_5', 'SERVICES_SAT_6',
       'SERVICES_SAT_7', 'SERVICES_SAT_8', 'SERVICES_SAT_9',
       'SERVICES_SAT_10', 'SERVICES_SAT_11', 'SERVICES_SAT_12',
       'WIFI_SAT', 'WIFI_KEYSUSE', 'WIFI_KEYSSAT', 'CAPS_1', 'CAPS_2',
       'CAPS_3', 'CAPS_4', 'RESOURCES', 'MHLTH1', 'MHLTH2', 'MHLTH3',
       'MHLTH4', 'PHQ2', 'DEPRESSION', 'GAD2', 'ANXIETY', 'HOUSING_DIFF',
       'SERVICES', 'RESOURCES_MORE'], dtype=object)

In [11]:
MULTI_SELECT = ['HOUSING_DIFF', 'SERVICES', 'RESOURCES_MORE']

SINGLE_SELECT = [id for id in STEM_ID if id not in MULTI_SELECT]

SINGLE_DEMOS = ['Undergrad Grad', 'Derived Residency Desc', 
             'Entry Status Desc', 'Ucb Level1 Ethnic Rollup Desc',
             'Ucb Level2 Ethnic Rollup Desc', 'Low-income Status', 
             'First Gen College', 'Person Gender Desc']

DOUBLE_DEMOS = ['Multiple Ethnicities ', 'Reporting College']

##
## 2. Check the Question Stem Total column for at least 3 single select questions

In [12]:
# completed function 
def check_qstem_total(qstems): 
    for qstem in qstems: 
        print('_____', qstem, '_____')
        # finding data source value for question stem total 
        allstemtotal = DATA_SOURCE[DATA_SOURCE['Question Stem Id'].str.contains(qstem, case=False)]
        stemtotal = allstemtotal[['Question Item Id', 'Question Stem Total']]
        stemtotal = stemtotal.drop_duplicates(ignore_index=True)
        if len(stemtotal['Question Stem Total'].value_counts()) > 1: 
            print('ERROR: DATA SOURCE HAS MULTIPLE STEM TOTAL VALUES -- CHECK .value_counts()')
            display(allstemtotal[['Question Item Id', 'Demographic Category', 'Question Stem Total']].drop_duplicates(ignore_index=True))
            
        data_source_val = stemtotal['Question Stem Total'][0]

        #finding raw survey value for question stem total 
        stems = [stem for stem in RAW_SURVEY.columns if qstem in stem]
        raw_survey_val = len(RAW_SURVEY[stems].dropna(how='all'))

        print('DATA SOURCE:', data_source_val)
        print('RAW SURVEY:', raw_survey_val)
        print('Equal?:', data_source_val == raw_survey_val) 
        print("\n")
        

# check multiple stem totals 
qstems = MULTI_SELECT
check_qstem_total(qstems)

_____ HOUSING_DIFF _____
ERROR: DATA SOURCE HAS MULTIPLE STEM TOTAL VALUES -- CHECK .value_counts()


Unnamed: 0,Question Item Id,Demographic Category,Question Stem Total
0,HOUSING_DIFF_1,Reporting College,5905
1,HOUSING_DIFF_2,Reporting College,5905
2,HOUSING_DIFF_3,Reporting College,5905
3,HOUSING_DIFF_4,Reporting College,5905
4,HOUSING_DIFF_5,Reporting College,5905
5,HOUSING_DIFF_6,Reporting College,5905
6,HOUSING_DIFF_1,Multiple Ethnicities,8190
7,HOUSING_DIFF_2,Multiple Ethnicities,8190
8,HOUSING_DIFF_3,Multiple Ethnicities,8190
9,HOUSING_DIFF_4,Multiple Ethnicities,8190


DATA SOURCE: 5905
RAW SURVEY: 5849
Equal?: False


_____ SERVICES _____
ERROR: DATA SOURCE HAS MULTIPLE STEM TOTAL VALUES -- CHECK .value_counts()


Unnamed: 0,Question Item Id,Demographic Category,Question Stem Total
0,SERVICES_TIME_1,Undergrad Grad,6421
1,SERVICES_TIME_2,Undergrad Grad,453
2,SERVICES_TIME_3,Undergrad Grad,6187
3,SERVICES_TIME_4,Undergrad Grad,1018
4,SERVICES_TIME_5,Undergrad Grad,8221
...,...,...,...
353,SERVICES_12,Ucb Level1 Ethnic Rollup Desc,12831
354,SERVICES_12,Ucb Level2 Ethnic Rollup Desc,12831
355,SERVICES_12,Low-income Status,12831
356,SERVICES_12,First Gen College,12831


DATA SOURCE: 6421
RAW SURVEY: 12831
Equal?: False


_____ RESOURCES_MORE _____
ERROR: DATA SOURCE HAS MULTIPLE STEM TOTAL VALUES -- CHECK .value_counts()


Unnamed: 0,Question Item Id,Demographic Category,Question Stem Total
0,RESOURCES_MORE_1,Reporting College,1862
1,RESOURCES_MORE_2,Reporting College,1862
2,RESOURCES_MORE_3,Reporting College,1862
3,RESOURCES_MORE_4,Reporting College,1862
4,RESOURCES_MORE_5,Reporting College,1862
...,...,...,...
75,RESOURCES_MORE_8,Ucb Level1 Ethnic Rollup Desc,1845
76,RESOURCES_MORE_8,Ucb Level2 Ethnic Rollup Desc,1845
77,RESOURCES_MORE_8,Low-income Status,1845
78,RESOURCES_MORE_8,First Gen College,1845


DATA SOURCE: 1862
RAW SURVEY: 1845
Equal?: False




In [13]:
qstem = MULTI_SELECT[0]
qstem

'HOUSING_DIFF'

####
#### DATA SOURCE STEM TOTAL 

In [14]:
allstemtotal = DATA_SOURCE[DATA_SOURCE['Question Stem Id'].str.contains(qstem, case=False)]
allstemtotal.head(2)

Unnamed: 0,Question Stem Id,Question Item Id,Demographic Category,Demographic Value,Undergrad Grad,Question Response,Count,Question Item,Question Stem,Demographic Value Total,"Demographic Value Total, by Undergrad Grad",Question Stem Total,Question Item Total
22273,HOUSING_DIFF,HOUSING_DIFF_1,Reporting College,Berkeley School of Education,G,Selected,25,,You said that it was difficult to search for housing for Fall 222. How would you characterize the difficulty you had searching for housing? Select all that apply. - High cost of housing,26,26,5905,5402
22274,HOUSING_DIFF,HOUSING_DIFF_1,Reporting College,Berkeley School of Education,U,Selected,0,,You said that it was difficult to search for housing for Fall 222. How would you characterize the difficulty you had searching for housing? Select all that apply. - High cost of housing,26,-1,5905,5402


In [16]:
qstem = MULTI_SELECT[0]
allstemtotal = DATA_SOURCE[DATA_SOURCE['Question Stem Id'].str.contains(qstem, case=False)]
stemtotal = allstemtotal[['Question Item Id', 'Question Stem Total']]
stemtotal = stemtotal.drop_duplicates(ignore_index=True)
if len(stemtotal['Question Stem Total'].value_counts()) > 1: 
    print('ERROR: DATA SOURCE HAS MULTIPLE STEM TOTAL VALUES -- CHECK .value_counts()')
            
stemtotal

ERROR: DATA SOURCE HAS MULTIPLE STEM TOTAL VALUES -- CHECK .value_counts()


Unnamed: 0,Question Item Id,Question Stem Total
0,HOUSING_DIFF_1,5905
1,HOUSING_DIFF_2,5905
2,HOUSING_DIFF_3,5905
3,HOUSING_DIFF_4,5905
4,HOUSING_DIFF_5,5905
5,HOUSING_DIFF_6,5905
6,HOUSING_DIFF_1,8190
7,HOUSING_DIFF_2,8190
8,HOUSING_DIFF_3,8190
9,HOUSING_DIFF_4,8190


####
#### RAW SURVEY STEM TOTAL

In [17]:
# get all column names that have qstem 
stems = [stem for stem in RAW_SURVEY.columns if qstem in stem]

In [18]:
len(RAW_SURVEY[stems].dropna(how='all'))

5849

##
## 3. Check Count and Demographic Value Totals column for each demographic

In [19]:
# completed function (one demographic value) 
def check_count_onedemo(qstem, demo, double_count_demo = False): 
    # finding data source values #
    ds_counts = DATA_SOURCE[DATA_SOURCE['Question Stem Id']== qstem]
    ds_counts = ds_counts[ds_counts['Demographic Category'] == demo][['Demographic Value', 'Demographic Value Total', 'Undergrad Grad', 'Count', 'Question Response', 'Question Item Id']]
    ds_counts = ds_counts.sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)
    
    # finding raw survey values #
    stems = [stem for stem in RAW_SURVEY.columns if qstem in stem]
    raw_final = pd.DataFrame() 
    for stem in stems: 
        raw = RAW_SURVEY
        if double_count_demo: 
            raw = RAW_SURVEY.explode(demo)
        raw['ID DUPLICATE'] = raw[stem]
        raw_piv = pd.pivot_table(raw, values=stem, index=['Undergrad Grad', demo, 'ID DUPLICATE'], aggfunc='count')
        raw_piv = raw_piv.reset_index().rename(columns={'Ungrad Grad Cd': 'Undergrad Grad', demo: 'Demographic Value', stem: 'Count', 'ID DUPLICATE': 'Question Response'})

        # make demographic value total col 
        select = [stem for stem in raw.columns if qstem in stem] + [demo]
        selected = raw[select]
        selected = selected.dropna(subset=stems, thresh = 1)
        demoval_total = selected[demo].value_counts().to_dict()
        raw_piv['Demographic Value Total'] = raw_piv['Demographic Value'].map(demoval_total)
        
        raw_piv['Question Item Id'] = [stem] * len(raw_piv) 
        raw_final = pd.concat([raw_final, raw_piv], ignore_index=True).sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)
    

    print('DATA SOURCE: ')
    display(ds_counts)
    print("\n")
    print('RAW SURVEY: ')
    display(raw_final)


# completed function (all demographic values for ONE QUESTION ITEM) 
def check_count_alldemo(qitem, demo_vals): 
    for demo in demo_vals:
        print('DEMOGRAPHIC VALUE:', demo) 
        if demo in ['Reporting College', 'Multiple Ethnicities']:
            check_count_onedemo(qitem, demo, double_count_demo = True)
        else:
            check_count_onedemo(qitem, demo) 
        print("\n")
        
demo_cat = [#'Undergrad Grad',
            'Derived Residency Desc',
            'Entry Status Desc',
            'Ucb Level1 Ethnic Rollup Desc',
            'Ucb Level2 Ethnic Rollup Desc',
            'Low-income Status',
            'First Gen College',
            'Person Gender Desc',
            'Reporting College',
            'Multiple Ethnicities']

# if FALSE, check dataframes below by replacing the variables qitem and demo (typically because of cleaning/low counts) 

In [20]:
check_count_alldemo(MULTI_SELECT[0], demo_cat) 

DEMOGRAPHIC VALUE: Derived Residency Desc
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,36,CA Resident,3354,HOUSING_DIFF_6,Selected,G
1,150,CA Resident,3354,HOUSING_DIFF_5,Selected,G
2,154,CA Resident,3354,HOUSING_DIFF_2,Selected,G
3,236,CA Resident,3354,HOUSING_DIFF_4,Selected,G
4,273,CA Resident,3354,HOUSING_DIFF_3,Selected,G
5,414,CA Resident,3354,HOUSING_DIFF_1,Selected,G
6,263,CA Resident,3354,HOUSING_DIFF_6,Selected,U
7,591,CA Resident,3354,HOUSING_DIFF_5,Selected,U
8,896,CA Resident,3354,HOUSING_DIFF_2,Selected,U
9,1322,CA Resident,3354,HOUSING_DIFF_4,Selected,U




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,36,CA Resident,3354,HOUSING_DIFF_6,Selected,G
1,150,CA Resident,3354,HOUSING_DIFF_5,Selected,G
2,154,CA Resident,3354,HOUSING_DIFF_2,Selected,G
3,236,CA Resident,3354,HOUSING_DIFF_4,Selected,G
4,273,CA Resident,3354,HOUSING_DIFF_3,Selected,G
5,414,CA Resident,3354,HOUSING_DIFF_1,Selected,G
6,263,CA Resident,3354,HOUSING_DIFF_6,Selected,U
7,591,CA Resident,3354,HOUSING_DIFF_5,Selected,U
8,896,CA Resident,3354,HOUSING_DIFF_2,Selected,U
9,1322,CA Resident,3354,HOUSING_DIFF_4,Selected,U




DEMOGRAPHIC VALUE: Entry Status Desc
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,134,ADVANCED STANDING,1389,HOUSING_DIFF_6,Selected,U
1,319,ADVANCED STANDING,1389,HOUSING_DIFF_5,Selected,U
2,492,ADVANCED STANDING,1389,HOUSING_DIFF_2,Selected,U
3,652,ADVANCED STANDING,1389,HOUSING_DIFF_4,Selected,U
4,860,ADVANCED STANDING,1389,HOUSING_DIFF_3,Selected,U
...,...,...,...,...,...,...
61,-1,UNKNOWN,-1,HOUSING_DIFF_2,Selected,U
62,-1,UNKNOWN,-1,HOUSING_DIFF_3,Selected,U
63,-1,UNKNOWN,-1,HOUSING_DIFF_4,Selected,U
64,-1,UNKNOWN,-1,HOUSING_DIFF_5,Selected,U




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,134,ADVANCED STANDING,1389,HOUSING_DIFF_6,Selected,U
1,319,ADVANCED STANDING,1389,HOUSING_DIFF_5,Selected,U
2,492,ADVANCED STANDING,1389,HOUSING_DIFF_2,Selected,U
3,652,ADVANCED STANDING,1389,HOUSING_DIFF_4,Selected,U
4,860,ADVANCED STANDING,1389,HOUSING_DIFF_3,Selected,U
5,1292,ADVANCED STANDING,1389,HOUSING_DIFF_1,Selected,U
6,1,DOCTORAL,6,HOUSING_DIFF_2,Selected,G
7,1,DOCTORAL,6,HOUSING_DIFF_6,Selected,G
8,3,DOCTORAL,6,HOUSING_DIFF_3,Selected,G
9,3,DOCTORAL,6,HOUSING_DIFF_4,Selected,G




DEMOGRAPHIC VALUE: Ucb Level1 Ethnic Rollup Desc
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,19,Asian,1793,HOUSING_DIFF_6,Selected,G
1,90,Asian,1793,HOUSING_DIFF_5,Selected,G
2,113,Asian,1793,HOUSING_DIFF_2,Selected,G
3,126,Asian,1793,HOUSING_DIFF_4,Selected,G
4,184,Asian,1793,HOUSING_DIFF_3,Selected,G
5,263,Asian,1793,HOUSING_DIFF_1,Selected,G
6,108,Asian,1793,HOUSING_DIFF_6,Selected,U
7,321,Asian,1793,HOUSING_DIFF_5,Selected,U
8,500,Asian,1793,HOUSING_DIFF_2,Selected,U
9,666,Asian,1793,HOUSING_DIFF_4,Selected,U




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,19,Asian,1793,HOUSING_DIFF_6,Selected,G
1,90,Asian,1793,HOUSING_DIFF_5,Selected,G
2,113,Asian,1793,HOUSING_DIFF_2,Selected,G
3,126,Asian,1793,HOUSING_DIFF_4,Selected,G
4,184,Asian,1793,HOUSING_DIFF_3,Selected,G
5,263,Asian,1793,HOUSING_DIFF_1,Selected,G
6,108,Asian,1793,HOUSING_DIFF_6,Selected,U
7,321,Asian,1793,HOUSING_DIFF_5,Selected,U
8,500,Asian,1793,HOUSING_DIFF_2,Selected,U
9,666,Asian,1793,HOUSING_DIFF_4,Selected,U




DEMOGRAPHIC VALUE: Ucb Level2 Ethnic Rollup Desc
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,10,African American,242,HOUSING_DIFF_6,Selected,G
1,22,African American,242,HOUSING_DIFF_2,Selected,G
2,30,African American,242,HOUSING_DIFF_5,Selected,G
3,35,African American,242,HOUSING_DIFF_4,Selected,G
4,42,African American,242,HOUSING_DIFF_3,Selected,G
...,...,...,...,...,...,...
79,109,White,1083,HOUSING_DIFF_5,Selected,U
80,199,White,1083,HOUSING_DIFF_2,Selected,U
81,315,White,1083,HOUSING_DIFF_4,Selected,U
82,409,White,1083,HOUSING_DIFF_3,Selected,U




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,10,African American,242,HOUSING_DIFF_6,Selected,G
1,22,African American,242,HOUSING_DIFF_2,Selected,G
2,30,African American,242,HOUSING_DIFF_5,Selected,G
3,35,African American,242,HOUSING_DIFF_4,Selected,G
4,42,African American,242,HOUSING_DIFF_3,Selected,G
...,...,...,...,...,...,...
77,109,White,1083,HOUSING_DIFF_5,Selected,U
78,199,White,1083,HOUSING_DIFF_2,Selected,U
79,315,White,1083,HOUSING_DIFF_4,Selected,U
80,409,White,1083,HOUSING_DIFF_3,Selected,U




DEMOGRAPHIC VALUE: Low-income Status
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,99,Low-income,1266,HOUSING_DIFF_6,Selected,U
1,310,Low-income,1266,HOUSING_DIFF_5,Selected,U
2,405,Low-income,1266,HOUSING_DIFF_2,Selected,U
3,594,Low-income,1266,HOUSING_DIFF_4,Selected,U
4,776,Low-income,1266,HOUSING_DIFF_3,Selected,U
5,1165,Low-income,1266,HOUSING_DIFF_1,Selected,U
6,262,Not low-income,2722,HOUSING_DIFF_6,Selected,U
7,494,Not low-income,2722,HOUSING_DIFF_5,Selected,U
8,810,Not low-income,2722,HOUSING_DIFF_2,Selected,U
9,1159,Not low-income,2722,HOUSING_DIFF_4,Selected,U




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,99,Low-income,1266,HOUSING_DIFF_6,Selected,U
1,310,Low-income,1266,HOUSING_DIFF_5,Selected,U
2,405,Low-income,1266,HOUSING_DIFF_2,Selected,U
3,594,Low-income,1266,HOUSING_DIFF_4,Selected,U
4,776,Low-income,1266,HOUSING_DIFF_3,Selected,U
5,1165,Low-income,1266,HOUSING_DIFF_1,Selected,U
6,262,Not low-income,2722,HOUSING_DIFF_6,Selected,U
7,494,Not low-income,2722,HOUSING_DIFF_5,Selected,U
8,810,Not low-income,2722,HOUSING_DIFF_2,Selected,U
9,1159,Not low-income,2722,HOUSING_DIFF_4,Selected,U




DEMOGRAPHIC VALUE: First Gen College
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,19,First-generation college,1657,HOUSING_DIFF_6,Selected,G
1,64,First-generation college,1657,HOUSING_DIFF_5,Selected,G
2,90,First-generation college,1657,HOUSING_DIFF_2,Selected,G
3,95,First-generation college,1657,HOUSING_DIFF_4,Selected,G
4,142,First-generation college,1657,HOUSING_DIFF_3,Selected,G
5,201,First-generation college,1657,HOUSING_DIFF_1,Selected,G
6,115,First-generation college,1657,HOUSING_DIFF_6,Selected,U
7,342,First-generation college,1657,HOUSING_DIFF_5,Selected,U
8,466,First-generation college,1657,HOUSING_DIFF_2,Selected,U
9,659,First-generation college,1657,HOUSING_DIFF_4,Selected,U




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,19,First-generation college,1651,HOUSING_DIFF_6,Selected,G
1,63,First-generation college,1651,HOUSING_DIFF_5,Selected,G
2,88,First-generation college,1651,HOUSING_DIFF_2,Selected,G
3,93,First-generation college,1651,HOUSING_DIFF_4,Selected,G
4,138,First-generation college,1651,HOUSING_DIFF_3,Selected,G
5,197,First-generation college,1651,HOUSING_DIFF_1,Selected,G
6,115,First-generation college,1651,HOUSING_DIFF_6,Selected,U
7,342,First-generation college,1651,HOUSING_DIFF_5,Selected,U
8,466,First-generation college,1651,HOUSING_DIFF_2,Selected,U
9,659,First-generation college,1651,HOUSING_DIFF_4,Selected,U




DEMOGRAPHIC VALUE: Person Gender Desc
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,1,Decline to State,222,HOUSING_DIFF_6,Selected,G
1,4,Decline to State,222,HOUSING_DIFF_2,Selected,G
2,5,Decline to State,222,HOUSING_DIFF_4,Selected,G
3,6,Decline to State,222,HOUSING_DIFF_5,Selected,G
4,8,Decline to State,222,HOUSING_DIFF_3,Selected,G
...,...,...,...,...,...,...
61,493,Woman,3178,HOUSING_DIFF_5,Selected,U
62,749,Woman,3178,HOUSING_DIFF_2,Selected,U
63,1009,Woman,3178,HOUSING_DIFF_4,Selected,U
64,1363,Woman,3178,HOUSING_DIFF_3,Selected,U




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,1,Decline to State,222,HOUSING_DIFF_6,Selected,G
1,4,Decline to State,222,HOUSING_DIFF_2,Selected,G
2,5,Decline to State,222,HOUSING_DIFF_4,Selected,G
3,6,Decline to State,222,HOUSING_DIFF_5,Selected,G
4,8,Decline to State,222,HOUSING_DIFF_3,Selected,G
...,...,...,...,...,...,...
58,493,Woman,3178,HOUSING_DIFF_5,Selected,U
59,749,Woman,3178,HOUSING_DIFF_2,Selected,U
60,1009,Woman,3178,HOUSING_DIFF_4,Selected,U
61,1363,Woman,3178,HOUSING_DIFF_3,Selected,U




DEMOGRAPHIC VALUE: Reporting College
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,2,Berkeley School of Education,26,HOUSING_DIFF_6,Selected,G
1,10,Berkeley School of Education,26,HOUSING_DIFF_2,Selected,G
2,10,Berkeley School of Education,26,HOUSING_DIFF_5,Selected,G
3,11,Berkeley School of Education,26,HOUSING_DIFF_4,Selected,G
4,15,Berkeley School of Education,26,HOUSING_DIFF_3,Selected,G
...,...,...,...,...,...,...
175,11,Walter A. Haas School of Business,220,HOUSING_DIFF_5,Selected,U
176,28,Walter A. Haas School of Business,220,HOUSING_DIFF_2,Selected,U
177,41,Walter A. Haas School of Business,220,HOUSING_DIFF_4,Selected,U
178,47,Walter A. Haas School of Business,220,HOUSING_DIFF_3,Selected,U




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,2,Berkeley School of Education,26,HOUSING_DIFF_6,Selected,G
1,10,Berkeley School of Education,26,HOUSING_DIFF_2,Selected,G
2,10,Berkeley School of Education,26,HOUSING_DIFF_5,Selected,G
3,11,Berkeley School of Education,26,HOUSING_DIFF_4,Selected,G
4,15,Berkeley School of Education,26,HOUSING_DIFF_3,Selected,G
...,...,...,...,...,...,...
127,11,Walter A. Haas School of Business,220,HOUSING_DIFF_5,Selected,U
128,28,Walter A. Haas School of Business,220,HOUSING_DIFF_2,Selected,U
129,41,Walter A. Haas School of Business,220,HOUSING_DIFF_4,Selected,U
130,47,Walter A. Haas School of Business,220,HOUSING_DIFF_3,Selected,U




DEMOGRAPHIC VALUE: Multiple Ethnicities
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,10,African American / Black,250,HOUSING_DIFF_6,Selected,G
1,22,African American / Black,250,HOUSING_DIFF_2,Selected,G
2,31,African American / Black,250,HOUSING_DIFF_5,Selected,G
3,35,African American / Black,250,HOUSING_DIFF_4,Selected,G
4,43,African American / Black,250,HOUSING_DIFF_3,Selected,G
...,...,...,...,...,...,...
103,195,White / Caucasian,1591,HOUSING_DIFF_5,Selected,U
104,318,White / Caucasian,1591,HOUSING_DIFF_2,Selected,U
105,488,White / Caucasian,1591,HOUSING_DIFF_4,Selected,U
106,665,White / Caucasian,1591,HOUSING_DIFF_3,Selected,U




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,10,African American / Black,250,HOUSING_DIFF_6,Selected,G
1,22,African American / Black,250,HOUSING_DIFF_2,Selected,G
2,31,African American / Black,250,HOUSING_DIFF_5,Selected,G
3,35,African American / Black,250,HOUSING_DIFF_4,Selected,G
4,43,African American / Black,250,HOUSING_DIFF_3,Selected,G
...,...,...,...,...,...,...
102,195,White / Caucasian,1591,HOUSING_DIFF_5,Selected,U
103,318,White / Caucasian,1591,HOUSING_DIFF_2,Selected,U
104,488,White / Caucasian,1591,HOUSING_DIFF_4,Selected,U
105,665,White / Caucasian,1591,HOUSING_DIFF_3,Selected,U






In [21]:
qstem = MULTI_SELECT[0]
demo = 'Derived Residency Desc'

####
#### DATA SOURCE COUNTS DF

In [22]:
DATA_SOURCE.head(3)

Unnamed: 0,Question Stem Id,Question Item Id,Demographic Category,Demographic Value,Undergrad Grad,Question Response,Count,Question Item,Question Stem,Demographic Value Total,"Demographic Value Total, by Undergrad Grad",Question Stem Total,Question Item Total
0,HOUSING_RESIDE,HOUSING_RESIDE,Undergrad Grad,G,G,Cooperative (Co-op) housing,70,,Which of the following best describes your housing for Fall 2022?,4152,4152,15299,15299
1,HOUSING_RESIDE,HOUSING_RESIDE,Undergrad Grad,G,G,I have not yet secured housing for Fall 2022,20,,Which of the following best describes your housing for Fall 2022?,4152,4152,15299,15299
2,HOUSING_RESIDE,HOUSING_RESIDE,Undergrad Grad,G,G,Other (please describe),107,,Which of the following best describes your housing for Fall 2022?,4152,4152,15299,15299


In [23]:
ds_counts = DATA_SOURCE[DATA_SOURCE['Question Stem Id']== qstem]
ds_counts = ds_counts[ds_counts['Demographic Category'] == demo][['Demographic Value', 'Demographic Value Total', 'Undergrad Grad', 'Count', 'Question Response', 'Question Item Id']]
ds_counts = ds_counts.sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)
ds_counts

Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,36,CA Resident,3354,HOUSING_DIFF_6,Selected,G
1,150,CA Resident,3354,HOUSING_DIFF_5,Selected,G
2,154,CA Resident,3354,HOUSING_DIFF_2,Selected,G
3,236,CA Resident,3354,HOUSING_DIFF_4,Selected,G
4,273,CA Resident,3354,HOUSING_DIFF_3,Selected,G
5,414,CA Resident,3354,HOUSING_DIFF_1,Selected,G
6,263,CA Resident,3354,HOUSING_DIFF_6,Selected,U
7,591,CA Resident,3354,HOUSING_DIFF_5,Selected,U
8,896,CA Resident,3354,HOUSING_DIFF_2,Selected,U
9,1322,CA Resident,3354,HOUSING_DIFF_4,Selected,U


####
#### RAW SURVEY COUNTS DF

In [24]:
stems = [stem for stem in RAW_SURVEY.columns if qstem in stem]
raw_final = pd.DataFrame() 
for stem in stems: 
    # uncomment line below if double counting
    # RAW_SURVEY = RAW_SURVEY.explode(# insert double counting demographic value)
    RAW_SURVEY['ID DUPLICATE'] = RAW_SURVEY[stem]
    raw_piv = pd.pivot_table(RAW_SURVEY, values=stem, index=['Undergrad Grad', demo, 'ID DUPLICATE'], aggfunc='count')

    raw_piv = raw_piv.reset_index().rename(columns={'Ungrad Grad Cd': 'Undergrad Grad', demo: 'Demographic Value', stem: 'Count', 'ID DUPLICATE': 'Question Response'})

    #make demographic value total col 
    select = [stem for stem in RAW_SURVEY.columns if qstem in stem] + ['Derived Residency Desc']
    selected = RAW_SURVEY[select]
    selected = selected.dropna(subset=stems, thresh = 1)
    demoval_total = selected['Derived Residency Desc'].value_counts().to_dict()
    raw_piv['Demographic Value Total'] = raw_piv['Demographic Value'].map(demoval_total)

    #replace low counts with -1
    #raw_piv['Count'] = raw_piv['Count'].apply(lambda x: -1 if x < 11 else x)
    
    raw_piv['Question Item Id'] = [stem] * len(raw_piv) 
    raw_piv = raw_piv.sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)
    raw_final = pd.concat([raw_final, raw_piv], ignore_index=True) 
    
raw_final

Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Item Id,Question Response,Undergrad Grad
0,414,CA Resident,3354,HOUSING_DIFF_1,Selected,G
1,2668,CA Resident,3354,HOUSING_DIFF_1,Selected,U
2,834,International,1499,HOUSING_DIFF_1,Selected,G
3,519,International,1499,HOUSING_DIFF_1,Selected,U
4,430,Out of State Domestic,917,HOUSING_DIFF_1,Selected,G
5,417,Out of State Domestic,917,HOUSING_DIFF_1,Selected,U
6,154,CA Resident,3354,HOUSING_DIFF_2,Selected,G
7,896,CA Resident,3354,HOUSING_DIFF_2,Selected,U
8,331,International,1499,HOUSING_DIFF_2,Selected,G
9,175,International,1499,HOUSING_DIFF_2,Selected,U


In [25]:
ds_counts.astype(str).equals(raw_piv.astype(str))

False

##
## 4. Check Count and Demographic Value Totals, by Undergrad Grad column for one non-double-counting demographic and one double-counting demographic for at least 2 single-select questions 
Preferably questions that haven’t been checked

In [26]:
# completed function (one demographic value) 
def check_count_ug_onedemo(qstem, demo, double_count_demo = False): 
    # finding data source values #
    ds_counts = DATA_SOURCE[DATA_SOURCE['Question Stem Id']== qstem]
    ds_counts = ds_counts[ds_counts['Demographic Category'] == demo][['Demographic Value', 'Demographic Value Total, by Undergrad Grad', 'Undergrad Grad', 'Count', 'Question Response', 'Question Item Id']]
    ds_counts = ds_counts.sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)
    
    # finding raw survey values #
    stems = [stem for stem in RAW_SURVEY.columns if qstem in stem]
    raw_final = pd.DataFrame() 
    for stem in stems: 
        raw = RAW_SURVEY
        if double_count_demo: 
            raw = RAW_SURVEY.explode(demo)
        raw['ID DUPLICATE'] = raw[stem]
        raw_piv = pd.pivot_table(raw, values=stem, index=['Undergrad Grad', demo, 'ID DUPLICATE'], aggfunc='count')
        raw_piv = raw_piv.reset_index().rename(columns={'Ungrad Grad Cd': 'Undergrad Grad', demo: 'Demographic Value', stem: 'Count', 'ID DUPLICATE': 'Question Response'})

        # make demographic value total by ug col 
         #make demographic value total col 
        select = [stem for stem in RAW_SURVEY.columns if qstem in stem] + [demo, 'Undergrad Grad']
        selected = RAW_SURVEY[select]
        selected = selected.dropna(subset=stems, thresh = 1)
        selected = selected[[demo, 'Undergrad Grad']].value_counts().to_frame().reset_index().rename(columns={0: 'Demographic Value Total, by Undergrad Grad', demo: 'Demographic Value'})
        raw_piv = selected.merge(raw_piv, 'right', on=['Demographic Value', 'Undergrad Grad'])

   
        raw_piv['Question Item Id'] = [stem] * len(raw_piv) 
        raw_piv = raw_piv.sort_index(axis=1)
        raw_final = pd.concat([raw_final, raw_piv], ignore_index=True).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)

    print('DATA SOURCE: ')
    display(ds_counts)
    print("\n")
    print('RAW SURVEY: ')
    display(raw_final)
    

# completed function (demographic values for ONE QUESTION ITEM) 
def check_count_ug_alldemo(qstem, demo_vals): 
    for demo in demo_vals:
        print('DEMOGRAPHIC VALUE:', demo) 
        if demo in ['Reporting College', 'Multiple Ethnicities']:
            check_count_ug_onedemo(qstem, demo, double_count_demo = True)
        else:
            check_count_ug_onedemo(qstem, demo) 
        print("\n")

In [27]:
check_count_ug_onedemo(qstem, demo)

DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,"Demographic Value Total, by Undergrad Grad",Question Item Id,Question Response,Undergrad Grad
0,36,CA Resident,428,HOUSING_DIFF_6,Selected,G
1,150,CA Resident,428,HOUSING_DIFF_5,Selected,G
2,154,CA Resident,428,HOUSING_DIFF_2,Selected,G
3,236,CA Resident,428,HOUSING_DIFF_4,Selected,G
4,273,CA Resident,428,HOUSING_DIFF_3,Selected,G
5,414,CA Resident,428,HOUSING_DIFF_1,Selected,G
6,263,CA Resident,2926,HOUSING_DIFF_6,Selected,U
7,591,CA Resident,2926,HOUSING_DIFF_5,Selected,U
8,896,CA Resident,2926,HOUSING_DIFF_2,Selected,U
9,1322,CA Resident,2926,HOUSING_DIFF_4,Selected,U




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,"Demographic Value Total, by Undergrad Grad",Question Item Id,Question Response,Undergrad Grad
0,36,CA Resident,428,HOUSING_DIFF_6,Selected,G
1,150,CA Resident,428,HOUSING_DIFF_5,Selected,G
2,154,CA Resident,428,HOUSING_DIFF_2,Selected,G
3,236,CA Resident,428,HOUSING_DIFF_4,Selected,G
4,273,CA Resident,428,HOUSING_DIFF_3,Selected,G
5,414,CA Resident,428,HOUSING_DIFF_1,Selected,G
6,263,CA Resident,2926,HOUSING_DIFF_6,Selected,U
7,591,CA Resident,2926,HOUSING_DIFF_5,Selected,U
8,896,CA Resident,2926,HOUSING_DIFF_2,Selected,U
9,1322,CA Resident,2926,HOUSING_DIFF_4,Selected,U


In [28]:
qstem = MULTI_SELECT[1]
demo = 'Person Gender Desc'#np.random.choice(demo)


####
#### DATA SOURCE COUNTS DF BY UG

In [29]:
ds_counts = DATA_SOURCE[DATA_SOURCE['Question Stem Id']== qstem]
ds_counts = ds_counts[ds_counts['Demographic Category'] == demo][['Demographic Value', 'Demographic Value Total, by Undergrad Grad', 'Undergrad Grad', 'Count', 'Question Response', 'Question Item Id']]
ds_counts = ds_counts.sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)
ds_counts

Unnamed: 0,Count,Demographic Value,"Demographic Value Total, by Undergrad Grad",Question Item Id,Question Response,Undergrad Grad
0,0,Decline to State,25,SERVICES_8,Selected,G
1,1,Decline to State,25,SERVICES_6,Selected,G
2,2,Decline to State,25,SERVICES_4,Selected,G
3,2,Decline to State,25,SERVICES_12,Selected,G
4,4,Decline to State,25,SERVICES_2,Selected,G
...,...,...,...,...,...,...
127,2050,Woman,5480,SERVICES_7,Selected,U
128,2349,Woman,5480,SERVICES_6,Selected,U
129,3076,Woman,5480,SERVICES_3,Selected,U
130,3172,Woman,5480,SERVICES_1,Selected,U


####
#### RAW SURVEY COUNTS DF BY UG

In [30]:
stems = [stem for stem in RAW_SURVEY.columns if qstem in stem]
raw_final = pd.DataFrame() 
for stem in stems: 
    # uncomment line below if double counting
    # RAW_SURVEY = RAW_SURVEY.explode(# insert double counting demographic value)
    RAW_SURVEY['ID DUPLICATE'] = RAW_SURVEY[stem]
    raw_piv = pd.pivot_table(RAW_SURVEY, values=stem, index=['Undergrad Grad', demo, 'ID DUPLICATE'], aggfunc='count')

    raw_piv = raw_piv.reset_index().rename(columns={'Ungrad Grad Cd': 'Undergrad Grad', demo: 'Demographic Value', stem: 'Count', 'ID DUPLICATE': 'Question Response'})

    #make demographic value total col 
    select = [stem for stem in RAW_SURVEY.columns if qstem in stem] + [demo, 'Undergrad Grad']
    selected = RAW_SURVEY[select]
    selected = selected.dropna(subset=stems, thresh = 1)
    selected = selected[[demo, 'Undergrad Grad']].value_counts().to_frame().reset_index().rename(columns={0: 'Demographic Value Total, by Undergrad Grad', demo: 'Demographic Value'})
    raw_piv = selected.merge(raw_piv, 'right', on=['Demographic Value', 'Undergrad Grad'])
    
    
    raw_piv['Question Item Id'] = [stem] * len(raw_piv) 
    raw_piv = raw_piv.sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)
    raw_final = pd.concat([raw_final, raw_piv], ignore_index=True).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)
    
raw_final

Unnamed: 0,Count,Demographic Value,"Demographic Value Total, by Undergrad Grad",Question Item Id,Question Response,Undergrad Grad
0,1,Decline to State,25,SERVICES_TIME_4,1-2 weeks,G
1,1,Decline to State,25,SERVICES_TIME_11,1-2 weeks,G
2,1,Decline to State,25,SERVICES_TIME_2,1-3 days,G
3,1,Decline to State,25,SERVICES_TIME_7,1-3 days,G
4,1,Decline to State,25,SERVICES_TIME_11,1-3 days,G
...,...,...,...,...,...,...
998,2349,Woman,5480,SERVICES_6,Selected,U
999,2491,Woman,5480,SERVICES_TIME_5,Less than 1 day,U
1000,3076,Woman,5480,SERVICES_3,Selected,U
1001,3172,Woman,5480,SERVICES_1,Selected,U


##
## 5. Check that each Question Stem Id matches their Question Stem/Item & Question Response

In [31]:
def check_qstem_qitem(): 
    STEM_IDS = DATA_SOURCE['Question Stem Id'].unique()
    for qstem in MULTI_SELECT: 
        qstem_str = DATA_SOURCE[DATA_SOURCE['Question Item Id'].str.contains(qstem, case=False)]['Question Stem'].unique()
        qitem_str = DATA_SOURCE[DATA_SOURCE['Question Item Id'].str.contains(qstem, case=False)]['Question Item'].unique()
        

        print('########', qstem, '########')
        print('QUESTION STEM:', qstem_str)
        print("\n")
        print('QUESTION ITEM:', qitem_str)
        print("\n")
    
check_qstem_qitem()

######## HOUSING_DIFF ########
QUESTION STEM: ['You said that it was difficult to search for housing for Fall 222. How would you characterize the difficulty you had searching for housing? Select all that apply. -  High cost of housing'
 'You said that it was difficult to search for housing for Fall 222. How would you characterize the difficulty you had searching for housing? Select all that apply. -  Available housing was not safe'
 'You said that it was difficult to search for housing for Fall 222. How would you characterize the difficulty you had searching for housing? Select all that apply. -  Available housing was too far from campus'
 'You said that it was difficult to search for housing for Fall 222. How would you characterize the difficulty you had searching for housing? Select all that apply. -  Available housing was in poor condition'
 'You said that it was difficult to search for housing for Fall 222. How would you characterize the difficulty you had searching for housing? Se

In [324]:
qstem_str = DATA_SOURCE[DATA_SOURCE['Question Item Id'].str.contains(qstem, case=False)]['Question Stem'].unique()
qitem_str = DATA_SOURCE[DATA_SOURCE['Question Item Id'].str.contains(qstem, case=False)]['Question Item'].unique()[0]

# make sure there is only one question stem for each question item 
if (len(qstem_str) == 1) == False: 
    print ('!!!! ERROR: MULTIPLE QUESTION STEMS FOR ONE QUESTION STEM !!!!')
    # ex: the question item is not properly separated from stem 
    # ex: 'During this academic year (since the beginning of the Fall 21 semester), have you consulted with an academic advisor in your major or college? -  Help/reception desk, in-person'
    # instead of: 'During this academic year (since the beginning of the Fall 21 semester), have you consulted with an academic advisor in your major or college?' 

print(qstem_str), print(qitem_str)

['During this academic year (since the beginning of the Fall 21 semester), have you consulted with an academic advisor in your major or college?']
nan


(None, None)