## **PULSE SURVEY 12 VALIDATION** 

1. Check the Question Stem Total column for at least 3 single select questions
2. Check Count and Demographic Value Totals column for each demographic for at least 2 single select questions
    - Note that Reporting College and Multiple Ethnicities are double counting demographics, which means that each student with multiple majors / ethnicities is counted once in each unique category. So, a student in L&S and CNR are counted as 2 responses- one from L&S and one from CNR. This means that their Demographic Value Totals will add up to more than their Question Stem Totals
3. Check Count and Demographic Value Totals, by Undergrad Grad column for one non-double-counting demographic and one double-counting demographic for at least 2 single-select questions (preferably questions that haven’t been checked)
4. Check that each Question Stem Id matches their Question Stem/Item & Question Response
    - Use Pulse Survey Content documents for this (must download the .docx files to be able to view)
    - While you’re doing this, make sure the text looks correct
5. Repeat the same thing for multi select questions

### **Demographic Categories**
**Double counting**
- Reporting College
- Multiple Ethnicities

**Non-double-counting**
- Undergrad Grad
- Derived Residency Desc
- Entry Status Desc
- Ucb Level1 Ethnic Rollup Desc
- Ucb Level2 Ethnic Rollup Desc


In [2]:
from sklearn.pipeline import Pipeline, FeatureUnion
import pandas as pd
import numpy as np
from IPython.display import display
import os 
os.chdir('/Users/roselee/VCUEDataTeam/Pulse Survey Data Source Generation/')

In [45]:
%run cleaning_transformers.ipynb

In [46]:
%run multiselect_counter_transformers.ipynb

In [47]:
DATA_SOURCE = pd.read_csv('12_ps_data_source.csv')
RAW_SURVEY = pd.read_csv('pulse_survey_12_raw_data.csv')

  RAW_SURVEY = pd.read_csv('pulse_survey_12_raw_data.csv')


##
## 1. Clean raw data

In [48]:
# data cleaning variables
COLUMNS_TO_REMOVE = ['RecordedDate'] ## may need to add:'PHQ2SCORE', 'GAD2SCORE', 'PHQ2', 'GAD2'
UNGRAD_GRAD_COL = 'UNGRADGRADCD' ## may need to replace
RESIDENCY_COL = 'RESIDENCY' ## may need to replace
ENTRY_STATUS_COL = 'ENTRYSTATUSDESC' ## may need to replace
ETH_LEVEL1_COL = 'LEVEL1ETH' ## may need to replace
ETH_LEVEL2_COL = 'LEVEL2ETH' ## may need to replace
VALUES_TO_NULLIFY = [-99, '-99', -1, '-1', -999, '-999', 'Not selected'] ## may need to replace

############# OPTIONAL: use ONLY if Reporting College cols look like a stem id #############
# rename reporintg college columns to avoid them getting treated as a question
RAW_SURVEY = RAW_SURVEY.rename(columns={'REPORTCOLLEGE1':'Reporting College - First Plan',
                                        'REPORTCOLLEGE2':'Reporting College - Second Plan',
                                        'REPORTCOLLEGE3':'Reporting College - Third Plan'})
############################################################################################
COLLEGE_COLS = RAW_SURVEY.columns[RAW_SURVEY.columns.str.contains('Reporting College')]
MULTI_ETH_COLS = ['African American / Black',
                  'Asian / Asian American',
                  'Hispanic / Latinx',
                  'International',
                  'American Indian / Alaska Native',
                  'Pacific Islander',
                  'Southwest Asian / North African',
                  'White / Caucasian',
                  'No Response']

# counting variables
QUESTION_DESC = RAW_SURVEY.loc[[0]] 
DATA = RAW_SURVEY[1:] 
DEMOGRAPHIC_COLUMNS = ['Undergrad Grad',
                       'Derived Residency Desc',
                       'Entry Status Desc',
                       'Ucb Level1 Ethnic Rollup Desc',
                       'Ucb Level2 Ethnic Rollup Desc',
                       'Low-income Status',
                       'First Gen College',
                       'Person Gender Desc',
                       'Reporting College',
                       'Multiple Ethnicities']

cleaning_pipeline = Pipeline([
    # drop null responses, remove duplicates and columns, make all missing/irrelevant values nan
    ('null rows remover', RemoveNullRowsTransformer()),
    ('values nullifier', ReplaceValuesTransformer(values_to_replace=VALUES_TO_NULLIFY)),
    ('duplicates remover', RemoveFirstDuplicateTransformer()),
    ('irrelevant columns remover', RemoveColumnsTransformer(columns_to_remove=COLUMNS_TO_REMOVE)),
    # rename column names
    ('undergrad grad col renamer', RenameColumnTransformer(UNGRAD_GRAD_COL, 'Undergrad Grad')),
    ('residency col renamer', RenameColumnTransformer(RESIDENCY_COL, 'Derived Residency Desc')),
    ('entry status col renamer', RenameColumnTransformer(ENTRY_STATUS_COL, 'Entry Status Desc')),
    ('ethnic lvl1 col renamer', RenameColumnTransformer(ETH_LEVEL1_COL, 'Ucb Level1 Ethnic Rollup Desc')),
    ('ethnic lvl2 col renamer', RenameColumnTransformer(ETH_LEVEL2_COL, 'Ucb Level2 Ethnic Rollup Desc')),
    # rename dataframe values
    ('undergrad value renamer', RelabelColumnTransformer(column_to_relabel='Undergrad Grad', new_label='U')),
    ('grad value renamer', RelabelColumnTransformer(column_to_relabel='Undergrad Grad', new_label='G')),
    ('first-year entry value renamer', RelabelColumnTransformer(column_to_relabel='Entry Status Desc', new_label='First-year')),
    # replace ADVANCED STANDING with NaN for all grad students
    ('advanced standing grad nullifier', ReplaceStringWithNaNTransformer(standing_col='Entry Status Desc')),
    # create columns for double counting demographics & mental health scores
    ('reporting clg col generator', UniqueStringListTransformer(columns_to_list=COLLEGE_COLS, unique_col_list='Reporting College')),
    ('multiple eth col generator', UniqueStringListTransformer(columns_to_list=MULTI_ETH_COLS, unique_col_list='Multiple Ethnicities')),
    ('depression col generator', AddColumnsTransformer(column_1='MHLTH1', column_2='MHLTH2', new_column='PHQ2', binary_column='DEPRESSION')),
    ('anxiety col generator', AddColumnsTransformer(column_1='MHLTH3', column_2='MHLTH4', new_column='GAD2', binary_column='ANXIETY'))
])

In [49]:
RAW_SURVEY = cleaning_pipeline.fit_transform(DATA)

In [50]:
DATA_SOURCE['Count'] = pd.to_numeric(DATA_SOURCE['Count'], downcast="integer")
DATA_SOURCE.head(2)

Unnamed: 0,Question Stem Id,Question Item Id,Demographic Category,Demographic Value,Undergrad Grad,Question Response,Count,Question Item,Question Stem,Demographic Value Total,"Demographic Value Total, by Undergrad Grad",Question Stem Total,Question Item Total
0,HOUSING_RESIDE,HOUSING_RESIDE,Undergrad Grad,G,G,Cooperative (Co-op) housing,70,,Which of the following best describes your housing for Fall 2022?,4152,4152,15299,15299
1,HOUSING_RESIDE,HOUSING_RESIDE,Undergrad Grad,G,G,I have not yet secured housing for Fall 2022,20,,Which of the following best describes your housing for Fall 2022?,4152,4152,15299,15299


In [51]:
RAW_SURVEY.head(2)

Unnamed: 0,ResponseId,Duration,EDUCNONEXAMLEVEL,EDUCNONEXAMLEVELCD,UGENTRYSTATUS,REGSTATUSDESC,GENDER,SHORTETHNICDESC,TYPE,Undergrad Grad,LowSocioEconomicStatusFlg,NeitherParent4yrClgDegFlg,ACADPLANNM1,ACADPLANNM2,ACADPLANNM3,CNR,CHE,COE,CED,CLS,BUS,GSE,GSJ,SPP,SOI,LAW,OPT,SPH,SSW,HOUSING_RESIDE,HOUSING_SAT_1,HOUSING_SAT_2,HOUSING_SAT_3,HOUSING_SAT_4,HOUSING_SAT_5,HOUSING_NEW,HOUSING_SEARCH,HOUSING_DIFF_1,HOUSING_DIFF_2,HOUSING_DIFF_3,HOUSING_DIFF_4,HOUSING_DIFF_5,HOUSING_DIFF_6,COURSES_REQU,COURSES_ELEC,COURSES_TTD,SERVICES_1,SERVICES_2,SERVICES_3,SERVICES_4,SERVICES_5,SERVICES_6,SERVICES_7,SERVICES_8,SERVICES_9,SERVICES_10,SERVICES_11,SERVICES_12,SERVICES_TIME_1,SERVICES_TIME_2,SERVICES_TIME_3,SERVICES_TIME_4,SERVICES_TIME_5,SERVICES_TIME_6,SERVICES_TIME_7,SERVICES_TIME_8,SERVICES_TIME_9,SERVICES_TIME_10,SERVICES_TIME_11,SERVICES_TIME_12,SERVICES_SAT_1,SERVICES_SAT_2,SERVICES_SAT_3,SERVICES_SAT_4,SERVICES_SAT_5,SERVICES_SAT_6,SERVICES_SAT_7,SERVICES_SAT_8,SERVICES_SAT_9,SERVICES_SAT_10,SERVICES_SAT_11,SERVICES_SAT_12,WIFI_SAT,WIFI_KEYSUSE,WIFI_KEYSSAT,CAPS_1,CAPS_2,CAPS_3,CAPS_4,RESOURCES,RESOURCES_MORE_1,RESOURCES_MORE_2,RESOURCES_MORE_3,RESOURCES_MORE_4,RESOURCES_MORE_5,RESOURCES_MORE_6,RESOURCES_MORE_7,RESOURCES_MORE_8,MHLTH1,MHLTH2,MHLTH3,MHLTH4,PHQ2SCORE,GAD2SCORE,PHQ2,GAD2,Semester Year Name Concat,African American / Black,Asian / Asian American,Hispanic / Latinx,International,American Indian / Alaska Native,Pacific Islander,Southwest Asian / North African,White / Caucasian,No Response,First Gen College,Person Gender Desc,Entry Status Desc,Derived Residency Desc,Ucb Level1 Ethnic Rollup Desc,Ucb Level2 Ethnic Rollup Desc,Reporting College - First Plan,Reporting College - Second Plan,Reporting College - Third Plan,Low-income Status,Reporting College,Multiple Ethnicities,DEPRESSION,ANXIETY
1,R_3PBs3Lj4wGkssfe,192.0,Doctoral not advanced to candidacy,Doctoral (not advanced to candidacy),,Continuing Student,Male,South Asian,Graduate Student,G,,,Chemical Engineering PhD,,,,,,,,,,,,,,,,,"Renting an apartment, house or condominium (not UCB-owned)",Very satisfied,Somewhat satisfied,Somewhat satisfied,Very satisfied,Very satisfied,No,Somewhat difficult,Selected,Selected,Selected,Selected,Selected,,Not applicable,No,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Somewhat dissatisfied,No,,Very well,Very well,Very well,Well,Very confident,,,,,,,,,Not at all,Not at all,Not at all,Not at all,0.0,0.0,0.0,0.0,2022 Fall,,Asian / Asian American,,,,,,White / Caucasian,,Not first-generation college,Man,FIRST TIME IN PROGRAM,Out of State Domestic,Asian,Asian,College of Chemistry,,,,[College of Chemistry],"[Asian / Asian American, White / Caucasian]",NO,NO
2,R_3hzM0hhcUj8G6E2,171.0,Freshman,Freshman,New Freshman,New Student,Female,,Undergraduate,U,,,Letters & Sci Undeclared UG,,,,,,,,,,,,,,,,,University Residence Halls or other University-managed housing,Very satisfied,Somewhat satisfied,Somewhat satisfied,Very dissatisfied,Somewhat satisfied,Not applicable or was not here last semester,Somewhat easy,,,,,,,Yes,No,,Selected,,Selected,,Selected,Selected,Selected,,,,,,Very satisfied,,Less than 1 day,,Less than 1 day,Less than 1 day,Less than 1 day,,,,,,Somewhat satisfied,,Somewhat satisfied,,Very satisfied,Very satisfied,Very satisfied,,,,,,Very dissatisfied,Yes,Somewhat dissatisfied,Well,Well,Well,Poorly,Somewhat confident,,,,,,,,,Several days,Several days,Several days,Not at all,2.0,1.0,2.0,1.0,2022 Fall,,,,,,,,,No Response,Unknown,Woman,First-year,CA Resident,,,College of Letters and Science,,,Not low-income,[College of Letters and Science],[No Response],NO,NO


In [52]:
STEM_ID = DATA_SOURCE['Question Stem Id'].unique()
STEM_ID

array(['HOUSING_RESIDE', 'HOUSING_SAT_1', 'HOUSING_SAT_2',
       'HOUSING_SAT_3', 'HOUSING_SAT_4', 'HOUSING_SAT_5', 'HOUSING_NEW',
       'HOUSING_SEARCH', 'COURSES_REQU', 'COURSES_ELEC', 'COURSES_TTD',
       'SERVICES_TIME_1', 'SERVICES_TIME_2', 'SERVICES_TIME_3',
       'SERVICES_TIME_4', 'SERVICES_TIME_5', 'SERVICES_TIME_6',
       'SERVICES_TIME_7', 'SERVICES_TIME_8', 'SERVICES_TIME_9',
       'SERVICES_TIME_10', 'SERVICES_TIME_11', 'SERVICES_TIME_12',
       'SERVICES_SAT_1', 'SERVICES_SAT_2', 'SERVICES_SAT_3',
       'SERVICES_SAT_4', 'SERVICES_SAT_5', 'SERVICES_SAT_6',
       'SERVICES_SAT_7', 'SERVICES_SAT_8', 'SERVICES_SAT_9',
       'SERVICES_SAT_10', 'SERVICES_SAT_11', 'SERVICES_SAT_12',
       'WIFI_SAT', 'WIFI_KEYSUSE', 'WIFI_KEYSSAT', 'CAPS_1', 'CAPS_2',
       'CAPS_3', 'CAPS_4', 'RESOURCES', 'MHLTH1', 'MHLTH2', 'MHLTH3',
       'MHLTH4', 'PHQ2', 'DEPRESSION', 'GAD2', 'ANXIETY', 'HOUSING_DIFF',
       'SERVICES', 'RESOURCES_MORE'], dtype=object)

In [53]:
MULTI_SELECT = ['HOUSING_DIFF', 'SERVICES', 'RESOURCES_MORE']

SINGLE_SELECT = [id for id in STEM_ID if id not in MULTI_SELECT]

SINGLE_DEMOS = ['Undergrad Grad', 'Derived Residency Desc', 
             'Entry Status Desc', 'Ucb Level1 Ethnic Rollup Desc',
             'Ucb Level2 Ethnic Rollup Desc', 'Low-income Status', 
             'First Gen College', 'Person Gender Desc']

DOUBLE_DEMOS = ['Multiple Ethnicities ', 'Reporting College']

##
## 2. Check the Question Stem Total column for at least 3 single select questions

In [54]:
# completed function 
def check_qstem_total(qstems): 
    for qstem in qstems: 
        print('_____', qstem, '_____')
        # finding data source value for question stem total 
        allstemtotal = DATA_SOURCE[DATA_SOURCE['Question Stem Id'].str.contains(qstem, case=False)]
        stemtotal = allstemtotal[['Question Item Id', 'Question Stem Total']]
        stemtotal = stemtotal.loc[DATA_SOURCE['Demographic Category'] == 'Undergrad Grad'].drop_duplicates(ignore_index=True)
        data_source_val = stemtotal['Question Stem Total'][0]

        #finding raw survey value for question stem total 
        raw_survey_val = RAW_SURVEY[qstem].count()

        print('DATA SOURCE:', data_source_val)
        print('RAW SURVEY:', raw_survey_val)
        print('Equal?:', data_source_val == raw_survey_val) 
        print("\n")

# check multiple stem totals 
qstems = SINGLE_SELECT[:8]
check_qstem_total(qstems)

_____ HOUSING_RESIDE _____
DATA SOURCE: 15299
RAW SURVEY: 15299
Equal?: True


_____ HOUSING_SAT_1 _____
DATA SOURCE: 15133
RAW SURVEY: 15133
Equal?: True


_____ HOUSING_SAT_2 _____
DATA SOURCE: 15080
RAW SURVEY: 15080
Equal?: True


_____ HOUSING_SAT_3 _____
DATA SOURCE: 15057
RAW SURVEY: 15057
Equal?: True


_____ HOUSING_SAT_4 _____
DATA SOURCE: 15065
RAW SURVEY: 15065
Equal?: True


_____ HOUSING_SAT_5 _____
DATA SOURCE: 15094
RAW SURVEY: 15094
Equal?: True


_____ HOUSING_NEW _____
DATA SOURCE: 15226
RAW SURVEY: 15226
Equal?: True


_____ HOUSING_SEARCH _____
DATA SOURCE: 11313
RAW SURVEY: 11313
Equal?: True




### TESTING 

In [38]:
qstem = 'SERVICES_TIME_5'

####
#### DATA SOURCE STEM TOTAL 

In [16]:
allstemtotal = DATA_SOURCE[DATA_SOURCE['Question Stem Id'].str.contains(qstem, case=False)]
allstemtotal.head(2)

Unnamed: 0,Question Stem Id,Question Item Id,Demographic Category,Demographic Value,Undergrad Grad,Question Response,Count,Question Item,Question Stem,Demographic Value Total,"Demographic Value Total, by Undergrad Grad",Question Stem Total,Question Item Total
136,SERVICES_TIME_5,SERVICES_TIME_5,Undergrad Grad,G,G,1-2 weeks,73,,"For each of the services that you used or tried to use this semester, how long did you have to wait? - Cal-1 Card",1866,1866,8221,8221
137,SERVICES_TIME_5,SERVICES_TIME_5,Undergrad Grad,G,G,1-3 days,380,,"For each of the services that you used or tried to use this semester, how long did you have to wait? - Cal-1 Card",1866,1866,8221,8221


In [17]:
stemtotal = allstemtotal[['Question Item Id', 'Question Stem Total']]
stemtotal = stemtotal.loc[DATA_SOURCE['Demographic Category'] == 'Undergrad Grad'].drop_duplicates(ignore_index=True)
stemtotal['Question Stem Total'][0]

8221

####
#### RAW SURVEY STEM TOTAL

In [18]:
RAW_SURVEY[qstem].count()

8221

In [19]:
# make sure above number is accurate and is not counting unnecessary values 
RAW_SURVEY[qstem].value_counts()

Less than 1 day                   5552
1-3 days                          1543
4-7 days                           412
More than 2 weeks                  253
1-2 weeks                          237
I haven't received service yet     224
Name: SERVICES_TIME_5, dtype: int64

##
## 3. Check Count and Demographic Value Totals column for each demographic for at least 2 single select questions

In [21]:
# completed function (one demographic value) 
def check_count_onedemo(qitem, demo, double_count_demo = False): 
    # finding data source values #
    ds_counts = DATA_SOURCE[DATA_SOURCE['Question Item Id']== qitem]
    ds_counts = ds_counts[ds_counts['Demographic Category'] == demo][['Demographic Value', 'Demographic Value Total', 'Undergrad Grad', 'Count', 'Question Response']]
    ds_counts = ds_counts.sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)
    
    
    # finding raw survey values #
    raw = RAW_SURVEY
    if double_count_demo: 
        raw = RAW_SURVEY.explode(demo)
    raw['ID DUPLICATE'] = raw[qitem]
    raw_piv = pd.pivot_table(raw, values=qitem, index=['Undergrad Grad', demo, 'ID DUPLICATE'], aggfunc='count')
    raw_piv = raw_piv.reset_index().rename(columns={'Ungrad Grad Cd': 'Undergrad Grad', demo: 'Demographic Value', qitem: 'Count', 'ID DUPLICATE': 'Question Response'})

    # make demographic value total col 
    demo_vals = raw_piv.groupby('Demographic Value').sum('Count')
    demo_vals = demo_vals.to_dict('index')
    demo_vals = {k1: v for k1 in demo_vals for k2, v in demo_vals[k1].items()}
    raw_piv['Demographic Value Total'] = raw_piv['Demographic Value'].map(demo_vals)

    raw_piv = raw_piv.sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)

    print('DATA SOURCE: ')
    display(ds_counts)
    print("\n")
    print('RAW SURVEY: ')
    display(raw_piv)


# completed function (all demographic values for ONE QUESTION ITEM) 
def check_count_alldemo(qitem, demo_vals): 
    for demo in demo_vals:
        print('DEMOGRAPHIC VALUE:', demo) 
        if demo in ['Reporting College', 'Multiple Ethnicities']:
            check_count_onedemo(qitem, demo, double_count_demo = True)
        else:
            check_count_onedemo(qitem, demo) 
        print("\n")
        
demo_cat = [#'Undergrad Grad',
            'Derived Residency Desc',
            'Entry Status Desc',
            'Ucb Level1 Ethnic Rollup Desc',
            'Ucb Level2 Ethnic Rollup Desc',
            'Low-income Status',
            'First Gen College',
            'Person Gender Desc',
            'Reporting College',
            'Multiple Ethnicities']

# if FALSE, check dataframes below by replacing the variables qitem and demo (typically because of cleaning/low counts) 

In [39]:
check_count_alldemo('SERVICES_TIME_2', demo_cat) 

DEMOGRAPHIC VALUE: Derived Residency Desc
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,11,CA Resident,157,More than 2 weeks,G
1,15,CA Resident,157,1-2 weeks,G
2,18,CA Resident,157,I haven't received service yet,G
3,27,CA Resident,157,4-7 days,G
4,42,CA Resident,157,Less than 1 day,G
5,44,CA Resident,157,1-3 days,G
6,11,International,148,1-2 weeks,G
7,12,International,148,I haven't received service yet,G
8,16,International,148,More than 2 weeks,G
9,24,International,148,4-7 days,G




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,11,CA Resident,158,More than 2 weeks,G
1,15,CA Resident,158,1-2 weeks,G
2,18,CA Resident,158,I haven't received service yet,G
3,27,CA Resident,158,4-7 days,G
4,42,CA Resident,158,Less than 1 day,G
5,45,CA Resident,158,1-3 days,G
6,11,International,148,1-2 weeks,G
7,12,International,148,I haven't received service yet,G
8,16,International,148,More than 2 weeks,G
9,24,International,148,4-7 days,G




DEMOGRAPHIC VALUE: Entry Status Desc
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,-1,DOCTORAL,-1,1-2 weeks,G
1,-1,DOCTORAL,-1,1-3 days,G
2,-1,DOCTORAL,-1,4-7 days,G
3,-1,DOCTORAL,-1,I haven't received service yet,G
4,-1,DOCTORAL,-1,Less than 1 day,G
5,-1,DOCTORAL,-1,More than 2 weeks,G
6,37,FIRST TIME IN PROGRAM,443,1-2 weeks,G
7,38,FIRST TIME IN PROGRAM,443,I haven't received service yet,G
8,40,FIRST TIME IN PROGRAM,443,More than 2 weeks,G
9,67,FIRST TIME IN PROGRAM,443,4-7 days,G




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,1,DOCTORAL,6,1-3 days,G
1,1,DOCTORAL,6,4-7 days,G
2,2,DOCTORAL,6,I haven't received service yet,G
3,2,DOCTORAL,6,Less than 1 day,G
4,37,FIRST TIME IN PROGRAM,445,1-2 weeks,G
5,38,FIRST TIME IN PROGRAM,445,I haven't received service yet,G
6,40,FIRST TIME IN PROGRAM,445,More than 2 weeks,G
7,67,FIRST TIME IN PROGRAM,445,4-7 days,G
8,112,FIRST TIME IN PROGRAM,445,Less than 1 day,G
9,150,FIRST TIME IN PROGRAM,445,1-3 days,G




DEMOGRAPHIC VALUE: Ucb Level1 Ethnic Rollup Desc
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,7,Asian,76,4-7 days,G
1,7,Asian,76,More than 2 weeks,G
2,8,Asian,76,1-2 weeks,G
3,8,Asian,76,I haven't received service yet,G
4,15,Asian,76,Less than 1 day,G
5,31,Asian,76,1-3 days,G
6,11,International,148,1-2 weeks,G
7,12,International,148,I haven't received service yet,G
8,16,International,148,More than 2 weeks,G
9,24,International,148,4-7 days,G




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,7,Asian,76,4-7 days,G
1,7,Asian,76,More than 2 weeks,G
2,8,Asian,76,1-2 weeks,G
3,8,Asian,76,I haven't received service yet,G
4,15,Asian,76,Less than 1 day,G
5,31,Asian,76,1-3 days,G
6,11,International,148,1-2 weeks,G
7,12,International,148,I haven't received service yet,G
8,16,International,148,More than 2 weeks,G
9,24,International,148,4-7 days,G




DEMOGRAPHIC VALUE: Ucb Level2 Ethnic Rollup Desc
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,3,African American,45,1-2 weeks,G
1,4,African American,45,More than 2 weeks,G
2,5,African American,45,4-7 days,G
3,5,African American,45,I haven't received service yet,G
4,13,African American,45,1-3 days,G
5,15,African American,45,Less than 1 day,G
6,7,Asian,76,4-7 days,G
7,7,Asian,76,More than 2 weeks,G
8,8,Asian,76,1-2 weeks,G
9,8,Asian,76,I haven't received service yet,G




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,3,African American,46,1-2 weeks,G
1,4,African American,46,More than 2 weeks,G
2,5,African American,46,4-7 days,G
3,5,African American,46,I haven't received service yet,G
4,14,African American,46,1-3 days,G
5,15,African American,46,Less than 1 day,G
6,7,Asian,76,4-7 days,G
7,7,Asian,76,More than 2 weeks,G
8,8,Asian,76,1-2 weeks,G
9,8,Asian,76,I haven't received service yet,G




DEMOGRAPHIC VALUE: Low-income Status
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,-1,Not low-income,-1,1-3 days,U




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,1,Not low-income,1,1-3 days,U




DEMOGRAPHIC VALUE: First Gen College
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,3,First-generation college,87,More than 2 weeks,G
1,9,First-generation college,87,1-2 weeks,G
2,12,First-generation college,87,4-7 days,G
3,13,First-generation college,87,I haven't received service yet,G
4,19,First-generation college,87,Less than 1 day,G
5,31,First-generation college,87,1-3 days,G
6,14,Not first-generation college,213,I haven't received service yet,G
7,19,Not first-generation college,213,1-2 weeks,G
8,22,Not first-generation college,213,More than 2 weeks,G
9,33,Not first-generation college,213,4-7 days,G




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,3,First-generation college,87,More than 2 weeks,G
1,9,First-generation college,87,1-2 weeks,G
2,12,First-generation college,87,4-7 days,G
3,13,First-generation college,87,I haven't received service yet,G
4,19,First-generation college,87,Less than 1 day,G
5,31,First-generation college,87,1-3 days,G
6,1,N,3,Less than 1 day,G
7,2,N,3,1-3 days,G
8,14,Not first-generation college,211,I haven't received service yet,G
9,19,Not first-generation college,211,1-2 weeks,G




DEMOGRAPHIC VALUE: Person Gender Desc
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,-1,Decline to State,-1,1-2 weeks,G
1,-1,Decline to State,-1,1-3 days,G
2,-1,Decline to State,-1,4-7 days,G
3,-1,Decline to State,-1,I haven't received service yet,G
4,-1,Decline to State,-1,Less than 1 day,G
5,-1,Decline to State,-1,More than 2 weeks,G
6,-1,Decline to State,-1,1-2 weeks,U
7,-1,Decline to State,-1,1-3 days,U
8,-1,Decline to State,-1,4-7 days,U
9,-1,Decline to State,-1,I haven't received service yet,U




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,1,Decline to State,5,1-3 days,G
1,1,Decline to State,5,I haven't received service yet,G
2,2,Decline to State,5,Less than 1 day,G
3,1,Decline to State,5,1-3 days,U
4,16,Man,211,I haven't received service yet,G
5,17,Man,211,1-2 weeks,G
6,19,Man,211,More than 2 weeks,G
7,28,Man,211,4-7 days,G
8,58,Man,211,Less than 1 day,G
9,73,Man,211,1-3 days,G




DEMOGRAPHIC VALUE: Reporting College
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,1,Berkeley School of Education,24,I haven't received service yet,G
1,2,Berkeley School of Education,24,1-2 weeks,G
2,5,Berkeley School of Education,24,1-3 days,G
3,5,Berkeley School of Education,24,4-7 days,G
4,5,Berkeley School of Education,24,More than 2 weeks,G
...,...,...,...,...,...
91,4,Walter A. Haas School of Business,60,1-2 weeks,G
92,4,Walter A. Haas School of Business,60,More than 2 weeks,G
93,9,Walter A. Haas School of Business,60,4-7 days,G
94,10,Walter A. Haas School of Business,60,Less than 1 day,G




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,1,Berkeley School of Education,24,I haven't received service yet,G
1,2,Berkeley School of Education,24,1-2 weeks,G
2,5,Berkeley School of Education,24,1-3 days,G
3,5,Berkeley School of Education,24,4-7 days,G
4,5,Berkeley School of Education,24,More than 2 weeks,G
...,...,...,...,...,...
74,4,Walter A. Haas School of Business,60,1-2 weeks,G
75,4,Walter A. Haas School of Business,60,More than 2 weeks,G
76,9,Walter A. Haas School of Business,60,4-7 days,G
77,10,Walter A. Haas School of Business,60,Less than 1 day,G




DEMOGRAPHIC VALUE: Multiple Ethnicities
DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,3,African American / Black,48,1-2 weeks,G
1,4,African American / Black,48,More than 2 weeks,G
2,5,African American / Black,48,I haven't received service yet,G
3,8,African American / Black,48,4-7 days,G
4,13,African American / Black,48,1-3 days,G
...,...,...,...,...,...
61,9,White / Caucasian,121,I haven't received service yet,G
62,18,White / Caucasian,121,More than 2 weeks,G
63,19,White / Caucasian,121,4-7 days,G
64,29,White / Caucasian,121,Less than 1 day,G




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,3,African American / Black,49,1-2 weeks,G
1,4,African American / Black,49,More than 2 weeks,G
2,5,African American / Black,49,I haven't received service yet,G
3,8,African American / Black,49,4-7 days,G
4,14,African American / Black,49,1-3 days,G
5,15,African American / Black,49,Less than 1 day,G
6,1,American Indian / Alaska Native,15,I haven't received service yet,G
7,2,American Indian / Alaska Native,15,Less than 1 day,G
8,3,American Indian / Alaska Native,15,1-2 weeks,G
9,3,American Indian / Alaska Native,15,4-7 days,G






### TESTING 

In [40]:
qitem = 'SERVICES_TIME_2'
demo = 'First Gen College'

####
#### DATA SOURCE COUNTS DF

In [26]:
ds_counts = DATA_SOURCE[DATA_SOURCE['Question Item Id']== qitem]
ds_counts = ds_counts[ds_counts['Demographic Category'] == demo][['Demographic Value', 'Demographic Value Total', 'Undergrad Grad', 'Count', 'Question Response']]
ds_counts = ds_counts.sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)
ds_counts

Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,4,First-generation college,2271,I haven't received service yet,G
1,8,First-generation college,2271,More than 2 weeks,G
2,13,First-generation college,2271,1-2 weeks,G
3,15,First-generation college,2271,4-7 days,G
4,47,First-generation college,2271,1-3 days,G
5,137,First-generation college,2271,Less than 1 day,G
6,66,First-generation college,2271,1-2 weeks,U
7,68,First-generation college,2271,More than 2 weeks,U
8,74,First-generation college,2271,I haven't received service yet,U
9,115,First-generation college,2271,4-7 days,U


####
#### RAW SURVEY COUNTS DF

In [27]:
# uncomment line below if double counting
# RAW_SURVEY = RAW_SURVEY.explode(# insert double counting demographic value)
RAW_SURVEY['ID DUPLICATE'] = RAW_SURVEY[qitem]
raw_piv = pd.pivot_table(RAW_SURVEY, values=qitem, index=['Undergrad Grad', demo, 'ID DUPLICATE'], aggfunc='count')

raw_piv = raw_piv.reset_index().rename(columns={'Ungrad Grad Cd': 'Undergrad Grad', demo: 'Demographic Value', qitem: 'Count', 'ID DUPLICATE': 'Question Response'})

#make demographic value total col 
demo_vals = raw_piv.groupby('Demographic Value').sum('Count')
demo_vals = demo_vals.to_dict('index')
demo_vals = {k1: v for k1 in demo_vals for k2, v in demo_vals[k1].items()}
raw_piv['Demographic Value Total'] = raw_piv['Demographic Value'].map(demo_vals)

#replace low counts with -1
#raw_piv['Count'] = raw_piv['Count'].apply(lambda x: -1 if x < 11 else x)

raw_piv = raw_piv.sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)
raw_piv

Unnamed: 0,Count,Demographic Value,Demographic Value Total,Question Response,Undergrad Grad
0,4,First-generation college,2266,I haven't received service yet,G
1,8,First-generation college,2266,More than 2 weeks,G
2,13,First-generation college,2266,1-2 weeks,G
3,15,First-generation college,2266,4-7 days,G
4,46,First-generation college,2266,1-3 days,G
5,133,First-generation college,2266,Less than 1 day,G
6,66,First-generation college,2266,1-2 weeks,U
7,68,First-generation college,2266,More than 2 weeks,U
8,74,First-generation college,2266,I haven't received service yet,U
9,115,First-generation college,2266,4-7 days,U


In [28]:
ds_counts.astype(str).equals(raw_piv.astype(str))

False

##
## 4. Check Count and Demographic Value Totals, by Undergrad Grad column for one non-double-counting demographic and one double-counting demographic for at least 2 single-select questions 
Preferably questions that haven’t been checked

In [42]:
# completed function (one demographic value) 
def check_count_ug_onedemo(qitem, demo, double_count_demo = False): 
    # finding data source values #
    ds_counts = DATA_SOURCE[DATA_SOURCE['Question Item Id']== qitem]
    ds_counts = ds_counts[ds_counts['Demographic Category'] == demo][['Demographic Value', 'Demographic Value Total, by Undergrad Grad', 'Undergrad Grad', 'Count', 'Question Response']]
    ds_counts = ds_counts.sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)
    
    
    # finding raw survey values #
    raw = RAW_SURVEY
    if double_count_demo: 
        raw = RAW_SURVEY.explode(demo)
    raw['ID DUPLICATE'] = raw[qitem]
    raw_piv = pd.pivot_table(raw, values=qitem, index=['Undergrad Grad', demo, 'ID DUPLICATE'], aggfunc='count')
    raw_piv = raw_piv.reset_index().rename(columns={'Ungrad Grad Cd': 'Undergrad Grad', demo: 'Demographic Value', qitem: 'Count', 'ID DUPLICATE': 'Question Response'})

    # make demographic value total by ug col 
    demo_vals = raw_piv.groupby(['Demographic Value', 'Undergrad Grad']).sum('Count')
    demo_vals.reset_index()
    raw_piv = demo_vals.merge(raw_piv, 'right', on=['Demographic Value', 'Undergrad Grad'], suffixes=('_dvt by ug', '')).rename(columns={'Count_dvt by ug': 'Demographic Value Total, by Undergrad Grad'})

    # sort columns 
    raw_piv = raw_piv.sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)

    print('DATA SOURCE: ')
    display(ds_counts)
    print("\n")
    print('RAW SURVEY: ')
    display(raw_piv)

# completed function (demographic values for ONE QUESTION ITEM) 
def check_count_ug_alldemo(qitem, demo_vals): 
    for demo in demo_vals:
        print('_____', qitem, '_____')
        print('DEMOGRAPHIC VALUE:', demo) 
        if demo in ['Reporting College', 'Multiple Ethnicities']:
            check_count_ug_onedemo(qitem, demo, double_count_demo = True)
        else:
            check_count_ug_onedemo(qitem, demo) 
        print("\n")

In [43]:
check_count_ug_onedemo(qitem, demo, double_count_demo = False)

DATA SOURCE: 


Unnamed: 0,Count,Demographic Value,"Demographic Value Total, by Undergrad Grad",Question Response,Undergrad Grad
0,4,First-generation college,224,I haven't received service yet,G
1,8,First-generation college,224,More than 2 weeks,G
2,13,First-generation college,224,1-2 weeks,G
3,15,First-generation college,224,4-7 days,G
4,47,First-generation college,224,1-3 days,G
5,137,First-generation college,224,Less than 1 day,G
6,66,First-generation college,2047,1-2 weeks,U
7,68,First-generation college,2047,More than 2 weeks,U
8,74,First-generation college,2047,I haven't received service yet,U
9,115,First-generation college,2047,4-7 days,U




RAW SURVEY: 


Unnamed: 0,Count,Demographic Value,"Demographic Value Total, by Undergrad Grad",Question Response,Undergrad Grad
0,4,First-generation college,222,I haven't received service yet,G
1,9,First-generation college,222,More than 2 weeks,G
2,14,First-generation college,222,1-2 weeks,G
3,16,First-generation college,222,4-7 days,G
4,46,First-generation college,222,1-3 days,G
5,133,First-generation college,222,Less than 1 day,G
6,66,First-generation college,2057,1-2 weeks,U
7,68,First-generation college,2057,More than 2 weeks,U
8,74,First-generation college,2057,I haven't received service yet,U
9,115,First-generation college,2057,4-7 days,U


### TESTING 

In [44]:
qitem = 'SERVICES_TIME_7'
demo = 'Reporting College'#np.random.choice(demo)


####
#### DATA SOURCE COUNTS DF BY UG

In [32]:
ds_counts = DATA_SOURCE[DATA_SOURCE['Question Item Id']== qitem]
ds_counts = ds_counts[ds_counts['Demographic Category'] == demo][['Demographic Value', 'Demographic Value Total, by Undergrad Grad', 'Undergrad Grad', 'Count', 'Question Response']]
ds_counts = ds_counts.sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)
ds_counts

Unnamed: 0,Count,Demographic Value,"Demographic Value Total, by Undergrad Grad",Question Response,Undergrad Grad
0,0,Berkeley School of Education,15,4-7 days,G
1,0,Berkeley School of Education,15,I haven't received service yet,G
2,0,Berkeley School of Education,15,More than 2 weeks,G
3,1,Berkeley School of Education,15,1-2 weeks,G
4,3,Berkeley School of Education,15,1-3 days,G
...,...,...,...,...,...
133,2,Walter A. Haas School of Business,58,1-2 weeks,U
134,3,Walter A. Haas School of Business,58,More than 2 weeks,U
135,4,Walter A. Haas School of Business,58,I haven't received service yet,U
136,11,Walter A. Haas School of Business,58,1-3 days,U


####
#### RAW SURVEY COUNTS DF BY UG

In [33]:
RAW_SURVEY = RAW_SURVEY.explode(demo)
RAW_SURVEY['ID DUPLICATE'] = RAW_SURVEY[qitem]
raw_piv = pd.pivot_table(RAW_SURVEY, values=qitem, index=['Undergrad Grad', demo, 'ID DUPLICATE'], aggfunc='count')

raw_piv = raw_piv.reset_index().rename(columns={'Ungrad Grad Cd': 'Undergrad Grad', demo: 'Demographic Value', qitem: 'Count', 'ID DUPLICATE': 'Question Response'})

# make demographic value total by ug col 
demo_vals = raw_piv.groupby(['Demographic Value', 'Undergrad Grad']).sum('Count')
demo_vals.reset_index()
raw_piv = demo_vals.merge(raw_piv, 'right', on=['Demographic Value', 'Undergrad Grad'], suffixes=('_dvt by ug', '')).rename(columns={'Count_dvt by ug': 'Demographic Value Total, by Undergrad Grad'})

raw_piv = raw_piv.sort_index(axis=1).sort_values(by = ['Demographic Value', 'Undergrad Grad', 'Count', 'Question Response']).reset_index(drop=True)
raw_piv

Unnamed: 0,Count,Demographic Value,"Demographic Value Total, by Undergrad Grad",Question Response,Undergrad Grad
0,1,Berkeley School of Education,15,1-2 weeks,G
1,3,Berkeley School of Education,15,1-3 days,G
2,11,Berkeley School of Education,15,Less than 1 day,G
3,1,Berkeley School of Education,1,Less than 1 day,U
4,1,College of Chemistry,38,I haven't received service yet,G
...,...,...,...,...,...
107,2,Walter A. Haas School of Business,58,1-2 weeks,U
108,3,Walter A. Haas School of Business,58,More than 2 weeks,U
109,4,Walter A. Haas School of Business,58,I haven't received service yet,U
110,11,Walter A. Haas School of Business,58,1-3 days,U


##
## 5. Check that each Question Stem Id matches their Question Stem/Item & Question Response

In [34]:
def check_qstem_qitem(): 
    STEM_IDS = DATA_SOURCE['Question Stem Id'].unique()
    for qstem in STEM_IDS: 
        qstem_str = DATA_SOURCE[DATA_SOURCE['Question Item Id'].str.contains(qstem, case=False)]['Question Stem'].unique()
        qitem_str = DATA_SOURCE[DATA_SOURCE['Question Item Id'].str.contains(qstem, case=False)]['Question Item'].unique()
        

        print('########', qstem, '########')
        print('QUESTION STEM:', qstem_str)
        print("\n")
        print('QUESTION ITEM:', qitem_str)
        print("\n")
    
check_qstem_qitem()

######## HOUSING_RESIDE ########
QUESTION STEM: ['Which of the following best describes your housing for Fall 2022?']


QUESTION ITEM: [nan]


######## HOUSING_SAT_1 ########
QUESTION STEM: ['For each of the following, how satisfied are you with your housing for Fall 2022? -  Distance from or commute to campus']


QUESTION ITEM: [nan]


######## HOUSING_SAT_2 ########
QUESTION STEM: ['For each of the following, how satisfied are you with your housing for Fall 2022? -  Safety']


QUESTION ITEM: [nan]


######## HOUSING_SAT_3 ########
QUESTION STEM: ['For each of the following, how satisfied are you with your housing for Fall 2022? -  Cost']


QUESTION ITEM: [nan]


######## HOUSING_SAT_4 ########
QUESTION STEM: ['For each of the following, how satisfied are you with your housing for Fall 2022? -  Condition of housing']


QUESTION ITEM: [nan]


######## HOUSING_SAT_5 ########
QUESTION STEM: ['For each of the following, how satisfied are you with your housing for Fall 2022? -  Overall sat

In [35]:
qstem_str = DATA_SOURCE[DATA_SOURCE['Question Item Id'].str.contains(qstem, case=False)]['Question Stem'].unique()
qitem_str = DATA_SOURCE[DATA_SOURCE['Question Item Id'].str.contains(qstem, case=False)]['Question Item'].unique()[0]

# make sure there is only one question stem for each question item 
if (len(qstem_str) == 1) == False: 
    print ('!!!! ERROR: MULTIPLE QUESTION STEMS FOR ONE QUESTION STEM !!!!')
    # ex: the question item is not properly separated from stem 
    # ex: 'During this academic year (since the beginning of the Fall 21 semester), have you consulted with an academic advisor in your major or college? -  Help/reception desk, in-person'
    # instead of: 'During this academic year (since the beginning of the Fall 21 semester), have you consulted with an academic advisor in your major or college?' 

print(qstem_str), print(qitem_str)

['For each of the services that you used or tried to use this semester, how long did you have to wait? -  Cal-1 Card']
nan


(None, None)