# Demographic Analysis for Aging extract

# Notebook description
In this notebook we perform univariate demographic analysis on Aging. This notebook presents analysis of the univariate demographics. It includes statistical measures such as mean and standard deviation for numerical variables, as well as value counts (n) and percentage counts (%) for categorical variables. 

Statistical tests used:
1. Chi-squared test of independence  
For categorical variables  
The Chi-squared test assumes that categorical variables should have at least five instances in each category of the data.  
To adhere to the chi-squared test, categories with fewer than 5 instances in a categorical variable will be excluded from the analysis.
2. Mann Whitney test  
For numerical variables  
Non-parametric test  
Normality and homogeneity of variance are not necessary assumptions for the Mann-Whitney U test.

Assumptions:
1. Alpha or significance level: 0.05

# Input data

In [1]:
import pandas as pd
import numpy as np
import math
import seaborn as sns
import matplotlib.pylab
from scipy import stats
from datetime import date
from statsmodels.iolib.table import SimpleTable
from scipy.stats import chi2_contingency
import textwrap

pd.set_option('display.max_columns', 700)

# Read data (only qualified data)

Filter only qualified data for the demographics analysis

In [2]:
data_nopop = pd.read_csv("gs://hpn_cte_advanced_stats_data_repository/yahui_data_repository/Aging 202312/AGING_DATA_EXTRACT__NEW_NQSD_20231214.csv")
population = pd.read_csv("gs://hpn_cte_advanced_stats_data_repository/yahui_data_repository/Aging 202312/population_updates_20231214.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
data = pd.merge(data_nopop, population, on=['PET_ID'], how='left')

In [4]:
data1 = data[data['IS_QUALIFIED_NEW']=='YES']

In [5]:
data1.columns[-35:]

Index(['questionnaire_itt_extract', 'walking_900', 'running_900', 'active_900',
       'walking_300', 'running_300', 'active_300', 'walking_60', 'running_60',
       'active_60', 'longest_running_event', 'longest_walking_event',
       'active_periods_count_rolling', 'most_common_hour_rolling',
       'active_periods_count_fixed', 'most_common_hour_fixed', 'patient_id',
       'dog_and_owner_name', 'study_food', 'weight', 'mixed_breed', 'intact',
       'sex', 'date_of_birth', 'baseline_mean_soft_stiction',
       'treatment_mean_soft_stiction', 'IS_QUALIFIED_NEW',
       'Qualifying_Data_Day_NEW', 'NORMALIZED_QUALIFYING_STUDY_DAY_NEW',
       'wearables_itt', 'wearables_pp', 'wearables_itt_extract',
       'wearables_pp_extract', 'itt_failure_reason', 'pp_failure_reason'],
      dtype='object')

# Demographics mapper

In [6]:
demographics_variable_list = \
{
    "DEMOGRAPHICS-gender": 'Gender',
    "DEMOGRAPHICS-day0_age":'Day 0 age',
    "weight": 'Day 0 body weight',
    "sitename": 'Site',
    "MEDICATION - medication": 'Baseline period pain medications',
    "OWNER_DAILY_DAIRY - total_amount_time_walked": 'Owner Daily Diary',
    'IS_WEEKEND': 'weekend vs. weekday'
    
}

# Helper functions

## Statistical for numerical & categorical (together)

In [7]:
def statistics_numerical_categorical(data, attribute_name, numerical_col, categorical_col): 
    mean = data[numerical_col].mean()
    std = data[numerical_col].std()
    
    value_counts_abs = data[categorical_col].value_counts()
    value_counts_norm = data[categorical_col].value_counts(normalize=True)
    
    abs_val_counts_df = pd.DataFrame(value_counts_abs).T.reset_index(drop=True).rename(index={0: attribute_name}).round(2)
    abs_val_counts_df.columns = [x+' (n)' for x in abs_val_counts_df.columns]
    
    norm_val_counts_df = pd.DataFrame(value_counts_norm).T.reset_index(drop=True).rename(index={0: attribute_name}).round(2)
    norm_val_counts_df.columns = [x+' (%)' for x in norm_val_counts_df.columns]
    
    numerical_df = pd.DataFrame({"mean":mean,
                                 "std":std},
                                index=[attribute_name]).round(2)
    
    numerical_cat_merge_df = pd.merge(numerical_df,
                                        abs_val_counts_df,
                                        left_index=True,right_index=True)
    numerical_cat_merge_df = pd.merge(numerical_cat_merge_df,
                                        norm_val_counts_df,
                                        left_index=True,right_index=True)
    return numerical_cat_merge_df

def statistics_numerical_categorical_itt_pp(demographics_df, attribute_name, numerical_col, categorical_col):
    health_state_1 = 1
    health_state_2 = 2
    
    demog_by_subpop = []
    demog_by_subpop.append(statistics_numerical_categorical(demographics_df[demographics_df['wearables_itt'] == True], 
                                                   attribute_name+' (ITT Population)', numerical_col, categorical_col))
    demog_by_subpop.append(statistics_numerical_categorical(demographics_df[demographics_df['wearables_pp'] == True], 
                                                   attribute_name+' (PP Population)', numerical_col, categorical_col))
    demog_by_subpop.append(statistics_numerical_categorical(demographics_df, 
                                                attribute_name+' (overall population)', numerical_col, categorical_col))
    demog_op_df = pd.concat(demog_by_subpop)
    demog_op_df.index = pd.MultiIndex.from_tuples([(attribute_name, 'ITT Population'),
                                                     (attribute_name, 'PP Population'),
                                                      (attribute_name,'Overall')],
                                 names=['Demographic','Subpopulation'])
    return demog_op_df

### Unit tests

In [8]:
sample_data = pd.DataFrame({'Age':[1,2,7,7],
                            'Age.category':['Adult','Adult', 'Senior','Senior']})
statistics_numerical_categorical(sample_data, 'Age', 'Age', 'Age.category')

Unnamed: 0,mean,std,Senior (n),Adult (n),Senior (%),Adult (%)
Age,4.25,3.2,2,2,0.5,0.5


In [9]:
sample_data = pd.DataFrame({'Age':[2,7,2,7,2,7,2,7,10,2],
                            'Age.category':['Adult', 'Senior','Adult','Senior','Adult', 'Senior','Adult','Senior','Senior','Adult'],
                           'wearables_itt':[True, True, True, False, False, False, False, True, True, True],
                           'wearables_pp':[True, True, True, False, False, True, False, True, True, True]})
statistics_numerical_categorical_itt_pp(sample_data, 'Age', 'Age', 'Age.category')

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,Senior (n),Adult (n),Senior (%),Adult (%)
Demographic,Subpopulation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Age,ITT Population,5.0,3.46,3,3,0.5,0.5
Age,PP Population,5.29,3.25,4,3,0.57,0.43
Age,Overall,4.8,3.08,5,5,0.5,0.5


## Statistics for categorical only

In [10]:
def statistics_for_categorical(data, attribute_name, categorical_col): 
    value_counts_abs = data[categorical_col].value_counts()
    value_counts_norm = data[categorical_col].value_counts(normalize=True)
    
    abs_val_counts_df = pd.DataFrame(value_counts_abs).T.reset_index(drop=True).rename(index={0: attribute_name}).round(2)
    abs_val_counts_df.columns = [x+' ' for x in abs_val_counts_df.columns]
    
    norm_val_counts_df = pd.DataFrame(value_counts_norm).T.reset_index(drop=True).rename(index={0: attribute_name}).round(2)
    norm_val_counts_df.columns = [x+' ' for x in norm_val_counts_df.columns] 
    
    abs_val_counts_df.index = [attribute_name+' (n)']
    norm_val_counts_df.index = [attribute_name+' (%)']
    
    return pd.merge(abs_val_counts_df.T,
                    norm_val_counts_df.T,
                    left_index=True,right_index=True)

def statistics_by_categorical_itt_pp(demographics_df, attribute_name, categorical_col):
    demog_by_subpop = []
    # ITT subpopulation
    itt_sub_pop = statistics_for_categorical(demographics_df[demographics_df['wearables_itt'] == True], 
                                             attribute_name, categorical_col)
    tuples_list = []
    for i in itt_sub_pop.index:
        tuples_list.append((attribute_name,'ITT'+' subpopulation',i))
    itt_sub_pop.index =  pd.MultiIndex.from_tuples(tuples_list,
                                 names=['Demographic','Subpopulation',''])
    demog_by_subpop.append(itt_sub_pop)
    
    # PP subpopulation
    pp_sub_pop = statistics_for_categorical(demographics_df[demographics_df['wearables_pp'] == True], 
                                           attribute_name, categorical_col)
    tuples_list = []
    for i in pp_sub_pop.index:
        tuples_list.append((attribute_name,'PP'+' subpopulation',i))
    pp_sub_pop.index =  pd.MultiIndex.from_tuples(tuples_list,
                                 names=['Demographic','Subpopulation',''])
    demog_by_subpop.append(pp_sub_pop)
    
    # Overall population metrics
    overall_pop = statistics_for_categorical(demographics_df, attribute_name, categorical_col)

    tuples_list = []
    for i in overall_pop.index:
        tuples_list.append((attribute_name,'Overall'+' subpopulation',i))
    overall_pop.index =  pd.MultiIndex.from_tuples(tuples_list,
                                 names=['Demographic','Subpopulation',''])
    demog_by_subpop.append(overall_pop)
    
    return pd.concat(demog_by_subpop)

### Unit Tests

In [11]:
statistics_for_categorical(sample_data, 'Age', 'Age.category')

Unnamed: 0,Age (n),Age (%)
Senior,5,0.5
Adult,5,0.5


In [12]:
statistics_by_categorical_itt_pp(sample_data, 'Age', 'Age.category')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Age (n),Age (%)
Demographic,Subpopulation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Age,ITT subpopulation,Senior,3,0.5
Age,ITT subpopulation,Adult,3,0.5
Age,PP subpopulation,Senior,4,0.57
Age,PP subpopulation,Adult,3,0.43
Age,Overall subpopulation,Senior,5,0.5
Age,Overall subpopulation,Adult,5,0.5


## Statistical tests for numerical and categorical data

In [13]:
def chi_squared_report(p_value, test_statistic, outcome_str, alpha=0.05):
    
    t_test_flag = {True: 'Reject the null hypothesis',
                                   False: 'Do not reject the null hypothesis'}
    t_test_conclusion = p_value < alpha
    
    mydata_ttest = [
        [test_statistic, p_value, t_test_flag[t_test_conclusion]]]
    myheaders_ttest = ["Statistic", 'P-value', 'Conclusion']
    mystubs_ttest = ["Results"]
    tbl_ttest = SimpleTable(mydata_ttest, 
                            myheaders_ttest,
                            mystubs_ttest,
                            title="Chi-squared test results")
    print(tbl_ttest)
    print("Outcome:", outcome_str)
    print('=' * 59)
    
def statistical_test_difference(demographics_df, attribute_name, categorical_col, alpha=0.05):

    print('ITT and PP Population')
    # For ITT
    itt_df = pd.DataFrame(demographics_df[demographics_df['wearables_itt'] == True][categorical_col].value_counts())
    itt_df.columns = ['ITT population']
    # For PP
    pp_df = pd.DataFrame(demographics_df[demographics_df['wearables_pp'] == True][categorical_col].value_counts())
    pp_df.columns = ['PP population']
        
    # contingency table df 
    contingency_table_for_test = pd.concat([itt_df, pp_df], axis=1).fillna(0)
    # Create a contingency table
    contingency_table = [list(contingency_table_for_test['ITT population']),
                            list(contingency_table_for_test['PP population'])]
        
    # remove categories that have less than 5 values
    contingency_wo_less_counts = np.any(np.array(contingency_table) < 5, axis=0)
    contingency_wo_less_counts = np.array(contingency_table)[:,~contingency_wo_less_counts]
        
    if contingency_wo_less_counts.size==0:
        print("Can't perform the test as the categories have less than 5 counts (based on the subpopulation)")
        return
    else:
        contingency_wo_less_counts = list([list([int(i) for i in x]) for x in contingency_wo_less_counts])
        
    # Perform the chi-square test of independence
    chi2_stat, p_value, dof, expected = chi2_contingency(contingency_wo_less_counts)
    p_value = round(p_value, 2)
    chi2_stat = round(chi2_stat, 2)
    if p_value < 0.05:
        outcome_str = "The p-value is less than 0.05, indicating strong evidence to reject the null hypothesis. There is a significant difference between the ITT and PP population using " +categorical_col+' as the categorical variable.'
        outcome_str = '\n'.join(textwrap.wrap(outcome_str, 54))
    else:
        outcome_str = "The p-value is greater than or equal to 0.05, suggesting that there is not enough evidence to reject the null hypothesis. There is no significant difference between the ITT and PP population using "+categorical_col+' as the categorical variable.'
        outcome_str = '\n'.join(textwrap.wrap(outcome_str, 54))
    # report results
    chi_squared_report(p_value, chi2_stat, outcome_str, alpha=0.05)
    print('\n')

def numerical_stat_test_report(demographics_df, numerical_col):
    
    stat, p_val = stats.mannwhitneyu(demographics_df[demographics_df['wearables_itt'] == True][numerical_col].dropna().values,
                   demographics_df[demographics_df['wearables_pp'] == True][numerical_col].dropna().values)

    print('=' * 65)
    print('Mann Whitney test for '+numerical_col)
    print('=' * 67)
    print('ITT vs PP:')
    print('-' * 67)
    print('Statistic: %.2f, p-value: %.2f' %(stat.round(2), p_val.round(2)))
    print('-' * 67)
    print('Outcome:')
    if p_val<0.05:
        print('Reject the NULL hypothesis')
        print('Significant difference between the two groups based on '+numerical_col)
    else:
        print('Cannot reject the NULL hypothesis')
        print('No significant difference between the two groups based on '+numerical_col)


### Unit Tests

In [14]:
chi_squared_report(0.05, 21.4, 'No significant difference', alpha=0.05)

                  Chi-squared test results                 
        Statistic P-value             Conclusion           
-----------------------------------------------------------
Results      21.4    0.05 Do not reject the null hypothesis
-----------------------------------------------------------
Outcome: No significant difference


In [15]:
sample_data = pd.DataFrame({'Age':[2,7,2,7,2,7,2,7,10,np.nan],
                            'Age.category':['Adult', 'Senior','Adult','Senior','Adult', 'Senior','Adult','Senior','Senior',np.nan],
                           'wearables_itt':[True, True, True, False, False, False, False, True, True, np.nan],
                           'wearables_pp':[True, True, True, False, False, True, False, True, True, np.nan]})
statistics_numerical_categorical_itt_pp(sample_data, 'Age', 'Age', 'Age.category')

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,Senior (n),Adult (n),Senior (%),Adult (%)
Demographic,Subpopulation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Age,ITT Population,5.6,3.51,3,2,0.6,0.4
Age,PP Population,5.83,3.19,4,2,0.67,0.33
Age,Overall,5.11,3.1,5,4,0.56,0.44


In [16]:
statistical_test_difference(sample_data, 'Age', 'Age.category', alpha=0.05)

ITT and PP Population
Can't perform the test as the categories have less than 5 counts (based on the subpopulation)


In [17]:
numerical_stat_test_report(sample_data, 'Age')

Mann Whitney test for Age
ITT vs PP:
-------------------------------------------------------------------
Statistic: 14.50, p-value: 1.00
-------------------------------------------------------------------
Outcome:
Cannot reject the NULL hypothesis
No significant difference between the two groups based on Age


In [18]:
stats.mannwhitneyu(sample_data[sample_data['wearables_itt']==True]['Age'].values,
                   sample_data[sample_data['wearables_pp']==True]['Age'].values)

MannwhitneyuResult(statistic=14.5, pvalue=1.0)

# Demographics dataset

Renaming the demographics columns 

In [19]:
demographics_study_cols = list(data1.columns[:20]) + ['wearables_itt', 'wearables_pp'] + list(demographics_variable_list.keys())
demog_df = data1[demographics_study_cols]
demog_df.rename(columns=demographics_variable_list, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [20]:
demog_df.columns[20:]

Index(['wearables_itt', 'wearables_pp', 'Gender', 'Day 0 age',
       'Day 0 body weight', 'Site', 'Baseline period pain medications',
       'Owner Daily Diary', 'weekend vs. weekday'],
      dtype='object')

## Age helper functions

In [21]:
def calculate_age(born):
    today = date.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

def age_category_converter(age):
    if (age < 7):
        return "Adult"
    elif (age>=7 and age < 11):
        return "Senior younger than 11"
    elif (age>= 11):
        return "Senior older than 11"
    else:
        return np.nan


categories_less_than_55 = ['Adult (less than 55lbs)', 'Senior younger than 11 (less than 55lbs)', 'Senior older than 11 (less than 55lbs)']
categories_greater_than_55 = ['Adult', 'Senior younger than 7', 'Senior younger than 11 (7+)', 'Senior older than 11']

def age_stratification(weight, age):
    if (weight < 55):
        if (age < 7):
            return categories_less_than_55[0]
        elif (age>=7 and age < 11):
            return categories_less_than_55[1]
        else:
            return categories_less_than_55[2]
    else:
        if (age < 5):
            return categories_greater_than_55[0]
        elif (age < 7):
            return categories_greater_than_55[1]
        elif (age < 11):
            return categories_greater_than_55[2]
        else:
            return categories_greater_than_55[3]



## Age

In [22]:
demog_df['Day 0 age'].unique()

array([ 5., 11.,  6., nan, 10.,  2.,  8.,  7.,  9., 13., 12., 15.])

Here we are trying to get the first non null value

In [23]:
demographics_age = demog_df.groupby(['PET_ID']).agg({'Day 0 age':'first',
                                                         'wearables_itt':'first',
                                                         'wearables_pp':'first'}).reset_index()

demographics_age['Age.category'] = [age_category_converter(x) for x in demographics_age['Day 0 age'].values]

### Age table

In [24]:
display(statistics_numerical_categorical_itt_pp(demographics_age, 'Age', 'Day 0 age', 'Age.category'))

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,Senior younger than 11 (n),Senior older than 11 (n),Adult (n),Senior younger than 11 (%),Senior older than 11 (%),Adult (%)
Demographic,Subpopulation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Age,ITT Population,9.02,2.45,64,29,18,0.58,0.26,0.16
Age,PP Population,9.0,2.39,23,7,8,0.61,0.18,0.21
Age,Overall,9.0,2.45,65,29,18,0.58,0.26,0.16


### Age (numerical) Mann Whitney test

In [25]:
numerical_stat_test_report(demographics_age, 'Day 0 age')

Mann Whitney test for Day 0 age
ITT vs PP:
-------------------------------------------------------------------
Statistic: 2121.50, p-value: 0.96
-------------------------------------------------------------------
Outcome:
Cannot reject the NULL hypothesis
No significant difference between the two groups based on Day 0 age


In [26]:
stats.mannwhitneyu(demographics_age[demographics_age['wearables_itt']==True]['Day 0 age'].dropna().values,
                   demographics_age[demographics_age['wearables_pp']==True]['Day 0 age'].dropna().values)

MannwhitneyuResult(statistic=2121.5, pvalue=0.9579153530464198)

### Age statistical test

In [27]:
def get_value_counts_as_per_chi_2(demographics_df, categorical_col):
    val_counts_greater_than_5 = demographics_df[categorical_col].value_counts()[\
                                demographics_df[categorical_col].value_counts()>5].index
    demographics_for_chi_df = demographics_df[demographics_df[categorical_col].isin(val_counts_greater_than_5)]
    return demographics_for_chi_df

In [28]:
op = statistical_test_difference(get_value_counts_as_per_chi_2(demographics_age, 'Age.category'), 'Age', 'Age.category', alpha=0.05)

ITT and PP Population
                  Chi-squared test results                 
        Statistic P-value             Conclusion           
-----------------------------------------------------------
Results      1.11    0.57 Do not reject the null hypothesis
-----------------------------------------------------------
Outcome: The p-value is greater than or equal to 0.05,
suggesting that there is not enough evidence to reject
the null hypothesis. There is no significant
difference between the ITT and PP population using
Age.category as the categorical variable.




## Age stratified by weight

In [29]:
demographics_age_byweight = demog_df.groupby(['PET_ID']).agg({'Day 0 age':'first',
                                                             'wearables_itt':'first',
                                                             'wearables_pp':'first',
                                                             'Day 0 body weight':'first'}).reset_index()

demographics_age_byweight['Age.byweight'] = demographics_age_byweight.apply(lambda row: age_stratification(row['Day 0 body weight'], 
                                                                                                           row['Day 0 age']), axis=1)

In [30]:
display(statistics_numerical_categorical_itt_pp(demographics_age_byweight, 'Age', 'Day 0 age', 'Age.byweight'))

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,Senior younger than 11 (7+) (n),Senior older than 11 (n),Senior younger than 11 (less than 55lbs) (n),Senior older than 11 (less than 55lbs) (n),Senior younger than 7 (n),Adult (less than 55lbs) (n),Adult (n),Senior younger than 11 (7+) (%),Senior older than 11 (%),Senior younger than 11 (less than 55lbs) (%),Senior older than 11 (less than 55lbs) (%),Senior younger than 7 (%),Adult (less than 55lbs) (%),Adult (%)
Demographic,Subpopulation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Age,ITT Population,9.02,2.45,38,29,26,24,14,3,1.0,0.28,0.21,0.19,0.18,0.1,0.02,0.01
Age,PP Population,9.0,2.39,10,6,13,5,7,1,,0.24,0.14,0.31,0.12,0.17,0.02,
Age,Overall,9.0,2.45,38,29,27,24,14,3,1.0,0.28,0.21,0.2,0.18,0.1,0.02,0.01


In [31]:
op = statistical_test_difference(get_value_counts_as_per_chi_2(demographics_age_byweight, 'Age.byweight'), 'Age', 'Age.byweight', alpha=0.05)

ITT and PP Population
                  Chi-squared test results                 
        Statistic P-value             Conclusion           
-----------------------------------------------------------
Results      4.78    0.31 Do not reject the null hypothesis
-----------------------------------------------------------
Outcome: The p-value is greater than or equal to 0.05,
suggesting that there is not enough evidence to reject
the null hypothesis. There is no significant
difference between the ITT and PP population using
Age.byweight as the categorical variable.




## Gender

In [32]:
data1['DEMOGRAPHICS-gender'].value_counts()

Spayed Female    5496
Neutered Male    5020
Intact Male       222
Name: DEMOGRAPHICS-gender, dtype: int64

In [33]:
def get_gender_name(x):
    nan = float("NaN")
    try:
        if isinstance(x,float):
            return np.nan
        elif isinstance(x,str):
            return x.split(' ')[1]
    except Exception as e:
        print(x)
        print(e)
def get_gender_status(x):
    nan = float("NaN")
    try:
        if isinstance(x,float):
            return np.nan
        elif isinstance(x,str):
            return x.split(' ')[0]
    except Exception as e:
        print(x)
        print(e)

In [34]:
demographics_gender = demog_df.groupby(['PET_ID']).agg({'Gender':'first',
                                                             'wearables_itt':'first',
                                                             'wearables_pp':'first'}).reset_index()
demographics_gender['Gender'] = demographics_gender['Gender'].replace('0',np.nan)
demographics_gender['Gender.name'] = [get_gender_name(x) for x in demographics_gender['Gender'].values]
demographics_gender['Gender.sterilization'] = [get_gender_status(x) for x in demographics_gender['Gender'].values]

### Gender table

In [35]:
demographics_gender['Gender.name'].value_counts()

Female    70
Male      64
Name: Gender.name, dtype: int64

In [36]:
display(statistics_by_categorical_itt_pp(demographics_gender, 'Gender', 'Gender.name'))

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Gender (n),Gender (%)
Demographic,Subpopulation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Gender,ITT subpopulation,Female,69,0.52
Gender,ITT subpopulation,Male,64,0.48
Gender,PP subpopulation,Male,21,0.51
Gender,PP subpopulation,Female,20,0.49
Gender,Overall subpopulation,Female,70,0.52
Gender,Overall subpopulation,Male,64,0.48


### Gender statistical test

In [37]:
statistical_test_difference(get_value_counts_as_per_chi_2(demographics_gender, 'Gender.name'), 'Gender', 'Gender.name', alpha=0.05)

ITT and PP Population
                  Chi-squared test results                 
        Statistic P-value             Conclusion           
-----------------------------------------------------------
Results      0.03    0.87 Do not reject the null hypothesis
-----------------------------------------------------------
Outcome: The p-value is greater than or equal to 0.05,
suggesting that there is not enough evidence to reject
the null hypothesis. There is no significant
difference between the ITT and PP population using
Gender.name as the categorical variable.




### Gender sterilization table

In [38]:
statistics_by_categorical_itt_pp(demographics_gender, 'Gender', 'Gender.sterilization')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Gender (n),Gender (%)
Demographic,Subpopulation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Gender,ITT subpopulation,Spayed,69,0.52
Gender,ITT subpopulation,Neutered,61,0.46
Gender,ITT subpopulation,Intact,3,0.02
Gender,PP subpopulation,Neutered,21,0.51
Gender,PP subpopulation,Spayed,20,0.49
Gender,Overall subpopulation,Spayed,70,0.52
Gender,Overall subpopulation,Neutered,61,0.46
Gender,Overall subpopulation,Intact,3,0.02


### Gender sterilization statistical test

In [39]:
statistical_test_difference(get_value_counts_as_per_chi_2(demographics_gender, 'Gender.sterilization'), 'Gender', 'Gender.sterilization', alpha=0.05)

ITT and PP Population
                  Chi-squared test results                 
        Statistic P-value             Conclusion           
-----------------------------------------------------------
Results      0.09    0.76 Do not reject the null hypothesis
-----------------------------------------------------------
Outcome: The p-value is greater than or equal to 0.05,
suggesting that there is not enough evidence to reject
the null hypothesis. There is no significant
difference between the ITT and PP population using
Gender.sterilization as the categorical variable.




## Body weight

In [40]:
def body_weight_category(x):
    if x is np.nan:
        return np.nan
    elif x>0 and x<=20:
        return 'up to 20 lbs'
    elif x>20 and x<=50:
        return '>20 lbs & <=50 lbs'
    elif x>50:
        return '>50 lbs'

In [41]:
demographics_body_weight = demog_df.groupby(['PET_ID']).agg({'Day 0 body weight':'first',
                                                                  'wearables_itt':'first',
                                                                  'wearables_pp':'first'}).reset_index()
demographics_body_weight['Day 0 body weight'].unique()

array([ 62.8,  63. ,  55. ,  65. ,  71. ,  52.8,  47. ,  63.8,  95.8,
         nan,  68.4, 108. ,  33. ,  45. ,  92. ,  48. ,  50. ,  61. ,
        47.4,  15. ,  57.2,  30.2,  73.4,  50.6,  59. ,  36.8,  25.4,
        43.2,  70.5,  25. ,  53.8,  97.3,  57.4,  52. ,  64.2,  63.6,
        35.6,  37.5,  53.9,  40. ,  75. ,  72.4,  36.4,  20.9,  87.6,
        58. ,  51.6,  40.7,  73.6,  71.5,  58.8,  47.5,  56. ,  72. ,
        61.5, 135.8,  29.6,  35.2,  88.5,  24.3,  46. ,  96. ,  61.8,
        24.8,  60.5,  68.8,  27. ,  95. ,  23. ,  67. ,  16. ,  58.5,
        85. ,  70. ,  28. ,  36. ,  25.5,  62.5,  89. ,  69.3,  75.8,
        60.4, 124.6,  51. ,  35. ,  60. ,  19. ,  32.5,  49.3,  74.5,
       115. ,  53.7,  17. ,  34.9,  68.7,  81.5])

In [42]:
demographics_body_weight['Day 0 body weight'] = [float(x) if x is not np.nan else np.nan for x in demographics_body_weight['Day 0 body weight'].values]
demographics_body_weight['Day 0 body weight (category)'] = [body_weight_category(x) for x in demographics_body_weight['Day 0 body weight'].values]

In [43]:
demographics_body_weight['Day 0 body weight (category)'].value_counts()

>50 lbs               72
>20 lbs & <=50 lbs    38
up to 20 lbs           8
Name: Day 0 body weight (category), dtype: int64

### Body weight table

In [44]:
statistics_by_categorical_itt_pp(demographics_body_weight, 'Day 0 body weight', 'Day 0 body weight (category)')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Day 0 body weight (n),Day 0 body weight (%)
Demographic,Subpopulation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Day 0 body weight,ITT subpopulation,>50 lbs,72,0.62
Day 0 body weight,ITT subpopulation,>20 lbs & <=50 lbs,37,0.32
Day 0 body weight,ITT subpopulation,up to 20 lbs,8,0.07
Day 0 body weight,PP subpopulation,>50 lbs,24,0.62
Day 0 body weight,PP subpopulation,>20 lbs & <=50 lbs,14,0.36
Day 0 body weight,PP subpopulation,up to 20 lbs,1,0.03
Day 0 body weight,Overall subpopulation,>50 lbs,72,0.61
Day 0 body weight,Overall subpopulation,>20 lbs & <=50 lbs,38,0.32
Day 0 body weight,Overall subpopulation,up to 20 lbs,8,0.07


### Body weight numerical statistical test

In [45]:
numerical_stat_test_report(demographics_body_weight, 'Day 0 body weight')

Mann Whitney test for Day 0 body weight
ITT vs PP:
-------------------------------------------------------------------
Statistic: 2296.50, p-value: 0.95
-------------------------------------------------------------------
Outcome:
Cannot reject the NULL hypothesis
No significant difference between the two groups based on Day 0 body weight


### Body weight statistical test

In [46]:
statistical_test_difference(get_value_counts_as_per_chi_2(demographics_body_weight, 'Day 0 body weight (category)'), 'Day 0 body weight', 'Day 0 body weight (category)', alpha=0.05)

ITT and PP Population
                  Chi-squared test results                 
        Statistic P-value             Conclusion           
-----------------------------------------------------------
Results      0.02     0.9 Do not reject the null hypothesis
-----------------------------------------------------------
Outcome: The p-value is greater than or equal to 0.05,
suggesting that there is not enough evidence to reject
the null hypothesis. There is no significant
difference between the ITT and PP population using Day
0 body weight (category) as the categorical variable.




## Medication

In [47]:
demog_df['medication_split'] = demog_df['Baseline period pain medications'].str.split(',')

# Define a function to extract the value before the last '-' from each item and return as a joined string
def extract_medication_class(item):
    if isinstance(item, list):
        return ', '.join([x.strip().rsplit('-', 1)[0] for x in item])
    else:
        return ''

# Apply the function to create a new column 'medication.class'
demog_df['medication.class'] = demog_df['medication_split'].apply(extract_medication_class)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [48]:
def statistics_for_subpopulation(df, attribute_name, categorical_col, subpopulation_name):
    
    # Filter rows with 'Pain medications' and 'NSAID Medications'
    pain_medication_df = df[df[categorical_col].str.contains('Pain medication', case=False)]
    nsaid_df = df[df[categorical_col].str.contains('(NSAID)', case=False)]

    # Count of 'Pain medications' and 'NSAID Medications'
    count_pain_medication = len(pain_medication_df)
    count_nsaid = len(nsaid_df)

    # Percentage using 'Pain medications' and 'NSAID Medications'
    total_records = len(df)
    percentage_pain_medication = (count_pain_medication / total_records) * 100
    percentage_nsaid = (count_nsaid / total_records) * 100

    # Creating DataFrames with the results
    result_df_pain_medication = pd.DataFrame({
        'Demographic': [attribute_name],
        'Subpopulation': [subpopulation_name],
        'Medication': ['Pain Medications'],
        'Medications(n)': [count_pain_medication],
        'Medications (%)': [percentage_pain_medication],
    })

    result_df_nsaid = pd.DataFrame({
        'Demographic': [attribute_name],
        'Subpopulation': [subpopulation_name],
        'Medication': ['NSAID Medications'],
        'Medications(n)': [count_nsaid],
        'Medications (%)': [percentage_nsaid],
    })

    return result_df_pain_medication, result_df_nsaid

In [49]:
# Return the dataset prepared for medication calss
demog_by_subpop = []
    
# ITT subpopulation
itt_sub_pop_df = demog_df[demog_df['wearables_itt'].eq(True)]
itt_sub_pop_pain, itt_sub_pop_nsaid = statistics_for_subpopulation(itt_sub_pop_df, 'Medication class', 'medication.class', 'ITT Population')
demog_by_subpop.append(itt_sub_pop_pain)
demog_by_subpop.append(itt_sub_pop_nsaid)

# PP subpopulation
pp_sub_pop_df = demog_df[demog_df['wearables_pp'].eq(True)]
pp_sub_pop_pain, pp_sub_pop_nsaid = statistics_for_subpopulation(pp_sub_pop_df, 'Medication class', 'medication.class', 'PP Population')
demog_by_subpop.append(pp_sub_pop_pain)
demog_by_subpop.append(pp_sub_pop_nsaid)

# Overall population metrics
overall_pop_pain, overall_pop_nsaid = statistics_for_subpopulation(demog_df, 'Medication class', 'medication.class', 'Overall Population')
demog_by_subpop.append(overall_pop_pain)
demog_by_subpop.append(overall_pop_nsaid)

result_df = pd.concat(demog_by_subpop, ignore_index=True)


  return func(self, *args, **kwargs)
  return func(self, *args, **kwargs)


In [50]:
# Set Multilevel Index for the medication class dataframe

rows_level1 = ['Medication class', 'Medication class', 'Medication class', 'Medication class', 'Medication class', 
               'Medication class'] 
rows_level2 = ['ITT Population', 'ITT Population', 'PP Population', 'PP Population', 'Overall Population', 'Overall Population']  
result_df.index = pd.MultiIndex.from_arrays([rows_level1, rows_level2], names=['Demographic', 'Subpopulation'])

# Create a boolean mask for the first occurrence of each 'Subpopulation' within each 'Demographic'
mask = ~result_df.duplicated(subset=['Demographic', 'Subpopulation'], keep='first')

# Replace duplicated 'Subpopulation' values with an empty string
result_df['Subpopulation'] = result_df['Subpopulation'].where(mask, '')

# Set the MultiIndex back
result_df.set_index(['Demographic', 'Subpopulation'], inplace=True)

result_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Medication,Medications(n),Medications (%)
Demographic,Subpopulation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Medication class,ITT Population,Pain Medications,1,0.00923
Medication class,,NSAID Medications,4,0.036921
Medication class,PP Population,Pain Medications,0,0.0
Medication class,,NSAID Medications,2,0.041771
Medication class,Overall Population,Pain Medications,1,0.009216
Medication class,,NSAID Medications,4,0.036863


## Unit Test

In [51]:
demog_df[demog_df['wearables_itt'] == True]['medication.class'].value_counts()

                                                                                              10798
Other                                                                                             6
Other, Other, Other, Other, Other                                                                 4
Other, Other                                                                                      4
Other, Other, Other                                                                               3
Other, Other, Non-steroidal anti-inflammatory (NSAID), Fatty acid supplement, Other, Other        1
Other-Nexgard 24.1, Other-Interceptor 50                                                          1
Other-nexguard 60                                                                                 1
Other-Bravecto 22, Other-Heartgard plus 25                                                        1
Other, Other, Other, Non-steroidal anti-inflammatory (NSAID), Emollient, Other                    1


In [52]:
demog_df[demog_df['wearables_pp'] == True]['medication.class'].value_counts()

                                                                                              4774
Other, Other, Other                                                                              3
Other-Bravecto 22, Other-Heartgard plus 25                                                       1
Other, Other, Other, Other-heartgard 26, Other-nexgard 24                                        1
Other, Other, Other, Other-Sentinel 51                                                           1
Other, Other                                                                                     1
Other-bravecto 44                                                                                1
Other, Other, Other-heartgard plus 26                                                            1
Other-Nexgard 24.1, Other-Interceptor 50                                                         1
Other, Other, Non-steroidal anti-inflammatory (NSAID), Fatty acid supplement, Other, Other       1
Other-Hear

## Site

In [53]:
demographics_site = demog_df.groupby(['PET_ID']).agg({'Site':'first',
                                                               'wearables_itt':'first',
                                                               'wearables_pp':'first'}).reset_index().replace(float('nan'),np.nan)
demographics_site['Site'] = demographics_site['Site'].astype(str)

In [54]:
demographics_site_greater_than_1 = demographics_site[demographics_site['Site']\
                                                     .isin(demographics_site['Site'].value_counts()\
                                                           [demographics_site['Site'].value_counts()>1].index)].reset_index(drop = True)

In [55]:
display(statistics_by_categorical_itt_pp(demographics_site_greater_than_1, 'Site', 'Site'))

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Site (n),Site (%)
Demographic,Subpopulation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Site,ITT subpopulation,Raintree,32,0.24
Site,ITT subpopulation,Quakertown Phase 2,20,0.15
Site,ITT subpopulation,Quakertown,12,0.09
Site,ITT subpopulation,Chippens Hill Phase 2,10,0.08
Site,ITT subpopulation,Braden River,8,0.06
Site,ITT subpopulation,Northgate,8,0.06
Site,ITT subpopulation,Blue Hills,6,0.05
Site,ITT subpopulation,Eastside Animal Hospital,6,0.05
Site,ITT subpopulation,Stanley Vet Clnic,5,0.04
Site,ITT subpopulation,Chippens Hill Veterinary Hospital,5,0.04


In [56]:
demographics_site_greater_than_1.head()

Unnamed: 0,PET_ID,Site,wearables_itt,wearables_pp
0,3347,Quakertown,True,True
1,3349,Wakarusa Veterinary Hospital,True,False
2,3350,Wakarusa Veterinary Hospital,True,True
3,3394,Chippens Hill Veterinary Hospital,True,False
4,3395,Chippens Hill Veterinary Hospital,True,False


In [57]:
statistical_test_difference(get_value_counts_as_per_chi_2(demographics_site_greater_than_1, 'Site'), 'Site', 'Site', alpha=0.05)

ITT and PP Population
                  Chi-squared test results                 
        Statistic P-value             Conclusion           
-----------------------------------------------------------
Results      2.23    0.33 Do not reject the null hypothesis
-----------------------------------------------------------
Outcome: The p-value is greater than or equal to 0.05,
suggesting that there is not enough evidence to reject
the null hypothesis. There is no significant
difference between the ITT and PP population using
Site as the categorical variable.




## Owner Daily Diary time spent walking

In [58]:
def diary_category_converter(df):
    # Calculate the mean and median of 'Owner Daily Diary'
    mean_diary = df['Owner Daily Diary'].mean()
    median_diary = df['Owner Daily Diary'].median()

    # Create a new column 'Owner Daily Diary,mean.categorical'
    df['Owner Daily Diary.mean.categorical'] = df['Owner Daily Diary'].apply(
        lambda x: 'Above Mean' if x > mean_diary else 'Below Mean'
    )

    df['Owner Daily Diary.median.categorical'] = df['Owner Daily Diary'].apply(
        lambda x: 'Above Median' if x > median_diary else 'Below Median'
    )




In [59]:
demographics_owner_daily_diary = demog_df.groupby(['PET_ID']).agg({'Owner Daily Diary':'first',
                                                         'wearables_itt':'first',
                                                         'wearables_pp':'first'}).reset_index()

diary_category_converter(demographics_owner_daily_diary)

### Owner Daily Diray table

In [60]:
display(statistics_numerical_categorical_itt_pp(demographics_owner_daily_diary, 'Owner Daily Diary', 'Owner Daily Diary', 'Owner Daily Diary.mean.categorical'))

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,Below Mean (n),Above Mean (n),Below Mean (%),Above Mean (%)
Demographic,Subpopulation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Owner Daily Diary,ITT Population,23.46,13.22,92,43,0.68,0.32
Owner Daily Diary,PP Population,25.88,14.36,21,21,0.5,0.5
Owner Daily Diary,Overall,23.46,13.22,93,43,0.68,0.32


In [61]:
display(statistics_numerical_categorical_itt_pp(demographics_owner_daily_diary, 'Owner Daily Diary', 'Owner Daily Diary', 'Owner Daily Diary.median.categorical'))

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,Below Median (n),Above Median (n),Below Median (%),Above Median (%)
Demographic,Subpopulation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Owner Daily Diary,ITT Population,23.46,13.22,87,48,0.64,0.36
Owner Daily Diary,PP Population,25.88,14.36,20,22,0.48,0.52
Owner Daily Diary,Overall,23.46,13.22,88,48,0.65,0.35


### Owner Daily Diray (numerical) Mann Whitney test

In [62]:
numerical_stat_test_report(demographics_owner_daily_diary, 'Owner Daily Diary')

Mann Whitney test for Owner Daily Diary
ITT vs PP:
-------------------------------------------------------------------
Statistic: 1817.00, p-value: 0.38
-------------------------------------------------------------------
Outcome:
Cannot reject the NULL hypothesis
No significant difference between the two groups based on Owner Daily Diary


In [63]:
stats.mannwhitneyu(demographics_owner_daily_diary[demographics_owner_daily_diary['wearables_itt']==True]['Owner Daily Diary'].dropna().values,
                   demographics_owner_daily_diary[demographics_owner_daily_diary['wearables_pp']==True]['Owner Daily Diary'].dropna().values)

MannwhitneyuResult(statistic=1817.0, pvalue=0.37505027580231776)

### Owner Daily Diray statistical test

In [64]:
op = statistical_test_difference(get_value_counts_as_per_chi_2(demographics_owner_daily_diary, 'Owner Daily Diary.mean.categorical'), 'Owner Daily Diary', 'Owner Daily Diary.mean.categorical', alpha=0.05)

ITT and PP Population
                  Chi-squared test results                 
        Statistic P-value             Conclusion           
-----------------------------------------------------------
Results      3.82    0.05 Do not reject the null hypothesis
-----------------------------------------------------------
Outcome: The p-value is greater than or equal to 0.05,
suggesting that there is not enough evidence to reject
the null hypothesis. There is no significant
difference between the ITT and PP population using
Owner Daily Diary.mean.categorical as the categorical
variable.




In [65]:
op = statistical_test_difference(get_value_counts_as_per_chi_2(demographics_owner_daily_diary, 'Owner Daily Diary.median.categorical'), 'Owner Daily Diary', 'Owner Daily Diary.median.categorical', alpha=0.05)

ITT and PP Population
                  Chi-squared test results                 
        Statistic P-value             Conclusion           
-----------------------------------------------------------
Results      3.12    0.08 Do not reject the null hypothesis
-----------------------------------------------------------
Outcome: The p-value is greater than or equal to 0.05,
suggesting that there is not enough evidence to reject
the null hypothesis. There is no significant
difference between the ITT and PP population using
Owner Daily Diary.median.categorical as the
categorical variable.




## Weekend vs. weekday

In [66]:
def map_weekend_weekday(value):
    if value == 0:
        return 'Weekday'
    elif value == 1:
        return 'Weekend'

In [67]:
demographics_weekend = demog_df.groupby(['PET_ID']).agg({'weekend vs. weekday':'first',
                                                               'wearables_itt':'first',
                                                               'wearables_pp':'first'}).reset_index().replace(float('nan'),np.nan)
demographics_weekend['ISweekend.categorical'] = demographics_weekend['weekend vs. weekday'].apply(lambda x: map_weekend_weekday(x))

### Weekend vs. weekday table

In [68]:
display(statistics_by_categorical_itt_pp(demographics_weekend, 'Weekend vs. weekday', 'ISweekend.categorical'))

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Weekend vs. weekday (n),Weekend vs. weekday (%)
Demographic,Subpopulation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Weekend vs. weekday,ITT subpopulation,Weekday,100,0.74
Weekend vs. weekday,ITT subpopulation,Weekend,35,0.26
Weekend vs. weekday,PP subpopulation,Weekday,31,0.74
Weekend vs. weekday,PP subpopulation,Weekend,11,0.26
Weekend vs. weekday,Overall subpopulation,Weekday,100,0.74
Weekend vs. weekday,Overall subpopulation,Weekend,36,0.26


### Weekend vs. weekday statistical test

In [69]:
statistical_test_difference(get_value_counts_as_per_chi_2(demographics_weekend, 'ISweekend.categorical'), 'Weekend vs. weekday', 'ISweekend.categorical', alpha=0.05)

ITT and PP Population
                  Chi-squared test results                 
        Statistic P-value             Conclusion           
-----------------------------------------------------------
Results       0.0     1.0 Do not reject the null hypothesis
-----------------------------------------------------------
Outcome: The p-value is greater than or equal to 0.05,
suggesting that there is not enough evidence to reject
the null hypothesis. There is no significant
difference between the ITT and PP population using
ISweekend.categorical as the categorical variable.


