In [24]:
import numpy as np

import pandas as pd

import statsmodels.api as sm
import statsmodels.formula.api as smf

import matplotlib.pyplot as plt
import seaborn as sns

In the dataset, a large sample of adult participants from the general community completed a five factor model personality questionnaire, and three self-report measures related to mood disorders: The Mood Disorder Questionnaire (MDQ), the Hypomania Checklist (HCL), and the Quick Inventory of Depressive Symptomatology (QIDS). The latter three measures were scored such that individuals ended up with a positive screen (scored 1), or a negative screen (scored 0) for being considered 'at-risk' for developing a mood disorder. The researchers were interested in the extent to which the personality variables could predict risk for mood disorders in this dataset. 

In [6]:
df = pd.read_csv('data_lab.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4464 entries, 0 to 4463
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Age                     4464 non-null   int64  
 1   Gender                  4464 non-null   int64  
 2   OpenessTotal            4464 non-null   float64
 3   ConscientiousnessTotal  4464 non-null   float64
 4   ExtraversionTotal       4464 non-null   float64
 5   AgreeablenessTotal      4464 non-null   float64
 6   NeuroticismTotal        4464 non-null   float64
 7   MDQ_screen              4464 non-null   int64  
 8   QIDS_screen             4464 non-null   int64  
 9   HCL_screen              4464 non-null   int64  
dtypes: float64(5), int64(5)
memory usage: 348.9 KB


In [25]:
questionnaires = [
    'OpenessTotal', 
    'ConscientiousnessTotal', 
    'ExtraversionTotal', 
    'AgreeablenessTotal', 
    'NeuroticismTotal'
]
screening_variables = ['MDQ_screen', 'QIDS_screen', 'HCL_screen']

Q1. Check that the three outcome variables have been labeled correctly 1 = positive screen, 0 = negative screen. In this research scenario, which category should we consider the target group and which should we consider the reference group?

In [10]:
df[screening_variables].describe()

Unnamed: 0,MDQ_screen,QIDS_screen,HCL_screen
count,4464.0,4464.0,4464.0
mean,0.163978,0.173611,0.152554
std,0.370298,0.378817,0.359597
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,0.0
75%,0.0,0.0,0.0
max,1.0,1.0,1.0


Q2. What proportion of the sample have a positive screen for each of the three outcome variables?

In [11]:
df[screening_variables].sum() / len(df)

MDQ_screen     0.163978
QIDS_screen    0.173611
HCL_screen     0.152554
dtype: float64

Q3. Is receiving a positive or negative screen on the QIDS associated with gender? What proportion of men and women have a received a positive screen in relation to the total number of males and females, respectively?

In [21]:
df.groupby(['Gender', 'QIDS_screen']).size()

Gender  QIDS_screen
0       0              1963
        1               295
1       0              1726
        1               480
dtype: int64

In [22]:
df.groupby('Gender')['QIDS_screen'].sum() / df.groupby('Gender')['QIDS_screen'].count()

Gender
0    0.130647
1    0.217588
Name: QIDS_screen, dtype: float64

Q4. What is the odds ratio for being a positive screen for females compared to males? 

In [23]:
(480 * 1963) / (295 * 1762)

1.8127320648723524

Q5. The research team initially wanted to examine which of the five personality traits significantly relate to the screening outcome on the QIDS. Run a binomial logistic model in Jamovi that will address this aim. First, check that the reference level is set correctly for the outcome variable in the menu (it should say 'negative'). Why does the overall model test have 5df? Does the model fit information in the output suggest this set of predictors provide a reasonable fit to the outcome variable? Provide the key parts of the output below. What is the deviance value for the null model?

In [33]:
endog = df['QIDS_screen'].values

exog = df[questionnaires].values
exog = (exog - exog.mean()) / exog.std()
exog = sm.add_constant(exog)

logit_mod = sm.Logit(endog, exog)
residuals = logit_mod.fit()

residuals.summary()

Optimization terminated successfully.
         Current function value: 0.369630
         Iterations 7


0,1,2,3
Dep. Variable:,y,No. Observations:,4464.0
Model:,Logit,Df Residuals:,4458.0
Method:,MLE,Df Model:,5.0
Date:,"Thu, 25 Feb 2021",Pseudo R-squ.:,0.1992
Time:,22:42:37,Log-Likelihood:,-1650.0
converged:,True,LL-Null:,-2060.4
Covariance Type:,nonrobust,LLR p-value:,3.649e-175

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-3.3994,0.177,-19.227,0.000,-3.746,-3.053
x1,0.3833,0.106,3.605,0.000,0.175,0.592
x2,-0.6065,0.108,-5.633,0.000,-0.818,-0.395
x3,-0.1713,0.080,-2.139,0.032,-0.328,-0.014
x4,-0.1878,0.106,-1.771,0.077,-0.396,0.020
x5,2.1156,0.097,21.907,0.000,1.926,2.305
