# Summary statistics 

This notebook contains code for the analysis of summary statistics for data used in the project: Public attitudes and ethical guidelines in digital field experiments (digex).

# TO DO

Taken from [2022-digex-study-design](https://docs.google.com/document/d/1nsaXEn04s9LTsjqrpbUpZlyQ3CAcuqkOwy1ZM0d-kKY/edit#):

- To describe the general behavior of participants when filling out the survey, we will calculate the means, standard deviations, and ranges  (M, SD, range) for the response rate, the number of screened out participants, the number of complete and incomplete survey participations, and the completion time. 

- We will further describe the composition of the obtained sample by providing frequency tables or plots and descriptive statistics (M, SD, range) for the demographic variables of gender, age, ethnic background, education level, and political viewpoint. 


## Set working directory

In [1]:
import pathlib   # Change cwd
import os 

path = pathlib.Path.cwd().parent
os.chdir(path)

## Imports

In [2]:
import pathlib   # Standard library

import yaml   # 3rd party packages
import joypy
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from digex_src import config    # Local imports
from digex_src import preprocess
from digex_src import get_summary_statistics
from digex_src.load_data import get_data_filepath

## load processed data

In [8]:
processed_data_path = get_data_filepath(
    file=config.PROCESSED_DATA_FILEPATH, 
    data_path=config.PROCESSED_DATA_DIR,
    main=False
) 

digex_df = pd.read_csv(processed_data_path, index_col=0)

digex_df.head()

Unnamed: 0,duration_sec,finished,sm_use,age,gender_id,ethnic_id,edu,politic_views,aware_sm_res,aware_sm_advan,...,rank_pub_interst,rank_add_fac_1,rank_add_fac_1_pos,rank_add_fac_2,rank_add_fac_2_pos,rank_add_fac_3,rank_add_fac_3_pos,aware_sm_advan_score,aware_sm_interact_score,aware_sm_use_score
1,912.0,True,Facebook,29.0,Male,Asian - Eastern,Highschool,Slightly liberal,Extremely aware,['… are large and can contain millions of data...,...,1.0,,,,,,,4,0,9
2,720.0,True,Twitter,33.0,Male,Mixed race,Highschool,Neutral/ Neither conservative or liberal,Moderately aware,['… are large and can contain millions of data...,...,4.0,,,,,,,1,1,9
3,1874.0,True,Facebook,33.0,Female,Pacific Islander,Bachelor's degree,Very liberal,Extremely aware,['… are large and can contain millions of data...,...,1.0,,,,,,,2,2,5
4,1264.0,True,Facebook,73.0,Female,White / Caucasian,Highschool,Slightly conservative,Moderately aware,['… are large and can contain millions of data...,...,1.0,,8.0,,,,,1,1,6
5,556.0,True,Twitter,27.0,Female,Native-American,Highschool,Very liberal,Extremely aware,['… often capture social relationships not fou...,...,7.0,,,,,,,0,3,9


In [9]:
digex_raw_data_path = get_data_filepath(main=False) 
digex_raw_data_df = pd.read_excel(digex_raw_data_path)

digex_raw_data_df.head()

Unnamed: 0,start_date,end_date,status,progress,duration_sec,finished,date,q_recaptcha_scor,consent,sm_use,...,rank_anony,rank_harms,rank_balance,rank_pub_interst,rank_add_fac_1,rank_add_fac_1_pos,rank_add_fac_2,rank_add_fac_2_pos,rank_add_fac_3,rank_add_fac_3_pos
0,Start Date,End Date,Response Type,Progress,Duration (in seconds),Finished,Recorded Date,Q_RecaptchaScore,Welcome to our survey! \n\nStudy Information a...,Which of the following social media sites do y...,...,To run experiments like the ones described in ...,To run experiments like the ones described in ...,To run experiments like the ones described in ...,To run experiments like the ones described in ...,Are there any additional factors that you thin...,Are there any additional factors that you thin...,Are there any additional factors that you thin...,Are there any additional factors that you thin...,Are there any additional factors that you thin...,Are there any additional factors that you thin...
1,2022-09-09 04:23:23,2022-09-09 04:38:35,IP Address,100,912,True,2022-09-09 04:38:37.488000,0.9,I consent,Facebook,...,6,4,3,1,,,,,,
2,2022-09-09 04:34:03,2022-09-09 04:46:03,IP Address,100,720,True,2022-09-09 04:46:04.146000,0.8,I consent,Twitter,...,6,1,7,4,,,,,,
3,2022-09-09 04:34:26,2022-09-09 05:05:41,IP Address,100,1874,True,2022-09-09 05:05:43.065000,0.9,I consent,Facebook,...,3,2,4,1,Na,,Na,,Na,
4,2022-09-09 04:51:43,2022-09-09 05:12:47,IP Address,100,1264,True,2022-09-09 05:12:47.820000,0.9,I consent,Facebook,...,3,4,5,1,Offer results to participants,8,,,,


## Summary statistics

In [10]:
df = digex_df.copy()

add - get_summary_statistics everywhere

### Survey experience (metadata)

#### Number of complete survey participants

In [32]:
completed_p = completed_participants(digex_df)
print(completed_p)

499


#### Response rate

In [33]:
response_r = response_rate(digex_df, as_percentage=False)
print(response_r)

499 per 500


In [36]:
response_perc = response_rate(digex_df, as_percentage=True)
print(response_perc,'%')

99.8 %


#### Number of screened out participants

In [13]:
print(PARTICIPANT_COUNT - completed_p)

1


we will calculate the means, standard deviations, and ranges (M, SD, range) for the response rate, the number of screened out participants, the number of complete and incomplete survey participations, and the completion time

#### Completion time

In [60]:
completion_time(digex_df, time_unit='min')

count                          499
mean     0 days 00:16:50.851703406
std      0 days 00:09:43.182528457
min                0 days 00:02:30
25%                0 days 00:10:08
50%                0 days 00:14:48
75%                0 days 00:20:39
max                0 days 01:23:47
Name: duration_sec, dtype: object


### Demographics

Copy table but make pretty in: https://journals.sagepub.com/doi/pdf/10.1177/20563051211033824

In [3]:
# https://aeturrell.github.io/coding-for-economists/data-numbers.html#counts
# count() and value_counts() and mode()

In [None]:
# Summarise numerical values with .describe()


In [None]:
# table = df[["mass", "height"]].agg([np.mean, np.std])
# table

In [None]:
# Aggregation: https://aeturrell.github.io/coding-for-economists/data-transformation.html#aggregation

# Groupby and aggregate: https://aeturrell.github.io/coding-for-economists/data-transformation.html#groupby-and-then-aggregate-aka-split-apply-combine

In [None]:
## Spread and Distribution: https://aeturrell.github.io/coding-for-economists/data-numbers.html#counts

In [None]:
from skimpy import skim

skim(df)

In [None]:
# Make quick charts with .plot.*

## Create script

In [59]:
# %%writefile digex_src/get_summary_statistics.py

import datetime


PARTICIPANT_COUNT = config.PARTICIPANT_COUNT
DECIMAL_PLACES = config.DECIMAL_PLACES


def completed_participants(df, col='finished'):
    return df.loc[df[col] == True][col].sum()


def response_rate(df, 
                  col='finished', 
                  as_percentage=True, 
                  total=config.PARTICIPANT_COUNT,
):
    if as_percentage:
        i = (df.loc[df[col] == True][col].sum() / total) * 100
        return i
    else:
        r = df.loc[df[col] == True][col].sum()
        return f'{r} per {total}'
    
    
def completion_time(df, col='duration_sec', time_unit='sec'):
    if time_unit == 'sec':
        print(df[col].describe())
    elif time_unit == 'min':
        print(pd.to_timedelta(df[col], 's').describe())