# Summary statistics 

This notebook contains code for the analysis of summary statistics for data used in the project: Public attitudes and ethical guidelines in digital field experiments (digex).

# TO DO

Taken from [2022-digex-study-design](https://docs.google.com/document/d/1nsaXEn04s9LTsjqrpbUpZlyQ3CAcuqkOwy1ZM0d-kKY/edit#):

- To describe the general behavior of participants when filling out the survey, we will calculate the means, standard deviations, and ranges  (M, SD, range) for the response rate, the number of screened out participants, the number of complete and incomplete survey participations, and the completion time. 

- We will further describe the composition of the obtained sample by providing frequency tables or plots and descriptive statistics (M, SD, range) for the demographic variables of gender, age, ethnic background, education level, and political viewpoint. 


## Set working directory

In [1]:
import pathlib   # Change cwd
import os 

path = pathlib.Path.cwd().parent
os.chdir(path)

## Imports

In [2]:
import pathlib   # Standard library

import yaml   # 3rd party packages
import joypy
import statistics
import pandas as pd
import numpy as np
from skimpy import skim
from matplotlib import pyplot as plt

from digex_src import config    # Local imports
from digex_src import preprocess
from digex_src import get_summary_statistics
from digex_src.load_data import get_data_filepath

## load processed data

In [3]:
processed_data_path = get_data_filepath(
    file=config.PROCESSED_DATA_FILEPATH, 
    data_path=config.PROCESSED_DATA_DIR,
    main=False
) 

digex_df = pd.read_csv(processed_data_path, index_col=0)

digex_df.head()

Unnamed: 0,duration_sec,finished,sm_use,age,gender_id,ethnic_id,edu,politic_views,aware_sm_res,aware_sm_advan,...,rank_pub_interst,rank_add_fac_1,rank_add_fac_1_pos,rank_add_fac_2,rank_add_fac_2_pos,rank_add_fac_3,rank_add_fac_3_pos,aware_sm_advan_score,aware_sm_interact_score,aware_sm_use_score
1,912.0,True,Facebook,29.0,Male,Asian - Eastern,Highschool,Slightly liberal,Extremely aware,['… are large and can contain millions of data...,...,1.0,,,,,,,4,0,9
2,720.0,True,Twitter,33.0,Male,Mixed race,Highschool,Neutral/ Neither conservative or liberal,Moderately aware,['… are large and can contain millions of data...,...,4.0,,,,,,,1,1,9
3,1874.0,True,Facebook,33.0,Female,Pacific Islander,Bachelor's degree,Very liberal,Extremely aware,['… are large and can contain millions of data...,...,1.0,,,,,,,2,2,5
4,1264.0,True,Facebook,73.0,Female,White / Caucasian,Highschool,Slightly conservative,Moderately aware,['… are large and can contain millions of data...,...,1.0,,8.0,,,,,1,1,6
5,556.0,True,Twitter,27.0,Female,Native-American,Highschool,Very liberal,Extremely aware,['… often capture social relationships not fou...,...,7.0,,,,,,,0,3,9


## Summary statistics

Variables examined: 0-14 (see variable-table.html)

### Overview

In [4]:
skim(digex_df)

### Survey experience 

#### Number of complete survey participants

In [7]:
completed_p = completed_participants(digex_df)
print(completed_p)

499


#### Response rate

In [8]:
response_r = response_rate(digex_df, as_percentage=False)
print(response_r)

499 per 500


In [9]:
response_perc = response_rate(digex_df, as_percentage=True)
print(response_perc,'%')

99.8 %


#### Number of screened out participants

In [10]:
print(PARTICIPANT_COUNT - completed_p)

1


#### Completion time

In [11]:
times_min = completion_time(digex_df, time_unit='min')
print(times_min)

count                          499
mean     0 days 00:16:50.851703406
std      0 days 00:09:43.182528457
min                0 days 00:02:30
25%                0 days 00:10:08
50%                0 days 00:14:48
75%                0 days 00:20:39
max                0 days 01:23:47
Name: duration_sec, dtype: object


### Demographic information

In [42]:
demographic_df = demographic_information(digex_df[[
    'age', 'gender_id', 'ethnic_id', 'edu',
    'politic_views', 'sm_use']])
display(demographic_df)

Unnamed: 0,age,age_vals,gender_id,gender_id_vals,gender_id_perc,ethnic_id,ethnic_id_vals,ethnic_id_perc,edu,edu_vals,edu_perc,politic_views,politic_views_vals,politic_views_perc,sm_use,sm_use_vals,sm_use_perc
0,Average,41.663327,Male,282.0,56.513,White / Caucasian,397,79.559,Bachelor's degree,222.0,44.489,Very liberal,150.0,30.06,Facebook,258.0,51.703
1,Standard deviation,13.635932,Female,207.0,41.483,African-American,32,6.413,Highschool,153.0,30.661,Slightly liberal,126.0,25.251,Reddit,133.0,26.653
2,Min,18.0,Non-binary / third gender,8.0,1.603,Mixed race,20,4.008,Master's degree or above,87.0,17.435,Slightly conservative,96.0,19.238,Twitter,108.0,21.643
3,Max,78.0,Prefer not to say,2.0,0.401,Hispanic,19,3.808,Associate's degree,22.0,4.409,Neutral/ Neither conservative or liberal,89.0,17.836,,,
4,,,,,,Asian - Eastern,16,3.206,Some college,7.0,1.403,Very conservative,35.0,7.014,,,
5,,,,,,Asian - Indian,7,1.403,Prefer not to say,4.0,0.802,Prefer not to say,3.0,0.601,,,
6,,,,,,Native-American,3,0.601,Vocational training,4.0,0.802,,,,,,
7,,,,,,Pacific Islander,1,0.2,,,,,,,,,
8,,,,,,Prefer not to say,1,0.2,,,,,,,,,
9,,,,,,Asian - Southeast,1,0.2,,,,,,,,,
