# Summary statistics 

This notebook contains code for the analysis of summary statistics for data used in the project: Public attitudes and ethical guidelines in digital field experiments (digex).

# TO DO

Taken from [2022-digex-study-design](https://docs.google.com/document/d/1nsaXEn04s9LTsjqrpbUpZlyQ3CAcuqkOwy1ZM0d-kKY/edit#):

### Data Analysis Methods

- Following the methodology of Fiesler and Proferes 2018 and other prior studies, we will use both qualitative and quantitative, statistical analysis to analyze  responses. Since the majority of questions contained in the survey will elicit responses on an ordinal scale of measurement, we primarily will use a number of descriptive statistics:

- To describe the general behavior of participants when filling out the survey, we will calculate the means, standard deviations, and ranges  (M, SD, range) for the response rate, the number of screened out participants, the number of complete and incomplete survey participations, and the completion time. 

- We will further describe the composition of the obtained sample by providing frequency tables or plots and descriptive statistics (M, SD, range) for the demographic variables of gender, age, ethnic background, education level, and political viewpoint. 


## Set working directory

In [1]:
import pathlib   # Change cwd
import os 

path = pathlib.Path.cwd().parent
os.chdir(path)

## Imports

In [3]:
import pathlib   # Standard library

import yaml   # 3rd party packages
import joypy
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from digex_src import config    # Local imports
from digex_src import preprocess
from digex_src.load_data import get_data_filepath

## load processed data

In [7]:
processed_data_path = get_data_filepath(
    file=config.PROCESSED_DATA_FILEPATH, 
    data_path=config.PROCESSED_DATA_DIR,
    main=False
) 

digex_df = pd.read_csv(processed_data_path, index_col=0)

digex_df.head()

Unnamed: 0,sm_use,age,gender_id,ethnic_id,edu,politic_views,aware_sm_res,aware_sm_advan,aware_sm_interact,aware_sm_use,...,rank_pub_interst,rank_add_fac_1,rank_add_fac_1_pos,rank_add_fac_2,rank_add_fac_2_pos,rank_add_fac_3,rank_add_fac_3_pos,aware_sm_advan_score,aware_sm_interact_score,aware_sm_use_score
1,Facebook,29.0,Male,Asian - Eastern,Highschool,Slightly liberal,Extremely aware,['… are large and can contain millions of data...,"['Creating fake accounts (""bots"")', 'Secretly ...","['Political elections (e.g. voting behavior)',...",...,1.0,,,,,,,4,0,9
2,Twitter,33.0,Male,Mixed race,Highschool,Neutral/ Neither conservative or liberal,Moderately aware,['… are large and can contain millions of data...,"['Privately messaging users', ""Publicly postin...","['Political elections (e.g. voting behavior)',...",...,4.0,,,,,,,1,1,9
3,Facebook,33.0,Female,Pacific Islander,Bachelor's degree,Very liberal,Extremely aware,['… are large and can contain millions of data...,"['Privately messaging users', ""Publicly postin...","['Political elections (e.g. voting behavior)',...",...,1.0,,,,,,,2,2,5
4,Facebook,73.0,Female,White / Caucasian,Highschool,Slightly conservative,Moderately aware,['… are large and can contain millions of data...,"['Creating fake accounts (""bots"")']","['Political elections (e.g. voting behavior)',...",...,1.0,,8.0,,,,,1,1,6
5,Twitter,27.0,Female,Native-American,Highschool,Very liberal,Extremely aware,['… often capture social relationships not fou...,"['Privately messaging users', ""Publicly postin...","['Political elections (e.g. voting behavior)',...",...,7.0,,,,,,,0,3,9


## describe data

In [None]:
Metadata
- duration


Demographics
- LocationLatitude	LocationLongitude

In [3]:
# https://aeturrell.github.io/coding-for-economists/data-numbers.html#counts
# count() and value_counts() and mode()

In [None]:
# Summarise numerical values with .describe()


In [None]:
# table = df[["mass", "height"]].agg([np.mean, np.std])
# table

In [None]:
# Aggregation: https://aeturrell.github.io/coding-for-economists/data-transformation.html#aggregation

# Groupby and aggregate: https://aeturrell.github.io/coding-for-economists/data-transformation.html#groupby-and-then-aggregate-aka-split-apply-combine

In [None]:
## Spread and Distribution: https://aeturrell.github.io/coding-for-economists/data-numbers.html#counts

In [None]:
from skimpy import skim

skim(df)

In [None]:
# Make quick charts with .plot.*

## Create script

In [5]:
%%writefile scripts/get_summary_statistics.py

'Hello'

Overwriting scripts/get_summary_statistics.py
