# Exploratory analysis 

This notebook contains code for exploratory data analysis of data used in the project: Public attitudes and ethical guidelines in digital field experiments (digex).

# TO DO

**We can roughly use the below bullet points taken from [2022-digex-study-design](https://docs.google.com/document/d/1nsaXEn04s9LTsjqrpbUpZlyQ3CAcuqkOwy1ZM0d-kKY/edit#) as section headings for this notebook to conduct the exploratory analysis**:

- To address question 1, we will provide frequency tables or plots and descriptive statistics (M, SD, range) for the variables: awareness of the fact that academic researchers use social media data, awareness of the advantages of social media data that account for why academics collect them, awareness of social media data use, and awareness of social media interaction methods. 

- To address question 2, we will provide frequency tables or plots and descriptive statistics (M, SD, range) for each of the 4 vignette studies presented in section 2 of the survey for the variable: attitudes towards actual research studies. 

- To address question 3, we will provide frequency tables or plots for the variables attitudes towards study design factors and attitudes towards ethical principles. Moreover, we will also use a mixed-methods approach to analyze open-ended free-text responses (see below).

- To further address questions 1-3, open-ended free-text inputs of both the open “other” options of selection items and the  open-ended free-text answers (i.e., what do you think it means for an academic study to receive "ethical approval”, describe any concerns you might have, what additional information about the study or the researchers that would influence your level of concern, are there any other features of research that are important for determining your level of concern, are there any additional factors that you think should be libsidered) will be analyzed using a mixed-methods approach to detect common topics, sentiments, and themes. 

## Set working directory

In [1]:
import pathlib   # Change cwd
import os 

path = pathlib.Path.cwd().parent
os.chdir(path)

## Imports

In [2]:
import joypy
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

## Plotting presets

In [None]:
digex_style = config.MPL_STYLE_FILEPATH
digex_palette = config.PALETTE

plt.style.use(digex_style)
sns.color_palette(digex_palette)

## Load processed data

In [None]:
processed_data_path = get_data_filepath(
    file=config.PROCESSED_DATA_FILEPATH, 
    data_path=config.PROCESSED_DATA_DIR,
    main=False
) 

digex_df = pd.read_csv(processed_data_path, index_col=0)

digex_df.head()

The above didn't work for me (Jason), so using the below.

In [3]:
digex_df = pd.read_csv('/Users/jasonburton/Documents/GitHub/article-digex-survey/data/processed/digex-survey-responses-processed.csv', index_col=0)

digex_df.head()

Unnamed: 0,duration_sec,finished,sm_use,age,gender_id,ethnic_id,edu,politic_views,aware_sm_res,aware_sm_advan,...,rank_pub_interst,rank_add_fac_1,rank_add_fac_1_pos,rank_add_fac_2,rank_add_fac_2_pos,rank_add_fac_3,rank_add_fac_3_pos,aware_sm_advan_score,aware_sm_interact_score,aware_sm_use_score
1,912.0,True,Facebook,29.0,Male,Asian - Eastern,Highschool,Slightly liberal,Extremely aware,['… are large and can contain millions of data...,...,1.0,,,,,,,4,0,9
2,720.0,True,Twitter,33.0,Male,Mixed race,Highschool,Neutral/ Neither conservative or liberal,Moderately aware,['… are large and can contain millions of data...,...,4.0,,,,,,,1,1,9
3,1874.0,True,Facebook,33.0,Female,Pacific Islander,Bachelor's degree,Very liberal,Extremely aware,['… are large and can contain millions of data...,...,1.0,,,,,,,2,2,5
4,1264.0,True,Facebook,73.0,Female,White / Caucasian,Highschool,Slightly conservative,Moderately aware,['… are large and can contain millions of data...,...,1.0,,8.0,,,,,1,1,6
5,556.0,True,Twitter,27.0,Female,Native-American,Highschool,Very liberal,Extremely aware,['… often capture social relationships not fou...,...,7.0,,,,,,,0,3,9


## Exploratory data analysis

Resources:
- https://aeturrell.github.io/coding-for-economists/data-exploratory-analysis.html#the-pandas-profiling-package
- https://deepnote.com/@deepnote/Joyplot-Introduction-4666e1a3-3249-442e-9a94-2bbcc5cb1b1d

### Demographics

see `02-digex-summary-statistics.ipynb`

(VS: I've added both our code there; apologies for not making it clear that I had already done some demographic summary stats!)

### Prior awareness

**Variables to examine: 8-15**

JB: struggling here... trying to parse the values to make dummy variables such that theres a separate column for `… are always representative of people’s offline behavior`, `… are naturalistic in that they do not require researchers to directly interact with research volunteers`, etc. with a 1 indicating that the participant checked the box and 0 indicating that they didn't

In [4]:
# remove brackets from the column values
digex_df['aware_sm_advan'] = digex_df['aware_sm_advan'].replace('\[', '', regex=True).replace('\]', '', regex=True)

# parse by comma separators
digex_df['aware_sm_advan'] = digex_df['aware_sm_advan'].apply(lambda x: x.split(','))

In [5]:
# get dummies
dummy_df = pd.get_dummies(digex_df['aware_sm_advan'].apply(pd.Series).stack()).sum(level=0)

  dummy_df = pd.get_dummies(digex_df['aware_sm_advan'].apply(pd.Series).stack()).sum(level=0)


In [6]:
# view dummies
dummy_df

Unnamed: 0,'… are always representative of people’s offline behavior','… are naturalistic in that they do not require researchers to directly interact with research volunteers','… are readily accessible to researchers and easy to collect','… are unaffected by the way social media platforms work','… are well formatted and never contain any missing data','… often capture social relationships not found using traditional methods (e.g. surveys)','… reflect events in real-time and can be collected continuously over time','None of the above','… are always representative of people’s offline behavior'.1,'… are large and can contain millions of data points','… are naturalistic in that they do not require researchers to directly interact with research volunteers'.1,'… are readily accessible to researchers and easy to collect'.1,'… are unaffected by the way social media platforms work'.1,'… often capture social relationships not found using traditional methods (e.g. surveys)'.1,'… reflect events in real-time and can be collected continuously over time'.1
1,0,1,0,0,0,1,1,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,1,1,0,0,1,0,0,0,1,0,0,0,0,0
4,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0
5,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,0,0,1,0,0,1,1,0,0,1,0,0,0,0,0
496,1,1,1,1,0,1,0,0,0,0,0,0,0,0,1
497,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
498,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0


Why does dummy_df have 15 columns with some column names looking identical?

### Study descriptions

**Variables to examine: 16-27**

#### Study 1

In [None]:
digex_df['study_1_ethic_acc'].value_counts(normalize=True, dropna=False)

#### Study 2

In [None]:
digex_df['study_2_ethic_acc'].value_counts(normalize=True, dropna=False)

#### Study 3

In [None]:
digex_df['study_3_ethic_acc'].value_counts(normalize=True, dropna=False)

#### Study 4

In [None]:
digex_df['study_4_ethic_acc'].value_counts(normalize=True, dropna=False)

#### Study comparison

In [None]:
study_1_ethic_acc_num = digex_df['study_1_ethic_acc'].map({'Completely unacceptable': -2, 
                                                           'Somewhat unacceptable': -1, 
                                                           'Neutral': 0, 
                                                           'Somewhat acceptable': 1, 
                                                           'Completely acceptable': 2})
study_2_ethic_acc_num = digex_df['study_2_ethic_acc'].map({'Completely unacceptable': -2, 
                                                           'Somewhat unacceptable': -1, 
                                                           'Neutral': 0, 
                                                           'Somewhat acceptable': 1, 
                                                           'Completely acceptable': 2})
study_3_ethic_acc_num = digex_df['study_3_ethic_acc'].map({'Completely unacceptable': -2, 
                                                           'Somewhat unacceptable': -1, 
                                                           'Neutral': 0, 
                                                           'Somewhat acceptable': 1, 
                                                           'Completely acceptable': 2})
study_4_ethic_acc_num = digex_df['study_4_ethic_acc'].map({'Completely unacceptable': -2, 
                                                           'Somewhat unacceptable': -1, 
                                                           'Neutral': 0, 
                                                           'Somewhat acceptable': 1, 
                                                           'Completely acceptable': 2})

In [None]:
print("Mean Study 1 acceptance:", study_1_ethic_acc_num.mean())
print("Mean Study 2 acceptance:", study_2_ethic_acc_num.mean())
print("Mean Study 3 acceptance:", study_3_ethic_acc_num.mean())
print("Mean Study 4 acceptance:", study_4_ethic_acc_num.mean())

In [None]:
fig, ax = plt.subplots()

x = [1,2,3,4]
y = [study_1_ethic_acc_num.mean(), 
     study_2_ethic_acc_num.mean(),
     study_3_ethic_acc_num.mean(),
     study_4_ethic_acc_num.mean()]
yerr = [study_1_ethic_acc_num.std(), 
     study_2_ethic_acc_num.std(),
     study_3_ethic_acc_num.std(),
     study_4_ethic_acc_num.std()] 

ax.errorbar(x, y, yerr, fmt='o', linewidth=2, capsize=6)

ax.set(xlim=(0, 5), xticks=np.arange(1, 5),
       ylim=(-2, 3), yticks=np.arange(-2, 3))

plt.xlabel('Study Number') 
plt.ylabel('Ethical Acceptance ±1 SD') 

plt.show()

In [None]:
cons = digex_df[(digex_df['politic_views'] == 'Very conservative') | (digex_df['politic_views'] == 'Slightly conservative')]
libs = digex_df[(digex_df['politic_views'] == 'Very liberal') | (digex_df['politic_views'] == 'Slightly liberal')]


In [None]:
study_1_ethic_acc_cons = cons['study_1_ethic_acc'].map({'Completely unacceptable': -2, 
                                                           'Somewhat unacceptable': -1, 
                                                           'Neutral': 0, 
                                                           'Somewhat acceptable': 1, 
                                                           'Completely acceptable': 2})
study_2_ethic_acc_cons = cons['study_2_ethic_acc'].map({'Completely unacceptable': -2, 
                                                           'Somewhat unacceptable': -1, 
                                                           'Neutral': 0, 
                                                           'Somewhat acceptable': 1, 
                                                           'Completely acceptable': 2})
study_3_ethic_acc_cons = cons['study_3_ethic_acc'].map({'Completely unacceptable': -2, 
                                                           'Somewhat unacceptable': -1, 
                                                           'Neutral': 0, 
                                                           'Somewhat acceptable': 1, 
                                                           'Completely acceptable': 2})
study_4_ethic_acc_cons = cons['study_4_ethic_acc'].map({'Completely unacceptable': -2, 
                                                           'Somewhat unacceptable': -1, 
                                                           'Neutral': 0, 
                                                           'Somewhat acceptable': 1, 
                                                           'Completely acceptable': 2})

In [None]:
print("Conservatives mean Study 1 acceptance for:", study_1_ethic_acc_cons.mean())
print("Conservatives mean Study 2 acceptance:", study_2_ethic_acc_cons.mean())
print("Conservatives mean Study 3 acceptance:", study_3_ethic_acc_cons.mean())
print("Conservatives mean Study 4 acceptance:", study_4_ethic_acc_cons.mean())

In [None]:
fig, ax = plt.subplots()

x = [1,2,3,4]
y = [study_1_ethic_acc_cons.mean(), 
     study_2_ethic_acc_cons.mean(),
     study_3_ethic_acc_cons.mean(),
     study_4_ethic_acc_cons.mean()]
yerr = [study_1_ethic_acc_cons.std(), 
     study_2_ethic_acc_cons.std(),
     study_3_ethic_acc_cons.std(),
     study_4_ethic_acc_cons.std()] 

ax.errorbar(x, y, yerr, fmt='o', linewidth=2, capsize=6)

ax.set(xlim=(0, 5), xticks=np.arange(1, 5),
       ylim=(-2, 3), yticks=np.arange(-2, 3))

plt.title('Conservatives (n = 131)') 
plt.xlabel('Study Number') 
plt.ylabel('Ethical Acceptance ±1 SD') 

plt.show()

In [None]:
study_1_ethic_acc_libs = libs['study_1_ethic_acc'].map({'Completely unacceptable': -2, 
                                                           'Somewhat unacceptable': -1, 
                                                           'Neutral': 0, 
                                                           'Somewhat acceptable': 1, 
                                                           'Completely acceptable': 2})
study_2_ethic_acc_libs = libs['study_2_ethic_acc'].map({'Completely unacceptable': -2, 
                                                           'Somewhat unacceptable': -1, 
                                                           'Neutral': 0, 
                                                           'Somewhat acceptable': 1, 
                                                           'Completely acceptable': 2})
study_3_ethic_acc_libs = libs['study_3_ethic_acc'].map({'Completely unacceptable': -2, 
                                                           'Somewhat unacceptable': -1, 
                                                           'Neutral': 0, 
                                                           'Somewhat acceptable': 1, 
                                                           'Completely acceptable': 2})
study_4_ethic_acc_libs = libs['study_4_ethic_acc'].map({'Completely unacceptable': -2, 
                                                           'Somewhat unacceptable': -1, 
                                                           'Neutral': 0, 
                                                           'Somewhat acceptable': 1, 
                                                           'Completely acceptable': 2})

In [None]:
print("Liberals mean Study 1 acceptance for:", study_1_ethic_acc_libs.mean())
print("Liberals mean Study 2 acceptance:", study_2_ethic_acc_libs.mean())
print("Liberals mean Study 3 acceptance:", study_3_ethic_acc_libs.mean())
print("Liberals mean Study 4 acceptance:", study_4_ethic_acc_libs.mean())

In [None]:
fig, ax = plt.subplots()

x = [1,2,3,4]
y = [study_1_ethic_acc_libs.mean(), 
     study_2_ethic_acc_libs.mean(),
     study_3_ethic_acc_libs.mean(),
     study_4_ethic_acc_libs.mean()]
yerr = [study_1_ethic_acc_libs.std(), 
     study_2_ethic_acc_libs.std(),
     study_3_ethic_acc_libs.std(),
     study_4_ethic_acc_libs.std()] 

ax.errorbar(x, y, yerr, fmt='o', linewidth=2, capsize=6)

ax.set(xlim=(0, 5), xticks=np.arange(1, 5),
       ylim=(-2, 3), yticks=np.arange(-2, 3))

plt.title('Liberals (n = 276)') 
plt.xlabel('Study Number') 
plt.ylabel('Ethical Acceptance ±1 SD') 

plt.show()

### Study design features

### Ethical priorities

In [None]:
ranks = digex_df[['rank_sci_repro', 'rank_resp', 'rank_just', 'rank_anony', 'rank_harms', 'rank_balance', 'rank_pub_interst']]


Ranks by means (lower value = higher priority).

In [None]:
ranks_means = ranks.mean()
ranks_means.sort_values(inplace=True, ascending=True)
ranks_means

Ranks by Borda count (higher value = higher priority).

In [None]:
ranks_borda = ranks.replace({1:6, 2:5, 3:4, 4:3, 5:2, 6:1, 7:0})    
ranks_borda = ranks_borda.sum()
ranks_borda.sort_values(inplace=True, ascending=False)
ranks_borda