# Exploratory analysis 

This notebook contains code for exploratory data analysis of data used in the project: Digital experiments survey.

# TO DO

Taken from [2022-digex-study-design](https://docs.google.com/document/d/1nsaXEn04s9LTsjqrpbUpZlyQ3CAcuqkOwy1ZM0d-kKY/edit#):

### Data Analysis Methods

- To address question 1, we will provide frequency tables or plots and descriptive statistics (M, SD, range) for the variables: awareness of the fact that academic researchers use social media data, awareness of the advantages of social media data that account for why academics collect them, awareness of social media data use, and awareness of social media interaction methods. 

- To address question 2, we will provide frequency tables or plots and descriptive statistics (M, SD, range) for each of the 4 vignette studies presented in section 2 of the survey for the variable: attitudes towards actual research studies. 

- To address question 3, we will provide frequency tables or plots for the variables attitudes towards study design factors and attitudes towards ethical principles. Moreover, we will also use a mixed-methods approach to analyze open-ended free-text responses (see below).

- To further address questions 1-3, open-ended free-text inputs of both the open “other” options of selection items and the  open-ended free-text answers (i.e., what do you think it means for an academic study to receive "ethical approval”, describe any concerns you might have, what additional information about the study or the researchers that would influence your level of concern, are there any other features of research that are important for determining your level of concern, are there any additional factors that you think should be considered) will be analyzed using a mixed-methods approach to detect common topics, sentiments, and themes. 
 
- The mixed-methods analysis approach will consist of combining quantitative textual analysis, specifically, topic (detection) modeling and sentiment analysis, with qualitative coding, specifically, structured tabular thematic analysis (ST-TA). Topic modeling will involve applying Latent Dirichlet Allocation to identify words within each  open-ended free-text response that are most frequently associated with each of the k identified topics, and produce the probability that each open-ended free-text response within the corpus of responses is associated with each of k topics (Jelodar et al. 2019). Sentiment analysis will involve assigning a polarity score in the range [−1, 1] through the Valence Aware Dictionary for sEntiment Reasoning (VADER) method. VADER is a lexicon-based sentiment analysis engine that combines lexicon-based methods with a rule-based modeling consisting of human validated rules Elbagir and Yang (2019). Models will be implemented using the R package topicmodels and the Python packages Natural Language Toolkit (NLTK). 

- The purpose of manual coding is to validate the topic models, specifically, the representatives of the detected topics and the number of topics k, and to identify further themes (Züll 2016). To implement ST-TA we will follow the method described by Robinson (2021). One to two coders will qualitatively identify themes mentioned by the participants and subsequently categorize all responses accordingly in order to receive a frequency measure. Specifically, coders will read the responses and add each theme as a column in a coding sheet. It will be coded whether the respective theme appears in the other responses (= 1) or not (= 0). If a coder encounters new relevant themes, they will be added and coded later. Nonsense responses will be excluded during coding. Ambiguities will be discussed and solved in pairs. If no solution can be found, a third coder will be consulted. Results from the  open-ended free-text items will be displayed in frequency tables or plots, word clouds, and summarized in quotes. 

- To statistically explore the effect of demographic variables and awareness and understanding of research using social media data and test whether response choices are independent to theme, topic and sentiment prevalence, we will use Chi-square tests when variables are categorical and Spearman’s rho for correlational analysis with categorical and scale responses. The results of the analysis of “other” options will be added to the individual items’ response presentations (see above). All analysis scripts will be included in an open-access repository along with the replication data.



## Set working directory

In [1]:
import pathlib   # Change cwd
import os 

path = pathlib.Path.cwd().parent
os.chdir(path)

## Imports

In [2]:
import joypy
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

## load processed data

In [7]:
processed_data_path = get_data_filepath(
    file=config.PROCESSED_DATA_FILEPATH, 
    data_path=config.PROCESSED_DATA_DIR,
    main=False
) 

digex_df = pd.read_csv(processed_data_path, index_col=0)

digex_df.head()

Unnamed: 0,sm_use,age,gender_id,ethnic_id,edu,politic_views,aware_sm_res,aware_sm_advan,aware_sm_interact,aware_sm_use,...,rank_pub_interst,rank_add_fac_1,rank_add_fac_1_pos,rank_add_fac_2,rank_add_fac_2_pos,rank_add_fac_3,rank_add_fac_3_pos,aware_sm_advan_score,aware_sm_interact_score,aware_sm_use_score
1,Facebook,29.0,Male,Asian - Eastern,Highschool,Slightly liberal,Extremely aware,['… are large and can contain millions of data...,"['Creating fake accounts (""bots"")', 'Secretly ...","['Political elections (e.g. voting behavior)',...",...,1.0,,,,,,,4,0,9
2,Twitter,33.0,Male,Mixed race,Highschool,Neutral/ Neither conservative or liberal,Moderately aware,['… are large and can contain millions of data...,"['Privately messaging users', ""Publicly postin...","['Political elections (e.g. voting behavior)',...",...,4.0,,,,,,,1,1,9
3,Facebook,33.0,Female,Pacific Islander,Bachelor's degree,Very liberal,Extremely aware,['… are large and can contain millions of data...,"['Privately messaging users', ""Publicly postin...","['Political elections (e.g. voting behavior)',...",...,1.0,,,,,,,2,2,5
4,Facebook,73.0,Female,White / Caucasian,Highschool,Slightly conservative,Moderately aware,['… are large and can contain millions of data...,"['Creating fake accounts (""bots"")']","['Political elections (e.g. voting behavior)',...",...,1.0,,8.0,,,,,1,1,6
5,Twitter,27.0,Female,Native-American,Highschool,Very liberal,Extremely aware,['… often capture social relationships not fou...,"['Privately messaging users', ""Publicly postin...","['Political elections (e.g. voting behavior)',...",...,7.0,,,,,,,0,3,9
