## Prep for raking: FB data

Goal is to prep the data to be ready for raking with the four parameters: 1) age group, 2) gender, 3) marital status, 4) education level <br>
Need to remove missing data from the raking parameters and output 5 separate dataframes for raking

In [1]:
import numpy as np
import pandas as pd

In [2]:
fb = pd.read_csv('../output/fb_numeric.csv')

In [3]:
fb.shape

(1777, 43)

In [4]:
fb.columns

Index(['Unnamed: 0', 'StartDate', 'EndDate', 'Status', 'IPAddress', 'Progress',
       'Duration__in_seconds_', 'Finished', 'RecordedDate', 'ResponseId',
       'RecipientLastName', 'RecipientFirstName', 'RecipientEmail',
       'ExternalReference', 'LocationLatitude', 'LocationLongitude',
       'DistributionChannel', 'UserLanguage', 'timer_First_Click',
       'timer_Last_Click', 'timer_Page_Submit', 'timer_Click_Count', 'Q1',
       'Q2', 'Q3', 'Q3_1', 'Q4', 'Q5', 'Q6', 'Q7_1', 'Q8', 'Q9', 'Q9_6_TEXT',
       'Q10', 'Q11', 'Q12', 'SC0', 'timeload', 'DeviceIdentifier',
       'ipaddress_0', 'ResponseID_0', 'Week', 'Image'],
      dtype='object')

In [5]:
fb = fb[['Finished', 'RecordedDate', 'Q1', 'Q2', 'Q3', 'Q3_1', 'Q4', 'Q5', 'Q6',
         'Q7_1', 'Q8', 'Q9', 'Q9_6_TEXT', 'Q10', 'Q11', 'Q12', 'Week', 'Image']]

In [6]:
fb.columns = ['Finished', 'RecordedDate', 'Q1', 'Q2', 'Q3', 'Q3_1', 'Q4', 'Q5', 'Q6',
              'Q7_1', 'Q8', 'gender', 'gender_text', 'marital', 'age_group', 'education', 'Week', 'Image']

In [7]:
fb['Finished'].value_counts()

True     1359
False     418
Name: Finished, dtype: int64

In [8]:
# remove breakoffs
fb_complete = fb.loc[fb['Finished']==True, :]

In [9]:
demographic_cols = ['gender', 'marital', 'age_group', 'education']
for col in demographic_cols:
    print(fb_complete[col].value_counts(dropna=False).sort_values())

4.0      3
3.0      5
5.0     14
NaN     18
6.0     23
2.0    503
1.0    793
Name: gender, dtype: int64
3.0      8
NaN     22
2.0    438
1.0    891
Name: marital, dtype: int64
1.0      4
NaN     19
2.0     32
3.0     51
4.0     98
5.0    219
6.0    451
7.0    485
Name: age_group, dtype: int64
1.0     12
2.0     98
4.0    140
5.0    157
3.0    244
NaN    708
Name: education, dtype: int64


*Data cleaning strategy: <br>
'gender': keep only values 1 and 2 and remove missing <br>
'marital': keep only values 1 and 2 and remove missing <br>
'age_group': remove value 1 and missing <br>
'education': figure out what correlates with high missing, keep for now, may be not usable for raking*

In [10]:
# check missing data in education
fb_educ_missing = fb_complete[['education', 'RecordedDate']].fillna(99)

In [11]:
# education data consistently missing for the first week: 2020-05-14 to 2020-05-20
pd.crosstab(fb_educ_missing['education'], fb_educ_missing['RecordedDate'])

RecordedDate,2020-05-14,2020-05-15,2020-05-16,2020-05-17,2020-05-18,2020-05-19,2020-05-20,2020-05-21,2020-05-22,2020-05-23,2020-05-24,2020-05-25,2020-05-26,2020-05-27,2020-05-28
education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1.0,0,0,0,0,0,0,0,1,3,1,0,2,3,1,1
2.0,0,0,0,0,0,0,0,5,15,9,21,17,19,8,4
3.0,0,0,0,0,0,0,0,13,36,37,40,31,47,35,5
4.0,0,0,0,0,0,0,0,8,23,20,23,20,27,17,2
5.0,0,0,0,0,0,0,0,10,28,24,17,27,19,25,7
99.0,60,96,101,114,124,85,88,28,2,2,1,3,1,3,0


In [12]:
# removed rows where gender, marital, and age_group are missing
fb_dropna = fb_complete.dropna(subset=['gender', 'marital', 'age_group'])
fb_dropna.shape

(1333, 18)

In [13]:
# clean up aforementioned values
fb_cleandemo = fb_dropna[(fb_dropna['gender'] <3) & (fb_dropna['marital']<3) & (fb_dropna['age_group']>=2)]

In [14]:
# output 5 datasets for weighting
dflist = list(fb_cleandemo.groupby('Image'))

In [15]:
condition_codebook = {1: 'control',
                      2: 'covid',
                      3: 'privacy',
                      4: 'finance',
                      5: 'mental'}

In [16]:
condition_dfs = {}

for image_number, condition_df in dflist:
    image_text = condition_codebook[image_number]
    condition_dfs[image_text] = condition_df

In [17]:
# check data shape
for image_text, condition_df in condition_dfs.items():
    print(image_text)
    print(condition_df.shape)

control
(331, 18)
covid
(554, 18)
privacy
(59, 18)
finance
(142, 18)
mental
(196, 18)


In [18]:
#for image_text, condition_df in condition_dfs.items():
    #for col in demographic_cols:
        #print(image_text)
        #print(condition_df[col].value_counts(dropna=False).sort_values())

In [19]:
# output 5 datasets
for image_text, condition_df in condition_dfs.items():
    condition_df.to_csv(f'../output/{image_text}.csv', index=False)