## Consumption of Pornographic Content by Age and Education

In [1]:
import pandas as pd
import gc

### Load Data

The 2004 comScore data are already grouped by machine_id and domain_name. The data has four columns: 
a. machine_id, 
b. domain_name, 
c. total duration of visits to the site, 
d. number of visits

Each row gives the total visits to and total duration spent on a domain.

We merge this with data from pydomains and the Trusted Source API. We then calculate the time spent on and total visits to pornographic domains based on a variety of measures. We also create two other columns that track proportion of visits, and proportion of time.

In [2]:
YEAR = 2004

In [4]:
# Read in data
idf = pd.read_csv('/opt/data/comscore/pydomains/app2/cs%04d_grp_machine_domain.csv.bz2' % YEAR)

In [5]:
# Load the pydomains data
pydom_df = pd.read_csv('/opt/data/comscore/pydomains/cs%04d_unique_domains_pydomains.csv.bz2' % YEAR, usecols=['domain_names', 'shalla_2017_cat', 'pred_shalla_2017_lab', 'pred_shalla_2017_prob_porn', 'pred_toulouse_2017_prob_adult'], encoding='latin1')
# rename column
pydom_df.rename(columns={'domain_names': 'domain_name'}, inplace=True)

In [6]:
# Load trusted data
tdf = pd.read_csv('/opt/data/comscore/pydomains/comScore_unique_2004.csv')
# rename column
tdf.rename(columns={'unique_url': 'domain_name', 'url_class': 'trusted_cat'}, inplace=True)

### Left join Trusted Source and PyDomains

In [7]:
# Left join with pydomain
pdf = idf.merge(pydom_df, how='left', on='domain_name')

# Left join with Trusted
pdf = pdf.merge(tdf, how='left', on='domain_name')

For domains that are already in the labeled datasets, we use the labels from there.

In [8]:
# -1 for Unknown, 0 - No, 1 - Yes
pdf['shalla_trusted_porn'] = 0
pdf.loc[pdf.shalla_2017_cat.isnull() & (pdf.trusted_cat.isnull() | (pdf.trusted_cat == 'UNKNOWN')), 'shalla_trusted_porn'] = -1
pdf.loc[pdf.shalla_2017_cat.notnull() & pdf.shalla_2017_cat.str.contains('porn', case=False), 'shalla_trusted_porn'] = 1
pdf.loc[pdf.trusted_cat.notnull() & pdf.trusted_cat.str.contains('porn', case=False), 'shalla_trusted_porn'] = 1

pdf[['shalla_2017_cat', 'trusted_cat', 'shalla_trusted_porn']].head()

Unnamed: 0,shalla_2017_cat,trusted_cat,shalla_trusted_porn
0,,UNKNOWN,-1
1,porn|hobby/games-online,Games,1
2,,Internet Services,0
3,,Online Shopping,0
4,porn,Pornography,1


### Unique domain names

In [9]:
udf = pdf.drop_duplicates(subset='domain_name').copy()
# Total unique domains
len(udf)

1011145

### Total number of domains for which the label must be imputed

Curated lists generally only information about the kind of content hosted by a small fraction of domains. Commercial APIs generally are a lot better but still miss a sizable chunk. In Shallist, for instance, only about 22% of the domains in the data have category assigned to them (see below). For Trusted Source, the commensurate number is nearly 86%. In all, we know the category of about 86% of the domains.

In [10]:
# -1 for Unknown, 0 - No, 1 - Yes
udf['shalla_cat_porn'] = -1
udf.loc[udf.shalla_2017_cat.notnull() & (udf.shalla_2017_cat.str.contains('unknown', case=False) == False), 'shalla_cat_porn'] = 0
udf.loc[udf.shalla_2017_cat.notnull() & udf.shalla_2017_cat.str.contains('porn', case=False), 'shalla_cat_porn'] = 1
udf.groupby('shalla_cat_porn').agg({'domain_name': 'count'})/udf.shape[0]

Unnamed: 0_level_0,domain_name
shalla_cat_porn,Unnamed: 1_level_1
-1,0.780725
0,0.096059
1,0.123216


In [11]:
# -1 for Unknown, 0 - No, 1 - Yes
udf['trusted_cat_porn'] = -1
udf.loc[udf.trusted_cat.notnull() & (udf.trusted_cat.str.contains('unknown', case=False) == False), 'trusted_cat_porn'] = 0
udf.loc[udf.trusted_cat.notnull() & udf.trusted_cat.str.contains('porn', case=False), 'trusted_cat_porn'] = 1
udf.groupby('trusted_cat_porn').agg({'domain_name': 'count'})/udf.shape[0]

Unnamed: 0_level_0,domain_name
trusted_cat_porn,Unnamed: 1_level_1
-1,0.156131
0,0.691121
1,0.152748


In [12]:
udf.groupby('shalla_trusted_porn').agg({'domain_name': 'count'})/udf.shape[0]

Unnamed: 0_level_0,domain_name
shalla_trusted_porn,Unnamed: 1_level_1
-1,0.154937
0,0.674917
1,0.170146


### Impact of Different Cut-offs

Next, we use the labeled data (from Trusted and Shallalist) to pick different probability cut-offs to test how inferences changes. We choose three: one that minimizes FP+FN, one that gives us far fewer FP, and one that gives us far fewer FN. (We cast a wide net.)

To get the value that minimizes FP+FN for a particular category in a multi-class prediction problem, we [run an optimization algorithm](https://github.com/soodoku/optimal_softmax_cutoffs).

In [13]:
# Filter out unknown (-1)
odf = udf[udf.shalla_trusted_porn != -1][['shalla_trusted_porn', 'pred_shalla_2017_prob_porn', 'pred_toulouse_2017_prob_adult']].copy()
odf.head()

Unnamed: 0,shalla_trusted_porn,pred_shalla_2017_prob_porn,pred_toulouse_2017_prob_adult
1,1,0.224025,0.411841
2,0,0.906646,0.99727
3,0,0.147073,0.330262
4,1,0.987619,0.953945
5,0,0.159574,0.322767


In [14]:
prob_shalla = {}
# the prob. threshold to get mininum FN+FP
prob_shalla['prob_shalla_min_fn_fp'] = 0.91

In [15]:
# FIXME: manual try to predict with reduce FN and FP
prob_shalla['prob_shalla_reduce_fn'] = 0.5
prob_shalla['prob_shalla_reduce_fp'] = 0.99

In [16]:
prob_toulouse = {}
# the prob. threshold to get mininum FN+FP
prob_toulouse['prob_toulouse_min_fn_fp'] = 0.91

In [17]:
# We try different probability cutoffs to show how inferences change based on trading false positives for false negatives.
# Shalla model
for c in prob_shalla:
    prob = prob_shalla[c]
    print(c, prob)
    pdf.loc[pdf.pred_shalla_2017_prob_porn <= prob, c] = False
    pdf.loc[pdf.pred_shalla_2017_prob_porn > prob, c] = True
    pdf.loc[pdf.shalla_trusted_porn == 1, c] = True
    pdf.loc[pdf.shalla_trusted_porn == 0, c] = False

prob_shalla_min_fn_fp 0.91
prob_shalla_reduce_fn 0.5
prob_shalla_reduce_fp 0.99


In [18]:
# Toulose model
for c in prob_toulouse:
    prob = prob_toulouse[c]
    print(c, prob)
    pdf.loc[pdf.pred_toulouse_2017_prob_adult <= prob, c] = False
    pdf.loc[pdf.pred_toulouse_2017_prob_adult > prob, c] = True
    pdf.loc[pdf.shalla_trusted_porn == 1, c] = True
    pdf.loc[pdf.shalla_trusted_porn == 0, c] = False

prob_toulouse_min_fn_fp 0.91


In [19]:
agg_sum = {'total_time': sum, 'total_visits': sum}
for c in prob_shalla:
    pdf.loc[pdf[c], 'total_time_porn_{}'.format(c)] = pdf['total_time']
    pdf.loc[pdf[c] == False, 'total_time_porn_{}'.format(c)] = 0
    pdf.loc[pdf[c], 'total_visits_porn_{}'.format(c)] = pdf['total_visits']
    pdf.loc[pdf[c] == False, 'total_visits_porn_{}'.format(c)] = 0
    agg_sum['total_time_porn_{}'.format(c)] = sum
    agg_sum['total_visits_porn_{}'.format(c)] = sum

for c in prob_toulouse:
    pdf.loc[pdf[c], 'total_time_porn_{}'.format(c)] = pdf['total_time']
    pdf.loc[pdf[c] == False, 'total_time_porn_{}'.format(c)] = 0
    pdf.loc[pdf[c], 'total_visits_porn_{}'.format(c)] = pdf['total_visits']
    pdf.loc[pdf[c] == False, 'total_visits_porn_{}'.format(c)] = 0
    agg_sum['total_time_porn_{}'.format(c)] = sum
    agg_sum['total_visits_porn_{}'.format(c)] = sum

In [20]:
# agg. total_time and total_visits by machine_id
gdf = pdf.groupby(['machine_id']).agg(agg_sum)

### We join the data with demographic data at the household level, recoding demographic codes to semantic labels.

In [21]:
# Load household level demographics data
dem_df = pd.read_csv('/opt/data/comscore/demographics_by_machine_id/demographics%d.csv' % YEAR, usecols = ['machine_id', 'hoh_oldest_age', 'hoh_most_education'])

dem_df['hoh_oldest_age'] = dem_df['hoh_oldest_age'].replace({1: '18-20', 
                                                             2: '21-24',
                                                             3: '25-29', 
                                                             4: '30-34',
                                                             5: '35-39',
                                                             6: '40-44',
                                                             7: '45-49',
                                                             8: '50-54',
                                                             9: '55-59',
                                                             10: '60-64',
                                                             11: '65 and over',
                                                             99: 'Missing'})

# FIXME: replace '**' to 99
dem_df['hoh_most_education'] = dem_df['hoh_most_education'].astype(str).replace({
                                                                 '**': 99})

edu = {0: 'Less than a high school diploma',
           1: 'High school diploma or equivalent', 
           2: 'Some college but no degree', 
           3: 'Associate degree', 
           4: 'Bachelor’s degree',
           5: 'Graduate degree',
           99: 'Missing'}

dem_df['hoh_most_education'] = dem_df['hoh_most_education'].astype(int).replace(edu)

# Merge browsing data with demographics data
df = gdf.merge(dem_df, how = 'left', on = 'machine_id')

### Total time spent (total number of visits) on pornographic domains

Given the potential skew in these numbers, we also show quartiles.

In [22]:
df.head()

Unnamed: 0,machine_id,total_time,total_visits,total_time_porn_prob_shalla_min_fn_fp,total_visits_porn_prob_shalla_min_fn_fp,total_time_porn_prob_shalla_reduce_fn,total_visits_porn_prob_shalla_reduce_fn,total_time_porn_prob_shalla_reduce_fp,total_visits_porn_prob_shalla_reduce_fp,total_time_porn_prob_toulouse_min_fn_fp,total_visits_porn_prob_toulouse_min_fn_fp,hoh_most_education,hoh_oldest_age
0,62,4245,615,1157.0,265.0,1165.0,270.0,1157.0,265.0,1159.0,267.0,Some college but no degree,45-49
1,2715,9419,920,416.0,144.0,488.0,153.0,416.0,144.0,417.0,145.0,Missing,25-29
2,3086,421,137,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,Missing,35-39
3,3325,21892,3302,169.0,76.0,415.0,200.0,116.0,60.0,188.0,91.0,High school diploma or equivalent,65 and over
4,3939,25430,4779,211.0,73.0,480.0,182.0,203.0,65.0,216.0,77.0,High school diploma or equivalent,45-49


We select the columns that we intend to show.

In [23]:
sel_cols = pd.IndexSlice[:, ['mean', '25%', '50%', '75%']]
grp_visits = ['total_visits_porn_prob_shalla_reduce_fn', 'total_visits_porn_prob_shalla_reduce_fp', 'total_visits_porn_prob_shalla_min_fn_fp', 'total_visits_porn_prob_toulouse_min_fn_fp']
grp_time = ['total_time_porn_prob_shalla_reduce_fn', 'total_time_porn_prob_shalla_reduce_fp', 'total_time_porn_prob_shalla_min_fn_fp', 'total_time_porn_prob_toulouse_min_fn_fp']

### Average Number of Visits to Pornographic Sites by Age

Given the skew, we focus our discussion on the medians. A consistent pattern emerges across all four versions of our measure: 18--20 visit the pornographic domains the most often but after that, there is a sharp decline and then a modest upward trend peaking at 40--44 after which the average number of visits roughly monotonically decline. You see the same rough pattern in the average time spent.

Perhaps yet more importantly, it seems the average number of visits are pretty low. We concur. And that means that the absolute size of the differences is pretty small too even though the relative size may look big. The more serious concern is about the data. We don't have a lot to say about it.

In [24]:
df.groupby(['hoh_oldest_age'])[grp_visits].describe().round(1).loc[:, sel_cols].astype('int')

Unnamed: 0_level_0,total_visits_porn_prob_shalla_reduce_fn,total_visits_porn_prob_shalla_reduce_fn,total_visits_porn_prob_shalla_reduce_fn,total_visits_porn_prob_shalla_reduce_fn,total_visits_porn_prob_shalla_reduce_fp,total_visits_porn_prob_shalla_reduce_fp,total_visits_porn_prob_shalla_reduce_fp,total_visits_porn_prob_shalla_reduce_fp,total_visits_porn_prob_shalla_min_fn_fp,total_visits_porn_prob_shalla_min_fn_fp,total_visits_porn_prob_shalla_min_fn_fp,total_visits_porn_prob_shalla_min_fn_fp,total_visits_porn_prob_toulouse_min_fn_fp,total_visits_porn_prob_toulouse_min_fn_fp,total_visits_porn_prob_toulouse_min_fn_fp,total_visits_porn_prob_toulouse_min_fn_fp
Unnamed: 0_level_1,mean,25%,50%,75%,mean,25%,50%,75%,mean,25%,50%,75%,mean,25%,50%,75%
hoh_oldest_age,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
18-20,410,17,101,358,344,7,55,255,353,8,60,263,379,10,76,321
21-24,322,14,75,266,266,4,34,172,272,5,38,178,294,8,52,228
25-29,346,19,82,269,280,5,30,160,286,6,34,173,315,9,54,220
30-34,341,22,85,272,269,5,30,151,276,6,35,163,306,11,55,219
35-39,373,23,89,283,294,5,30,151,302,7,35,166,335,11,57,231
40-44,363,25,100,314,281,6,38,175,290,7,43,187,324,13,67,256
45-49,359,20,87,312,281,5,32,176,289,6,37,188,321,10,56,254
50-54,368,17,74,267,292,4,24,140,300,4,28,155,331,7,47,208
55-59,331,17,71,238,252,3,19,108,259,4,23,120,292,7,40,178
60-64,264,14,62,212,186,2,14,79,194,3,18,90,224,5,32,152


### Average Time Spent on Pornographic Sites by Age

In [25]:
df.groupby(['hoh_oldest_age'])[grp_time].describe().round(1).loc[:, sel_cols].astype('int')

Unnamed: 0_level_0,total_time_porn_prob_shalla_reduce_fn,total_time_porn_prob_shalla_reduce_fn,total_time_porn_prob_shalla_reduce_fn,total_time_porn_prob_shalla_reduce_fn,total_time_porn_prob_shalla_reduce_fp,total_time_porn_prob_shalla_reduce_fp,total_time_porn_prob_shalla_reduce_fp,total_time_porn_prob_shalla_reduce_fp,total_time_porn_prob_shalla_min_fn_fp,total_time_porn_prob_shalla_min_fn_fp,total_time_porn_prob_shalla_min_fn_fp,total_time_porn_prob_shalla_min_fn_fp,total_time_porn_prob_toulouse_min_fn_fp,total_time_porn_prob_toulouse_min_fn_fp,total_time_porn_prob_toulouse_min_fn_fp,total_time_porn_prob_toulouse_min_fn_fp
Unnamed: 0_level_1,mean,25%,50%,75%,mean,25%,50%,75%,mean,25%,50%,75%,mean,25%,50%,75%
hoh_oldest_age,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
18-20,2381,66,456,1912,1896,19,230,1350,1956,24,251,1429,2086,29,304,1574
21-24,1769,50,316,1372,1294,9,129,871,1349,12,146,933,1501,19,204,1048
25-29,1889,73,342,1394,1389,12,122,845,1457,16,138,904,1580,29,201,1055
30-34,1935,84,379,1396,1446,13,126,792,1514,18,150,851,1659,32,213,1015
35-39,2263,92,381,1382,1699,13,126,769,1758,19,143,842,1896,32,211,1013
40-44,2228,97,427,1571,1592,16,153,906,1674,21,177,995,1838,38,255,1153
45-49,2071,74,367,1545,1587,11,125,897,1655,16,144,962,1779,28,214,1165
50-54,2237,59,308,1340,1630,7,95,721,1712,11,112,790,1835,19,165,927
55-59,1944,58,287,1158,1420,6,67,534,1489,8,80,603,1627,18,133,762
60-64,1601,48,252,1015,1120,3,48,389,1182,5,62,464,1286,11,110,639


### Average Number of Visits to Pornographic Sites by Education

As education levels increase, the average number of visits go down. Households where the most educated person in the household has a graduate degree visit pornographic sites less often and spent less time on them than households where the most educated person has less than a HS diploma.

In [26]:
df.groupby(['hoh_most_education'])[grp_visits].describe().round(1).loc[:, sel_cols].astype('int').reindex(edu.values())

Unnamed: 0_level_0,total_visits_porn_prob_shalla_reduce_fn,total_visits_porn_prob_shalla_reduce_fn,total_visits_porn_prob_shalla_reduce_fn,total_visits_porn_prob_shalla_reduce_fn,total_visits_porn_prob_shalla_reduce_fp,total_visits_porn_prob_shalla_reduce_fp,total_visits_porn_prob_shalla_reduce_fp,total_visits_porn_prob_shalla_reduce_fp,total_visits_porn_prob_shalla_min_fn_fp,total_visits_porn_prob_shalla_min_fn_fp,total_visits_porn_prob_shalla_min_fn_fp,total_visits_porn_prob_shalla_min_fn_fp,total_visits_porn_prob_toulouse_min_fn_fp,total_visits_porn_prob_toulouse_min_fn_fp,total_visits_porn_prob_toulouse_min_fn_fp,total_visits_porn_prob_toulouse_min_fn_fp
Unnamed: 0_level_1,mean,25%,50%,75%,mean,25%,50%,75%,mean,25%,50%,75%,mean,25%,50%,75%
hoh_most_education,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Less than a high school diploma,374,25,108,374,294,7,41,212,303,9,48,237,335,13,77,308
High school diploma or equivalent,333,20,80,268,256,4,28,142,264,6,33,156,296,9,52,214
Some college but no degree,344,18,78,266,268,4,24,132,275,5,29,144,306,8,47,205
Associate degree,334,19,79,254,261,4,24,134,268,5,29,142,299,8,48,202
Bachelor’s degree,349,16,70,244,274,3,19,116,281,4,23,127,309,6,40,186
Graduate degree,305,13,60,223,233,2,14,92,239,2,18,100,267,4,32,161
Missing,347,21,89,297,271,5,32,166,279,6,37,178,311,10,59,242


### Average Time Spent on Pornographic Sites by Education

In [27]:
df.groupby(['hoh_most_education'])[grp_time].describe().round(1).loc[:, sel_cols].astype('int').reindex(edu.values())

Unnamed: 0_level_0,total_time_porn_prob_shalla_reduce_fn,total_time_porn_prob_shalla_reduce_fn,total_time_porn_prob_shalla_reduce_fn,total_time_porn_prob_shalla_reduce_fn,total_time_porn_prob_shalla_reduce_fp,total_time_porn_prob_shalla_reduce_fp,total_time_porn_prob_shalla_reduce_fp,total_time_porn_prob_shalla_reduce_fp,total_time_porn_prob_shalla_min_fn_fp,total_time_porn_prob_shalla_min_fn_fp,total_time_porn_prob_shalla_min_fn_fp,total_time_porn_prob_shalla_min_fn_fp,total_time_porn_prob_toulouse_min_fn_fp,total_time_porn_prob_toulouse_min_fn_fp,total_time_porn_prob_toulouse_min_fn_fp,total_time_porn_prob_toulouse_min_fn_fp
Unnamed: 0_level_1,mean,25%,50%,75%,mean,25%,50%,75%,mean,25%,50%,75%,mean,25%,50%,75%
hoh_most_education,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Less than a high school diploma,2570,86,447,1880,1855,18,174,1118,1919,24,211,1193,2117,41,272,1335
High school diploma or equivalent,2044,75,338,1357,1471,10,107,748,1545,14,128,818,1696,25,196,980
Some college but no degree,1986,63,319,1260,1496,8,94,672,1546,11,112,740,1671,22,171,908
Associate degree,1878,73,327,1280,1443,8,92,689,1502,11,109,740,1635,22,177,900
Bachelor’s degree,1989,57,296,1185,1483,5,67,565,1552,8,83,621,1686,16,139,801
Graduate degree,1911,43,227,1061,1392,3,44,430,1429,4,59,495,1574,10,105,657
Missing,2036,78,376,1498,1469,12,131,850,1552,17,152,920,1679,30,218,1104


### Proportion of Time Spent on (Proportion of Visits to) Pornographic Domains

Do we see the patterns because it just captures that certain people spend more time online? To check that we look at proportions.

The data are clear---as people get older, they generally spend a smaller proportion of time on pornographic websites with perceptible drop-offs after 50--54. Splitting by education also shows that the declining trend is a result of people in households where education level is higher spending less time on pornographic domains.

In [28]:
grp_prop_visits = []
for g in grp_visits:
    df['prop_' + g] = df[g]/df['total_visits'] 
    grp_prop_visits.append('prop_'  + g)

grp_prop_time = []
for g in grp_time:
    df['prop_' + g] = df[g]/df['total_time'] 
    grp_prop_time.append('prop_'  + g)    

#### By Age

In [29]:
df.groupby(['hoh_oldest_age'])[grp_prop_visits].mean().round(3)

Unnamed: 0_level_0,prop_total_visits_porn_prob_shalla_reduce_fn,prop_total_visits_porn_prob_shalla_reduce_fp,prop_total_visits_porn_prob_shalla_min_fn_fp,prop_total_visits_porn_prob_toulouse_min_fn_fp
hoh_oldest_age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
18-20,0.111,0.089,0.091,0.1
21-24,0.105,0.083,0.085,0.094
25-29,0.103,0.078,0.08,0.091
30-34,0.098,0.072,0.074,0.085
35-39,0.095,0.069,0.071,0.083
40-44,0.096,0.07,0.072,0.084
45-49,0.097,0.071,0.073,0.085
50-54,0.093,0.068,0.07,0.081
55-59,0.086,0.059,0.061,0.073
60-64,0.08,0.052,0.054,0.066


In [30]:
df.groupby(['hoh_oldest_age'])[grp_prop_time].mean().round(3)

Unnamed: 0_level_0,prop_total_time_porn_prob_shalla_reduce_fn,prop_total_time_porn_prob_shalla_reduce_fp,prop_total_time_porn_prob_shalla_min_fn_fp,prop_total_time_porn_prob_toulouse_min_fn_fp
hoh_oldest_age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
18-20,0.064,0.052,0.053,0.057
21-24,0.065,0.051,0.053,0.056
25-29,0.065,0.05,0.052,0.055
30-34,0.061,0.047,0.048,0.052
35-39,0.059,0.044,0.045,0.049
40-44,0.059,0.044,0.046,0.05
45-49,0.06,0.046,0.047,0.051
50-54,0.06,0.045,0.047,0.05
55-59,0.055,0.041,0.042,0.046
60-64,0.051,0.035,0.036,0.04


#### By Education

In [31]:
df.groupby(['hoh_most_education'])[grp_prop_visits].mean().round(3).reindex(edu.values())

Unnamed: 0_level_0,prop_total_visits_porn_prob_shalla_reduce_fn,prop_total_visits_porn_prob_shalla_reduce_fp,prop_total_visits_porn_prob_shalla_min_fn_fp,prop_total_visits_porn_prob_toulouse_min_fn_fp
hoh_most_education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Less than a high school diploma,0.105,0.08,0.082,0.093
High school diploma or equivalent,0.093,0.067,0.07,0.081
Some college but no degree,0.092,0.066,0.068,0.079
Associate degree,0.092,0.066,0.068,0.079
Bachelor’s degree,0.089,0.063,0.065,0.075
Graduate degree,0.083,0.057,0.059,0.069
Missing,0.097,0.07,0.073,0.084


In [32]:
df.groupby(['hoh_most_education'])[grp_prop_time].mean().round(3).reindex(edu.values())

Unnamed: 0_level_0,prop_total_time_porn_prob_shalla_reduce_fn,prop_total_time_porn_prob_shalla_reduce_fp,prop_total_time_porn_prob_shalla_min_fn_fp,prop_total_time_porn_prob_toulouse_min_fn_fp
hoh_most_education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Less than a high school diploma,0.064,0.049,0.051,0.054
High school diploma or equivalent,0.059,0.045,0.046,0.05
Some college but no degree,0.056,0.043,0.044,0.047
Associate degree,0.057,0.043,0.044,0.048
Bachelor’s degree,0.057,0.043,0.044,0.048
Graduate degree,0.054,0.04,0.041,0.045
Missing,0.059,0.044,0.046,0.05
