## Bad Domains: Visits to Malware/Phishing Sites by Age, Education, and Race

We pair passively observed domain level browsing data from comScore with [pydomains](https://github.com/themains/pydomains), a Python package that provides multiple ways to infer the kind of content hosted by a domain to examine if the old, the less well educated, and minorities more frequently visit (spend more time) on websites implicated in distributing malware or engaged in phishing than their complementary groups.

Two caveats. The browsing data is at the machine level. And the demographics data is at the household level.

Topline: The most educated most frequently visit (spend most time on) phishing/malware websites. Part of the reason is because they are online more often. When we split the entire sample by race, Asians and Whites more frequently visit (spend more time on) malware/phishing websites than other racial groups. Again, it seems part of the reason is that Asians/Whites spend more time online. When we split by age, we see that the older people more frequently visit (spend most time on) phishing/malware sites. Here there is some evidence that it is because they are choosing worse than younger people.

In [1]:
import pandas as pd
import gc

### Load browsing data for 2016 grouped by domain and machine ID

In [2]:
YEAR = 2016
gdf = pd.read_csv('/opt/comscore/pydomains/app2/cs%04d_grp_machine_domain.csv.bz2' % YEAR)
gdf.head()

Unnamed: 0,machine_id,domain_name,total_time,total_visits
0,17549714,100dayloans.com,0,1
1,17549714,1fbusa.com,43,18
2,17549714,2020panel.com,91,22
3,17549714,247-inc.net,46,32
4,17549714,4salelocal.net,1,1


How many machines do we have the data from?

In [3]:
len(gdf.machine_id.unique())

81407

### Get the Kind of Content Hosted by a Domain

We use [pydomains](https://github.com/themains/pydomains) to get the kind of content hosted by each of the domains in comScore. (We make the data freely available [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DXSNFA).) We only load the relevant columns: predictions about whether a domain is engaged in phishing from an LSTM model based on PhishTank data and predictions about whether a domain distributes malware from an LSTM model based on Toulouse data and from an LSTM model based on Malware data.

In [4]:
pydom_df = pd.read_csv('/opt/comscore/pydomains/cs%04d_unique_domains_pydomains.csv.bz2' % YEAR, usecols=['domain_names', 'pred_phish_2017_prob', 'pred_toulouse_2017_lab', 'pred_malware_2017_prob'])
# rename column
pydom_df.rename(columns={'domain_names': 'domain_name'}, inplace=True)
pydom_df.head()

Unnamed: 0,domain_name,pred_phish_2017_prob,pred_malware_2017_prob,pred_toulouse_2017_lab
0,realmadridvsbarcelonalivestream.com,0.88454,0.896531,adult
1,smartphonerankings.com,0.838993,0.367245,adult
2,sdale.org,0.17217,0.100568,others
3,twentyfoursevenrp.com,0.995628,0.876898,phishing
4,beachhousepublishing.com,0.932028,0.638768,press


In [5]:
# Left join
pdf = gdf.merge(pydom_df, how='left', on='domain_name')
pdf.head()

Unnamed: 0,machine_id,domain_name,total_time,total_visits,pred_phish_2017_prob,pred_malware_2017_prob,pred_toulouse_2017_lab
0,17549714,100dayloans.com,0,1,0.463685,0.266254,adult
1,17549714,1fbusa.com,43,18,0.262803,0.262831,phishing
2,17549714,2020panel.com,91,22,0.008066,0.093134,adult
3,17549714,247-inc.net,46,32,0.306593,0.111548,phishing
4,17549714,4salelocal.net,1,1,0.507583,0.271384,adult


In [6]:
pdf.columns

Index(['machine_id', 'domain_name', 'total_time', 'total_visits',
       'pred_phish_2017_prob', 'pred_malware_2017_prob',
       'pred_toulouse_2017_lab'],
      dtype='object')

In [7]:
# Delete to keep the memory from filling up
%xdel gdf
gc.collect()

21

### Phishing Model

Predict that the website was engaged in phishing if $prob > 0.9$ for the 2017 model. We choose this threshold to reduce the number of false positives. For across race, age, and education inferences to hold, measurement error should be orthogonal to race, income, and education.

In [8]:
pdf.loc[pdf.pred_phish_2017_prob > 0.9, 'total_time_phishing'] = pdf['total_time']
pdf.loc[pdf.pred_phish_2017_prob <= 0.9, 'total_time_phishing'] = 0
pdf.loc[pdf.pred_phish_2017_prob > 0.9, 'total_visits_phishing'] = pdf['total_visits']
pdf.loc[pdf.pred_phish_2017_prob <= 0.9, 'total_visits_phishing'] = 0

### Malware Model

Predict that the website is engaged in distributing malware if $prob > 0.9$ for the 2017 model.

In [9]:
pdf.loc[pdf.pred_malware_2017_prob > 0.9, 'total_time_malware'] = pdf['total_time']
pdf.loc[pdf.pred_malware_2017_prob <= 0.9, 'total_time_malware'] = 0
pdf.loc[pdf.pred_malware_2017_prob > 0.9, 'total_visits_malware'] = pdf['total_visits']
pdf.loc[pdf.pred_malware_2017_prob <= 0.9, 'total_visits_malware'] = 0

### Toulouse Model (for malware only)

Since we have two measures of Malware, for comparison, for Toulouse, we just use the label.

In [10]:
c = 'malware'
pdf.loc[pdf.pred_toulouse_2017_lab == c, 'total_time_tl_{0:s}'.format(c)] = pdf['total_time']
pdf.loc[pdf.pred_toulouse_2017_lab != c, 'total_time_tl_{0:s}'.format(c)] = 0
pdf.loc[pdf.pred_toulouse_2017_lab == c, 'total_visits_tl_{0:s}'.format(c)] = pdf['total_visits']
pdf.loc[pdf.pred_toulouse_2017_lab != c, 'total_visits_tl_{0:s}'.format(c)] = 0
    
pdf.head()

Unnamed: 0,machine_id,domain_name,total_time,total_visits,pred_phish_2017_prob,pred_malware_2017_prob,pred_toulouse_2017_lab,total_time_phishing,total_visits_phishing,total_time_malware,total_visits_malware,total_time_tl_malware,total_visits_tl_malware
0,17549714,100dayloans.com,0,1,0.463685,0.266254,adult,0.0,0.0,0.0,0.0,0.0,0.0
1,17549714,1fbusa.com,43,18,0.262803,0.262831,phishing,0.0,0.0,0.0,0.0,0.0,0.0
2,17549714,2020panel.com,91,22,0.008066,0.093134,adult,0.0,0.0,0.0,0.0,0.0,0.0
3,17549714,247-inc.net,46,32,0.306593,0.111548,phishing,0.0,0.0,0.0,0.0,0.0,0.0
4,17549714,4salelocal.net,1,1,0.507583,0.271384,adult,0.0,0.0,0.0,0.0,0.0,0.0


Our final dataset is at the machine_id level. We want to know how much time, how many visits, what proportion of time, and what proportion of visits for each person spent on websites implicated in phishing and distributing malware. 

We start by filtering the data to keep only phishing and malware domains (for each measure). And then simply groupby machine_id. 

In [11]:
cats = ['phishing', 'malware', 'tl_malware']
aggs = {'total_time': sum, 'total_visits': sum}
for c in cats:
    aggs['total_time_{0:s}'.format(c)] = sum
    aggs['total_visits_{0:s}'.format(c)] = sum
gdf = pdf.groupby(['machine_id']).agg(aggs)
gdf.head()

Unnamed: 0_level_0,total_time_phishing,total_time_malware,total_visits_malware,total_visits_tl_malware,total_time_tl_malware,total_visits,total_visits_phishing,total_time
machine_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
17549714,88.0,62.0,11.0,1.0,28.0,4147,30.0,36193
66614909,19.0,5.0,5.0,6.0,8.0,1540,16.0,8240
66859433,0.0,0.0,0.0,0.0,0.0,9,0.0,138
69370447,0.0,0.0,0.0,0.0,0.0,105,0.0,1810
70605319,6.0,1.0,2.0,0.0,0.0,288,3.0,3099


### Load household level demographics data

In [12]:
dem_df = pd.read_csv('/opt/comscore/demographics_by_machine_id/demographics2016.csv', usecols = ['machine_id', 'racial_background', 'country_of_origin', 'hoh_oldest_age', 'hoh_most_education'])
dem_df.describe()

Unnamed: 0,machine_id,hoh_most_education,hoh_oldest_age,racial_background,country_of_origin
count,81417.0,81417.0,81417.0,81417.0,81417.0
mean,182455900.0,37.414508,7.233919,2.162509,0.115676
std,13498110.0,46.21562,2.783556,1.916148,0.319838
min,17549710.0,1.0,1.0,-88.0,0.0
25%,173181800.0,2.0,5.0,1.0,0.0
50%,185428300.0,4.0,8.0,1.0,0.0
75%,193360400.0,99.0,9.0,3.0,0.0
max,201100200.0,99.0,99.0,5.0,1.0


### Convert Demographic Codes to Semantic Labels

Let's translate numerical labels to semantic labels

In [13]:
dem_df['racial_background'] = dem_df['racial_background'].replace({1: 'White', 
                                                                   2: 'Black', 
                                                                   3: 'Asian', 
                                                                   5: 'Other',
                                                                 -88: 'Missing'})
(dem_df['racial_background'].value_counts()/dem_df['racial_background'].value_counts().sum()).round(2)

White      0.58
Other      0.23
Black      0.12
Asian      0.07
Missing    0.00
Name: racial_background, dtype: float64

In [14]:
dem_df['country_of_origin'] = dem_df['country_of_origin'].replace({0: 'Non-Hispanic', 
                                                                   1: 'Hispanic'})
(dem_df['country_of_origin'].value_counts()/dem_df['country_of_origin'].value_counts().sum()).round(2)

Non-Hispanic    0.88
Hispanic        0.12
Name: country_of_origin, dtype: float64

In [15]:
dem_df['hoh_oldest_age'] = dem_df['hoh_oldest_age'].replace({1: '18-20', 
                                                             2: '21-24',
                                                             3: '25-29', 
                                                             4: '30-34',
                                                             5: '35-39',
                                                             6: '40-44',
                                                             7: '45-49',
                                                             8: '50-54',
                                                             9: '55-59',
                                                             10: '60-64',
                                                             11: '65 and over',
                                                             99: 'Missing'})
(dem_df['hoh_oldest_age'].value_counts()/dem_df['hoh_oldest_age'].value_counts().sum()).round(2)

65 and over    0.15
50-54          0.15
45-49          0.13
55-59          0.11
40-44          0.10
35-39          0.08
60-64          0.08
30-34          0.07
25-29          0.06
21-24          0.03
18-20          0.02
Missing        0.00
Name: hoh_oldest_age, dtype: float64

In [16]:
dem_df['hoh_most_education'] = dem_df['hoh_most_education'].replace({0: 'Less than a high school diploma',
                                                                     1: 'High school diploma or equivalent', 
                                                                     2: 'Some college but no degree', 
                                                                     3: 'Associate degree', 
                                                                     4: 'Bachelor’s degree',
                                                                     5: 'Graduate degree',
                                                                     99: 'Missing'})
(dem_df['hoh_most_education'].value_counts()/dem_df['hoh_most_education'].value_counts().sum()).round(2)

Missing                              0.36
Some college but no degree           0.26
Associate degree                     0.21
Bachelor’s degree                    0.14
High school diploma or equivalent    0.03
Graduate degree                      0.01
Name: hoh_most_education, dtype: float64

## Merge browsing data with demographics data

In [17]:
mdf = gdf.merge(dem_df, how='left', on='machine_id')

In [18]:
# Given the data are pretty big, we delete gdf and call the garbage collector
%xdel gdf
gc.collect()

35

In [19]:
mdf.head()

Unnamed: 0,machine_id,total_time_phishing,total_time_malware,total_visits_malware,total_visits_tl_malware,total_time_tl_malware,total_visits,total_visits_phishing,total_time,hoh_most_education,hoh_oldest_age,racial_background,country_of_origin
0,17549714,88.0,62.0,11.0,1.0,28.0,4147,30.0,36193,Missing,40-44,White,Non-Hispanic
1,66614909,19.0,5.0,5.0,6.0,8.0,1540,16.0,8240,Some college but no degree,60-64,White,Hispanic
2,66859433,0.0,0.0,0.0,0.0,0.0,9,0.0,138,Missing,25-29,White,Non-Hispanic
3,69370447,0.0,0.0,0.0,0.0,0.0,105,0.0,1810,Missing,55-59,Other,Non-Hispanic
4,70605319,6.0,1.0,2.0,0.0,0.0,288,3.0,3099,Missing,21-24,Other,Non-Hispanic


In [20]:
mdf.describe().astype(int)

Unnamed: 0,machine_id,total_time_phishing,total_time_malware,total_visits_malware,total_visits_tl_malware,total_time_tl_malware,total_visits,total_visits_phishing,total_time
count,81407,81407,81407,81407,81407,81407,81407,81407,81407
mean,182457976,280,220,9,2,36,981,16,14915
std,13496576,4319,4157,33,11,585,1546,49,29915
min,17549714,0,0,0,0,0,1,0,0
25%,173182743,1,0,0,0,0,146,1,1286
50%,185429062,11,3,2,0,0,469,4,4998
75%,193360555,69,31,7,2,3,1242,14,15719
max,201100249,425190,425190,1818,415,98263,99484,4190,1383660


### Analysis

#### By Education

We start by tracking total time spent and total number of visits to phishing and malware sites by education.

In [21]:
mdf.groupby(['hoh_most_education'])['total_time_phishing', 'total_visits_phishing'].describe().astype(int)

Unnamed: 0_level_0,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
hoh_most_education,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Associate degree,16810,352,6172,0,2,21,88,425190,16810,19,54,0,2,6,19,3189
Bachelor’s degree,11342,244,2382,0,2,22,98,139112,11342,20,46,0,2,7,21,1232
Graduate degree,539,355,2841,0,2,22,102,52898,539,25,49,0,2,8,24,363
High school diploma or equivalent,2489,277,2450,0,1,14,66,79276,2489,14,46,0,1,5,13,1562
Missing,29319,187,2460,0,0,2,35,203687,29319,10,38,0,0,2,7,3242
Some college but no degree,20908,368,5432,0,2,18,85,396946,20908,19,59,0,1,6,18,4190


The first thing that jumps out is the sharp right skew. Given the skew, we focus on the medians. There a slightly surprising pattern emerges: the greater the education level of the most educated person in the household, the more frequent the visits (identified by the median but holds for 75th percentile) to phishing sites. For instance, households where graduate education is the highest level of education visit phishing related sites more often (median = 8) than households where the most educated person just has a bachelor's degree (median = 7). When we look at the time, the pattern is slightly less clear but consistent.

Looking at visits and time spent on malware sites (see below), the pattern is broadly the same. And this is true regardless of what way we measure malware sites---Toulouse or Malware data.

In [22]:
mdf.groupby(['hoh_most_education'])['total_time_malware', 'total_visits_malware'].describe().astype(int)

Unnamed: 0_level_0,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
hoh_most_education,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Associate degree,16810,285,6048,0,0,6,41,425190,16810,10,33,0,1,3,9,1127
Bachelor’s degree,11342,173,2292,0,0,6,43,139051,11342,11,33,0,1,3,10,1230
Graduate degree,539,318,2833,0,0,7,47,52810,539,15,42,0,1,3,13,464
High school diploma or equivalent,2489,215,2320,0,0,4,32,79221,2489,8,40,0,0,2,7,1588
Missing,29319,149,2379,0,0,1,15,203648,29319,6,26,0,0,1,4,1818
Some college but no degree,20908,293,5123,0,0,5,39,396919,20908,11,41,0,0,2,8,1767


In [23]:
mdf.groupby(['hoh_most_education'])['total_time_tl_malware', 'total_visits_tl_malware'].describe().astype(int)

Unnamed: 0_level_0,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
hoh_most_education,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Associate degree,16810,43,736,0,0,0,5,82125,16810,2,13,0,0,0,2,385
Bachelor’s degree,11342,47,686,0,0,0,5,49598,11342,3,14,0,0,0,2,405
Graduate degree,539,46,234,0,0,0,9,2582,539,4,16,0,0,1,3,297
High school diploma or equivalent,2489,39,269,0,0,0,4,7946,2489,2,11,0,0,0,2,381
Missing,29319,26,214,0,0,0,1,12846,29319,1,9,0,0,0,1,376
Some college but no degree,20908,38,752,0,0,0,4,98263,20908,2,11,0,0,0,2,415


#### By Race

Next, we track total time spent and total number of visits to malware sites by racial background. Here we see reasonably consistent results with the following broad pattern: Asians visit phishing and malware sites most frequently followed by whites, blacks, and "others".

In [24]:
mdf.groupby(['racial_background'])['total_time_phishing', 'total_visits_phishing'].describe().astype(int)

Unnamed: 0_level_0,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
racial_background,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Asian,5769,269,2961,0,2,23,105,166643,5769,18,38,0,2,7,19,1208
Black,9775,231,2546,0,1,14,78,203687,9775,14,58,0,1,5,14,3189
Missing,10,98,217,0,0,0,5,640,10,5,10,0,0,0,2,28
Other,18553,130,1643,0,0,2,31,139112,18553,8,27,0,0,2,7,1334
White,47300,350,5351,0,1,15,80,425190,47300,19,54,0,1,5,17,4190


In [25]:
mdf.groupby(['racial_background'])['total_time_malware', 'total_visits_malware'].describe().astype(int)

Unnamed: 0_level_0,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
racial_background,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Asian,5769,209,2932,0,1,7,52,166637,5769,11,29,0,1,3,10,806
Black,9775,173,2444,0,0,4,36,203648,9775,8,26,0,0,2,7,1127
Missing,10,97,217,0,0,0,0,640,10,4,9,0,0,0,0,28
Other,18553,102,1607,0,0,1,12,139051,18553,4,17,0,0,1,3,477
White,47300,278,5142,0,0,4,36,425190,47300,10,39,0,0,2,8,1818


In [26]:
mdf.groupby(['racial_background'])['total_time_tl_malware', 'total_visits_tl_malware'].describe().astype(int)

Unnamed: 0_level_0,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
racial_background,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Asian,5769,63,538,0,0,1,11,25505,5769,3,13,0,0,1,3,338
Black,9775,46,878,0,0,0,5,82125,9775,2,8,0,0,0,2,345
Missing,10,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0
Other,18553,26,737,0,0,0,1,98263,18553,1,6,0,0,0,1,297
White,47300,35,425,0,0,0,3,49598,47300,2,13,0,0,0,2,415


#### By Age

Next, we track things by age. Here we have a fairly consistent pattern. People 60 and over most frequently visit (spend the most time) on phishing and malware sites. People under 30 lie on the other end of the spectrum. People in thirties, forties, and fifies generally come in between.  

In [27]:
mdf.groupby(['hoh_oldest_age'])['total_time_phishing', 'total_visits_phishing'].describe().astype(int)

Unnamed: 0_level_0,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
hoh_oldest_age,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
18-20,1560,113,727,0,0,3,32,21279,1560,8,22,0,0,2,7,414
21-24,2812,117,930,0,0,4,39,40852,2812,9,27,0,0,2,9,517
25-29,4958,249,4614,0,0,6,53,273741,4958,11,28,0,1,3,10,505
30-34,5714,201,2176,0,1,9,58,92960,5714,12,29,0,1,4,12,669
35-39,6820,327,6261,0,1,10,61,391783,6820,13,37,0,1,4,12,1562
40-44,7919,329,6841,0,1,11,66,425190,7919,14,36,0,1,4,13,1232
45-49,10900,335,5725,0,1,10,63,389279,10900,14,43,0,1,4,13,1788
50-54,12461,223,2362,0,0,7,58,152794,12461,14,58,0,0,3,13,3242
55-59,9201,296,3870,0,1,14,80,231801,9201,17,44,0,1,5,16,1334
60-64,6463,337,4087,0,2,21,97,185870,6463,22,64,0,1,6,21,3189


In [28]:
mdf.groupby(['hoh_oldest_age'])['total_time_malware', 'total_visits_malware'].describe().astype(int)

Unnamed: 0_level_0,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
hoh_oldest_age,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
18-20,1560,81,651,0,0,1,14,21277,1560,5,19,0,0,1,4,414
21-24,2812,72,424,0,0,1,20,11441,2812,5,24,0,0,1,4,699
25-29,4958,207,4599,0,0,1,22,273741,4958,6,20,0,0,1,5,442
30-34,5714,158,2152,0,0,2,26,92933,5714,6,19,0,0,2,6,319
35-39,6820,245,5597,0,0,3,27,391777,6820,7,29,0,0,2,6,1588
40-44,7919,269,6788,0,0,2,31,425190,7919,8,28,0,0,2,6,1230
45-49,10900,290,5723,0,0,2,28,389278,10900,8,36,0,0,2,6,1767
50-54,12461,159,2203,0,0,2,26,152794,12461,8,34,0,0,1,6,1818
55-59,9201,227,3694,0,0,4,36,231700,9201,10,31,0,0,2,8,976
60-64,6463,265,3666,0,0,6,45,185904,6463,12,42,0,0,3,10,1157


In [29]:
mdf.groupby(['hoh_oldest_age'])['total_time_tl_malware', 'total_visits_tl_malware'].describe().astype(int)

Unnamed: 0_level_0,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
hoh_oldest_age,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
18-20,1560,25,200,0,0,0,1,5723,1560,1,3,0,0,0,1,71
21-24,2812,24,171,0,0,0,2,6130,2812,1,7,0,0,0,1,225
25-29,4958,27,196,0,0,0,2,6975,4958,2,10,0,0,0,2,309
30-34,5714,45,1113,0,0,0,3,82125,5714,2,9,0,0,0,2,338
35-39,6820,43,1223,0,0,0,3,98263,6820,2,9,0,0,0,2,385
40-44,7919,25,180,0,0,0,2,11334,7919,2,7,0,0,0,2,165
45-49,10900,37,429,0,0,0,2,25505,10900,2,12,0,0,0,2,367
50-54,12461,32,500,0,0,0,2,49598,12461,2,10,0,0,0,1,372
55-59,9201,35,491,0,0,0,4,40966,9201,2,11,0,0,0,2,384
60-64,6463,44,400,0,0,0,6,16959,6463,3,16,0,0,0,2,415


### Proportion of time, visits

We think some of the patterns we see are a reflection of the total time people spend online. And exposure to problems may be greater just as a result of that. And that is an important insight. We also believe though that the more educated, more instance, are less likely---adjusted for frequency---to go to phishing and malware sites because they have greater skills. So we now look at proportions. The data are roughly in line with expectations for education with the least educated (remember this is household level) spending the largest share of their time (visits) on phishing and malware sites and the most educated spending the least. For race as well the pattern is flipped. For age, we don't expect such a pattern and that is indeed what we find.

In [30]:
mdf['prop_phishing_visits'] = mdf['total_visits_phishing']/mdf['total_visits']
mdf['prop_phishing_time'] = mdf['total_time_phishing']/mdf['total_time']

mdf['prop_malware_visits'] = mdf['total_visits_malware']/mdf['total_visits']
mdf['prop_malware_time']   = mdf['total_time_malware']/mdf['total_time']

mdf['prop_tl_malware_visits'] = mdf['total_visits_tl_malware']/mdf['total_visits']
mdf['prop_tl_malware_time']   = mdf['total_time_tl_malware']/mdf['total_time']

mdf.groupby(['hoh_most_education'])['prop_phishing_visits', 'prop_phishing_time',
                                                      'prop_malware_visits', 'prop_malware_time',
                                                      'prop_tl_malware_visits', 'prop_tl_malware_time'].mean().round(3)

Unnamed: 0_level_0,prop_phishing_visits,prop_phishing_time,prop_malware_visits,prop_malware_time,prop_tl_malware_visits,prop_tl_malware_time
hoh_most_education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Associate degree,0.016,0.014,0.009,0.01,0.003,0.003
Bachelor’s degree,0.015,0.013,0.009,0.009,0.003,0.003
Graduate degree,0.015,0.01,0.009,0.008,0.003,0.003
High school diploma or equivalent,0.016,0.018,0.011,0.013,0.004,0.005
Missing,0.019,0.021,0.012,0.016,0.005,0.007
Some college but no degree,0.017,0.017,0.01,0.012,0.003,0.003


In [31]:
mdf.groupby(['racial_background'])['prop_phishing_visits', 'prop_phishing_time',
                                                   'prop_malware_visits', 'prop_malware_time',
                                                   'prop_tl_malware_visits', 'prop_tl_malware_time'].mean().round(3)

Unnamed: 0_level_0,prop_phishing_visits,prop_phishing_time,prop_malware_visits,prop_malware_time,prop_tl_malware_visits,prop_tl_malware_time
racial_background,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Asian,0.015,0.014,0.01,0.01,0.003,0.003
Black,0.018,0.02,0.011,0.015,0.005,0.006
Missing,0.031,0.052,0.026,0.052,0.0,0.0
Other,0.017,0.018,0.01,0.014,0.004,0.006
White,0.017,0.017,0.01,0.012,0.003,0.004


Given plausible skew in total visits and time spent, we also check how medians look across income, education, and race

In [32]:
mdf.groupby(['hoh_oldest_age'])['prop_phishing_visits', 'prop_phishing_time',
                                                'prop_malware_visits', 'prop_malware_time',
                                                'prop_tl_malware_visits', 'prop_tl_malware_time'].mean().round(3)

Unnamed: 0_level_0,prop_phishing_visits,prop_phishing_time,prop_malware_visits,prop_malware_time,prop_tl_malware_visits,prop_tl_malware_time
hoh_oldest_age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
18-20,0.015,0.015,0.009,0.01,0.004,0.005
21-24,0.015,0.015,0.01,0.011,0.004,0.005
25-29,0.016,0.016,0.01,0.012,0.004,0.004
30-34,0.016,0.016,0.009,0.011,0.004,0.005
35-39,0.017,0.017,0.01,0.012,0.003,0.004
40-44,0.017,0.018,0.01,0.013,0.003,0.004
45-49,0.017,0.018,0.01,0.013,0.004,0.005
50-54,0.018,0.019,0.011,0.014,0.005,0.006
55-59,0.018,0.018,0.011,0.014,0.004,0.004
60-64,0.017,0.016,0.01,0.012,0.003,0.004
