## Bad Domains: Visits to Malware/Phishing Sites by Age, Education, and Race

We pair passively observed domain level browsing data from comScore with [pydomains](https://github.com/themains/pydomains), a Python package that provides multiple ways to infer the kind of content hosted by a domain to examine if the old, the less well educated, and minorities more frequently visit (spend more time) on websites implicated in distributing malware or engaged in phishing than their complementary groups.

Two caveats. The browsing data is at the machine level. And the demographics data is at the household level.

Topline: The most educated most frequently visit (spend most time on) phishing/malware websites. Part of the reason is because they are online more often. When we split the entire sample by race, Asians and Whites more frequently visit (spend more time on) malware/phishing websites than other racial groups. Again, it seems part of the reason is that Asians/Whites spend more time online. When we split by age, we see that the older people more frequently visit (spend most time on) phishing/malware sites. Here there is some evidence that it is because they are choosing worse than younger people.

In [1]:
import pandas as pd
import gc

### Load browsing data for 2016 grouped by domain and machine ID

In [2]:
YEAR = 2016
gdf = pd.read_csv('/opt/data/comscore/pydomains/app2/cs%04d_grp_machine_domain.csv.bz2' % YEAR)
gdf.head()

Unnamed: 0,machine_id,domain_name,total_time,total_visits
0,17549714,100dayloans.com,0,1
1,17549714,1fbusa.com,43,18
2,17549714,2020panel.com,91,22
3,17549714,247-inc.net,46,32
4,17549714,4salelocal.net,1,1


How many machines do we have the data from?

In [3]:
len(gdf.machine_id.unique())

81407

### Get the Kind of Content Hosted by a Domain

We use [pydomains](https://github.com/themains/pydomains) to get the kind of content hosted by each of the domains in comScore. (We make the data freely available [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DXSNFA).) We only load the relevant columns: predictions about whether a domain is engaged in phishing from an LSTM model based on PhishTank data and predictions about whether a domain distributes malware from an LSTM model based on Toulouse data and from an LSTM model based on Malware data.

In [4]:
pydom_df = pd.read_csv('/opt/data/comscore/pydomains/cs%04d_unique_domains_pydomains.csv.bz2' % YEAR, usecols=['domain_names', 'pred_phish_2017_prob', 'pred_toulouse_2017_lab', 'pred_malware_2017_prob'])
# rename column
pydom_df.rename(columns={'domain_names': 'domain_name'}, inplace=True)
pydom_df.head()

Unnamed: 0,domain_name,pred_phish_2017_prob,pred_malware_2017_prob,pred_toulouse_2017_lab
0,realmadridvsbarcelonalivestream.com,0.999508,0.907737,adult
1,smartphonerankings.com,0.787181,0.252708,adult
2,sdale.org,0.039262,0.00785,adult
3,twentyfoursevenrp.com,0.911047,0.979689,adult
4,beachhousepublishing.com,0.818236,0.579768,shopping


In [5]:
# Left join
pdf = gdf.merge(pydom_df, how='left', on='domain_name')
pdf.head()

Unnamed: 0,machine_id,domain_name,total_time,total_visits,pred_phish_2017_prob,pred_malware_2017_prob,pred_toulouse_2017_lab
0,17549714,100dayloans.com,0,1,0.608718,0.130488,adult
1,17549714,1fbusa.com,43,18,0.093295,0.148021,adult
2,17549714,2020panel.com,91,22,0.030941,0.041042,adult
3,17549714,247-inc.net,46,32,0.172064,0.112687,phishing
4,17549714,4salelocal.net,1,1,0.202896,0.782387,adult


In [6]:
pdf.columns

Index(['machine_id', 'domain_name', 'total_time', 'total_visits',
       'pred_phish_2017_prob', 'pred_malware_2017_prob',
       'pred_toulouse_2017_lab'],
      dtype='object')

In [7]:
# Delete to keep the memory from filling up
%xdel gdf
gc.collect()

60

### Phishing Model

Predict that the website was engaged in phishing if $prob > 0.9$ for the 2017 model. We choose this threshold to reduce the number of false positives. For across race, age, and education inferences to hold, measurement error should be orthogonal to race, income, and education.

In [8]:
pdf.loc[pdf.pred_phish_2017_prob > 0.9, 'total_time_phishing'] = pdf['total_time']
pdf.loc[pdf.pred_phish_2017_prob <= 0.9, 'total_time_phishing'] = 0
pdf.loc[pdf.pred_phish_2017_prob > 0.9, 'total_visits_phishing'] = pdf['total_visits']
pdf.loc[pdf.pred_phish_2017_prob <= 0.9, 'total_visits_phishing'] = 0

### Malware Model

Predict that the website is engaged in distributing malware if $prob > 0.9$ for the 2017 model.

In [9]:
pdf.loc[pdf.pred_malware_2017_prob > 0.9, 'total_time_malware'] = pdf['total_time']
pdf.loc[pdf.pred_malware_2017_prob <= 0.9, 'total_time_malware'] = 0
pdf.loc[pdf.pred_malware_2017_prob > 0.9, 'total_visits_malware'] = pdf['total_visits']
pdf.loc[pdf.pred_malware_2017_prob <= 0.9, 'total_visits_malware'] = 0

### Toulouse Model (for malware only)

Since we have two measures of Malware, for comparison, for Toulouse, we just use the label.

In [10]:
c = 'malware'
pdf.loc[pdf.pred_toulouse_2017_lab == c, 'total_time_tl_{0:s}'.format(c)] = pdf['total_time']
pdf.loc[pdf.pred_toulouse_2017_lab != c, 'total_time_tl_{0:s}'.format(c)] = 0
pdf.loc[pdf.pred_toulouse_2017_lab == c, 'total_visits_tl_{0:s}'.format(c)] = pdf['total_visits']
pdf.loc[pdf.pred_toulouse_2017_lab != c, 'total_visits_tl_{0:s}'.format(c)] = 0
    
pdf.head()

Unnamed: 0,machine_id,domain_name,total_time,total_visits,pred_phish_2017_prob,pred_malware_2017_prob,pred_toulouse_2017_lab,total_time_phishing,total_visits_phishing,total_time_malware,total_visits_malware,total_time_tl_malware,total_visits_tl_malware
0,17549714,100dayloans.com,0,1,0.608718,0.130488,adult,0.0,0.0,0.0,0.0,0.0,0.0
1,17549714,1fbusa.com,43,18,0.093295,0.148021,adult,0.0,0.0,0.0,0.0,0.0,0.0
2,17549714,2020panel.com,91,22,0.030941,0.041042,adult,0.0,0.0,0.0,0.0,0.0,0.0
3,17549714,247-inc.net,46,32,0.172064,0.112687,phishing,0.0,0.0,0.0,0.0,0.0,0.0
4,17549714,4salelocal.net,1,1,0.202896,0.782387,adult,0.0,0.0,0.0,0.0,0.0,0.0


Our final dataset is at the machine_id level. We want to know how much time, how many visits, what proportion of time, and what proportion of visits for each person spent on websites implicated in phishing and distributing malware. 

We start by filtering the data to keep only phishing and malware domains (for each measure). And then simply groupby machine_id. 

In [11]:
cats = ['phishing', 'malware', 'tl_malware']
aggs = {'total_time': sum, 'total_visits': sum}
for c in cats:
    aggs['total_time_{0:s}'.format(c)] = sum
    aggs['total_visits_{0:s}'.format(c)] = sum
gdf = pdf.groupby(['machine_id']).agg(aggs)
gdf.head()

Unnamed: 0_level_0,total_time,total_visits,total_time_phishing,total_visits_phishing,total_time_malware,total_visits_malware,total_time_tl_malware,total_visits_tl_malware
machine_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
17549714,36193,4147,63.0,43.0,70.0,43.0,0.0,0.0
66614909,8240,1540,4.0,6.0,12.0,10.0,0.0,0.0
66859433,138,9,0.0,0.0,0.0,0.0,0.0,0.0
69370447,1810,105,0.0,0.0,0.0,0.0,0.0,0.0
70605319,3099,288,0.0,1.0,1.0,1.0,0.0,0.0


### Load household level demographics data

In [12]:
dem_df = pd.read_csv('/opt/data/comscore/demographics_by_machine_id/demographics2016.csv', usecols = ['machine_id', 'racial_background', 'country_of_origin', 'hoh_oldest_age', 'hoh_most_education'])
dem_df.describe()

Unnamed: 0,machine_id,hoh_most_education,hoh_oldest_age,racial_background,country_of_origin
count,81417.0,81417.0,81417.0,81417.0,81417.0
mean,182455900.0,37.414508,7.233919,2.162509,0.115676
std,13498110.0,46.21562,2.783556,1.916148,0.319838
min,17549710.0,1.0,1.0,-88.0,0.0
25%,173181800.0,2.0,5.0,1.0,0.0
50%,185428300.0,4.0,8.0,1.0,0.0
75%,193360400.0,99.0,9.0,3.0,0.0
max,201100200.0,99.0,99.0,5.0,1.0


### Convert Demographic Codes to Semantic Labels

Let's translate numerical labels to semantic labels

In [13]:
dem_df['racial_background'] = dem_df['racial_background'].replace({1: 'White', 
                                                                   2: 'Black', 
                                                                   3: 'Asian', 
                                                                   5: 'Other',
                                                                 -88: 'Missing'})
(dem_df['racial_background'].value_counts()/dem_df['racial_background'].value_counts().sum()).round(2)

White      0.58
Other      0.23
Black      0.12
Asian      0.07
Missing    0.00
Name: racial_background, dtype: float64

In [14]:
dem_df['country_of_origin'] = dem_df['country_of_origin'].replace({0: 'Non-Hispanic', 
                                                                   1: 'Hispanic'})
(dem_df['country_of_origin'].value_counts()/dem_df['country_of_origin'].value_counts().sum()).round(2)

Non-Hispanic    0.88
Hispanic        0.12
Name: country_of_origin, dtype: float64

In [15]:
dem_df['hoh_oldest_age'] = dem_df['hoh_oldest_age'].replace({1: '18-20', 
                                                             2: '21-24',
                                                             3: '25-29', 
                                                             4: '30-34',
                                                             5: '35-39',
                                                             6: '40-44',
                                                             7: '45-49',
                                                             8: '50-54',
                                                             9: '55-59',
                                                             10: '60-64',
                                                             11: '65 and over',
                                                             99: 'Missing'})
(dem_df['hoh_oldest_age'].value_counts()/dem_df['hoh_oldest_age'].value_counts().sum()).round(2)

65 and over    0.15
50-54          0.15
45-49          0.13
55-59          0.11
40-44          0.10
35-39          0.08
60-64          0.08
30-34          0.07
25-29          0.06
21-24          0.03
18-20          0.02
Missing        0.00
Name: hoh_oldest_age, dtype: float64

In [16]:
dem_df['hoh_most_education'] = dem_df['hoh_most_education'].replace({0: 'Less than a high school diploma',
                                                                     1: 'High school diploma or equivalent', 
                                                                     2: 'Some college but no degree', 
                                                                     3: 'Associate degree', 
                                                                     4: 'Bachelor’s degree',
                                                                     5: 'Graduate degree',
                                                                     99: 'Missing'})
(dem_df['hoh_most_education'].value_counts()/dem_df['hoh_most_education'].value_counts().sum()).round(2)

Missing                              0.36
Some college but no degree           0.26
Associate degree                     0.21
Bachelor’s degree                    0.14
High school diploma or equivalent    0.03
Graduate degree                      0.01
Name: hoh_most_education, dtype: float64

## Merge browsing data with demographics data

In [17]:
mdf = gdf.merge(dem_df, how='left', on='machine_id')

In [18]:
# Given the data are pretty big, we delete gdf and call the garbage collector
%xdel gdf
gc.collect()

60

In [19]:
mdf.head()

Unnamed: 0,machine_id,total_time,total_visits,total_time_phishing,total_visits_phishing,total_time_malware,total_visits_malware,total_time_tl_malware,total_visits_tl_malware,hoh_most_education,hoh_oldest_age,racial_background,country_of_origin
0,17549714,36193,4147,63.0,43.0,70.0,43.0,0.0,0.0,Missing,40-44,White,Non-Hispanic
1,66614909,8240,1540,4.0,6.0,12.0,10.0,0.0,0.0,Some college but no degree,60-64,White,Hispanic
2,66859433,138,9,0.0,0.0,0.0,0.0,0.0,0.0,Missing,25-29,White,Non-Hispanic
3,69370447,1810,105,0.0,0.0,0.0,0.0,0.0,0.0,Missing,55-59,Other,Non-Hispanic
4,70605319,3099,288,0.0,1.0,1.0,1.0,0.0,0.0,Missing,21-24,Other,Non-Hispanic


In [20]:
mdf.describe().astype(int)

Unnamed: 0,machine_id,total_time,total_visits,total_time_phishing,total_visits_phishing,total_time_malware,total_visits_malware,total_time_tl_malware,total_visits_tl_malware
count,81407,81407,81407,81407,81407,81407,81407,81407,81407
mean,182457976,14915,981,245,11,270,12,31,1
std,13496576,29915,1546,4186,40,4383,41,705,14
min,17549714,0,1,0,0,0,0,0,0
25%,173182743,1286,146,0,0,0,1,0,0
50%,185429062,4998,469,5,3,7,3,0,0
75%,193360555,15719,1242,44,9,52,10,1,1
max,201100249,1383660,99484,425190,3442,425190,3054,98263,1540


### Analysis

#### By Education

We start by tracking total time spent and total number of visits to phishing and malware sites by education.

In [21]:
mdf.groupby(['hoh_most_education'])['total_time_phishing', 'total_visits_phishing'].describe().astype(int)

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
hoh_most_education,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Associate degree,16810,298,6008,0,1,10,58,425190,16810,13,42,0,1,4,12,2351
Bachelor’s degree,11342,217,2509,0,1,12,63,139062,11342,14,37,0,1,5,14,1228
Graduate degree,539,337,2840,0,1,12,71,52860,539,19,46,0,1,5,17,474
High school diploma or equivalent,2489,249,2432,0,1,7,45,79224,2489,10,41,0,1,3,9,1534
Missing,29319,164,2367,0,0,1,22,203668,29319,7,30,0,0,1,5,2560
Some college but no degree,20908,328,5197,0,1,9,55,396920,20908,14,50,0,1,4,12,3442


The first thing that jumps out is the sharp right skew. Given the skew, we focus on the medians. There a slightly surprising pattern emerges: the greater the education level of the most educated person in the household, the more frequent the visits (identified by the median but holds for 75th percentile) to phishing sites. For instance, households where graduate education is the highest level of education visit phishing related sites more often (median = 8) than households where the most educated person just has a bachelor's degree (median = 7). When we look at the time, the pattern is slightly less clear but consistent.

Looking at visits and time spent on malware sites (see below), the pattern is broadly the same. And this is true regardless of what way we measure malware sites---Toulouse or Malware data.

In [22]:
mdf.groupby(['hoh_most_education'])['total_time_malware', 'total_visits_malware'].describe().astype(int)

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
hoh_most_education,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Associate degree,16810,303,5938,0,1,13,68,425190,16810,14,43,0,1,4,13,2248
Bachelor’s degree,11342,261,2906,0,1,15,77,144890,11342,16,40,0,1,5,16,1230
Graduate degree,539,409,2950,0,1,18,89,52817,539,22,52,0,1,6,20,473
High school diploma or equivalent,2489,272,2482,0,1,9,53,79221,2489,11,44,0,1,3,10,1596
Missing,29319,191,2853,0,0,2,27,204410,29319,7,31,0,0,1,6,2382
Some college but no degree,20908,355,5429,0,1,11,63,396953,20908,14,49,0,1,4,13,3054


In [23]:
mdf.groupby(['hoh_most_education'])['total_time_tl_malware', 'total_visits_tl_malware'].describe().astype(int)

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
hoh_most_education,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Associate degree,16810,39,826,0,0,0,1,82125,16810,2,17,0,0,0,1,849
Bachelor’s degree,11342,36,660,0,0,0,1,49598,11342,2,12,0,0,0,1,366
Graduate degree,539,43,249,0,0,0,2,2582,539,3,21,0,0,0,1,302
High school diploma or equivalent,2489,24,165,0,0,0,1,4250,2489,1,10,0,0,0,1,366
Missing,29319,27,651,0,0,0,1,76056,29319,1,13,0,0,0,1,938
Some college but no degree,20908,29,742,0,0,0,1,98263,20908,1,15,0,0,0,1,1540


#### By Race

Next, we track total time spent and total number of visits to malware sites by racial background. Here we see reasonably consistent results with the following broad pattern: Asians visit phishing and malware sites most frequently followed by whites, blacks, and "others".

In [24]:
mdf.groupby(['racial_background'])['total_time_phishing', 'total_visits_phishing'].describe().astype(int)

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
racial_background,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Asian,5769,199,2618,0,1,12,66,166637,5769,12,29,0,1,4,13,872
Black,9775,210,2560,0,0,7,50,203668,9775,10,44,0,1,3,10,2351
Missing,10,98,216,0,0,0,12,640,10,4,9,0,0,0,2,28
Other,18553,116,1757,0,0,1,20,139062,18553,6,20,0,0,1,5,960
White,47300,308,5172,0,0,8,52,425190,47300,13,45,0,1,3,11,3442


In [25]:
mdf.groupby(['racial_background'])['total_time_malware', 'total_visits_malware'].describe().astype(int)

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
racial_background,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Asian,5769,249,2973,0,1,17,88,166637,5769,14,34,0,1,5,15,1030
Black,9775,217,2578,0,1,9,60,204410,9775,11,42,0,1,3,11,2248
Missing,10,0,0,0,0,0,1,2,10,0,0,0,0,0,1,2
Other,18553,145,2103,0,0,2,25,144890,18553,6,21,0,0,1,5,977
White,47300,332,5373,0,1,9,59,425190,47300,14,46,0,1,4,12,3054


In [26]:
mdf.groupby(['racial_background'])['total_time_tl_malware', 'total_visits_tl_malware'].describe().astype(int)

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
racial_background,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Asian,5769,82,1191,0,0,0,4,76056,5769,5,39,0,0,0,2,1540
Black,9775,35,850,0,0,0,1,82125,9775,1,7,0,0,0,1,264
Missing,10,0,0,0,0,0,0,1,10,0,0,0,0,0,0,1
Other,18553,20,734,0,0,0,0,98263,18553,1,10,0,0,0,1,849
White,47300,29,567,0,0,0,1,71491,47300,1,10,0,0,0,1,366


#### By Age

Next, we track things by age. Here we have a fairly consistent pattern. People 60 and over most frequently visit (spend the most time) on phishing and malware sites. People under 30 lie on the other end of the spectrum. People in thirties, forties, and fifies generally come in between.  

In [27]:
mdf.groupby(['hoh_oldest_age'])['total_time_phishing', 'total_visits_phishing'].describe().astype(int)

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_time_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing,total_visits_phishing
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
hoh_oldest_age,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
18-20,1560,95,708,0,0,2,22,21277,1560,6,19,0,0,1,5,414
21-24,2812,102,924,0,0,2,24,40849,2812,6,20,0,0,2,6,336
25-29,4958,231,4616,0,0,3,33,273741,4958,8,24,0,0,2,7,454
30-34,5714,171,1863,0,0,4,39,71787,5714,8,22,0,0,2,8,484
35-39,6820,269,5613,0,0,5,38,391777,6820,9,32,0,0,2,8,1534
40-44,7919,297,6814,0,0,5,42,425190,7919,10,30,0,0,3,9,1228
45-49,10900,294,5659,0,0,4,41,389279,10900,10,37,0,0,2,9,1796
50-54,12461,198,2361,0,0,3,37,152794,12461,10,43,0,0,2,8,2560
55-59,9201,255,3775,0,0,7,52,231685,9201,12,36,0,0,3,11,969
60-64,6463,286,3668,0,1,11,63,185890,6463,15,51,0,1,4,14,2300


In [28]:
mdf.groupby(['hoh_oldest_age'])['total_time_malware', 'total_visits_malware'].describe().astype(int)

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_time_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware,total_visits_malware
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
hoh_oldest_age,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
18-20,1560,101,692,0,0,2,26,21275,1560,6,20,0,0,2,6,416
21-24,2812,108,914,0,0,3,33,40849,2812,7,26,0,0,2,7,707
25-29,4958,260,4885,0,0,4,39,273741,4958,8,23,0,0,2,8,387
30-34,5714,261,3578,0,0,6,42,179431,5714,9,24,0,0,3,9,405
35-39,6820,295,5841,0,0,6,48,391778,6820,9,32,0,1,3,9,1596
40-44,7919,328,6961,0,0,6,51,425190,7919,11,32,0,1,3,10,1230
45-49,10900,331,5794,0,0,6,48,389278,10900,11,40,0,0,3,9,1795
50-54,12461,228,3099,0,0,5,45,216122,12461,11,46,0,0,2,9,2382
55-59,9201,280,3608,0,0,9,59,204410,9201,13,36,0,1,3,12,977
60-64,6463,273,3480,0,1,13,74,185903,6463,17,52,0,1,5,15,2248


In [29]:
mdf.groupby(['hoh_oldest_age'])['total_time_tl_malware', 'total_visits_tl_malware'].describe().astype(int)

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_time_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware,total_visits_tl_malware
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
hoh_oldest_age,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
18-20,1560,18,144,0,0,0,1,3219,1560,0,4,0,0,0,1,136
21-24,2812,44,1440,0,0,0,1,76056,2812,1,10,0,0,0,1,469
25-29,4958,18,152,0,0,0,1,5075,4958,1,9,0,0,0,1,339
30-34,5714,39,1121,0,0,0,1,82125,5714,1,13,0,0,0,1,673
35-39,6820,36,1210,0,0,0,1,98263,6820,1,7,0,0,0,1,338
40-44,7919,19,287,0,0,0,1,23959,7919,1,11,0,0,0,1,654
45-49,10900,27,315,0,0,0,1,19321,10900,1,14,0,0,0,1,848
50-54,12461,36,841,0,0,0,1,71491,12461,1,15,0,0,0,1,938
55-59,9201,29,539,0,0,0,1,40966,9201,1,11,0,0,0,1,416
60-64,6463,34,387,0,0,0,1,16705,6463,2,18,0,0,0,1,849


### Proportion of time, visits

We think some of the patterns we see are a reflection of the total time people spend online. And exposure to problems may be greater just as a result of that. And that is an important insight. We also believe though that the more educated, more instance, are less likely---adjusted for frequency---to go to phishing and malware sites because they have greater skills. So we now look at proportions. The data are roughly in line with expectations for education with the least educated (remember this is household level) spending the largest share of their time (visits) on phishing and malware sites and the most educated spending the least. For race as well the pattern is flipped. For age, we don't expect such a pattern and that is indeed what we find.

In [30]:
mdf['prop_phishing_visits'] = mdf['total_visits_phishing']/mdf['total_visits']
mdf['prop_phishing_time'] = mdf['total_time_phishing']/mdf['total_time']

mdf['prop_malware_visits'] = mdf['total_visits_malware']/mdf['total_visits']
mdf['prop_malware_time']   = mdf['total_time_malware']/mdf['total_time']

mdf['prop_tl_malware_visits'] = mdf['total_visits_tl_malware']/mdf['total_visits']
mdf['prop_tl_malware_time']   = mdf['total_time_tl_malware']/mdf['total_time']

mdf.groupby(['hoh_most_education'])['prop_phishing_visits', 'prop_phishing_time',
                                                      'prop_malware_visits', 'prop_malware_time',
                                                      'prop_tl_malware_visits', 'prop_tl_malware_time'].mean().round(3)

  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0_level_0,prop_phishing_visits,prop_phishing_time,prop_malware_visits,prop_malware_time,prop_tl_malware_visits,prop_tl_malware_time
hoh_most_education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Associate degree,0.011,0.012,0.012,0.013,0.002,0.002
Bachelor’s degree,0.011,0.011,0.012,0.013,0.002,0.002
Graduate degree,0.011,0.009,0.012,0.012,0.002,0.003
High school diploma or equivalent,0.012,0.015,0.013,0.017,0.002,0.004
Missing,0.014,0.018,0.016,0.022,0.004,0.006
Some college but no degree,0.012,0.014,0.013,0.015,0.002,0.002


In [31]:
mdf.groupby(['racial_background'])['prop_phishing_visits', 'prop_phishing_time',
                                                   'prop_malware_visits', 'prop_malware_time',
                                                   'prop_tl_malware_visits', 'prop_tl_malware_time'].mean().round(3)

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,prop_phishing_visits,prop_phishing_time,prop_malware_visits,prop_malware_time,prop_tl_malware_visits,prop_tl_malware_time
racial_background,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Asian,0.011,0.011,0.013,0.013,0.004,0.004
Black,0.014,0.017,0.015,0.02,0.003,0.005
Missing,0.027,0.052,0.006,0.001,0.0,0.0
Other,0.013,0.015,0.014,0.019,0.003,0.004
White,0.013,0.014,0.013,0.016,0.002,0.003


Given plausible skew in total visits and time spent, we also check how medians look across income, education, and race

In [32]:
mdf.groupby(['hoh_oldest_age'])['prop_phishing_visits', 'prop_phishing_time',
                                                'prop_malware_visits', 'prop_malware_time',
                                                'prop_tl_malware_visits', 'prop_tl_malware_time'].mean().round(3)

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,prop_phishing_visits,prop_phishing_time,prop_malware_visits,prop_malware_time,prop_tl_malware_visits,prop_tl_malware_time
hoh_oldest_age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
18-20,0.012,0.013,0.013,0.016,0.003,0.004
21-24,0.012,0.013,0.013,0.015,0.003,0.004
25-29,0.012,0.014,0.013,0.015,0.003,0.003
30-34,0.012,0.013,0.013,0.016,0.003,0.004
35-39,0.012,0.014,0.013,0.016,0.002,0.004
40-44,0.012,0.015,0.013,0.017,0.002,0.003
45-49,0.012,0.015,0.013,0.017,0.003,0.004
50-54,0.013,0.016,0.015,0.019,0.004,0.006
55-59,0.013,0.016,0.014,0.017,0.002,0.003
60-64,0.013,0.013,0.013,0.014,0.002,0.003
