## Analysis of HIBP/YouGov Data

In [1]:
import pandas as pd

In [2]:
# How many breaches does HIBP have data for?
breaches = pd.read_json("../data/breaches.json")

print("Number of breaches:", breaches.shape[0])
print("Total number of breached accounts: ", breaches['PwnCount'].sum())
print("Number of Unique Domains: ", breaches['Domain'].nunique())

Number of breaches: 293
Total number of breached accounts:  5235843322
Number of Unique Domains:  278


In [3]:
# Read in the data
profile = pd.read_csv("../data/YGOV1058_profile.csv", low_memory = False)
pwned   = pd.read_csv("../data/YGOV1058_pwned.csv", low_memory = False)

print("Number of people:", profile.shape[0])
print("Number of people whose info. was breached:", pwned['id'].nunique())
print("Number of breaches:", pwned.shape[0])

# Merge the two files
fin_dat = pd.merge(profile, pwned, on = 'id', how = 'left')
print("Number of rows in the final dataset: ", fin_dat.shape[0])

# Sanity check = Number of unique IDs 
print("Sanity check: number of unique IDs in the final dataset:", fin_dat['id'].nunique())

Number of people: 5000
Number of people whose info. was breached: 4142
Number of breaches: 14979
Number of rows in the final dataset:  15837
Sanity check: number of unique IDs in the final dataset: 5000


Already our heads are spinning. 82.84% of Americans have had their credentials breached in one of the big breaches we have public data on. And the 4,142 Americans' credentials have been part of at least 14,979 breaches.

In [4]:
list(fin_dat)

['id',
 'gender',
 'birthyr',
 'race',
 'educ',
 'faminc',
 'inputstate',
 'pid3',
 'pid7',
 'votereg',
 'ideo5',
 'newsint',
 'marstat',
 'child18',
 'employ',
 'presvote16post',
 'region',
 'Title',
 'Name',
 'Domain',
 'BreachDate',
 'AddedDate',
 'ModifiedDate',
 'PwnCount',
 'DataClasses',
 'IsVerified',
 'IsFabricated',
 'IsSensitive',
 'IsActive',
 'IsRetired',
 'IsSpamList',
 'LogoType']

In [5]:
fin_dat.head()

Unnamed: 0,id,gender,birthyr,race,educ,faminc,inputstate,pid3,pid7,votereg,...,ModifiedDate,PwnCount,DataClasses,IsVerified,IsFabricated,IsSensitive,IsActive,IsRetired,IsSpamList,LogoType
0,371823339,1,1993,1,2,4,39,3,5,1,...,2017-03-08T23:49:53Z,393430309.0,"Email addresses, IP addresses, Names, Physical...",True,False,False,True,False,True,png
1,371823339,1,1993,1,2,4,39,3,5,1,...,2017-08-07T02:51:12Z,85176234.0,"Email addresses, Passwords, Usernames",True,False,False,True,False,False,svg
2,398212310,1,2000,1,2,97,51,5,8,3,...,,,,,,,,,,
3,392933925,1,2000,1,1,1,34,2,7,1,...,2017-08-07T02:51:12Z,85176234.0,"Email addresses, Passwords, Usernames",True,False,False,True,False,False,svg
4,392933925,1,2000,1,1,1,34,2,7,1,...,2017-03-25T23:43:45Z,29396116.0,"Email addresses, IP addresses, Passwords, User...",True,False,False,True,False,False,png


Let's start by checking how frequently panelists' email is part of a breach.

In [6]:
# some people's emails are not part of the breach but join produces NaNs for them. We swap them with 0s
fin_dat['pwn'] = pd.notna(fin_dat['PwnCount'])

fin_dat.groupby(['id'])['pwn'].sum().describe().round(2)

count    5000.00
mean        3.00
std         2.62
min         0.00
25%         1.00
50%         3.00
75%         4.00
max        22.00
Name: pwn, dtype: float64

On average, panelists' emails are part of 3 breaches (the median is about the same). The range of the number of breaches the panelists' email is part of ranges from 0 to 22! And the standard deviation is 2.62.

So how does this exposure vary by gender, race, education, and age? We answer those questions next, starting with gender.

In [7]:
# Let's first recode gender (see the codebook)
fin_dat['sex'] = fin_dat['gender'].replace({1: 'male', 2: 'female'})

print((fin_dat['sex'].value_counts()/fin_dat['sex'].value_counts().sum()).round(2))

fin_dat.groupby(['id', 'sex'])['pwn'].sum().groupby(['sex']).mean().round(2)

female    0.54
male      0.46
Name: sex, dtype: float64


sex
female    3.17
male      2.82
Name: pwn, dtype: float64

Women's emails are part of the breaches a bit more frequently than men's emails (Diff = .35). But what does that mean? On average, emails of 100 men would be part of 282 breaches. For 100 women, the number is 317. So women's emails are at about 12% greater risk than men's. This is a bit surprising given men probably have more accounts online.

Next, let's look at race.

In [8]:
# Again, we start by recoding ints to something that is human readable
fin_dat['race_eth'] = fin_dat['race'].replace({1: 'White', 
                                               2: 'Black', 
                                               3: 'Hispanic/Latino', 
                                               4: 'Asian', 
                                               5: 'Native American', 
                                               6: 'Middle Eastern', 
                                               7: 'Mixed Race', 
                                               8: 'Other'})

# Let's first check how many of each we got
(fin_dat['race_eth'].value_counts()/fin_dat['race_eth'].value_counts().sum()).round(2)

White              0.67
Hispanic/Latino    0.13
Black              0.12
Asian              0.03
Middle Eastern     0.02
Mixed Race         0.01
Native American    0.01
Other              0.00
Name: race_eth, dtype: float64

In [9]:
# Mean number of breaches the emails of people of diff. race/ethnicity are part of
fin_dat.groupby(['id', 'race_eth'])['pwn'].sum().groupby(['race_eth']).mean().round(2)

race_eth
Asian              2.82
Black              3.16
Hispanic/Latino    2.50
Middle Eastern     2.66
Mixed Race         2.45
Native American    2.96
Other              2.92
White              3.12
Name: pwn, dtype: float64

This is compelling. African Americans' and Whites' emails are most frequently part of breaches. The mean is 3.12 and 3.16 for African Americans and Whites respectively. For Hispanics/Latinos, the corresponding number is just 2.5! For Asians, the mean is 2.82, about 9.6% lower than Whites.

We next check how frequency of your email being part of breaches varies by how educated a person is.

In [10]:
# We start again by changing numbers to semantic labels
fin_dat['educat'] = fin_dat['educ'].replace({1: 'No HS', 
                                             2: 'HS Grad.', 
                                             3: 'Some College', 
                                             4: '2-year College Degree', 
                                             5: '4-year College Degree', 
                                             6: 'Postgrad Degree'})

# Let's check how many of each we got
(fin_dat['educat'].value_counts()/fin_dat['educat'].value_counts().sum()).round(2)

HS Grad.                 0.32
Some College             0.20
4-year College Degree    0.19
Postgrad Degree          0.11
2-year College Degree    0.11
No HS                    0.06
Name: educat, dtype: float64

In [11]:
# Average by education
fin_dat.groupby(['id', 'educat'])['pwn'].sum().groupby(['educat']).mean().round(2)

educat
2-year College Degree    3.07
4-year College Degree    3.22
HS Grad.                 2.89
No HS                    2.35
Postgrad Degree          3.20
Some College             3.04
Name: pwn, dtype: float64

The numbers are once again compelling. The relationship between average number of breaches a panelists' email is part of and their education is roughly monotonic. The average number of breaches people with no HS are part of is just 2.35. Compare this to postgrads, with a mean of 3.20 or over 36% greater!

Lastly, we check the relationship with age. 

In [12]:
fin_dat['agecat'] = pd.cut(2018 - fin_dat['birthyr'], [18, 25, 35, 50, 65, 100])
print((fin_dat['agecat'].value_counts()/fin_dat['agecat'].value_counts().sum()).round(2))
fin_dat.groupby(['id', 'agecat']).size().groupby(['agecat']).mean().round(2)

(50, 65]     0.28
(35, 50]     0.26
(25, 35]     0.19
(65, 100]    0.18
(18, 25]     0.09
Name: agecat, dtype: float64


agecat
(18, 25]     2.29
(25, 35]     3.29
(35, 50]     3.49
(50, 65]     3.41
(65, 100]    3.07
dtype: float64

We see an interesting curvilinear pattern with age. Young people (in part because they may have account with some of the compromised websites) have their emails as part of fewest breaches (Mean = 2.29). There is a steep jump to 3.29 for people between ages 25 and 25, and another jump for people between 35 and 50 (Mean = 3.48). People over 65 have emails that are part of somewhat fewer breaches (Mean = 3.06).

### Now let's check share of different breaches in the data. 

In [13]:
fin_dat.groupby(['Domain']).size().sort_values(ascending = False)

Domain
rivercitymediaonline.com    2913
linkedin.com                1089
modbsolutions.com           1067
myspace.com                 1059
data4marketers.com           996
cashcrate.com                856
adobe.com                    609
disqus.com                   570
ticketfly.com                393
tumblr.com                   340
dropbox.com                  288
dailymotion.com              255
last.fm                      248
evony.com                    171
clixsense.com                150
cafemom.com                  145
imesh.com                    144
kickstarter.com              140
edmodo.com                   130
zomato.com                   112
neopets.com                  108
reverbnation.com              96
forum.btcsec.com              77
bitly.com                     77
r2games.com                   66
8tracks.com                   52
funimation.com                48
diet.com                      45
patreon.com                   43
yahoo.com                     36
   

In [14]:
fin_dat.groupby(['Domain']).size()[fin_dat.groupby(['Domain']).size() > 100].nunique()

21

In [15]:
fin_dat.groupby(['Domain']).size()[fin_dat.groupby(['Domain']).size() > 100].sum()

11783

Lastly, we investigate the kind of breaches. HIBP uses the code 'SpamList' for cases where personal data is being used for spamming people. Here's HIBP: "Occasionally, large volumes of personal data are found being utilised for the purposes of sending targeted spam. This often includes many of the same attributes frequently found in data breaches such as names, addresses, phones numbers and dates of birth. The lists are often aggregated from multiple sources, frequently by eliciting personal information from people with the promise of a monetary reward . Whilst the data may not have been sourced from a breached system, the personal nature of the information and the fact that it's redistributed in this fashion unbeknownst to the owners warrants inclusion here."

In [16]:
fin_dat['IsSpamList'].sum(skipna = True)

5649

Is there any data from 'fabricated breaches'? More on what HIBP means by fabricated breaches: "Some breaches may be flagged as "fabricated". In these cases, it is highly unlikely that the breach contains legitimate data sourced from the alleged site but it may still be sold or traded under the auspices of legitimacy. Often these incidents are comprised of data aggregated from other locations (or may be entirely fabricated), yet still contain actual email addresses of unbeknownst to the account holder. Fabricated breaches are still included in the system because regardless of their legitimacy, they still contain personal information about individuals who want to understand their exposure on the web."

In [17]:
fin_dat['IsFabricated'].sum(skipna = True)

0

What proportion comes from "unverified" breaches? Here's HIBP on what it means by unverified breaches: "Some breaches may be flagged as "unverified". In these cases, whilst there is legitimate data within the alleged breach, it may not have been possible to establish legitimacy beyond reasonable doubt. Unverified breaches are still included in the system because regardless of their legitimacy, they still contain personal information about individuals who want to understand their exposure on the web." Suggested reading: https://www.troyhunt.com/introducing-unverified-breaches-to-have-i-been-pwned/

In [18]:
fin_dat['IsVerified'].sum(skipna = True)

14979

It appears that an overwhelming majority of the data comes from verified breaches.

It is useful to see if associations with socio-economic indicators hold up when we subset on verified, non-spam breaches.

In [19]:
fin_small_dat = fin_dat[pd.isna(fin_dat['IsSpamList']) | (fin_dat['IsSpamList'] == False)]
fin_small_dat = fin_small_dat[pd.isna(fin_small_dat['IsVerified']) | fin_small_dat['IsVerified'] == True]

fin_small_dat.shape[0]

10188

In [20]:
fin_small_dat.groupby(['id', 'educat'])['pwn'].sum().groupby(['educat']).mean().round(2)

educat
2-year College Degree    2.10
4-year College Degree    2.37
HS Grad.                 1.91
No HS                    1.53
Postgrad Degree          2.30
Some College             2.22
Name: pwn, dtype: float64

The pattern holds up. Again, the number of breached accounts of people with college degree or more is higher than people who only got as far as high school.

How about men versus women.

In [21]:
fin_small_dat.groupby(['id', 'sex'])['pwn'].sum().groupby(['sex']).mean().round(2)

sex
female    2.15
male      2.05
Name: pwn, dtype: float64

The pattern is more attenuated than above with averages about the same. 

How about when we split by race and ethnicity.

In [22]:
fin_small_dat.groupby(['id', 'race_eth'])['pwn'].sum().groupby(['race_eth']).mean().round(2)

race_eth
Asian              2.16
Black              2.03
Hispanic/Latino    1.73
Middle Eastern     2.05
Mixed Race         1.70
Native American    1.85
Other              2.69
White              2.21
Name: pwn, dtype: float64

Again, things look a bit different than above. Asians join Whites at the top of the pile. (We have too few Others to say something very confidently.) African Americans and Hispanics' accounts are less frequently breached.

In [23]:
fin_small_dat.groupby(['id', 'agecat']).size().groupby(['agecat']).mean().round(2)

agecat
(18, 25]     1.99
(25, 35]     2.63
(35, 50]     2.53
(50, 65]     2.30
(65, 100]    1.93
dtype: float64

The general pattern for age remains roughly similar with the middle aged more likely to have their accounts breached compared to the less than 25 and the over 65.