# I. Motivation and problem statement:
Since the emergence of data storage and usage over time by private companies, the threats of data breaches have always been a critical  issue, costing both direct financial impacts for industries and indirect impact of personal information stolen to the users of those companies. As a result, I'm planning to do this analysis in order to investigate which type of industry is most targeted for data breaching, if there's any considerable difference between the number of records lost for each type of data breach and how this trend changed over time. I believe gaining a deeper insight into this would help prospective industries to be more conscious of the risks they might face regarding data breach, and to implement approriate legal procedures as well as security procedures fast enough to protect the victimm's personal information in times of crisis.

# II. Data selected for analysis:
For this analysis, I'm gonna use the dataset about data breaches report from the Privacy Rights Clearinghouse (PRC), which can be found at this link: https://privacyrights.org/data-breaches. This dataset specifies the type of breach and the type of businesses and organizations associated with each incident from 2005 to 2018, with more than 9000 breach events. The license of this data is of Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), meaning it allows me to copy the material in any format and transform, build upon that material. This dataset is suitable for addressing my research goal because it has data on which exact type of data breach for each event, associated with each organization, allowing me to analyze the trend of data breaches by type and by  industry over time. Another dataset that I'm going to use is this dataset from Information is Beautiful which contains over 370 selected events among 30,000 data breaches events from 2004 to 2021, it can be found at this link: https://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/ This dataset contains more up-to-date data breaches events compared to the first one. This license of this data is also of Creative Commons license (CC), allowing me to use and build upon the dataset of this organization. This dataset provides the number of records lost during later events and help me identify the trend of data breaching more recently. One possible ethical consideration for both these dataset is that they also included each company's name for each event, thus, I wouldn't specify directly any company in this report in order to avoid breaking down their reputation without their awareness.

# III. Unknowns and dependencies:
There are some factors that outside of my control that might impact my ability to complete this project by the end of the quarter, including having a difficult time using python on Jupiter Notebook to do my analysis, which would take more time than I expected, another problem is that I would have to find another dataset about data breach cost to answer one of my research question and I wouldnn't find the suitable dataset for that. 


# IV. Research questions (or hypotheses):
The hypothesis in the course of my research project is that data breaches caused by external reasons are more detrimental: causing the most lost in record numbers, than internal reasons. The second hypothesis is that hacking comprises mainly of the total data breaches events. The third hypothesis is that these breaches are most targeted at businesses such as financial and insurance services since these businesses make the most profit compared to other types of businesses. 
The 5 Research Questions that I'm attempting to answer are:
1.  Between external reasons and internal reasons for data breaches, which type caused more lost in data records from 2004 to 2021?
2. How have the trends for internal attacks and external attacks for data breaches progressed from 2004 to 2021?
3. What is the distribution of breaches by specific types? 
4. What types of businesses are most targeted by hacking breaches attacks?
5. How have the trends for hacking breaches aimed at organizations changed from 2005 to 2018?

# V. Background and/or Related Work:
The loss of personal and senstive information regarding the users and/or the companies has led to significant reputational damage as well as financial losses. According to the IBM’s 2016 Cost of Data Breach Study, the average cost of a data breach has reached $4 million. According to this article (https://onlinelibrary.wiley.com/doi/full/10.1002/widm.1211), the conclusions that they reached after analysing the statistics of data leak incidents from 2011 to 2016 from Identity Theft Resource Center is that data leakage has been increasing, with business and medical/healthcare leaks take the majority of the leaks. Moreover, from the figure that they included, the number of breaches caused by outsiders (hackers, etc) in 2016 takes around 55% of the overall incident. This article also stated that according to other reports, there has been a threat of insiders causing data leak in recent years, with more than 40% of breaches perpetrated from inside a company. According to another report of Verizon DBIR Team in 2020 analyzing both incidents and breaches collected from a variety of sources using the VERIS framework (https://enterprise.verizon.com/resources/reports/dbir/2020/introduction/), it has also confirmed that external attackers are considerably more common than are internal attackers, with Financial being the primary motive. The actors of these breaches come primarily from phishing and credential theft, particularly from “Social” and “Hacking” subsections respectively. Thus, these finidings have informed me to come up with my hypothesis and research questions stated above, to investigate the overall trend of data breaches more recently in terms of the types of businesses most targeted and the types of attackers behind them. 

# VI. Methodology:
Firstly, the data are compiled from PRC and Information is Beautiful into 2 different tsv files and saved into 2 lists of dictionaries. Secondly, sum, percentage, and average methods are applied to this data to answer the research questions stated above. Then, I would use a bar graph and a pie chart in order to compare directly the numbers against each other, partcularly for question 1, question 3 and question 4. Additionaly, for identifying trends over the years: question 2 and question 5, I would use time series graphs against a range of years from 2018 to 2021 and from 2005 to 2018 respectively in order to identify the patterns for internal attacks vs external attacks as well as trends for hacking breaches of each organization. This method of presenting my results as a graph is useful because it allows me to visualize and analyze the trends over time.

## Part 1a: Load the data into the notebook

First I'll load each of the datafiles into my Notebook and save them into list-of-dictionaries, since that's fairly standard, and it makes it easy to check my work as I go.

In [1]:
#import the csv module
import csv

I'll load my .tsv file into Python and store it as a variable with relevant name. Since this data set's first row is only explaining further details into what constitutes that specific column, I'm going to remove the first row for clarity in analysis later on.

In [2]:
#load the data from the flat files into three lists-of-dictionaries
with open('Balloon Race.tsv', newline='') as f:
    reader = csv.DictReader(f, delimiter='\t')
    balloon_data_breach = [dict(r) for r in reader]
    
with open('PRC Data Breach.tsv', newline='') as f:
    reader = csv.DictReader(f, delimiter='\t')
    prc_data_breach = [dict(r) for r in reader] 
    
#clear the first row of the dataset
balloon_data_breach.pop(0)
print(balloon_data_breach[0])
print(prc_data_breach[0])
print(len(prc_data_breach))
print(len(balloon_data_breach))

{'organisation': 'Ubiquiti', 'alternative name': '', 'records lost': '16,000,000', 'year   ': '2021', 'date': 'Feb 2021', 'story': 'Unknown amount of user data breached', 'sector': 'tech', 'method': 'hacked', 'interesting story': '', 'data sensitivity': '2', 'displayed records': '', '': '', 'source name': 'ZDNet', '1st source link': 'https://www.zdnet.com/article/ubiquiti-tells-customers-to-change-passwords-after-security-breach/', '2nd source link': '', 'ID': ''}
{'Date Made Public': '3/3/2006', 'Company': 'PayDay OK LLC', 'City': '', 'State': 'New Jersey', 'Type of breach': 'HACK', 'Type of organization': 'BSF', 'Total Records': '88', 'Description of incident': "The company's website was breached sometime around February 19 by a hacker in an attempt to gain access to certain customers' private information. Social Security numbers, names, addresses, bank account names and bank account numbers may have been compromised. At least 88 individuals were affected.", 'Information Source': 'Ca

## Question 1: Between external reasons and internal reasons for data breaches, which type caused more lost in data records from 2004 to 2021?

For my first step, I need to decide what I mean by external reasons and internal reasons. I looked through the method column of the Balloon Race dataset and decided that external reasons would include hacked and lost device, whereas internal reasons would include poor security, inside job. The method "oops" would be left out because the dataset doesn't have a direct explanation for this. 

I'll store these values in two lists, so that as I loop through each data breach event in the dataset later I can check whether the event falls within the external or internal reasons. 

In [3]:
internal_reason = ['poor security','inside job']
external_reason = ['hacked', 'lost device']

As wI loop through the dataset, I'll create a dictionary with keys for 'internal' and 'external'. The values for each of these will start at 0, and increase as I loop through the dataset and add the number of records lost for each type I see.

In [4]:
#create an empty dictionary to hold the counts
breach_types = {'external':0, 'internal':0}

Now I will loop through my ``balloon data breach`` list and examine each dictionary in that list. If the method in that dictionary matches one of the methods I've listed in my internal and external reasons lists I've made above, then I take the value of ``records lost`` for that dictionary and add it to one of the totals in ``breach_types``. If it doesn't match, I move on to the next one and do the same thing.

In [5]:
#create a method to strip off characters in displayed records column in order to turn it into int
def filter_commas(value):
    return value.replace(',', '')

In [6]:
for breach_event in balloon_data_breach:

    breach_method = breach_event['method'] #get the method used for that breach event
    
    if breach_method in internal_reason: #if it matches internal reasons, add the count to our internal total
        breach_types['internal'] += int(filter_commas(breach_event['records lost']))
    
    elif breach_method in external_reason: #if itmatches external reasons, add the count to our external total
        breach_types['external'] += int(filter_commas(breach_event['records lost']))
    
    else:
        pass

In [7]:
print(breach_types)

{'external': 7325077930, 'internal': 4285858153}


From the data above, we can conclude that data breaches caused by external reasons such as hacking, stolen devices, etc caused more data lost comparing to internal reasons such as inside job and poor security, with external accounts for approximately 14650155860 records lost, and internal accounts for 8571716306 records lost from 2004 to 2021! In order to recognize the difference more clearly, I created a bar graph comparing these numbers next to each other, called ``The number of records data lost in terms of external vs internal reasons.png``, which allows us to see external attackers caused almost double the amount of data lost compared to internal attackers. 

## Question 2: How have the trends for internal attacks and external attacks for data breaches progressed from 2004 to 2021?

I want to explore further on whether the number of data breaches caused by outsiders versus those caused by insiders has been increasing or decreasing in recent years using the dataset from Balloon Race.tsv since it has more up-to-date data continuing to 2021.

The first step that I'm going to do is create an empty list and fill it with the years starting from 2004 until 2021 as well as each year's counts for external attacks vs internal attacks. 

In [8]:
#create an empty dictionary
breaches_by_year_ei = {}

for breach_event in balloon_data_breach:
    
    breach_year = breach_event['year   '] #get the year for that breach event
    
    #if haven't seen this year yet, create an item for it so I can use to track yearly counts later
    if breach_year not in breaches_by_year_ei.keys():
        
        #this time, we create a dictionary-in-a-dictionary, to hold our external and internal attacks counts separately
        breaches_by_year_ei[breach_year] = {'external attacks' : 0, 'internal attacks' : 0}
    
    else:
        pass
        #if I've already seen this hour and stored it in my dictionary, ignore it and move on

print(breaches_by_year_ei)

{'2021': {'external attacks': 0, 'internal attacks': 0}, '2020': {'external attacks': 0, 'internal attacks': 0}, '2019': {'external attacks': 0, 'internal attacks': 0}, '2018': {'external attacks': 0, 'internal attacks': 0}, '2016': {'external attacks': 0, 'internal attacks': 0}, '2017': {'external attacks': 0, 'internal attacks': 0}, '2015': {'external attacks': 0, 'internal attacks': 0}, '2014': {'external attacks': 0, 'internal attacks': 0}, '2013': {'external attacks': 0, 'internal attacks': 0}, '2012': {'external attacks': 0, 'internal attacks': 0}, '2011': {'external attacks': 0, 'internal attacks': 0}, '2010': {'external attacks': 0, 'internal attacks': 0}, '2009': {'external attacks': 0, 'internal attacks': 0}, '2008': {'external attacks': 0, 'internal attacks': 0}, '2007': {'external attacks': 0, 'internal attacks': 0}, '2006': {'external attacks': 0, 'internal attacks': 0}, '2005': {'external attacks': 0, 'internal attacks': 0}, '2004': {'external attacks': 0, 'internal attac

In [9]:
for breach_event in balloon_data_breach:
    
    breach_method = breach_event['method'] #get the method used for that breach event
    breach_year = breach_event['year   ']  #get the year for that breach event
    
    if breach_method in external_reason: #if it matches external reasons, add into external attack counts
        breaches_by_year_ei[breach_year]['external attacks'] += 1 
    elif breach_method in internal_reason: #if it matches internal reasons, add into internal attack counts
        breaches_by_year_ei[breach_year]['internal attacks'] += 1 
    else:
        pass

print(breaches_by_year_ei)

{'2021': {'external attacks': 6, 'internal attacks': 0}, '2020': {'external attacks': 18, 'internal attacks': 6}, '2019': {'external attacks': 30, 'internal attacks': 12}, '2018': {'external attacks': 23, 'internal attacks': 15}, '2016': {'external attacks': 29, 'internal attacks': 3}, '2017': {'external attacks': 16, 'internal attacks': 6}, '2015': {'external attacks': 21, 'internal attacks': 3}, '2014': {'external attacks': 14, 'internal attacks': 4}, '2013': {'external attacks': 25, 'internal attacks': 3}, '2012': {'external attacks': 16, 'internal attacks': 2}, '2011': {'external attacks': 24, 'internal attacks': 1}, '2010': {'external attacks': 4, 'internal attacks': 2}, '2009': {'external attacks': 6, 'internal attacks': 0}, '2008': {'external attacks': 4, 'internal attacks': 4}, '2007': {'external attacks': 3, 'internal attacks': 2}, '2006': {'external attacks': 1, 'internal attacks': 1}, '2005': {'external attacks': 1, 'internal attacks': 0}, '2004': {'external attacks': 0, 'in

## Part 1b: Exporting the dataset for time graph visualization

Now I have the data that I can make into a time series graph to analyze the trend over time in a spreadsheet program like Google Sheets or Microsoft Excel. But to do that, I will convert the data to CSV format and export it into a file. 
 
The first step is to convert my data from a nested dictionary to a nested list , where each sub-list (which will be a row in my CSV file), contains the values in a consistent order, includes:

``[year, external breach attacks, internal breach attacks]``

In [11]:
#create a new empty 'master list'
data_breach_attack_year = []

#for each year in my data_breach_attack_year dictionary
for year, counts in breaches_by_year_ei.items():
    
    #create a new sub-list that will store the year, external, and internal attacks data in a consistent order
    list_element = [] #new empty sub-list
    list_element.append(year) #add the year in, e.g. ['2021']
    list_element.append(counts['external attacks']) #append external attacks, e.g. ['2021', 6]
    list_element.append(counts['internal attacks']) #append internal attacks, e.g. ['2021', 6, 0]
    
    data_breach_attack_year.append(list_element) #add this list to the end of our growing master list
    
print(data_breach_attack_year)

[['2021', 6, 0], ['2020', 18, 6], ['2019', 30, 12], ['2018', 23, 15], ['2016', 29, 3], ['2017', 16, 6], ['2015', 21, 3], ['2014', 14, 4], ['2013', 25, 3], ['2012', 16, 2], ['2011', 24, 1], ['2010', 4, 2], ['2009', 6, 0], ['2008', 4, 4], ['2007', 3, 2], ['2006', 1, 1], ['2005', 1, 0], ['2004', 0, 1]]


In [12]:
#sort the list in 'numerical order' by the first sub-item, which is the year stamp
data_breach_attack_year.sort() 
print(data_breach_attack_year)

[['2004', 0, 1], ['2005', 1, 0], ['2006', 1, 1], ['2007', 3, 2], ['2008', 4, 4], ['2009', 6, 0], ['2010', 4, 2], ['2011', 24, 1], ['2012', 16, 2], ['2013', 25, 3], ['2014', 14, 4], ['2015', 21, 3], ['2016', 29, 3], ['2017', 16, 6], ['2018', 23, 15], ['2019', 30, 12], ['2020', 18, 6], ['2021', 6, 0]]


Now that I have my data in a list, I can export to a CSV file!

In [13]:
with open('data_breach_internal_external_trends.csv', 'w', encoding='utf-8') as f:
    writer = csv.writer(f)
    #write a header row
    writer.writerow(('year', 
                     'external breach attacks', 
                     'internal breach attacks'))
    
    for i in data_breach_attack_year:
        writer.writerow((i[0], i[1], i[2]))

I used  this CSV  file  to  make  a  time  graph  series  in  order  to  analyze  the  trends  of  data  breach  attacks  by  types  over  time and found that external attacks have always been  a  bigger  threats  to  companies  and  industries compared  to  internal  attacks. However, both  external  attacks  and  internal  attacks  have  decreased in  the  last  few  years  probably  because  of higher awareness of industries  to create  tighter  security. The graph is named as ``Data Breach Attacks by Types Trends from 2004 to 2021.png``.

## Question 3: What is the distribution of breaches by specific types? 

Based on the records provided from Privacy Rights Clearinghouse (PRC), each breach can be caused by the following reasons: inside word that intentionally breaches data (INSD), payment card fraud (CARD), physical loss (PHYS), lost/stolen portable devices (PORT), being hacked (HACK), stationary equipment loss (STAT), unintended disclosure like sending email to wrong addressesses (DISC), or unknown methods (UNKN). In order to find out the distribution of these breaches by specific types of methods, I'll first create an empty dictionary with all the types of data breaches methods. 

In [44]:
#create an empty dictionary that I'll fill with breaches types
breaches_by_types = {}

for breach_event in prc_data_breach:
    
    breach_type = breach_event['Type of breach'] #get the type of attack method for that breach event
    
    #if haven't seen this type yet, create an item for it so I can use to track each type counts later
    if breach_type not in breaches_by_types.keys():
        
        #this time, I create a dictionary-in-a-dictionary, to hold our counts and percentage separately
        breaches_by_types[breach_type] = {'counts' : 0, 'proportion' : 0}
    
    else:
        pass
        #if I've already seen this type and stored it in my dictionary, ignore it and move on

print(breaches_by_types)

{'HACK': {'counts': 0, 'proportion': 0}, 'PORT': {'counts': 0, 'proportion': 0}, 'DISC': {'counts': 0, 'proportion': 0}, 'PHYS': {'counts': 0, 'proportion': 0}, 'UNKN': {'counts': 0, 'proportion': 0}, 'INSD': {'counts': 0, 'proportion': 0}, 'STAT': {'counts': 0, 'proportion': 0}, 'CARD': {'counts': 0, 'proportion': 0}, '#N/A': {'counts': 0, 'proportion': 0}}


Now that I have lists ready to hold counts for each type of data breach method and its proportion, I'll loop through the raw data again and start adding the counts for each types.

In [45]:
for breach_event in prc_data_breach:
    
    breach_type = breach_event['Type of breach'] #get the organization attacked for that breach event
    
    breaches_by_types[breach_type]['counts'] += 1 
    breaches_by_types[breach_type]['proportion'] = round((breaches_by_types[breach_type]['counts'] * 100/len(prc_data_breach)),2)
    
    
print(breaches_by_types)

{'HACK': {'counts': 2510, 'proportion': 28.0}, 'PORT': {'counts': 1169, 'proportion': 13.04}, 'DISC': {'counts': 1842, 'proportion': 20.55}, 'PHYS': {'counts': 1728, 'proportion': 19.27}, 'UNKN': {'counts': 704, 'proportion': 7.85}, 'INSD': {'counts': 606, 'proportion': 6.76}, 'STAT': {'counts': 249, 'proportion': 2.78}, 'CARD': {'counts': 68, 'proportion': 0.76}, '#N/A': {'counts': 89, 'proportion': 0.99}}


As we can see from the pie chart named ``Distribution of Data Breaches by Types.png``, it is clear that the most prevalent types of breaches are HACK (28%), DISC (20.55%), PHYS (19.27%), and PORT (13.04%), together they represent more than 80% of the total number of breaches. Thus, this leads me to investigate the next question about which type of businesses is most targeted by hacking. 

## Question 4: What types of businesses are most targeted by hacking breaches attacks?

I want to look into the types of businesses that are most targeted at by external attacks, specifically through hacking. Each breach record contains data on each of 7 sectors: Businesses - Financial and Insurance Services (BSF), Businesses - Other (BSO), Businesses- Retail/Merchant - Including Online Retail (BSR), Educational Institutions (EDU), Government & Military (GOV),Healthcare, Medical Providers & Medical Insurance Services (MED), Nonprofits (NGO).

First I'm going to filter the original dataset into a list that only contains breach events that are caused by hacking. In order to do this, I'm going to create an empty list and then loop through the original dataset, if the method for that event is HACK then I'll add it to my list.

In [13]:
filtered_prc_hack = []

for breach_event in prc_data_breach:
    breach_method = breach_event['Type of breach'] #get the method used for that breach event
    
    if breach_method == 'HACK':
        filtered_prc_hack.append(breach_event)

print(len(filtered_prc_hack))
print(filtered_prc_hack)

2510


Then I'll create an empty list and fill it with the types of organizations as well as each type's counts of hacking attacks while looping through the filtered dataset that I just created. 

In [14]:
#create an empty dictionary
breaches_by_industry = {}

for breach_event in filtered_prc_hack:
    
    breach_industry = breach_event['Type of organization'] #get the organization attacked for that breach event
    
    #if haven't seen this organization yet, create an item for it so I can use to track hacking counts later
    if breach_industry not in breaches_by_industry.keys():
        
        #this time, we create a dictionary-in-a-dictionary, to hold our hacking and  counts separately
        breaches_by_industry[breach_industry] = {'number of hacking' : 0, 'proportion' : 0}
    
    else:
        pass
        #if we've already seen this organization and stored it in our dictionary, ignore it and move on

print(breaches_by_industry)

{'BSF': {'number of hacking': 0, 'proportion': 0}, 'BSR': {'number of hacking': 0, 'proportion': 0}, 'BSO': {'number of hacking': 0, 'proportion': 0}, 'GOV': {'number of hacking': 0, 'proportion': 0}, 'EDU': {'number of hacking': 0, 'proportion': 0}, 'MED': {'number of hacking': 0, 'proportion': 0}, 'NGO': {'number of hacking': 0, 'proportion': 0}}


In order to assess the number of hacking counts for each type of organization in terms of the total number of hacking events, I also included ``proportion`` to calculate the percentage of each type's hacking events. 

In [15]:
for breach_event in filtered_prc_hack:
    
    breach_industry = breach_event['Type of organization'] #get the organization attacked for that breach event
    
    breaches_by_industry[breach_industry]['number of hacking'] += 1 
    breaches_by_industry[breach_industry]['proportion'] = round((breaches_by_industry[breach_industry]['number of hacking'] * 100/len(filtered_prc_hack)),2)
    
    
print(breaches_by_industry)

{'BSF': {'number of hacking': 213, 'proportion': 8.49}, 'BSR': {'number of hacking': 301, 'proportion': 11.99}, 'BSO': {'number of hacking': 609, 'proportion': 24.26}, 'GOV': {'number of hacking': 147, 'proportion': 5.86}, 'EDU': {'number of hacking': 290, 'proportion': 11.55}, 'MED': {'number of hacking': 912, 'proportion': 36.33}, 'NGO': {'number of hacking': 38, 'proportion': 1.51}}


Based on the result, I found that contrary to my hypothesis, most of the attacks target health organizations (MED), while businesses (BSO) comes in second place. NGOs are the least attacked type of organisation with 1.51% along with Government and Military represents only 5.86% of all the hacking breaches. This happened because I assumed that NGO's data are predominantly public data and that Government and Military has always had tight software security in order to avoid wars, etc. 

## Question 5: How have the trends for hacking breaches aimed at organizations changed from 2005 to 2018? 

In order to investigate the development of hacking breaches targeted at different organisations types over the time, I'm going to loop through the filtered dataset that I created above which only has hacking data breach events and create an empty dictionary that holds all of the years and also all the organizations types counts for each year. 

In [25]:
#create an empty dictionary
breaches_by_year_prc = {}

for breach_event in filtered_prc_hack:
    
    breach_year = breach_event['Year of Breach'] #get the year for that breach event
    
    #if haven't seen this year yet, create an item for it so I can use to track yearly counts later
    if breach_year not in breaches_by_year_prc.keys():
        
        #this time, we create a dictionary-in-a-dictionary, to hold our bike and pedestrian counts separately
        breaches_by_year_prc[breach_year] = {'BSF':0, 'BSR':0, 'BSO':0, 'GOV':0, 'EDU':0, 'MED':0, 'NGO':0}
    
    else:
        pass
        #if we've already seen this hour and stored it in our dictionary, ignore it and move on

print(breaches_by_year_prc)

{'2006': {'BSF': 0, 'BSR': 0, 'BSO': 0, 'GOV': 0, 'EDU': 0, 'MED': 0, 'NGO': 0}, '2012': {'BSF': 0, 'BSR': 0, 'BSO': 0, 'GOV': 0, 'EDU': 0, 'MED': 0, 'NGO': 0}, '2013': {'BSF': 0, 'BSR': 0, 'BSO': 0, 'GOV': 0, 'EDU': 0, 'MED': 0, 'NGO': 0}, '2014': {'BSF': 0, 'BSR': 0, 'BSO': 0, 'GOV': 0, 'EDU': 0, 'MED': 0, 'NGO': 0}, '2015': {'BSF': 0, 'BSR': 0, 'BSO': 0, 'GOV': 0, 'EDU': 0, 'MED': 0, 'NGO': 0}, '2016': {'BSF': 0, 'BSR': 0, 'BSO': 0, 'GOV': 0, 'EDU': 0, 'MED': 0, 'NGO': 0}, '2017': {'BSF': 0, 'BSR': 0, 'BSO': 0, 'GOV': 0, 'EDU': 0, 'MED': 0, 'NGO': 0}, '2009': {'BSF': 0, 'BSR': 0, 'BSO': 0, 'GOV': 0, 'EDU': 0, 'MED': 0, 'NGO': 0}, '2010': {'BSF': 0, 'BSR': 0, 'BSO': 0, 'GOV': 0, 'EDU': 0, 'MED': 0, 'NGO': 0}, '2011': {'BSF': 0, 'BSR': 0, 'BSO': 0, 'GOV': 0, 'EDU': 0, 'MED': 0, 'NGO': 0}, '2005': {'BSF': 0, 'BSR': 0, 'BSO': 0, 'GOV': 0, 'EDU': 0, 'MED': 0, 'NGO': 0}, '2007': {'BSF': 0, 'BSR': 0, 'BSO': 0, 'GOV': 0, 'EDU': 0, 'MED': 0, 'NGO': 0}, '2008': {'BSF': 0, 'BSR': 0, 'BSO': 0, 

In [27]:
for breach_event in filtered_prc_hack:
    
    breach_organization = breach_event['Type of organization']
    breach_year = breach_event['Year of Breach']  
    
    if breach_organization == 'BSF':
        breaches_by_year_prc[breach_year]['BSF'] += 1 #add Business and finanical service counts
    elif breach_organization == 'BSR':
        breaches_by_year_prc[breach_year]['BSR'] += 1 #add Businesses- Retail counts
    elif breach_organization == 'BSO':
        breaches_by_year_prc[breach_year]['BSO'] += 1 #add Businesses - Other counts
    elif breach_organization == 'GOV':
        breaches_by_year_prc[breach_year]['GOV'] += 1 #add Government & Military counts
    elif breach_organization == 'EDU':
        breaches_by_year_prc[breach_year]['EDU'] += 1 #add Educational Institutions counts
    elif breach_organization == 'MED':
        breaches_by_year_prc[breach_year]['MED'] += 1 #add Medical Providers and services counts
    else:
        breaches_by_year_prc[breach_year]['NGO'] += 1#add Nonprofits counts

print(breaches_by_year_prc)

{'2006': {'BSF': 11, 'BSR': 10, 'BSO': 8, 'GOV': 12, 'EDU': 31, 'MED': 3, 'NGO': 0}, '2012': {'BSF': 16, 'BSR': 54, 'BSO': 59, 'GOV': 34, 'EDU': 40, 'MED': 33, 'NGO': 9}, '2013': {'BSF': 23, 'BSR': 47, 'BSO': 50, 'GOV': 18, 'EDU': 22, 'MED': 51, 'NGO': 3}, '2014': {'BSF': 14, 'BSR': 37, 'BSO': 62, 'GOV': 12, 'EDU': 16, 'MED': 177, 'NGO': 5}, '2015': {'BSF': 12, 'BSR': 10, 'BSO': 79, 'GOV': 3, 'EDU': 11, 'MED': 72, 'NGO': 1}, '2016': {'BSF': 21, 'BSR': 5, 'BSO': 126, 'GOV': 11, 'EDU': 11, 'MED': 214, 'NGO': 2}, '2017': {'BSF': 46, 'BSR': 22, 'BSO': 131, 'GOV': 11, 'EDU': 8, 'MED': 198, 'NGO': 2}, '2009': {'BSF': 7, 'BSR': 6, 'BSO': 10, 'GOV': 4, 'EDU': 20, 'MED': 5, 'NGO': 1}, '2010': {'BSF': 11, 'BSR': 32, 'BSO': 15, 'GOV': 7, 'EDU': 24, 'MED': 16, 'NGO': 1}, '2011': {'BSF': 9, 'BSR': 40, 'BSO': 40, 'GOV': 18, 'EDU': 20, 'MED': 26, 'NGO': 9}, '2005': {'BSF': 2, 'BSR': 2, 'BSO': 1, 'GOV': 2, 'EDU': 40, 'MED': 0, 'NGO': 1}, '2007': {'BSF': 15, 'BSR': 12, 'BSO': 14, 'GOV': 9, 'EDU': 18, '

## Part 2: Exporting the dataset for time graph visualization

Now I convert my data into a time series graph to analyze the trend over time in a spreadsheet program like Google Sheets or Microsoft Excel. But to do that, I will convert the data to CSV format and export it into a file. 
 
The first step is to convert my data from a nested dictionary to a nested list , where each sub-list (which will be a row in my CSV file), contains the values in a consistent order, includes:

``[year, BSF, BSR, BSO, GOV, EDU, MED, NGO]``

In [29]:
#create a new empty 'master list'
data_breach_hack_year = []

#for each year in my data_breach_hack_year dictionary
for year, counts in breaches_by_year_prc.items():
    
    #create a new sub-list that will store the year and each organization types in a consistent order
    list_element = [] #new empty sub-list
    list_element.append(year) #add the year in, e.g. ['2021']
    list_element.append(counts['BSF']) #append BSF 
    list_element.append(counts['BSR']) #append BSR
    list_element.append(counts['BSO']) #append BSO
    list_element.append(counts['GOV']) #append GOV
    list_element.append(counts['EDU']) #append EDU
    list_element.append(counts['MED']) #append MED
    list_element.append(counts['NGO']) #append NGO
    
    data_breach_hack_year.append(list_element) #add this list to the end of our growing master list
    
print(data_breach_hack_year)

[['2006', 11, 10, 8, 12, 31, 3, 0], ['2012', 16, 54, 59, 34, 40, 33, 9], ['2013', 23, 47, 50, 18, 22, 51, 3], ['2014', 14, 37, 62, 12, 16, 177, 5], ['2015', 12, 10, 79, 3, 11, 72, 1], ['2016', 21, 5, 126, 11, 11, 214, 2], ['2017', 46, 22, 131, 11, 8, 198, 2], ['2009', 7, 6, 10, 4, 20, 5, 1], ['2010', 11, 32, 15, 7, 24, 16, 1], ['2011', 9, 40, 40, 18, 20, 26, 9], ['2005', 2, 2, 1, 2, 40, 0, 1], ['2007', 15, 12, 14, 9, 18, 0, 3], ['2008', 11, 8, 6, 2, 25, 4, 1], ['2018', 15, 16, 8, 4, 4, 81, 0], ['2019', 0, 0, 0, 0, 0, 32, 0]]


In [30]:
#sort the list in 'numerical order' by the first sub-item, which is the year stamp
data_breach_hack_year.sort() 
print(data_breach_hack_year)

[['2005', 2, 2, 1, 2, 40, 0, 1], ['2006', 11, 10, 8, 12, 31, 3, 0], ['2007', 15, 12, 14, 9, 18, 0, 3], ['2008', 11, 8, 6, 2, 25, 4, 1], ['2009', 7, 6, 10, 4, 20, 5, 1], ['2010', 11, 32, 15, 7, 24, 16, 1], ['2011', 9, 40, 40, 18, 20, 26, 9], ['2012', 16, 54, 59, 34, 40, 33, 9], ['2013', 23, 47, 50, 18, 22, 51, 3], ['2014', 14, 37, 62, 12, 16, 177, 5], ['2015', 12, 10, 79, 3, 11, 72, 1], ['2016', 21, 5, 126, 11, 11, 214, 2], ['2017', 46, 22, 131, 11, 8, 198, 2], ['2018', 15, 16, 8, 4, 4, 81, 0], ['2019', 0, 0, 0, 0, 0, 32, 0]]


In [33]:
with open('data_breach_hacking_trends.csv', 'w', encoding='utf-8') as f:
    writer = csv.writer(f)
    #write a header row
    writer.writerow(('year', 
                     'BSF', 
                     'BSR',
                     'BSO',
                     'GOV',
                     'EDU',
                     'MED',
                     'NGO'))
    
    for i in data_breach_hack_year:
        writer.writerow((i[0], i[1], i[2], i[3], i[4], i[5], i[6], i[7]))

I used  this CSV  file  to  make  a  time  graph  series in Google Sheets in  order  to  analyze  the  overall hacking breaches evolution over time from 2005 to 2019. I found that from 2005 to 2013, the number of breaches remain stable for all types of organizations. However, since 2013, there's a sharp increase in in attacks against the Medical sector. Then there is a decrease in 2015 and another increase in 2016. These fluctuations reflect the different crypto-attacks that Medical services have reported in the past: for example the Cryptolocker in 2014 and the TeslaCrypt in 2016. For hacking breaches targeted at Business (BSO) sector, I found that there's a crucial increase from 2013 and 2017 followed by a sharp decrease in 2018. The graph is named as ``Hacking Breaches Trends by Organization Types from 2005 to 2019.png``.

## VIII. Discussions:

## Limitations: 
Certain limitations implied in my study are that I had to use data from two different datasets to answer my hypothesis and they overlap each other, with one having records from 2005 to 2018 and one having records from 2004 to 2021. These datasets contain different information and volume of records regarding hacking breack events. Thus the choice of the dataset to answer my hypotheses can influence the findings I extract after analyzing the statistics and patterns to answer certain research questions. Second limitation in my study is that the data from Information is Beautiful dataset could potentially not contain enough breaches events considering it only has more than 300 events compared to more than 9000 events in PRC, and also it could be biased since it didn't come from a non-profit organization, in contrast to PRC. 

## Implications:
However, the findings from my study are in line with various previous studies and reports. As chronicled in my study, the comprehensive analysis revealed that in contrast to what companies are often afraid of, external attacks have always outnumbered internal attacks in terms of lost records and data. The time graph showed that even though the number of internal attacks have increased during the last few years, the number of external attacks have consistently been more than double of that of internal attacks from 2004 to 2021. Thus, the implication is that companies should focus more on strengthening their security standards to protect data breach from outsiders but also consider putting into place stricter rules to make sure there are huge consequences for internal breaches intentionally and unintentionally so that employees are more conscious of their own protection of companies' data. The second finding is that hacking have always been in the top methods behind these data breach attacks, however, contrary to my theory, hacking attacks have targeted not only at Business organizations but also Medical institutions. This can be explained because health organizations hold a lot confidential and sensitive data. This data is often sold and used for spamming for commercial compaigns. The implication for this is that upon transitioning from a paper-based system to digitized and distributive healthcare data, healthcare organizations shouldn't complacent because businesses hold more financial values. Nowadays the values lie more in the personalized data and not monetary. Thus, they should also invest in installing a secure database and also proper technical training for medical employees as well. 

## IX. Conclusions:

From my analysis of data breach records from 2004 to 2021 encountered on multiple organizations, one of my main findings is that external attacks caused more detrimental loss in records and data compared to internal attacks. Another finding that appears is that hacking methods are most commonly used by these attackers and that the most targetted type of organization are medical organizations and BSOs since they possess the most sensitive personal data. These findings imply that taking serious action to secure personal data and proper training of employees, especially for MED and BSO organizations. 

Future research upon this study include identifying the main locations from which the confidential data are breached by hacking methods, for example via network servers, emails, etc. We should also research more on the preventive measurements that the organizations could take depending on their financial and technical ability so as to avoid as much data breaches as possible.