As per the California state laws, notification files required to have different headings such as What Happened?, What Information Was Involved?, etc. 

In this notebook, I used regex to separate the text in the different sections and create a new column for each section.

However, some of the older files which were released before this law was passed, did not include these headings. I did not split those files.

Importing libraries

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
from openpyxl import load_workbook
from itertools import islice

Importing the excel workbook and converting it to a pandas DataFrame.

In [3]:
wb = load_workbook('Updated_Melted_DF.xlsx')
print(wb.get_sheet_names())

['Sheet1']


In [4]:
sheet = wb.get_sheet_by_name('Sheet1')

In [5]:
data = sheet.values

In [6]:
cols = next(data)[1:]

In [7]:
data = list(data)

In [8]:
index = [r[0] for r in data]

In [9]:
data = (islice(r, 1, None) for r in data)

In [10]:
df = pd.DataFrame(data, index = index, columns = cols)

In [11]:
df.head()

Unnamed: 0,ID_Number,Organization_Name,Date(s)_of_Breach,Reported_Date,Hyperlinks,Total_PDFs,PDF Number,PDF Link,PDF_Text,Length
0,1,"Mindlance, Inc.",12/28/2017,01/19/2018,https://oag.ca.gov/ecrime/databreach/reports/s...,1,PDF1,https://oag.ca.gov/system/files/California%20T...,"Re: Notice of Data Breach Dear : Mindlance, In...",9181.0
1,2,Rosewood Hotel Group,"05/29/2016, 01/11/2017",01/19/2018,https://oag.ca.gov/ecrime/databreach/reports/s...,1,PDF1,https://oag.ca.gov/system/files/Rosewood%20-%2...,T8321 v.03 01.16.2018Return Mail Processing Ce...,26634.0
2,3,Corovan Corporation,09/14/2017,01/18/2018,https://oag.ca.gov/ecrime/databreach/reports/s...,1,PDF1,https://oag.ca.gov/system/files/Corovan%20-%20...,"Exhibit 1 By providing this notice, Corovan Co...",25594.0
3,4,Employer Leasing Company,"09/14/2017, 09/18/2018",01/18/2018,https://oag.ca.gov/ecrime/databreach/reports/s...,1,PDF1,https://oag.ca.gov/system/files/Employer%20Lea...,"By providing this notice, Employer Leasing Com...",25858.0
4,5,American Golf Corporation,"12/12/2017, 12/15/2017",01/18/2018,https://oag.ca.gov/ecrime/databreach/reports/s...,1,PDF1,https://oag.ca.gov/system/files/Sample%20Notic...,"909 North Sepulveda Blvd., Suite 650 ! El Seg...",8583.0


In [12]:
df.shape

(1500, 10)

In [13]:
df['PDF_Text'] = df['PDF_Text'].str.lower()

Introduction

In [14]:
intro = []
for i in range(len(df)):
    text_split = re.split(r'(what\s*happened)', str(df['PDF_Text'][i]))
    intro.append(len(text_split))

In [15]:
sum(1 for i in intro if i >= 3)

628

In [16]:
introduction = []
for i in range(len(df)):
    text_split = re.split(r'(what\s*happened)', str(df['PDF_Text'][i]))
    if len(text_split) >= 3:
        introduction.append(text_split[0])
    else:
        introduction.append('')

In [17]:
introduction[0]

're: notice of data breach dear : mindlance, inc. (“mindlance” or “company”) has numerous safeguards in place to protect its employees’ personal information. unfortunately, we need to inform you of an information security incident that recently affected some employees and which may affect you. we also want to tell you about the actions that mindlance is taking to address this incident and to assure you that we have taken steps to prevent a recurrence. '

What Happened?

In [76]:
list_of_lenghts = []
for i in range(len(df)):
    text_split = re.split(r'(what\s*happened|what\s*information\s*was\s*involved)', str(df['PDF_Text'][i]))
    list_of_lenghts.append(len(text_split))

In [77]:
len(list_of_lenghts)

1500

In [78]:
sum(1 for i in list_of_lenghts if i >= 5)

549

In [75]:
what_happened = []
for i in range(len(df)):
    text_split = re.split(r'(what\s*happened|what\s*information\s*was\s*involved)', str(df['PDF_Text'][i]))
    if len(text_split) >= 5:
        what_happened.append(text_split[2])    
    else:
        what_happened.append('')

In [79]:
len(what_happened)

1500

What Information Was Involved?

In [70]:
list_1 = []
for i in range(len(df)):
    text_split = re.split(r'(what\s*information\s*was\s*involved|what\s*we\s*are\s*doing|what\s*are\s*we\s*doing )', 
                          str(df['PDF_Text'][i]))
    list_1.append(len(text_split))

In [71]:
sum(1 for i in list_1 if i >= 5)

450

In [69]:
check = re.split(r'(what\s*information\s*was\s*involved|what\s*are\s*we\s*doing|what\s*we\s*are\s*doing)', 
                 str(df['PDF_Text'][484]))
check[2]

'.  we began investigating the incident as soon as we learned of it.  we have determined that the personal information involved in this incident included a copy of your 2015 form w-2, which includes your name, address, 2015 income information and social security number or individual taxpayer identification number.  '

In [80]:
what_info_was_involved = []
for i in range(len(df)):
    text_split = re.split(r'(what\s*information\s*was\s*involved|what\s*are\s*we\s*doing|what\s*we\s*are\s*doing)', 
                          str(df['PDF_Text'][i]))
    if len(text_split) >= 5:
        what_info_was_involved.append(text_split[2])    
    else:
        what_info_was_involved.append('')

In [81]:
len(what_info_was_involved)

1500

In [101]:
what_info_was_involved[0]

'? an attachment to the december 29, 2017 e-mail contained the name and social security number, related only to a limited number of mindlance employees. the stolen personal information attached to the e-mail did not contain driver’s license number or state identification card number, date of birth, any financial account number, pay card number, credit or debit card number, or medical or health insurance information. '

What We Are Doing?

In [127]:
list_2 = []
for i in range(len(df)):
    text_split = re.split(r'(what\s*we\s*are\s*doing|what\s*are\s*we\s*doing|what\s*you\s*can\s*do|what\s*can\s*you\s*do|what\s*else\s*you\s*can\s*do|what\s*else\s*can\s*you\s*do)', 
                          str(df['PDF_Text'][i]))
    list_2.append(len(text_split))

In [128]:
sum(1 for i in list_2 if i >= 5)

547

In [129]:
what_we_are_doing = []
for i in range(len(df)):
    text_split = re.split(r'(what\s*we\s*are\s*doing|what\s*are\s*we\s*doing|what\s*you\s*can\s*do|what\s*can\s*you\s*do|what\s*else\s*you\s*can\s*do|what\s*else\s*can\s*you\s*do)',
                          str(df['PDF_Text'][i]))
    if len(text_split) >= 5:
        what_we_are_doing.append(text_split[2])    
    else:
        what_we_are_doing.append('')

In [125]:
text_split = re.split(r'(what\s*we\s*are\s*doing|what\s*are\s*we\s*doing|what\s*you\s*can\s*do|what\s*can\s*you\s*do|what\s*else\s*you\s*can\s*do|what\s*else\s*can\s*you\s*do)',
                      str(df['PDF_Text'][0]))

In [132]:
what_we_are_doing[0]

' first, we want to emphasize that mindlance has no information suggesting that any of your personal information has been misused. while mindlance has notified law enforcement about this incident, mindlance has not delayed notifying you as a result of a request from any law enforcement agency. second, mindlance promptly took steps to confirm that unauthorized recipients of the december 29, 2017 e-mail do not retain possession of the stolen information. within the mindlance electronic network, mindlance has quarantined the e-mails and restricted access to senior management responsible for responding to this incident. third, out of an abundance of caution, mindlance is offering one year of identity protection services at no cost to you through experian, one of the three nationwide credit bureaus. your free, one-year membership in experian’s identityworkssm product provides identity restoration services, fraud detection tools, and other benefits which include monitoring your credit file. 

What You Can Do?

In [133]:
list_3 = []
for i in range(len(df)):
    text_split = re.split(r'(what\s*you\s*can\s*do|what\s*can\s*you\s*do|what\s*else\s*you\s*can\s*do|what\s*else\s*can\s*you\s*do|for\s*more\s*information|other\s*important\s*information)', 
                          str(df['PDF_Text'][i]))
    list_3.append(len(text_split))

In [134]:
sum(1 for i in list_3 if i >= 5)

552

In [135]:
what_you_can_do = []
for i in range(len(df)):
    text_split = re.split(r'(what\s*you\s*can\s*do|what\s*can\s*you\s*do|what\s*else\s*you\s*can\s*do|what\s*else\s*can\s*you\s*do|for\s*more\s*information|other\s*important\s*information)', 
                          str(df['PDF_Text'][i]))
    if len(text_split) >= 5:
        what_you_can_do.append(text_split[2])    
    else:
        what_you_can_do.append('')

Other Important Information

In [20]:
list_4 = []
for i in range(len(df)):
    text_split = re.split(r'(other\s*important\s*information|for\s*more\s*information)', 
                          str(df['PDF_Text'][i]))
    list_4.append(len(text_split))

In [22]:
sum(1 for i in list_4 if i >= 5)

281

In [23]:
other_important_information = []
for i in range(len(df)):
    text_split = re.split(r'(other\s*important\s*information|for\s*more\s*information)',
                         str(df['PDF_Text'][i]))
    if len(text_split) >= 5:
        other_important_information.append(text_split[2])    
    else:
        other_important_information.append('')

In [24]:
other_important_information[0]

'. to help prevent a recurrence of this information security incident, mindlance is conducting a thorough review of its current policies and procedures. based on that review, we will evaluate what additional steps are needed to enhance the strong protections we already have in place for safeguarding personal information. '

For More Information

In [25]:
list_5 = []
for i in range(len(df)):
    text_split = re.split(r'(other\s*important\s*information|for\s*more\s*information)', 
                          str(df['PDF_Text'][i]))
    list_5.append(len(text_split))

In [28]:
sum(1 for i in list_5 if i >= 5)

281

In [29]:
for_more_information = []
for i in range(len(df)):
    text_split = re.split(r'(other\s*important\s*information|for\s*more\s*information)',
                         str(df['PDF_Text'][i]))
    if len(text_split) >= 5:
        for_more_information.append(text_split[4])    
    else:
        for_more_information.append('')

In [30]:
for_more_information[0]

'. mindlance sincerely regrets any inconvenience this incident may cause you. if you have any questions concerning the incident, please contact our dedicated call center at 855-559-9708. our call center is available to you monday through friday (except for major u.s. holidays) from 9:00 am est through 7:00 pm est. sincerely, paul rajat\nmanaging director\nsteps to protect the security of your personal information\nby taking the following steps, you can help reduce the risk that your personal information may be misused.\n1. enroll in identityworkssm. you must personally activate identity monitoring for it to be effective. the notice\nletter contains instructions and information on how to activate your identityworkssm membership. if you need assistance or if you want to enroll by telephone, you should contact experian directly at 1-877-890-9332. experian’s identityworkssm product will provide the following: l experian credit report at signup: see what information is associated with your 

Creating a dataframe with the sections and concatenating it with the original dataframe.

In [142]:
final_list = [('Introduction', introduction), ('What Happened', what_happened), 
              ('What Information Was Involved', what_info_was_involved),
             ('What Are We Doing', what_we_are_doing), ('What You Can Do', what_you_can_do), 
             ('Other Important Information', other_important_information),
             ('For More Information', for_more_information)]

In [32]:
headings = pd.DataFrame.from_items(final_list)

In [144]:
final_df = pd.concat([df, headings], axis=1)

In [146]:
final_df.shape

(1500, 14)

Exporting the df to a xlsx file.

In [33]:
writer = pd.ExcelWriter('Headings_2.xlsx', engine='xlsxwriter')

In [34]:
headings.to_excel(writer)

In [35]:
writer.save()