In this notebook, I made word frequency tables for each section.

Importing the modules

In [2]:
import pandas as pd
import numpy as np
import re

In [3]:
from openpyxl import load_workbook
from itertools import islice

Importing the excel file

In [17]:
wb = load_workbook('Headings.xlsx')
print(wb.get_sheet_names())

['Sheet1']


In [18]:
sheet = wb.get_sheet_by_name('Sheet1')

In [19]:
data = sheet.values

In [20]:
cols = next(data)[1:]

In [21]:
data = list(data)

In [22]:
index = [r[0] for r in data]

In [23]:
data = (islice(r, 1, None) for r in data)

In [24]:
df = pd.DataFrame(data, index = index, columns = cols)

In [17]:
df.head()

Unnamed: 0,ID_Number,Organization_Name,Date(s)_of_Breach,Reported_Date,Hyperlinks,Total_PDFs,PDF Number,PDF Link,PDF_Text,Length,Introduction,What Happened,What Information Was Involved,What Are We Doing,What You Can Do,Other Important Information,For More Information,Able to Split?
0,1,"Mindlance, Inc.",12/28/2017,01/19/2018,https://oag.ca.gov/ecrime/databreach/reports/s...,1,PDF1,https://oag.ca.gov/system/files/California%20T...,"re: notice of data breach dear : mindlance, in...",9181.0,"re: notice of data breach dear : mindlance, in...",? certain mindlance confidential and proprieta...,"? an attachment to the december 29, 2017 e-mai...","first, we want to emphasize that mindlance ha...",\nin addition to the steps mindlance has taken...,. to help prevent a recurrence of this informa...,. mindlance sincerely regrets any inconvenienc...,Yes-All Headings
1,2,Rosewood Hotel Group,"05/29/2016, 01/11/2017",01/19/2018,https://oag.ca.gov/ecrime/databreach/reports/s...,1,PDF1,https://oag.ca.gov/system/files/Rosewood%20-%2...,t8321 v.03 01.16.2018return mail processing ce...,26634.0,t8321 v.03 01.16.2018return mail processing ce...,?sabre notified us in late december 2017 that ...,?sabre has indicated to us that the affected r...,"after learning of the issue, we quickly began ...",we take our obligation to safeguard our guests...,we regret that this issue at sabre may affect ...,"on fraud alerts, you also may contact the ftc...",Yes-All Headings
2,3,Corovan Corporation,09/14/2017,01/18/2018,https://oag.ca.gov/ecrime/databreach/reports/s...,1,PDF1,https://oag.ca.gov/system/files/Corovan%20-%20...,"exhibit 1 by providing this notice, corovan co...",25594.0,"exhibit 1 by providing this notice, corovan co...","? on october 17, 2017, we became aware that c...",? as part of the investigation into this inci...,. we take the security of your personal infor...,. we encourage you to enroll and receive the c...,,,Yes-Some Headings
3,4,Employer Leasing Company,"09/14/2017, 09/18/2018",01/18/2018,https://oag.ca.gov/ecrime/databreach/reports/s...,1,PDF1,https://oag.ca.gov/system/files/Employer%20Lea...,"by providing this notice, employer leasing com...",25858.0,"by providing this notice, employer leasing com...","? on october 17, 2017, we became aware that c...",? as part of the investigation into this inci...,. we take the security of your personal infor...,. we encourage you to enroll and receive the c...,,,Yes-Some Headings
4,5,American Golf Corporation,"12/12/2017, 12/15/2017",01/18/2018,https://oag.ca.gov/ecrime/databreach/reports/s...,1,PDF1,https://oag.ca.gov/system/files/Sample%20Notic...,909 North Sepulveda Blvd Suite 650 El Segundo ...,8583.0,,We were recently informed by the company that ...,We believe that the incident could have affect...,We take the privacy of personal information se...,We recommend that you review credit and debit ...,,,Yes-Some Headings


Create copy of DataFrame

In [20]:
new_df = df.copy()

Importing methods from NLTK

In [19]:
from nltk.tokenize import word_tokenize

In [None]:
new_df['Introduction'].str.split(expand=True).stack().value_counts()

In [68]:
my_words_intro = new_df['Introduction'].str.lower().str.cat(sep = ' ')

Creating a a function which returns the word frequencies as a dataframe

In [84]:
def wordfrequency(dataframe, heading):
    ### This function will remove concatenate all the string in the column 'heading' in the dataframe 'dataframe', 
    ### remove punctuation, remove stopwords, and return the word frquencies as a dataframe
    
    all_words = dataframe[heading].str.lower().str.cat(sep = ' ')
    # Removing punctuation
    tokenizer = RegexpTokenizer(r'\w+')
    without_punc = tokenizer.tokenize(all_words)
    
    # Create a string object
    new_word_string = ' '.join(without_punc)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words_tokens = word_tokenize(new_word_string)
    filtered_words = [word for word in words_tokens if word not in stop_words]
    
    # Get the word frequencies
    word_freq = nltk.FreqDist(filtered_words)
    
    # Return the frequencies in a dataframe
    return pd.DataFrame(word_freq.most_common(), columns = ['Word', 'Frequency'])
    

Introduction

In [None]:
introduction = wordfrequency(new_df, 'Introduction')

In [86]:
introduction.head()

Unnamed: 0,Word,Frequency
0,information,1307
1,data,708
2,incident,673
3,security,633
4,notice,620


In [102]:
introduction.to_csv('Introduction.csv')

What Happened? 

In [88]:
what_happened = wordfrequency(new_df, 'What Happened')

In [89]:
what_happened.head()

Unnamed: 0,Word,Frequency
0,information,862
1,card,534
2,unauthorized,398
3,access,369
4,investigation,365


In [103]:
what_happened.to_csv('What Happened.csv')

What Information Was Involved?

In [90]:
what_info_was_involved = wordfrequency(new_df, 'What Information Was Involved')

In [91]:
what_info_was_involved.head()

Unnamed: 0,Word,Frequency
0,information,1141
1,card,773
2,number,676
3,security,484
4,name,439


In [104]:
what_info_was_involved.to_csv('What Information Was Involved.csv')

What Are We Doing?

In [92]:
what_are_we_doing = wordfrequency(new_df, 'What Are We Doing')

In [93]:
what_are_we_doing.head()

Unnamed: 0,Word,Frequency
0,information,941
1,identity,662
2,card,560
3,credit,518
4,incident,479


In [105]:
what_are_we_doing.to_csv('What Are We Doing.csv')

What You Can Do?

In [95]:
what_you_can_do = wordfrequency(new_df, 'What You Can Do')

In [96]:
what_you_can_do.head()

Unnamed: 0,Word,Frequency
0,credit,2392
1,information,1123
2,report,1097
3,identity,1091
4,theft,850


In [106]:
what_you_can_do.to_csv('What You Can Do.csv')

Other Important Information

In [97]:
other_important_info = wordfrequency(new_df, 'Other Important Information')

In [98]:
other_important_info.head()

Unnamed: 0,Word,Frequency
0,credit,2411
1,report,1286
2,information,1226
3,identity,1131
4,security,1023


In [107]:
other_important_info.to_csv('Other Important Information.csv')

For More Information

In [99]:
for_more_info = wordfrequency(new_df, 'For More Information')

In [100]:
for_more_info.head()

Unnamed: 0,Word,Frequency
0,credit,3274
1,report,1540
2,identity,1478
3,information,1386
4,freeze,1285


In [108]:
for_more_info.to_csv('For More Information.csv')