# Introduction
The subject of this notebook, the [Full Disclosure (FD) mailing list](http://seclists.org/fulldisclosure/) is a "public, vendor-neutral forum for detailed discussion of vulnerabilities and exploitation techniques, as well as tools, papers, news, and events of interest to the community."

Let's start our exploration by looking at the corpus of full disclosure mailing list which has been extracted and derive insights from this data.


# Number of Words per Document Histogram for Full Disclosure Mailing List
We create this histogram to gain insights into the the number of words per document for a particular year as input.

In [14]:
#import the libraries
import pandas as pd
import glob
import os
from bokeh.charts import Histogram, output_file, show
from bokeh.io import output_notebook
import numpy as np

In [15]:
#Input year for which the word count histogram is being plotted
year='2012'
#Directory with the files
path = 'data/input/bodymessage_corpus/'+year

file_name_list=[]
file_wc_list=[]
file_wc_df = pd.DataFrame(columns = ["file_name","word_count"])

#function that returns the word count for all the documents in a list
def doc_wordcnt(path_file_name, file_name):
    #use open() for opening file.
    with open(path_file_name) as f:
        #create a list of all words fetched from the file using a list comprehension
        words = [word for line in f for word in line.split()]        
        #append the file name and the word count to a list
        file_name_list.append(file_name)
        file_wc_list.append(len(words))

#loop within the directory to get all the file names
for filename in os.listdir(path):
    #get the word count for every file
    doc_wordcnt(path+'/'+filename, filename)
    
#populate the column of the dataframe using the values populated in the list
file_wc_df["file_name"] = file_name_list
file_wc_df["word_count"] = file_wc_list

In [16]:
#Sort in ascending order of word count to split and plot multiple histograms
file_wc_asc_df = file_wc_df.sort_values("word_count", ascending=False);

In [17]:
#function call to display the histogram in the ipython notebook
output_notebook()

# Number of Words per Document Histogram

In this notebook, 2012 will be used for single-year analysis.
The histogram below shows the Number of Words per Document for '2012'. The histogram has been split into three parts to get a better picture of the number of words.

To achieve this, the dataframe is split into three equal parts and the histogram is plotted for each of the subset.

In [63]:
#function to plot multiple histograms
def plot_hist(plot_index, plot_df, plot_year):
    #plot the histogram based on values in the dataframe
    p = 'fig'+str(plot_index)
    range_max=plot_df["word_count"].iloc[0]
    range_min=plot_df["word_count"].iloc[-1]
    p = Histogram(plot_df, values='word_count',title="Word Frequency across Documents in "+plot_year+"; Range: "+str(range_min)+" to "+str(range_max))
    
    p.xaxis.axis_label = 'Word Frequency per Document'
    p.xaxis.axis_label_text_font_size = '10pt'
    p.xaxis.major_label_text_font_size = '10pt'

    p.yaxis.axis_label = 'Frequency'
    p.yaxis.axis_label_text_font_size = '10pt'
    p.yaxis.major_label_text_font_size = '10pt'
    
    show(p)

#split the sorted dataframe into 3 and call the histogram function for each dataframe
for idx,val in enumerate(np.array_split(file_wc_asc_df, 3)):
    #Get the year from the filename for the first record
    year=file_wc_asc_df.file_name[0][:4]
    plot_hist(idx, val, year)
    

In [19]:
print "File name :", file_wc_df.file_name[file_wc_df['word_count'].idxmax()], ", Maximum (word count) :", file_wc_df.word_count[file_wc_df['word_count'].idxmax()]
print "File name :", file_wc_df.file_name[file_wc_df['word_count'].idxmin()], ", Minimum (word count) :", file_wc_df.word_count[file_wc_df['word_count'].idxmin()]

File name : 2012_Mar_341.txt , Maximum (word count) : 8279
File name : 2012_Apr_241.txt , Minimum (word count) : 16
