# Challenge 3: Metadata Mining from PDF dataset

### Statement

A collection of .pdf files is provided in PDF directory. Use this collection to perform the following tasks.

**Note:** PyPDF2 library is available on the system

### Tasks


**Task 1:** Print the names of all files in the directory.

In [23]:
from PyPDF2 import PdfFileReader, PdfFileWriter
import os

file_list =  os.listdir("PDF")

counter = 1
for f in file_list:
    print str(counter)+"  "+f
    counter = counter + 1

1  7.pdf
2  9.pdf
3  1.pdf
4  4.pdf
5  10.pdf
6  6.pdf
7  8.pdf
8  3.pdf
9  5.pdf
10  2.pdf


**Task 2.** Print the superset of all attributes present in PDF metsadata.

In [15]:
from PyPDF2 import PdfFileReader, PdfFileWriter
import os

directory = "PDF/"
file_list =  os.listdir(directory)

attr_superset = []

for f in file_list:
    # Opening file
    f = directory+f
    with open(f, 'rb') as input_file:
        reader = PdfFileReader(input_file)
        metadata = reader.getDocumentInfo()
        for item, value in metadata.iteritems():
            item = item.replace('/','')
            if item not in attr_superset:
                attr_superset.append(item)
            
print attr_superset
    

['ModDate', 'Trapped', 'Producer', 'CreationDate', 'Creator', 'Title', 'Author', 'Keywords', 'Subject']


**Task 3.** Print superset of softwares used to create the PDF files.

In [12]:
from PyPDF2 import PdfFileReader, PdfFileWriter
import os

def superset(target_field):
    directory = "PDF/"
    file_list =  os.listdir(directory)

    result_supterset = []

    for f in file_list:
        # Opening file
        f = directory+f
        with open(f, 'rb') as input_file:
            reader = PdfFileReader(input_file)
            metadata = reader.getDocumentInfo()
            if target_field in metadata:
                field_value = metadata.get(target_field)
                if field_value not in attr_superset:
                    result_supterset.append(field_value)

    return result_supterset
    
print superset('/Creator')

[u'Adobe InDesign CC (Macintosh)', u'Adobe InDesign CC 2015 (Macintosh)', u'CorelDRAW X5', u'Adobe InDesign CC 2017 (Macintosh)', u'Adobe InDesign CS6 (Macintosh)', u'QuarkXPress(R) 4.1', u'Adobe InDesign CC 2014 (Macintosh)']


**Task 4.** Print superset of authors who created these PDF files.

In [19]:
print superset('/Author')

[u'Amol', u'stanojev', u'United Energy Heating & Cooling']


**Task 5.** Print superset of values for all fields present in the attribute superset list.

In [14]:

for attr in attr_superset:
    attr = '/'+attr
    print "==========> "+attr
    print superset(attr)

[u"D:20140131140820+01'00'", u"D:20160301145326-05'00'", u"D:20160616191215+05'30'", u"D:20180116110453+11'00'", u"D:20160318175247+05'30'", u"D:20020811085301-04'00'", u"D:20120119165208+05'30'", u"D:20150924094404-05'00'", u"D:20170405161457-04'00'"]
['/False', '/False', '/False', '/False']
[u'Adobe PDF Library 10.0.1', u'Adobe PDF Library 15.0', u'Corel PDF Engine Version 15.0.0.486', u'Acrobat Distiller 17.0 (Macintosh)', u'Adobe PDF Library 10.0.1', u'Acrobat Distiller 3.01 for Power Macintosh', u'Acrobat Distiller 7.0 (Windows)', u'Adobe PDF Library 11.0']
[u"D:20140131134324+01'00'", u"D:20160301145310-05'00'", u"D:20160616191215+05'30'", u"D:20171222101252+11'00'", u"D:20160318174858+05'30'", u'D:20020404165209Z', u"D:20120119164724+05'30'", u"D:20150924094404-05'00'", u"D:20170405160641-04'00'"]
[u'Adobe InDesign CC (Macintosh)', u'Adobe InDesign CC 2015 (Macintosh)', u'CorelDRAW X5', u'Adobe InDesign CC 2017 (Macintosh)', u'Adobe InDesign CS6 (Macintosh)', u'QuarkXPress(R) 4.

**Task 6.** List the titles of all books from author "Disney".

In [20]:
from PyPDF2 import PdfFileReader, PdfFileWriter
import os

def search_pdfs(target_field, target_value):
    directory = "PDF/"
    file_list =  os.listdir(directory)

    files_matched = {}

    for f in file_list:
        # Opening file
        f = directory+f
        with open(f, 'rb') as input_file:
            reader = PdfFileReader(input_file)
            metadata = reader.getDocumentInfo()
            if target_field in metadata:
                field_value = metadata.get(target_field)
                if target_value in field_value:
                    files_matched[f.replace(directory, '')] = field_value
                    
    return files_matched

print search_pdfs('/Author','Disney')
    

{}


**Task 7.** Print the list of files which were generated using **Adobe** products.

In [21]:
print search_pdfs('/Creator','Adobe')

{'9.pdf': u'Adobe InDesign CC (Macintosh)', '1.pdf': u'Adobe InDesign CC 2015 (Macintosh)', '6.pdf': u'Adobe InDesign CS6 (Macintosh)', '2.pdf': u'Adobe InDesign CC 2014 (Macintosh)', '10.pdf': u'Adobe InDesign CC 2017 (Macintosh)'}
