## In this notebook, I check for duplicates; and then I create 3-tuples mapping (filename, ECCO-Category, Class Label). The output of this will be needed when pickling the three models.

#### Since this is one of the initial iterations, we have 19 categories here. But eventually, we end up with 17 classes.

In [4]:
import pandas as pd
import os

path = './ECCO_TrainingData_Trial4/'
print len(os.listdir(path)[2:])
print os.listdir(path)[2:]

19
['Agriculture.csv', 'Biography.csv', 'Botany.csv', 'Church.csv', 'Commerce.csv', 'Dictionaries.csv', 'Drama.csv', 'Fiction.csv', 'History.csv', 'History_Natural.csv', 'Law.csv', 'Mathematics.csv', 'Medicine.csv', 'Physics.csv', 'Poetry.csv', 'Politics.csv', 'Rhetoric.csv', 'Sermons.csv', 'Travels.csv']


#### We have 18 categories (excluding Dictionary- which is gonna go), but I'm building MasterTuple for all categories in case I might need it later.

In [5]:
df_s = []
total_rows = 0
for name in os.listdir(path)[2:]:
    df = pd.read_csv(path+name)
    df['TableName'] = name[:-4]
    print name + ": " + str(df.shape[0])
    total_rows += df.shape[0]
    df_s.append(df)
print "\nThere are a total of " + str(total_rows) + " rows."

Agriculture.csv: 1403
Biography.csv: 2650
Botany.csv: 494
Church.csv: 5077
Commerce.csv: 2530
Dictionaries.csv: 6586
Drama.csv: 4708
Fiction.csv: 4557
History.csv: 5619
History_Natural.csv: 501
Law.csv: 4402
Mathematics.csv: 1282
Medicine.csv: 4043
Physics.csv: 852
Poetry.csv: 6159
Politics.csv: 4182
Rhetoric.csv: 1334
Sermons.csv: 7410
Travels.csv: 2683

There are a total of 66472 rows.


In [6]:
# Checking if the count matches the numbers in count_table:
count_df = pd.read_csv('./ECCO_Training_Table.csv')
ROWS = count_df['no.docs'].sum()
count_df

Unnamed: 0.1,Unnamed: 0,category,no.docs
0,1,Agriculture,1403
1,2,Biography,2650
2,3,Botany,494
3,4,Church,5077
4,5,Commerce,2530
5,6,Dictionaries,6586
6,7,Drama,4708
7,8,Fiction,4557
8,9,History_Natural,501
9,10,History,5619


In [7]:
print ROWS
print total_rows

66472
66472


#### The numbers match with the count table :)
## Now, checking for duplicates:

In [9]:
_19tables = pd.concat(df_s, ignore_index=True)
print _19tables.nunique()
_19tables

Filename          66472
DocumentID        66472
ESTC_ID           56705
Date                125
Title             53547
Vol_Number          156
Author            14596
Imprint           47673
Field_Headings    28474
TableName            19
dtype: int64


Unnamed: 0,Filename,DocumentID,ESTC_ID,Date,Title,Vol_Number,Author,Imprint,Field_Headings,TableName
0,0604100200.xml,604100200,T122642,1771,"A six weeks tour, through the southern countie...",0,"Young, Arthur","Dublin : printed for J. Milliken, 1771.","Agriculture, Wales, Agriculture, England",Agriculture
1,0623400500.xml,623400500,T118425,1771,"An abridgment of the six weeks, and six months...",0,"Young, Arthur","Dublin : printed by S. Powell, 1771.","Agriculture, England",Agriculture
2,0651800800.xml,651800800,T040653,1794,General view of the agriculture of the county ...,0,"Foot, Peter, land-surveyor","London : printed by John Nichols, 1794.","Agriculture, England, London Region, London (E...",Agriculture
3,0680600100.xml,680600100,T040647,1799,General view of the agriculture of the county ...,0,"Young, Arthur",London : printed by W. Bulmer and Co. for G. N...,"Agriculture, England, Lincolnshire, Lincolnshi...",Agriculture
4,0810200300.xml,810200300,T138847,1767,Certain ancient tracts concerning the manageme...,0,Anon,"London : printed for C. Bathurst, at the Cross...","Farm management, Surveying, Early works to 180...",Agriculture
5,0816800300.xml,816800300,T147158,1772,"A six weeks tour, through the southern countie...",0,"Young, Arthur","London : printed for W. Strahan; W. Nicoll, No...","Agriculture, Wales, Agriculture, England",Agriculture
6,1072000601.xml,1072000601,N004781,1753,The country gentleman's companion. In two volu...,Volume 1,Country gentleman,"London : printed for the author, and sold by T...","Fishing, Agriculture, Fish culture, Fish-culture",Agriculture
7,1072000602.xml,1072000602,N004781,1753,The country gentleman's companion. In two volu...,Volume 2,Country gentleman,"London : printed for the author, and sold by T...","Fishing, Agriculture, Fish culture, Fish-culture",Agriculture
8,1107400101.xml,1107400101,T078928,1771,The farmer's tour through the East of England....,Volume 1,"Young, Arthur","London : printed for W. Strahan; W. Nicoll, No...","Agriculture, Early works to 1800, Agriculture,...",Agriculture
9,1107400102.xml,1107400102,T078928,1771,The farmer's tour through the East of England....,Volume 2,"Young, Arthur","London : printed for W. Strahan; W. Nicoll, No...","Agriculture, Early works to 1800, Agriculture,...",Agriculture


In [10]:
# Things look good so far, since there are 66472 unique filenames.
# Just re-checking if any filename appears more than once.

key_listoftuples = []
total_rows = 0
for name in os.listdir(path)[2:]:
    df = pd.read_csv(path+name)
    key_listoftuples.append((name, df['Filename'].tolist()))
#     print name + ": " + str(df.shape[0])
    
common_ = {}

for (name, fn_list) in key_listoftuples:
#     print "Running for ", name
    for (n, fn) in key_listoftuples:
        if n == name:
            continue
        if len(set(fn).intersection(fn_list)) is not 0:
            print name[:-4] + " and " + n[:-4] + " have: " + str(set(fn).intersection(fn_list))
            print len(set(fn).intersection(fn_list))
            
            first = n[:-4] + " and " + name[:-4]
            
            if first not in common_:
                common_[first] = list(set(fn).intersection(fn_list))
print common_

{}


#### Now that we're sure there are no duplicates, moving on to create the master tuple:

In [11]:
path = '../Dataset/'

allFilenamesInDataset = []
folders = os.listdir(path)[1:]

for folder_name in folders:
#     print folder_name
#     print os.listdir(path+folder_name)
    allFilenamesInDataset.extend(os.listdir(path+folder_name))
print len(allFilenamesInDataset)

154924


In [12]:
# Checking if I have the txts for all filenames in _18tables
fn_19tables = [f+'.txt' for f in _19tables['Filename'].tolist()]

for fname in fn_19tables:
    if fname not in allFilenamesInDataset:
        print fname
print "DONE!"

DONE!


In [13]:
# Confirmed: I have all the txt's.

key_to_txts = {}

for folder_name in folders:
    temp = os.listdir(path+folder_name)
    key_to_txts[folder_name] = temp
print key_to_txts['GenRef']

['0031300100.xml.txt', '0031300200.xml.txt', '0031300300.xml.txt', '0031300400.xml.txt', '0031300500.xml.txt', '0031300600.xml.txt', '0031300700.xml.txt', '0031300800.xml.txt', '0031300900.xml.txt', '0031301000.xml.txt', '0031301100.xml.txt', '0031301200.xml.txt', '0031301300.xml.txt', '0031301400.xml.txt', '0031301500.xml.txt', '0031301700.xml.txt', '0031301800.xml.txt', '0031301900.xml.txt', '0031302000.xml.txt', '0031302100.xml.txt', '0031302200.xml.txt', '0031302300.xml.txt', '0031302400.xml.txt', '0031302500.xml.txt', '0031400101.xml.txt', '0031400102.xml.txt', '0031400103.xml.txt', '0031400104.xml.txt', '0031400105.xml.txt', '0031400106.xml.txt', '0031400107.xml.txt', '0031400108.xml.txt', '0031400200.xml.txt', '0031400300.xml.txt', '0031400400.xml.txt', '0031400500.xml.txt', '0031400600.xml.txt', '0031500100.xml.txt', '0031500200.xml.txt', '0031500300.xml.txt', '0031500400.xml.txt', '0031500500.xml.txt', '0031500600.xml.txt', '0031500700.xml.txt', '0031500800.xml.txt', '00315009

In [14]:
for folder in key_to_txts.keys():
    print folder + ": " + str(len(key_to_txts[folder]))

GenRef: 4182
RelAndPhil: 39079
SSAndFineArt: 31450
HistAndGeo: 14570
MedSciTech: 12448
Law: 9966
LitAndLang_2: 11249
LitAndLang_1: 31980


In [15]:
useful_cleantables = _19tables[['Filename', 'TableName']]
useful_cleantables['Filename'] = useful_cleantables['Filename'] + '.txt'
useful_cleantables

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Filename,TableName
0,0604100200.xml.txt,Agriculture
1,0623400500.xml.txt,Agriculture
2,0651800800.xml.txt,Agriculture
3,0680600100.xml.txt,Agriculture
4,0810200300.xml.txt,Agriculture
5,0816800300.xml.txt,Agriculture
6,1072000601.xml.txt,Agriculture
7,1072000602.xml.txt,Agriculture
8,1107400101.xml.txt,Agriculture
9,1107400102.xml.txt,Agriculture


In [17]:
# Converting dataframe into a list of tuples:
fn_label_tuples = [tuple(x) for x in useful_cleantables.values]
# print fn_label_tuples

# Creating the Master-Tuple that has everything I'll need:

folder_fn_label_tuples = [] # in that order

k = 0
for (fname, label) in fn_label_tuples:
    if k % 1000 == 0:
        print k
        
    for folder in key_to_txts.keys():
        if fname in key_to_txts[folder]:
            folder_fn_label_tuples.append((folder, fname, label))
            
    k += 1
    
print "\n\n", len(folder_fn_label_tuples)
print folder_fn_label_tuples[245:250]

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000


66472
[('MedSciTech', '0657500201.xml.txt', 'Agriculture'), ('MedSciTech', '0657500202.xml.txt', 'Agriculture'), ('MedSciTech', '0657600400.xml.txt', 'Agriculture'), ('MedSciTech', '0660700700.xml.txt', 'Agriculture'), ('MedSciTech', '0660800200.xml.txt', 'Agriculture')]


In [18]:
# Pickling for further use:
import pickle

with open('./5.0_folder_fn_label_tuples.pickle', 'wb') as f:
    pickle.dump(folder_fn_label_tuples, f)

In [21]:
# Checking it works properly:
with open('./5.0_folder_fn_label_tuples.pickle', 'rb') as f:
    blah = pickle.load(f)
print blah[:5]

[('HistAndGeo', '0604100200.xml.txt', 'Agriculture'), ('HistAndGeo', '0623400500.xml.txt', 'Agriculture'), ('HistAndGeo', '0651800800.xml.txt', 'Agriculture'), ('HistAndGeo', '0680600100.xml.txt', 'Agriculture'), ('HistAndGeo', '0810200300.xml.txt', 'Agriculture')]
