## MCC

The goal of these notebooks is to evaluate simple classification approaches on profile data.  Can we learn a classifier to label profiles with the MCC based on some labeled data?

My process will be:
* Do a little exploration - what does the data look like?  What labeled data do I have?
* Build a crappy pipeline from data pre-processing to evaluating a classifier with a confusion matrix
* Try to build a decent word feature extractor to see if that will improve performance

This notebook starts the exploratory analysis.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import base_dir as bd
from collections import defaultdict
base_dir=bd.get_base_dir()

In [3]:
#basic word frequency analysis from detailed composition data

word_counts=defaultdict(int)

with open(base_dir+'Profile_Composition_6-29-2017.txt') as f:
    for record in f.readlines()[2:]:
        text=record[87:138]
        for word in text.split():
            word_counts[word]+=1

print 'There are %d words\n'%len(word_counts)

print 'Top words:'
for f,word in sorted([(v,k) for (k,v) in word_counts.iteritems()],reverse=True)[:200]:
    print f, word

There are 22072 words

Top words:
16180 SODIUM
14407 POTASSIUM
14089 ALUMINUM
13859 ZINC
13787 MAGNESIUM
13759 SILVER
13736 BARIUM
13729 LITHIUM
13677 LEAD
13666 SELENIUM
13658 CHROMIUM
13644 COPPER
13599 NICKEL
13582 SILICON
13516 CADMIUM
13509 ARSENIC
13495 ANTIMONY
13451 THALLIUM
13450 BERYLLIUM
13445 MOLYBDENUM
6620 OF:
6470 WATER
5736 CONSISTING
3819 NON-HALOGENATED
3224 ACID
3100 SOLVENTS:
3038 ETHYL
2601 METHYL
2275 CONTAINERS
2264 SMALL
2058 ACETONE,
1881 CARBON
1837 TOLUENE,
1706 IN
1635 PETROLEUM
1618 OIL
1508 AND
1480 CHLORIDE
1459 ACETATE,
1454 METHANOL,
1444 BENZENE
1425 PPE,
1393 PAPER,
1357 ETHANOL,
1315 CALCIUM
1312 MERCURY
1299 XYLENE,
1294 OF
1289 KETONE,
1234 CONTAINING:
1214 PLASTIC,
1209 TITANIUM
1156 MINERAL
1148 OIL,
1110 PLASTIC
1105 TALC,
1047 CLAY,
1033 PAINT
1004 GLYCOL
1002 RAGS,
987 DEBRIS
961 OR
959 HYDROXIDE
958 SILICA,
929 SOLVENTS
913 METHYLENE
860 METHANOL
854 ***********************************
817 HALOGENATED
815 DISTILLATES,
806 RESIN
764 CARBONATE,

In [4]:
# basic word frequency analysis from profile data (standard metals not included)
# the composition data has already been processed by the data extraction SQL query

word_counts=defaultdict(int)

with open(base_dir+'Profile_Data_6-29-2017.txt') as f:
    #for idx,field in enumerate(f.readlines()[0].split('|')):
    #    print idx,field
    for record in f.readlines()[1:]:
        text=record.split('|')[22]
        for word in text.split():
            word_counts[word]+=1

print 'There are %d words\n'%len(word_counts)

print 'Top words:'
for f,word in sorted([(v,k) for (k,v) in word_counts.iteritems()],reverse=True)[:200]:
    print f, word

There are 18388 words

Top words:
5881 OF:
5695 WATER
4932 CONSISTING
3613 NON-HALOGENATED
2843 SOLVENTS:
2773 ACID
2568 ETHYL
2196 SODIUM
2191 SMALL
2176 CONTAINERS
2117 METHYL
1885 ACETONE,
1575 TOLUENE,
1499 OIL
1429 PETROLEUM
1376 IN
1298 METHANOL,
1203 ACETATE,
1202 ETHANOL,
1190 PAPER,
1188 PPE,
1176 CHLORIDE
1117 CONTAINING:
1109 XYLENE,
1056 AND
1043 CARBON
1006 KETONE,
1003 PLASTIC,
983 OF
978 MINERAL
974 OIL,
942 BENZENE
925 PAINT
923 PLASTIC
920 DEBRIS
899 RAGS,
891 OR
832 HYDROXIDE
811 GLYCOL
794 METHANOL
767 POTASSIUM
737 TITANIUM
728 SOLVENTS
727 DISTILLATES,
704 RESIN
686 METHYLENE
659 ***********************************
642 ISOPROPANOL,
623 CALCIUM
623 BE
617 MAY
611 XYLENE
607 HALOGENATED
604 SOLIDS
585 TOLUENE
584 DEBRIS:
565 ACETIC
557 LATEX
555 DIOXIDE
543 CLAY,
528 PEROXIDE
528 NOT
528 ACID,
497 ACETONITRILE
492 ETHER
487 PAPER
486 BASED
486 ALUMINUM
486 ACETATE
472 ***
471 PADS,
466 ETHYLENE
464 AMMONIUM
453 DIRT,
447 OXIDE
447 ORGANICS:
447 METAL
446 ETHANOL
442 

In [5]:
# 
profiles=pd.read_csv(base_dir+'Profile_Data_6-29-2017.txt',sep='|')
profiles.info()
#profiles['Matl Category Code'].value_counts()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13573 entries, 0 to 13572
Data columns (total 31 columns):
WPS                     13573 non-null int64
EPA Waste Name          13573 non-null object
Waste Material Name     13573 non-null object
Matl Category Code      13383 non-null object
Physical Description    13573 non-null object
RCRA                    13573 non-null object
DOT Haz Class           13573 non-null object
Lab Pack/Loospack/No    13573 non-null object
SWRC Qty Restriction    4611 non-null object
Low BTU                 13573 non-null int64
High BTU                13573 non-null int64
Low PH                  13573 non-null float64
High PH                 13573 non-null float64
High ASH                13573 non-null int64
total halogen-low       13478 non-null float64
total halogen-high      13478 non-null float64
Physical States         13573 non-null object
CSF                     13573 non-null object
CSF Storage Code        13573 non-null int64
Reactive           

In [15]:
# label counts - eliminate all non-numerics
label_counts=defaultdict(int)

with open(base_dir+'Profile_Data_6-29-2017.txt') as f:
    
    for record in f.readlines()[1:]:
        label=record.split('|')[3].strip()
        if not label.isdigit():
            label='UNLABELED'
        label_counts[label]+=1

total_labeled_profiles=sum([v for (k,v) in label_counts.iteritems() if k!='UNLABELED'])
print 'There are %d labeled profiles'%total_labeled_profiles
print 'There are %d labels\n'%(len(label_counts)-1)

cumulative=0
print 'Labels:'
for f,label in sorted([(v,k) for (k,v) in label_counts.iteritems() if k!='UNLABELED'],reverse=True):
    cumulative+=f
    print label,f,float(cumulative)/total_labeled_profiles

There are 4084 labeled profiles
There are 138 labels

Labels:
7001 405 0.0991674828599
9502 378 0.191723800196
7700 296 0.264201762977
7401 165 0.304603330069
7010 156 0.342801175318
5501 112 0.370225269344
9505 109 0.396914789422
6501 105 0.422624877571
9401 98 0.446620959843
4001 97 0.470372184133
7006 92 0.492899118511
7003 91 0.515181194907
9501 89 0.536973555338
9201 79 0.556317335945
7008 77 0.575171400588
9050 72 0.592801175318
9009 67 0.609206660137
3001 63 0.624632713026
9008 59 0.639079333986
7103 55 0.652546523017
7007 52 0.6652791381
3002 50 0.677522037218
8003 44 0.688295788443
7503 43 0.698824681685
9006 41 0.708863858962
6301 41 0.718903036239
8201 38 0.728207639569
7012 38 0.737512242899
7501 37 0.746571988247
5201 34 0.754897159647
9007 33 0.762977473066
9552 31 0.770568070519
2103 30 0.77791380999
1101 29 0.785014691479
3004 28 0.791870714985
7504 27 0.798481880509
5502 27 0.805093046033
8002 25 0.811214495593
9799 24 0.817091087169
9011 24 0.822967678746
7005 24 0.82