In [1]:
import numpy as np
import pandas as pd
import nltk
PATH = "../../classification_dataset.csv"

In [2]:
df = pd.read_csv(PATH)
print(df.shape)

(342781, 4)


#### Understand the data format
- There are 342781 rows, 4 columns. 
- Column text is the input.
- l1,l2 and l3 are levels. 

In [3]:
df.head(4)

Unnamed: 0,text,l1,l2,l3
0,"Ronald \""Ron\"" D. Boire is an American busines...",Agent,Person,BusinessPerson
1,Astra 1KR is one of the Astra geostationary sa...,Place,Satellite,ArtificialSatellite
2,Cycleryon is an extinct genus of decapod crust...,Species,Animal,Crustacean
3,"Angela Maria of the Heart of Jesus, also calle...",Agent,Cleric,Saint


- Obviously, the data is not process yet. There are many stopwords, not yet tokenized and lemmaized.

In [4]:
print(df['text'].iloc[0])

Ronald \"Ron\" D. Boire is an American businessman. He has served as an executive for several companies, including Barnes & Noble, Brookstone, Sears Canada and Toys R Us.


In [5]:
print(df['text'].describe())

count                                                342781
unique                                               342781
top       The following 183 genera within the Dothideomy...
freq                                                      1
Name: text, dtype: object


In [6]:
df['textcount']=[len(df['text'].iloc[i].split()) for i in range(df.shape[0])]

In [7]:
df['textcount'].describe()

count    342781.000000
mean        105.525195
std          96.220057
min           2.000000
25%          40.000000
50%          74.000000
75%         138.000000
max         732.000000
Name: textcount, dtype: float64

- The number of words in the text ranges from 2 to 732. Median is 74. It is quite long for the input of the deep learning model, may need a preprocessing method to filter some words.

- First level has 9 classes. 'Agent', 'Device', 'Event', 'Place', 'Species', 'SportsSeason', 'TopicalConcept', 'UnitOfWork', 'Work'. Each with differnt frequencies.

In [8]:
print(df['l1'].describe())
print(np.unique(df['l1']))
print(df.groupby('l1').count())

count     342781
unique         9
top        Agent
freq      177341
Name: l1, dtype: object
['Agent' 'Device' 'Event' 'Place' 'Species' 'SportsSeason'
 'TopicalConcept' 'UnitOfWork' 'Work']
                  text      l2      l3  textcount
l1                                               
Agent           177341  177341  177341     177341
Device             353     353     353        353
Event            27059   27059   27059      27059
Place            65128   65128   65128      65128
Species          31149   31149   31149      31149
SportsSeason      8307    8307    8307       8307
TopicalConcept    1115    1115    1115       1115
UnitOfWork        2497    2497    2497       2497
Work             29832   29832   29832      29832


- Each class in level 1 has different classes in level 2.
- Each class in level 2 also has different classes in level 3.

In [9]:
df1 = df[df['l1']=='Agent']
print(df1['l2'].describe())
print(df1.groupby('l2').count())

count      177341
unique         30
top       Athlete
freq        44163
Name: l2, dtype: object
                         text     l1     l3  textcount
l2                                                    
Actor                    1667   1667   1667       1667
Artist                   7091   7091   7091       7091
Athlete                 44163  44163  44163      44163
Boxer                     403    403    403        403
BritishRoyalty            685    685    685        685
Broadcaster              6549   6549   6549       6549
Cleric                   6420   6420   6420       6420
Coach                    2691   2691   2691       2691
ComicsCharacter           203    203    203        203
Company                 11777  11777  11777      11777
EducationalInstitution   6306   6306   6306       6306
FictionalCharacter       3062   3062   3062       3062
GridironFootballPlayer   2696   2696   2696       2696
Group                    2659   2659   2659       2659
MotorcycleRider         

In [10]:
df2 = df1[df1['l2']=='Athlete']
print(df2['l3'].describe())
print(df2.groupby('l3').count())

count          44163
unique            27
top       GolfPlayer
freq            2700
Name: l3, dtype: object
                               text    l1    l2  textcount
l3                                                        
AustralianRulesFootballPlayer  2691  2691  2691       2691
BadmintonPlayer                1289  1289  1289       1289
BaseballPlayer                 2652  2652  2652       2652
BasketballPlayer               2696  2696  2696       2696
Bodybuilder                     243   243   243        243
Canoeist                        410   410   410        410
ChessPlayer                    1302  1302  1302       1302
Cricketer                      2664  2664  2664       2664
Cyclist                        2697  2697  2697       2697
DartsPlayer                     525   525   525        525
GaelicGamesPlayer              2694  2694  2694       2694
GolfPlayer                     2700  2700  2700       2700
Gymnast                        2698  2698  2698       2698
Handbal

In [11]:
total = 0
total2 = 0
for l1label in np.unique(df['l1']):
    tmpdf1 = df[df['l1']==l1label]
    print(l1label,"l2: ",end="")
    tmplabels = np.unique(tmpdf1['l2'])
    tmpn = len(tmplabels)
    print(tmpn)
    print("  l3: ",end="")
    subtotal = 0
    for l2label in tmplabels:
        tmpdf2 = tmpdf1[tmpdf1['l2']==l2label]
        tmplabels2 = np.unique(tmpdf2['l3'])
        print(len(tmplabels2),end=" ")
        subtotal+=len(tmplabels2)
    print()
    print("  subtotal l3: ",subtotal)
    total2+=subtotal
    total+=tmpn
print("Total in l2: ",total)
print("Total in l3: ",total2)

Agent l2: 30
  l3: 2 5 27 1 1 3 4 1 1 8 3 2 1 1 1 1 5 1 19 7 1 2 2 5 1 8 1 5 1 3 
  subtotal l3:  123
Device l2: 1
  l3: 1 
  subtotal l3:  1
Event l2: 6
  l3: 2 1 2 5 4 4 
  subtotal l3:  18
Place l2: 16
  l3: 1 1 8 2 1 2 6 1 4 1 2 3 1 2 1 1 
  subtotal l3:  37
Species l2: 5
  l3: 8 1 1 1 6 
  subtotal l3:  17
SportsSeason l2: 2
  l3: 1 3 
  subtotal l3:  4
TopicalConcept l2: 1
  l3: 1 
  subtotal l3:  1
UnitOfWork l2: 1
  l3: 1 
  subtotal l3:  1
Work l2: 8
  l3: 2 2 1 5 3 1 1 2 
  subtotal l3:  17
Total in l2:  70
Total in l3:  219


- There are many labels in each level, level 1 has 6, level 2 has 70 and level 3 has 219. If there are one classifier to classify 6 labels in the first level, 6 classifiers to separate 70 labels, then 70 classifiers to distinguish 219 labels. This is quite repetitive and infeasible. 
- It might be more doable with sequence-to-sequence encode-decoder network method, since it can decode three labels out with considerations of previous level. 