## Group Assignment 5: Text-based Industry Classification using Doc2Vec

#### Group 1: Tara Bode and Hankun Li

### Assignment Specifics

Cluster the business sections (/blue/acg7849/share/BS) using Doc2Vec (50 clusters) in two ways:

- Using a counter as the ‘tag’ (as in 5.1.5)
- Using a counter as the ‘tag’, and the industry code as an additional tag (yield TaggedDocument(words=file_tokens, tags=[i, SIC]) where SIC is a string holding the tag (for example ‘1740’)

Extract the 4-digit SIC industry code from the annual report header (STANDARD INDUSTRIAL CLASSIFICATION).

Required: Evaluate whether adding the industry code as an additional tag improves the clustering. Use the standard deviation of profitability as a way to evaluate this. (Firms that are more similar, should have similar performance. Therefore, a better clustering would result in lower standard deviations for each cluster, relative to a worse clustering).

Do this for the filings for the year 2019 only. Calculate the standard deviation of performance for each cluster (use the year of CONFORMED END OF PERIOD, which are the first 4 digits of ‘date’ in summary.text).

For 50 clusters that means you will have 2 standard deviations for each cluster (one for each approach, with the extra SIC tag vs not adding the extra SIC tag). Use a t-test to test for a difference between the two sets of 50 standard deviations.

### Alternative Method

In [1]:
# init summary file
import csv
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.cluster import  hierarchy

with open('/blue/acg7849/share/BS/summary.txt') as f:
    files = [{k: v for k, v in row.items()}
        for row in csv.DictReader(f, skipinitialspace=True, delimiter="|")]

In [2]:
# files with length > 1000
files = [f for f in files if int(f["length"])> 1000]

In [3]:
# generator functin that returns one file at the time (just a string)
# note that fit_transform expects one string for each file, so do not tokenize it
# this would be different for doc2vec, which expects a taggeddocument element
def readBSGen():
    for f in files[0:500]: # first 500 
        with open ( '/blue/acg7849/share/BS/item1/{}'.format(f['filename']) , encoding='utf-8') as b:
            BS = b.read()
        yield BS

In [4]:
# set up vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# read documents using generator
tfidf = vectorizer.fit_transform( readBSGen(  ) )

# dense
tfidf = tfidf.todense()

# clustering
threshold = 0.9

# Z is the cosine distance matrix
Z = hierarchy.linkage(tfidf,"average", metric="cosine")

# C are the clusters assigned
C = hierarchy.fcluster(Z, threshold, criterion="distance")

In [5]:
max(C)

56

### Setup

In [6]:
pip install gensim

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


### Define Generator

In [7]:
import gensim
import os, string, re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english') )

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

# add some punctuation to string.punctuation
punc = string.punctuation + '“”'

# documents get tagged by an index (number), while filenames have different numbers
# keep track of this
fileIdToIndex = {} # given a fileId -> tag
indexToFileId=[] # given a tag -> fileId

class BusinessSection(object):
    def __init__(self, dirname, tokens_only = False):
        self.dirname = dirname
        self.tokens_only = tokens_only
 
    def __iter__(self):
        for i, fname in enumerate(os.listdir(self.dirname)[0:200]):
        #for fname in os.listdir(self.dirname):
            with open( os.path.join(self.dirname, fname), encoding='utf-8') as f:
                content = f.read()
            
            # grab id from filename
            myCounter = int (  re.findall(r'(\d*)\.txt', fname)[0] )
            # update 
            fileIdToIndex [ myCounter] = i
            indexToFileId.append( myCounter)
            #print('fname', fname, 'tag', myCounter)
            file_tokens = [x for x in word_tokenize(content) if x.isalpha() and x.lower() not in stopWords and x not in string.punctuation]
            
            if self.tokens_only == True:
                yield file_tokens
            else:
                yield TaggedDocument(words=file_tokens, tags=[i] )                    

[nltk_data] Downloading package stopwords to /home/tbode/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/tbode/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
businessLists = BusinessSection(r'/blue/acg7849/share/BS/item1/') 

### Develop Doc2Vec Model

In [9]:
# create a model, build vocabulary
model = gensim.models.doc2vec.Doc2Vec(vector_size=100, min_count=2, epochs=40)
model.build_vocab(businessLists)

fileIdToIndex

indexToFileId[ 50  ]

myText = '10-K'
print( myText.isalpha() )

# train it
model.train(businessLists, total_examples=model.corpus_count, epochs=model.epochs)

# Hipergator
def tokenizeFile(file_id):
    with open( r'/blue/acg7849/share/BS/item1/'+str(file_id)+'.txt', encoding='utf-8') as f:
            content = f.read()
    return ([x for x in word_tokenize(content) if x.isalpha() and x.lower() not in stopWords and x not in string.punctuation] )

False


In [10]:
t = tokenizeFile(1)
model.infer_vector( t )

t = tokenizeFile(1)
inferred_vector = model.infer_vector( t )
# dv is short for docvecs
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
#sims = model.dv.most_similar([inferred_vector], topn=4)
sims

# letter with filename 1.txt is the first letter, so tag is 0
similar_doc = model.docvecs.most_similar(0)
similar_doc

print('number of documents', model.corpus_count)
print('model.docvecs', len(model.dv))

FileNotFoundError: [Errno 2] No such file or directory: '/blue/acg7849/share/BS/item1/1.txt'

In [11]:
# Hipergator
# reread the files, and get the vector for each file
# feed vector into k-means algorithm to make clusters
businessLists = BusinessSection(r'/blue/acg7849/share/BS/item1/', tokens_only = True) # a memory-friendly iterator
vectors = [ model.infer_vector( w ) for w in businessLists]
len(vectors)

vectors[0]

array([ 0.8111465 , -6.8363056 , -1.1089574 , -1.7763877 ,  6.0939803 ,
       -5.2874627 ,  1.559156  , -5.5606747 , -0.59152114,  1.7936459 ,
        3.2701406 ,  8.686971  , -3.6097493 , -1.1391431 , -7.6438556 ,
        3.1278439 ,  3.0566885 , -3.993789  , -0.40114927, -2.758942  ,
       -4.3644915 , -1.2638062 , -1.3739427 ,  4.263918  ,  6.5339866 ,
        3.9380407 ,  5.292408  , -1.8500096 , -3.8596354 , -4.229199  ,
        5.1704273 ,  0.92813313, -2.249856  ,  1.4273007 , -5.2768483 ,
        0.46173948,  0.44336426, -2.4169915 , -4.8872075 ,  4.8238635 ,
       -1.9600277 , -0.6680041 , -2.4193633 , -2.2200668 , -3.2014844 ,
       -0.19226287, -0.9528848 , -1.4524802 ,  3.9892545 , -3.86931   ,
        2.629111  , -1.840102  , -6.750227  ,  4.595751  ,  5.262988  ,
        5.5831966 , -0.6511026 ,  1.1891615 ,  5.7411895 ,  8.221103  ,
        2.5684223 ,  1.055465  , -2.0501633 , -4.6277733 , -1.8617773 ,
        2.5733612 ,  1.2462313 ,  5.3163466 ,  2.436893  ,  3.06

### Cluster 1: Using a Counter as the Tag

In [12]:
import nltk
from nltk.cluster import KMeansClusterer
num_clusters = 10
kclusterer = KMeansClusterer(num_clusters, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(vectors, assign_clusters=True)

In [13]:
assigned_clusters[0:20]

[6, 5, 7, 5, 5, 5, 8, 6, 5, 6, 8, 0, 2, 8, 2, 7, 8, 1, 8, 7]

In [14]:
import collections

print(collections.Counter(assigned_clusters))

Counter({8: 52, 9: 24, 5: 23, 7: 18, 6: 17, 2: 17, 3: 16, 0: 11, 1: 11, 4: 11})


### Cluster 2: Using a Counter as the Tag and the Industry Code as an Additional Tag (SIC)

### Extract 4-digit SIC Industry Code

### Evaluation of Clustering Improvement with Use of Additional Tag: Measured by Standard Deviation of Profitability

### Calculate Standard Deviation of Performance for Each Cluster for 2019 Filings

### Use T-test to Evaluate a Difference between 2 Sets of 50 Standard Deviations