## Group Assignment 5: Text-based Industry Classification using Doc2Vec

#### Group 1: Tara Bode and Hankun Li

### Assignment Specifics

Cluster the business sections (/blue/acg7849/share/BS) using Doc2Vec (50 clusters) in two ways:

- Using a counter as the ‘tag’ (as in 5.1.5)
- Using a counter as the ‘tag’, and the industry code as an additional tag (yield TaggedDocument(words=file_tokens, tags=[i, SIC]) where SIC is a string holding the tag (for example ‘1740’)

Extract the 4-digit SIC industry code from the annual report header (STANDARD INDUSTRIAL CLASSIFICATION).

Required: Evaluate whether adding the industry code as an additional tag improves the clustering. Use the standard deviation of profitability as a way to evaluate this. (Firms that are more similar, should have similar performance. Therefore, a better clustering would result in lower standard deviations for each cluster, relative to a worse clustering).

Do this for the filings for the year 2019 only. Calculate the standard deviation of performance for each cluster (use the year of CONFORMED END OF PERIOD, which are the first 4 digits of ‘date’ in summary.text).

For 50 clusters that means you will have 2 standard deviations for each cluster (one for each approach, with the extra SIC tag vs not adding the extra SIC tag). Use a t-test to test for a difference between the two sets of 50 standard deviations.

### Setup

In [None]:
pip install gensim

### Define Generator

In [None]:
import gensim
import os, string, re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english') )

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

# add some punctuation to string.punctuation
punc = string.punctuation + '“”'

# documents get tagged by an index (number), while filenames have different numbers
# keep track of this
fileIdToIndex = {} # given a fileId -> tag
indexToFileId=[] # given a tag -> fileId

class BusinessSection(object):
    def __init__(self, dirname, tokens_only = False):
        self.dirname = dirname
        self.tokens_only = tokens_only
 
    def __iter__(self):
        for i, fname in enumerate(os.listdir(self.dirname)[0:200]):
        #for fname in os.listdir(self.dirname):
            with open( os.path.join(self.dirname, fname), encoding='utf-8') as f:
                content = f.read()
            
            # grab id from filename
            myCounter = int (  re.findall(r'(\d*)\.txt', fname)[0] )
            # update 
            fileIdToIndex [ myCounter] = i
            indexToFileId.append( myCounter)
            #print('fname', fname, 'tag', myCounter)
            file_tokens = [x for x in word_tokenize(content) if x.isalpha() and x.lower() not in stopWords and x not in string.punctuation]
            
            if self.tokens_only == True:
                yield file_tokens
            else:
                yield TaggedDocument(words=file_tokens, tags=[i] )                    

In [None]:
businessLists = BusinessSection(r'/blue/acg7849/tbode/BS/item1/') 

### Develop Doc2Vec Model

In [None]:
# create a model, build vocabulary
model = gensim.models.doc2vec.Doc2Vec(vector_size=100, min_count=2, epochs=40)
model.build_vocab(businessLists)

fileIdToIndex

indexToFileId[ 50  ]

myText = '10-K'
print( myText.isalpha() )

# train it
model.train(businessLists, total_examples=model.corpus_count, epochs=model.epochs)

# Hipergator
def tokenizeFile(file_id):
    with open( r'/blue/acg7849/tbode/BS/item1/'+str(file_id)+'.txt', encoding='utf-8') as f:
            content = f.read()
    return ([x for x in word_tokenize(content) if x.isalpha() and x.lower() not in stopWords and x not in string.punctuation] )

In [None]:
t = tokenizeFile(1)
model.infer_vector( t )

t = tokenizeFile(1)
inferred_vector = model.infer_vector( t )
# dv is short for docvecs
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
#sims = model.dv.most_similar([inferred_vector], topn=4)
sims

# letter with filename 1.txt is the first letter, so tag is 0
similar_doc = model.docvecs.most_similar(0)
similar_doc

print('number of documents', model.corpus_count)
print('model.docvecs', len(model.dv))

In [None]:
# Hipergator
# reread the files, and get the vector for each file
# feed vector into k-means algorithm to make clusters
businessLists = BusinessSection(r'/blue/acg7849/tbode/BS/item1/', tokens_only = True) # a memory-friendly iterator
vectors = [ model.infer_vector( w ) for w in businessLists]
len(vectors)

vectors[0]

### Cluster 1: Using a Counter as the Tag

In [None]:
import nltk
from nltk.cluster import KMeansClusterer
num_clusters = 10
kclusterer = KMeansClusterer(num_clusters, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(vectors, assign_clusters=True)

In [None]:
assigned_clusters[0:20]

In [None]:
import collections

print(collections.Counter(assigned_clusters))

### Cluster 2: Using a Counter as the Tag and the Industry Code as an Additional Tag (SIC)

### Extract 4-digit SIC Industry Code

### Evaluation of Clustering Improvement with Use of Additional Tag: Measured by Standard Deviation of Profitability

### Calculate Standard Deviation of Performance for Each Cluster for 2019 Filings

### Use T-test to Evaluate a Difference between 2 Sets of 50 Standard Deviations