## Group Assignment 5: Text-based Industry Classification using Doc2Vec

#### Group 1: Tara Bode and Hankun Li
          

### Assignment Specifics 

Cluster the business sections (/blue/acg7849/share/BS) using Doc2Vec (50 clusters) in two ways:

- Using a counter as the ‘tag’ (as in 5.1.5)
- Using a counter as the ‘tag’, and the industry code as an additional tag (yield TaggedDocument(words=file_tokens, tags=[i, SIC]) where SIC is a string holding the tag (for example ‘1740’)

Extract the 4-digit SIC industry code from the annual report header (STANDARD INDUSTRIAL CLASSIFICATION).

Required: Evaluate whether adding the industry code as an additional tag improves the clustering. Use the standard deviation of profitability as a way to evaluate this. (Firms that are more similar, should have similar performance. Therefore, a better clustering would result in lower standard deviations for each cluster, relative to a worse clustering).

Do this for the filings for the year 2019 only. Calculate the standard deviation of performance for each cluster (use the year of CONFORMED END OF PERIOD, which are the first 4 digits of ‘date’ in summary.text).

For 50 clusters that means you will have 2 standard deviations for each cluster (one for each approach, with the extra SIC tag vs not adding the extra SIC tag). Use a t-test to test for a difference between the two sets of 50 standard deviations.

### Setup

In [1]:
# all imports
import os as os
import pandas as pd
import glob
import csv
from pathlib import Path
import html, re
from w3lib.html import replace_entities
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from scipy.cluster import  hierarchy
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
import gensim
stopWords = set(stopwords.words('english') )
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk import FreqDist
from nltk.collocations import *



### Define Generator

In [2]:
# add some punctuation to string.punctuation
punc = string.punctuation + '“”'

# documents get tagged by an index (number), while filenames have different numbers
# keep track of this
fileIdToIndex = {} # given a fileId -> tag
indexToFileId=[] # given a tag -> fileId

class MyFiles(object):
    def __init__(self, dirname, tokens_only = False):
        self.dirname = dirname
        self.tokens_only = tokens_only
 
    def __iter__(self):
        #for i, fname in enumerate(files[0:200]):
         
        for i, fname in enumerate(os.listdir(self.dirname)[0:4604]):    # there are 4604 files in total
        # enumerate = return a list of tuples, iterate from start to end
        # os.listdir = return index of a directory, input = directory address
        # this part enumerates the first 200 units in the index under dirname
        
            with open( os.path.join(self.dirname, fname), encoding='utf-8') as f:
                content = f.read()
            # filter
                #content = [f for f in content if int(f["length"])> 1000 and (f["date"][0:4]) == '2019']
                #content = [f for f in content if len(f)> 1000]
            # grab id from filename
            myCounter = int ( re.findall(r'(\d*)\.txt', fname)[0] )
            # update 
            fileIdToIndex [ myCounter] = i
            indexToFileId.append( myCounter)
            #print('fname', fname, 'tag', myCounter)
            file_tokens = [x for x in word_tokenize(content) if x.isalpha() and x.lower() not in stopWords and x not in string.punctuation]
            
            if self.tokens_only == True:
                yield file_tokens
            else:
                yield TaggedDocument(words=file_tokens, tags=[i] )                    

In [3]:
# Hipergator
ffiles = MyFiles(r'/blue/acg7849/hli1/item1') # this one expected str, bytes or os.PathLike object
#ffiles = MyFiles(r'/blue/acg7849/share/BS/item1/') # a memory-friendly iterator
# dirname = '/blue/acg7849/share/BS/item1/'?

In [4]:
# create a model, build vocabulary
model = gensim.models.doc2vec.Doc2Vec(vector_size=100, min_count=2, epochs=10)
model.build_vocab(ffiles) 

In [5]:
# train it
model.train(ffiles, total_examples=model.corpus_count, epochs=model.epochs)
print('FINISH')

FINISH


In [6]:
# Hipergator
def tokenizeFile(file_id):
    with open( r'/blue/acg7849/hli1/item1/'+str(file_id)+'.txt', encoding='utf-8') as f:
            content = f.read()
    return ([x for x in word_tokenize(content) if x.isalpha() and x.lower() not in stopWords and x not in string.punctuation] )

In [7]:
t = tokenizeFile(277350) # test file = 277350
#model.infer_vector( t )

In [8]:
t = tokenizeFile(277350)
inferred_vector = model.infer_vector( t )
# dv is short for docvecs
# most similar files to 277350 is the 4th one 265065?
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
#sims = model.dv.most_similar([inferred_vector], topn=4)
sims[0:5]

[(3711, 0.9844865798950195),
 (2789, 0.6870694160461426),
 (3302, 0.6644330620765686),
 (3647, 0.652461588382721),
 (26, 0.6479239463806152)]

In [9]:
# letter with filename 1.txt is the first letter, so tag is 0
similar_doc = model.docvecs.most_similar(0)
similar_doc

  similar_doc = model.docvecs.most_similar(0)


[(4321, 0.7653406262397766),
 (1806, 0.7328099608421326),
 (2554, 0.727522075176239),
 (4225, 0.7245810031890869),
 (3588, 0.7169573307037354),
 (1656, 0.716842532157898),
 (4586, 0.7161878347396851),
 (3621, 0.7130987644195557),
 (324, 0.7120179533958435),
 (103, 0.7078105211257935)]

In [10]:
print('number of documents', model.corpus_count)
print('model.docvecs', len(model.dv))

number of documents 4604
model.docvecs 4604


In [11]:
# reread the files, and get the vector for each file
# feed vector into k-means algorithm to make clusters
wordLists = MyFiles(r'/blue/acg7849/hli1/item1', tokens_only = True) # a memory-friendly iterator
vectors = [ model.infer_vector( w ) for w in wordLists]
print(len(vectors))
vectors[0]

4604


array([ 0.13251027,  1.7713019 , -1.2070944 ,  0.5877362 ,  3.2593195 ,
        2.5406146 , -4.9172263 , -0.6247052 , -0.3348404 ,  0.6904445 ,
        3.0260167 ,  1.055489  , -2.2696917 , -2.0256064 , -1.3583243 ,
       -1.1576496 ,  0.74752647, -0.47880083, -2.6639798 , -2.8420856 ,
       -0.9892699 ,  2.11443   , -2.0620131 ,  0.5325125 ,  1.0371944 ,
       -0.02256046, -5.4824047 ,  0.52058345, -0.01334684, -0.41951287,
        1.436628  ,  0.77188253, -0.9873223 , -0.2342852 , -1.8224778 ,
       -0.83203405,  0.36926478, -0.44015148,  1.9534036 ,  1.2269791 ,
       -1.4412913 , -1.4802396 ,  2.219955  ,  1.4758697 ,  1.6850762 ,
       -4.326427  , -0.48489276, -1.3180994 ,  1.9214945 ,  0.01527684,
        0.09991759,  1.133366  , -3.1863654 , -0.4483956 , -1.5857337 ,
       -1.8974037 ,  1.2719195 ,  2.1387637 ,  0.3302084 ,  1.0655164 ,
        0.43528807,  2.2369564 , -0.1382547 , -0.45128253,  0.6500001 ,
       -0.6549607 ,  0.8065622 , -0.05199064, -2.0942473 , -4.12

### Cluster 1: Using a Counter as the Tag

In [12]:
from nltk.cluster import KMeansClusterer
num_clusters = 50
kclusterer = KMeansClusterer(num_clusters, distance=nltk.cluster.util.cosine_distance, repeats=25)

#assigned_clusters: sequence of files and matching clusters
assigned_clusters = kclusterer.cluster(vectors, assign_clusters=True)
assigned_clusters[0:10]

[36, 21, 43, 24, 1, 14, 5, 29, 43, 9]

In [13]:
import collections
print(collections.Counter(assigned_clusters))
# give length of each cluster

Counter({33: 311, 12: 244, 24: 218, 37: 155, 45: 151, 27: 139, 42: 127, 23: 121, 46: 117, 40: 114, 49: 114, 41: 114, 29: 110, 34: 106, 7: 105, 36: 103, 4: 99, 9: 98, 3: 98, 22: 94, 5: 93, 39: 86, 20: 85, 35: 83, 0: 79, 13: 75, 43: 74, 28: 74, 15: 72, 11: 71, 21: 69, 44: 69, 48: 66, 30: 65, 31: 64, 38: 64, 19: 62, 47: 61, 16: 60, 6: 60, 1: 57, 18: 55, 2: 52, 17: 51, 32: 49, 26: 43, 14: 41, 25: 37, 8: 26, 10: 23})


### Add Performance and its Standard Deviation of Each Cluster

In [15]:
pfm = pd.read_csv (r'performance.csv')  
pfm = pfm.loc[pfm["year"] == 2019]
#backup new csv
#pfm.to_csv('pfm19.csv', index = False)

In [16]:
# need: left join (asg5 csv, pfm19)
asg5 = pd.read_csv (r'ASG5.csv') 
pfm = pd.read_csv (r'pfm19.csv') 
pfm = pfm.rename(columns={'cik': 'CIK'})
pfm[0:5]

Unnamed: 0,CIK,performance,year
0,1750,0.011929,2019
1,6201,0.028102,2019
2,3197,0.043332,2019
3,1230869,0.319006,2019
4,764622,0.029131,2019


In [32]:
table1 = asg5.join(pfm.set_index('CIK'), on='CIK')
#df.join(other.set_index('key'), on='key')

In [33]:
# drop last column
table1 = table1.iloc[: , :-1]  

In [None]:
# need to drop all NA on performance later
#table1.to_csv('table1.csv', index = False)

### Display of Clusters

In [None]:
#table1_test = table1[0:200]

In [None]:
#table1_test.to_csv('table1_test.csv', index = False)

In [31]:
len(assigned_clusters)

4604

In [34]:
table1['clstr'] = assigned_clusters

In [35]:
table1 = table1.dropna()
len(table1)                 #3753

3753

In [36]:
table1[0:5]

Unnamed: 0,CIK,coName,formtype,date,fName,length,full_link,performance,clstr
0,1345016,YELP INC,10-K,2019-12-31,280603.txt,152323,/blue/acg7849/share/BS/item1/280603.txt,0.038182,36
1,1772177,"KURA SUSHI USA, INC.",10-K,2019-08-31,275323.txt,38468,/blue/acg7849/share/BS/item1/275323.txt,0.019055,21
2,1341766,"Celsius Holdings, Inc.",10-K,2019-12-31,280595.txt,132197,/blue/acg7849/share/BS/item1/280595.txt,0.110321,43
3,1679826,Ping Identity Holding Corp.,10-K,2019-12-31,283125.txt,100202,/blue/acg7849/share/BS/item1/283125.txt,-0.001723,24
4,1650445,Quorum Health Corp,10-K,2019-12-31,282859.txt,146425,/blue/acg7849/share/BS/item1/282859.txt,-0.109879,1


In [37]:
table1.to_csv('table1.csv', index = False)

In [39]:
# calculate sd(perf) by clstr
#table1_test.sql('SELECT performance,clstr, sum(performance) GROUP BY clstr')
table1_sd = table1.groupby('clstr').std()
table1_sd = table1_sd ['performance']# we get sd of performance, ranked by cluster
#print(table1_sd)
table1_sd.to_csv('table1_sd.csv', index = True)

### Extract 4-digit SIC Industry Code

In [1]:
# Install Packages

!pip install w3lib first

import html, re
# need to do: pip install w3lib first
from w3lib.html import replace_entities

import csv
import os as os
import pandas as pd
import glob as glob

Defaulting to user installation because normal site-packages is not writeable


In [2]:
mydirectory = '/blue/acg7849/share/10Ks'
file_list = glob.glob(mydirectory + '/*.txt')
#print(file_list)

In [3]:
SIC_list = []
fileName_list = []

for file_path in file_list:
    # read the file
    with open(file_path) as f:
        filing = f.read()

    SICregex = 'CLASSIFICATION:.*?\[(\d{4})'
    SIC = re.findall(SICregex, filing) 
    SIC_list.append(SIC);
        
    #need to input FileName
    shortName = file_path[file_path.rfind('/')+1:]  
    fileName_list.append(shortName);
      
    DF = pd.DataFrame(list(zip(fileName_list, SIC_list)), columns = ['FileName','Standard Industrial Classification'])
   

In [4]:
#DF
DF.to_csv('SIC.csv', index = False)

### Use T-test to Evaluate Difference between 2 Sets of 50 Standard Deviations

In [None]:
from math import sqrt
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from scipy.stats import t
 
# function for calculating the t-test for two dependent samples

# change 'data2' with our updated dataframe that includes SIC tag

def dependent_ttest(table1_sd, data2, alpha):
    # calculate means
    mean1, mean2 = mean(table1_sd), mean(data2)
    # number of paired samples
    n = len(table1_sd)
    # sum squared difference between observations
    d1 = sum([(table1_sd[i]-data2[i])**2 for i in range(n)])
    # sum difference between observations
    d2 = sum([table1_sd[i]-data2[i] for i in range(n)])
    # standard deviation of the difference between means
    sd = sqrt((d1 - (d2**2 / n)) / (n - 1))
    # standard error of the difference between the means
    sed = sd / sqrt(n)
    # calculate the t statistic
    t_stat = (mean1 - mean2) / sed
    # degrees of freedom
    df = n - 1
    # calculate the critical value
    cv = t.ppf(1.0 - alpha, df)
    # calculate the p-value
    p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0
    # return everything
    return t_stat, df, cv, p
 
# seed the random number generator
seed(1)
# generate two independent samples (pretend they are dependent)
table1_sd = 5 * randn(100) + 50
data2 = 5 * randn(100) + 51
# calculate the t test
alpha = 0.05
t_stat, df, cv, p = dependent_ttest(table1_sd, data2, alpha)
print('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p))
# interpret via critical value
if abs(t_stat) <= cv:
    print('Accept null hypothesis that the means are equal.')
else:
    print('Reject the null hypothesis that the means are equal.')
# interpret via p-value
if p > alpha:
    print('Accept null hypothesis that the means are equal.')
else:
    print('Reject the null hypothesis that the means are equal.')