In [1]:
import sys
sys.path.insert(0, '..')
%load_ext autoreload
%autoreload 2
%aimport std_func

# Hide warnings
import warnings
warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'pattern'

## Part-of-Speech (POS) Tagging - Cosine Similarity Analysis
Part-of-speech (POS) tagging is a process of grammatical classification to classify texts into list of tuples where each word in the sentence gets a tag (label) that tells its part of speech (e.g. noun, pronoun, verb, adjective, adverb). According to Asoka Diggs, a Data Scientist at Intel, his research shows that nouns are better than n-grams. As a result, we used POS tagging to extract only nouns. We have examined the case with multiple-gram nouns. However, the results do not show distinct difference between documents, which may be caused by overfitting the model. Here we only consider 1-gram nouns. We conduct the consine similarity measure on the word counts from POS tagging.

Source: https://databricks.com/session/nouns-are-better-than-n-grams

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string

### Cosine Similarity Analysis

#### Word Counts from POS tagging
We select the words with the word type `Noun` and use `CountVectorizer` from `sklearn.feature_extraction.text` to count the term frequency for each 1-gram noun and select the top 600 nouns by frequency

In [3]:
df = pd.read_csv('data/nouns_only.csv',
                 usecols = ['reportingDate', 'name', 'CIK', 'coDescription_lemmatized',
                           'coDescription_stopwords', 'coDescription_pos', 'SIC', 'SIC_desc'])
df = df.set_index(df.name)

In [4]:
import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import word_tokenize

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

Vectorizer = CountVectorizer(ngram_range = (1,1), 
                             max_features = 600)

count_data = Vectorizer.fit_transform(df['coDescription_pos'])
wordsCount_pos_tag = pd.DataFrame(count_data.toarray(),columns=Vectorizer.get_feature_names())
wordsCount_pos_tag = wordsCount_pos_tag.set_index(df['name'])
wordsCount_pos_tag



Unnamed: 0_level_0,ability,acces,accordance,account,accounting,acquisition,acre,acreage,act,action,...,voting,waste,water,way,website,week,weight,work,year,york
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"MONGODB, INC.",4,2,0,0,0,2,0,0,2,0,...,0,0,0,6,5,0,0,6,9,2
SALESFORCE COM INC,0,3,0,1,0,3,0,0,0,2,...,0,0,0,3,3,0,0,2,4,0
SPLUNK INC,2,13,0,2,0,2,0,0,1,1,...,0,0,0,2,12,0,0,3,1,0
"OKTA, INC.",0,26,0,2,0,2,0,0,0,0,...,0,0,0,6,10,0,0,2,1,0
VEEVA SYSTEMS INC,46,12,21,34,68,34,0,0,48,21,...,9,0,0,4,5,0,2,4,226,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"AMERICAN REALTY CAPITAL NEW YORK CITY REIT, INC.",2,1,2,0,2,3,0,0,0,0,...,0,0,1,0,4,0,0,0,5,15
"CYCLACEL PHARMACEUTICALS, INC.",0,0,3,0,0,0,0,0,2,5,...,0,0,0,2,4,9,0,1,8,0
ZOETIS INC.,11,3,16,20,99,123,0,0,39,11,...,0,7,0,3,0,0,1,2,159,0
"STAG INDUSTRIAL, INC.",1,5,2,0,3,12,0,0,3,1,...,0,2,0,0,5,0,0,1,5,1


#### Cosine Similarity Computation on on 1-Gram Nouns
To determine the similarity of each company's business description, we use cosine similarity analysis on this POS-tagging with only nouns embeddings.

In [6]:
# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim_pos_tag = pd.DataFrame(cosine_similarity(wordsCount_pos_tag, wordsCount_pos_tag))
cosine_sim_pos_tag = cosine_sim_pos_tag.set_index(df['name'])
cosine_sim_pos_tag.columns = df['name']
cosine_sim_pos_tag

name,"MONGODB, INC.",SALESFORCE COM INC,SPLUNK INC,"OKTA, INC.",VEEVA SYSTEMS INC,AUTODESK INC,"INTERNATIONAL WESTERN PETROLEUM, INC.","DAYBREAK OIL & GAS, INC.","ETERNAL SPEECH, INC.","ETERNAL SPEECH, INC.",...,OMEGA HEALTHCARE INVESTORS INC,TABLEAU SOFTWARE INC,HORIZON PHARMA PLC,MERRIMACK PHARMACEUTICALS INC,"REVEN HOUSING REIT, INC.","AMERICAN REALTY CAPITAL NEW YORK CITY REIT, INC.","CYCLACEL PHARMACEUTICALS, INC.",ZOETIS INC.,"STAG INDUSTRIAL, INC.",EQUINIX INC
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"MONGODB, INC.",1.000000,0.721505,0.712640,0.768272,0.469290,0.406578,0.127653,0.141367,0.077973,0.077973,...,0.123737,0.601398,0.143517,0.236362,0.174593,0.134490,0.149307,0.168299,0.156941,0.499246
SALESFORCE COM INC,0.721505,1.000000,0.696715,0.696564,0.626009,0.450280,0.208104,0.205340,0.232427,0.232427,...,0.135028,0.609349,0.158332,0.213185,0.176697,0.179007,0.143488,0.208971,0.131269,0.680443
SPLUNK INC,0.712640,0.696715,1.000000,0.713106,0.501298,0.445346,0.161575,0.146188,0.106176,0.106176,...,0.112114,0.875391,0.140879,0.235275,0.159207,0.140506,0.167965,0.158579,0.137773,0.702077
"OKTA, INC.",0.768272,0.696564,0.713106,1.000000,0.471704,0.476666,0.157277,0.158698,0.111659,0.111659,...,0.113043,0.711460,0.158162,0.338571,0.171273,0.149249,0.168510,0.186875,0.149740,0.540787
VEEVA SYSTEMS INC,0.469290,0.626009,0.501298,0.471704,1.000000,0.358283,0.243000,0.287512,0.310724,0.310724,...,0.497539,0.472323,0.233799,0.262623,0.335537,0.386472,0.181338,0.630742,0.324082,0.488382
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"AMERICAN REALTY CAPITAL NEW YORK CITY REIT, INC.",0.134490,0.179007,0.140506,0.149249,0.386472,0.121649,0.263172,0.288564,0.367285,0.367285,...,0.415974,0.136735,0.132879,0.116195,0.644117,1.000000,0.097014,0.402502,0.610487,0.124585
"CYCLACEL PHARMACEUTICALS, INC.",0.149307,0.143488,0.167965,0.168510,0.181338,0.196186,0.115146,0.125728,0.067980,0.067980,...,0.095797,0.222415,0.479660,0.660791,0.106048,0.097014,1.000000,0.139846,0.097144,0.128518
ZOETIS INC.,0.168299,0.208971,0.158579,0.186875,0.630742,0.224314,0.248238,0.232798,0.263805,0.263805,...,0.502115,0.217181,0.192793,0.233262,0.359660,0.402502,0.139846,1.000000,0.335454,0.165691
"STAG INDUSTRIAL, INC.",0.156941,0.131269,0.137773,0.149740,0.324082,0.106612,0.237217,0.310193,0.122410,0.122410,...,0.312032,0.139664,0.144205,0.123239,0.686805,0.610487,0.097144,0.335454,1.000000,0.106782


### Performance Evaluation
#### Predictions Based on the Closest Cosine Similarity Distance
We use the closest neighborhood in terms of cosine similarity distances to evaluate the accuracy of the SIC classfication generated using POS-tagging with only 1-gram nouns embeddings and cosine similarity distances.

In [7]:
prediction, accuracy, cm = std_func.get_accuracy(cosine_sim_pos_tag, df)

NameError: name 'std_func' is not defined

In [8]:
cosine_sim_pos_tag_conf = std_func.conf_mat_cosine(cosine_sim_pos_tag, df)
cosine_sim_pos_tag_conf

NameError: name 'std_func' is not defined

From the above confusion matrix, cosine similarity analysis on POS-tagging with only 1-gram nouns embeddings gives an accuray of 94% on average. For industries `Crude Petroleum and Natural Gas`, `Real Estate Investment Trusts` and `State Commercial Banks (commercial banking)`, the accuracy is above 95%. `Prepackaged Software` gives the lowest accuracy at 86%. However, this confusion matrix gives extremely high prediction, we then look into the 2-D and 3-D plots to see if they are well-clustered.

### Plotting
#### Plotting on the Cosine Similarity Matrix
We use PCA to automatically perform dimensionality reduction. First, we have a 2-D plot on cosine similarity matrix.

In [9]:
plot_cos_pos_tag = std_func.pca_visualize_2d(cosine_sim_pos_tag, df.loc[:,["name","SIC_desc"]])

NameError: name 'std_func' is not defined

Here we have a 3-D plot with the first three dimensions which maximize the most variance.

In [10]:
std_func.pca_visualize_3d(plot_cos_pos_tag)

NameError: name 'std_func' is not defined

We can see from the above 3D plot that three industries are not well clustered. `Pharmaceutical Preparations` and `State Commercial Banks (commercial banking)` seem to be more spread out than others. The other three industries
`Crude Petroleum and Natural Gas`, `Real Estate Investment Trusts` and `Prepackaged Software` are closely clustered with each other.

We can look at the explained variance of each dimension the PCA embedding of our cosine similatiry matrix generated from POS-tagging with only 1-gram nouns embeddings produced below:

In [11]:
plot_cos_pos_tag[0].explained_variance_ratio_

NameError: name 'plot_cos_pos_tag' is not defined

The total explained variance of the first three dimensions are:

In [12]:
plot_cos_pos_tag[0].explained_variance_ratio_[0:3].sum()

NameError: name 'plot_cos_pos_tag' is not defined

The first three dimensions explained 78% of the total variance that exists within the data.

### Conclusion Reporting

In [13]:
from sklearn.metrics import classification_report
print(classification_report(prediction["y_true"], prediction["y_pred"], target_names=df["SIC_desc"].unique()))

NameError: name 'prediction' is not defined

We can see from the above classification_report, we can conclude that cosine similarity analysis on POS-tagging with 1-gram nouns embeddings gives a good result on SIC classfication, specifically on the industries `Crude Petroleum and Natural Gas`, `Real Estate Investment Trusts` and `State Commercial Banks (commercial banking)`.