# Culture Measures Based on Company Reviews

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import nltk, re, math, csv
# nltk.download('wordnet')
# nlkt.download('punkt')

import koolture as kt

from string import punctuation
from functools import partial
import concurrent.futures as cf
from collections import defaultdict

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

%load_ext autoreload
%autoreload 2

Load your dataset.

In [2]:
df = pd.read_csv('../data/clean_gs.csv')
df.head()

Unnamed: 0,employer,id,pros,cons
0,American Express,44001,Still not big enough in market place,"Great brand , Good leadership , Clear business..."
1,Eventum IT Solutions,44004,Nothing important on my point of view.,"Learn new technologies, helpful people, good m..."
2,Eventum IT Solutions,44004,Alot of friends working together which isn't v...,Very good opportunities to learn technologies
3,Eventum IT Solutions,44004,Working hours are not good and need to add the...,You can learn technically a lot in this company.
4,Eventum IT Solutions,44004,No Real Cons at all,- Very friendly environment.\r\n- Highly exper...


In [3]:
df.shape

(78420, 4)

The following function will remove the company names from their respective reviews.

Function to get the root of the word. You can get all three (lemma, stem, and snow) or use them separately with the partial functions below.

The following function helps with the preprocessing of the data. It runs after the lemmatizer, stemmer, snowball, etc. If you want to include stopwords and take them out at a later stage, uncomment the first `filtered_tokens` below and comment out the second one.

Functions to get the words forming top topics and to run the LDA models.

In [4]:
our_range = 2, 10, 50, 100, 150, 200, 250, 300

Create an array with the unique employers in the dataset.

Remove the company names from the reviews, and extract the reviews into a numpy array.

In [5]:
comps_of_interest = df.employer.value_counts()
comps_of_interest.head(8)

Amazon               561
Oracle               422
Microsoft            349
Siemens              343
Dell Technologies    338
IBM                  337
EY                   324
PwC                  300
Name: employer, dtype: int64

In [6]:
#comps_of_interest = (comps_of_interest).index
comps_of_interest = (comps_of_interest[(comps_of_interest == 28)]).index
len(comps_of_interest), comps_of_interest

(14,
 Index(['Tata Consultancy Services', 'Zendesk', 'Vestas Wind Systems',
        'YoungCapital', 'Bombardier', 'Nanyang Technological University',
        'Huawei Technologies', 'Westpac Group', 'Boston Scientific',
        'Bank of America', 'Tech Data', 'SITA', 'Fiserv', 'HARMAN'],
       dtype='object'))

In [7]:
cond2 = df['employer'].isin(comps_of_interest) # create the condition
df_interest = df[cond2].copy() # get the new dataset
unique_ids = df_interest['employer'].unique() # get the unique IDs or unique employers in the dataset

In [8]:
df_interest.shape

(392, 4)

In [9]:
%%time

df = kt.comp_name_out(df, 'employer', 'pros', comps_of_interest)
data_pros = df_interest['pros'].values

CPU times: user 113 ms, sys: 2.52 ms, total: 116 ms
Wall time: 117 ms


The text preprocessing of the corpus takes place in parallel. You first normalize the reviews and then take the root of the words.

In [10]:
%%time

with cf.ProcessPoolExecutor() as e:
    data_pros_cleaned = e.map(kt.normalize_doc, data_pros)
    data_pros_cleaned = list(e.map(kt.root_of_word, data_pros_cleaned))

df_interest['pros_clean'] = data_pros_cleaned

CPU times: user 200 ms, sys: 72 ms, total: 272 ms
Wall time: 519 ms


Here you create an array with all of the companies and the amount of reviews they have. So far, only companies with at least 2 reviews make it to the modeling stage.

Select only the employers that meet the condition above by creating a boolean with True for yes and False for no.

The following loop will create sparse matrices for all companies and return a list of tuples with the name of the company, its sparse matrix, and the fitted vectorizer.

In [11]:
%%time

vectorizers_dicts = kt.get_vectorizers(data=df_interest, unique_ids=unique_ids,
                                      company_col='employer', reviews_col='pros', 
                                      vrizer=CountVectorizer())

CPU times: user 22 ms, sys: 1.88 ms, total: 23.8 ms
Wall time: 23.6 ms


Calculate the total words in the dictionary of review words, and get the percentage of words in the final dictionary that can be found in the full corpus.

The following block run the models in parallel over the range of topics specified in our_range variable and return a dictionary with the output of the get_models function for each company. It is used to identify the interval to search further for optimal topic number.

In [12]:
%%time

partial_func = partial(kt.get_models, topics=our_range, vrizer_dicts=vectorizers_dicts, unique_ids=unique_ids)

with cf.ProcessPoolExecutor() as e:
    output = list(e.map(partial_func, unique_ids))

CPU times: user 231 ms, sys: 269 ms, total: 500 ms
Wall time: 1min 41s


The next block of code will now iterate over the dictionary output from above, add each dataset into a list, and then concatenate them all into one dataset (output df contains exactly same information, but more readable, and used in next blocks).

In [13]:
output_df = kt.build_dataframe(output)
output_df.head()

Unnamed: 0,company,topics,coherence,models
0,SITA,2,0.122767,LatentDirichletAllocation(learning_method='onl...
1,SITA,10,0.234603,LatentDirichletAllocation(learning_method='onl...
2,SITA,50,0.20901,LatentDirichletAllocation(learning_method='onl...
3,SITA,100,0.145389,LatentDirichletAllocation(learning_method='onl...
4,SITA,150,0.106866,LatentDirichletAllocation(learning_method='onl...


The following loop iterates over the new dataframe, searches for the top 2 topics based on highest coherence, and appends to a list a tuple containing the company, a tuple with the top two topic numbers, and the fitted vectorizer from the original `vectorizers_list`.

In [14]:
%%time

topics_sorted, comps, tops = kt.top_two_topics(data=output_df, companies_var='company',
                               coherence_var='coherence', topics_var='topics',
                               unique_ids=unique_ids, vrizers_list=vectorizers_dicts.values())

CPU times: user 18.5 ms, sys: 1.62 ms, total: 20.1 ms
Wall time: 19 ms


Now run the `get_models` function again over the new space of topics. You will  need to
1. sort the tuple with the top two topics.
2. create a linearly spaced array with 10 elements between the top 2 topics, turn it into integers, make the array a set to eliminate any duplicates that might arise if there is a 2 in the top two topics, and then turn that into a list.
3. get your fixed partial function again
4. the output is the same as before

In [17]:
%%time


partial_func = partial(kt.get_models, vrizer_dicts=vectorizers_dicts, unique_ids=unique_ids)

with cf.ProcessPoolExecutor() as e:
    output2 = list(e.map(partial_func, comps, tops))

CPU times: user 120 ms, sys: 80.7 ms, total: 200 ms
Wall time: 18.1 s


Create multiple dataframes from dictionaries again and collapse them into 1.

In [18]:
output_df2 = kt.build_dataframe(output2)
output_df2.head()

Unnamed: 0,company,topics,coherence,models
0,SITA,32,0.225976,LatentDirichletAllocation(learning_method='onl...
1,SITA,36,0.207399,LatentDirichletAllocation(learning_method='onl...
2,SITA,41,0.20235,LatentDirichletAllocation(learning_method='onl...
3,SITA,10,0.234603,LatentDirichletAllocation(learning_method='onl...
4,SITA,45,0.210976,LatentDirichletAllocation(learning_method='onl...


Search for the best topic based on the new output, and get the top 10 words per topic. At the moment, you are only adding 1 of the topics for each company but you can change this by removing the indexing in `top_topics` below.

In [19]:
%%time

best_topics = kt.absolute_topics(output_df2, 'company', 'coherence', 
                                 'topics', 'models', vectorizers_dicts.values())

CPU times: user 11.8 ms, sys: 828 µs, total: 12.6 ms
Wall time: 12.3 ms


Check out your output.

Get the probabilities dataframes for each company and add them to a dictionary.

In [20]:
#generate matrix summarizing distribution of docs (reviews) over topics
docs_of_probas = defaultdict(pd.DataFrame)

for tup in vectorizers_dicts.values():
    docs_of_probas[tup[0]] = pd.DataFrame(best_topics[tup[0]][1].transform(tup[1]))

# Calculate the measures of interest

In [21]:
%%time

comP_h_results = defaultdict(float)
comT_h_results = defaultdict(float)
entropy_avg_results = defaultdict(float)
cross_entropy_results = defaultdict(float)

for company, proba_df in docs_of_probas.items():
    comP_h_results[company] = kt.comph(proba_df.values)
    comT_h_results[company] = kt.conth(proba_df)
    entropy_avg_results[company] = kt.ent_avg(proba_df.values)
    cross_entropy_results[company] = kt.avg_crossEnt(proba_df.values)

CPU times: user 1.28 s, sys: 12 ms, total: 1.29 s
Wall time: 1.3 s


In [22]:
comph_df = pd.DataFrame.from_dict(comP_h_results.items())
conth_df = pd.DataFrame.from_dict(comT_h_results.items())
crossEnt_df = pd.DataFrame.from_dict(cross_entropy_results.items())
cultureMetrics = comph_df.merge(conth_df, how = 'inner', right_on = 0, left_on = 0)
cultureMetrics = cultureMetrics.merge(crossEnt_df, how = 'inner', right_on = 0, left_on = 0)
cultureMetrics.columns = ['employerID', 'comph', 'conth', 'avgCrossEnt']
cultureMetrics.head()

Unnamed: 0,employerID,comph,conth,avgCrossEnt
0,SITA,0.596506,1.155615,7.833806
1,HARMAN,0.588545,1.186593,7.724157
2,Zendesk,0.606455,1.142182,8.30978
3,Fiserv,0.58688,1.205923,7.836984
4,Huawei Technologies,0.596746,1.201933,8.148887


In [None]:
cultureMetrics.to_csv('CultureMetrics_TestSample_1000.csv')