# Script content

## Here we train a 20 topics model on all of the data. The goal is to build a recommendation engine and we do that as follows.

### 1) Extract topics distribution per document from the model object. To compute expertise on topics we sum these probabilities up to folder level over all outbox emails (now it may be clear why we removed forward emails texts). Call this value OUT_sum (derived per folder and topic ).

### 2) To compute level s of interest in a topic we first sum over all inbox mails, at folder topic level. Call this column IN_sum.

### 3) We consider OUT_sum to be a good measure of knowledge on a topic, as it accounts for both the odds that an email is indeed about a topic as well as how often folder owner speaks about this topic.

## 4) Interest in a topic is (considering recommendation engine as the goal!) is measured by the ration OUT_sum/IN_sum. This measure also accounts for situations where expert talks to other experts, thus needs no mentor there.  

### 5) For a mentor and mentee pair of executives (i,j) (ordered pair!) with corresponding values for a topic t, we then compute the following metric

#               M(t,i,j):=OUT_sum(i,t)* ((OUT_sum(j,t) + beta)/(IN_sum(j,t) + beta))

### as our measure of the goodness of the match for topic t and pair (I,j). Summing over all topics gives

# M(i,j)=Sum_{i,j} M(t,i,j)

### as overall measure of match. This is in fact an inner product and in is high when the mentor and mentee have both high values in many topics.

## Our recommendation engine will propose per mentor 3 mentees with maximal values of M(i,.). Choice for 3 is bit arbitrary and implementing in a business would require considering further specifics to come from the stakeholders.

# Notice that we match based on expertise  and preference levels as well as interest in receiving mentorship (according to our interpretation) over all topics jointly!!!
#### Off course further analysis will yield what the actual topics per pair are that really matter.

# How to test this unsupervised engine?

## Well a complex machine like  car or plane is not tested by a single test for a good reason. Therefore our model which consists of various steps also need to be addressed in various ways. LDA topics model has already been tested thus we only need to test the method developed here. We propose the following three metrics.


# Method 1

### We randomly assign emails to folders, with distribution as observed across folders. That is we sample folder ids with probability equal to proportion of emails in it (in and outbox). 
### We then compute the above defined metrics and at the end sum over all mentors  + 3 mentees with highest matching score.

### Repeating this say 5000 times will give us a distribution and we are interested in the p-value of observing what we observed. If it is highly unlikely (in the tail) then that obviously votes in favor of our matching system being meaningful. Such a permutation test type of approach is often used in applied statistics---say determining whether some quantity (height etc) in one group is significantly different from that of the other.


# Method 2

### Instead of picking top three mentees per mentor we sample three random mentees to see where about is our best choice in the distribution generated in this way. That is we estimate p-value of the distribution generated by randomly assigning mentees, but with same metric as above.

# Method 3

### Instead of resigning emails to folders we pick random pairs of mentors as well as mentees (without replacement so that each mentor is in exactly one pair and same holds at the mentees side), and in addition we pick at random a percentage of topics (say 50% of topics) and swap the corresponding OUT_sum (mentors side ) and IN_sum values (mentees side), and then re-compute everything up to the final score per mentor mentee combination, pick the top three values for each mentor and sum all of those over all mentors.. 

### Here too we wish to observe that out true value is very high up in that distribution. That is if one reassigns these preferences in an arbitrary way we get much lower top scores.

### Let us now get to coding.

### Load cleaned email texts 

In [1]:

import cPickle 
matrixpath = file('/notebooks/LDA models and data/Data Frames and lists/text_term_matrix_clean.pkl', 'rb')
text_term_matrix=cPickle.load(matrixpath )

matrix0path = file('/notebooks/LDA models and data/Data Frames and lists/text_term_matrix_clean.pkl', 'rb')
text_clean=cPickle.load(matrix0path)

dictfilepath=file('/notebooks/LDA models and data/Data Frames and lists/Dictionary.pkl', 'rb')
Dictionary=cPickle.load(dictfilepath)



Using TensorFlow backend.


In [5]:
import gensim
from gensim.models import ldamodel
import pandas as pd
import numpy as np

### Fit a LDA model with 20 topics

In [6]:
ldamodel_20 =gensim.models.ldamulticore.LdaMulticore(corpus=text_term_matrix, num_topics=20, id2word=Dictionary, workers=10,\
chunksize=2000, passes=1, batch=False, alpha='symmetric', eta=None, decay=0.5, offset=1.0,\
eval_every=10, iterations=50, gamma_threshold=0.001, random_state=None, minimum_probability=0.01,\
minimum_phi_value=0.01, per_word_topics=False)

### Collect inferred document topic probabilities from the model object

In [14]:
from gensim.models import ldamodel
doc_topic_prob=[]
for bowplus in text_term_matrix0:
    doc_topic_prob.append([bowplus[0] ,bowplus[1] ,ldamodel_20.get_document_topics(bowplus[2])])

### Save to dick

In [289]:
import cPickle
with open('/notebooks/LDA models and data/Data Frames and lists/doc_topic_prob.pkl', 'wb') as pickle_file:
    cPickle.dump(obj=doc_topic_prob, file=pickle_file, protocol=cPickle.HIGHEST_PROTOCOL)

### Unpack doc topic probabilities list to an array and then data frame with columns: 

## 'dirpath', 'inout_id','emailid', 'topicid', 'prob'

### That file is sufficient to build and evaluate our recommendation engine as described above

In [283]:
k=0
resarray=np.array(['dirpath', 'inout_id','emailid' ,'topicid', 'prob'], dtype=np.dtype('a16'), ndmin=2)
for bowplus in doc_topic_prob:
    if k % 10000==0:
        print k
    comunalia=np.array([bowplus[0],bowplus[1],str(k)], ndmin=2)
    probs =np.array([list(i) for i in bowplus[2]], dtype=np.dtype('a16'))
    rep=len(bowplus[2])
    comunalia_rep=np.repeat(a=comunalia, repeats=rep, axis=0)
    thisbow=np.ma.concatenate([comunalia_rep, probs], axis=-1)
    resarray=np.ma.concatenate([resarray, thisbow], axis=0)
    k=k+1

0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
210000
220000
230000
240000
250000
260000
270000
280000
290000
300000
310000
320000
330000
340000
350000
360000
370000
380000
390000
400000
410000
420000
430000
440000
450000
460000
470000
480000
490000
500000


### Running of the code lines in the previous cell took very long (app. 4 hours) and without a doubt this can be handled more efficiently. However due to time constraints we shall leave it as is for now.

In [288]:
# Save!!
pickle_file=file('/notebooks/LDA models and data/Data Frames and lists/resarray.pkl', 'wb')
cPickle.dump(obj=resarray, file=pickle_file, protocol=cPickle.HIGHEST_PROTOCOL)

### Check outpout

In [290]:
resarray[0:10,:]

masked_array(data =
 [['dirpath' 'inout_id' 'emailid' 'topicid' 'prob']
 ['hain-m' '0' '0' '0' '0.0250000000244']
 ['hain-m' '0' '0' '1' '0.0250000000762']
 ['hain-m' '0' '0' '2' '0.0250000000577']
 ['hain-m' '0' '0' '3' '0.0250000000608']
 ['hain-m' '0' '0' '4' '0.0250000000233']
 ['hain-m' '0' '0' '5' '0.0250000002241']
 ['hain-m' '0' '0' '6' '0.0250000000178']
 ['hain-m' '0' '0' '7' '0.0250000000259']
 ['hain-m' '0' '0' '8' '0.025000000075']],
             mask =
 False,
       fill_value = N/A)

### Make a data frame so that we can aggregate

In [292]:
columnnames=['dirpath', 'inout_id','emailid' ,'topicid', 'prob']

df_recommend = pd.DataFrame(resarray[1:,:], columns=columnnames)


In [294]:
print df_recommend.shape
print resarray.shape

(1951525, 5)
(1951526, 5)


### Cast probabilities back to float. We needed them as chars to unpack as array doesn't accept mixed data types and we have a dirpath there

In [479]:
df_recommend['prob']=df_recommend['prob'].apply(lambda x : float(x))

In [296]:
df_recommend[0:10]

Unnamed: 0,dirpath,inout_id,emailid,topicid,prob
0,hain-m,0,0,0,0.025
1,hain-m,0,0,1,0.025
2,hain-m,0,0,2,0.025
3,hain-m,0,0,3,0.025
4,hain-m,0,0,4,0.025
5,hain-m,0,0,5,0.025
6,hain-m,0,0,6,0.025
7,hain-m,0,0,7,0.025
8,hain-m,0,0,8,0.025
9,hain-m,0,0,9,0.025


### Check precision since we are going to sum

### OK!

In [373]:
df_recommend.iloc[2345,4]

0.32674993661000001

### Add counter columns and aggregate to get IN_sum and OUT_sum as well as number of emails per folder (plus inout_id)

In [371]:
df_recommend['count_emails']= 1
df_grouped =df_recommend.groupby(['dirpath','inout_id', 'topicid'], as_index=False)[['prob', 'count_emails']].sum()

In [380]:
print df_grouped.shape
df_grouped[0:4]

(5796, 5)


Unnamed: 0,dirpath,inout_id,topicid,prob,count_emails
0,allen-p,0,0,120.038677,635
1,allen-p,0,1,64.01827,462
2,allen-p,0,10,70.678547,427
3,allen-p,0,11,212.328779,832


### OK! We have about 150 folders here, 20 topics, and in and out box, which amounts to 150*20*2= 6000. However not all folders may have a score for all topics with values above threshold set in the gensim  software to return by a model object---as the gensim code author explained in a blog

### Split frame to in and out box parts

In [389]:
df_grouped_in=df_grouped[df_grouped['inout_id']=='0']

df_grouped_out=df_grouped[df_grouped['inout_id']=='1']


### Makes sense, we certainly have one value per outbox topic combination and a few less for the outbox part as we miss few out boxes


In [385]:
print df_grouped_out.shape
print df_grouped_in.shape

(2796, 5)
(3000, 5)


In [390]:
df_grouped_out.rename(columns={'dirpath': 'mentor', 'prob': 'OUT_sum'}, inplace=True)
df_grouped_in.rename(columns={'dirpath': 'mentee', 'prob': 'IN_sum'}, inplace=True)

In [396]:
df_joined = df_grouped_out.merge(right=df_grouped_in, how='inner', left_on=['mentor', 'topicid'],\
                                 right_on=['mentee', 'topicid'], suffixes=('_mentor', '_mentee'))

In [398]:
df_joined[0:4] 

Unnamed: 0,mentor,inout_id_mentor,topicid,OUT_sum,count_emails_mentor,mentee,inout_id_mentee,IN_sum,count_emails_mentee
0,allen-p,1,0,20.295934,65,allen-p,0,120.038677,635
1,allen-p,1,1,4.660352,24,allen-p,0,64.01827,462
2,allen-p,1,10,0.527964,14,allen-p,0,70.678547,427
3,allen-p,1,11,24.096311,82,allen-p,0,212.328779,832


In [400]:
# df_joined['INsum_OUTsum']=df_joined[['OUT_sum', 'IN_sum']].apply(lambda x: x[0]/x[1])

### Add ratio (IN_sum+beta) /(OUT_sum+beta)

In [441]:
beta=0.1
def in_over_out(x):
    return (x['IN_sum']+beta)/(x['OUT_sum']+beta)
df_joined['INsum_OUTsum']=df_joined.apply(in_over_out, axis=1)

In [744]:
 df_joined_save=df_joined[['mentor', 'topicid', 'OUT_sum', 'IN_sum', 'INsum_OUTsum']]
filedfjoined= file('/notebooks/LDA models and data/Data Frames and lists/expertise_preferance_scores.pkl', 'wb')
cPickle.dump(obj=df_joined_save, file=filedfjoined, protocol=cPickle.HIGHEST_PROTOCOL)

In [442]:
df_joined[0:4]

Unnamed: 0,mentor,inout_id_mentor,topicid,OUT_sum,count_emails_mentor,mentee,inout_id_mentee,IN_sum,count_emails_mentee,INsum_OUTsum
0,allen-p,1,0,20.295934,65,allen-p,0,120.038677,635,5.890325
1,allen-p,1,1,4.660352,24,allen-p,0,64.01827,462,13.469228
2,allen-p,1,10,0.527964,14,allen-p,0,70.678547,427,112.711174
3,allen-p,1,11,24.096311,82,allen-p,0,212.328779,832,8.779387


In [443]:
df_mentors=df_joined[['mentor', 'topicid', 'OUT_sum']]
df_mentees=df_joined[['mentee', 'topicid', 'INsum_OUTsum']]

### As already mentioned the structure of data is such that there are folder topic combinations with no emails that ever had a topic probability above the gensim threshold for returning. In such cases there is obviously no basics for match on that topic (if either mentor or mentee has no sum variable. 

### This remark actually applies already when defining IN_sum/OUT_sum. One might argue that such case should be handled with more care, however if one receives emails on a topic and never sends any then it s a valid argument  that this person actually has no interest in this topic. This should be confirmed by the stakeholders in a real project.


In [444]:
df_cartesian=df_mentors.merge(right=df_mentees, how='inner', left_on='topicid', right_on='topicid' ,\
                              suffixes=('_mentor', '_mentee'))


def mult_cols(x):
    return x['OUT_sum']*x['INsum_OUTsum']

df_cartesian['topic_match_score']=df_cartesian.apply(mult_cols, axis=1)

In [454]:
df_cartesian_agg =df_cartesian.groupby(['mentor', 'mentee'], as_index=False)['topic_match_score'].sum()

# The first result of the final part if our assignment is now obtained.

# The table with top 3 recommendations for each folder (as mentor) is produced in the next cell:

In [472]:
df_recommendation_engine=df_cartesian_agg.groupby('mentor').head(3).reset_index(drop=True)
df_recommendation_engine[0:30]

Unnamed: 0,index,mentor,mentee,topic_match_score
0,63,allen-p,linder-e,81429.37
1,54,allen-p,keavey-p,43185.18
2,139,allen-p,ybarbo-p,43056.24
3,205,arnold-j,linder-e,647838.1
4,204,arnold-j,lewis-a,480609.5
5,196,arnold-j,keavey-p,396960.6
6,347,arora-h,linder-e,31506.37
7,338,arora-h,keavey-p,10827.59
8,423,arora-h,ybarbo-p,10680.32
9,489,badeer-r,linder-e,101607.9


### And the required total score is (the higher the better):

In [740]:
recommendation_score =df_recommendation_engine.topic_match_score.sum()
print recommendation_score

88137833.5072


# In the remainder we collect all required statistics for our 3 valuation methods

# As simple as it may sound to speak out, writing code for such schema requires some lines

### The following 7 cells do the preparatory work, that only needs to be done once. The part that needs be repeated per random sample of folders for each email is partly repetition of what we did to compute recommendation_score, and is all put in a loop in cell 8 from here.

In [490]:
df_temp1=df_recommend[['dirpath','inout_id', 'emailid']].drop_duplicates()
df_temp1['count_emails']=1
# determine p for np.random_choice call
df_p4rc = df_temp1.groupby(['dirpath','inout_id'], as_index=False)['count_emails'].sum()

In [497]:
df_p4rc_in=df_p4rc[df_p4rc['inout_id']=='0']
df_p4rc_out=df_p4rc[df_p4rc['inout_id']=='1']
print df_p4rc_out.shape
print df_p4rc_in.shape

(142, 3)
(150, 3)


In [574]:
# prepare data for sampling with replacement
# inbox
#
# prepare
email_count_in=df_p4rc_in.count_emails.sum()
p_in=df_p4rc_in.count_emails/df_p4rc_in.count_emails.sum()
nr_in_folders=df_p4rc_in.shape[0]

# sample from inbox id' s with replacement with same distribution p_in
rc_in =np.random.choice(a=nr_in_folders, size=email_count_in , replace=True, p=p_in)


# outbox
#
# prepare
email_count_out=df_p4rc_out.count_emails.sum()
p_out=df_p4rc_out.count_emails/df_p4rc_out.count_emails.sum()
nr_out_folders=df_p4rc_out.shape[0]

# # sample from outbox id's with replacement with same distribution p_out
# rc_out =np.random.choice(a=nr_out_folders, size=email_count_out , replace=True, p=p_out)

In [562]:
# define a dirpath_id needed
# prep for in folder
df_tmpin = df_recommend[df_recommend['inout_id']=='0'].iloc[:,0:2].drop_duplicates()
tmpin = np.array(range(0,nr_in_folders))
df_tmpin['dirpath_id']=tmpin

# prep for out folder
df_tmpout = df_recommend[df_recommend['inout_id']=='1'].iloc[:,0:2].drop_duplicates()
tmpout = np.array(range(0,nr_out_folders))
df_tmpout['dirpath_id']=tmpout

# concatenate so we have an id per folder name and inout_id value
df_tmp = pd.concat([df_tmpin, df_tmpout])

In [673]:
# add this column to full frame df_recommend where one rwo stands for dirpath, inout_id, topic and corresponding 
# sum of probbailities
df_recommend1 = df_recommend.merge(right=df_tmp, how='left', left_on=['dirpath', 'inout_id'],\
                                 right_on=['dirpath', 'inout_id'])
df_recommend1.head()



Unnamed: 0,dirpath,inout_id,emailid,topicid,prob,count_emails,dirpath_id
0,hain-m,0,0,0,0.025,1,0
1,hain-m,0,0,1,0.025,1,0
2,hain-m,0,0,2,0.025,1,0
3,hain-m,0,0,3,0.025,1,0
4,hain-m,0,0,4,0.025,1,0


In [674]:
def to_int(x):
    return int(x['emailid'])
df_recommend1['emailid']=df_recommend1.apply(to_int, axis=1)
# df_recommend1[0:1]

In [707]:
# rc_in[df_recommend1.dirpath_id[0]]
# df_recommend1.dirpath_id[0]

df_tmpin2 = df_recommend1[df_recommend1['inout_id']=='0'][['dirpath','inout_id', 'emailid', 'dirpath_id']].drop_duplicates()
df_tmpout2 = df_recommend1[df_recommend1['inout_id']=='1'][['dirpath','inout_id', 'emailid', 'dirpath_id']].drop_duplicates()
print df_tmpout2.shape



(149379, 4)


# From here everything needs to be recomputed per random reassignment of directories to emails.

### We nonetheless provide code with comments but leave then commented (in case a reader wishes to execute the pieces of code that happen in that loop), yet for readability we comment them out of executable code.



In [718]:
# def random_dir_in():
#     return np.random.choice(a=nr_in_folders, size=1 , replace=True, p=p_in)

# # =np.random.choice(a=nr_in_folders, size=1 , replace=True, p=p_in)
# def random_dir_out():
#     return np.random.choice(a=nr_out_folders, size=1 , replace=True, p=p_out) 

# df_tmpin2['random_dirpath_id']=df_tmpin2['emailid'].apply(lambda x: random_dir_in()[0])
# df_tmpout2['random_dirpath_id']=df_tmpout2['emailid'].apply(lambda x: random_dir_out()[0])


### Data Frame df_recommend2 has now randomly resigned folder id's and we can repeat the same thing we did above, in a loop and each time compute total score. Instead of aggregating on ' dirpath' we need to aggregate on 'random_dirpath_id' here metrics.

In [732]:
df_recommend1_in = df_recommend1[df_recommend1['inout_id']=='0']
df_recommend1_in= df_recommend1_in.merge(right=df_tmpin2,  how='inner',  left_on=['dirpath', 'inout_id', 'emailid', 'dirpath_id'],\
                                 right_on=['dirpath', 'inout_id', 'emailid', 'dirpath_id'])



df_recommend1_out = df_recommend1[df_recommend1['inout_id']=='1']
df_recommend1_out= df_recommend1_out.merge(right=df_tmpout2,  how='inner',  left_on=['dirpath', 'inout_id', 'emailid', 'dirpath_id'],\
                                 right_on=['dirpath', 'inout_id', 'emailid', 'dirpath_id'])

df_recommend2=pd.concat([df_recommend1_out, df_recommend1_in])

In [771]:
# df_recommend2=df_tmp = pd.concat([df_recommend1_out, df_recommend1_in]).ipynb_checkpoints/
df_recommend2=df_recommend2[['inout_id', 'emailid', 'topicid', 'prob', 'random_dirpath_id']].\
rename(columns={'random_dirpath_id': 'dirpath'})

### Helper functions:

#### 1) for IN_sum/OUT_sum computation

#### 2) Multiply two columns

#### 3) sample directory id's at random with distributiin as generated by in and out folders (probability of guessing

In [None]:
beta=0.1
def in_over_out(x):
    return (x['IN_sum']+beta)/(x['OUT_sum']+beta)

#####################################

def mult_cols(x):
    return x['OUT_sum']*x['INsum_OUTsum']
####################################

def random_dir_in():
    return np.random.choice(a=nr_in_folders, size=1 , replace=True, p=p_in)

def random_dir_out():
    return np.random.choice(a=nr_out_folders, size=1 , replace=True, p=p_out) 


In [769]:
df_recommend1_in = df_recommend1[df_recommend1['inout_id']=='0']
df_recommend1_in= df_recommend1_in.merge(right=df_tmpin2,  how='inner',  left_on=['dirpath', 'inout_id', 'emailid', 'dirpath_id'],\
                                 right_on=['dirpath', 'inout_id', 'emailid', 'dirpath_id'])



df_recommend1_out = df_recommend1[df_recommend1['inout_id']=='1']
df_recommend1_out= df_recommend1_out.merge(right=df_tmpout2,  how='inner',  left_on=['dirpath', 'inout_id', 'emailid', 'dirpath_id'],\
                                 right_on=['dirpath', 'inout_id', 'emailid', 'dirpath_id'])

df_recommend2=df_tmp = pd.concat([df_recommend1_out, df_recommend1_in])

In [772]:
df_recommend2.head()

Unnamed: 0,inout_id,emailid,topicid,prob,dirpath
0,1,8,0,0.362389,7
1,1,8,10,0.05538,7
2,1,8,11,0.19575,7
3,1,8,17,0.373782,7
4,1,14,13,0.105881,70


In [None]:
number_of_samples=100
matching_scores = np.zeros(shape=(number_of_samples,), dtype=float)
# k=0
for j in range(0, number_of_samples):
#     k=k+1
    df_tmpin2['random_dirpath_id']=df_tmpin2['emailid'].apply(lambda x: random_dir_in()[0])
    df_tmpout2['random_dirpath_id']=df_tmpout2['emailid'].apply(lambda x: random_dir_out()[0])

    df_grouped2 =df_recommend2.groupby(['dirpath','inout_id', 'topicid'], as_index=False)['prob'].sum()

    df_grouped_in2=df_grouped2[df_grouped2['inout_id']=='0']
    df_grouped_out2=df_grouped2[df_grouped2['inout_id']=='1']

    df_grouped_out2.rename(columns={'dirpath': 'mentor', 'prob': 'OUT_sum'}, inplace=True)
    df_grouped_in2.rename(columns={'dirpath': 'mentee', 'prob': 'IN_sum'}, inplace=True)

    df_joined2 = df_grouped_out2.merge(right=df_grouped_in2, how='inner', left_on=['mentor', 'topicid'],\
                                     right_on=['mentee', 'topicid'], suffixes=('_mentor', '_mentee'))


    df_joined2['INsum_OUTsum']=df_joined.apply(in_over_out, axis=1)

    df_mentors2=df_joined2[['mentor', 'topicid', 'OUT_sum']]
    df_mentees2=df_joined2[['mentee', 'topicid', 'INsum_OUTsum']]

    df_cartesian2=df_mentors.merge(right=df_mentees2, how='inner', left_on='topicid', right_on='topicid' ,\
                                  suffixes=('_mentor', '_mentee'))


    df_cartesian2['topic_match_score']=df_cartesian2.apply(mult_cols, axis=1)

    df_cartesian2=df_mentors2.merge(right=df_mentees2, how='inner', left_on='topicid', right_on='topicid' ,\
                                  suffixes=('_mentor', '_mentee'))


    df_cartesian2['topic_match_score']=df_cartesian2.apply(mult_cols, axis=1)

    df_cartesian_agg2 =df_cartesian2.groupby(['mentor', 'mentee'], as_index=False)['topic_match_score'].sum()

    df_recommendation_engine2=df_cartesian_agg2.groupby('mentor').head(3).reset_index(drop=True)
    recommendation_score2 =df_recommendation_engine2.topic_match_score.sum()
    matching_scores[j]=recommendation_score2
#     print recommendation_score2
#     print k

In [758]:
print recommendation_score 
print recommendation_score2

88137833.5072
2232129.58661
