## Ikigai - A Career Village RecSys

by Marsh [ @vbookshelf ]<br>
9 April 2019

<img src="http://bee.test.woza.work/assets/student.jpg" width="500"></img>

## Contents

<a href='#Introduction'>1. Introduction</a><br>
<a href='#Prepare_the_Data'>2. Prepare the Data</a><br>
<a href='#Ask_a_Question'>3. Ask a Question</a><br>
<a href='#Model_1'>4. Model 1 -  Tags and Profiles</a><br>
<a href='#Model_2'>5. Model 2 - Profiles and Answers</a><br>
<a href='#Model_3'>6. Model 3 - TruncatedSVD</a><br>
<a href='#Model_4'>7. Model 4 - GloVe Embeddings</a><br>
<a href='#Select_professionals'>8. Select professionals who are most likely to answer the question</a><br>
<a href='#Final_Output'>9. Final Output - The chosen ones</a><br>
<a href='#Testing'>10. Testing and Results</a><br>
<a href='#Things'>11. Things to keep in mind</a><br>
<a href='#Ideas'>12. Ideas for sharpening this system</a><br>

<a href='#Citations'>Citations</a><br>
<a href='#Reference_Kernels'>Reference Kernels</a><br>
<a href='#Helpful_Resources'>Helpful Resources</a><br>
<a href='#Conclusion'>Conclusion</a><br>


In [1]:
# Set a seed value
from numpy.random import seed
seed(101)

import pandas as pd
import numpy as np
import os

import pickle
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Don't Show Warning Messages
import warnings
warnings.filterwarnings('ignore')

Using TensorFlow backend.


In [2]:
# Read the data

df_questions = \
pd.read_csv('../input/data-science-for-good-careervillage/questions.csv')
df_answers = \
pd.read_csv('../input/data-science-for-good-careervillage/answers.csv')
df_professionals = \
pd.read_csv('../input/data-science-for-good-careervillage/professionals.csv')

df_comments = \
pd.read_csv('../input/data-science-for-good-careervillage/comments.csv')
df_tags = \
pd.read_csv('../input/data-science-for-good-careervillage/tags.csv')
df_tag_users = \
pd.read_csv('../input/data-science-for-good-careervillage/tag_users.csv')

#print(df_questions.shape)
#print(df_answers.shape)
#print(df_professionals.shape)
#print(df_comments.shape)
#print(df_tags.shape)
#print(df_tag_users.shape)

| <a id='Introduction'></a>

## 1. Introduction

If you're like me then alot of your emails are unopened. It's because you know from reading the title or just from the sender's name that you've got no interest in the content. Who cares if a city that you visited 6 months ago now has "some great deals on hotel rooms". Our brains are becoming efficient content filters. Anything that's not relevant will be ignored.

That's why it's important for CareerVillage to have a good recommender system (RecSys). Professionals need to feel that the questions sent to them are relevant. Creating a personalized experience will make them feel valued. This will lead to more questions being answered and faster answers.

The objective of this competition is to develop a method to recommend relevant questions to the professionals who are most likely to answer them.

This solution follows two steps:<br>

*Step 1*: Develop a method to recommend relevant questions to professionals.<br>
*Step 2*: Identify those professionals who are most likely to answer a relevant question.

It's not practical to measure the quality of this solution using an evaluation metric. To overcome this we will assess the results qualitatively using a simple run and read approach. 

In step 1 you'll ask a question. Then you'll look at the professionals that each model recommendeds and ask: "Based on this person's  profile or past answer, will this recommended professional be **capable** of answering this question?"<br>
This is an intuitive way of assessing the relevance of your question to a particular professional.

In step two, to assess whether or not a professional will respond to a relevant question, this system will look at four indicators:

1. Is this professional a new member?
2. Has this professional answered a past question from this student?
3. Did this professional answer a question recently?
4. Did this professional make a comment recently?

Using this filter, this system will generate a final list of professionals that are likely to respond to your question.

A central feature of this system is that it compares your question to each professional's background info or to all answers in the dataset. If the question is similar to a particular professional's background or to a past answer that he or she gave - then it's likely that that professional is able to answer your question. In other words, your question is relevant to that professional.


**RecSys Architecture**

This recommender system is made up of the four models and a filter:<br>

( A professional's profile is made up of the industry they work in and their title. )

- Model 1 uses tags followed, professional profiles and tfidf (Term Frequency Inverse Document Frequency)
- Model 2 uses professional profiles, past answers and tfidf
- Model 3 uses professional profiles, past answers, tfidf and Truncated SVD (Singular Value Decomposition)
- Model 4 uses professional profiles, past answers and GloVe pre-trained word embeddings  (Global Vectors for Word Representation)

Why do we need four models? It's because not all models perform equally well on all questions. This is a small dataset and different careers are not equally represented. For example theres a lot of professionals with a computer science background but few firefighters. Professions are well represented. Trades are not. Also, not all questions are about specific careers, some are about life. Therefore, having a diversity of models that tackle the problem in different ways is the most prudent approach.

**Interactive Notebook**

This notebook is set up in such a way that you'll be able to select a question from the dataset, run the entire notebook and then see a printout of results for each model, and for the filter. There's also an option to type in your own question as you would on the CareerVillage website.

**Testing and Results**

Does this recommender system work? Yes it does. 

To demonstrate this fact I've tested it on nine questions that are representative of the data. The number of recommendations produced and the number of false positives in those recommendations is tabulated in section 10.


My goal here is to build a working prototype. Let's start by preparing the data.

| <a id='Prepare_the_Data'></a>

## 2. Prepare the Data

In [3]:
# Check what folders are available

os.listdir('../input')

['data-prep-for-career-village-recsys',
 'glove-global-vectors-for-word-representation',
 'data-science-for-good-careervillage']

I've done the data preparation in a seperate kernel. This is the link:<br>
https://www.kaggle.com/vbookshelf/data-prep-for-careervillage-recsys


Here we'll simply load the prepared data from that kernel's output.

df_qa_prof.pickle is the pre-processed dataframe that we will use in all three models. This is a merged dataframe that includes questions, answers and professionals. Professionals who didn't answer any questions are not included. 

There's a new column called quest_text where each cell contains both the question title and the question body. There's also a new column called answers_text where each cell contains the combined content of the following columns: professionals_headline, professionals_industry and answers_body.

In [4]:
# Load the pickled dataframe

path_1 = '../input/data-prep-for-career-village-recsys/df_qa_prof.pickle'

df_qa_prof = pickle.load(open(path_1,'rb'))

# check the shape
df_qa_prof.shape

(51123, 18)

In [5]:
df_qa_prof.head(2)

Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body,answers_id,answers_author_id,answers_question_id,answers_date_added,answers_body,professionals_date_joined,professionals_headline,professionals_id,professionals_industry,professionals_location,num_days_member,quest_text,answers_text
0,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,332a511f1569444485cf7a7a556a5e54,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...,2015-10-19 20:56:49,Assist with Recognizing and Developing Potential,36ff3b3666df400f956f8335cf53e09e,Mental Health Care,"Cleveland, Ohio",1200,teacher maths teacher maths teacher useful col...,assist recognizing developing potential mental...
1,0f1d6a4f276c4a05878dd48e03e52289,585ac233015447cc9e9a217044e515e1,2016-05-19 22:16:25 UTC+0000,what kind of college could i go to for a soc...,I like soccer because i been playing sense i w...,f3519ab99a1a4a13a8a9ecb814287d2a,36ff3b3666df400f956f8335cf53e09e,0f1d6a4f276c4a05878dd48e03e52289,2016-07-31 15:35:54 UTC+0000,<p>Hi Rodrigo!</p>\n<p>The important thing to ...,2015-10-19 20:56:49,Assist with Recognizing and Developing Potential,36ff3b3666df400f956f8335cf53e09e,Mental Health Care,"Cleveland, Ohio",1200,kind college could go soccer player like socce...,assist recognizing developing potential mental...


In [6]:
# Define a function to clean the text

def process_text(x):
    
    # remove the hash sign
    x = x.replace("#", "")
    
    # remove the dash sign with a space
    #x = x.replace("-", " ")
    
    # Remove HTML
    x = BeautifulSoup(x).get_text()
    
    # convert words to lower case
    x = x.lower()
    
    # remove the word question
    x = x.replace("question", "")
    
    # remove the word career
    x = x.replace("career", "")
    
    # remove the word study
    x = x.replace("study", "")
    
    # remove the word student
    x = x.replace("student", "")
    
    # remove the word school
    x = x.replace("school", "")
    
    # Remove non-letters
    x = re.sub("[^a-zA-Z]"," ", x)
    
    # Remove stop words
    # Convert words to lower case and split them
    words = x.split()
    stops = stopwords.words("english")
    x_list = [w for w in words if not w in stops]
    # convert the list to a string
    x = ' '.join(x_list)
    
    return x

<hr>
| <a id='Ask_a_Question'></a>

## 3. Ask a Question

Please select any question in the dataset or type your own question. Then run all cells in this kernel.


 ### ~ Option 1: Choose a question from the CareerVillage dataset ~
Please set QUESTION_INDEX equal to any row index between 0 to 51000.<br>
<br>
For your first try I suggest using QUESTION_INDEX = 777. It's a computer science related question that nicely demonstrates the performance of each of the 4 models and the filter.

If you'd like to choose Option 2 then please set QUESTION_INDEX = None

In [7]:
###################################

QUESTION_INDEX = 1710

###################################

### ~ Option 2: Ask a question as you would on the CareerVillage site ~
Please type your text within inverted commas - " i am a string "

In [8]:
# =========================================== #
# Please check that QUESTION_INDEX = None in the above cell before entering
# your own question.

my_question_title = "How do I become a data scientist?"

my_question_body = "I want to be a data scientist. What subjects should I study? #data-science"

# =========================================== #

==> After selecting one of the above options please Run all cells in this kernel. <==

### ~ This is your Question ~

In [9]:
# Code to process the question

# if Option 1 is chosen
if QUESTION_INDEX != None:
    
    QUESTION_INDEX = int(QUESTION_INDEX)
    
    student_id = df_qa_prof.loc[QUESTION_INDEX, 'questions_author_id']
    # Get the question info from the dataset.
    # The text has already been cleaned above.
    question_id = df_qa_prof.loc[QUESTION_INDEX, 'questions_id']
    question_title = df_qa_prof.loc[QUESTION_INDEX, 'questions_title']
    question_body = df_qa_prof.loc[QUESTION_INDEX, 'questions_body']
    # question_text is clean text that is used in the models
    question_text = df_qa_prof.loc[QUESTION_INDEX, 'quest_text'] 

# if Option 2 is chosen
else:
    student_id = 33333333 # dummy id that's needed for the final selection code
    # get the input question
    question_id = 'My Question'
    question_title = my_question_title
    question_body = my_question_body
    # Clean the text using the process_text() function.
    # question_text is clean text that is used in the models
    question_text = process_text(question_title) + ' ' + process_text(question_body)
    

# Print the question
print('Question id: ', question_id)
print('Question Title: ', question_title)
print('\n')
print('Question Body:\n ', question_body)

Question id:  eb80205482e4424cad8f16bc25aa2d9c
Question Title:  I want to become an army officer. What can I do to become an army officer?


Question Body:
  I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army


<hr>
| <a id='Model_1'></a>

## 4. Model 1 -  Tags, Profiles, Tfidf and Cosine Similarity


This model considers every professional in the dataset irrespective of whether or not they have answered a past question. 

**How does this model  work?**

It compares your question to each professional's background info. Background info is made up of a professional's title, the industry they work in and the tags they follow. The hash symbols are removed from the tags - the model sees the tags as words. The similarity is measured by comparing the vector encoding of the question to the vector encoding of each professional's background info. The encodings are created using Tfidf (Term Frequency Inverse Document Frequency). The vectors are compared using cosine similarity.

> **The idea is that that if a question is similar to a professional's background then there's a good chance that he or she will be able to answer the question.**

**How can we assess how well the model is working?**

We will "run and read". We will run the model and then look at the results to see if they make sense. The model will print out the background info of each professional it has selected. By reading this and comparing it to the question we'll be able to tell whether the model is making reasonable choices.




## 4.1. Prepare the data

In [10]:
# load df_professionals
path_2 = '../input/data-prep-for-career-village-recsys/df_professionals.pickle'
df_professionals = pickle.load(open(path_2,'rb'))

# replace all missing values with nothing
df_professionals = df_professionals.fillna('')

# Create a dictionary of tag id's and tag names
keys = list(df_tags['tags_tag_id'])
values = list(df_tags['tags_tag_name'])
tags_dict = dict(zip(keys, values))

# Change the tag id numbers to tag names that we can read
df_tag_users['tag_name'] = df_tag_users['tag_users_tag_id'].map(tags_dict)

df_tag_users.head()

Unnamed: 0,tag_users_tag_id,tag_users_user_id,tag_name
0,593,c72ab38e073246e88da7e9a4ec7a4472,computer-software
1,1642,8db519781ec24f2e8bdc67c2ac53f614,programming
2,638,042d2184ee3e4e548fc3589baaa69caf,running
3,11093,c660bd0dc1b34224be78a58aa5a84a63,life-coach
4,21539,8ce1dca4e94240239e4385ed22ef43ce,art


Because anyone is able to follow a tag there could be mixture of students and professionals in df_tag_users. We need to filter out the professionals.

In [11]:
# get a list of professionals
prof_list = list(df_professionals['professionals_id'])
# filter out the professionals from df_tag_users
df_prof_tag_users = df_tag_users[df_tag_users['tag_users_user_id'].isin(prof_list)]

df_prof_tag_users.shape

(117520, 3)

Now that we've filtered out all the professionals, let's see which tags each professional follows. The hash sign # has been removed from the tags.

In [12]:
# drop the tag_users_tag_id column
df_prof_tag_users = df_prof_tag_users.drop('tag_users_tag_id', axis=1)

# replace missing values with nothing - just be be safe
df_prof_tag_users =df_prof_tag_users.fillna('')

# add a space to the end of each tag name
def add_space(x):
    x = x + ' '
    
    return x

df_prof_tag_users['tag_name'] = df_prof_tag_users['tag_name'].apply(add_space)

# groupby tag_users_user_id and sum() the tags
df_prof_tag_users = df_prof_tag_users.groupby('tag_users_user_id').sum()

# reset the index
df_prof_tag_users = df_prof_tag_users.reset_index()

# check how many professionals follow tags
num_followers = len(df_prof_tag_users['tag_users_user_id'])

# Are there professionals who don't follow any tags?

num_profs = df_professionals['professionals_id'].nunique()
num_tag_followers = df_prof_tag_users['tag_users_user_id'].nunique()

num_not_followers = num_profs - num_tag_followers

print(num_followers, 'professionals follow tags.')
print(num_not_followers, 'professionals do not follow tags.')

df_prof_tag_users.head()

25605 professionals follow tags.
2649 professionals do not follow tags.


Unnamed: 0,tag_users_user_id,tag_name
0,00009a0f9bda43eba47104e9ac62aff5,digital-media script-writing content-creation
1,000196ef8db54b9a86ae70ad31745d04,accounting
2,0008138be908438e8944b21f7f57f2c1,real-estate
3,000d4635e5da41e3bfd83677ee11dda4,university information-technology college
4,000e2b5714444d79a672bf927905135c,financial-services


2649 professionals don't follow tags.

### Add a new column to the df_professionals dataframe that shows the tags that each professional follows.

Because there are 2649 professionals that don't follow any tags, if we try to merge df_professionals and df_prof_tag_users then those who don't follow any tags will be automatically dropped. We must keep this in mind when merging dataframes. We will do a left join. This will include the rows common to both dataframes as well as all elements from the left dataframe. Please refer to the tutorial video referenced in the 'Helpful Resources' section if you'd like to learn more about merging dataframes.

In [13]:
# https://www.youtube.com/watch?v=h4hOPGo4UVU

# Change column name in df_prof_tag_users. 
# For the merge to work the column called professionals_id needs to be in
# both dataframes.
new_names = ['professionals_id', 'tags_followed']
df_prof_tag_users.columns = new_names

# perform the left merge
df_profs = pd.merge(df_professionals,df_prof_tag_users, 
                   on='professionals_id', how='left')

# replace missing values with nothing
df_profs = df_profs.fillna('')

print('We now have a combined dataframe containing the tag info and profile info for all professionals.')

df_profs.head()

We now have a combined dataframe containing the tag info and profile info for all professionals.


Unnamed: 0,professionals_date_joined,professionals_headline,professionals_id,professionals_industry,professionals_location,num_days_member,tags_followed
0,2011-10-05 20:35:19,,9ced4ce7519049c0944147afb75a8ce3,,,2675,
1,2011-10-05 20:49:21,,f718dcf6d2ec4cb0a52a9db59d7f9e67,,,2675,
2,2011-10-18 17:31:26,,0c673e046d824ec0ad0ebe012a0673e4,,"New York, New York",2662,consulting education consulting education cons...
3,2011-11-09 20:39:29,,977428d851b24183b223be0eb8619a8c,,"Boston, Massachusetts",2640,
4,2011-12-10 22:14:44,,e2d57e5041a44f489288397c9904c2b2,,,2609,


### Create a new column that contains the background info of each professional. Then clean the text in this new column.


In [14]:
# Create the new column by summing the strings from each seperate column.
df_profs['prof_info'] = df_profs['professionals_headline'] + ' ' \
+ df_profs['professionals_industry'] + ' ' + df_profs['tags_followed']

# clean the text using the process_text() function defined above
df_profs['prof_info'] = df_profs['prof_info'].apply(process_text)

print('The prof_info column contains the combined profile info of each professional.')
df_profs.head()

The prof_info column contains the combined profile info of each professional.


Unnamed: 0,professionals_date_joined,professionals_headline,professionals_id,professionals_industry,professionals_location,num_days_member,tags_followed,prof_info
0,2011-10-05 20:35:19,,9ced4ce7519049c0944147afb75a8ce3,,,2675,,
1,2011-10-05 20:49:21,,f718dcf6d2ec4cb0a52a9db59d7f9e67,,,2675,,
2,2011-10-18 17:31:26,,0c673e046d824ec0ad0ebe012a0673e4,,"New York, New York",2662,consulting education consulting education cons...,consulting education consulting education cons...
3,2011-11-09 20:39:29,,977428d851b24183b223be0eb8619a8c,,"Boston, Massachusetts",2640,,
4,2011-12-10 22:14:44,,e2d57e5041a44f489288397c9904c2b2,,,2609,,


## 4.2. Process the question

Here we are inserting the question into the top row of the prof_info column. This column contains the background info of each professional. Inserting the question at the top of this column will make the cosine comparison code easier to write.

In [15]:
# copy a row from df_profs
df_row1 = df_profs[df_profs.index == 0] 
# set all values to nothing
df_row1.loc[:,:] = ''
# reset the index
df_row1 = df_row1.reset_index(drop=True)
    
# Assign the prof_info in this row to be the same as the question.
# We do this because later we will compare this question to all other rows
# in the prof_info column.
df_row1.loc[0, 'prof_info'] = question_text

# Concat df_row to df_profs
# The question will be the first row
df_profs = pd.concat([df_row1, df_profs], axis=0).reset_index(drop=True)

print('The Question, in processed form, is now located at the top of the prof_info column.')

df_profs.head(2)

The Question, in processed form, is now located at the top of the prof_info column.


Unnamed: 0,professionals_date_joined,professionals_headline,professionals_id,professionals_industry,professionals_location,num_days_member,tags_followed,prof_info
0,,,,,,,,want become army officer become army officer p...
1,2011-10-05 20:35:19,,9ced4ce7519049c0944147afb75a8ce3,,,2675.0,,


## 4.3. Vectorize the data

In [16]:

# Select the data we want to use. 
# This column has our new question at the top.
data = df_profs['prof_info']

# instantiate vectorizer
vect = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.5)

# learn the 'vocabulary' of the data
vect.fit(data)

# Transform the data into a document term matrix.
# Keep in mind that the output type is a sparse matrix.
prof_dtm = vect.transform(data)

#prof_dtm.shape

In [17]:
# check what features have been created
#vect.get_feature_names()

## 4.4. Calculate the Cosine Similarity
We are calculating the similarity of your question to each professional's background info. This profile is made up of a professional's headline, industry and the tags that he or she follows.

Because we used tfidf, the vectors have already been normalized. Therefore, in order to get the cosine similarity we only need to take the dot product. The dot product is known as the linear kernel.

In [18]:
# https://stackoverflow.com/questions/12118720/
# python-tf-idf-cosine-to-find-document-similarity

# prof_dtm[0:1] This selects the first row of prof_info column.
# We are saying: Tell me how similar every row is to the first row.
cosine_similarities = linear_kernel(prof_dtm[0:1], prof_dtm)

# The line of code commented out below would give us the cosine similarity score
# of every row to every other row, just like a correlation matrix.
# But there's no need for this and the RAM needed for this calculation
# would cause this kernel to crash.

# cosine_similarities = linear_kernel(prof_dtm, prof_dtm)

# Quick check: The first value should be 1.0 because it's the
# comparison of the question to itself.
cosine_similarities

array([[1., 0., 0., ..., 0., 0., 0.]])

Let's put everything into a dataframe and sort the similarities from highest to lowest.

In [19]:
# create a dataframe
df_cosine_matrix = pd.DataFrame(cosine_similarities)

# get the column names from df_train
cols = list(df_profs['professionals_id'])

# Change the name of the first column. This is the score for the Question
cols[0] = 'question_cosine_score'

# rename the columns in the dataframe
df_cosine_matrix.columns = cols

# Add the professionals id values as a new column.
# This is identical to answers_author_id.
#df_cosine_matrix['answers_id'] = df_train['answers_author_id']

# set the answers_id column as the index
#df_cosine_matrix.set_index('answers_id', inplace=True)

# transpose the dataframe
df = df_cosine_matrix.T

# rename the column
new_col = ['cosine_score_for_each_prof_id']
df.columns = new_col

# sort the cosine similarity values in descending order
df = df.sort_values('cosine_score_for_each_prof_id',axis=0, ascending=False)

# check the top 10 cosine scores
df.head(10)

Unnamed: 0,cosine_score_for_each_prof_id
question_cosine_score,1.0
6847e217f0d942a5b7c492131e47aa84,0.331196
2d0e6cc6f02d40f0804698c51fc4d583,0.284776
cb7141606c7b4a00ab42547b55091978,0.271799
1a67257ce4164a27a83840007c0903ce,0.27083
ffdcc03fc51c4621819601fa36acd354,0.229441
1713e8b9fe3b471e84567b5c0a2c4b45,0.226505
e298ba34062748459040f021142030a9,0.208444
5f77e0a2c3a144dda336df2294a64530,0.205595
8c4ce39ccdb34558af9419673e378d22,0.183295


## 4.5. Select the Professionals
Here we'll select those professionals whose background info has a cosine similarity (to the question) that is greater than or equal to a threshold. I established this threshold by trial and error. I asked several questions and looked for a cosine similarity value below which the answers were not relevant to the question. 0.13 seems to be a reasonable value for this data.

In [20]:
# Set the cosine similarity threshold
MODEL_1_THRESHOLD = 0.13

# filter out all rows that have a cosine_score >= THRESHOLD
df = df[df['cosine_score_for_each_prof_id'] >= MODEL_1_THRESHOLD]

# remove the first row because this row is the question we asked
df = df[1:]

num_professionals = len(df)

print('Number of professionals chosen: ', num_professionals)

print('This is a sample of the professionals the model has selected.')

# Print the id's of the professionals who have been 
# selected as well as the associated cosine scores
df.head(10)

Number of professionals chosen:  21
This is a sample of the professionals the model has selected.


Unnamed: 0,cosine_score_for_each_prof_id
6847e217f0d942a5b7c492131e47aa84,0.331196
2d0e6cc6f02d40f0804698c51fc4d583,0.284776
cb7141606c7b4a00ab42547b55091978,0.271799
1a67257ce4164a27a83840007c0903ce,0.27083
ffdcc03fc51c4621819601fa36acd354,0.229441
1713e8b9fe3b471e84567b5c0a2c4b45,0.226505
e298ba34062748459040f021142030a9,0.208444
5f77e0a2c3a144dda336df2294a64530,0.205595
8c4ce39ccdb34558af9419673e378d22,0.183295
314e0b063a204c1ca05c27406874ee49,0.16834


Here we create a python list containing the id's of each professional selected.

In [21]:
# reset the index
df.reset_index(inplace=True)

# rename the columns
new_names = ['prof_id', 'cosine_score_for_each_prof_id']
df.columns = new_names

# create a list with all answer id values from df
prof_list = list(df['prof_id'])

# display the list
#prof_list

## 4.6. Why did the model select these professionals?

Next we'll print out the background info of the selected professionals. By looking at this we'll be able to tell if a professional was a reasonable choice or a bad choice. 

**First, let's print the question again.**

In [22]:
print('Question id: ', question_id)
print('Question Title: ', question_title)
print('\n')
print('Question Body:\n ', question_body)

Question id:  eb80205482e4424cad8f16bc25aa2d9c
Question Title:  I want to become an army officer. What can I do to become an army officer?


Question Body:
  I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army


**Now let's print the background info. You'll need to scroll through the output. When looking at the printout in a forked kaggle kernel you could mistakenly think that what you see is all there is. Scroll to see more.**<br>
| <a id='model_1_prof_printout'></a>

In [23]:
# Print the profiles the professionals who can answer this question. 
# Note: If you are running this kernel you may need to scroll the output otherwise
# you might mistakenly think that the text shown is all there is.

print('\n')
print('Model 1')
print('Number of professionals selected: ', len(prof_list))
print('== Printing info on each professional who was selected ==')

# set the index
df_professionals = df_professionals.set_index('professionals_id')

# set the index of df_profs to be the question id
df_profs = df_profs.set_index('professionals_id')

# Create an empty list to store the professional id's that are
# associated with the answers that have been selected,
model_1_list = []


for prof_id in prof_list:
    
    print('\n')
    
    # print the professional's id (i.e. their name)
    # get the prof id of the person who wrote the answer
    
    print('==> Professional id: ', prof_id)
    model_1_list.append(prof_id)
    
    
    # print their job title:
    title = df_professionals.loc[prof_id, 'professionals_headline']
    print('Title: ', title)
    
    # print the industry they work in
    industry = df_professionals.loc[prof_id, 'professionals_industry']
    print('Industry: ', industry)
    
    # Print the tags that are followed
    tags = df_profs.loc[prof_id,'tags_followed']
    print('==Tags being followed:\n',tags)
    



Model 1
Number of professionals selected:  21
== Printing info on each professional who was selected ==


==> Professional id:  6847e217f0d942a5b7c492131e47aa84
Title:  HR Business Partner | HR Manager | HR Consultant | Retired Army Officer 
Industry:  Human Resources
==Tags being followed:
 human-resources military military-service army army-officer arm armed-forces profes professional-development 


==> Professional id:  2d0e6cc6f02d40f0804698c51fc4d583
Title:  Sr. Marketing Manager
Industry:  entertainment
==Tags being followed:
 military pilot entertainment army army-officer military-pilot helicopter-pilot united-states-army 


==> Professional id:  cb7141606c7b4a00ab42547b55091978
Title:  Assurance Associate
Industry:  Accounting
==Tags being followed:
 accounting college-advice college-selection careers changing-careers army-officer army armed-forces veterans audit auditing leadership physical-security 


==> Professional id:  1a67257ce4164a27a83840007c0903ce
Title:  Registered

What do you think? Just by looking at their backgrounds would you have chosen these professionals to answer your question?

This is only a preliminary list. Once we have the recommendations from all four models we'll make a final selection of professionals based on who is most likely respond to an email containing the question. 

These are the id's of the professionals that Model_1 has selected:

In [24]:
model_1_list

['6847e217f0d942a5b7c492131e47aa84',
 '2d0e6cc6f02d40f0804698c51fc4d583',
 'cb7141606c7b4a00ab42547b55091978',
 '1a67257ce4164a27a83840007c0903ce',
 'ffdcc03fc51c4621819601fa36acd354',
 '1713e8b9fe3b471e84567b5c0a2c4b45',
 'e298ba34062748459040f021142030a9',
 '5f77e0a2c3a144dda336df2294a64530',
 '8c4ce39ccdb34558af9419673e378d22',
 '314e0b063a204c1ca05c27406874ee49',
 '5a9bb75b7e6345208fa3e82b39747253',
 'bbbfccbc65b641a3982d84323db7b45a',
 '7f7ab25253504a37856b49835af2ebb6',
 'e3e0ef78006d440c892e26cfca963dd0',
 'f9eed631e47b4a2f8a9791c2b89e47c7',
 '8963293a4dbe49619114a7cb7d76fa51',
 'ee22a21283674d348b90843f88e10f7d',
 'ed9c89c3bc6c4f48a34ec1ea1e654a8c',
 '8d3b320a845b424eae433585dc732b4b',
 '12e8dea2f3534f6dbc31701cf64f9a4f',
 'ba78827657f94f50a8487126f95168ef']

Model 1 gives every professional in the dataset a chance to be selected, especially those who've joined recently and haven't yet answered any questions. However, the three models that follow will only consider professionals who have answered past questions. 

Also, Model 1 used the hash tags that professsionals follow. The other three models won't use these hash tags.

<hr>
|<a id='Model_2'></a>

## 5. Model 2 - Answers, Tfidf and Cosine Similarity

This model only considers professionals who've answered a past question.

**How does this model work?**

This model compares your question to every answer in the dataset.

> **The idea is that if your question is similar to a past answer then there's a good chance that the professional who gave that answer will be able to answer your question.**


**How can we assess how well the model is working?**

At the end the model will print both the past answer that was matched, and the profile of the professional who gave that answer. By reading this information you'll be able to judge whether that professional was a good choice to answer your question. 

## 5.1. Load the data

In [25]:
# load df_qa_prof
df_qa_prof = pickle.load(open(path_1,'rb'))
# load df_professionals
df_professionals = pickle.load(open(path_2,'rb'))


print(df_qa_prof.shape)
print(df_professionals.shape)

(51123, 18)
(28254, 6)


## 5.2. Process the Question

In [26]:
# copy a row from df_qa_prof
df_row2 = df_qa_prof[df_qa_prof.index == 0] 
# set all values to nothing
df_row2.loc[:,:] = ''
# reset the index
df_row2 = df_row2.reset_index(drop=True)
    
# Assign the answer_text in this row to be the same as the question.
# We do this because later we will compare this question to all other rows
# in the answer_text column.
df_row2.loc[0, 'answers_text'] = question_text

# Concat df_row2 to df_qa_prof.
# The question will be at the top of the first row.
df_qa_prof = pd.concat([df_row2, df_qa_prof], axis=0).reset_index(drop=True)

df_qa_prof.head(2)

Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body,answers_id,answers_author_id,answers_question_id,answers_date_added,answers_body,professionals_date_joined,professionals_headline,professionals_id,professionals_industry,professionals_location,num_days_member,quest_text,answers_text
0,,,,,,,,,,,,,,,,,,want become army officer become army officer p...
1,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,332a511f1569444485cf7a7a556a5e54,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...,2015-10-19 20:56:49,Assist with Recognizing and Developing Potential,36ff3b3666df400f956f8335cf53e09e,Mental Health Care,"Cleveland, Ohio",1200.0,teacher maths teacher maths teacher useful col...,assist recognizing developing potential mental...


## 5.3. Vectorize the data

In [27]:
# Select the data we want to use. Note we are comparing the question to answers.
# We need to vectorize the answers_text column.
# This column has our new question at the top.
data = df_qa_prof['answers_text']

# instantiate vectorizer
vect = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.5)

# learn the vocabulary of the data
vect.fit(data)

# Transform the data to a document term matrix.
# The output type is a sparse matrix.
prof_dtm = vect.transform(data)

prof_dtm.shape

(51124, 1522504)

## 5.4. Calculate the Cosine Similarity
We are calculating the similarity of your question to all past answers in the dataset.

In [28]:
# https://stackoverflow.com/questions/12118720/
# python-tf-idf-cosine-to-find-document-similarity

# prof_dtm[0:1] This selects the first row of prof_info column.
# We are saying: Tell me how similar every row is to the first row.
cosine_similarities = linear_kernel(prof_dtm[0:1], prof_dtm)

# The line below would give us the cosine similarity of every row to every other row,
# just like a correlation matrix.
# But there's no need for this and the RAM needed for this calculation
# would cause this kernel to crash.
# cosine_similarities = linear_kernel(prof_dtm, prof_dtm)

# Quick check: The first value should be 1.0 because it's the
# comparison of the question to itself.
cosine_similarities

array([[1.        , 0.00566799, 0.        , ..., 0.        , 0.00264203,
        0.        ]])

Let's put everything into a dataframe and sort the similarities from highest to lowest.

In [29]:
# create a dataframe
df_cosine_matrix = pd.DataFrame(cosine_similarities)

# get the column names from df_train
cols = list(df_qa_prof['answers_id'])

# Change the name of the first column. This is the score for the Question
cols[0] = 'question_cosine_score'

# rename the columns in the dataframe
df_cosine_matrix.columns = cols

# Add the professionals id values as a new column.
# This is identical to answers_author_id.
df_cosine_matrix['answers_id'] = df_qa_prof['answers_author_id']

# set the answers_id column as the index
df_cosine_matrix.set_index('answers_id', inplace=True)

# transpose the dataframe
df = df_cosine_matrix.T

# rename the column
new_col = ['cosine_score_for_each_answer_id']
df.columns = new_col

# sort the cosine similarity values in descending order
df = df.sort_values('cosine_score_for_each_answer_id',axis=0, ascending=False)

# check the top 20 cosine scores
df.head(20)


Unnamed: 0,cosine_score_for_each_answer_id
question_cosine_score,1.0
f16c227f7c8c4c10b55dd29d425dc48d,0.209113
e9ebfa69514243e19c3790f8118cc820,0.20706
5410c56b923c4a29b88e4ae4d7a93958,0.206751
6962d99d8a71492a98d940b33bcb2787,0.192097
eaa66ef919bc408ab5296237440e323f,0.171462
a898e7728f064b0dae5fed73281ab48b,0.159091
f8b9c41cda914a219895cf2ca26890a2,0.157744
036b8034e2b64d16aa232111deeba63c,0.157687
256450e27f564449885640cda9f7a69c,0.152916


## 5.5. Select the Answers
Here we select the answers that have a cosine similarity that is greater than or equal to a threshold. 

In [30]:
# Set the cosine similarity threshold
MODEL_2_THRESHOLD = 0.1

# filter out all rows that have a cosine_score >= THRESHOLD
df = df[df['cosine_score_for_each_answer_id'] >= MODEL_2_THRESHOLD]

# remove the first row because this row is the question we asked
df = df[1:]

num_answers = len(df)

print('Number of answers chosen: ', num_answers)

print('This is a sample of the answers the model has selected.')

# print the answers that have been selected as well as the associated cosine scores
df.head(10)

Number of answers chosen:  32
This is a sample of the answers the model has selected.


Unnamed: 0,cosine_score_for_each_answer_id
f16c227f7c8c4c10b55dd29d425dc48d,0.209113
e9ebfa69514243e19c3790f8118cc820,0.20706
5410c56b923c4a29b88e4ae4d7a93958,0.206751
6962d99d8a71492a98d940b33bcb2787,0.192097
eaa66ef919bc408ab5296237440e323f,0.171462
a898e7728f064b0dae5fed73281ab48b,0.159091
f8b9c41cda914a219895cf2ca26890a2,0.157744
036b8034e2b64d16aa232111deeba63c,0.157687
256450e27f564449885640cda9f7a69c,0.152916
04473602a91b4c7593cbedb4834cbd02,0.151033


## 5.6. Identify the professionals that gave each answer

Here we will identify the professionals that gave each answer. These will be the professionals that this model thinks are best able to answer your question. 

We start by creating a python list containing the id's of each answer selected.

In [31]:
# reset the index
df.reset_index(inplace=True)

# rename the columns
new_names = ['answers_id', 'cosine_score_for_each_answer_id']
df.columns = new_names

# create a list with all answer id values from df
answer_list = list(df['answers_id'])

# display the list
#answer_list

**Let's print the question again.**

In [32]:
print('Question id: ', question_id)
print('Question Title: ', question_title)
print('\n')
print('Question Body:\n ', question_body)

Question id:  eb80205482e4424cad8f16bc25aa2d9c
Question Title:  I want to become an army officer. What can I do to become an army officer?


Question Body:
  I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army


First we'll get the id of the professional who gave each answer. Then we'll print the profile of that professional and their answer. There could be duplicate professionals in this list because the same professional could have given several past answers.<br>

| <a id='model_2_prof_printout'></a>

In [33]:
# Print info on the professionals who can answer this question

#print('\n')
print('Model 2')
print('Number of professionals selected: ', len(answer_list))
print('== Printing info on each professional who was selected ==')

# set the index
df_professionals = df_professionals.set_index('professionals_id')



# Create an empty list to store the professional id's that are
# associated with the answers that have been selected,
model_2_list = []

# set the index of df_train to be the question id
df_qa_prof = df_qa_prof.set_index('answers_id')

for ans_id in answer_list:
    
    # print the professional's id (i.e. their name)
    # get the prof id of the person who wrote the answer
    prof_id = df_qa_prof.loc[ans_id, 'answers_author_id']
    print('\n')
    print('==> Professional id: ', prof_id)
    model_2_list.append(prof_id)
    
    
    # print their job title:
    title = df_professionals.loc[prof_id, 'professionals_headline']
    print('Title: ', title)
    
    # print the industry they work in
    industry = df_professionals.loc[prof_id, 'professionals_industry']
    print('Industry: ', industry)
    
    # Print the answer that they wrote which was similar the question being asked
    answer = df_qa_prof.loc[ans_id,'answers_body']
    print('==Answer given to similar question:\n',answer)
    

Model 2
Number of professionals selected:  32
== Printing info on each professional who was selected ==


==> Professional id:  3d80e6bbe3e848729a23b1fb59b456c2
Title:  Real Estate Investor. Formerly a Military Strategic Planner and still a Sailor, Communicator WD4USA and Aviator.
Industry:  Executive Office
==Answer given to similar question:
 Preethi, 
It appears to me that you are interested in becoming an officer in the U.S. Army.  I retired from the U.S. Army as a Lieutenant Colonel after 30 years of service.   There are at least two ways for you to become an Army officer.

After you finish high school,  you should attend a university in the U.S. that has an Army R.O.T.C. (Reserve Officers Training Corps) program.  In this way, you work towards your degree and meeting your Army Officer training simultaneously.  You'll then be commissioned at the same time you receive your Bachelors Degree.  There are ROTC scholarships for which you can apply too.  Start the application process in 

We now have a list of professionals that model 2 has chosen to answer your question. Based on their past history printed above, would you say that your question is relevant to them?

Once again this is just a preliminary list. Later we'll filter out those professionals who have a high possibility of actually submitting an answer.

These are the id's of the professionals that Model_2 has selected:

In [34]:
# uncomment the next line to print the list of professional id's

# model_2_list

<hr>
| <a id='Model_3'></a>

## 6. Model 3 - Answers, Tfidf, TruncatedSVD and Cosine Similarity


Singular Value Decomposition (SVD) is commonly understood as a dimensionality reduction technique. However, it can be also be seen as a way of creating a new set of features. These are called latent features. It's not clear what each latent feature represents but they're very effective in capturing the essence of the data. The previous model used more than 1.5 million features when calculating cosine similarity. This model will use 200.

The workflow is almost identical to model 2. The diffference is that we'll take the output from tfidf and transform it using TruncatedSVD.


## 6.1. Load the data

In [35]:
# load df_qa_prof
df_qa_prof = pickle.load(open(path_1,'rb'))
# load df_professionals
df_professionals = pickle.load(open(path_2,'rb'))


print(df_qa_prof.shape)
print(df_professionals.shape)

(51123, 18)
(28254, 6)


## 6.2. Process the Question

In [36]:
# copy a row from df_qa_prof
df_row2 = df_qa_prof[df_qa_prof.index == 0] 
# set all values to nothing
df_row2.loc[:,:] = ''
# reset the index
df_row2 = df_row2.reset_index(drop=True)
    
# Assign the answer_text in this row to be the same as the question.
# We do this because later we will compare this question to all other rows
# in the answer_text column.
df_row2.loc[0, 'answers_text'] = question_text

# Concat df_row2 to df_qa_prof.
# The question will be at the top of the first row.
df_qa_prof = pd.concat([df_row2, df_qa_prof], axis=0).reset_index(drop=True)

df_qa_prof.head(2)

Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body,answers_id,answers_author_id,answers_question_id,answers_date_added,answers_body,professionals_date_joined,professionals_headline,professionals_id,professionals_industry,professionals_location,num_days_member,quest_text,answers_text
0,,,,,,,,,,,,,,,,,,want become army officer become army officer p...
1,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,332a511f1569444485cf7a7a556a5e54,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...,2015-10-19 20:56:49,Assist with Recognizing and Developing Potential,36ff3b3666df400f956f8335cf53e09e,Mental Health Care,"Cleveland, Ohio",1200.0,teacher maths teacher maths teacher useful col...,assist recognizing developing potential mental...


## 6.3. Vectorize the data

In [37]:
# Select the data we want to use. Note we are comparing the question to answers.
# We need to vectorize the answers_body column.
# This column has our new question at the top.
data = df_qa_prof['answers_text']

# instantiate vectorizer
vect = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.5)

# learn the vocabulary of the data
vect.fit(data)

# transform the data to a document term matrix
prof_dtm = vect.transform(data)

prof_dtm.shape

(51124, 1522504)

## 6.4. Transform prof_dtm using TruncatedSVD

In [38]:
from sklearn.decomposition import TruncatedSVD

# Initialize
tsvd = TruncatedSVD(n_components=200, random_state=101)

# Fit
tsvd.fit(prof_dtm)

# Transform
# This returns a type numpy array and not a sparse matrix type as with tfidf.
prof_dtm = tsvd.transform(prof_dtm)

prof_dtm.shape

(51124, 200)

Let's put the output into a dataframe so we can see what it is. Intially there were 1,522,916 features. Now there are just 200 features. The number of rows is still 51,124.

In [39]:
# create a dataframe
df = pd.DataFrame(prof_dtm)

df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199
0,0.024012,-0.005233,-0.009748,-0.005342,0.002909,0.005126,-0.010004,-0.002907,-0.000924,-0.006485,0.008983,0.011888,-0.016142,0.002696,0.014321,-0.012303,-0.008312,-0.010187,-0.000344,-0.000979,0.013771,0.00179,-0.003558,0.003595,0.006147,0.010667,0.007663,-0.00971,0.008002,0.011751,0.008605,1.5e-05,0.012617,-0.025816,0.015445,-0.025654,0.002745,-0.012761,0.006431,0.022957,...,-0.008735,-0.00121,0.025941,0.000136,0.00146,0.018899,-0.008228,-0.03237,0.015502,0.007517,0.03322,0.016331,0.029766,0.005713,-0.017217,0.012513,-0.001755,-0.013422,-0.005775,0.016536,-0.037061,0.010874,0.039462,0.021333,-0.014037,0.006142,0.011391,-0.021098,-0.016897,0.027584,-0.014076,0.003823,0.008722,-0.006744,0.011348,-0.01128,-0.023942,-0.02238,0.00587,0.00735
1,0.091275,0.066991,0.001929,0.053959,0.032376,0.177628,0.075062,0.009162,-0.048245,0.080048,-0.045129,0.004902,-0.013384,-0.007065,-0.026235,-0.007314,0.026981,0.010917,0.052557,0.008371,0.005669,-0.054713,0.017955,-0.04494,-0.0006,-0.004533,-0.03424,-0.054853,0.001047,0.035031,0.025327,-0.026025,0.011667,0.023136,0.013855,-0.007803,-0.001012,-0.015788,-0.024928,0.006101,...,0.00046,-0.002158,-0.001045,0.001147,-0.010277,0.011202,-0.00846,0.007301,0.006969,0.0073,0.014034,-0.005858,-0.009415,-0.012237,-0.00278,-0.002628,0.025951,0.000349,0.002894,0.001052,0.013626,0.007154,-0.01128,0.007826,-0.012368,-0.015932,-0.007734,0.00147,0.012151,-0.011608,0.006183,0.006614,-0.00017,-0.004091,0.001745,0.003296,0.001353,-0.006141,-0.001645,0.003551
2,0.132012,0.072808,-0.004845,0.065047,0.037348,0.207049,0.072997,0.046877,-0.06,0.090972,-0.084411,0.025868,-0.016241,-0.001757,-0.020689,-0.007035,0.018215,0.004865,-0.001775,-0.033415,-0.013262,-0.049306,0.03238,0.009741,-0.009819,0.01528,0.031648,-0.008183,0.00223,0.016318,0.016858,-0.010464,-0.00571,-0.000111,-0.006515,-0.006015,-0.015076,0.015498,-0.001207,-0.008783,...,0.001121,0.052119,-0.027918,0.004661,-0.014729,0.039925,-0.024854,-0.005238,-0.001309,0.021034,0.024079,-0.02535,-0.033112,0.008616,-0.006941,0.017449,-0.004575,0.003917,-0.016406,0.000977,-0.011891,-0.005826,-0.004835,0.012854,-0.028004,-0.01125,0.036943,-0.00758,0.025207,0.021427,0.035351,-0.001291,0.008332,-0.001758,0.012118,-0.033493,-0.00028,-0.034146,0.007145,0.021763
3,0.249143,0.291359,0.088074,0.003244,0.027115,0.40381,0.291359,-0.284397,0.475649,-0.386703,0.240129,0.086612,0.016843,0.038012,0.0435,0.055217,-0.059609,0.026254,-0.103907,-0.046098,0.000173,0.068979,-0.020511,0.038844,-0.007911,0.009164,0.008877,-0.020112,-0.010487,0.002063,-0.008487,-0.002851,0.001319,0.018081,-0.00201,0.005785,-0.004125,0.001761,0.003083,0.015351,...,-0.002996,0.00545,-0.005741,-0.007909,0.007214,0.009298,0.003164,0.014428,-0.008382,0.002802,-0.010361,0.002144,0.000279,0.00949,0.004151,-0.000868,-0.010834,-0.009119,0.008648,-0.00236,-0.002897,-7e-06,0.01367,-0.003865,-0.005843,-0.003034,0.020443,0.006594,0.006951,0.013695,0.006646,0.007947,-0.015653,0.003742,0.008901,0.005721,-0.003901,-0.005946,-0.009372,-0.018388
4,0.345648,0.566397,0.229726,0.031798,-0.038559,-0.134735,-0.047582,0.033388,-0.02961,0.012116,-0.008063,-0.006647,-0.006184,0.002452,0.005754,0.005784,-0.006474,0.005925,-0.014835,-0.019913,0.005958,-0.015204,-0.004945,-0.000229,-0.004074,-0.010286,0.00403,-0.003012,-0.000403,0.004274,0.001178,-0.00365,0.011628,0.007091,-0.000875,0.00436,-0.004776,-0.003476,-0.000628,-0.006347,...,-0.005799,0.002683,-0.005213,0.005423,0.009991,0.011323,-0.001511,-0.003799,0.00209,0.002235,-0.00912,-0.008038,0.001841,0.0198,-0.00541,0.000212,-0.009536,0.002728,-0.001387,0.003661,-0.00761,-0.00822,-0.001343,0.015161,0.009678,-0.002317,0.007576,-0.003283,-0.001285,0.010946,-0.011389,-0.010501,-0.002826,0.005568,0.009961,0.011338,0.004947,-0.00541,-0.011453,-0.014694


## 6.5. Calculate the Cosine Similarity
We are calculating the similarity of your question to all past answers in the dataset.

In [40]:
# https://stackoverflow.com/questions/12118720/
# python-tf-idf-cosine-to-find-document-similarity

# prof_dtm[0:1] This selects the first row of prof_info column. Note this slicing 
# is for a sparse matrix.
# We are saying: Tell me how similar every row is to the first row.
# Note that we are using cosine_similarity here and not linear_kernel.
cosine_similarities = cosine_similarity(prof_dtm[0:1], prof_dtm)

# The line below would give us the cosine similarity of every row to every other row,
# just like a correlation matrix.
# But there's no need for this and the RAM needed for this calculation
# would cause this kernel to crash.
# cosine_similarities = linear_kernel(prof_dtm, prof_dtm)

# Quick check: The first value should be 1.0 because it's the
# comparison of the question to itself.
cosine_similarities

array([[ 1.        , -0.00602565,  0.05804452, ...,  0.01112067,
         0.02519041, -0.00443636]])

Let's put everything into a dataframe and sort the similarities from highest to lowest.

In [41]:
# create a dataframe
df_cosine_matrix = pd.DataFrame(cosine_similarities)

# get the column names from df_train
cols = list(df_qa_prof['answers_id'])

# Change the name of the first column. This is the score for the Question
cols[0] = 'question_cosine_score'

# rename the columns in the dataframe
df_cosine_matrix.columns = cols

# Add the professionals id values as a new column.
# This is identical to answers_author_id.
df_cosine_matrix['answers_id'] = df_qa_prof['answers_author_id']

# set the answers_id column as the index
df_cosine_matrix.set_index('answers_id', inplace=True)

# transpose the dataframe
df = df_cosine_matrix.T

# rename the column
new_col = ['cosine_score_for_each_answer_id']
df.columns = new_col

# sort the cosine similarity values in descending order
df = df.sort_values('cosine_score_for_each_answer_id',axis=0, ascending=False)

# check the top 20 cosine scores
df.head(20)


Unnamed: 0,cosine_score_for_each_answer_id
question_cosine_score,1.0
ab662ff2b8f140cc95684c91e5b7e1d4,0.882289
daa59219e113448380b88e2f5d750cfa,0.824792
1b3fb9d89f094e5da67631da462a112b,0.819852
19faf6e6859d4b02bd702b03dd393cf6,0.782701
e1a32bcb80ba46ce9c0bd5b50df21fc9,0.775577
639478160b7648509e3c1cb025031796,0.758128
3fbca8e88b2f4eedbf70b91ce02cf336,0.756315
2581fe16c5274d30858279b7a7aa151b,0.747558
5ad1524b55b644bd8a860bfecdc9e03e,0.741275


## 6.6. Select the Answers
Here we select the answers that have a cosine similarity that is greater than or equal to a threshold. 

In [42]:
# Set the cosine similarity threshold
MODEL_3_THRESHOLD = 0.65

# filter out all rows that have a cosine_score >= THRESHOLD
df = df[df['cosine_score_for_each_answer_id'] >= MODEL_3_THRESHOLD]

# remove the first row because this row is the question we asked
df = df[1:]

num_answers = len(df)

print('Number of answers chosen: ', num_answers)

print('This is a sample of the answers the model has selected.')

# print the answers that have been selected as well as the associated cosine scores
df.head(10)

Number of answers chosen:  46
This is a sample of the answers the model has selected.


Unnamed: 0,cosine_score_for_each_answer_id
ab662ff2b8f140cc95684c91e5b7e1d4,0.882289
daa59219e113448380b88e2f5d750cfa,0.824792
1b3fb9d89f094e5da67631da462a112b,0.819852
19faf6e6859d4b02bd702b03dd393cf6,0.782701
e1a32bcb80ba46ce9c0bd5b50df21fc9,0.775577
639478160b7648509e3c1cb025031796,0.758128
3fbca8e88b2f4eedbf70b91ce02cf336,0.756315
2581fe16c5274d30858279b7a7aa151b,0.747558
5ad1524b55b644bd8a860bfecdc9e03e,0.741275
0214f6fd41e546a991a06b3f622c3c1c,0.741232


## 6.7. Identify the professionals that gave each answer

Here we will identify the professionals that gave each answer. These will be the professionals that this model thinks are best able to answer your question. 

We start by creating a python list containing the id's of each answer selected.

In [43]:
# reset the index
df.reset_index(inplace=True)

# rename the columns
new_names = ['answers_id', 'cosine_score_for_each_answer_id']
df.columns = new_names

# create a list with all answer id values from df
answer_list = list(df['answers_id'])

# display the list
#answer_list

**Let's print the question again.**

In [44]:
print('Question id: ', question_id)
print('Question Title: ', question_title)
print('\n')
print('Question Body:\n ', question_body)

Question id:  eb80205482e4424cad8f16bc25aa2d9c
Question Title:  I want to become an army officer. What can I do to become an army officer?


Question Body:
  I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army


We'll get the id of the professional who gave each answer. Then we'll print the profile of that professional and their answer. There could be duplicate professionals in this list because the same professional could have given several past answers.<br>

In [45]:
# Print info on the professionals who can answer this question

#print('\n')
print('Model 3')
print('Number of professionals selected: ', len(answer_list))
print('== Printing info on each professional who was selected ==')

# set the index
df_professionals = df_professionals.set_index('professionals_id')



# Create an empty list to store the professional id's that are
# associated with the answers that have been selected,
model_3_list = []

# set the index of df_train to be the question id
df_qa_prof = df_qa_prof.set_index('answers_id')

for ans_id in answer_list:
    
    # print the professional's id (i.e. their name)
    # get the prof id of the person who wrote the answer
    prof_id = df_qa_prof.loc[ans_id, 'answers_author_id']
    print('\n')
    print('==> Professional id: ', prof_id)
    model_3_list.append(prof_id)
    
    
    # print their job title:
    title = df_professionals.loc[prof_id, 'professionals_headline']
    print('Title: ', title)
    
    # print the industry they work in
    industry = df_professionals.loc[prof_id, 'professionals_industry']
    print('Industry: ', industry)
    
    # Print the answer that they wrote which was similar the question being asked
    answer = df_qa_prof.loc[ans_id,'answers_body']
    print('==Answer given to similar question:\n',answer)

Model 3
Number of professionals selected:  46
== Printing info on each professional who was selected ==


==> Professional id:  9990d7c9c7374381a6b5a79e5868f85b
Title:  College Student/Military
Industry:  
==Answer given to similar question:
 <p>It depends on whether or not you want to be enlisted or an officer. If you want to go the enlisted route there are many online schools that offer a great education to military personal or you can also join the national guard if you would rather go to school in your state and still serve your country. On the other hand, you if you want to be an officer I would advise you to follow one of the suggestions that Jenn pointed out because they are both very good options for someone who wants to become an officer.</p>


==> Professional id:  5c48a65be129431eb2669079b7941c21
Title:  Officer in Charge, Naval Oceanography and Anti-submarine Warfare Detachment San Diego 
Industry:  Military
==Answer given to similar question:
 <p>Devetra,</p>
<p>If you wan

These are the id's of the professionals that Model_3 has selected:

In [46]:
# uncomment the next line to print the list of professional id's

# model_3_list

Is your question relevant to these professionals? 

In the next model we're going to add some Ai magic in the form of pre-trained word embeddings.

<hr>
| <a id='Model_4'></a>

## 7. Model 4 - Answers, GloVe Embeddings and Cosine Similarity

Pre-trained word embeddings are dense vectors that have been trained on text corpuses containing millions of words. Tfidf encodes word frequency but  embedding vectors encode the "meaning" of words and are able to understand analogies such as: man is to woman as king is to queen. In other words, embedding vectors help a model understand that the words man, woman, king and queen are all gender related. Moreover, a model will understand that king and queen are royalty.

In addition to trying to create relevant matches, here I'm using embedding vectors to increase the diversity of answers (more viewpoints) that student's receive. For example, if a student asks how to become a film star, it would be good if that student also received advice from people involved in theatre. This would be possible because the model would know that the words film and theatre are closely related.

This model will encode words using pre-trained GloVe word embeddings. We will use 200-dimensional english word vectors. These were pre-trained on the combined Wikipedia 2014 + Gigaword 5th Edition corpora (6B tokens, 400K vocab). Because GloVe vectors are available as a Kaggle dataset I've simply imported them into this kernel.

The embedding vector length is 200. For each answer we'll consider only the first 500 words (max_length = 500). Shorter answers will be padded with zeros. For long answers, all words beyond the first 500 will be thrown away. To create a vector for a given answer, we will average all the 200-long word vectors that make up that answer.

Again, this model will compare your question to all answers in the dataset. If your question is similar to a past answer then there's a good chance that the professional who gave that answer will be able to answer your question.


## 7.1. Define the document corpus


> The answers_text column in dataframe df_qa_prof will be the corpus of documents that we'll use to create this model. Whenever we refer to a document corpus, this is the column that we'll be referring to.


## 7.2. Load the Data

In [47]:
# load df_qa_prof
df_qa_prof = pickle.load(open(path_1,'rb'))
# load df_professionals
df_professionals = pickle.load(open(path_2,'rb'))


print(df_qa_prof.shape)
print(df_professionals.shape)

(51123, 18)
(28254, 6)


In [48]:
# We will use GloVe vectors that have a standard length of 200
EMBED_LENGTH = 200

## 7.3. Process the question

In [49]:
# copy a row from df_qa_prof
df_row3 = df_qa_prof[df_qa_prof.index == 0] 
# set all values to nothing
df_row3.loc[:,:] = ''
# reset the index
df_row3 = df_row3.reset_index(drop=True)
    
# Assign the answer_text in this row to be the same as the question.
# We do this because later we will compare this question to all other rows
# in the answer_text column.
df_row3.loc[0, 'answers_text'] = question_text

# Concat df_row to df_qa_prof
# The question will be the first row
df_qa_prof = pd.concat([df_row3, df_qa_prof], axis=0).reset_index(drop=True)


## 7.4. Pre-process the data

In [50]:
# Create a new column showing the length of each answer
df_qa_prof['answer_length'] = df_qa_prof['answers_body'].apply(len)

print('The answers_text column is the document corpus.')
df_qa_prof.head(2)

The answers_text column is the document corpus.


Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body,answers_id,answers_author_id,answers_question_id,answers_date_added,answers_body,professionals_date_joined,professionals_headline,professionals_id,professionals_industry,professionals_location,num_days_member,quest_text,answers_text,answer_length
0,,,,,,,,,,,,,,,,,,want become army officer become army officer p...,0
1,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,332a511f1569444485cf7a7a556a5e54,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...,2015-10-19 20:56:49,Assist with Recognizing and Developing Potential,36ff3b3666df400f956f8335cf53e09e,Mental Health Care,"Cleveland, Ohio",1200.0,teacher maths teacher maths teacher useful col...,assist recognizing developing potential mental...,327


## 7.5. Assemble the GloVe Embedding Matrix for our text corpus
We'll use pre-trained GloVe embeddings that are available in Kaggle datasets.

In [51]:
# Create a corpus of documents
corpus_text_list = list(df_qa_prof['answers_text'])


### Tokenize the corpus of documents (i.e. extract the vocabulary)

In [52]:
# Instantiate the tokenizer.
# Note that this is a word tokenizer.
t = Tokenizer()

# create a dictionary where the word is the key and a number is the value
t.fit_on_texts(corpus_text_list)

# How many words are there in our corpus vocabulary?

vocab_size = len(t.word_index)
print('Vocab size: ', vocab_size)

Vocab size:  56046


In [53]:
# These are all the words in the vocabulary of our corpus.
# Each word is assigned an index starting at 1.

t.word_index

{'work': 1,
 'get': 2,
 'college': 3,
 'good': 4,
 'would': 5,
 'time': 6,
 'also': 7,
 'like': 8,
 'want': 9,
 'job': 10,
 'people': 11,
 'many': 12,
 'one': 13,
 'degree': 14,
 'best': 15,
 'make': 16,
 'need': 17,
 'help': 18,
 'take': 19,
 'may': 20,
 'experience': 21,
 'know': 22,
 'education': 23,
 'great': 24,
 'business': 25,
 'go': 26,
 'find': 27,
 'engineering': 28,
 'well': 29,
 'years': 30,
 'hi': 31,
 'field': 32,
 'luck': 33,
 'think': 34,
 'information': 35,
 'computer': 36,
 'way': 37,
 'first': 38,
 'different': 39,
 'important': 40,
 'services': 41,
 'learn': 42,
 'skills': 43,
 'see': 44,
 'major': 45,
 'research': 46,
 'lot': 47,
 'start': 48,
 'management': 49,
 'working': 50,
 'look': 51,
 'program': 52,
 'really': 53,
 'much': 54,
 'things': 55,
 'technology': 56,
 'year': 57,
 'might': 58,
 'science': 59,
 'could': 60,
 'life': 61,
 'even': 62,
 'etc': 63,
 'classes': 64,
 'engineer': 65,
 'new': 66,
 'high': 67,
 'com': 68,
 'www': 69,
 'software': 70,
 'healt

In [54]:
# Add 1 to the number of words in the vocabulary
vocab_size = len(t.word_index) + 1
vocab_size

56047

### Integer encode the corpus documents

Here we are replacing every word with its cprresponding index. Each row in our text corpus is a seperate list of these index values. Take note that the lists have different lengths.

- encoded_docs is a list of lists. [[2, 101, 605], [33, 77],...]

In [55]:
# convert the text to sequences of numbers
encoded_docs = t.texts_to_sequences(corpus_text_list)

# Print the list of lists
#print(encoded_docs)

### Pad each list so they all have the same length

Here we'll pad each list with zeros so that they all have the same length.

In [56]:

# Let's look at the text lengths to decide what max_length to use
print('Min length: ', df_qa_prof['answer_length'].min())
print('Max length: ',df_qa_prof['answer_length'].max())
print('Mean length: ',df_qa_prof['answer_length'].mean())
print('Median length: ',df_qa_prof['answer_length'].median())
print('Mode lengths: ',df_qa_prof['answer_length'].mode()) # value that appears most often

# Set the max_length 
max_length = 500

# Pad each list so they all have the same length
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

# (num_answers, max_length)
padded_docs.shape  

Min length:  0
Max length:  35695
Mean length:  893.0522259604099
Median length:  641.0
Mode lengths:  0    1383
dtype: int64


(51124, 500)

### Create a GloVe embedding matrix specific to our corpus vocab

In [57]:
# source: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

# We will use pre-trained GloVe emedding vectors from Kaggle Datasets that
# have been imported into this kernel.
# https://www.kaggle.com/rtatman/glove-global-vectors-for-word-representation

# Load the pre-trained GloVe vectors
# Set the path to glove.6B.200d.txt
path = '../input/glove-global-vectors-for-word-representation/glove.6B.200d.txt'

embeddings_index = dict()
f = open(path)

for line in f:
    # Note: use split(' ') instead of split() if you get an error.
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Loaded %s word vectors.' % len(embeddings_index))

# create a weight matrix
embedding_matrix = np.zeros((vocab_size, EMBED_LENGTH))
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

print('The result is a matrix of embeddings.')
print('Words are the rows, the features are the columns.')

# The result is a matrix of embeddings only for words in our data.
# Words are the rows, the features are the columns.

Loaded 400000 word vectors.
The result is a matrix of embeddings.
Words are the rows, the features are the columns.


Let's put the embedding matrix into a dataframe so we can more clearly see what it is. This matrix includes the embedding vectors for all words in the GloVe vocab. Later we'll extract the vectors that correspond to the words in our document corpus.

In [58]:


# Note that the words are on the index column
df_glove_embeddings = pd.DataFrame(embedding_matrix)

# get all the dictionary keys as a list
word_dict = t.word_index

# get a list of keys
keys = list(word_dict.keys())

# Insert a dummy_word at the first position.
# The dummy_word exists because our dict key:value pairs
# start from word:1 and not word:0.
keys.insert(0, 'dummy_word')

# transpose the dataframe so that the words become the columns
df_glove_embeddings = df_glove_embeddings.T

# set the names of the columns
df_glove_embeddings.columns = keys


# convert the dataframe back to the original form
df_glove_embeddings = df_glove_embeddings.T

# reset the index
df_glove_embeddings = df_glove_embeddings.reset_index(drop=False)

# change the name of the first column to 'words'
column_names = list(df_glove_embeddings.columns)
column_names[0] = 'words'
df_glove_embeddings.columns = column_names

print('This is the embeddings in a dataframe.')
df_glove_embeddings.head(10)

This is the embeddings in a dataframe.


Unnamed: 0,words,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,...,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199
0,dummy_word,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,work,0.005679,0.22325,-0.097926,-0.16128,0.47453,-0.33332,-0.37491,-0.041808,-0.059711,0.23397,0.57158,0.28719,-0.11798,0.35308,0.27206,0.007182,-0.38106,0.357,0.16333,0.3281,-0.019585,2.8545,0.30997,-0.17071,0.65618,0.63599,0.22558,-0.03727,0.36916,0.21133,-0.20398,-0.22599,-0.000351,-0.26588,-0.18939,-0.41834,-0.4714,-0.41733,0.27964,...,0.4745,0.055277,0.20737,-0.15987,-0.17818,0.23228,-0.15886,-0.16294,0.066815,0.64801,-0.22142,0.2822,0.047239,0.15179,0.42934,-0.25535,-0.72418,-0.2496,0.54558,0.24107,0.9555,-0.11706,-0.008816,-0.072432,0.19031,0.10898,-0.14574,-0.28885,0.083009,0.44254,-0.558,0.054495,-0.37817,-0.07831,0.51383,-0.382,-0.046936,0.39707,0.078734,0.023301
2,get,0.38331,0.16071,-0.3276,0.21097,-0.12469,0.3394,-0.5098,-0.21631,-0.008541,0.74978,-0.18249,0.04785,0.53559,-0.39374,0.14918,-0.022363,-0.27996,0.20976,0.14072,0.049913,0.38384,3.229,-0.34841,-0.098795,0.28231,-0.65828,-0.044889,0.019791,0.17554,0.012565,0.099094,-0.23645,-0.33309,-0.11629,-0.023981,-0.30039,-0.35389,-0.18019,-0.18452,...,-0.36907,0.1962,0.37949,-0.078997,0.039151,-0.017539,-0.1155,0.223,-0.59291,-0.17492,-0.1306,-0.072993,0.30784,0.65795,0.45429,-0.0243,-0.4739,-0.47814,0.33045,0.45529,0.91337,-0.26402,-0.11176,0.12379,-0.1498,-0.077251,0.098572,-0.14011,-0.29891,-0.029814,-0.32825,-0.25695,-0.36299,-0.084002,-0.022744,0.14506,0.11416,0.47199,-0.49446,0.36881
3,college,-0.019945,-0.31487,-0.087167,-0.1272,0.76309,0.09103,0.54087,0.16012,-0.040687,0.03463,-0.28407,-0.22656,0.29737,-0.73894,0.39638,-0.00554,-0.22885,-0.5244,0.23098,-0.205,0.46579,3.0511,0.15189,0.40687,0.35375,-0.12477,0.30486,0.49723,0.40127,-0.58412,0.70035,0.35002,-0.27727,0.14976,-0.37709,-0.39735,-0.69991,0.149,0.60205,...,0.49502,-0.15489,0.60184,0.37685,0.27022,0.63644,-0.48914,0.072821,-0.41599,0.64418,-0.036089,0.50934,0.08383,0.60195,0.21546,-0.27606,-0.42249,0.014857,0.41405,0.75514,0.014259,0.61088,-0.7707,0.09145,-0.41804,-0.87801,-0.86657,0.17904,0.15552,0.11132,0.65916,0.27482,0.7951,-0.23382,-0.44453,0.28255,-0.8321,-0.13113,0.8218,0.24142
4,good,0.51507,0.35596,0.1571,-0.074075,-0.25446,-0.11357,-0.49943,-0.12626,0.38851,0.54204,0.10479,0.44099,-0.06549,0.058463,0.4115,0.56709,-0.11869,0.25107,0.2564,-0.21615,0.6417,2.7875,0.12036,0.049481,0.24843,-0.6739,0.001196,0.35802,-0.17588,-0.39135,-0.014093,0.2361,-0.43184,-0.027045,0.022829,-0.28283,-0.50008,-0.11275,-0.45002,...,0.36631,0.36358,0.35067,-0.21211,0.10592,-0.14738,-0.10271,0.46204,-0.54369,0.026473,-0.34436,-0.28099,0.27903,0.22506,0.13079,-0.12761,-0.74556,-0.14482,-0.1178,0.46916,1.0226,-0.12157,-0.38652,0.20441,-0.38827,-0.18671,0.36354,-0.33577,-0.039282,0.33316,-0.048109,-0.38057,-0.35258,-0.006266,0.27227,-0.16222,-0.31979,0.14338,-0.072859,0.17815
5,would,0.28405,0.29386,-0.12101,-0.10605,0.39364,0.3154,-0.36304,-0.27793,0.21493,0.3914,0.077963,0.17886,0.074721,-0.24493,0.13675,-0.25969,-0.17564,0.93516,0.20657,-0.34136,-0.10694,3.4751,-0.54822,0.2364,-0.15247,0.21666,0.042201,-0.16206,-0.12015,-0.12456,-0.12575,-0.42431,0.066287,-0.21297,0.074132,-0.37839,-0.61262,-0.71189,0.12575,...,-0.47062,-0.015481,0.29794,-0.29518,-0.10978,0.36344,-0.096705,-0.26986,-0.11236,0.41221,0.52284,0.18203,0.34269,-0.13928,0.36973,0.11779,-0.49141,-0.32003,0.23044,0.18443,1.2937,-0.22258,0.003193,0.028787,-0.2969,0.36411,-0.125,0.25003,0.23086,0.36244,-0.27102,0.26138,-0.099219,-0.020817,-0.041567,-0.025865,-0.094608,0.36582,-0.088731,0.3607
6,time,0.19674,0.56518,-0.066079,-0.41694,-0.13375,0.47872,-0.32995,-0.1194,0.26947,0.2089,0.4236,0.32995,0.13164,0.20726,0.17406,0.27936,-0.11503,0.093967,0.36322,-0.53247,0.2175,3.1337,0.20181,-0.27973,0.27848,0.20376,-0.33474,0.62662,-0.024693,0.20079,-0.078253,0.039856,0.22083,-0.14034,-0.30905,-0.078472,-0.63928,-0.18789,-0.20267,...,-0.058402,0.058443,-0.001465,-0.002148,-0.092231,0.10477,-0.046672,0.11758,-0.47927,0.39629,-0.15032,-0.24826,0.30486,-0.19384,0.18508,0.042527,-0.37978,-0.45501,0.20884,-0.005877,0.84409,-0.44488,-0.55597,0.42431,-0.36427,-0.27076,0.29951,0.21196,-0.14311,0.255,-0.58994,0.073674,-0.18871,0.24933,0.24808,-0.21876,-0.058454,0.52523,0.018351,-0.11299
7,also,0.067532,0.18657,-0.006122,0.022128,0.2729,-0.016634,-0.30398,0.24975,-0.088128,-0.29102,0.064569,0.60796,-0.022705,-0.092373,0.36451,0.049544,-0.47105,0.53116,-0.18828,-0.37185,0.25309,2.6955,-0.19216,-0.28393,-0.053655,0.16175,0.10098,-0.12152,-0.043732,-0.31838,-0.10674,-0.003701,-0.29602,-0.23761,0.39797,-0.61881,-0.8136,-0.81163,-0.033008,...,0.1362,0.3262,-0.12225,0.10418,-0.092845,0.27035,0.13148,-0.013678,0.14382,0.78509,0.1217,0.013614,0.69032,-0.15573,-0.056398,0.32858,-0.59882,-0.43602,0.21874,-0.079901,0.90635,-0.20372,-0.096026,-0.094772,-0.073139,-0.1524,-0.35996,0.016672,0.040921,0.096236,0.2208,0.11753,-0.56885,0.24838,0.25087,0.24543,-0.48018,-0.000395,-0.21817,0.039332
8,like,0.25527,0.33678,-0.52359,-0.24037,0.10562,0.11899,-0.55253,0.36645,-0.40646,0.37358,-0.21459,0.52908,0.44046,0.087591,-0.14473,-0.16494,-0.27365,0.25612,-0.055087,0.090737,0.18271,2.5233,0.24048,-0.32437,0.55388,-0.20451,0.19837,-0.17136,-0.14982,0.12071,0.090739,-0.076308,-0.47191,0.21234,-0.31174,-0.067683,-0.28015,-0.051859,-0.050429,...,0.028264,0.32992,0.053101,-0.17984,0.29414,-0.13556,0.089491,0.22628,-0.058628,0.11444,-0.32945,-0.04282,0.33654,0.58262,0.048651,0.24395,-0.10776,-0.19581,0.029991,0.46667,0.84119,-0.48545,-0.49973,0.38383,0.024319,-0.030888,-0.085667,0.50983,-0.01989,-0.004425,-0.29027,-0.025755,-0.068046,0.34536,0.24535,-0.28286,-0.49124,-0.051566,0.31835,0.34844
9,want,0.43249,0.53471,-0.018324,0.15637,0.066969,0.076314,-0.7165,0.24152,0.17092,0.89622,-0.20074,0.18837,0.45357,-0.18642,-0.34106,-0.031057,-0.049072,0.65583,-0.007434,0.15477,0.15585,3.1113,-0.41215,0.18944,0.18664,-0.20054,0.066307,0.039018,0.060674,-0.31788,0.018229,-0.68072,-0.013336,0.2321,-0.082346,-0.26842,-0.35187,-0.20386,0.066975,...,-0.37762,-0.09862,0.20412,-0.34287,-0.20151,-0.11827,-0.16011,0.16357,-0.37903,-0.24553,-0.05317,-0.042,0.1821,0.22396,0.4508,-0.20303,-0.61633,-0.20274,0.21842,0.36694,0.76844,-0.29219,-0.098009,-0.29294,-0.19319,0.14572,0.24515,-0.071184,0.39793,-0.00333,-0.40118,0.13976,0.082754,-0.1296,-0.3055,0.008989,0.22656,0.32141,-0.42978,0.47478


Now let's create an embedding matrix that only includes words in our document corpus. This is simply a look-up function. For every word in a document, this code looks up the embedding vector associated with that word and inserts it in encoding_mat. 

Take note that here we are **averaging** all the word vectors that make up a given answer.



In [59]:
# create an empty matrix
encoding_mat = np.zeros((len(padded_docs), EMBED_LENGTH))

for i in range(0,len(padded_docs)):
    # select the document
    padded_doc = padded_docs[i]
    # create an empty encoding list
    encoding = np.zeros(EMBED_LENGTH)
    # select a document
    for item in padded_doc:
        # Here we are adding the vectors together.
        # This selects a row from embedding_matrix.
        # The output is a list.
        encoding = encoding + embedding_matrix[item] # item is an integer value
        
    # Insert the encoding to encoding_mat
    # Here we are averaging the encodings by dividing by the length.
    encoding_mat[i] = encoding/max_length

# check the shape of the matrix
encoding_mat.shape

(51124, 200)

In [60]:
# Display the embedding matrix
# The words are the rows and the features are the columns.

# Every row represents one answer that has been encoded as a vector
df_encoding_mat = pd.DataFrame(encoding_mat)

print('This is the embedding matrix. Each row represents one answer that has been encoded as a vector.')
print('Row 0 is the question.')
df_encoding_mat.head()

This is the embedding matrix. Each row represents one answer that has been encoded as a vector.
Row 0 is the question.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199
0,0.009525,-0.002975,0.003757,-0.002923,0.004781,0.000607,-0.009669,0.009326,0.016464,0.01457,-0.004031,0.003123,-0.000843,0.000958,0.011874,-0.007913,-0.003926,0.008817,-0.00057,-0.000914,0.005643,0.107928,0.007482,0.005282,0.0053,-0.004403,-0.0045,-0.002869,-0.002073,0.002266,0.002272,-0.000959,-0.001708,-0.005477,0.003198,0.00387,-0.031066,-0.001086,0.001209,0.015957,...,-0.007056,0.004755,0.003335,0.00385,-0.003248,0.008779,-0.005595,0.004519,-0.013246,0.009939,0.013512,-0.005664,-0.0006,-0.000889,0.001338,0.000195,-0.00806,-0.013073,0.011545,-0.002982,0.023639,-0.002482,-0.003857,0.005492,-0.009869,0.000828,-0.017233,0.007699,-0.002773,0.008226,0.001573,0.002676,0.008716,-0.000836,-0.001089,0.009642,0.01198,-0.003493,0.005941,0.010331
1,0.016543,0.025232,0.017283,-0.004585,0.025737,0.003801,-0.029428,0.006048,-0.01867,0.025888,-0.013804,0.021663,0.016008,0.000548,0.002324,0.007336,-0.001639,0.005294,-0.016542,0.004426,0.00049,0.099559,-0.013248,0.008171,0.005808,0.004313,0.000734,7.8e-05,0.007173,0.001412,-0.008392,-0.00104,-0.005931,0.000316,0.006988,-0.017887,-0.01449,-0.011015,-0.008418,-0.00272,...,-0.005718,-0.004124,0.001664,-0.005227,-0.002296,-0.005011,0.002016,0.006601,-0.013967,0.009616,0.009893,0.003344,0.013021,-0.007627,0.013631,0.005102,-0.011596,0.003344,0.012787,0.003464,0.024496,-0.004431,0.005644,-0.000814,-0.004268,-0.007382,0.006024,-0.001454,-0.007773,-0.007336,0.004097,-0.000252,0.004588,0.011327,-0.006434,-0.018766,-0.006398,0.000393,7e-05,-0.010316
2,0.020799,0.039901,0.005123,-0.01187,0.023158,0.006242,-0.051353,0.00996,-0.012436,0.045081,-0.014536,0.033084,0.022482,0.002429,0.007159,0.00637,-0.009974,0.026578,-0.019543,-0.005948,0.000548,0.208865,-0.019608,0.014621,0.009552,-0.002505,0.004153,0.009517,-0.003038,-0.000641,-0.009367,-0.007758,-0.010607,0.001862,0.00665,-0.013456,-0.036405,-0.00919,-0.010718,0.008697,...,-0.006129,0.00317,0.00263,-0.013369,-0.000122,-0.005389,0.001581,0.003779,-0.014215,0.029511,-0.002579,0.000818,0.020471,-0.007966,0.027703,0.019623,-0.020107,0.002465,0.016768,0.002841,0.065616,-0.017344,0.00673,-0.002659,0.002597,-0.000979,0.013834,0.00231,-0.015256,0.008648,-0.00203,0.002699,-0.00735,0.017602,0.001584,-0.021291,-0.008981,-0.002192,-0.00392,0.00145
3,0.022964,0.026646,-0.005306,-0.010258,0.004238,-0.002894,-0.059486,0.010145,-0.00021,0.041355,-0.010758,0.023794,0.014482,-0.00145,0.022851,0.007962,-0.012437,0.035309,0.000658,-0.005614,0.028506,0.335965,-0.006318,0.01612,0.026103,-0.003068,0.001431,0.002658,-0.001322,-0.017986,-0.006046,-0.030309,-0.014613,0.001136,0.003161,-0.018724,-0.063218,-0.022232,-0.005708,0.023743,...,-0.003203,0.004423,0.010686,-0.013882,0.002003,-0.00457,-0.002383,-0.016191,-0.043347,0.043394,-0.001937,-0.01549,0.00841,0.0075,0.027237,-0.008691,-0.021924,-0.017845,0.015481,0.004915,0.112371,-0.01324,-0.005895,-0.00647,-0.003249,-0.002865,0.010255,0.005883,-0.012372,0.019924,-0.009175,0.00441,-0.003365,0.01335,0.01302,-0.005333,-0.004279,0.015542,-0.008396,0.008437
4,0.031751,0.034708,-0.014375,-0.022872,0.013888,0.006212,-0.102788,0.019331,-0.019415,0.054042,-0.00551,0.043605,0.006279,0.015367,0.024504,0.029838,-0.02438,0.041458,0.003468,-0.020259,0.035523,0.50039,-0.011342,0.004195,0.036581,-0.00709,0.005571,0.013478,-0.005138,-0.020297,0.000351,-0.042206,-0.015001,-0.007901,0.004272,-0.023913,-0.081038,-0.040811,0.006177,0.026174,...,0.009663,0.005652,0.013551,-0.014056,0.001993,0.001223,0.007068,-0.021372,-0.050492,0.079773,-0.010477,-0.010797,0.014472,0.010976,0.05218,-0.007319,-0.034081,-0.022496,0.009412,-0.00082,0.160251,-0.029409,-0.015949,-0.00873,-0.002354,0.010781,0.007174,0.004915,-0.017579,0.044088,-0.01584,-2.2e-05,-0.011126,0.037537,0.028219,-0.009331,-0.00057,0.021783,-0.014707,0.018445


In [61]:
# check the shape of the embedding matrix
encoding_mat.shape

(51124, 200)

## 7.6. Calculate the cosine similarity of the question (first row) to every answer (all other rows)

In [62]:
# reshape the encoding matrix to (num_samples, num_features)
encoding_mat = encoding_mat #.reshape(max_length,EMBED_LENGTH) 
# reshape the base_document i.e. the one we will compare to all others
base_doc = encoding_mat[0].reshape(1,EMBED_LENGTH)

# calculate the cosine similarity
cosine_similarities = cosine_similarity(base_doc, encoding_mat)

# The following would compute a cosine similiarity matrix comapring every
# doc to every other doc, like a correlation matrix.
# This uses a lot of RAM.
#cosine_similarities = cosine_similarity(encoding_mat, encoding_mat)

cosine_similarities.shape

(1, 51124)

In [63]:
# flatten the matrix
cosine_similarities = cosine_similarities.flatten()

#Check: The first value should be 1.0 because the 
# question is being compared to itself.
cosine_similarities

array([1.        , 0.63717302, 0.75057081, ..., 0.68013869, 0.76587238,
       0.77467368])

Let's put everything into a dataframe and sort the similarities from highest to lowest.

In [64]:
# create a dataframe
df_cosine_matrix = pd.DataFrame(cosine_similarities)

# transpose the dataframe
df_cosine_matrix = df_cosine_matrix.T

# get the column names from df_train
cols = list(df_qa_prof['answers_id'])

# Change the name of the first column. This is the score for the Question
cols[0] = 'question_cosine_score'

# rename the columns in the dataframe
df_cosine_matrix.columns = cols

# Add the professionals id values as a new column.
# This is identical to answers_author_id.
df_cosine_matrix['answers_id'] = df_qa_prof['answers_author_id']

# set the answers_id column as the index
df_cosine_matrix.set_index('answers_id', inplace=True)

# transpose the dataframe
df = df_cosine_matrix.T

# rename the column
new_col = ['cosine_score_for_each_answer_id']
df.columns = new_col

# sort the cosine similarity values in descending order
df = df.sort_values('cosine_score_for_each_answer_id',axis=0, ascending=False)

# check the top 20 cosine scores
df.head(20)


Unnamed: 0,cosine_score_for_each_answer_id
question_cosine_score,1.0
e9ebfa69514243e19c3790f8118cc820,0.924699
b3efc307b94d44b6b82eaf9a325e2fd6,0.918611
f8b9c41cda914a219895cf2ca26890a2,0.915846
036b8034e2b64d16aa232111deeba63c,0.911662
573b02b214bc445db3004dfb0c842283,0.910685
5410c56b923c4a29b88e4ae4d7a93958,0.908284
c57ef07bdff34b8cb4f7421a0fc4437f,0.908248
a898e7728f064b0dae5fed73281ab48b,0.90801
74c89f47b05741bba95122f1cf43d196,0.907588


§ Pause here for a moment. Please take a look at the previous dataframe. You'll notice that Model 4 has identified quite a few professionals. However, for many questions you're going to find that the final printout for this model does not contain any recommended professionals. This is because the threshold is set quite high. It's at 0.94 at the moment. Later I'll explain why this threshold is set high.

## 7.7. Select the Answers
Select the answers that have a cosine similarity that is greater than or equal to a threshold value.

In [65]:
# Set the cosine similarity threshold
MODEL_4_THRESHOLD = 0.94


# filter out all rows that have a cosine_score >= THRESHOLD
df_selected = df[df['cosine_score_for_each_answer_id'] >= MODEL_4_THRESHOLD]

# remove the first row because this row is the question we asked
df_selected = df_selected[1:]

num_answers = len(df_selected)

print('Number of answers chosen: ', num_answers)

print('This is a sample of the answers the model has selected.')

# print the answers that have been selected as well as the associated cosine scores
df_selected.head(10)

Number of answers chosen:  0
This is a sample of the answers the model has selected.


Unnamed: 0,cosine_score_for_each_answer_id


This is a list of id's that correspond to the answers that are similar to the question you asked. We'll identify the professional that gave each answer. These will be the professionals that are best able to answer your question.

In [66]:
# reset the index
df_selected.reset_index(inplace=True)

# rename the columns
new_names = ['answers_id', 'cosine_score_for_each_answer_id']
df_selected.columns = new_names

# create a list with all answer id values from df
answer_list = list(df_selected['answers_id'])

# display the list
# answer_list

**Let's print the question again.**

In [67]:
print('Question id: ', question_id)
print('Question Title: ', question_title)
print('\n')
print('Question Body:\n ', question_body)

Question id:  eb80205482e4424cad8f16bc25aa2d9c
Question Title:  I want to become an army officer. What can I do to become an army officer?


Question Body:
  I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army


Next we'll print the profile of each professional as well as the answer they gave to a similar question. Again, there could be duplicate professionals in this list because the same professional could have given several past answers.<br>
| <a id='model_3_prof_printout'></a>

In [68]:
# Print info on the professionals who can answer this question

#print('\n')
print('Model 4')
print('Number of professionals selected: ', len(answer_list)) # correct this. there could be duplicates
print('== Printing info on each professional who was selected ==')

# set the index
df_professionals = df_professionals.set_index('professionals_id')

# Create an empty list to store the professional id's that are
# associated with the answers that have been selected,
model_4_list = []

# set the index of df_train to be the question id
df_qa_prof = df_qa_prof.set_index('answers_id')


for ans_id in answer_list:
    
    # print the professional's id (i.e. their name)
    # get the prof id of the person who wrote the answer
    prof_id = df_qa_prof.loc[ans_id, 'answers_author_id']
    print('\n')
    print('==> Professional id: ', prof_id)
    model_4_list.append(prof_id)
    
    print('Answer id: ', ans_id)
    
    
    # print their job title:
    title = df_professionals.loc[prof_id, 'professionals_headline']
    print('Title: ', title)
    
    # print the industry they work in
    industry = df_professionals.loc[prof_id, 'professionals_industry']
    print('Industry: ', industry)
    
    # Print the answer that they wrote which was similar the question being asked
    answer = df_qa_prof.loc[ans_id,'answers_body']
    
    print('==Answer given to similar question:\n',answer)

Model 4
Number of professionals selected:  0
== Printing info on each professional who was selected ==


These are the id's of the professionals that Model_4 has selected:

In [69]:
# uncomment the next line to print the list of professional id's

# model_4_list

<hr>
| <a id='Select_professionals'></a>

## 8. Select those professionals who are most likely to answer the question

We have recommendations from four models. Now we need to filter out those professionals who are likely to answer when they are sent this question via email.

**What selection criteria are we going to use?**

> We will ask four questions:
> 1. Has this professional joined CareerVillage within the last 30 days of the most recent new user who signed up?
> 2. Has this professional answered a past question from the student who posted this question?
> 2. Has this professional answered a question within 30 days of the most recent answer posted on CareerVillage?
> 3. Has this professional made a comment within 30 days of the most recent answer posted on CareerVillage?

If the answer is yes to any one of these questions then we will conclude that there is a high possibility that this professional will respond to the email.

This is the logic behind this filter:

New users are more likely to be highly active. Also, having recently answered a question or made a comment is an indicator that a person is contributing to the community and is therefore likely to respond to a relevant email. Lastly, if a professional answered a past question from the student posting this question, then there is already a kind of "connection" between them. There's a good chance of that professional also answering this new question.


## 8.1. Get a summary of how many professionals each model has selected
There could be duplicate professional id's here. We'll remove those duplicates later.

In [70]:
# Note that there could be duplicate professional id's
# in these lists.

print('Model 1 Tags: ', len(model_1_list))
print('Model 2 Tfidf: ',len(model_2_list))
print('Model 3 TSVD: ',len(model_3_list))
print('Model 4 GloVe: ',len(model_4_list))


Model 1 Tags:  21
Model 2 Tfidf:  32
Model 3 TSVD:  46
Model 4 GloVe:  0


## 8.2. What is the total number of professionals the models have selected?

In [71]:
# Join all the lists
combined_list = model_1_list + model_2_list + model_3_list + model_4_list
# Create a dataframe containing all professionals
df_selected = pd.DataFrame(combined_list, columns=['professionals_id'])

# Drop any duplicate id's.
# Because model 2 and model 3 select professionals based on answers, 
# there is a possibility that the same professional could be selected 
# multiple times bcause they gave several answers that matched the Question.

# remove the duplicates
df_selected = df_selected.drop_duplicates('professionals_id')
# get the total number of professionals
total = len(df_selected)

print(total, 'professionals are able to answer the Question.')

65 professionals are able to answer the Question.


## 8.3. Apply the final selection criteria

### ~ Has this professional joined CareerVillage within the last 30 days of the most recent new user who signed up?

In [72]:

def new_member(x):
    # get the value from df_professionals
    num_days_member = df_professionals.loc[x, 'num_days_member']

    if num_days_member <= 30:
        return 1
    else:
        return 0

df_selected['new_member'] = df_selected['professionals_id'].apply(new_member)

### ~ Has this professional answered a past question from the student asking this question?

In [73]:
# Get the id of the student asking the question.
# student_id variable was captured above.

def past_interaction(x):
    # Filter out all the questions this professional has answered in the past
    df_past = df_qa_prof[df_qa_prof['answers_author_id'] == x]

    # Get a list of stuents who've asked the above questions
    student_list = list(df_past['questions_author_id'])

    # Check if the student asking this question is in student_list
    if student_id in student_list:
        return 1 # there was a past interaction
    else:
        return 0 # there has been no past interaction

# create a new column that shows if there was a past interaction
df_selected['past_interaction'] = \
df_selected['professionals_id'].apply(past_interaction)

In [74]:
df_selected.head()

Unnamed: 0,professionals_id,new_member,past_interaction
0,6847e217f0d942a5b7c492131e47aa84,0,0
1,2d0e6cc6f02d40f0804698c51fc4d583,0,0
2,cb7141606c7b4a00ab42547b55091978,0,0
3,1a67257ce4164a27a83840007c0903ce,0,0
4,ffdcc03fc51c4621819601fa36acd354,0,0


### ~ Has this professional answered a question within 30 days of the most recent answer posted on CareerVillage?

In [75]:
# Has this professional answered a question within
# 30 days of the most recent answer posted on CareerVillage?
# Yes --> send email

# convert the answers_date_added to pandas datetime
df_answers['answers_date_added'] = \
pd.to_datetime(df_answers['answers_date_added'])

# get the date of the most recent answer
newest_answer_date = df_answers['answers_date_added'].max()

# Get the number of days a question was answered from the most recent answer posted
# on CareerVillage.

def days_from_newest_answer(x):
    
    num_days = (newest_answer_date - x).days
    
    return num_days

# create a new column
df_answers['days_from_newest_answer'] = \
df_answers['answers_date_added'].apply(days_from_newest_answer)

# filter out all rows where days_from_newest_answer <= 30
df_filtered = df_answers[df_answers['days_from_newest_answer'] <= 30]

# Drop duplicate professional id's because some professionals
# may have abswered multiple questions in that time period.
df_filtered = df_filtered.drop_duplicates('answers_author_id')

# get a list of professionals that made these recent answers
prof_list = list(df_filtered['answers_author_id'])


def recent_answer(x):
    if x in prof_list:
        return 1
    else:
        return 0

# create a new column
df_selected['recent_answer'] = \
df_selected['professionals_id'].apply(recent_answer)

### ~ Has this professional made a comment within 30 days of the most recent answer posted on CareerVillage?

In [76]:
# Has this professional made a comment within
# 30 days of the most recent answer posted on CareerVillage?
# Yes --> send email

# convert the answers_date_added to pandas datetime
df_comments['comments_date_added'] = pd.to_datetime(df_comments['comments_date_added'])

# Get the number of days a question was answered from the most recent answer posted
# on CareerVillage.

def days_from_newest_answer(x):
    
    num_days = (newest_answer_date - x).days
    
    return num_days

# create a new column
df_comments['days_from_newest_answer'] = \
df_comments['comments_date_added'].apply(days_from_newest_answer)

# filter out all rows where days_from_newest_answer <= 30
df_filtered = df_comments[df_comments['days_from_newest_answer'] <= 30]

# Drop duplicate professional id's because some professionals
# may have made multiple comments in that time period.
df_filtered = df_filtered.drop_duplicates('comments_author_id')

# get a list of professionals that made these recent comments
prof_list = list(df_filtered['comments_author_id'])

# add a new column to df_selected
def recent_comment(x):
    if x in prof_list:
        return 1
    else:
        return 0

df_selected['recent_comment'] = df_selected['professionals_id'].apply(recent_comment)


<hr>
| <a id='Final_Output'></a>

## 9. Final Output - The chosen ones

### ~ Filter out those professionals who've met at least one of the selection critera ~

In [77]:
# sum up the row scores for each professional in df_selected
def sum_rows(row):
    
    total = row['new_member'] + row['recent_answer'] + \
    row['recent_comment'] + row['past_interaction']
    
    return total
    
df_selected['total_score'] = df_selected.apply(sum_rows, axis=1)


# filter out rows where the score > 0
df_send_email = df_selected[df_selected['total_score'] > 0]



final_selection_list = list(df_send_email['professionals_id'])

num_selected = len(final_selection_list)

print('=== Final Results ===\n')

print(num_selected, 'professionals are likely to respond to the email.')

#print('These are their names:\n', final_selection_list)

print('These are their scores.')

# Print the list of professionals that have a high likelihood of
# responding to an email notification
df_send_email.head(20)

=== Final Results ===

9 professionals are likely to respond to the email.
These are their scores.


Unnamed: 0,professionals_id,new_member,past_interaction,recent_answer,recent_comment,total_score
25,cbd8f30613a849bf918aed5c010340be,0,1,0,0,1
36,327e3aeb0e174b8bb49ab6bc75ca3f89,0,1,0,0,1
41,f1cc078488fa49b2827a9671ab1cc582,0,0,1,0,1
51,258cade180394e3c80559a75f5bd874b,1,0,1,1,3
59,9a6c9a45fc384da9a48555b9ea794b7e,0,0,1,0,1
62,fcc808fdb24147f6972e5cd5ab785173,0,0,1,0,1
85,cbe6fda957e741c19e83d13cf7b7fa89,0,0,1,0,1
91,be5d23056fcb4f1287c823beec5291e1,0,0,1,0,1
92,2aa47af241bf42a4b874c453f0381bd4,0,1,0,0,1


### Show which model selected each professional
These professional id's will help you to go back to a specific model's output and find a profile. There could be duplicates here because the same professional could have been selected by more than one model.

In [78]:
print('This shows which model selected each chosen professional:\n')

for prof_id in final_selection_list:
    if prof_id in model_1_list:

        print('Model 1 Tags: ', prof_id)

for prof_id in final_selection_list:
    if prof_id in model_2_list:

        print('Model 2 Tfidf: ', prof_id)
        
for prof_id in final_selection_list:
    if prof_id in model_3_list:

        print('Model 3 TSVD: ', prof_id)

for prof_id in final_selection_list:
    if prof_id in model_4_list:

        print('Model 4 GloVe: ', prof_id)

This shows which model selected each chosen professional:

Model 2 Tfidf:  cbd8f30613a849bf918aed5c010340be
Model 2 Tfidf:  327e3aeb0e174b8bb49ab6bc75ca3f89
Model 2 Tfidf:  f1cc078488fa49b2827a9671ab1cc582
Model 2 Tfidf:  258cade180394e3c80559a75f5bd874b
Model 3 TSVD:  9a6c9a45fc384da9a48555b9ea794b7e
Model 3 TSVD:  fcc808fdb24147f6972e5cd5ab785173
Model 3 TSVD:  cbe6fda957e741c19e83d13cf7b7fa89
Model 3 TSVD:  be5d23056fcb4f1287c823beec5291e1
Model 3 TSVD:  2aa47af241bf42a4b874c453f0381bd4


In [79]:
# End of Recommender System
#====================================================================#

| <a id='Testing'></a>

## 10. Testing and Results

I used nine questions for testing and tuning. I tried to choose questions that reflect the strengths, limitations and quirks of this data. These include: 

1. Questions relating to careers that are well represented - e.g. computer science
2. Questions relating to careers that have a low representation - firefighting, plumbing
3. General questions that are more life-skills related
 
The test results for each model and the filter are summarised in a dataframe below.

These are the questions that I used for testing:

**Index:** 777<br>
**Question id**:  11ce7c537cd84db0bd7840ad3ca04004<br>
**Question Title**:  I want to major in computer science. What classes should I take?<br>
**Question Body**:<br>
 I want to do something like cyber security or write code for companys. #computer-science #programming #computer-engineering #computer-software 
 
 
**Index**: 999<br>
**Question id**:  0601a843065945ac86e473f421774952<br>
**Question Title**:  What is required to become a firefighter?<br>
**Question Body**:<br>
I want to know because so I can get the things I need to become a firefighter. #fireman 


**Index**: 2043<br>
**Question id**:  8b469efa88284afb907179e4c73a99af<br>
**Question Title**:  What exactly, is the difference between a psychologist and psychiatrist?<br>
**Question Body**:<br>
i dont know whether i want to be a psychologist or psychiatrist. I want to work with married people, going through a divorce. Do both of them work with people who are going through a divorce? #psychology #psychiatry


**Index**: 2487<br>
**Question id**:  1aaa4249d4ea41a4b2d196313e4e930e<br>
**Question Title**:  How do I decide what career I want to choose?<br>
**Question Body**:<br>
I am finding it very difficult to decide what I want to spend the rest of my life doing, and I would like to know what process others took to find their paths who were as lost as me. #undecided #unsure #searching

**Index**: 3618<br>
**Question id**:  e14626d53e5d44ac98e4e1c57404aa9d<br>
**Question Title**:  What are the best ways to maintain a work and school balance?<br>
**Question Body**:<br>
I think many people struggle with this and any type of advice towards this question is valuable.  #business #leadership #organization

**Index**: 1710<br>
**Question id**:  eb80205482e4424cad8f16bc25aa2d9c<br>
**Question Title**:  I want to become an army officer. What can I do to become an army officer?<br>
**Question Body**:<br>
I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army


**Index**: 1000<br>
**Question id**:  a5dfa070a89c47c28460557b1f4dabb8<br>
**Question Title**:  What are the challenges I may face pursuing a career in science, and how can I stand out among the rest?<br>
**Question Body**:<br>
For my first two years in high school I have been studying biotechnology and I have been working diligently in preparing myself for college and the workforce. Though I have not decided on an exact career, I am deeply considering one within forensics, pharmaceuticals, or in the space exploration fields.  What are some of the challenges students may face when applying for college, throughout college, and in the workforce? How can I stand out from the rest? #high-school #student #biotechnology #workforce #science #career #forensics #medicine #pharmaceuticals #space-exploration #future #technology #stem #steam #nasa #astrophysics #planetary-science #women-in-stem


**Index**: custom question<br>
**Question id**:  custom question<br>
**Question Title**:  How do I become a data scientist?<br>
**Question Body**:<br>
I want to be a data scientist. What subjects should I study? #data-science

**Index**: custom question<br>
**Question id**:  custom question<br>
**Question Title**:  How do I become a plumber?<br>
**Question Body**:<br>
I want to be a plumber. What subjects should I study? #plumber #plumbing



In [80]:
import pandas as pd
# The next two lines causes all the text to appear. Sentences are not truncated.
# All columns and all rows are displayed. Nothing is hidden.
# Note: this must be in the same cell as import pandas as pd
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', None)


results_dict = {
'question_index': [777,999,2043,2487,3618,1710,1000,'custom question','custom question'],
'question_title': ['I want to major in computer science. What classes should I take?',
                 'What is required to become a firefighter?',
                 'What exactly, is the difference between a psychologist and psychiatrist?',
                 'How do I decide what career I want to choose?',
                 'What are the best ways to maintain a work and school balance?',
                 'I want to become an army officer. What can I do to become an army officer?',
                 'What are the challenges I may face pursuing a career in science, and how can I stand out among the rest?',
                 'How do I become a data scientist?', 'How do I become a plumber?'],
'Model_1_Tags': ['rec: 481 fp: 0','rec: 2 fp: 0','rec: 1 fp: 1','rec: 0 fp: 0','rec: 1 fp: 1','rec: 21 fp: 2','rec: 0 fp: 0','rec: 105 fp: 1','rec: 4 fp: 0'],
'Model_2_Tfidf': ['rec: 392 fp: 12','rec: 10 fp: 1','rec: 9 fp: 0','rec: 2 fp: 1','rec: 1 fp: 1','rec: 32 fp: 0','rec: 0 fp: 0','rec: 99 fp: 10','rec: 4 fp: 4'],
'Model_3_TSVD': ['rec: 140 fp: 1','rec: 0 fp: 0','rec: 0 fp: 0','rec: 6 fp: 1','rec: 0 fp: 0','rec: 46 fp: 0','rec: 0 fp: 0','rec: 108 fp: 8','rec: 7 fp: 7'],
'Model_4_GloVe': ['rec: 53 fp: 0','rec: 0 fp: 0','rec: 0 fp: 0','rec: 34 fp: 4','rec: 1588 fp: 0','rec: 0 fp: 0','rec: 26 fp: 0','rec: 0 fp: 0','rec: 0 fp: 0'],
'Final_Filter_Output': ['rec: 29 fp: 1','rec: 2 fp: 1','rec: 1 fp: 1','rec: 2 fp: 0','rec: 54 fp: 0','rec: 9 fp: 0','rec: 3 fp: 0','rec: 14 fp: 1','rec: 2 fp: 1']
  
}

    
df_results = pd.DataFrame(results_dict)

#df_results.head(10)

### Results

In each cell there are two numbers. One shows the number of recommendations, the other shows the number of false positives (bad matches) in those recommendations. To determine the number of false positives I looked at the output of each model and counted how many recommended professionals I believed were not well matched to the question. This is a subjective way of assessing performance, but it gives us a rough idea of how each model is performing.

For example, Model_3_TSVD recommended 140 professionals to answer the question: "I want to major in computer science. What classes should I take?" There was one false positive in the output. In the dataframe the results are displayed like this - rec: 140 fp: 1
After filtering, the system concluded that 29 professionals are likely to respond to the email. These final recommendations contain one false positive - rec: 29 fp: 1


These are the cosine similarity thresholds for each model:

MODEL_1_THRESHOLD = 0.13<br>
MODEL_2_THRESHOLD = 0.1<br>
MODEL_3_THRESHOLD = 0.65<br>
MODEL_4_THRESHOLD = 0.94

In [81]:
df_results.head(10)

Unnamed: 0,question_index,question_title,Model_1_Tags,Model_2_Tfidf,Model_3_TSVD,Model_4_GloVe,Final_Filter_Output
0,777,I want to major in computer science. What classes should I take?,rec: 481 fp: 0,rec: 392 fp: 12,rec: 140 fp: 1,rec: 53 fp: 0,rec: 29 fp: 1
1,999,What is required to become a firefighter?,rec: 2 fp: 0,rec: 10 fp: 1,rec: 0 fp: 0,rec: 0 fp: 0,rec: 2 fp: 1
2,2043,"What exactly, is the difference between a psychologist and psychiatrist?",rec: 1 fp: 1,rec: 9 fp: 0,rec: 0 fp: 0,rec: 0 fp: 0,rec: 1 fp: 1
3,2487,How do I decide what career I want to choose?,rec: 0 fp: 0,rec: 2 fp: 1,rec: 6 fp: 1,rec: 34 fp: 4,rec: 2 fp: 0
4,3618,What are the best ways to maintain a work and school balance?,rec: 1 fp: 1,rec: 1 fp: 1,rec: 0 fp: 0,rec: 1588 fp: 0,rec: 54 fp: 0
5,1710,I want to become an army officer. What can I do to become an army officer?,rec: 21 fp: 2,rec: 32 fp: 0,rec: 46 fp: 0,rec: 0 fp: 0,rec: 9 fp: 0
6,1000,"What are the challenges I may face pursuing a career in science, and how can I stand out among the rest?",rec: 0 fp: 0,rec: 0 fp: 0,rec: 0 fp: 0,rec: 26 fp: 0,rec: 3 fp: 0
7,custom question,How do I become a data scientist?,rec: 105 fp: 1,rec: 99 fp: 10,rec: 108 fp: 8,rec: 0 fp: 0,rec: 14 fp: 1
8,custom question,How do I become a plumber?,rec: 4 fp: 0,rec: 4 fp: 4,rec: 7 fp: 7,rec: 0 fp: 0,rec: 2 fp: 1


<hr>

**Observations**<br>

This system is performing well on careers for which a lot of data is available. Examples include the computer science, military and data science questions. 

For questions about firefighting, psychology and plumbing the number of final recommendations is low. The models are also generating a higher ratio of false positives. There are professionals in the dataset that are qualified to answer these questions and the system is detecting them. But, the final filter is rejecting these professionals because they are not active. Therefore, I believe that the problem is not with this recommendation system. There's just not enough data representing these careers.

Still, this highlights a weakness of this recommender system: For certain questions, especially those where a career has a low representation in the data, model 4 has a tendency to generate many false positives. To see a demonstration of this try entering index 2018 (question id:  b53c5e9b7436453fa13a416a23b512cc). This is a question about becoming a Politician. 

For the more general/life-skills realted questions...

~ How do I decide what career I want to choose?<br>
~ What are the challenges I may face pursuing a career in science, and how can I stand out among the rest?<br>
~ What are the best ways to maintain a work and school balance?

...Model 4 (GloVe) is performing well. It's nicely making up for the weakness that the other three models are showing. 

**Is this a good recommender system?**<br>

Based on these tests my preliminary conclusion is that this recommender system works. However, one should be careful about drawing conclusions after only a quick test like this. This system needs to be field tested. Only then will we know how robust it really is.

**Now just a few thoughts on setting thresholds...**

I've tried to reduce the number of irrelevant recommendations (i.e. bad match between question and professional) by setting model 3 and model 4 thresholds to be quite high. Reducing the number of false positives is important to inspire confidence in this system and reduce the amount of irrelevant questions sent to professionals. 

As an example, consider the question: 
"I want to become an army officer. What can I do to become an army officer?"

At the current threshold of 0.94 Model 4 is generating 0 recommendations. However, if this threshold were lowered to 0.85 Model 4 would generate 305 recommendations. This would cause an increase in the number of false positives. In addition to recommending more professionals with a military background, Model 4 would also recommend police officers.

Moreover, if the threshold was set at 0.85 then model 4 would generate 38,220 recommendations when given the very general question: "What are the best ways to maintain a work and school balance?" 

38,220 (contains duplicates) is a lot of professionals and one might guess that many false positives would be generated... but think again - would you or I consider such a question irrelevant?

That said, when choosing these thresholds it's important to consider **business priorities**. One of CareerVillage's priorities is 'No question left behind' meaning that no question should be left unanswered. In this case an executive decision could be made to use lower thresholds. The benefit would be that there would be a higher possibility that a question would get an answer because the system would output a longer list of recommended professionals. The risk is that the number of irrelevant questions sent to professionals will increase. One way to manage this risk would be to warn professionals ahead of time that a new system is being tested and request their patience (and possibly their feedback) as the system is being tuned. 

Another point to consider is that these thresholds are dictated by the amount and quality of the data. As the amount of data increases, these thresholds could be adjusted. This system is not static. Its ongoing performance will need to be monitored. With time it could get better as data quantity increases or it could get worse if the data becomes polluted, with spam for example.


<hr>
| <a id='Things'></a>

## 11. Things to keep in mind

**Domain knowledge is like the force, use it...**

If you are a domain expert i.e you work for CareerVillage or you're an experienced career counselor or child psychologist, then you may be able to use the practical insights you have to improve this solution. 

I suggest starting by looking at the filter. Are the conditions too strict? Are there better conditions that can be added?<br>
You can then try tuning the threshold values or tuning other model parameters to see if the quality and quantity of the recommendations improve. Also, this system is set up so that the models work independantly. Therefore, it's possible to experiment by including and excluding certain models.

**Be careful...**

However, as mentioned above, please be careful when lowering the threshold value for model 4 (GloVe). If a question is very general then this model may cast a wide net and recommend thousands of professionals. This could crash the system.

**What steps can be taken to address the cold start problem?**

Say you've just signed up as a shopper on Amazon. The site won't know the most relevant products to recommend to you because you've never bought anything i.e. you don't have a shopping history. This is called the cold start problem. 

Here the new professionals are the "shoppers" and the "products" are the questions.

In this recommender system Model 1 gives every professional a chance to be paired with a question. It relies on tags and professional profiles. 

(Reminder: A Professional's profile includes a professional's industry and title.)

Therefore, one way of tackling the cold start problem is to encourage new professionals to complete their tag and profile information. There are many professionals whose info is incomplete or sparse. New professionals will have a better chance of being matched if they provide complete and detailed information about themselves. 

Another way of addressing the cold start problem is to simply do nothing. Professionals are currently able to scroll the forums to find questions to answer. With time cold start issues will resolve themselves as these professionals find and answer questions. Once they answer a question Model 2, 3 or 4 will automatically match them to more questions.

**Fairness**

One of the requirements for fairness is a diverse dataset. This isn't always easy to create for many reasons, one being that the perspective of those creating it may be limited. Let me explain using a professions vs trades example. (Please note that the term bias here means one-sidedness. Not bias in the sense of bias/variance.)

Society tends to esteem professions above trades. Those of us who've reaped the social and financial rewards of a university education would naturally want to encourage all children to follow this path. But not all children want to be doctors. Many can find secure, fullfilling and often lucrative careers as plumbers and electricians. Not everyone knows that such opportunities exist outside the university system.

If CareerVillage's marketing team were to target a predominantly white-collar demographic when trying to attract new professionals - then the number of university educated professionals in the dataset will be higher, and growing faster, than the number of tradesmen or tradeswomen. This will lead to a data bias in favour of professions. Any alogorithms constructed using this data will reflect this bias. The consequence - students who submit questions about learning a trade won't get answers. This would be an "exculsionary user experience", also known as discrimination. All because the data is not diverse.

Here's an enlightening TED talk on algorithmic bias:<br>
https://www.ted.com/talks/joy_buolamwini_how_i_m_fighting_bias_in_algorithms?language=en#t-232473



<hr>
| <a id='Ideas'></a>

## 12. Ideas for sharpening this system

**Include a Stoplist**<br>

A stoplist is a pre-defined list of root words. It can be used to blacklist professionals who've given answers or have profiles that contain words that are in the stoplist. The recommender system shouldn't know that these professionals exist.

Why? CareerVillage meets a very real need. It's certain to become popular with more students and professionals, in more countries. Unfortunately this increase in popularity will attract both the good and the bad - human and bot. It's important to build a stoplist into the recommender system that will exclude certain professionals from the final output when that person is promoting an agenda that's contrary to the CareerVillage value system.

**Train a Neural Network to automatically rate the quality of answers**<br>

The CareerVillage vision is to give students answers that are tailored, reliable, encouraging and inspirational. There are many answers in the dataset that meet these requirements. Unfortunately, there are also answers that don't. Being able to automatically rate the quality of answers may help improve them.

How might this be done?<br>

The main pre-requisite is a labeled dataset. Once this is in place one could train a deep neural network to read each answer and then rate it on a scale of 1 to 10 according to the CareerVillage [Pro Tips](https://medium.com/@careervillage/introducing-protips-2d4ad51c445a) guidelines. This neural network could have one of several architectures - DNN, CNN, RNN or even BERT, which is a state of the art pre-trained network created by Google. 

However, factors such as memory requirements, training time, inference time, web page load time, maintenance complexity and hosting costs will need to be considered when deciding if this idea is feasible.


**Add "Recent Login" as a criteria for selecting professionals**

If a professional visited CareerVillage recently then he or she was probably scrolling through the forums reading the questions and answers. There's a good chance that this person will respond to an email notification to answer a relevant question. Login information is not part of this dataset. The following condition should be added to the filter:<br>
"Has this professional visited the site in the last 14 days?"

**Encourage all professionals to complete their profile information**

There are many professionals who have incomplete or sparse profile information. This recommender system relies on profile information. If more professionals provide complete profiles then the performance of this system will improve.

**Add 'About me' and 'Career Stories' to the profile information**

On some professional's CareerVillage profile page there are sections called "About me" and "Career Stories". In these sections they share what their career path has been, life experiences, past mistakes, what they do on a typical work day and other useful personal information and experiences. Here's a good example:<br>
https://www.careervillage.org/users/9852/kim/

This data is not included in this dataset, possibly because only a few professionals choose to share this information. "About me" and "Career Stories" could be a valuable data source for this recommender system. Including them will also help address the cold start problem.

| <a id='Citations'></a>

## Citations

1. GloVe: [Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf)<br>
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. 

2. Photo by jeshoots.com on Pixabay

| <a id='Reference_Kernels'></a>

## Reference Kernels

1. Rounak Banik - Movie Recommender Systems<br>
https://www.kaggle.com/rounakbanik/movie-recommender-systems

2. Chris Crawford - Starter kernel<br>
https://www.kaggle.com/crawford/starter-kernel

3. wjsheng - UPDATE 5: text processing<br>
https://www.kaggle.com/wjshenggggg/update-5-text-processing

4. RodH - Recommender: things to consider<br>
https://www.kaggle.com/rdhnw1/recommender-things-to-consider

5. Marsh - Keras cnn + GloVe + Early Stopping<br>
https://www.kaggle.com/vbookshelf/keras-cnn-glove-early-stopping-0-048-lb


| <a id='Helpful_Resources'></a>

## Helpful Resources

1. Frank Kane course on Building Recommender Systems<br>
https://www.udemy.com/building-recommender-systems-with-machine-learning-and-ai

2. Andrew Ng Deep Learning Specialization, Sequence Models, Week 2<br>
https://www.coursera.org/learn/nlp-sequence-models

3. What are word embeddings?<br>
https://www.youtube.com/watch?v=Eku_pbZ3-Mw

4. Blog post with a simple example explaining how to use pre trained embeddings:<br>
https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

5. Machine learning with text<br>
https://www.youtube.com/watch?v=ZiKMIuYidY0

6. NLTK Tutorial series<br>
https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/

7. Blog post by Kaggle Grandmaster Abhishek Thakur<br>
http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/

8. Tutorial on merging dataframes<br>
https://www.youtube.com/watch?v=h4hOPGo4UVU

9. Blog post by William Zinsser<br>
https://theamericanscholar.org/writing-english-as-a-second-language/#.XJ8oJhMzYWo



<hr>

| <a id='Conclusion'></a>

## Conclusion

Ikigai is a formula for happiness and fulfillment. It's a Japanese word that roughly means "a reason for being" or "the reason you wake up in the morning". It's the area of intersection of four overlapping circles: what you love to do, what you're good at, what you get paid to do and what the world needs.

Thank you CareerVillage and Kaggle for hosting this challenging competition.