## Ikigai - A Career Village RecSys

by Marsh [ @vbookshelf ]<br>
9 April 2019

<img src="http://bee.test.woza.work/assets/student.jpg" width="500"></img>

## Contents

<a href='#Introduction'>1. Introduction</a><br>
<a href='#Prepare_the_Data'>2. Prepare the Data</a><br>
<a href='#Ask_a_Question'>3. Ask a Question</a><br>
<a href='#Model_1'>4. Model 1 -  Tags and Profiles</a><br>
<a href='#Model_2'>5. Model 2 - Profiles and Answers</a><br>
<a href='#Model_3'>6. Model 3 - TruncatedSVD</a><br>
<a href='#Model_4'>7. Model 4 - GloVe Embeddings</a><br>
<a href='#Select_professionals'>8. Select professionals who are most likely to answer the question</a><br>
<a href='#Final_Output'>9. Final Output - The chosen ones</a><br>
<a href='#Testing'>10. Testing and Results</a><br>
<a href='#Things'>11. Things to keep in mind</a><br>
<a href='#Ideas'>12. Ideas for sharpening this system</a><br>

<a href='#Citations'>Citations</a><br>
<a href='#Reference_Kernels'>Reference Kernels</a><br>
<a href='#Helpful_Resources'>Helpful Resources</a><br>
<a href='#Conclusion'>Conclusion</a><br>


In [None]:
# Set a seed value
from numpy.random import seed
seed(101)

import pandas as pd
import numpy as np
import os

import pickle
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Don't Show Warning Messages
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Read the data

df_questions = \
pd.read_csv('../input/data-science-for-good-careervillage/questions.csv')
df_answers = \
pd.read_csv('../input/data-science-for-good-careervillage/answers.csv')
df_professionals = \
pd.read_csv('../input/data-science-for-good-careervillage/professionals.csv')

df_comments = \
pd.read_csv('../input/data-science-for-good-careervillage/comments.csv')
df_tags = \
pd.read_csv('../input/data-science-for-good-careervillage/tags.csv')
df_tag_users = \
pd.read_csv('../input/data-science-for-good-careervillage/tag_users.csv')

#print(df_questions.shape)
#print(df_answers.shape)
#print(df_professionals.shape)
#print(df_comments.shape)
#print(df_tags.shape)
#print(df_tag_users.shape)

| <a id='Introduction'></a>

## 1. Introduction

If you're like me then alot of your emails are unopened. It's because you know from reading the title or just from the sender's name that you've got no interest in the content. Who cares if a city that you visited 6 months ago now has "some great deals on hotel rooms". Our brains are becoming efficient content filters. Anything that's not relevant will be ignored.

That's why it's important for CareerVillage to have a good recommender system (RecSys). Professionals need to feel that the questions sent to them are relevant. Creating a personalized experience will make them feel valued. This will lead to more questions being answered and faster answers.

The objective of this competition is to develop a method to recommend relevant questions to the professionals who are most likely to answer them.

This solution follows two steps:<br>

*Step 1*: Develop a method to recommend relevant questions to professionals.<br>
*Step 2*: Identify those professionals who are most likely to answer a relevant question.

It's not practical to measure the quality of this solution using an evaluation metric. To overcome this we will assess the results qualitatively using a simple run and read approach. 

In step 1 you'll ask a question. Then you'll look at the professionals that each model recommendeds and ask: "Based on this person's  profile or past answer, will this recommended professional be **capable** of answering this question?"<br>
This is an intuitive way of assessing the relevance of your question to a particular professional.

In step two, to assess whether or not a professional will respond to a relevant question, this system will look at four indicators:

1. Is this professional a new member?
2. Has this professional answered a past question from this student?
3. Did this professional answer a question recently?
4. Did this professional make a comment recently?

Using this filter, this system will generate a final list of professionals that are likely to respond to your question.

A central feature of this system is that it compares your question to each professional's background info or to all answers in the dataset. If the question is similar to a particular professional's background or to a past answer that he or she gave - then it's likely that that professional is able to answer your question. In other words, your question is relevant to that professional.


**RecSys Architecture**

This recommender system is made up of the four models and a filter:<br>

( A professional's profile is made up of the industry they work in and their title. )

- Model 1 uses tags followed, professional profiles and tfidf (Term Frequency Inverse Document Frequency)
- Model 2 uses professional profiles, past answers and tfidf
- Model 3 uses professional profiles, past answers, tfidf and Truncated SVD (Singular Value Decomposition)
- Model 4 uses professional profiles, past answers and GloVe pre-trained word embeddings  (Global Vectors for Word Representation)

Why do we need four models? It's because not all models perform equally well on all questions. This is a small dataset and different careers are not equally represented. For example theres a lot of professionals with a computer science background but few firefighters. Professions are well represented. Trades are not. Also, not all questions are about specific careers, some are about life. Therefore, having a diversity of models that tackle the problem in different ways is the most prudent approach.

**Interactive Notebook**

This notebook is set up in such a way that you'll be able to select a question from the dataset, run the entire notebook and then see a printout of results for each model, and for the filter. There's also an option to type in your own question as you would on the CareerVillage website.

**Testing and Results**

Does this recommender system work? Yes it does. 

To demonstrate this fact I've tested it on nine questions that are representative of the data. The number of recommendations produced and the number of false positives in those recommendations is tabulated in section 10.


My goal here is to build a working prototype. Let's start by preparing the data.

| <a id='Prepare_the_Data'></a>

## 2. Prepare the Data

In [None]:
# Check what folders are available

os.listdir('../input')

I've done the data preparation in a seperate kernel. This is the link:<br>
https://www.kaggle.com/vbookshelf/data-prep-for-careervillage-recsys


Here we'll simply load the prepared data from that kernel's output.

df_qa_prof.pickle is the pre-processed dataframe that we will use in all three models. This is a merged dataframe that includes questions, answers and professionals. Professionals who didn't answer any questions are not included. 

There's a new column called quest_text where each cell contains both the question title and the question body. There's also a new column called answers_text where each cell contains the combined content of the following columns: professionals_headline, professionals_industry and answers_body.

In [None]:
# Load the pickled dataframe

path_1 = '../input/data-prep-for-career-village-recsys/df_qa_prof.pickle'

df_qa_prof = pickle.load(open(path_1,'rb'))

# check the shape
df_qa_prof.shape

In [None]:
df_qa_prof.head(2)

In [None]:
# Define a function to clean the text

def process_text(x):
    
    # remove the hash sign
    x = x.replace("#", "")
    
    # remove the dash sign with a space
    #x = x.replace("-", " ")
    
    # Remove HTML
    x = BeautifulSoup(x).get_text()
    
    # convert words to lower case
    x = x.lower()
    
    # remove the word question
    x = x.replace("question", "")
    
    # remove the word career
    x = x.replace("career", "")
    
    # remove the word study
    x = x.replace("study", "")
    
    # remove the word student
    x = x.replace("student", "")
    
    # remove the word school
    x = x.replace("school", "")
    
    # Remove non-letters
    x = re.sub("[^a-zA-Z]"," ", x)
    
    # Remove stop words
    # Convert words to lower case and split them
    words = x.split()
    stops = stopwords.words("english")
    x_list = [w for w in words if not w in stops]
    # convert the list to a string
    x = ' '.join(x_list)
    
    return x

<hr>
| <a id='Ask_a_Question'></a>

## 3. Ask a Question

Please select any question in the dataset or type your own question. Then run all cells in this kernel.


 ### ~ Option 1: Choose a question from the CareerVillage dataset ~
Please set QUESTION_INDEX equal to any row index between 0 to 51000.<br>
<br>
For your first try I suggest using QUESTION_INDEX = 777. It's a computer science related question that nicely demonstrates the performance of each of the 4 models and the filter.

If you'd like to choose Option 2 then please set QUESTION_INDEX = None

In [None]:
###################################

QUESTION_INDEX = 1710

###################################

### ~ Option 2: Ask a question as you would on the CareerVillage site ~
Please type your text within inverted commas - " i am a string "

In [None]:
# =========================================== #
# Please check that QUESTION_INDEX = None in the above cell before entering
# your own question.

my_question_title = "How do I become a data scientist?"

my_question_body = "I want to be a data scientist. What subjects should I study? #data-science"

# =========================================== #

==> After selecting one of the above options please Run all cells in this kernel. <==

### ~ This is your Question ~

In [None]:
# Code to process the question

# if Option 1 is chosen
if QUESTION_INDEX != None:
    
    QUESTION_INDEX = int(QUESTION_INDEX)
    
    student_id = df_qa_prof.loc[QUESTION_INDEX, 'questions_author_id']
    # Get the question info from the dataset.
    # The text has already been cleaned above.
    question_id = df_qa_prof.loc[QUESTION_INDEX, 'questions_id']
    question_title = df_qa_prof.loc[QUESTION_INDEX, 'questions_title']
    question_body = df_qa_prof.loc[QUESTION_INDEX, 'questions_body']
    # question_text is clean text that is used in the models
    question_text = df_qa_prof.loc[QUESTION_INDEX, 'quest_text'] 

# if Option 2 is chosen
else:
    student_id = 33333333 # dummy id that's needed for the final selection code
    # get the input question
    question_id = 'My Question'
    question_title = my_question_title
    question_body = my_question_body
    # Clean the text using the process_text() function.
    # question_text is clean text that is used in the models
    question_text = process_text(question_title) + ' ' + process_text(question_body)
    

# Print the question
print('Question id: ', question_id)
print('Question Title: ', question_title)
print('\n')
print('Question Body:\n ', question_body)

<hr>
| <a id='Model_1'></a>

## 4. Model 1 -  Tags, Profiles, Tfidf and Cosine Similarity


This model considers every professional in the dataset irrespective of whether or not they have answered a past question. 

**How does this model  work?**

It compares your question to each professional's background info. Background info is made up of a professional's title, the industry they work in and the tags they follow. The hash symbols are removed from the tags - the model sees the tags as words. The similarity is measured by comparing the vector encoding of the question to the vector encoding of each professional's background info. The encodings are created using Tfidf (Term Frequency Inverse Document Frequency). The vectors are compared using cosine similarity.

> **The idea is that that if a question is similar to a professional's background then there's a good chance that he or she will be able to answer the question.**

**How can we assess how well the model is working?**

We will "run and read". We will run the model and then look at the results to see if they make sense. The model will print out the background info of each professional it has selected. By reading this and comparing it to the question we'll be able to tell whether the model is making reasonable choices.




## 4.1. Prepare the data

In [None]:
# load df_professionals
path_2 = '../input/data-prep-for-career-village-recsys/df_professionals.pickle'
df_professionals = pickle.load(open(path_2,'rb'))

# replace all missing values with nothing
df_professionals = df_professionals.fillna('')

# Create a dictionary of tag id's and tag names
keys = list(df_tags['tags_tag_id'])
values = list(df_tags['tags_tag_name'])
tags_dict = dict(zip(keys, values))

# Change the tag id numbers to tag names that we can read
df_tag_users['tag_name'] = df_tag_users['tag_users_tag_id'].map(tags_dict)

df_tag_users.head()

Because anyone is able to follow a tag there could be mixture of students and professionals in df_tag_users. We need to filter out the professionals.

In [None]:
# get a list of professionals
prof_list = list(df_professionals['professionals_id'])
# filter out the professionals from df_tag_users
df_prof_tag_users = df_tag_users[df_tag_users['tag_users_user_id'].isin(prof_list)]

df_prof_tag_users.shape

Now that we've filtered out all the professionals, let's see which tags each professional follows. The hash sign # has been removed from the tags.

In [None]:
# drop the tag_users_tag_id column
df_prof_tag_users = df_prof_tag_users.drop('tag_users_tag_id', axis=1)

# replace missing values with nothing - just be be safe
df_prof_tag_users =df_prof_tag_users.fillna('')

# add a space to the end of each tag name
def add_space(x):
    x = x + ' '
    
    return x

df_prof_tag_users['tag_name'] = df_prof_tag_users['tag_name'].apply(add_space)

# groupby tag_users_user_id and sum() the tags
df_prof_tag_users = df_prof_tag_users.groupby('tag_users_user_id').sum()

# reset the index
df_prof_tag_users = df_prof_tag_users.reset_index()

# check how many professionals follow tags
num_followers = len(df_prof_tag_users['tag_users_user_id'])

# Are there professionals who don't follow any tags?

num_profs = df_professionals['professionals_id'].nunique()
num_tag_followers = df_prof_tag_users['tag_users_user_id'].nunique()

num_not_followers = num_profs - num_tag_followers

print(num_followers, 'professionals follow tags.')
print(num_not_followers, 'professionals do not follow tags.')

df_prof_tag_users.head()

2649 professionals don't follow tags.

### Add a new column to the df_professionals dataframe that shows the tags that each professional follows.

Because there are 2649 professionals that don't follow any tags, if we try to merge df_professionals and df_prof_tag_users then those who don't follow any tags will be automatically dropped. We must keep this in mind when merging dataframes. We will do a left join. This will include the rows common to both dataframes as well as all elements from the left dataframe. Please refer to the tutorial video referenced in the 'Helpful Resources' section if you'd like to learn more about merging dataframes.

In [None]:
# https://www.youtube.com/watch?v=h4hOPGo4UVU

# Change column name in df_prof_tag_users. 
# For the merge to work the column called professionals_id needs to be in
# both dataframes.
new_names = ['professionals_id', 'tags_followed']
df_prof_tag_users.columns = new_names

# perform the left merge
df_profs = pd.merge(df_professionals,df_prof_tag_users, 
                   on='professionals_id', how='left')

# replace missing values with nothing
df_profs = df_profs.fillna('')

print('We now have a combined dataframe containing the tag info and profile info for all professionals.')

df_profs.head()

### Create a new column that contains the background info of each professional. Then clean the text in this new column.


In [None]:
# Create the new column by summing the strings from each seperate column.
df_profs['prof_info'] = df_profs['professionals_headline'] + ' ' \
+ df_profs['professionals_industry'] + ' ' + df_profs['tags_followed']

# clean the text using the process_text() function defined above
df_profs['prof_info'] = df_profs['prof_info'].apply(process_text)

print('The prof_info column contains the combined profile info of each professional.')
df_profs.head()

## 4.2. Process the question

Here we are inserting the question into the top row of the prof_info column. This column contains the background info of each professional. Inserting the question at the top of this column will make the cosine comparison code easier to write.

In [None]:
# copy a row from df_profs
df_row1 = df_profs[df_profs.index == 0] 
# set all values to nothing
df_row1.loc[:,:] = ''
# reset the index
df_row1 = df_row1.reset_index(drop=True)
    
# Assign the prof_info in this row to be the same as the question.
# We do this because later we will compare this question to all other rows
# in the prof_info column.
df_row1.loc[0, 'prof_info'] = question_text

# Concat df_row to df_profs
# The question will be the first row
df_profs = pd.concat([df_row1, df_profs], axis=0).reset_index(drop=True)

print('The Question, in processed form, is now located at the top of the prof_info column.')

df_profs.head(2)

## 4.3. Vectorize the data

In [None]:

# Select the data we want to use. 
# This column has our new question at the top.
data = df_profs['prof_info']

# instantiate vectorizer
vect = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.5)

# learn the 'vocabulary' of the data
vect.fit(data)

# Transform the data into a document term matrix.
# Keep in mind that the output type is a sparse matrix.
prof_dtm = vect.transform(data)

#prof_dtm.shape

In [None]:
# check what features have been created
#vect.get_feature_names()

## 4.4. Calculate the Cosine Similarity
We are calculating the similarity of your question to each professional's background info. This profile is made up of a professional's headline, industry and the tags that he or she follows.

Because we used tfidf, the vectors have already been normalized. Therefore, in order to get the cosine similarity we only need to take the dot product. The dot product is known as the linear kernel.

In [None]:
# https://stackoverflow.com/questions/12118720/
# python-tf-idf-cosine-to-find-document-similarity

# prof_dtm[0:1] This selects the first row of prof_info column.
# We are saying: Tell me how similar every row is to the first row.
cosine_similarities = linear_kernel(prof_dtm[0:1], prof_dtm)

# The line of code commented out below would give us the cosine similarity score
# of every row to every other row, just like a correlation matrix.
# But there's no need for this and the RAM needed for this calculation
# would cause this kernel to crash.

# cosine_similarities = linear_kernel(prof_dtm, prof_dtm)

# Quick check: The first value should be 1.0 because it's the
# comparison of the question to itself.
cosine_similarities

Let's put everything into a dataframe and sort the similarities from highest to lowest.

In [None]:
# create a dataframe
df_cosine_matrix = pd.DataFrame(cosine_similarities)

# get the column names from df_train
cols = list(df_profs['professionals_id'])

# Change the name of the first column. This is the score for the Question
cols[0] = 'question_cosine_score'

# rename the columns in the dataframe
df_cosine_matrix.columns = cols

# Add the professionals id values as a new column.
# This is identical to answers_author_id.
#df_cosine_matrix['answers_id'] = df_train['answers_author_id']

# set the answers_id column as the index
#df_cosine_matrix.set_index('answers_id', inplace=True)

# transpose the dataframe
df = df_cosine_matrix.T

# rename the column
new_col = ['cosine_score_for_each_prof_id']
df.columns = new_col

# sort the cosine similarity values in descending order
df = df.sort_values('cosine_score_for_each_prof_id',axis=0, ascending=False)

# check the top 10 cosine scores
df.head(10)

## 4.5. Select the Professionals
Here we'll select those professionals whose background info has a cosine similarity (to the question) that is greater than or equal to a threshold. I established this threshold by trial and error. I asked several questions and looked for a cosine similarity value below which the answers were not relevant to the question. 0.13 seems to be a reasonable value for this data.

In [None]:
# Set the cosine similarity threshold
MODEL_1_THRESHOLD = 0.13

# filter out all rows that have a cosine_score >= THRESHOLD
df = df[df['cosine_score_for_each_prof_id'] >= MODEL_1_THRESHOLD]

# remove the first row because this row is the question we asked
df = df[1:]

num_professionals = len(df)

print('Number of professionals chosen: ', num_professionals)

print('This is a sample of the professionals the model has selected.')

# Print the id's of the professionals who have been 
# selected as well as the associated cosine scores
df.head(10)

Here we create a python list containing the id's of each professional selected.

In [None]:
# reset the index
df.reset_index(inplace=True)

# rename the columns
new_names = ['prof_id', 'cosine_score_for_each_prof_id']
df.columns = new_names

# create a list with all answer id values from df
prof_list = list(df['prof_id'])

# display the list
#prof_list

## 4.6. Why did the model select these professionals?

Next we'll print out the background info of the selected professionals. By looking at this we'll be able to tell if a professional was a reasonable choice or a bad choice. 

**First, let's print the question again.**

In [None]:
print('Question id: ', question_id)
print('Question Title: ', question_title)
print('\n')
print('Question Body:\n ', question_body)

**Now let's print the background info. You'll need to scroll through the output. When looking at the printout in a forked kaggle kernel you could mistakenly think that what you see is all there is. Scroll to see more.**<br>
| <a id='model_1_prof_printout'></a>

In [None]:
# Print the profiles the professionals who can answer this question. 
# Note: If you are running this kernel you may need to scroll the output otherwise
# you might mistakenly think that the text shown is all there is.

print('\n')
print('Model 1')
print('Number of professionals selected: ', len(prof_list))
print('== Printing info on each professional who was selected ==')

# set the index
df_professionals = df_professionals.set_index('professionals_id')

# set the index of df_profs to be the question id
df_profs = df_profs.set_index('professionals_id')

# Create an empty list to store the professional id's that are
# associated with the answers that have been selected,
model_1_list = []


for prof_id in prof_list:
    
    print('\n')
    
    # print the professional's id (i.e. their name)
    # get the prof id of the person who wrote the answer
    
    print('==> Professional id: ', prof_id)
    model_1_list.append(prof_id)
    
    
    # print their job title:
    title = df_professionals.loc[prof_id, 'professionals_headline']
    print('Title: ', title)
    
    # print the industry they work in
    industry = df_professionals.loc[prof_id, 'professionals_industry']
    print('Industry: ', industry)
    
    # Print the tags that are followed
    tags = df_profs.loc[prof_id,'tags_followed']
    print('==Tags being followed:\n',tags)
    

What do you think? Just by looking at their backgrounds would you have chosen these professionals to answer your question?

This is only a preliminary list. Once we have the recommendations from all four models we'll make a final selection of professionals based on who is most likely respond to an email containing the question. 

These are the id's of the professionals that Model_1 has selected:

In [None]:
model_1_list

Model 1 gives every professional in the dataset a chance to be selected, especially those who've joined recently and haven't yet answered any questions. However, the three models that follow will only consider professionals who have answered past questions. 

Also, Model 1 used the hash tags that professsionals follow. The other three models won't use these hash tags.

<hr>
|<a id='Model_2'></a>

## 5. Model 2 - Answers, Tfidf and Cosine Similarity

This model only considers professionals who've answered a past question.

**How does this model work?**

This model compares your question to every answer in the dataset.

> **The idea is that if your question is similar to a past answer then there's a good chance that the professional who gave that answer will be able to answer your question.**


**How can we assess how well the model is working?**

At the end the model will print both the past answer that was matched, and the profile of the professional who gave that answer. By reading this information you'll be able to judge whether that professional was a good choice to answer your question. 

## 5.1. Load the data

In [None]:
# load df_qa_prof
df_qa_prof = pickle.load(open(path_1,'rb'))
# load df_professionals
df_professionals = pickle.load(open(path_2,'rb'))


print(df_qa_prof.shape)
print(df_professionals.shape)

## 5.2. Process the Question

In [None]:
# copy a row from df_qa_prof
df_row2 = df_qa_prof[df_qa_prof.index == 0] 
# set all values to nothing
df_row2.loc[:,:] = ''
# reset the index
df_row2 = df_row2.reset_index(drop=True)
    
# Assign the answer_text in this row to be the same as the question.
# We do this because later we will compare this question to all other rows
# in the answer_text column.
df_row2.loc[0, 'answers_text'] = question_text

# Concat df_row2 to df_qa_prof.
# The question will be at the top of the first row.
df_qa_prof = pd.concat([df_row2, df_qa_prof], axis=0).reset_index(drop=True)

df_qa_prof.head(2)

## 5.3. Vectorize the data

In [None]:
# Select the data we want to use. Note we are comparing the question to answers.
# We need to vectorize the answers_text column.
# This column has our new question at the top.
data = df_qa_prof['answers_text']

# instantiate vectorizer
vect = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.5)

# learn the vocabulary of the data
vect.fit(data)

# Transform the data to a document term matrix.
# The output type is a sparse matrix.
prof_dtm = vect.transform(data)

prof_dtm.shape

## 5.4. Calculate the Cosine Similarity
We are calculating the similarity of your question to all past answers in the dataset.

In [None]:
# https://stackoverflow.com/questions/12118720/
# python-tf-idf-cosine-to-find-document-similarity

# prof_dtm[0:1] This selects the first row of prof_info column.
# We are saying: Tell me how similar every row is to the first row.
cosine_similarities = linear_kernel(prof_dtm[0:1], prof_dtm)

# The line below would give us the cosine similarity of every row to every other row,
# just like a correlation matrix.
# But there's no need for this and the RAM needed for this calculation
# would cause this kernel to crash.
# cosine_similarities = linear_kernel(prof_dtm, prof_dtm)

# Quick check: The first value should be 1.0 because it's the
# comparison of the question to itself.
cosine_similarities

Let's put everything into a dataframe and sort the similarities from highest to lowest.

In [None]:
# create a dataframe
df_cosine_matrix = pd.DataFrame(cosine_similarities)

# get the column names from df_train
cols = list(df_qa_prof['answers_id'])

# Change the name of the first column. This is the score for the Question
cols[0] = 'question_cosine_score'

# rename the columns in the dataframe
df_cosine_matrix.columns = cols

# Add the professionals id values as a new column.
# This is identical to answers_author_id.
df_cosine_matrix['answers_id'] = df_qa_prof['answers_author_id']

# set the answers_id column as the index
df_cosine_matrix.set_index('answers_id', inplace=True)

# transpose the dataframe
df = df_cosine_matrix.T

# rename the column
new_col = ['cosine_score_for_each_answer_id']
df.columns = new_col

# sort the cosine similarity values in descending order
df = df.sort_values('cosine_score_for_each_answer_id',axis=0, ascending=False)

# check the top 20 cosine scores
df.head(20)


## 5.5. Select the Answers
Here we select the answers that have a cosine similarity that is greater than or equal to a threshold. 

In [None]:
# Set the cosine similarity threshold
MODEL_2_THRESHOLD = 0.1

# filter out all rows that have a cosine_score >= THRESHOLD
df = df[df['cosine_score_for_each_answer_id'] >= MODEL_2_THRESHOLD]

# remove the first row because this row is the question we asked
df = df[1:]

num_answers = len(df)

print('Number of answers chosen: ', num_answers)

print('This is a sample of the answers the model has selected.')

# print the answers that have been selected as well as the associated cosine scores
df.head(10)

## 5.6. Identify the professionals that gave each answer

Here we will identify the professionals that gave each answer. These will be the professionals that this model thinks are best able to answer your question. 

We start by creating a python list containing the id's of each answer selected.

In [None]:
# reset the index
df.reset_index(inplace=True)

# rename the columns
new_names = ['answers_id', 'cosine_score_for_each_answer_id']
df.columns = new_names

# create a list with all answer id values from df
answer_list = list(df['answers_id'])

# display the list
#answer_list

**Let's print the question again.**

In [None]:
print('Question id: ', question_id)
print('Question Title: ', question_title)
print('\n')
print('Question Body:\n ', question_body)

First we'll get the id of the professional who gave each answer. Then we'll print the profile of that professional and their answer. There could be duplicate professionals in this list because the same professional could have given several past answers.<br>

| <a id='model_2_prof_printout'></a>

In [None]:
# Print info on the professionals who can answer this question

#print('\n')
print('Model 2')
print('Number of professionals selected: ', len(answer_list))
print('== Printing info on each professional who was selected ==')

# set the index
df_professionals = df_professionals.set_index('professionals_id')



# Create an empty list to store the professional id's that are
# associated with the answers that have been selected,
model_2_list = []

# set the index of df_train to be the question id
df_qa_prof = df_qa_prof.set_index('answers_id')

for ans_id in answer_list:
    
    # print the professional's id (i.e. their name)
    # get the prof id of the person who wrote the answer
    prof_id = df_qa_prof.loc[ans_id, 'answers_author_id']
    print('\n')
    print('==> Professional id: ', prof_id)
    model_2_list.append(prof_id)
    
    
    # print their job title:
    title = df_professionals.loc[prof_id, 'professionals_headline']
    print('Title: ', title)
    
    # print the industry they work in
    industry = df_professionals.loc[prof_id, 'professionals_industry']
    print('Industry: ', industry)
    
    # Print the answer that they wrote which was similar the question being asked
    answer = df_qa_prof.loc[ans_id,'answers_body']
    print('==Answer given to similar question:\n',answer)
    

We now have a list of professionals that model 2 has chosen to answer your question. Based on their past history printed above, would you say that your question is relevant to them?

Once again this is just a preliminary list. Later we'll filter out those professionals who have a high possibility of actually submitting an answer.

These are the id's of the professionals that Model_2 has selected:

In [None]:
# uncomment the next line to print the list of professional id's

# model_2_list

<hr>
| <a id='Model_3'></a>

## 6. Model 3 - Answers, Tfidf, TruncatedSVD and Cosine Similarity


Singular Value Decomposition (SVD) is commonly understood as a dimensionality reduction technique. However, it can be also be seen as a way of creating a new set of features. These are called latent features. It's not clear what each latent feature represents but they're very effective in capturing the essence of the data. The previous model used more than 1.5 million features when calculating cosine similarity. This model will use 200.

The workflow is almost identical to model 2. The diffference is that we'll take the output from tfidf and transform it using TruncatedSVD.


## 6.1. Load the data

In [None]:
# load df_qa_prof
df_qa_prof = pickle.load(open(path_1,'rb'))
# load df_professionals
df_professionals = pickle.load(open(path_2,'rb'))


print(df_qa_prof.shape)
print(df_professionals.shape)

## 6.2. Process the Question

In [None]:
# copy a row from df_qa_prof
df_row2 = df_qa_prof[df_qa_prof.index == 0] 
# set all values to nothing
df_row2.loc[:,:] = ''
# reset the index
df_row2 = df_row2.reset_index(drop=True)
    
# Assign the answer_text in this row to be the same as the question.
# We do this because later we will compare this question to all other rows
# in the answer_text column.
df_row2.loc[0, 'answers_text'] = question_text

# Concat df_row2 to df_qa_prof.
# The question will be at the top of the first row.
df_qa_prof = pd.concat([df_row2, df_qa_prof], axis=0).reset_index(drop=True)

df_qa_prof.head(2)

## 6.3. Vectorize the data

In [None]:
# Select the data we want to use. Note we are comparing the question to answers.
# We need to vectorize the answers_body column.
# This column has our new question at the top.
data = df_qa_prof['answers_text']

# instantiate vectorizer
vect = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.5)

# learn the vocabulary of the data
vect.fit(data)

# transform the data to a document term matrix
prof_dtm = vect.transform(data)

prof_dtm.shape

## 6.4. Transform prof_dtm using TruncatedSVD

In [None]:
from sklearn.decomposition import TruncatedSVD

# Initialize
tsvd = TruncatedSVD(n_components=200, random_state=101)

# Fit
tsvd.fit(prof_dtm)

# Transform
# This returns a type numpy array and not a sparse matrix type as with tfidf.
prof_dtm = tsvd.transform(prof_dtm)

prof_dtm.shape

Let's put the output into a dataframe so we can see what it is. Intially there were 1,522,916 features. Now there are just 200 features. The number of rows is still 51,124.

In [None]:
# create a dataframe
df = pd.DataFrame(prof_dtm)

df.head()

## 6.5. Calculate the Cosine Similarity
We are calculating the similarity of your question to all past answers in the dataset.

In [None]:
# https://stackoverflow.com/questions/12118720/
# python-tf-idf-cosine-to-find-document-similarity

# prof_dtm[0:1] This selects the first row of prof_info column. Note this slicing 
# is for a sparse matrix.
# We are saying: Tell me how similar every row is to the first row.
# Note that we are using cosine_similarity here and not linear_kernel.
cosine_similarities = cosine_similarity(prof_dtm[0:1], prof_dtm)

# The line below would give us the cosine similarity of every row to every other row,
# just like a correlation matrix.
# But there's no need for this and the RAM needed for this calculation
# would cause this kernel to crash.
# cosine_similarities = linear_kernel(prof_dtm, prof_dtm)

# Quick check: The first value should be 1.0 because it's the
# comparison of the question to itself.
cosine_similarities

Let's put everything into a dataframe and sort the similarities from highest to lowest.

In [None]:
# create a dataframe
df_cosine_matrix = pd.DataFrame(cosine_similarities)

# get the column names from df_train
cols = list(df_qa_prof['answers_id'])

# Change the name of the first column. This is the score for the Question
cols[0] = 'question_cosine_score'

# rename the columns in the dataframe
df_cosine_matrix.columns = cols

# Add the professionals id values as a new column.
# This is identical to answers_author_id.
df_cosine_matrix['answers_id'] = df_qa_prof['answers_author_id']

# set the answers_id column as the index
df_cosine_matrix.set_index('answers_id', inplace=True)

# transpose the dataframe
df = df_cosine_matrix.T

# rename the column
new_col = ['cosine_score_for_each_answer_id']
df.columns = new_col

# sort the cosine similarity values in descending order
df = df.sort_values('cosine_score_for_each_answer_id',axis=0, ascending=False)

# check the top 20 cosine scores
df.head(20)


## 6.6. Select the Answers
Here we select the answers that have a cosine similarity that is greater than or equal to a threshold. 

In [None]:
# Set the cosine similarity threshold
MODEL_3_THRESHOLD = 0.65

# filter out all rows that have a cosine_score >= THRESHOLD
df = df[df['cosine_score_for_each_answer_id'] >= MODEL_3_THRESHOLD]

# remove the first row because this row is the question we asked
df = df[1:]

num_answers = len(df)

print('Number of answers chosen: ', num_answers)

print('This is a sample of the answers the model has selected.')

# print the answers that have been selected as well as the associated cosine scores
df.head(10)

## 6.7. Identify the professionals that gave each answer

Here we will identify the professionals that gave each answer. These will be the professionals that this model thinks are best able to answer your question. 

We start by creating a python list containing the id's of each answer selected.

In [None]:
# reset the index
df.reset_index(inplace=True)

# rename the columns
new_names = ['answers_id', 'cosine_score_for_each_answer_id']
df.columns = new_names

# create a list with all answer id values from df
answer_list = list(df['answers_id'])

# display the list
#answer_list

**Let's print the question again.**

In [None]:
print('Question id: ', question_id)
print('Question Title: ', question_title)
print('\n')
print('Question Body:\n ', question_body)

We'll get the id of the professional who gave each answer. Then we'll print the profile of that professional and their answer. There could be duplicate professionals in this list because the same professional could have given several past answers.<br>

In [None]:
# Print info on the professionals who can answer this question

#print('\n')
print('Model 3')
print('Number of professionals selected: ', len(answer_list))
print('== Printing info on each professional who was selected ==')

# set the index
df_professionals = df_professionals.set_index('professionals_id')



# Create an empty list to store the professional id's that are
# associated with the answers that have been selected,
model_3_list = []

# set the index of df_train to be the question id
df_qa_prof = df_qa_prof.set_index('answers_id')

for ans_id in answer_list:
    
    # print the professional's id (i.e. their name)
    # get the prof id of the person who wrote the answer
    prof_id = df_qa_prof.loc[ans_id, 'answers_author_id']
    print('\n')
    print('==> Professional id: ', prof_id)
    model_3_list.append(prof_id)
    
    
    # print their job title:
    title = df_professionals.loc[prof_id, 'professionals_headline']
    print('Title: ', title)
    
    # print the industry they work in
    industry = df_professionals.loc[prof_id, 'professionals_industry']
    print('Industry: ', industry)
    
    # Print the answer that they wrote which was similar the question being asked
    answer = df_qa_prof.loc[ans_id,'answers_body']
    print('==Answer given to similar question:\n',answer)

These are the id's of the professionals that Model_3 has selected:

In [None]:
# uncomment the next line to print the list of professional id's

# model_3_list

Is your question relevant to these professionals? 

In the next model we're going to add some Ai magic in the form of pre-trained word embeddings.

<hr>
| <a id='Model_4'></a>

## 7. Model 4 - Answers, GloVe Embeddings and Cosine Similarity

Pre-trained word embeddings are dense vectors that have been trained on text corpuses containing millions of words. Tfidf encodes word frequency but  embedding vectors encode the "meaning" of words and are able to understand analogies such as: man is to woman as king is to queen. In other words, embedding vectors help a model understand that the words man, woman, king and queen are all gender related. Moreover, a model will understand that king and queen are royalty.

In addition to trying to create relevant matches, here I'm using embedding vectors to increase the diversity of answers (more viewpoints) that student's receive. For example, if a student asks how to become a film star, it would be good if that student also received advice from people involved in theatre. This would be possible because the model would know that the words film and theatre are closely related.

This model will encode words using pre-trained GloVe word embeddings. We will use 200-dimensional english word vectors. These were pre-trained on the combined Wikipedia 2014 + Gigaword 5th Edition corpora (6B tokens, 400K vocab). Because GloVe vectors are available as a Kaggle dataset I've simply imported them into this kernel.

The embedding vector length is 200. For each answer we'll consider only the first 500 words (max_length = 500). Shorter answers will be padded with zeros. For long answers, all words beyond the first 500 will be thrown away. To create a vector for a given answer, we will average all the 200-long word vectors that make up that answer.

Again, this model will compare your question to all answers in the dataset. If your question is similar to a past answer then there's a good chance that the professional who gave that answer will be able to answer your question.


## 7.1. Define the document corpus


> The answers_text column in dataframe df_qa_prof will be the corpus of documents that we'll use to create this model. Whenever we refer to a document corpus, this is the column that we'll be referring to.


## 7.2. Load the Data

In [None]:
# load df_qa_prof
df_qa_prof = pickle.load(open(path_1,'rb'))
# load df_professionals
df_professionals = pickle.load(open(path_2,'rb'))


print(df_qa_prof.shape)
print(df_professionals.shape)

In [None]:
# We will use GloVe vectors that have a standard length of 200
EMBED_LENGTH = 200

## 7.3. Process the question

In [None]:
# copy a row from df_qa_prof
df_row3 = df_qa_prof[df_qa_prof.index == 0] 
# set all values to nothing
df_row3.loc[:,:] = ''
# reset the index
df_row3 = df_row3.reset_index(drop=True)
    
# Assign the answer_text in this row to be the same as the question.
# We do this because later we will compare this question to all other rows
# in the answer_text column.
df_row3.loc[0, 'answers_text'] = question_text

# Concat df_row to df_qa_prof
# The question will be the first row
df_qa_prof = pd.concat([df_row3, df_qa_prof], axis=0).reset_index(drop=True)


## 7.4. Pre-process the data

In [None]:
# Create a new column showing the length of each answer
df_qa_prof['answer_length'] = df_qa_prof['answers_body'].apply(len)

print('The answers_text column is the document corpus.')
df_qa_prof.head(2)

## 7.5. Assemble the GloVe Embedding Matrix for our text corpus
We'll use pre-trained GloVe embeddings that are available in Kaggle datasets.

In [None]:
# Create a corpus of documents
corpus_text_list = list(df_qa_prof['answers_text'])


### Tokenize the corpus of documents (i.e. extract the vocabulary)

In [None]:
# Instantiate the tokenizer.
# Note that this is a word tokenizer.
t = Tokenizer()

# create a dictionary where the word is the key and a number is the value
t.fit_on_texts(corpus_text_list)

# How many words are there in our corpus vocabulary?

vocab_size = len(t.word_index)
print('Vocab size: ', vocab_size)

In [None]:
# These are all the words in the vocabulary of our corpus.
# Each word is assigned an index starting at 1.

t.word_index

In [None]:
# Add 1 to the number of words in the vocabulary
vocab_size = len(t.word_index) + 1
vocab_size

### Integer encode the corpus documents

Here we are replacing every word with its cprresponding index. Each row in our text corpus is a seperate list of these index values. Take note that the lists have different lengths.

- encoded_docs is a list of lists. [[2, 101, 605], [33, 77],...]

In [None]:
# convert the text to sequences of numbers
encoded_docs = t.texts_to_sequences(corpus_text_list)

# Print the list of lists
#print(encoded_docs)

### Pad each list so they all have the same length

Here we'll pad each list with zeros so that they all have the same length.

In [None]:

# Let's look at the text lengths to decide what max_length to use
print('Min length: ', df_qa_prof['answer_length'].min())
print('Max length: ',df_qa_prof['answer_length'].max())
print('Mean length: ',df_qa_prof['answer_length'].mean())
print('Median length: ',df_qa_prof['answer_length'].median())
print('Mode lengths: ',df_qa_prof['answer_length'].mode()) # value that appears most often

# Set the max_length 
max_length = 500

# Pad each list so they all have the same length
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

# (num_answers, max_length)
padded_docs.shape  

### Create a GloVe embedding matrix specific to our corpus vocab

In [None]:
# source: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

# We will use pre-trained GloVe emedding vectors from Kaggle Datasets that
# have been imported into this kernel.
# https://www.kaggle.com/rtatman/glove-global-vectors-for-word-representation

# Load the pre-trained GloVe vectors
# Set the path to glove.6B.200d.txt
path = '../input/glove-global-vectors-for-word-representation/glove.6B.200d.txt'

embeddings_index = dict()
f = open(path)

for line in f:
    # Note: use split(' ') instead of split() if you get an error.
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Loaded %s word vectors.' % len(embeddings_index))

# create a weight matrix
embedding_matrix = np.zeros((vocab_size, EMBED_LENGTH))
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

print('The result is a matrix of embeddings.')
print('Words are the rows, the features are the columns.')

# The result is a matrix of embeddings only for words in our data.
# Words are the rows, the features are the columns.

Let's put the embedding matrix into a dataframe so we can more clearly see what it is. This matrix includes the embedding vectors for all words in the GloVe vocab. Later we'll extract the vectors that correspond to the words in our document corpus.

In [None]:


# Note that the words are on the index column
df_glove_embeddings = pd.DataFrame(embedding_matrix)

# get all the dictionary keys as a list
word_dict = t.word_index

# get a list of keys
keys = list(word_dict.keys())

# Insert a dummy_word at the first position.
# The dummy_word exists because our dict key:value pairs
# start from word:1 and not word:0.
keys.insert(0, 'dummy_word')

# transpose the dataframe so that the words become the columns
df_glove_embeddings = df_glove_embeddings.T

# set the names of the columns
df_glove_embeddings.columns = keys


# convert the dataframe back to the original form
df_glove_embeddings = df_glove_embeddings.T

# reset the index
df_glove_embeddings = df_glove_embeddings.reset_index(drop=False)

# change the name of the first column to 'words'
column_names = list(df_glove_embeddings.columns)
column_names[0] = 'words'
df_glove_embeddings.columns = column_names

print('This is the embeddings in a dataframe.')
df_glove_embeddings.head(10)

Now let's create an embedding matrix that only includes words in our document corpus. This is simply a look-up function. For every word in a document, this code looks up the embedding vector associated with that word and inserts it in encoding_mat. 

Take note that here we are **averaging** all the word vectors that make up a given answer.



In [None]:
# create an empty matrix
encoding_mat = np.zeros((len(padded_docs), EMBED_LENGTH))

for i in range(0,len(padded_docs)):
    # select the document
    padded_doc = padded_docs[i]
    # create an empty encoding list
    encoding = np.zeros(EMBED_LENGTH)
    # select a document
    for item in padded_doc:
        # Here we are adding the vectors together.
        # This selects a row from embedding_matrix.
        # The output is a list.
        encoding = encoding + embedding_matrix[item] # item is an integer value
        
    # Insert the encoding to encoding_mat
    # Here we are averaging the encodings by dividing by the length.
    encoding_mat[i] = encoding/max_length

# check the shape of the matrix
encoding_mat.shape

In [None]:
# Display the embedding matrix
# The words are the rows and the features are the columns.

# Every row represents one answer that has been encoded as a vector
df_encoding_mat = pd.DataFrame(encoding_mat)

print('This is the embedding matrix. Each row represents one answer that has been encoded as a vector.')
print('Row 0 is the question.')
df_encoding_mat.head()

In [None]:
# check the shape of the embedding matrix
encoding_mat.shape

## 7.6. Calculate the cosine similarity of the question (first row) to every answer (all other rows)

In [None]:
# reshape the encoding matrix to (num_samples, num_features)
encoding_mat = encoding_mat #.reshape(max_length,EMBED_LENGTH) 
# reshape the base_document i.e. the one we will compare to all others
base_doc = encoding_mat[0].reshape(1,EMBED_LENGTH)

# calculate the cosine similarity
cosine_similarities = cosine_similarity(base_doc, encoding_mat)

# The following would compute a cosine similiarity matrix comapring every
# doc to every other doc, like a correlation matrix.
# This uses a lot of RAM.
#cosine_similarities = cosine_similarity(encoding_mat, encoding_mat)

cosine_similarities.shape

In [None]:
# flatten the matrix
cosine_similarities = cosine_similarities.flatten()

#Check: The first value should be 1.0 because the 
# question is being compared to itself.
cosine_similarities

Let's put everything into a dataframe and sort the similarities from highest to lowest.

In [None]:
# create a dataframe
df_cosine_matrix = pd.DataFrame(cosine_similarities)

# transpose the dataframe
df_cosine_matrix = df_cosine_matrix.T

# get the column names from df_train
cols = list(df_qa_prof['answers_id'])

# Change the name of the first column. This is the score for the Question
cols[0] = 'question_cosine_score'

# rename the columns in the dataframe
df_cosine_matrix.columns = cols

# Add the professionals id values as a new column.
# This is identical to answers_author_id.
df_cosine_matrix['answers_id'] = df_qa_prof['answers_author_id']

# set the answers_id column as the index
df_cosine_matrix.set_index('answers_id', inplace=True)

# transpose the dataframe
df = df_cosine_matrix.T

# rename the column
new_col = ['cosine_score_for_each_answer_id']
df.columns = new_col

# sort the cosine similarity values in descending order
df = df.sort_values('cosine_score_for_each_answer_id',axis=0, ascending=False)

# check the top 20 cosine scores
df.head(20)


ยง Pause here for a moment. Please take a look at the previous dataframe. You'll notice that Model 4 has identified quite a few professionals. However, for many questions you're going to find that the final printout for this model does not contain any recommended professionals. This is because the threshold is set quite high. It's at 0.94 at the moment. Later I'll explain why this threshold is set high.

## 7.7. Select the Answers
Select the answers that have a cosine similarity that is greater than or equal to a threshold value.

In [None]:
# Set the cosine similarity threshold
MODEL_4_THRESHOLD = 0.94


# filter out all rows that have a cosine_score >= THRESHOLD
df_selected = df[df['cosine_score_for_each_answer_id'] >= MODEL_4_THRESHOLD]

# remove the first row because this row is the question we asked
df_selected = df_selected[1:]

num_answers = len(df_selected)

print('Number of answers chosen: ', num_answers)

print('This is a sample of the answers the model has selected.')

# print the answers that have been selected as well as the associated cosine scores
df_selected.head(10)

This is a list of id's that correspond to the answers that are similar to the question you asked. We'll identify the professional that gave each answer. These will be the professionals that are best able to answer your question.

In [None]:
# reset the index
df_selected.reset_index(inplace=True)

# rename the columns
new_names = ['answers_id', 'cosine_score_for_each_answer_id']
df_selected.columns = new_names

# create a list with all answer id values from df
answer_list = list(df_selected['answers_id'])

# display the list
# answer_list

**Let's print the question again.**

In [None]:
print('Question id: ', question_id)
print('Question Title: ', question_title)
print('\n')
print('Question Body:\n ', question_body)

Next we'll print the profile of each professional as well as the answer they gave to a similar question. Again, there could be duplicate professionals in this list because the same professional could have given several past answers.<br>
| <a id='model_3_prof_printout'></a>

In [None]:
# Print info on the professionals who can answer this question

#print('\n')
print('Model 4')
print('Number of professionals selected: ', len(answer_list)) # correct this. there could be duplicates
print('== Printing info on each professional who was selected ==')

# set the index
df_professionals = df_professionals.set_index('professionals_id')

# Create an empty list to store the professional id's that are
# associated with the answers that have been selected,
model_4_list = []

# set the index of df_train to be the question id
df_qa_prof = df_qa_prof.set_index('answers_id')


for ans_id in answer_list:
    
    # print the professional's id (i.e. their name)
    # get the prof id of the person who wrote the answer
    prof_id = df_qa_prof.loc[ans_id, 'answers_author_id']
    print('\n')
    print('==> Professional id: ', prof_id)
    model_4_list.append(prof_id)
    
    print('Answer id: ', ans_id)
    
    
    # print their job title:
    title = df_professionals.loc[prof_id, 'professionals_headline']
    print('Title: ', title)
    
    # print the industry they work in
    industry = df_professionals.loc[prof_id, 'professionals_industry']
    print('Industry: ', industry)
    
    # Print the answer that they wrote which was similar the question being asked
    answer = df_qa_prof.loc[ans_id,'answers_body']
    
    print('==Answer given to similar question:\n',answer)

These are the id's of the professionals that Model_4 has selected:

In [None]:
# uncomment the next line to print the list of professional id's

# model_4_list

<hr>
| <a id='Select_professionals'></a>

## 8. Select those professionals who are most likely to answer the question

We have recommendations from four models. Now we need to filter out those professionals who are likely to answer when they are sent this question via email.

**What selection criteria are we going to use?**

> We will ask four questions:
> 1. Has this professional joined CareerVillage within the last 30 days of the most recent new user who signed up?
> 2. Has this professional answered a past question from the student who posted this question?
> 2. Has this professional answered a question within 30 days of the most recent answer posted on CareerVillage?
> 3. Has this professional made a comment within 30 days of the most recent answer posted on CareerVillage?

If the answer is yes to any one of these questions then we will conclude that there is a high possibility that this professional will respond to the email.

This is the logic behind this filter:

New users are more likely to be highly active. Also, having recently answered a question or made a comment is an indicator that a person is contributing to the community and is therefore likely to respond to a relevant email. Lastly, if a professional answered a past question from the student posting this question, then there is already a kind of "connection" between them. There's a good chance of that professional also answering this new question.


## 8.1. Get a summary of how many professionals each model has selected
There could be duplicate professional id's here. We'll remove those duplicates later.

In [None]:
# Note that there could be duplicate professional id's
# in these lists.

print('Model 1 Tags: ', len(model_1_list))
print('Model 2 Tfidf: ',len(model_2_list))
print('Model 3 TSVD: ',len(model_3_list))
print('Model 4 GloVe: ',len(model_4_list))


## 8.2. What is the total number of professionals the models have selected?

In [None]:
# Join all the lists
combined_list = model_1_list + model_2_list + model_3_list + model_4_list
# Create a dataframe containing all professionals
df_selected = pd.DataFrame(combined_list, columns=['professionals_id'])

# Drop any duplicate id's.
# Because model 2 and model 3 select professionals based on answers, 
# there is a possibility that the same professional could be selected 
# multiple times bcause they gave several answers that matched the Question.

# remove the duplicates
df_selected = df_selected.drop_duplicates('professionals_id')
# get the total number of professionals
total = len(df_selected)

print(total, 'professionals are able to answer the Question.')

## 8.3. Apply the final selection criteria

### ~ Has this professional joined CareerVillage within the last 30 days of the most recent new user who signed up?

In [None]:

def new_member(x):
    # get the value from df_professionals
    num_days_member = df_professionals.loc[x, 'num_days_member']

    if num_days_member <= 30:
        return 1
    else:
        return 0

df_selected['new_member'] = df_selected['professionals_id'].apply(new_member)

### ~ Has this professional answered a past question from the student asking this question?

In [None]:
# Get the id of the student asking the question.
# student_id variable was captured above.

def past_interaction(x):
    # Filter out all the questions this professional has answered in the past
    df_past = df_qa_prof[df_qa_prof['answers_author_id'] == x]

    # Get a list of stuents who've asked the above questions
    student_list = list(df_past['questions_author_id'])

    # Check if the student asking this question is in student_list
    if student_id in student_list:
        return 1 # there was a past interaction
    else:
        return 0 # there has been no past interaction

# create a new column that shows if there was a past interaction
df_selected['past_interaction'] = \
df_selected['professionals_id'].apply(past_interaction)

In [None]:
df_selected.head()

### ~ Has this professional answered a question within 30 days of the most recent answer posted on CareerVillage?

In [None]:
# Has this professional answered a question within
# 30 days of the most recent answer posted on CareerVillage?
# Yes --> send email

# convert the answers_date_added to pandas datetime
df_answers['answers_date_added'] = \
pd.to_datetime(df_answers['answers_date_added'])

# get the date of the most recent answer
newest_answer_date = df_answers['answers_date_added'].max()

# Get the number of days a question was answered from the most recent answer posted
# on CareerVillage.

def days_from_newest_answer(x):
    
    num_days = (newest_answer_date - x).days
    
    return num_days

# create a new column
df_answers['days_from_newest_answer'] = \
df_answers['answers_date_added'].apply(days_from_newest_answer)

# filter out all rows where days_from_newest_answer <= 30
df_filtered = df_answers[df_answers['days_from_newest_answer'] <= 30]

# Drop duplicate professional id's because some professionals
# may have abswered multiple questions in that time period.
df_filtered = df_filtered.drop_duplicates('answers_author_id')

# get a list of professionals that made these recent answers
prof_list = list(df_filtered['answers_author_id'])


def recent_answer(x):
    if x in prof_list:
        return 1
    else:
        return 0

# create a new column
df_selected['recent_answer'] = \
df_selected['professionals_id'].apply(recent_answer)

### ~ Has this professional made a comment within 30 days of the most recent answer posted on CareerVillage?

In [None]:
# Has this professional made a comment within
# 30 days of the most recent answer posted on CareerVillage?
# Yes --> send email

# convert the answers_date_added to pandas datetime
df_comments['comments_date_added'] = pd.to_datetime(df_comments['comments_date_added'])

# Get the number of days a question was answered from the most recent answer posted
# on CareerVillage.

def days_from_newest_answer(x):
    
    num_days = (newest_answer_date - x).days
    
    return num_days

# create a new column
df_comments['days_from_newest_answer'] = \
df_comments['comments_date_added'].apply(days_from_newest_answer)

# filter out all rows where days_from_newest_answer <= 30
df_filtered = df_comments[df_comments['days_from_newest_answer'] <= 30]

# Drop duplicate professional id's because some professionals
# may have made multiple comments in that time period.
df_filtered = df_filtered.drop_duplicates('comments_author_id')

# get a list of professionals that made these recent comments
prof_list = list(df_filtered['comments_author_id'])

# add a new column to df_selected
def recent_comment(x):
    if x in prof_list:
        return 1
    else:
        return 0

df_selected['recent_comment'] = df_selected['professionals_id'].apply(recent_comment)


<hr>
| <a id='Final_Output'></a>

## 9. Final Output - The chosen ones

### ~ Filter out those professionals who've met at least one of the selection critera ~

In [None]:
# sum up the row scores for each professional in df_selected
def sum_rows(row):
    
    total = row['new_member'] + row['recent_answer'] + \
    row['recent_comment'] + row['past_interaction']
    
    return total
    
df_selected['total_score'] = df_selected.apply(sum_rows, axis=1)


# filter out rows where the score > 0
df_send_email = df_selected[df_selected['total_score'] > 0]



final_selection_list = list(df_send_email['professionals_id'])

num_selected = len(final_selection_list)

print('=== Final Results ===\n')

print(num_selected, 'professionals are likely to respond to the email.')

#print('These are their names:\n', final_selection_list)

print('These are their scores.')

# Print the list of professionals that have a high likelihood of
# responding to an email notification
df_send_email.head(20)

### Show which model selected each professional
These professional id's will help you to go back to a specific model's output and find a profile. There could be duplicates here because the same professional could have been selected by more than one model.

In [None]:
print('This shows which model selected each chosen professional:\n')

for prof_id in final_selection_list:
    if prof_id in model_1_list:

        print('Model 1 Tags: ', prof_id)

for prof_id in final_selection_list:
    if prof_id in model_2_list:

        print('Model 2 Tfidf: ', prof_id)
        
for prof_id in final_selection_list:
    if prof_id in model_3_list:

        print('Model 3 TSVD: ', prof_id)

for prof_id in final_selection_list:
    if prof_id in model_4_list:

        print('Model 4 GloVe: ', prof_id)

In [None]:
# End of Recommender System
#====================================================================#

| <a id='Testing'></a>

## 10. Testing and Results

I used nine questions for testing and tuning. I tried to choose questions that reflect the strengths, limitations and quirks of this data. These include: 

1. Questions relating to careers that are well represented - e.g. computer science
2. Questions relating to careers that have a low representation - firefighting, plumbing
3. General questions that are more life-skills related
 
The test results for each model and the filter are summarised in a dataframe below.

These are the questions that I used for testing:

**Index:** 777<br>
**Question id**:  11ce7c537cd84db0bd7840ad3ca04004<br>
**Question Title**:  I want to major in computer science. What classes should I take?<br>
**Question Body**:<br>
 I want to do something like cyber security or write code for companys. #computer-science #programming #computer-engineering #computer-software 
 
 
**Index**: 999<br>
**Question id**:  0601a843065945ac86e473f421774952<br>
**Question Title**:  What is required to become a firefighter?<br>
**Question Body**:<br>
I want to know because so I can get the things I need to become a firefighter. #fireman 


**Index**: 2043<br>
**Question id**:  8b469efa88284afb907179e4c73a99af<br>
**Question Title**:  What exactly, is the difference between a psychologist and psychiatrist?<br>
**Question Body**:<br>
i dont know whether i want to be a psychologist or psychiatrist. I want to work with married people, going through a divorce. Do both of them work with people who are going through a divorce? #psychology #psychiatry


**Index**: 2487<br>
**Question id**:  1aaa4249d4ea41a4b2d196313e4e930e<br>
**Question Title**:  How do I decide what career I want to choose?<br>
**Question Body**:<br>
I am finding it very difficult to decide what I want to spend the rest of my life doing, and I would like to know what process others took to find their paths who were as lost as me. #undecided #unsure #searching

**Index**: 3618<br>
**Question id**:  e14626d53e5d44ac98e4e1c57404aa9d<br>
**Question Title**:  What are the best ways to maintain a work and school balance?<br>
**Question Body**:<br>
I think many people struggle with this and any type of advice towards this question is valuable.  #business #leadership #organization

**Index**: 1710<br>
**Question id**:  eb80205482e4424cad8f16bc25aa2d9c<br>
**Question Title**:  I want to become an army officer. What can I do to become an army officer?<br>
**Question Body**:<br>
I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army


**Index**: 1000<br>
**Question id**:  a5dfa070a89c47c28460557b1f4dabb8<br>
**Question Title**:  What are the challenges I may face pursuing a career in science, and how can I stand out among the rest?<br>
**Question Body**:<br>
For my first two years in high school I have been studying biotechnology and I have been working diligently in preparing myself for college and the workforce. Though I have not decided on an exact career, I am deeply considering one within forensics, pharmaceuticals, or in the space exploration fields.  What are some of the challenges students may face when applying for college, throughout college, and in the workforce? How can I stand out from the rest? #high-school #student #biotechnology #workforce #science #career #forensics #medicine #pharmaceuticals #space-exploration #future #technology #stem #steam #nasa #astrophysics #planetary-science #women-in-stem


**Index**: custom question<br>
**Question id**:  custom question<br>
**Question Title**:  How do I become a data scientist?<br>
**Question Body**:<br>
I want to be a data scientist. What subjects should I study? #data-science

**Index**: custom question<br>
**Question id**:  custom question<br>
**Question Title**:  How do I become a plumber?<br>
**Question Body**:<br>
I want to be a plumber. What subjects should I study? #plumber #plumbing



In [None]:
import pandas as pd
# The next two lines causes all the text to appear. Sentences are not truncated.
# All columns and all rows are displayed. Nothing is hidden.
# Note: this must be in the same cell as import pandas as pd
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', None)


results_dict = {
'question_index': [777,999,2043,2487,3618,1710,1000,'custom question','custom question'],
'question_title': ['I want to major in computer science. What classes should I take?',
                 'What is required to become a firefighter?',
                 'What exactly, is the difference between a psychologist and psychiatrist?',
                 'How do I decide what career I want to choose?',
                 'What are the best ways to maintain a work and school balance?',
                 'I want to become an army officer. What can I do to become an army officer?',
                 'What are the challenges I may face pursuing a career in science, and how can I stand out among the rest?',
                 'How do I become a data scientist?', 'How do I become a plumber?'],
'Model_1_Tags': ['rec: 481 fp: 0','rec: 2 fp: 0','rec: 1 fp: 1','rec: 0 fp: 0','rec: 1 fp: 1','rec: 21 fp: 2','rec: 0 fp: 0','rec: 105 fp: 1','rec: 4 fp: 0'],
'Model_2_Tfidf': ['rec: 392 fp: 12','rec: 10 fp: 1','rec: 9 fp: 0','rec: 2 fp: 1','rec: 1 fp: 1','rec: 32 fp: 0','rec: 0 fp: 0','rec: 99 fp: 10','rec: 4 fp: 4'],
'Model_3_TSVD': ['rec: 140 fp: 1','rec: 0 fp: 0','rec: 0 fp: 0','rec: 6 fp: 1','rec: 0 fp: 0','rec: 46 fp: 0','rec: 0 fp: 0','rec: 108 fp: 8','rec: 7 fp: 7'],
'Model_4_GloVe': ['rec: 53 fp: 0','rec: 0 fp: 0','rec: 0 fp: 0','rec: 34 fp: 4','rec: 1588 fp: 0','rec: 0 fp: 0','rec: 26 fp: 0','rec: 0 fp: 0','rec: 0 fp: 0'],
'Final_Filter_Output': ['rec: 29 fp: 1','rec: 2 fp: 1','rec: 1 fp: 1','rec: 2 fp: 0','rec: 54 fp: 0','rec: 9 fp: 0','rec: 3 fp: 0','rec: 14 fp: 1','rec: 2 fp: 1']
  
}

    
df_results = pd.DataFrame(results_dict)

#df_results.head(10)

### Results

In each cell there are two numbers. One shows the number of recommendations, the other shows the number of false positives (bad matches) in those recommendations. To determine the number of false positives I looked at the output of each model and counted how many recommended professionals I believed were not well matched to the question. This is a subjective way of assessing performance, but it gives us a rough idea of how each model is performing.

For example, Model_3_TSVD recommended 140 professionals to answer the question: "I want to major in computer science. What classes should I take?" There was one false positive in the output. In the dataframe the results are displayed like this - rec: 140 fp: 1
After filtering, the system concluded that 29 professionals are likely to respond to the email. These final recommendations contain one false positive - rec: 29 fp: 1


These are the cosine similarity thresholds for each model:

MODEL_1_THRESHOLD = 0.13<br>
MODEL_2_THRESHOLD = 0.1<br>
MODEL_3_THRESHOLD = 0.65<br>
MODEL_4_THRESHOLD = 0.94

In [None]:
df_results.head(10)

<hr>

**Observations**<br>

This system is performing well on careers for which a lot of data is available. Examples include the computer science, military and data science questions. 

For questions about firefighting, psychology and plumbing the number of final recommendations is low. The models are also generating a higher ratio of false positives. There are professionals in the dataset that are qualified to answer these questions and the system is detecting them. But, the final filter is rejecting these professionals because they are not active. Therefore, I believe that the problem is not with this recommendation system. There's just not enough data representing these careers.

Still, this highlights a weakness of this recommender system: For certain questions, especially those where a career has a low representation in the data, model 4 has a tendency to generate many false positives. To see a demonstration of this try entering index 2018 (question id:  b53c5e9b7436453fa13a416a23b512cc). This is a question about becoming a Politician. 

For the more general/life-skills realted questions...

~ How do I decide what career I want to choose?<br>
~ What are the challenges I may face pursuing a career in science, and how can I stand out among the rest?<br>
~ What are the best ways to maintain a work and school balance?

...Model 4 (GloVe) is performing well. It's nicely making up for the weakness that the other three models are showing. 

**Is this a good recommender system?**<br>

Based on these tests my preliminary conclusion is that this recommender system works. However, one should be careful about drawing conclusions after only a quick test like this. This system needs to be field tested. Only then will we know how robust it really is.

**Now just a few thoughts on setting thresholds...**

I've tried to reduce the number of irrelevant recommendations (i.e. bad match between question and professional) by setting model 3 and model 4 thresholds to be quite high. Reducing the number of false positives is important to inspire confidence in this system and reduce the amount of irrelevant questions sent to professionals. 

As an example, consider the question: 
"I want to become an army officer. What can I do to become an army officer?"

At the current threshold of 0.94 Model 4 is generating 0 recommendations. However, if this threshold were lowered to 0.85 Model 4 would generate 305 recommendations. This would cause an increase in the number of false positives. In addition to recommending more professionals with a military background, Model 4 would also recommend police officers.

Moreover, if the threshold was set at 0.85 then model 4 would generate 38,220 recommendations when given the very general question: "What are the best ways to maintain a work and school balance?" 

38,220 (contains duplicates) is a lot of professionals and one might guess that many false positives would be generated... but think again - would you or I consider such a question irrelevant?

That said, when choosing these thresholds it's important to consider **business priorities**. One of CareerVillage's priorities is 'No question left behind' meaning that no question should be left unanswered. In this case an executive decision could be made to use lower thresholds. The benefit would be that there would be a higher possibility that a question would get an answer because the system would output a longer list of recommended professionals. The risk is that the number of irrelevant questions sent to professionals will increase. One way to manage this risk would be to warn professionals ahead of time that a new system is being tested and request their patience (and possibly their feedback) as the system is being tuned. 

Another point to consider is that these thresholds are dictated by the amount and quality of the data. As the amount of data increases, these thresholds could be adjusted. This system is not static. Its ongoing performance will need to be monitored. With time it could get better as data quantity increases or it could get worse if the data becomes polluted, with spam for example.


<hr>
| <a id='Things'></a>

## 11. Things to keep in mind

**Domain knowledge is like the force, use it...**

If you are a domain expert i.e you work for CareerVillage or you're an experienced career counselor or child psychologist, then you may be able to use the practical insights you have to improve this solution. 

I suggest starting by looking at the filter. Are the conditions too strict? Are there better conditions that can be added?<br>
You can then try tuning the threshold values or tuning other model parameters to see if the quality and quantity of the recommendations improve. Also, this system is set up so that the models work independantly. Therefore, it's possible to experiment by including and excluding certain models.

**Be careful...**

However, as mentioned above, please be careful when lowering the threshold value for model 4 (GloVe). If a question is very general then this model may cast a wide net and recommend thousands of professionals. This could crash the system.

**What steps can be taken to address the cold start problem?**

Say you've just signed up as a shopper on Amazon. The site won't know the most relevant products to recommend to you because you've never bought anything i.e. you don't have a shopping history. This is called the cold start problem. 

Here the new professionals are the "shoppers" and the "products" are the questions.

In this recommender system Model 1 gives every professional a chance to be paired with a question. It relies on tags and professional profiles. 

(Reminder: A Professional's profile includes a professional's industry and title.)

Therefore, one way of tackling the cold start problem is to encourage new professionals to complete their tag and profile information. There are many professionals whose info is incomplete or sparse. New professionals will have a better chance of being matched if they provide complete and detailed information about themselves. 

Another way of addressing the cold start problem is to simply do nothing. Professionals are currently able to scroll the forums to find questions to answer. With time cold start issues will resolve themselves as these professionals find and answer questions. Once they answer a question Model 2, 3 or 4 will automatically match them to more questions.

**Fairness**

One of the requirements for fairness is a diverse dataset. This isn't always easy to create for many reasons, one being that the perspective of those creating it may be limited. Let me explain using a professions vs trades example. (Please note that the term bias here means one-sidedness. Not bias in the sense of bias/variance.)

Society tends to esteem professions above trades. Those of us who've reaped the social and financial rewards of a university education would naturally want to encourage all children to follow this path. But not all children want to be doctors. Many can find secure, fullfilling and often lucrative careers as plumbers and electricians. Not everyone knows that such opportunities exist outside the university system.

If CareerVillage's marketing team were to target a predominantly white-collar demographic when trying to attract new professionals - then the number of university educated professionals in the dataset will be higher, and growing faster, than the number of tradesmen or tradeswomen. This will lead to a data bias in favour of professions. Any alogorithms constructed using this data will reflect this bias. The consequence - students who submit questions about learning a trade won't get answers. This would be an "exculsionary user experience", also known as discrimination. All because the data is not diverse.

Here's an enlightening TED talk on algorithmic bias:<br>
https://www.ted.com/talks/joy_buolamwini_how_i_m_fighting_bias_in_algorithms?language=en#t-232473



<hr>
| <a id='Ideas'></a>

## 12. Ideas for sharpening this system

**Include a Stoplist**<br>

A stoplist is a pre-defined list of root words. It can be used to blacklist professionals who've given answers or have profiles that contain words that are in the stoplist. The recommender system shouldn't know that these professionals exist.

Why? CareerVillage meets a very real need. It's certain to become popular with more students and professionals, in more countries. Unfortunately this increase in popularity will attract both the good and the bad - human and bot. It's important to build a stoplist into the recommender system that will exclude certain professionals from the final output when that person is promoting an agenda that's contrary to the CareerVillage value system.

**Train a Neural Network to automatically rate the quality of answers**<br>

The CareerVillage vision is to give students answers that are tailored, reliable, encouraging and inspirational. There are many answers in the dataset that meet these requirements. Unfortunately, there are also answers that don't. Being able to automatically rate the quality of answers may help improve them.

How might this be done?<br>

The main pre-requisite is a labeled dataset. Once this is in place one could train a deep neural network to read each answer and then rate it on a scale of 1 to 10 according to the CareerVillage [Pro Tips](https://medium.com/@careervillage/introducing-protips-2d4ad51c445a) guidelines. This neural network could have one of several architectures - DNN, CNN, RNN or even BERT, which is a state of the art pre-trained network created by Google. 

However, factors such as memory requirements, training time, inference time, web page load time, maintenance complexity and hosting costs will need to be considered when deciding if this idea is feasible.


**Add "Recent Login" as a criteria for selecting professionals**

If a professional visited CareerVillage recently then he or she was probably scrolling through the forums reading the questions and answers. There's a good chance that this person will respond to an email notification to answer a relevant question. Login information is not part of this dataset. The following condition should be added to the filter:<br>
"Has this professional visited the site in the last 14 days?"

**Encourage all professionals to complete their profile information**

There are many professionals who have incomplete or sparse profile information. This recommender system relies on profile information. If more professionals provide complete profiles then the performance of this system will improve.

**Add 'About me' and 'Career Stories' to the profile information**

On some professional's CareerVillage profile page there are sections called "About me" and "Career Stories". In these sections they share what their career path has been, life experiences, past mistakes, what they do on a typical work day and other useful personal information and experiences. Here's a good example:<br>
https://www.careervillage.org/users/9852/kim/

This data is not included in this dataset, possibly because only a few professionals choose to share this information. "About me" and "Career Stories" could be a valuable data source for this recommender system. Including them will also help address the cold start problem.

| <a id='Citations'></a>

## Citations

1. GloVe: [Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf)<br>
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. 

2. Photo by jeshoots.com on Pixabay

| <a id='Reference_Kernels'></a>

## Reference Kernels

1. Rounak Banik - Movie Recommender Systems<br>
https://www.kaggle.com/rounakbanik/movie-recommender-systems

2. Chris Crawford - Starter kernel<br>
https://www.kaggle.com/crawford/starter-kernel

3. wjsheng - UPDATE 5: text processing<br>
https://www.kaggle.com/wjshenggggg/update-5-text-processing

4. RodH - Recommender: things to consider<br>
https://www.kaggle.com/rdhnw1/recommender-things-to-consider

5. Marsh - Keras cnn + GloVe + Early Stopping<br>
https://www.kaggle.com/vbookshelf/keras-cnn-glove-early-stopping-0-048-lb


| <a id='Helpful_Resources'></a>

## Helpful Resources

1. Frank Kane course on Building Recommender Systems<br>
https://www.udemy.com/building-recommender-systems-with-machine-learning-and-ai

2. Andrew Ng Deep Learning Specialization, Sequence Models, Week 2<br>
https://www.coursera.org/learn/nlp-sequence-models

3. What are word embeddings?<br>
https://www.youtube.com/watch?v=Eku_pbZ3-Mw

4. Blog post with a simple example explaining how to use pre trained embeddings:<br>
https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

5. Machine learning with text<br>
https://www.youtube.com/watch?v=ZiKMIuYidY0

6. NLTK Tutorial series<br>
https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/

7. Blog post by Kaggle Grandmaster Abhishek Thakur<br>
http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/

8. Tutorial on merging dataframes<br>
https://www.youtube.com/watch?v=h4hOPGo4UVU

9. Blog post by William Zinsser<br>
https://theamericanscholar.org/writing-english-as-a-second-language/#.XJ8oJhMzYWo



<hr>

| <a id='Conclusion'></a>

## Conclusion

Ikigai is a formula for happiness and fulfillment. It's a Japanese word that roughly means "a reason for being" or "the reason you wake up in the morning". It's the area of intersection of four overlapping circles: what you love to do, what you're good at, what you get paid to do and what the world needs.

Thank you CareerVillage and Kaggle for hosting this challenging competition.