# CV: Question-Professional Matching with RecSys

## Problem Statement

The U.S. has almost 500 students for every guidance counselor. Underserved youth lack the network to find their career role models, making CareerVillage.org the only option for millions of young people in America and around the globe with nowhere else to turn.

To date, 25,000 volunteers have created profiles and opted in to receive emails when a career question is a good fit for them. This is where your skills come in. To help students get the advice they need, the team at CareerVillage.org needs to be able to send the right questions to the right volunteers. The notifications sent to volunteers seem to have the greatest impact on how many questions are answered.

### Problem Statement Breakdown:

1. The problem statement has two verticals - Questions and Professionals
2. The target action among the two verticals is to find right connections betwen them
3. Find the relevant Professional for any given new Question
4. Send targeted emails to the recommended Professionals


### Dataset Summary:

1. There are **23,931 questions** from 12,329 students
2. There are **51,123 answers**
3. There are **50,106 answers** provided by **28,152 professionals**
4. There are **6,679 answers** provided where a professional has answered more questions than asked


In [None]:
# The usual suspects for data processing and visualization
import pandas as pd 
import seaborn as sns
import matplotlib as plt
import datetime

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import os
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
emails = pd.read_csv('../input/emails.csv')
questions = pd.read_csv('../input/questions.csv', parse_dates = ['questions_date_added'])
professionals = pd.read_csv('../input/professionals.csv')
comments = pd.read_csv('../input/comments.csv')
tag_users = pd.read_csv('../input/tag_users.csv')
group_memberships = pd.read_csv('../input/group_memberships.csv')
tags = pd.read_csv('../input/tags.csv')
students = pd.read_csv('../input/students.csv')
groups = pd.read_csv('../input/groups.csv')
tag_questions = pd.read_csv('../input/tag_questions.csv')
matches = pd.read_csv('../input/matches.csv')
answers = pd.read_csv('../input/answers.csv')
school_memberships = pd.read_csv('../input/school_memberships.csv')

### EDA

#### 1. Question and Answer Analysis

In [None]:
min_q_date = min(questions['questions_date_added'])
max_q_date = max(questions['questions_date_added'])
print('There were {:,} questions asked between {} and {}'.format(questions.shape[0], min_q_date.strftime('%Y-%m-%d'), max_q_date.strftime('%Y-%m-%d')))

# Plot count of questions accross years
sns.set_style("white")
sns.countplot(x=questions['questions_date_added'].dt.year, data=questions, facecolor='darkorange').set_title('Volume of Questions per Year')
sns.despine();

In [None]:
answers = pd.read_csv('../input/answers.csv', parse_dates=['answers_date_added'])
min_a_date = min(answers['answers_date_added'])
max_a_date = max(answers['answers_date_added'])
print('There were {:,} answers provided between {} and {}'.format(answers.shape[0], min_a_date.strftime('%Y-%m-%d'), max_a_date.strftime('%Y-%m-%d')))

# Plot count of questions accross years
sns.set_style("white")
sns.countplot(x=answers['answers_date_added'].dt.year, data=answers, facecolor='darkorange').set_title('Volume of Answers per Year')
sns.despine();

In [None]:
q_a = questions.merge(right=answers, how='inner', left_on='questions_id', right_on='answers_question_id')
print('There are {:,} questions that got answered, which is {:.0f}% of all questions.'.format(q_a['questions_id'].nunique(), 100*q_a['questions_id'].nunique()/questions.shape[0]))

#### 2. Distribution of days taken to answer a question

In [None]:
questions['questions_date_added'] = pd.to_datetime(questions['questions_date_added'])
answers['answers_date_added'] = pd.to_datetime(answers['answers_date_added'])
qa = pd.merge(questions, answers, left_on='questions_id', right_on='answers_question_id', how='left')
qa_grouped = qa.groupby('questions_id').agg({'questions_date_added':min, 'answers_date_added':min,
                                                   'questions_body':min})
qa_grouped['days_taken'] = (qa_grouped['answers_date_added'] - qa_grouped['questions_date_added']).dt.days
qa_grouped['questions_body_length'] = qa_grouped['questions_body'].apply(len)

In [None]:
print('Numerical summary of days taken to answer a question')
display(qa_grouped['days_taken'].describe())
plt.figure(figsize=(10,6))
plt.title('Distribution of days taken to answer a question')
plt.hist(qa_grouped['days_taken'], color='blue', edgecolor='black', bins=100)
plt.xlabel('min days taken to answer a question')
plt.ylabel('count')

#### 3. Distribution of tags amongst the questions:

In [None]:
q_t = questions.merge(right=tag_questions, how='left', left_on='questions_id', right_on='tag_questions_question_id')
q_tag = q_t.merge(right=tags, how='left', left_on='tag_questions_tag_id', right_on='tags_tag_id')
q_a_tag = q_tag.merge(right=answers, how='left', left_on='questions_id', right_on='answers_question_id')
tagnames = q_tag['tags_tag_name'].nunique()
notag = q_tag['questions_id'][q_tag['tags_tag_name'].isnull()]
print('Across the {:,} questions {} unique tags were used; {} questions had no tag'.format(questions.shape[0], tagnames, len(notag)))

In [None]:
# Distribution of tags 
q_tag.groupby(['questions_id'])['tags_tag_name'].nunique().describe()

In [None]:
sns.set(style='whitegrid', palette='bright', color_codes=True)
# Draw a violinplot of the number of tags per question
# sns.swarmplot(x=q_tag.groupby(['questions_id'])['tags_tag_name'].nunique(),data=q_tag)
ax = sns.distplot(q_tag.groupby(['questions_id'])['tags_tag_name'].nunique(), hist=True, kde=False, rug=True, bins=40)
ax.set(xlabel='Number of Unique Tags', ylabel='Number of Questions', title='Distribution of Unique Tags')
sns.despine(left=True)

#### 4. Questions with the highest number of tags

In [None]:
# Exploring the variety of tags for the question with the most tags
tenlargest = q_tag.groupby(['questions_id'])['tags_tag_name'].nunique().nlargest(20)
twenlargest_1 = tenlargest.index
for i in twenlargest_1:
    print(q_tag[['questions_title', 'questions_date_added']][q_tag['questions_id'] == i].iloc[1,0])

#### 5. Some of the most popular tags

In [None]:
#To see the tags that every user follows 
tag_users_merged = pd.merge(tag_users, tags, left_on='tag_users_tag_id', right_on='tags_tag_id', how='inner')
#To see the tags that are linked with every question
questions_merged = pd.merge(tag_questions, tags, left_on='tag_questions_tag_id', right_on='tags_tag_id', how='inner')

In [None]:
plt.figure(figsize=(10,6))
plt.title('50 most popular tags wrt user following')
sns.countplot(tag_users_merged[tag_users_merged['tags_tag_name'].isin(
    tag_users_merged['tags_tag_name'].value_counts().index[:50])]['tags_tag_name'], color='maroon', order=tag_users_merged['tags_tag_name'].value_counts().index[:50])
plt.ylabel('count')
plt.xticks(rotation='vertical')

In [None]:
plt.figure(figsize=(10,6))
plt.title('50 most popular tags wrt the number of questions they are linked to')
sns.countplot(questions_merged[questions_merged['tags_tag_name'].isin(
    questions_merged['tags_tag_name'].value_counts().index[:50])]['tags_tag_name'], color='maroon', order=questions_merged['tags_tag_name'].value_counts().index[:50])
plt.ylabel('count')
plt.xticks(rotation='vertical')

# Recomendation System

*Wikipedia*

A recommender system or a recommendation system (sometimes replacing "system" with a synonym such as platform or engine) is a subclass of information filtering system that seeks to predict the "preference" a user would give to an item.

Recommender systems typically produce a list of recommendations in one of two ways – through collaborative filtering or through content-based filtering (also known as the personality-based approach). Collaborative filtering approaches build a model from a user's past behavior (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties. These approaches are often combined (see Hybrid Recommender Systems).

Therefore, I propose the following three approaches to solve this problem:

* **Content-Based Filtering-** This method uses only information about the question title and question description of the questions, professional has previously answered to when modeling the professional's preferences. In other words, these algorithms try to recommend questions that are similar to those that a professional has answered to in the past. In particular, various student questions are compared with questions the professional has answered to, and the best-matching questions will be recommended.

* **Collaborative Filtering-** This method makes automatic predictions (filtering) about the preference of a professionals by collecting preferences from many other professional (collaborating). It predicts what a particular professional will answer to based on what questions other similar professionals have answered to. The underlying assumption of the collaborative filtering approach is that if Alice and Bob have answer to the same question(s), Alice is more likely to share Bob's preference for a given question than that of a randomly chosen professional.

* **Hybrid methods-** Most companies like Netflix and Hulu use a hybrid approach in their recommendation models, which provide recommendation based on the combination of what content a user like in the past as well as what other similar user like. Recent research has demonstrated that a hybrid approach that combines collaborative filtering and content-based filtering could be more effective than both approaches in some cases. These hybrid methods can also be used to overcome some of the common problems in recommender systems such as cold start and the sparsity problem.


### Model Evaluation

In Recommender Systems, there are a set metrics commonly used for evaluation. We chose to work with Top-N accuracy metrics, which evaluates the accuracy of the top recommendations provided to a user, comparing to the items the user has actually interacted.
This evaluation method works as follows:

* For each user
    * For each item the user has interacted in test set
    
        * Sample 1000 other items the user has never interacted.
        * Ask the recommender model to produce a ranked list of recommended items, from a set composed one interacted item and the 100 non-interacted ("non-relevant!) items

        * Compute the Top-N accuracy metrics for this user and interacted item from the recommendations ranked list
* Aggregate the global Top-N accuracy metrics

In the next code block, we build the model evaluator

In [None]:
#Top-N accuracy metrics consts
EVAL_RANDOM_SAMPLE_NON_INTERACTED_questions = 100

class ModelEvaluator:

    def get_not_answered_questions_sample(self, professional_id, sample_size, seed=42):
        answered_questions = get_questions_answered(professional_id, answers_full_indexed_df)
        all_questions = set(questions['questions_id'])
        non_answered_questions = all_questions - answered_questions

        #random.seed(seed)
        non_answered_questions_sample = random.sample(non_answered_questions, sample_size)
        return set(non_answered_questions_sample)

    def _verify_hit_top_n(self, project_id, recommended_questions, topn):        
            try:
                index = next(i for i, c in enumerate(recommended_questions) if c == project_id)
            except:
                index = -1
            hit = int(index in range(0, topn))
            return hit, index

    def evaluate_model_for_professional(self, model, professional_id, processed_text_matrix = None):
        #Getting the questions in test set
        answered_values_testset = answers_test_indexed_df.loc[professional_id]
        if type(answered_values_testset['questions_id']) == pd.Series:
            professional_answered_questions_testset = set(answered_values_testset['questions_id'])
        else:
            professional_answered_questions_testset = set([answered_values_testset['questions_id']])  
        answered_questions_count_testset = len(professional_answered_questions_testset) 

        #Getting a ranked recommendation list from a model for a given professional
        if(processed_text_matrix==None): 
            professional_recs_df = model.recommend_questions(professional_id, 
                                                   questions_to_ignore=get_questions_answered(professional_id, 
                                                                                        answers_train_indexed_df), 
                                                   topn=100000000)
        else:
            professional_recs_df = model.recommend_questions(professional_id,processed_text_matrix, 
                                                   questions_to_ignore=get_questions_answered(professional_id, 
                                                                                        answers_train_indexed_df), 
                                                   topn=100000000)
        hits_at_3_count = 0
        hits_at_5_count = 0
        hits_at_10_count = 0
        #For each project the professional has answered in test set
        for project_id in professional_answered_questions_testset:
            #Getting a random sample (100) questions the professional has not answered 
            #(to represent questions that are assumed to be no relevant to the professional)
            non_answered_questions_sample = self.get_not_answered_questions_sample(professional_id, 
                                                                          sample_size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_questions, 
                                                                              seed=42)

            #Combining the current answered project with the 100 random questions
            questions_to_filter_recs = non_answered_questions_sample.union(set([project_id]))

            #Filtering only recommendations that are either the answered project or from a random sample of 100 non-answered questions
            valid_recs_df = professional_recs_df[professional_recs_df['questions_id'].isin(questions_to_filter_recs)]                    
            valid_recs = valid_recs_df['questions_id'].values
            #Verifying if the current answered project is among the Top-N recommended questions
            hit_at_3, index_at_3 = self._verify_hit_top_n(project_id, valid_recs, 3)
            hits_at_3_count += hit_at_3
            hit_at_5, index_at_5 = self._verify_hit_top_n(project_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(project_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        #Recall is the rate of the answered questions that are ranked among the Top-N recommended questions, 
        #when mixed with a set of non-relevant questions
        recall_at_3 = hits_at_3_count / float(answered_questions_count_testset)
        recall_at_5 = hits_at_5_count / float(answered_questions_count_testset)
        recall_at_10 = hits_at_10_count / float(answered_questions_count_testset)

        professional_metrics = {'hits@3_count':hits_at_3_count, 
                         'hits@5_count':hits_at_5_count, 
                          'hits@10_count':hits_at_10_count, 
                          'answered_count': answered_questions_count_testset,
                          'recall@3': recall_at_3,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return professional_metrics

    def evaluate_model(self, model, processed_text_matrix):
        #print('Running evaluation for professionals')
        people_metrics = []
        for idx, professional_id in enumerate(list(answers_test_indexed_df.index.unique().values)):
            
            professional_metrics = self.evaluate_model_for_professional(model, professional_id, processed_text_matrix)  
            professional_metrics['_professional_id'] = professional_id
            people_metrics.append(professional_metrics)
        print('%d professionals processed' % idx)
        print(pd.DataFrame(people_metrics).head())

        detailed_results_df = pd.DataFrame(people_metrics) \
                            .sort_values('answered_count', ascending=False)
        
        global_recall_at_3 = detailed_results_df['hits@3_count'].sum() / float(detailed_results_df['answered_count'].sum())
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['answered_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['answered_count'].sum())
        
        global_metrics = {'modelName': model.get_model_name(),
                          'recall@3': global_recall_at_3,
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df

## 1. Data Preparation

Before getting into the recommender systems, we will load and preprocess our datasets. The code creates two datasets: "questions" contains all question-related information, while "answers_df" contains answers- and professionals-related information. To make sure the code runs smoothly in Kaggle kernel, I turned on the test mode to only keep 10000 answered events in the dataset.

Before modeling, we need to measure the relation strength between a professional and a question. Although most professionals only answers once in the dataset. To better measure this strength, we combine the times of answers, and create a new dataset containing unique answer relations between a professional, a question, and the relation strength. The hidden code block will output the number of questions and unique professional-question answer events:

In [None]:
def data_preperation(questions, answers, professionals):
   
    test_mode = True
    # Merge datasets
    answers = answers.merge(professionals, left_on='answers_author_id', right_on='professionals_id', how="left")
    df = answers.merge(questions,left_on='answers_question_id', right_on='questions_id', how="left")
    print(df.shape)
    # only load a few lines in test mode
    if test_mode:
        df = df.head(10000)

    answers_df = df

    # Define event strength as the answered amount to a certain question
    answers_df['eventStrength'] = 1

    def smooth_professional_preference(x):
        return x
        
    answers_full_df = answers_df \
                        .groupby(['professionals_id', 'questions_id'])['eventStrength'].sum() \
                        .apply(smooth_professional_preference).reset_index()
            
    # Update questions dataset
    question_cols = questions.columns
    questions = df[question_cols].drop_duplicates()

    print('# of questions: %d' % len(questions))
    print('# of unique user/question answers: %d' % len(answers_full_df))

    return (questions, answers_full_df)

## 2. Split Data into Train & Test

To evaluate our models, we will split the donation dataset into training and validation sets.

In [None]:
def split_train_test_data(answers_full_df):
    answers_train_df, answers_test_df = train_test_split(answers_full_df,
                                    test_size=0.20,
                                    random_state=42)

    print('# answers on Train set: %d' % len(answers_train_df))
    print('# answers on Test set: %d' % len(answers_test_df))
    
    answers_full_indexed_df = answers_full_df.set_index('professionals_id')
    answers_test_indexed_df = answers_test_df.set_index('professionals_id')
    answers_train_indexed_df = answers_train_df.set_index('professionals_id')
    
    return (answers_train_df, answers_test_df, answers_full_indexed_df, answers_test_indexed_df, answers_train_indexed_df)

# 1. Content Based RecSys

## A) Text Processing

#### Using tfidf

We will use a first TF-IDF, to extract information from question title and descriptions. TF-IDF converts unstructured text into a vector structure, where each word is represented by a position in the vector, and the value measures how relevant a given word is for a question title/descriptions. It is used bo compute similarity between questions based on question titles and descriptions.

In [None]:
def process_text_using_tfidf(questions):
    # Preprocessing of text data
    textfeats = ["questions_title","questions_body"]
    for cols in textfeats:
        questions[cols] = questions[cols].astype(str) 
        questions[cols] = questions[cols].astype(str).fillna('') # FILL NA
        questions[cols] = questions[cols].str.lower() # Lowercase all text, so that capitalized words dont get treated differently
    
    text = questions["questions_title"] + ' ' + questions["questions_body"]
    vectorizer = TfidfVectorizer(strip_accents='unicode',
                                analyzer='word',
                                lowercase=True, # Convert all uppercase to lowercase
                                stop_words='english', # Remove commonly found english words ('it', 'a', 'the') which do not typically contain much signal
                                max_df = 0.9, # Only consider words that appear in fewer than max_df percent of all documents
                                # max_features=5000 # Maximum features to be extracted                    
                                )                        
    question_ids = questions['questions_id'].tolist()
    tfidf_matrix = vectorizer.fit_transform(text)
    tfidf_feature_names = vectorizer.get_feature_names()
    tfidf_matrix

    return (tfidf_matrix, tfidf_feature_names, question_ids)

#### Using BERT Embedding

Now, let's use BERT to extract information from question title and descriptions.

In [None]:
def bert_setup():
    # Install bert-as-service
    !pip install bert-serving-server
    !pip install bert-serving-client

    # Download and unzip the pre-trained model
    !wget http://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
    !unzip uncased_L-12_H-768_A-12.zip

    # Start the BERT server
    bert_command = 'bert-serving-start -model_dir /kaggle/working/uncased_L-12_H-768_A-12'
    process = subprocess.Popen(bert_command.split(), stdout=subprocess.PIPE)
   
def combine_two_list(x,y):
    z = []

    for i in range(len(x)):
        z.append(x[i]+y[i])
        
    return z
def process_text_using_bert_embeddings(questions):

    title_embeddings = bc.encode(questions["questions_title"].tolist())
    body_embeddings = bc.encode(questions["questions_body"].tolist())

    question_embeddings = np.asarray(combine_two_list(title_embeddings.tolist(),body_embeddings.tolist()))

    return scipy.sparse.csr_matrix(question_embeddings)


## B) Build Professional profile

To build a professional's profile, we take all the questions the professional has answered to and average them. The average is weighted by the event strength based on answer times.

In [None]:
def get_question_profile(question_id, processed_text_matrix):
    idx = question_ids.index(question_id)
    question_profile = processed_text_matrix[idx:idx+1]
    return question_profile

def get_question_profiles(ids, processed_text_matrix):
    question_profiles_list = [get_question_profile(x, processed_text_matrix) for x in np.ravel([ids])]
    question_profiles = scipy.sparse.vstack(question_profiles_list)
    return question_profiles

def build_professionals_profile(professional_id, answers_indexed_df, processed_text_matrix):
    answers_professional_df = answers_indexed_df.loc[professional_id]
    professional_question_profiles = get_question_profiles(answers_professional_df['questions_id'], processed_text_matrix)
    professional_question_strengths = np.array(answers_professional_df['eventStrength']).reshape(-1,1)
    #Weighted average of question profiles by the answers strength
    professional_question_strengths_weighted_avg = np.sum(professional_question_profiles.multiply(professional_question_strengths), axis=0) / (np.sum(professional_question_strengths)+1)
    professional_profile_norm = sklearn.preprocessing.normalize(professional_question_strengths_weighted_avg)
    return professional_profile_norm

from tqdm import tqdm

def build_professionals_profiles(answers_full_df, processed_text_matrix): 
    answers_indexed_df = answers_full_df[answers_full_df['questions_id'].isin(questions['questions_id'])].set_index('professionals_id')
    professional_profiles = {}
    for professional_id in tqdm(answers_indexed_df.index.unique()):
        professional_profiles[professional_id] = build_professionals_profile(professional_id, answers_indexed_df, processed_text_matrix)
    print("# of professionals with profiles: %d" % len(professional_profiles))
    return professional_profiles

## C) Build the Content-Based Recommender

Now it's time to build our content-based recommender:

In [None]:
class ContentBasedRecommender:
    
    MODEL_NAME = 'Content-Based'
    
    def __init__(self, questions_df=None):
        self.question_ids = question_ids
        self.questions_df = questions_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def _get_similar_questions_to_professional_profile(self, professional_id, processed_text_matrix, topn=1000):
        #Computes the cosine similarity between the professional profile and all question profiles
        cosine_similarities = cosine_similarity(professional_profiles[professional_id], processed_text_matrix)
        #Gets the top similar questions
        similar_indices = cosine_similarities.argsort().flatten()[-topn:]
        #Sort the similar questions by similarity
        similar_questions = sorted([(question_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=lambda x: -x[1])
        return similar_questions
        
    def recommend_questions(self, professional_id, processed_text_matrix, questions_to_ignore=[], topn=10, verbose=False):
        similar_questions = self._get_similar_questions_to_professional_profile(professional_id, processed_text_matrix)
        #Ignores questions the professional has already answered
        similar_questions_filtered = list(filter(lambda x: x[0] not in questions_to_ignore, similar_questions))
        
        recommendations_df = pd.DataFrame(similar_questions_filtered, columns=['questions_id', 'recStrength']).head(topn)

        recommendations_df = recommendations_df.merge(self.questions_df, how = 'left', 
                                                    left_on = 'questions_id', 
                                                    right_on = 'questions_id')[['recStrength', 'questions_id', 'questions_title', 'questions_body']]


        return recommendations_df
    

def get_questions_answered(professional_id, answers_df):
    # Get the professional's data and merge in the movie information.
    try:
        answered_questions = answers_df.loc[professional_id]['Project ID']
        return set(answered_questions if type(answered_questions) == pd.Series else [answered_questions])
    except KeyError:
        return set()

### a) Run Content-based model withTFIDF

In [None]:
questions, answers_full_df = data_preperation(questions, answers, professionals)
answers_train_df, answers_test_df, answers_full_indexed_df, answers_test_indexed_df, answers_train_indexed_df = split_train_test_data(answers_full_df)
tfidf_matrix, tfidf_feature_names, question_ids = process_text_using_tfidf(questions)
professional_profiles = build_professionals_profiles(answers_full_df, tfidf_matrix)

In [None]:
myprofessional1 = "000d4635e5da41e3bfd83677ee11dda4"
myprofessional2 = "00271cc10e0245fba4a35e76e669c281"
cbr_model_tfidf = ContentBasedRecommender(questions)
cbr_model_tfidf.recommend_questions(myprofessional2, tfidf_matrix)

In [None]:
model_evaluator = ModelEvaluator()

print('Evaluating Content-Based Filtering model...')
cb_tfidf_global_metrics, cb_tfidf_detailed_results_df = model_evaluator.evaluate_model(cbr_model_tfidf, tfidf_matrix)
print('\nGlobal metrics:\n%s' % cb_tfidf_global_metrics)
cb_tfidf_detailed_results_df = cb_tfidf_detailed_results_df[['_professional_id', 'answered_count', "hits@3_count", 'hits@5_count','hits@10_count', 
                                                'recall@3','recall@5','recall@10']]
cb_tfidf_detailed_results_df.head(10)

### b) Run Content-based model with BERT

In [None]:
import subprocess
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
bert_setup()

In [None]:
from bert_serving.client import BertClient
# Start the BERT client
bc = BertClient()

questions, answers_full_df = data_preperation(questions, answers, professionals)
answers_train_df, answers_test_df, answers_full_indexed_df, answers_test_indexed_df, answers_train_indexed_df = split_train_test_data(answers_full_df)
processed_text_embedding = process_text_using_bert_embeddings(questions)
professional_profiles = build_professionals_profiles(answers_full_df, processed_text_embedding)

In [None]:
myprofessional1 = "000d4635e5da41e3bfd83677ee11dda4"
myprofessional2 = "00271cc10e0245fba4a35e76e669c281"
cbr_model_bert = ContentBasedRecommender(questions)
cbr_model_bert.recommend_questions(myprofessional2, processed_text_embedding)

In [None]:
model_evaluator = ModelEvaluator()

print('Evaluating Content-Based Filtering model...')
cb_bert_global_metrics, cb_bert_detailed_results_df = model_evaluator.evaluate_model(cbr_model_bert, processed_text_embedding)
print('\nGlobal metrics:\n%s' % cb_bert_global_metrics)
cb_bert_detailed_results_df = cb_bert_detailed_results_df[['_professional_id', 'answered_count', "hits@3_count", 'hits@5_count','hits@10_count', 
                                                'recall@3','recall@5','recall@10']]
cb_bert_detailed_results_df.head(10)

# 2. Collaborative Filtering RecSys

Next, we will build a model-based Collaborative Filtering (CF) Recommender. In this approach, models are developed using machine learning algorithms to recommend question to professionals. There are many model-based CF algorithms, here we adopt a latent factor model, which compresses professional-question matrix into a low-dimensional representation in terms of latent factors. A reduced presentation could be utilized for either professional-based or question-based neighborhood searching algorithms to find recommendations. Here we a use popular latent factor model named Singular Value Decomposition (SVD).

### A) Create the professional-question matrix

We will first get the professional-question matrix and print the first five rows.

In [None]:
#Creating a sparse pivot table with professionals in rows and questions in columns
professionals_questions_pivot_matrix_df = answers_full_df.pivot(index='professionals_id', 
                                                          columns='questions_id', 
                                                          values='eventStrength').fillna(0)

# Transform the professional-question dataframe into a matrix
professionals_questions_pivot_matrix = professionals_questions_pivot_matrix_df.as_matrix()

# Get professional ids
professionals_ids = list(professionals_questions_pivot_matrix_df.index)

# Print the first 5 rows of the professional-question matrix
professionals_questions_pivot_matrix[:5]

### B) Singular Value Decomposition (SVD)

Now we will use SVD to get latent factors. After the factorization, we will try to reconstruct the original matrix by multiplying its factors. The resulting matrix is not sparse any more. It is the generated predictions for questions the donor have not yet answer to, which we will exploit for recommendations.

In [None]:
# Performs matrix factorization of the original professional-question matrix
# Here we set k = 20, which is the number of factors we are going to get
# In the definition of SVD, an original matrix A is approxmated as a product A ≈ UΣV 
# where U and V have orthonormal columns, and Σ is non-negative diagonal.
U, sigma, Vt = svds(professionals_questions_pivot_matrix, k = 20)
sigma = np.diag(sigma)

# Reconstruct the matrix by multiplying its factors
all_professional_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 

#Converting the reconstructed matrix back to a Pandas dataframe
cf_preds_df = pd.DataFrame(all_professional_predicted_ratings, 
                           columns = professionals_questions_pivot_matrix_df.columns, 
                           index=professionals_ids).transpose()
cf_preds_df.head()

### C) Build the Collaborative Filtering Model

In [None]:
class CFRecommender:
    
    MODEL_NAME = 'Collaborative Filtering'
    
    def __init__(self, cf_predictions_df, questions_df=None):
        self.cf_predictions_df = cf_predictions_df
        self.questions_df = questions_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_questions(self, professional_id, questions_to_ignore=[], topn=10):
        # Get and sort the professional's predictions
        sorted_professional_predictions = self.cf_predictions_df[professional_id].sort_values(ascending=False) \
                                    .reset_index().rename(columns={professional_id: 'recStrength'})

        # Recommend the highest predicted questions that the professional hasn't donated to
        recommendations_df = sorted_professional_predictions[~sorted_professional_predictions['questions_id'].isin(questions_to_ignore)] \
                               .sort_values('recStrength', ascending = False) \
                               .head(topn)

 
        recommendations_df = recommendations_df.merge(self.questions_df, how = 'left', 
                                                          left_on = 'questions_id', 
                                                          right_on = 'questions_id')[['recStrength', 'questions_id', 'questions_title', 'questions_body']]


        return recommendations_df

In [None]:
cfr_model = CFRecommender(cf_preds_df, questions)
cfr_model.recommend_questions(myprofessional2)

In [None]:
print('Evaluating Collaborative Filtering (SVD Matrix Factorization) model...')
cf_global_metrics, cf_detailed_results_df = model_evaluator.evaluate_model(cfr_model, None)
print('\nGlobal metrics:\n%s' % cf_global_metrics)
cf_detailed_results_df = cf_detailed_results_df[['_professional_id', 'answered_count', "hits@3_count", 'hits@5_count','hits@10_count', 
                                               'recall@3','recall@5','recall@10']]
cf_detailed_results_df.head(10)

# 3. Hybrid RecSys

The third approach, the hybrid method, combines the first two approaches to try to give even better recommendations. Hybrid methods have performed better than individual approaches in many studies and have being extensively used by researchers and practioners.

Let's build a very simple hybridization method, by only multiply the Content-Based score with the Collaborative-Filtering score , and ranking by the resulting hybrid score.

In [None]:
class HybridRecommender:
    
    MODEL_NAME = 'Hybrid'
    
    def __init__(self, cb_rec_model, cf_rec_model, questions_df):
        self.cb_rec_model = cb_rec_model
        self.cf_rec_model = cf_rec_model
        self.questions_df = questions_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_questions(self, professional_id,processed_text_matrix, questions_to_ignore=[], topn=10):
        #Getting the top-1000 Content-based filtering recommendations
        cb_recs_df = self.cb_rec_model.recommend_questions(professional_id,processed_text_matrix, questions_to_ignore=questions_to_ignore, 
                                                           topn=1000).rename(columns={'recStrength': 'recStrengthCB'})
        
        #Getting the top-1000 Collaborative filtering recommendations
        cf_recs_df = self.cf_rec_model.recommend_questions(professional_id, questions_to_ignore=questions_to_ignore,  
                                                           topn=1000).rename(columns={'recStrength': 'recStrengthCF'})
        
        #Combining the results by question ID
        recs_df = cb_recs_df.merge(cf_recs_df,
                                   how = 'inner', 
                                   left_on = 'questions_id', 
                                   right_on = 'questions_id')
        
        #Computing a hybrid recommendation score based on CF and CB scores
        recs_df['recStrengthHybrid'] = recs_df['recStrengthCB'] * recs_df['recStrengthCF']
        
        #Sorting recommendations by hybrid score
        recommendations_df = recs_df.sort_values('recStrengthHybrid', ascending=False).head(topn)

        recommendations_df = recommendations_df.merge(self.questions_df, how = 'left', 
                                                    left_on = 'questions_id', 
                                                    right_on = 'questions_id')[['recStrengthHybrid', 
                                                                              'questions_id', 'questions_title', 
                                                                              'questions_body']]


        return recommendations_df

In [None]:
hybrid_model = HybridRecommender(cbr_model_bert, cfr_model, questions)

In [None]:
hybrid_model.recommend_questions(myprofessional2, processed_text_embedding)

In [None]:
print('Evaluating Hybrid model...')
hybrid_global_metrics, hybrid_detailed_results_df = model_evaluator.evaluate_model(hybrid_model, processed_text_embedding)
print('\nGlobal metrics:\n%s' % hybrid_global_metrics)
hybrid_detailed_results_df = hybrid_detailed_results_df[['_professional_id', 'answered_count', "hits@3_count", 'hits@5_count','hits@10_count', 
                                             'recall@3','recall@5','recall@10']]
hybrid_detailed_results_df.head(10)

# Model Comparision

In [None]:
global_metrics_df = pd.DataFrame([cf_global_metrics, 
                                  cb_tfidf_global_metrics,
                                  cb_bert_global_metrics,
                                  hybrid_global_metrics]).set_index('modelName')
global_metrics_df

**In my Observation: Content Based Model using TFIDF is best performing Model.**

## References

1. http://recommender-systems.org/
2. https://en.wikipedia.org/wiki/Recommender_system
3. https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/
4. https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101
