# Introduction 
**Notebook Overview:** This notebook contains a solution for Career village recommendation system competition. In this notebook, I build a hybrid recommendation system for recommending students questions to professionals for CareerVillage.org. The recommender system works by matching professionals with questions by tags they follow, their previous answers' question tags and similar tags. Also, it overcomes some of the most highest rated problem for CareerVillage recommender system like cold-start and others.


**Competition problem statements:** CareerVillage.org is a non-profit organization helping underserved youth to provide information to build their career. Students can ask their questions in the CareerVillage.org and professionals(expert people who love to help students) answer their questions. The challenge is that CareerVillage has to recommend correct questions to the professionals so that the questions match with the professional's interests. This will increase the likelihood of a question to get an answer. So in this competition, we have to make a recommendation system that will correctly recommend questions that will match with professionals interest. 

# Recommendation system
Before we deep dive into the solutions, let's first make sure we understand different terminology of recommendation system. By definition, A recommendation system is a system that identifies and provides recommended content or digital items for users by using users interests. Recommender systems have become an important feature in modern websites, e.g. in Amazon, Netflix, or Flickr. Click rates, revenues and other measures of success may be increased by the application of effective recommender systems.

The difficult task is to identify relevant items even if they are generally unpopular. Recommender systems leverage available context such as user information, time, location, etc. to filter relevant items. Thereby, also items from the tails of the popularity distribution are successfully recommended. [1]


**Types of recommendation system:**
1. **Collaborative Filtering:** Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person. Below is a picture of how collaborative filtering works [5]. 

    ![](https://i.imgur.com/FZli7DC.gif)
    Though collaborative filtering One major problem of collaborative filtering is "cold start". As we’ve seen, collaborative-filtering can be a powerful way of recommending items based on user history, but what if there is no user history? This is called the “cold start” problem, and it can apply both to new items and to new users. Items with lots of history get recommended a lot, while those without never make it into the recommendation engine, resulting in a positive feedback loop. At the same time, new users have no history and thus the system doesn’t have any good recommendations. Potential solution: Onboarding processes can learn basic info to jump-start user preferences, importing social network contacts [4].
    



2. **Content-Based Filtering:** These filtering methods are based on the description of an item and a profile of the user’s preferred choices. In a content-based recommendation system, keywords are used to describe the items; besides, a user profile is built to state the type of item this user likes. In other words, the algorithms try to recommend products which are similar to the ones that a user has liked in the past. The idea of content-based filtering is that if you like an item you will also like a ‘similar’ item. For example, when we are recommending the same kind of item like a movie or song recommendation. [2]

    One major problem of this approach is the diversity. Relevance is important, but it’s not all there is. If you watched and liked Star Wars, the odds are pretty good that you’ll also like The Empire Strikes Back, but you probably don’t need a recommendation engine to tell you that. It’s also important for a recommendation engine to come up with results that are novel (that is, stuff the user wasn’t expecting) and diverse (that is, stuff that represents a broad selection of their interests). [3]

 
3. **Hybrid recommender system:** Hybrid recommender system is a special type of recommender system that combines both content and collaborative filtering method. Combining collaborative filtering and content-based filtering could be more effective in some cases. Hybrid approaches can be implemented in several ways: by making content-based and collaborative-based predictions separately and then combining them; by adding content-based capabilities to a collaborative-based approach (and vice versa). Several studies empirically compare the performance of the hybrid with pure collaborative and content-based methods and demonstrate that hybrid methods can provide more accurate recommendations than pure approaches. These methods can also be used to overcome some of the common problems in recommender systems such as cold start and the sparsity problem.




**Types of Data for building recommendation systems:** There are two kinds of data available for building a recommendation system. There are 
1. **Explicit feedback:** Explicit feedback is the data about user explicit feedback(ratings etc) about a product. It tells directly that users like a product or not. 

2. **Implicit feedback:** In implicit feedback, we don't have the data about how the user rates a product. Examples for implicit feedback are clicks, watched movies, played songs, purchases or assigned tags.


Now we know what a recommended system is and it's different types. Next, we will look into the what method is more suitable for building a recommendation system for CareerVillage.org. 

# Choosing a Recommendation System for CareerVillage
We have to build a recommendation system for CareerVillage. So let's what data career village provides us. Career Village gives us dataset about professionals, professionals answered questions, students, students questions. So let's first decide what data do we have for building a recommender system. Is it implicit or explicit feedback? 

Because we don't have any ratings or equivalent matrix for determining how professionals like a question, we get implicit feedback data. So we are using implicit feedback data for recommendation system. Examples of implicit feedback in CareerVillage is users answered questions' tags, location. 


**Which method should we choose?**: We can't choose collaborative filtering because this will cause a cold start because a large portion of professionals doesn't answer any questions yet and for new professionals. What about content filtering? This will also cause problems because we will not have diversity and most of them we don't have clear information about professionals for creating user feature matrix. 

So what should we do? Well, we can use a hybrid model. It will combine both collaborative and content-based method and can make a robust recommendation by analysing professionals tags if they have and similar professionals interest. This will help us to solve both cold start and diversity problem. 

Now we see that a hybrid model is perfect for this problem. In the next section, we will see how we will build a hybrid recommendation system for CareerVillage. 

# LightFm hybrid recommender for CareerVillage
A hybrid recommender is a special kind of recommender that uses both collaborative and content based filtering for making recommendations. Thats make hybrid recommender a very speacial and useful method for building recommendation system. But there are several techniques and methods for buidling hybrid recommender system. But in this project, I like to use **LightFM Hybrid model**. A hybrid matrix factorization model by Lyst. 

1. **What is LightFM?**: LightFM is a hybrid matrix factorisation model representing users and items as linear combinations of their content features’ latent factors. The model outperforms both collaborative and content-based models in cold-start or sparse interaction data scenarios (using both user and item metadata), and performs at least as well as a pure collaborative matrix factorisation model where interaction data is abundant.

    In LightFM, like in a collaborative filtering model, users and items are represented as latent vectors (embeddings). However, just as in a CB model, these are entirely defined by functions (in this case, linear combinations) of embeddings of the content features that describe each product or user. 

    For example, if the movie ‘Wizard of Oz’ is described by the following features: ‘musical fantasy’, ‘Judy Garland’, and ‘Wizard of Oz’, then its latent representation will be given by the sum of these features’ latent representations. In doing so, LightFM unites the advantages of contentbased and collaborative recommenders. [6]


2. **How LightFM works**: The [LightFM paper](https://arxiv.org/pdf/1507.08439.pdf) describes beautifully how lightFM works. To put it simply in words, lightFM model learns embeddings (latent representations in a high-dimensional space) for users and items in a way that encodes user preferences over items. When multiplied together, these representations produce scores for every item for a given user; items scored highly are more likely to be interesting to the user [5].

    The user and item representations are expressed in terms of representations of their features: an embedding is estimated for every feature, and these features are then summed together to arrive at representations for users and items [5].
    
    The latent representation of user u is given by the sum of its features’ latent vectors:  \\(q_{u}= \sum_{}^{j\in f_{u}}\\).
    
    And same for the items:  \\(p_{i}= \sum_{}^{j\in f_{i}}\\)
    
    The model’s prediction for user u and item i is then given by the dot product of user and item representations, adjusted by user and item feature biases:
    \\(\hat{r_{ui}}= f\left ( q_{u}\cdot p_{i}+b_{u}+b_{i}  \right )\\)
    
    This is just a general idea of the model. Please read the lightFM model paper more in depth knowledge. 
    
    
    
    
    

3. **Why LightFM**: 
    * In both cold-start and low density scenarios, LightFM performs at least as well as pure content-based models, substantially outperforming them when either (1) collaborative information is available in the training set or (2) user features are included in the model. This is really useful for our CareerVillage recommendation system beacause we will have a lot of new questions and students that makes a very good environment for cold start problem. 

    * When collaborative data is abundant (warm-start, dense user-item matrix), LightFM performs at least as well as the MF model. 

    * Embeddings produced by LightFM encode important semantic information about features, and can be used for related recommendation tasks such as tag recommendations. This is also very important for our problem. Because there are useful for finding similar tags so that model can recommend questions that has similiar tags to professionals tags. 



# LightFM Python Library 
Fortunely, there is a library that makes easy to build a lightFM model. LightFM model is developed by Lyst. They also created a library for building lightfm model. It is very popular on Github having 2400+ stars and 226 closed issues. Because it is well maintained by Lyst( a london based e-commerce compnay) and it's learning community, lightFM python library is a really good source for building lightFM model. 

**Benefit of LightFM python library**: We can oviously make our implementation of lightFM model. But that will be reinvented the wheel. Because lightFM library is really well maintained library that are used production by many well reputed brand (Lyst, Sketchfab). 

The biggest benefit of lightfm library is that it implements **WARP (Weighted Approximate-Rank Pairwise) loss** for implicit feedback learning-to-rank. Wait! What is that? 

For optimization of our matrix factorization function we can use different optimization methods e.g ALS, SGD. But there is another special optimization method called WARP (Weighted Approximate-Rank Pairwise). From the documentation, WARP works like these: 
1. For a given (user, positive item pair), sample a negative item at random from all the remaining items. Compute predictions for both items; if the negative item’s prediction exceeds that of the positive item plus a margin, perform a gradient update to rank the positive item higher and the negative item lower. If there is no rank violation, continue sampling negative items until a violation is found.

2. If you found a violating negative example at the first try, make a large gradient update: this indicates that a lot of negative items are ranked higher than positives items given the current state of the model, and the model must be updated by a large amount. If it took a lot of sampling to find a violating example, perform a small update: the model is likely close to the optimum and should be updated at a low rate.

It performs very well for implicit feedback model. Do you remember that our data is impicit feedback! For that reason, WARP loss is very essential for CareerVillage recommender system. Becasue lightfm library implements this algorithm, that makes lightfm library really stands out. 



Also, there are other important benefit also:
1. LightFM is written in Cython and is paralellized via HOGWILD SGD. This will outperform any implementations of lightFM. 
2. Already battle tested by many developers. 
3. Already used in production by many well reputeted brand. 
4. It's API is really developer friendly. This makes building model really easy.
5. Provide evaluation matrics for evaluating the perfomance of the model. 
6. Finally, it is very very fast. 

Building lightFM model from scracth by maintaing all of features above is really difficult and time consuming. For that reason, I am uisng LightFM python library for building the model.



**How LightFM python library works**: This library makes really easy for building lightFM model. There are couple of steps for building model using LightFM library. 
1. Process our Data and Make a lightFM dataset by using it's api 
2. Build interaction matrix, user/item features 
3. Make a model and train the model 
4. Evaluate the model 
5. Make predictions 


**Want to learn more about LightFM library?**: If you want to deep dive how to use this library please visit it's [official page](https://lyst.github.io/lightfm/docs/index.html). You can find guide for how to use this library.

Now we know which method and library to use for building our recommender model. Let's see why this solutions is effective. 

# Effectiveness of this solution
Now I am going to describe the effectiveness of my solution to the CareerVillage recommendation problem. 
1. Solved major recommendation system problem for CareerVillage: 
    * **Cold Start**: LightFM handle cold start problem very excellently because of it's hybrid nature. LightFM uses item and user features for recommendation. For when interection doesn't found for new users, I mean for cold start scenario lightFM fallback to content or collaborative method depending on the available data. That's why lightFM outperforms cold start problem very well. 
    * **Questions rankings**: The objective for emailing questions to professionals is that the students can get answer quickly. So we have to make sure that undesereved question should have higher rank. For example, if a question have 10 answer most probably the students don't need additional answer. But those questions that have less that 3 answer neeed more disperate answer. 
    
    That's why, I build the model that solved this problem by weighting the model so that the model provides less importance to the questions that have higher number answers. The weight is calculated using (```1/num_of_ans_per_ques```). And this weights are passed with every interection matrix. This makes sure that questions with less answers get boost in recommendation. 
    * **Similar tags**: Professional and questions can have multiple tags and most of the time tags are just same except for some letter difference. My model takes care of that. LightFM creates users and items features embedding that understand the similiarity between tags. This makes this model highly effiencient. Embeddings produced by LightFM encode important semantic information about features (tags). 
    
    This features makes this model really extra edge. Because there are lot of tags. Most of the time students added tags in questions that is slightly different that professionals tags. But my model can pickup these and find similirities between those. 
    * **Scope for adding new features**: I have provided class and function for building dataset and model. They are designed in way so that new features can be implemented easily. So that CareerVillage can test new features. 
    * **Fast**: Because my model is build using LightFM library and using WARP loss, the speed of training and prediction is blazingly fast. 

2. **Ready for deploying in production**:
    * I have provided step by step guildlines, function and example for deploying this model into production.
    * I build re-usable function for easy usecase that will useful for production and adding additional features.
    * All codes are well formated and documented. I provided commend on every single method, function and class so that anybody can understand the code and easy to implement.
3. **Well documented**: 
    * Documented every thing from what is recommender system to model explaination to production steps.
    * Written comment for most of the code for easy understandibility.
    * I provided flowchart for better understanding the model structer and this notebook. 
    
Now we are ready to build our model. Without further do let's start building. 

# Kernel Overview
This kernel is divided into three major parts. 
1. **Solutions overview**: This section is the above section that you read. This section contains information about recommendation system, different kinds of model, why we choose lightFM, and most importantly the effectiveness of this solutions. 

2. **Model Building**: In this section, we will be building our model step by step providing documentation for each. We will provide documenation for why we did certain things. This will help us to understand the model and implementation better. Here is the picture for how we will complete this steps: 
![](https://i.imgur.com/UyaNRZg.png)

3. **Model in production**: This is a very important steps. In this steps, we are also going to make our model and make recommendations. But we will build class for each individuals steps that we build in step 2. And we will be building a pipeline that will makes really easy for putting this model into production. 

    You can say why you don't used these function and class in step 2. Well, because in steps 2, we are building model step by step that makes really easy for understanding the model. But in this steps we will put out model in production. The pipeline is same as described in step 2. Here is a simple flowcert for how this model can put into production:
    ![](https://i.imgur.com/7fG6gvc.png)
    
    This is not a definite blueprint for how this model will put into produciton but this will give some idea for how this can implement. So without further talking let's move on to step 2 (Model building). 
    

# Gathering Data 
CareerVillage provides us a very rich set of datasets for this competition. Dataset contains information about students, professionals, questions, answers, comment and specially tags. This competition already contains lot of great EDA and data analysis on this data. Feel free to look at those. This will give you very rich idea about the dataset.

In [None]:
################################################
# Importing necessary library
################################################
import numpy as np
import pandas as pd

# all lightfm imports 
from lightfm.data import Dataset
from lightfm import LightFM
from lightfm import cross_validation
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import auc_score

# imports re for text cleaning 
import re
from datetime import datetime, timedelta

# we will ignore pandas warning 
import warnings
warnings.filterwarnings('ignore')


In [None]:
############################################
# Read all our datasets and store them in pandas dataframe objects. 
############################################
base_path = '../input/'
df_answer_scores = pd.read_csv(
    base_path + 'answer_scores.csv')

df_answers = pd.read_csv(
    base_path + 'answers.csv',
    parse_dates=['answers_date_added'])

df_comments = pd.read_csv(
    base_path + 'comments.csv')

df_emails = pd.read_csv(
    base_path + 'emails.csv')

df_group_memberships = pd.read_csv(
    base_path + 'group_memberships.csv')

df_groups = pd.read_csv(
    base_path + 'groups.csv')

df_matches = pd.read_csv(
    base_path + 'matches.csv')

df_professionals = pd.read_csv(
    base_path + 'professionals.csv',
    parse_dates=['professionals_date_joined'])

df_question_scores = pd.read_csv(
    base_path + 'question_scores.csv')

df_questions = pd.read_csv(
    base_path + 'questions.csv',
    parse_dates=['questions_date_added'])

df_school_memberships = pd.read_csv(
    base_path + 'school_memberships.csv')

df_students = pd.read_csv(
    base_path + 'students.csv',
    parse_dates=['students_date_joined'])

df_tag_questions = pd.read_csv(
    base_path + 'tag_questions.csv')

df_tag_users = pd.read_csv(
    base_path + 'tag_users.csv')

df_tags = pd.read_csv(
    base_path + 'tags.csv')


# Defining our necessary functions 
Because for easy and quick production, I have build functions for every major part of data pre-processing and model building. In this steps, I store all those functions without storing them spreading all over the notebook. I provide rich documentation for each function so that they will be easily understandable. 

In [None]:
def generate_int_id(dataframe, id_col_name):
    """
    Generate unique integer id for users, questions and answers

    Parameters
    ----------
    dataframe: Dataframe
        Pandas Dataframe for Users or Q&A. 
    id_col_name : String 
        New integer id's column name.
        
    Returns
    -------
    Dataframe
        Updated dataframe containing new id column 
    """
    new_dataframe=dataframe.assign(
        int_id_col_name=np.arange(len(dataframe))
        ).reset_index(drop=True)
    return new_dataframe.rename(columns={'int_id_col_name': id_col_name})



def create_features(dataframe, features_name, id_col_name):
    """
    Generate features that will be ready for feeding into lightfm

    Parameters
    ----------
    dataframe: Dataframe
        Pandas Dataframe which contains features
    features_name : List
        List of feature columns name avaiable in dataframe
    id_col_name: String
        Column name which contains id of the question or
        answer that the features will map to.
        There are two possible values for this variable.
        1. questions_id_num
        2. professionals_id_num

    Returns
    -------
    Pandas Series
        A pandas series containing process features
        that are ready for feed into lightfm.
        The format of each value
        will be (user_id, ['feature_1', 'feature_2', 'feature_3'])
        Ex. -> (1, ['military', 'army', '5'])
    """

    features = dataframe[features_name].apply(
        lambda x: ','.join(x.map(str)), axis=1)
    features = features.str.split(',')
    features = list(zip(dataframe[id_col_name], features))
    return features



def generate_feature_list(dataframe, features_name):
    """
    Generate features list for mapping 

    Parameters
    ----------
    dataframe: Dataframe
        Pandas Dataframe for Users or Q&A. 
    features_name : List
        List of feature columns name avaiable in dataframe. 
        
    Returns
    -------
    List of all features for mapping 
    """
    features = dataframe[features_name].apply(
        lambda x: ','.join(x.map(str)), axis=1)
    features = features.str.split(',')
    features = features.apply(pd.Series).stack().reset_index(drop=True)
    return features


def calculate_auc_score(lightfm_model, interactions_matrix, 
                        question_features, professional_features): 
    """
    Measure the ROC AUC metric for a model. 
    A perfect score is 1.0.

    Parameters
    ----------
    lightfm_model: LightFM model 
        A fitted lightfm model 
    interactions_matrix : 
        A lightfm interactions matrix 
    question_features, professional_features: 
        Lightfm features 
        
    Returns
    -------
    String containing AUC score 
    """
    score = auc_score( 
        lightfm_model, interactions_matrix, 
        item_features=question_features, 
        user_features=professional_features, 
        num_threads=4).mean()
    return score

# Data Preprocessing and feature creation 
Data preprocessing is essential for every data science project. We need to clean and modified our data for our own usecases. Also, feature creation is very important. Because it's let's our model good and diverse prediction. 

**Generate numeric identifier**:
LightFM python only except numeric id. But the data CareerVillage has provided us is contains uuid for identifying users and professionals and others. In this step, I will make unique identifier for each professionals, students, questions and answers. 

In [None]:
# generating unique integer id for users and q&a
df_professionals = generate_int_id(df_professionals, 'professionals_id_num')
df_students = generate_int_id(df_students, 'students_id_num')
df_questions = generate_int_id(df_questions, 'questions_id_num')
df_answers = generate_int_id(df_answers, 'answers_id_num')

In [None]:
#  df_answers.groupby(['answers_author_id'], sort=False).ngroup()

**Merging Datasets**: This is one of the most important steps for our solution. Our professionals, students, q&a and tags are stored in seperate datasets. For purpose of model, we have to merge our datasets in very carefull way so that they are useful for our model. 

1. All tags (q&a) are stored in a separate dataset. So firstly we merge those tags with questions and answers datasets. 
2. Then, we merge answers with quesitons because one question can have multiple answers. 

In [None]:
###########################
# merging dataset
###########################

# just dropna from tags 
df_tags = df_tags.dropna()
df_tags['tags_tag_name'] = df_tags['tags_tag_name'].str.replace('#', '')


# merge tag_questions with tags name
# then group all tags for each question into single rows
df_tags_question = df_tag_questions.merge(
    df_tags, how='inner',
    left_on='tag_questions_tag_id', right_on='tags_tag_id')
df_tags_question = df_tags_question.groupby(
    ['tag_questions_question_id'])['tags_tag_name'].apply(
        ','.join).reset_index()
df_tags_question = df_tags_question.rename(columns={'tags_tag_name': 'questions_tag_name'})

# merge tag_users with tags name 
# then group all tags for each user into single rows 
# after that rename the tag column name 
df_tags_pro = df_tag_users.merge(
    df_tags, how='inner',
    left_on='tag_users_tag_id', right_on='tags_tag_id')
df_tags_pro = df_tags_pro.groupby(
    ['tag_users_user_id'])['tags_tag_name'].apply(
        ','.join).reset_index()
df_tags_pro = df_tags_pro.rename(columns={'tags_tag_name': 'professionals_tag_name'})


# merge professionals and questions tags with main merge_dataset 
df_questions = df_questions.merge(
    df_tags_question, how='left',
    left_on='questions_id', right_on='tag_questions_question_id')
df_professionals = df_professionals.merge(
    df_tags_pro, how='left',
    left_on='professionals_id', right_on='tag_users_user_id')

# merge questions with scores 
df_questions = df_questions.merge(
    df_question_scores, how='left',
    left_on='questions_id', right_on='id')
# merge questions with students 
df_questions = df_questions.merge(
    df_students, how='left',
    left_on='questions_author_id', right_on='students_id')



# merge answers with questions 
# then merge professionals and questions score with that 
df_merge = df_answers.merge(
    df_questions, how='inner',
    left_on='answers_question_id', right_on='questions_id')
df_merge = df_merge.merge(
    df_professionals, how='inner',
    left_on='answers_author_id', right_on='professionals_id')
df_merge = df_merge.merge(
    df_question_scores, how='inner',
    left_on='questions_id', right_on='id')

**Generate some features**: In this steps, we are going to generate some features. We are going to generate ```number of answers by professionals```, ```num of answers in each question```, ```num of tags per professionals``` and ```number of tags per question```. I will not use all of these features in this model. But I will use ```number of answers per question``` for weighting our model so that our model pay less attention to those quesitons that have higher number of answers. 

In [None]:
#######################
# Generate some features for calculates weights
# that will use with interaction matrix 
#######################

df_merge['num_of_ans_by_professional'] = df_merge.groupby(['answers_author_id'])['questions_id'].transform('count')
df_merge['num_ans_per_ques'] = df_merge.groupby(['questions_id'])['answers_id'].transform('count')
df_merge['num_tags_professional'] = df_merge['professionals_tag_name'].str.split(",").str.len()
df_merge['num_tags_question'] = df_merge['questions_tag_name'].str.split(",").str.len()



In [None]:
print("Maximum number of answer per question : " + str(df_merge['num_ans_per_ques'].max()))
print("Maximum number of tags per professional : " + str(df_merge['num_tags_professional'].max()))
print("Maximum number of tags per question : " + str(df_merge['num_tags_question'].max()))

**Merge answered questions tags with professional's tags**: Professionals can follow some tags. But not all professional follow tags and most especially we see from EDA that sometime professionals answers questions that is not related to their tags. For that reason, I have merge questions tags that each professional has answered with professional tags. This makes our model more robust and context aware. 

In [None]:
########################
# Merge professionals previous answered 
# questions tags into professionals tags 
########################

# select professionals answered questions tags 
# and stored as a dataframe
professionals_prev_ans_tags = df_merge[['professionals_id', 'questions_tag_name']]
# drop null values from that 
professionals_prev_ans_tags = professionals_prev_ans_tags.dropna()
# because professsionals answers multiple questions, 
# we group all of tags of each user into single row 
professionals_prev_ans_tags = professionals_prev_ans_tags.groupby(
    ['professionals_id'])['questions_tag_name'].apply(
        ','.join).reset_index()

# drop duplicates tags from each professionals rows
professionals_prev_ans_tags['questions_tag_name'] = (
    professionals_prev_ans_tags['questions_tag_name'].str.split(',').apply(set).str.join(','))

# finally merge the dataframe with professionals dataframe 
df_professionals = df_professionals.merge(professionals_prev_ans_tags, how='left', on='professionals_id')

# join professionals tags and their answered tags 
# we replace nan values with ""
df_professionals['professional_all_tags'] = (
    df_professionals[['professionals_tag_name', 'questions_tag_name']].apply(
        lambda x: ','.join(x.dropna()),
        axis=1))


**Handling null and duplicates values**: Now we want clean our data a little bit. We will handle null and duplicate values. Because if we don't remove that they will cause error and wrong prediction. Also, we will replace null values with generic name or value.

In [None]:
# handling null values 
df_questions['score'] = df_questions['score'].fillna(0)
df_questions['score'] = df_questions['score'].astype(int)
df_questions['questions_tag_name'] = df_questions['questions_tag_name'].fillna('No Tag')
# remove duplicates tags from each questions 
df_questions['questions_tag_name'] = df_questions['questions_tag_name'].str.split(',').apply(set).str.join(',')


# fill nan with 'No Tag' if any 
df_professionals['professional_all_tags'] = df_professionals['professional_all_tags'].fillna('No Tag')
# replace "" with "No Tag", because previously we replace nan with ""
df_professionals['professional_all_tags'] = df_professionals['professional_all_tags'].replace('', 'No Tag')
df_professionals['professionals_location'] = df_professionals['professionals_location'].fillna('No Location')
df_professionals['professionals_industry'] = df_professionals['professionals_industry'].fillna('No Industry')

# remove duplicates tags from each professionals 
df_professionals['professional_all_tags'] = df_professionals['professional_all_tags'].str.split(',').apply(set).str.join(',')



# remove some null values from df_merge
df_merge['num_ans_per_ques']  = df_merge['num_ans_per_ques'].fillna(0)
df_merge['num_tags_professional'] = df_merge['num_tags_professional'].fillna(0)
df_merge['num_tags_question'] = df_merge['num_tags_question'].fillna(0)

# Building model in LightFM
In this steps, we are going to build our lighFM model using lightFM python library. Firstly, we have to create lightFM ```Dataset``` for our model. LightFM Datset class makes it really easy for us for creating ```interection matrix```, ```weights``` and ```user/item features```.
* ```interection matrix```: It is a matrix that contains user/ item interections or professional/quesiton intereactions. 
* ```weights```: weight of interection matrix. Less weight means less importance to that interection matrix. 
* ```user/item features```: user/item features supplied as like this ```(user_id, ['feature_1', 'feature_2', 'feature_3'])```

If you want to how lightfm python library's dataset class works and how to use it, please go to this link [Building LightFM Datasets](http://https://lyst.github.io/lightfm/docs/examples/dataset.html). 

Then, after that we will be start building our LightFM model using LightFM class. LightFM class makes it really easy for making lightFM model. After that we will train our model by our data. 

**Creating features list for Dataset class**: LightFM library has a Dataset class that makes it really easy for building necessary information for model. But we have feed set of all professionals/questions unique ids and all questions and professional features list. This will create internel mapping for lightFM to use. 

In [None]:
# generating features list for mapping 
question_feature_list = generate_feature_list(
    df_questions,
    ['questions_tag_name'])

professional_feature_list = generate_feature_list(
    df_professionals,
    ['professional_all_tags'])

In [None]:
# calculate our weight value 
df_merge['total_weights'] = 1 / (
    df_merge['num_ans_per_ques'])


# creating features for feeding into lightfm 
df_questions['question_features'] = create_features(
    df_questions, ['questions_tag_name'], 
    'questions_id_num')

df_professionals['professional_features'] = create_features(
    df_professionals,
    ['professional_all_tags'],
    'professionals_id_num')

**LightFM Dataset**: In this steps we are going to build lightfm datasets. And then we will be building our interactions matrix, weights and professional/question features. 

In [None]:
########################
# Dataset building for lightfm
########################

# define our dataset variable
# then we feed unique professionals and questions ids
# and item and professional feature list
# this will create lightfm internel mapping
dataset = Dataset()
dataset.fit(
    set(df_professionals['professionals_id_num']), 
    set(df_questions['questions_id_num']),
    item_features=question_feature_list, 
    user_features=professional_feature_list)


# now we are building interactions matrix between professionals and quesitons
# we are passing professional and questions id as a tuple
# e.g -> pd.Series((pro_id, question_id), (pro_id, questin_id))
# then we use lightfm build in method for building interactions matrix
df_merge['author_question_id_tuple'] = list(zip(
    df_merge.professionals_id_num, df_merge.questions_id_num, df_merge.total_weights))

interactions, weights = dataset.build_interactions(
    df_merge['author_question_id_tuple'])



# now we are building our questions and professionals features
# in a way that lightfm understand.
# we are using lightfm build in method for building
# questions and professionals features 
questions_features = dataset.build_item_features(
    df_questions['question_features'])

professional_features = dataset.build_user_features(
    df_professionals['professional_features'])

**Model building and training**: In ths steps, I am going to build lightfm model and then train the model. If you want to learn how to create lightfm model using this library please read this post [recommender for the Movielens dataset](https://lyst.github.io/lightfm/docs/examples/movielens_implicit.html). 

In [None]:
################################
# Model building part
################################

# define lightfm model by specifying hyper-parametre
# then fit the model with ineteractions matrix, item and user features 
model = LightFM(
    no_components=150,
    learning_rate=0.05,
    loss='warp',
    random_state=2019)

model.fit(
    interactions,
    item_features=questions_features,
    user_features=professional_features, sample_weight=weights,
    epochs=5, num_threads=4, verbose=True)


# Evaluating the performance of the model 
Now we have to evaluate our model to see it's performance. No matter how good your model is, if you can't evaluate your model correctly you can't imporove and trust your model. For recommendation problem, there is not very good matrics for evaluating. But luckily lightfm provides us a very rich set of evaluating matrics. In this steps, we will be calculating AUC scores for our model.

**What is AUC score in lightfm library?**: It measure the ROC AUC metric for a model: the probability that a randomly chosen positive example has a higher score than a randomly chosen negative example. A perfect score is 1.0. 

Let's see what is our model score. 

In [None]:
calculate_auc_score(model, interactions, questions_features, professional_features)

Wow! That is really impresive. Over AUC is over 90 percent. That is really excellent. This tells us that the quality of our overall model is very good.

**Make real recommendations**: Now we already see how our model is by looking at AUC score. But now let's see some real example of recommendation. 

In [None]:
from IPython.display import display_html
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

def recommend_questions(professional_ids):
     
    for professional in professional_ids:
        # print their previous answered question title
        previous_q_id_num = df_merge.loc[df_merge['professionals_id_num'] == professional][:3]['questions_id_num']
        df_previous_questions = df_questions.loc[df_questions['questions_id_num'].isin(previous_q_id_num)]
        print('Professional Id (' + str(professional) + "): Previous Answered Questions")
        display_side_by_side(
            df_previous_questions[['questions_title', 'question_features']],
            df_professionals.loc[df_professionals.professionals_id_num == professional][['professionals_id_num','professionals_tag_name']])
        
        # predict
        discard_qu_id = df_previous_questions['questions_id_num'].values.tolist()
        df_use_for_prediction = df_questions.loc[~df_questions['questions_id_num'].isin(discard_qu_id)]
        questions_id_for_predict = df_use_for_prediction['questions_id_num'].values.tolist()
        
        scores = model.predict(
            professional,
            questions_id_for_predict,
            item_features=questions_features,
            user_features=professional_features)
        
        df_use_for_prediction['scores'] = scores
        df_use_for_prediction = df_use_for_prediction.sort_values(by='scores', ascending=False)[:8]
        print('Professional Id (' + str(professional) + "): Recommended Questions: ")
        display(df_use_for_prediction[['questions_title', 'question_features']])
    

    

    

In [None]:
recommend_questions([1200 ,19897, 3])

**Analysis**: Awesome! Finally we can see our recommendation by our model. Let's take some time to ponder over the recommendations. 
* For the first professionl (1200) has not answer any questions yet. But he/she follows some tags. Our model take those tags as features and predict questions that has similar tags. 
* For the second professionals (19897) answered one questions that has tag major. But in his profile he follows tags like creative works like arts, illustrator etc. So our model recommend questions that has creative tags like arts, illustrator because he follows more tags one creative works. 
* For the third professionals (3): answered questions that has tag forensic, criminal, science, justice, detective. From the tags we can get an idea of professionals interests. Our model also learn that. That's why it recommend items that has tags like law, criminal, detective. 

This is just a simple exploration. Hope you get idea of the model recommendations. The model can survive cold-start, high-poularity problem. It also recommend those questions that has less answer because of its weights that I provided during training. Now we build our model and tested it. In the next section, we will look how we can put this model in production. 

# Model in Production
We previously saw how lightfm model works and build for this project. Well, now we are going build a pipeline that will help us for putting this model into production. We are going to build class for each steps discuss in step 2. Also, we are going to build some additional functions and methods that will add additional functionality to the model. 

Here is the picture of our pipeline: 
![](https://i.imgur.com/Kh4rVcL.png)


We will now build class for each of these steps. Without further do let's begin. 

In [None]:
############################################
# Read all our datasets agian 
# and store them in pandas dataframe objects. 
############################################
base_path = '../input/'
df_answer_scores = pd.read_csv(
    base_path + 'answer_scores.csv')

df_answers = pd.read_csv(
    base_path + 'answers.csv',
    parse_dates=['answers_date_added'])

df_comments = pd.read_csv(
    base_path + 'comments.csv')

df_emails = pd.read_csv(
    base_path + 'emails.csv')

df_group_memberships = pd.read_csv(
    base_path + 'group_memberships.csv')

df_groups = pd.read_csv(
    base_path + 'groups.csv')

df_matches = pd.read_csv(
    base_path + 'matches.csv')

df_professionals = pd.read_csv(
    base_path + 'professionals.csv',
    parse_dates=['professionals_date_joined'])

df_question_scores = pd.read_csv(
    base_path + 'question_scores.csv')

df_questions = pd.read_csv(
    base_path + 'questions.csv',
    parse_dates=['questions_date_added'])

df_school_memberships = pd.read_csv(
    base_path + 'school_memberships.csv')

df_students = pd.read_csv(
    base_path + 'students.csv',
    parse_dates=['students_date_joined'])

df_tag_questions = pd.read_csv(
    base_path + 'tag_questions.csv')

df_tag_users = pd.read_csv(
    base_path + 'tag_users.csv')

df_tags = pd.read_csv(
    base_path + 'tags.csv')


**Data Processing Class**: Now we are going to build a class that will be used for data cleaning and processing specificly designed for CareerVillage Datasetes. I have provided details document and comment with each part of the code. This will help understanding the code and intention very well. 

In [None]:
class CareerVillageDataPreparation:
    """
    Clean and process data CareerVillage Data. 
    
    This class process data in a way that will be useful
    for building lightFM dataset. 
    """
    
    def __init__(self):
        pass

    def _assign_unique_id(self, data, id_col_name):
        """
        Generate unique integer id for users, questions and answers

        Parameters
        ----------
        data: Dataframe
            Pandas Dataframe for Users or Q&A. 
        id_col_name : String 
            New integer id's column name.

        Returns
        -------
        Dataframe
            Updated dataframe containing new id column
        """
        new_dataframe=data.assign(
            int_id_col_name=np.arange(len(data))
            ).reset_index(drop=True)
        return new_dataframe.rename(columns={'int_id_col_name': id_col_name})

    def _dropna(self, data, column, axis):
        """Drop null values from specific column"""
        return data.dropna(column, axis=axis)

    def _merge_data(self, left_data, left_key, right_data, right_key, how):
        """
        This function is used for merging two dataframe.
        
        Parameters
        -----------
        left_data: Dataframe
            Left side dataframe for merge
        left_key: String
            Left Dataframe merge key
        right_data: Dataframe
            Right side dataframe for merge
        right_key: String
            Right Dataframe merge key
        how: String
            Method of merge (inner, left, right, outer)
            
        
        Returns
        --------
        Dataframe
            A new dataframe merging left and right dataframe
        """
        return left_data.merge(
            right_data,
            how=how,
            left_on=left_key,
            right_on=right_key)

    def _group_tags(self, data, group_by, tag_column):
        """Grouop multiple tags into single rows sepearated by comma"""
        return data.groupby(
            [group_by])[tag_column].apply(
            ','.join).reset_index()

    def _merge_cv_datasets(
        self,
        professionals,students,
        questions,answers,
        tags,tag_questions,tag_users, questions_score):
        """
        This function merges all the necessary 
        CareerVillage dataset in defined way. 
        
        Parameters
        ------------
        professionals,students,
        questions,answers,
        tags,tag_questions,
        tag_users,
        questions_score: Dataframe
            Pandas dataframe defined by it's name
        
        
        Returns
        ---------
        questions, professionals: Dataframe
            Updated dataframe after merge
        merge: Dataframe
            A new datframe after merging answers with questions
        """
        
        
        # merge tag_questions with tags name
        # then group all tags for each question into single rows
        tag_question = self._merge_data(
            left_data=tag_questions,
            left_key='tag_questions_tag_id',
            right_data=tags,
            right_key='tags_tag_id',
            how='inner')
        tag_question = self._group_tags(
            data=tag_question,
            group_by='tag_questions_question_id',
            tag_column='tags_tag_name')
        
        tag_question = tag_question.rename(
            columns={'tags_tag_name': 'questions_tag_name'})
        
        # merge tag_users with tags name
        # then group all tags for each user into single rows 
        # after that rename the tag column name
        tags_pro = self._merge_data(
            left_data=tag_users,
            left_key='tag_users_tag_id',
            right_data=tags,
            right_key='tags_tag_id',
            how='inner')
        tags_pro = self._group_tags(
            data=tags_pro,
            group_by='tag_users_user_id',
            tag_column='tags_tag_name')
        tags_pro = tags_pro.rename(
            columns={'tags_tag_name': 'professionals_tag_name'})
        
        # merge professionals and questions tags with main merge_dataset 
        questions = self._merge_data(
            left_data=questions,
            left_key='questions_id',
            right_data=tag_question,
            right_key='tag_questions_question_id',
            how='left')
        professionals = self._merge_data(
            left_data=professionals,
            left_key='professionals_id',
            right_data=tags_pro,
            right_key='tag_users_user_id',
            how='left')
        
        # merge questions with scores 
        questions = self._merge_data(
            left_data=questions,
            left_key='questions_id',
            right_data=questions_score,
            right_key='id',
            how='left')
        
        # merge questions with students
        questions = self._merge_data(
            left_data=questions,
            left_key='questions_author_id',
            right_data=students,
            right_key='students_id',
            how='left')
        
        # merge answers with questions
        # then merge professionals and questions score with that
        merge = self._merge_data(
            left_data=answers,
            left_key='answers_question_id',
            right_data=questions,
            right_key='questions_id',
            how='inner')
        
        merge = self._merge_data(
            left_data=merge,
            left_key='answers_author_id',
            right_data=professionals,
            right_key='professionals_id',
            how='inner')
        
        return questions, professionals, merge
  
    def _drop_duplicates_tags(self, data, col_name):
        # drop duplicates tags from each row
        return (
            data[col_name].str.split(
                ',').apply(set).str.join(','))


    def _merge_pro_pre_ans_tags(self, professionals, merge):
        ########################
        # Merge professionals previous answered
        # questions tags into professionals tags
        ########################
        
        # select professionals answered questions tags
        # and stored as a dataframe
        professionals_prev_ans_tags = (
            merge[['professionals_id', 'questions_tag_name']])
        # drop null values from that
        professionals_prev_ans_tags = professionals_prev_ans_tags.dropna()
        
        # because professsionals answers multiple questions,
        # we group all of tags of each user into single row
        professionals_prev_ans_tags = self._group_tags(
            data=professionals_prev_ans_tags,
            group_by='professionals_id',
            tag_column='questions_tag_name')
        
        # drop duplicates tags from each professionals rows
        professionals_prev_ans_tags['questions_tag_name'] = \
        self._drop_duplicates_tags(
            professionals_prev_ans_tags, 'questions_tag_name')
        
        # finally merge the dataframe with professionals dataframe
        professionals = self._merge_data(
            left_data=professionals,
            left_key='professionals_id',
            right_data=professionals_prev_ans_tags,
            right_key='professionals_id',
            how='left')
        
        # join professionals tags and their answered tags 
        # we replace nan values with ""
        professionals['professional_all_tags'] = (
            professionals[['professionals_tag_name',
                           'questions_tag_name']].apply(
                lambda x: ','.join(x.dropna()),
                axis=1))
        return professionals

    def prepare(
        self,
        professionals,students,
        questions,answers,
        tags,tag_questions,tag_users, questions_score):
        
        """
        This function clean and process 
        CareerVillage Data sets. 
        """
        
        # assign unique integer id
        professionals = self._assign_unique_id(
            professionals, 'professionals_id_num')
        students = self._assign_unique_id(
            students, 'students_id_num')
        questions = self._assign_unique_id(
            questions, 'questions_id_num')
        answers = self._assign_unique_id(
            answers, 'answers_id_num')
        
        # just dropna from tags 
        tags = tags.dropna()
        tags['tags_tag_name'] = tags['tags_tag_name'].str.replace(
            '#', '')
        
        
        # merge necessary datasets
        df_questions, df_professionals, df_merge = self._merge_cv_datasets(
            professionals,students,
            questions,answers,
            tags,tag_questions,tag_users,
            questions_score)
        
        #######################
        # Generate some features for calculates weights
        # that will use with interaction matrix
        #######################
        df_merge['num_ans_per_ques'] = df_merge.groupby(
            ['questions_id'])['answers_id'].transform('count')
        
        # merge pro previoius answered question tags with pro tags 
        df_professionals = self._merge_pro_pre_ans_tags(
            df_professionals, df_merge)
        
        # some more pre-processing 
        # handling null values 
        df_questions['score'] = df_questions['score'].fillna(0)
        df_questions['score'] = df_questions['score'].astype(int)
        df_questions['questions_tag_name'] = \
        df_questions['questions_tag_name'].fillna('No Tag')
        
        # remove duplicates tags from each questions 
        df_questions['questions_tag_name'] = \
        df_questions['questions_tag_name'].str.split(
            ',').apply(set).str.join(',')

        # fill nan with 'No Tag' if any 
        df_professionals['professional_all_tags'] = \
        df_professionals['professional_all_tags'].fillna(
            'No Tag')
        # replace "" with "No Tag", because previously we replace nan with ""
        df_professionals['professional_all_tags'] = \
        df_professionals['professional_all_tags'].replace(
            '', 'No Tag')
        
        df_professionals['professionals_location'] = \
        df_professionals['professionals_location'].fillna(
            'No Location')
        
        df_professionals['professionals_industry'] = \
        df_professionals['professionals_industry'].fillna(
            'No Industry')

        # remove duplicates tags from each professionals
        df_professionals['professional_all_tags'] = \
        df_professionals['professional_all_tags'].str.split(
            ',').apply(set).str.join(',')

        # remove some null values from df_merge
        df_merge['num_ans_per_ques']  = \
        df_merge['num_ans_per_ques'].fillna(0)
        
        return df_questions, df_professionals, df_merge


**Building Data for LightFM Class**: From step 2 we already know that lightfm library except data in a very specific and elligent way. LightFM data format is already discussed in step 2. Feel free to read that. Now we are building a class that will be put all of dataset building puzzle in a specific class. 

In [None]:
class LightFMDataPrep:
    def __init__(self):
        pass
    def create_features(self, dataframe, features_name, id_col_name):
        """
        Generate features that will be ready for feeding into lightfm

        Parameters
        ----------
        dataframe: Dataframe
            Pandas Dataframe which contains features
        features_name : List
            List of feature columns name avaiable in dataframe
        id_col_name: String
            Column name which contains id of the question or
            answer that the features will map to.
            There are two possible values for this variable.
            1. questions_id_num
            2. professionals_id_num

        Returns
        -------
        Pandas Series
            A pandas series containing process features
            that are ready for feed into lightfm.
            The format of each value
            will be (user_id, ['feature_1', 'feature_2', 'feature_3'])
            Ex. -> (1, ['military', 'army', '5'])
        """

        features = dataframe[features_name].apply(
            lambda x: ','.join(x.map(str)), axis=1)
        features = features.str.split(',')
        features = list(zip(dataframe[id_col_name], features))
        return features



    def generate_feature_list(self, dataframe, features_name):
        """
        Generate features list for mapping 

        Parameters
        ----------
        dataframe: Dataframe
            Pandas Dataframe for Users or Q&A. 
        features_name : List
            List of feature columns name avaiable in dataframe. 

        Returns
        -------
        List of all features for mapping 
        """
        features = dataframe[features_name].apply(
            lambda x: ','.join(x.map(str)), axis=1)
        features = features.str.split(',')
        features = features.apply(pd.Series).stack().reset_index(drop=True)
        return features
    
    def create_data(self, questions, professionals, merge):
        question_feature_list = self.generate_feature_list(
            questions,
            ['questions_tag_name'])

        professional_feature_list = self.generate_feature_list(
            professionals,
            ['professional_all_tags'])
        
        merge['total_weights'] = 1 / (
            merge['num_ans_per_ques'])
        
        # creating features for feeding into lightfm 
        questions['question_features'] = self.create_features(
            questions, ['questions_tag_name'], 
            'questions_id_num')

        professionals['professional_features'] = self.create_features(
            professionals,
            ['professional_all_tags'],
            'professionals_id_num')
        
        return question_feature_list,\
    professional_feature_list,merge,questions,professionals
        
    def fit(self, questions, professionals, merge):
        ########################
        # Dataset building for lightfm
        ########################
        question_feature_list, \
        professional_feature_list,\
        merge,questions,professionals = \
        self.create_data(questions, professionals, merge)
        
        
        # define our dataset variable
        # then we feed unique professionals and questions ids
        # and item and professional feature list
        # this will create lightfm internel mapping
        dataset = Dataset()
        dataset.fit(
            set(professionals['professionals_id_num']), 
            set(questions['questions_id_num']),
            item_features=question_feature_list, 
            user_features=professional_feature_list)


        # now we are building interactions
        # matrix between professionals and quesitons
        # we are passing professional and questions id as a tuple
        # e.g -> pd.Series((pro_id, question_id), (pro_id, questin_id))
        # then we use lightfm build in method for building interactions matrix
        merge['author_question_id_tuple'] = list(zip(
            merge.professionals_id_num,
            merge.questions_id_num,
            merge.total_weights))

        interactions, weights = dataset.build_interactions(
            merge['author_question_id_tuple'])



        # now we are building our questions and
        # professionals features
        # in a way that lightfm understand.
        # we are using lightfm build in method for building
        # questions and professionals features 
        questions_features = dataset.build_item_features(
            questions['question_features'])

        professional_features = dataset.build_user_features(
            professionals['professional_features'])
        
        return interactions,\
    weights,questions_features,professional_features
        
        

**Train Model Class**: In step 2, we saw how we build and train our model. Now we are going to put those all together in TrainLightFM class. 

In [None]:
class TrainLightFM:
    def __init__(self):
        pass
        
    def train_test_split(self, interactions, weights):
        train_interactions, test_interactions = \
        cross_validation.random_train_test_split(
            interactions, 
            random_state=np.random.RandomState(2019))
        
        train_weights, test_weights = \
        cross_validation.random_train_test_split(
            weights, 
            random_state=np.random.RandomState(2019))
        return train_interactions,\
    test_interactions, train_weights, test_weights
    
    def fit(self, interactions, weights,
            questions_features, professional_features,
            cross_validation=False,no_components=150,
            learning_rate=0.05,
            loss='warp',
            random_state=2019,
            verbose=True,
            num_threads=4, epochs=5):
        ################################
        # Model building part
        ################################

        # define lightfm model by specifying hyper-parametre
        # then fit the model with ineteractions matrix,
        # item and user features
        
        model = LightFM(
            no_components,
            learning_rate,
            loss=loss,
            random_state=random_state)
        model.fit(
            interactions,
            item_features=questions_features,
            user_features=professional_features, sample_weight=weights,
            epochs=epochs, num_threads=num_threads, verbose=verbose)
        
        return model


**Recommendations classs**: Now we are going to build a class for making recommendations. This will make easy for making recommendations in djono api. This recommendations class build with extra features. You can use this for general prediction by giving professionals ids and questions features. It has another features that let's choose questions from range of two dates and make recommendation from those questions. 

This is useful because those professionals that choose email frequency lavel as "weekly" or "daily", we can select questions from a week and then recommend those questions. 

In [None]:
class LightFMRecommendations:
    """
    Make prediction given model and professional ids
    """
    def __init__(self, lightfm_model,
                 professionals_features,
                 questions_features,
                 questions,professionals,merge):
        self.model = lightfm_model
        self.professionals_features = professionals_features
        self.questions_features = questions_features
        self.questions = questions
        self.professionals = professionals
        self.merge = merge
        
    def previous_answered_questions(self, professionals_id):
        previous_q_id_num = (
            self.merge.loc[\
                self.merge['professionals_id_num'] == \
                professionals_id]['questions_id_num'])
        
        previous_answered_questions = self.questions.loc[\
            self.questions['questions_id_num'].isin(
            previous_q_id_num)]
        return previous_answered_questions
        
    
    def _filter_question_by_pro(self, professionals_id):
        """Drop questions that professional already answer"""
        previous_answered_questions = \
        self.previous_answered_questions(professionals_id)
        
        discard_qu_id = \
        previous_answered_questions['questions_id_num'].values.tolist()
        
        questions_for_prediction = \
        self.questions.loc[~self.questions['questions_id_num'].isin(discard_qu_id)]
        
        return questions_for_prediction
    
    def _filter_question_by_date(self, questions, start_date, end_date):
        mask = \
        (questions['questions_date_added'] > start_date) & \
        (questions['questions_date_added'] <= end_date)
        
        return questions.loc[mask]
        
    
    def recommend_by_pro_id_general(self,
                                    professional_id,
                                    num_prediction=8):
        questions_for_prediction = self._filter_question_by_pro(professional_id)
        score = self.model.predict(
            professional_id,
            questions_for_prediction['questions_id_num'].values.tolist(), 
            item_features=self.questions_features,
            user_features=self.professionals_features)
        
        questions_for_prediction['recommendation_score'] = score
        questions_for_prediction = questions_for_prediction.sort_values(
            by='recommendation_score', ascending=False)[:num_prediction]
        return questions_for_prediction
    
    def recommend_by_pro_id_frequency_date_range(self,
                                                 professional_id,
                                                 start_date,
                                                 end_date,
                                                 num_prediction=8):
        questions_for_prediction = \
        self._filter_question_by_pro(professional_id)
        
        start_date = datetime.strptime(start_date, '%Y-%m-%d')
        end_date = datetime.strptime(end_date, '%Y-%m-%d')
        
        questions_for_prediction = self._filter_question_by_date(
            questions_for_prediction, start_date, end_date)
        
        score = self.model.predict(
            professional_id,
            questions_for_prediction['questions_id_num'].values.tolist(), 
            item_features=self.questions_features,
            user_features=self.professionals_features)
        
        questions_for_prediction['recommendation_score'] = score
        questions_for_prediction = questions_for_prediction.sort_values(
            by='recommendation_score', ascending=False)[:num_prediction]
        return questions_for_prediction


**Put it all together**: Now we defined all our important class file. Let's use each of these class and build our model. 

In [None]:
# instiate all class instance
cv_data_prep = CareerVillageDataPreparation()
light_fm_data_prep = LightFMDataPrep()
train_lightfm = TrainLightFM()

# process raw data
df_questions_p, df_professionals_p, df_merge_p = \
cv_data_prep.prepare(
    df_professionals,df_students,
    df_questions,df_answers,
    df_tags,df_tag_questions,df_tag_users,
    df_question_scores)


# prepare data for lightfm 
interactions, weights, \
questions_features, professional_features = \
light_fm_data_prep.fit(
    df_questions_p, df_professionals_p, df_merge_p)


# finally build and trian our model
model = train_lightfm.fit(interactions,
                          weights,
                          questions_features,
                          professional_features)


Awesome! Do you see, how easy it was for building our model. We can surely apply this idea when putting the model into production. Now we are going to see some real recommendations. 

In [None]:
# define our recommender class
lightfm_recommendations = LightFMRecommendations(
    model,
    professional_features,questions_features,
    df_questions_p, df_professionals_p, df_merge_p)

# let's what our model predict for user id 3
print("Recommendation for professional: " + str(3))
display(lightfm_recommendations.recommend_by_pro_id_general(3)[:8])

In [None]:
# also let's see what our model predicts for professional 3
# given questions between two dates
print("Recommendations for professionals (question from 2016-1-1 to 2016-12-31): " + str(3))
display(lightfm_recommendations.recommend_by_pro_id_frequency_date_range(3,
                                                                 '2016-1-1','2016-12-31')[:8])

Awesome! We can see our recommendations. Also, we can see, the new recommendation class has a method for recommending questions by a frequency of date. This is very helpful for recommending questions to professionals that have set their email frequency to daily or weekly. 

# Conclusion

**Idea that I tried but don't implemented in this notebook**: 
* Adding location features: I tried adding location features but somehow it decreases model AUC score to 91% to 84%. That's why I don't use that features.
* Adding dates and hearts data: I also tried that but it doesn't improve AUC score. 
* Correcting spelling error: I tried this method and successfully implemented it. But this is really slow. For that reason, I exluded it. 

**Idea that I think is important don't implemented in this notebook**: 
* Adding professionals industry and title as a features. This will inhance our model diversity and will increase overall recommendations score.
* CareerVillage should auto correct the hashtags for students questions asking time. This will help the model to match the tags more efficiently. 
* For those professionals those have choosen email frequency to immediete, we can create another same model just exchange user/item features. I mean train our model by giving questions as users and professionals as items. In this way, we can predict professionals by giving a questions. So that it helps to target daily frequency professionals.

Finally we came to end. I want give you a big thank you for reading this notebook. I have provided a very good recommender system for CareerVillage in the notebook. If you find any mistakes or have any suggestions feel free to comment. And don't forget to upvote. Good luck! 

References: 

[1] [Improving Pairwise Learning for Item Recommendation from Implicit Feedback](http://webia.lip6.fr/~gallinar/gallinari/uploads/Teaching/WSDM2014-rendle.pdf)

[2] [Content-based filtering](https://towardsdatascience.com/what-are-product-recommendation-engines-and-the-various-versions-of-them-9dcab4ee26d5)

[3] [What Is Content-Based Filtering?](https://www.upwork.com/hiring/data/what-is-content-based-filtering/)

[4] [What Is Collaborative Filtering?](https://www.upwork.com/hiring/data/how-collaborative-filtering-works/)

[5] [Collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering)

[6] [LightFM model documentation](https://lyst.github.io/lightfm/docs/lightfm.html)

[7] [Metadata Embeddings for User and Item Cold-start Recommendations](https://arxiv.org/pdf/1507.08439.pdf)