# Notebook Overview: 

This notebook contains a solution for Career village recommendation system competition. In this notebook, I build a hybrid recommendation system for recommending students questions to professionals for CareerVillage.org. The recommender system works by matching professionals with questions by tags they follow, their previous answers' question tags and similar tags. Also, it overcomes some of the most highest rated problem for CareerVillage recommender system like cold-start and others.

# Competition problem statements: 
    
CareerVillage.org is a non-profit organization helping underserved youth to provide information to build their career. Students can ask their questions in the CareerVillage.org and professionals(expert people who love to help students) answer their questions. The challenge is that CareerVillage has to recommend correct questions to the professionals so that the questions match with the professional's interests. This will increase the likelihood of a question to get an answer. So in this competition, we have to make a recommendation system that will correctly recommend questions that will match with professionals interest.

In [1]:
################################################
# Importing necessary library
################################################
import numpy as np
import pandas as pd

# all lightfm imports 
from lightfm.data import Dataset
from lightfm import LightFM
from lightfm import cross_validation
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import auc_score

# imports re for text cleaning 
import re
from datetime import datetime, timedelta

# we will ignore pandas warning 
import warnings
warnings.filterwarnings('ignore')



In [2]:
############################################
# Read all our datasets and store them in pandas dataframe objects. 
############################################
base_path = '../../../data/data-science-for-good-careervillage/'
df_answer_scores = pd.read_csv(
    base_path + 'answer_scores.csv')

df_answers = pd.read_csv(
    base_path + 'answers.csv',
    parse_dates=['answers_date_added'])

df_comments = pd.read_csv(
    base_path + 'comments.csv')

df_emails = pd.read_csv(
    base_path + 'emails.csv')

df_group_memberships = pd.read_csv(
    base_path + 'group_memberships.csv')

df_groups = pd.read_csv(
    base_path + 'groups.csv')

df_matches = pd.read_csv(
    base_path + 'matches.csv')

df_professionals = pd.read_csv(
    base_path + 'professionals.csv',
    parse_dates=['professionals_date_joined'])

df_question_scores = pd.read_csv(
    base_path + 'question_scores.csv')

df_questions = pd.read_csv(
    base_path + 'questions.csv',
    parse_dates=['questions_date_added'])

df_school_memberships = pd.read_csv(
    base_path + 'school_memberships.csv')

df_students = pd.read_csv(
    base_path + 'students.csv',
    parse_dates=['students_date_joined'])

df_tag_questions = pd.read_csv(
    base_path + 'tag_questions.csv')

df_tag_users = pd.read_csv(
    base_path + 'tag_users.csv')

df_tags = pd.read_csv(
    base_path + 'tags.csv')

In [3]:
def generate_int_id(dataframe, id_col_name):
    """
    Generate unique integer id for users, questions and answers

    Parameters
    ----------
    dataframe: Dataframe
        Pandas Dataframe for Users or Q&A. 
    id_col_name : String 
        New integer id's column name.
        
    Returns
    -------
    Dataframe
        Updated dataframe containing new id column 
    """
    new_dataframe=dataframe.assign(
        int_id_col_name=np.arange(len(dataframe))
        ).reset_index(drop=True)
    return new_dataframe.rename(columns={'int_id_col_name': id_col_name})



def create_features(dataframe, features_name, id_col_name):
    """
    Generate features that will be ready for feeding into lightfm

    Parameters
    ----------
    dataframe: Dataframe
        Pandas Dataframe which contains features
    features_name : List
        List of feature columns name avaiable in dataframe
    id_col_name: String
        Column name which contains id of the question or
        answer that the features will map to.
        There are two possible values for this variable.
        1. questions_id_num
        2. professionals_id_num

    Returns
    -------
    Pandas Series
        A pandas series containing process features
        that are ready for feed into lightfm.
        The format of each value
        will be (user_id, ['feature_1', 'feature_2', 'feature_3'])
        Ex. -> (1, ['military', 'army', '5'])
    """

    features = dataframe[features_name].apply(
        lambda x: ','.join(x.map(str)), axis=1)
    features = features.str.split(',')
    features = list(zip(dataframe[id_col_name], features))
    return features



def generate_feature_list(dataframe, features_name):
    """
    Generate features list for mapping 

    Parameters
    ----------
    dataframe: Dataframe
        Pandas Dataframe for Users or Q&A. 
    features_name : List
        List of feature columns name avaiable in dataframe. 
        
    Returns
    -------
    List of all features for mapping 
    """
    features = dataframe[features_name].apply(
        lambda x: ','.join(x.map(str)), axis=1)
    features = features.str.split(',')
    features = features.apply(pd.Series).stack().reset_index(drop=True)
    return features


def calculate_auc_score(lightfm_model, interactions_matrix, 
                        question_features, professional_features): 
    """
    Measure the ROC AUC metric for a model. 
    A perfect score is 1.0.

    Parameters
    ----------
    lightfm_model: LightFM model 
        A fitted lightfm model 
    interactions_matrix : 
        A lightfm interactions matrix 
    question_features, professional_features: 
        Lightfm features 
        
    Returns
    -------
    String containing AUC score 
    """
    score = auc_score( 
        lightfm_model, interactions_matrix, 
        item_features=question_features, 
        user_features=professional_features, 
        num_threads=4).mean()
    return score

In [4]:
# generating unique integer id for users and q&a
df_professionals = generate_int_id(df_professionals, 'professionals_id_num')
df_students = generate_int_id(df_students, 'students_id_num')
df_questions = generate_int_id(df_questions, 'questions_id_num')
df_answers = generate_int_id(df_answers, 'answers_id_num')

In [5]:
###########################
# merging dataset
###########################

# just dropna from tags 
df_tags = df_tags.dropna()
df_tags['tags_tag_name'] = df_tags['tags_tag_name'].str.replace('#', '')


# merge tag_questions with tags name
# then group all tags for each question into single rows
df_tags_question = df_tag_questions.merge(
    df_tags, how='inner',
    left_on='tag_questions_tag_id', right_on='tags_tag_id')
df_tags_question = df_tags_question.groupby(
    ['tag_questions_question_id'])['tags_tag_name'].apply(
        ','.join).reset_index()
df_tags_question = df_tags_question.rename(columns={'tags_tag_name': 'questions_tag_name'})

# merge tag_users with tags name 
# then group all tags for each user into single rows 
# after that rename the tag column name 
df_tags_pro = df_tag_users.merge(
    df_tags, how='inner',
    left_on='tag_users_tag_id', right_on='tags_tag_id')
df_tags_pro = df_tags_pro.groupby(
    ['tag_users_user_id'])['tags_tag_name'].apply(
        ','.join).reset_index()
df_tags_pro = df_tags_pro.rename(columns={'tags_tag_name': 'professionals_tag_name'})


# merge professionals and questions tags with main merge_dataset 
df_questions = df_questions.merge(
    df_tags_question, how='left',
    left_on='questions_id', right_on='tag_questions_question_id')
df_professionals = df_professionals.merge(
    df_tags_pro, how='left',
    left_on='professionals_id', right_on='tag_users_user_id')

# merge questions with scores 
df_questions = df_questions.merge(
    df_question_scores, how='left',
    left_on='questions_id', right_on='id')
# merge questions with students 
df_questions = df_questions.merge(
    df_students, how='left',
    left_on='questions_author_id', right_on='students_id')



# merge answers with questions 
# then merge professionals and questions score with that 
df_merge = df_answers.merge(
    df_questions, how='inner',
    left_on='answers_question_id', right_on='questions_id')
df_merge = df_merge.merge(
    df_professionals, how='inner',
    left_on='answers_author_id', right_on='professionals_id')
df_merge = df_merge.merge(
    df_question_scores, how='inner',
    left_on='questions_id', right_on='id')

In [6]:
#######################
# Generate some features for calculates weights
# that will use with interaction matrix 
#######################

df_merge['num_of_ans_by_professional'] = df_merge.groupby(['answers_author_id'])['questions_id'].transform('count')
df_merge['num_ans_per_ques'] = df_merge.groupby(['questions_id'])['answers_id'].transform('count')
df_merge['num_tags_professional'] = df_merge['professionals_tag_name'].str.split(",").str.len()
df_merge['num_tags_question'] = df_merge['questions_tag_name'].str.split(",").str.len()

In [7]:
print("Maximum number of answer per question : " + str(df_merge['num_ans_per_ques'].max()))
print("Maximum number of tags per professional : " + str(df_merge['num_tags_professional'].max()))
print("Maximum number of tags per question : " + str(df_merge['num_tags_question'].max()))

Maximum number of answer per question : 58
Maximum number of tags per professional : 82.0
Maximum number of tags per question : 54.0


In [8]:
########################
#Merge answered questions tags with professional's tags: Professionals can follow some tags. 
#But not all professional follow tags and most especially we see from EDA that sometime professionals 
#answers questions that is not related to their tags. For that reason, I have merge questions tags that 
#each professional has answered with professional tags. 
#This makes our model more robust and context aware.

# Merge professionals previous answered 
# questions tags into professionals tags 
########################

# select professionals answered questions tags 
# and stored as a dataframe
professionals_prev_ans_tags = df_merge[['professionals_id', 'questions_tag_name']]
# drop null values from that 
professionals_prev_ans_tags = professionals_prev_ans_tags.dropna()
# because professsionals answers multiple questions, 
# we group all of tags of each user into single row 
professionals_prev_ans_tags = professionals_prev_ans_tags.groupby(
    ['professionals_id'])['questions_tag_name'].apply(
        ','.join).reset_index()

# drop duplicates tags from each professionals rows
professionals_prev_ans_tags['questions_tag_name'] = (
    professionals_prev_ans_tags['questions_tag_name'].str.split(',').apply(set).str.join(','))

# finally merge the dataframe with professionals dataframe 
df_professionals = df_professionals.merge(professionals_prev_ans_tags, how='left', on='professionals_id')

# join professionals tags and their answered tags 
# we replace nan values with ""
df_professionals['professional_all_tags'] = (
    df_professionals[['professionals_tag_name', 'questions_tag_name']].apply(
        lambda x: ','.join(x.dropna()),
        axis=1))

In [9]:
# handling null values and duplicates
df_questions['score'] = df_questions['score'].fillna(0)
df_questions['score'] = df_questions['score'].astype(int)
df_questions['questions_tag_name'] = df_questions['questions_tag_name'].fillna('No Tag')
# remove duplicates tags from each questions 
df_questions['questions_tag_name'] = df_questions['questions_tag_name'].str.split(',').apply(set).str.join(',')


# fill nan with 'No Tag' if any 
df_professionals['professional_all_tags'] = df_professionals['professional_all_tags'].fillna('No Tag')
# replace "" with "No Tag", because previously we replace nan with ""
df_professionals['professional_all_tags'] = df_professionals['professional_all_tags'].replace('', 'No Tag')
df_professionals['professionals_location'] = df_professionals['professionals_location'].fillna('No Location')
df_professionals['professionals_industry'] = df_professionals['professionals_industry'].fillna('No Industry')

# remove duplicates tags from each professionals 
df_professionals['professional_all_tags'] = df_professionals['professional_all_tags'].str.split(',').apply(set).str.join(',')



# remove some null values from df_merge
df_merge['num_ans_per_ques']  = df_merge['num_ans_per_ques'].fillna(0)
df_merge['num_tags_professional'] = df_merge['num_tags_professional'].fillna(0)
df_merge['num_tags_question'] = df_merge['num_tags_question'].fillna(0)

In [10]:
# generating features list for mapping 
question_feature_list = generate_feature_list(
    df_questions,
    ['questions_tag_name'])

professional_feature_list = generate_feature_list(
    df_professionals,
    ['professional_all_tags'])

In [11]:
# calculate our weight value 
df_merge['total_weights'] = 1 / (
    df_merge['num_ans_per_ques'])


# creating features for feeding into lightfm 
df_questions['question_features'] = create_features(
    df_questions, ['questions_tag_name'], 
    'questions_id_num')

df_professionals['professional_features'] = create_features(
    df_professionals,
    ['professional_all_tags'],
    'professionals_id_num')

In [12]:
########################
# Dataset building for lightfm
########################

# define our dataset variable
# then we feed unique professionals and questions ids
# and item and professional feature list
# this will create lightfm internel mapping
dataset = Dataset()
dataset.fit(
    set(df_professionals['professionals_id_num']), 
    set(df_questions['questions_id_num']),
    item_features=question_feature_list, 
    user_features=professional_feature_list)


# now we are building interactions matrix between professionals and quesitons
# we are passing professional and questions id as a tuple
# e.g -> pd.Series((pro_id, question_id), (pro_id, questin_id))
# then we use lightfm build in method for building interactions matrix
df_merge['author_question_id_tuple'] = list(zip(
    df_merge.professionals_id_num, df_merge.questions_id_num, df_merge.total_weights))

interactions, weights = dataset.build_interactions(
    df_merge['author_question_id_tuple'])



# now we are building our questions and professionals features
# in a way that lightfm understand.
# we are using lightfm build in method for building
# questions and professionals features 
questions_features = dataset.build_item_features(
    df_questions['question_features'])

professional_features = dataset.build_user_features(
    df_professionals['professional_features'])

In [13]:
################################
# Model building part
################################

# define lightfm model by specifying hyper-parametre
# then fit the model with ineteractions matrix, item and user features 
model = LightFM(
    no_components=150,
    learning_rate=0.05,
    loss='warp',
    random_state=2019)

model.fit(
    interactions,
    item_features=questions_features,
    user_features=professional_features, sample_weight=weights,
    epochs=5, num_threads=4, verbose=True)

Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4


<lightfm.lightfm.LightFM at 0x1a2d334e50>

In [14]:
calculate_auc_score(model, interactions, questions_features, professional_features)

0.91337264

In [15]:
#Make real recommendations
from IPython.display import display_html
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

def recommend_questions(professional_ids):
     
    for professional in professional_ids:
        # print their previous answered question title
        previous_q_id_num = df_merge.loc[df_merge['professionals_id_num'] == professional][:3]['questions_id_num']
        df_previous_questions = df_questions.loc[df_questions['questions_id_num'].isin(previous_q_id_num)]
        print('Professional Id (' + str(professional) + "): Previous Answered Questions")
        display_side_by_side(
            df_previous_questions[['questions_title', 'question_features']],
            df_professionals.loc[df_professionals.professionals_id_num == professional][['professionals_id_num','professionals_tag_name']])
        
        # predict
        discard_qu_id = df_previous_questions['questions_id_num'].values.tolist()
        df_use_for_prediction = df_questions.loc[~df_questions['questions_id_num'].isin(discard_qu_id)]
        questions_id_for_predict = df_use_for_prediction['questions_id_num'].values.tolist()
        
        scores = model.predict(
            professional,
            questions_id_for_predict,
            item_features=questions_features,
            user_features=professional_features)
        
        df_use_for_prediction['scores'] = scores
        df_use_for_prediction = df_use_for_prediction.sort_values(by='scores', ascending=False)[:8]
        print('Professional Id (' + str(professional) + "): Recommended Questions: ")
        display(df_use_for_prediction[['questions_title', 'question_features']])

In [16]:
recommend_questions([1200 ,19897, 3])

Professional Id (1200): Previous Answered Questions


Unnamed: 0,questions_title,question_features

Unnamed: 0,professionals_id_num,professionals_tag_name
1200,1200,"marketing,strategy,entrepreneurship,management,java,advertising,python,data-analysis,online-advertising,real-estate,team-leadership,dj,analytics,display-advertising,football,blackjack,hip-hop,billiards,break"


Professional Id (1200): Recommended Questions: 


Unnamed: 0,questions_title,question_features
19011,How do you get started in starting your own bu...,"(19011, [business, marketing, management])"
8559,Which Business field of study is best suited t...,"(8559, [administration, entrepreneurship, busi..."
14451,What is the best way to find a good internship...,"(14451, [business, marketing])"
23680,what makes markerting so important in business?,"(23680, [business, marketing])"
15671,How beneficial will a business major be ?,"(15671, [business, finance, accounting, market..."
10073,How important is a business degree when trying...,"(10073, [business, entrepreneurship])"
9867,Best way to get started in entrepreneurship an...,"(9867, [business, entrepreneurship])"
15730,What does it take to succeed in business?,"(15730, [business, entrepreneurship])"


Professional Id (19897): Previous Answered Questions


Unnamed: 0,questions_title,question_features
22784,Do companies truly focus on your college major when applying for jobs?,"(22784, [major])"

Unnamed: 0,professionals_id_num,professionals_tag_name
19897,19897,"illustration,graphic-design,adobe-creative-suite,comic-books"


Professional Id (19897): Recommended Questions: 


Unnamed: 0,questions_title,question_features
19407,How can you be a successful photographer? What...,"(19407, [art, photography, graphic-design])"
2310,what is one of best things about being an anim...,"(2310, [animation, art, design, artist])"
9682,How to get started in animation?,"(9682, [animation, art, artist])"
6058,How should you start in the Graphic Design ind...,"(6058, [art, design, graphic-design])"
23691,How to comopose/write music without going ti s...,"(23691, [music, art])"
13484,Would a Graphic Design degree be a feesible op...,"(13484, [art, graphic-design])"
1416,Is Competition In The Animation Field Low or H...,"(1416, [animation, art])"
19471,Graphic Design - job outlook for the next 10 y...,"(19471, [art, graphic-design])"


Professional Id (3): Previous Answered Questions


Unnamed: 0,questions_title,question_features
11339,What are the different jobs a person can do in Forensic Science?,"(11339, [criminal, justice, forensic, science])"
14818,What does a typical work day for a forensic scientist look like?,"(14818, [No Tag])"
19077,Is most of your day spent working when being a detective?,"(19077, [detective])"

Unnamed: 0,professionals_id_num,professionals_tag_name
3,3,


Professional Id (3): Recommended Questions: 


Unnamed: 0,questions_title,question_features
2423,How long does it take to become a Detective?,"(2423, [law, criminal-justice, police, law-enf..."
17184,What types of Detectives are there?,"(17184, [law, criminal-justice, police, law-en..."
9778,I want to be a police officer or a police disp...,"(9778, [law, criminal-justice, police, law-enf..."
8863,What qualifications are needed to be promoted ...,"(8863, [criminal-justice, law-enforcement, pol..."
11514,What does an aspiring cop have to look forward...,"(11514, [criminal-justice, law-enforcement, po..."
1936,What degrees do you have to have in order to g...,"(1936, [criminal-justice, law-enforcement, pol..."
18947,"Could I go straight into Law Enforcememt, when...","(18947, [law, law-enforcement, police])"
20003,how many criminal psychologist jobs are our th...,"(20003, [criminal-justice, psychology, law-enf..."


# Analysis

Analysis: Awesome! Finally we can see our recommendation by our model. Let's take some time to ponder over the recommendations.

For the first professionl (1200) has not answer any questions yet. But he/she follows some tags. Our model take those tags as features and predict questions that has similar tags.
For the second professionals (19897) answered one questions that has tag major. But in his profile he follows tags like creative works like arts, illustrator etc. So our model recommend questions that has creative tags like arts, illustrator because he follows more tags one creative works.
For the third professionals (3): answered questions that has tag forensic, criminal, science, justice, detective. From the tags we can get an idea of professionals interests. Our model also learn that. That's why it recommend items that has tags like law, criminal, detective.
This is just a simple exploration. Hope you get idea of the model recommendations. The model can survive cold-start, high-poularity problem. It also recommend those questions that has less answer because of its weights that I provided during training. Now we build our model and tested it. In the next section, we will look how we can put this model in production.

# Model in production

Well, now we are going build a pipeline that will help us for putting this model into production. We are going to build class for each steps discuss in step 2. Also, we are going to build some additional functions and methods that will add additional functionality to the model.

In [17]:
############################################
# Read all our datasets agian 
# and store them in pandas dataframe objects. 
############################################
base_path = '../../../data/data-science-for-good-careervillage/'
df_answer_scores = pd.read_csv(
    base_path + 'answer_scores.csv')

df_answers = pd.read_csv(
    base_path + 'answers.csv',
    parse_dates=['answers_date_added'])

df_comments = pd.read_csv(
    base_path + 'comments.csv')

df_emails = pd.read_csv(
    base_path + 'emails.csv')

df_group_memberships = pd.read_csv(
    base_path + 'group_memberships.csv')

df_groups = pd.read_csv(
    base_path + 'groups.csv')

df_matches = pd.read_csv(
    base_path + 'matches.csv')

df_professionals = pd.read_csv(
    base_path + 'professionals.csv',
    parse_dates=['professionals_date_joined'])

df_question_scores = pd.read_csv(
    base_path + 'question_scores.csv')

df_questions = pd.read_csv(
    base_path + 'questions.csv',
    parse_dates=['questions_date_added'])

df_school_memberships = pd.read_csv(
    base_path + 'school_memberships.csv')

df_students = pd.read_csv(
    base_path + 'students.csv',
    parse_dates=['students_date_joined'])

df_tag_questions = pd.read_csv(
    base_path + 'tag_questions.csv')

df_tag_users = pd.read_csv(
    base_path + 'tag_users.csv')

df_tags = pd.read_csv(
    base_path + 'tags.csv')

In [18]:
class CareerVillageDataPreparation:
    """
    Clean and process data CareerVillage Data. 
    
    This class process data in a way that will be useful
    for building lightFM dataset. 
    """
    
    def __init__(self):
        pass

    def _assign_unique_id(self, data, id_col_name):
        """
        Generate unique integer id for users, questions and answers

        Parameters
        ----------
        data: Dataframe
            Pandas Dataframe for Users or Q&A. 
        id_col_name : String 
            New integer id's column name.

        Returns
        -------
        Dataframe
            Updated dataframe containing new id column
        """
        new_dataframe=data.assign(
            int_id_col_name=np.arange(len(data))
            ).reset_index(drop=True)
        return new_dataframe.rename(columns={'int_id_col_name': id_col_name})

    def _dropna(self, data, column, axis):
        """Drop null values from specific column"""
        return data.dropna(column, axis=axis)

    def _merge_data(self, left_data, left_key, right_data, right_key, how):
        """
        This function is used for merging two dataframe.
        
        Parameters
        -----------
        left_data: Dataframe
            Left side dataframe for merge
        left_key: String
            Left Dataframe merge key
        right_data: Dataframe
            Right side dataframe for merge
        right_key: String
            Right Dataframe merge key
        how: String
            Method of merge (inner, left, right, outer)
            
        
        Returns
        --------
        Dataframe
            A new dataframe merging left and right dataframe
        """
        return left_data.merge(
            right_data,
            how=how,
            left_on=left_key,
            right_on=right_key)

    def _group_tags(self, data, group_by, tag_column):
        """Grouop multiple tags into single rows sepearated by comma"""
        return data.groupby(
            [group_by])[tag_column].apply(
            ','.join).reset_index()

    def _merge_cv_datasets(
        self,
        professionals,students,
        questions,answers,
        tags,tag_questions,tag_users, questions_score):
        """
        This function merges all the necessary 
        CareerVillage dataset in defined way. 
        
        Parameters
        ------------
        professionals,students,
        questions,answers,
        tags,tag_questions,
        tag_users,
        questions_score: Dataframe
            Pandas dataframe defined by it's name
        
        
        Returns
        ---------
        questions, professionals: Dataframe
            Updated dataframe after merge
        merge: Dataframe
            A new datframe after merging answers with questions
        """
        
        
        # merge tag_questions with tags name
        # then group all tags for each question into single rows
        tag_question = self._merge_data(
            left_data=tag_questions,
            left_key='tag_questions_tag_id',
            right_data=tags,
            right_key='tags_tag_id',
            how='inner')
        tag_question = self._group_tags(
            data=tag_question,
            group_by='tag_questions_question_id',
            tag_column='tags_tag_name')
        
        tag_question = tag_question.rename(
            columns={'tags_tag_name': 'questions_tag_name'})
        
        # merge tag_users with tags name
        # then group all tags for each user into single rows 
        # after that rename the tag column name
        tags_pro = self._merge_data(
            left_data=tag_users,
            left_key='tag_users_tag_id',
            right_data=tags,
            right_key='tags_tag_id',
            how='inner')
        tags_pro = self._group_tags(
            data=tags_pro,
            group_by='tag_users_user_id',
            tag_column='tags_tag_name')
        tags_pro = tags_pro.rename(
            columns={'tags_tag_name': 'professionals_tag_name'})
        
        # merge professionals and questions tags with main merge_dataset 
        questions = self._merge_data(
            left_data=questions,
            left_key='questions_id',
            right_data=tag_question,
            right_key='tag_questions_question_id',
            how='left')
        professionals = self._merge_data(
            left_data=professionals,
            left_key='professionals_id',
            right_data=tags_pro,
            right_key='tag_users_user_id',
            how='left')
        
        # merge questions with scores 
        questions = self._merge_data(
            left_data=questions,
            left_key='questions_id',
            right_data=questions_score,
            right_key='id',
            how='left')
        
        # merge questions with students
        questions = self._merge_data(
            left_data=questions,
            left_key='questions_author_id',
            right_data=students,
            right_key='students_id',
            how='left')
        
        # merge answers with questions
        # then merge professionals and questions score with that
        merge = self._merge_data(
            left_data=answers,
            left_key='answers_question_id',
            right_data=questions,
            right_key='questions_id',
            how='inner')
        
        merge = self._merge_data(
            left_data=merge,
            left_key='answers_author_id',
            right_data=professionals,
            right_key='professionals_id',
            how='inner')
        
        return questions, professionals, merge
  
    def _drop_duplicates_tags(self, data, col_name):
        # drop duplicates tags from each row
        return (
            data[col_name].str.split(
                ',').apply(set).str.join(','))


    def _merge_pro_pre_ans_tags(self, professionals, merge):
        ########################
        # Merge professionals previous answered
        # questions tags into professionals tags
        ########################
        
        # select professionals answered questions tags
        # and stored as a dataframe
        professionals_prev_ans_tags = (
            merge[['professionals_id', 'questions_tag_name']])
        # drop null values from that
        professionals_prev_ans_tags = professionals_prev_ans_tags.dropna()
        
        # because professsionals answers multiple questions,
        # we group all of tags of each user into single row
        professionals_prev_ans_tags = self._group_tags(
            data=professionals_prev_ans_tags,
            group_by='professionals_id',
            tag_column='questions_tag_name')
        
        # drop duplicates tags from each professionals rows
        professionals_prev_ans_tags['questions_tag_name'] = \
        self._drop_duplicates_tags(
            professionals_prev_ans_tags, 'questions_tag_name')
        
        # finally merge the dataframe with professionals dataframe
        professionals = self._merge_data(
            left_data=professionals,
            left_key='professionals_id',
            right_data=professionals_prev_ans_tags,
            right_key='professionals_id',
            how='left')
        
        # join professionals tags and their answered tags 
        # we replace nan values with ""
        professionals['professional_all_tags'] = (
            professionals[['professionals_tag_name',
                           'questions_tag_name']].apply(
                lambda x: ','.join(x.dropna()),
                axis=1))
        return professionals

    def prepare(
        self,
        professionals,students,
        questions,answers,
        tags,tag_questions,tag_users, questions_score):
        
        """
        This function clean and process 
        CareerVillage Data sets. 
        """
        
        # assign unique integer id
        professionals = self._assign_unique_id(
            professionals, 'professionals_id_num')
        students = self._assign_unique_id(
            students, 'students_id_num')
        questions = self._assign_unique_id(
            questions, 'questions_id_num')
        answers = self._assign_unique_id(
            answers, 'answers_id_num')
        
        # just dropna from tags 
        tags = tags.dropna()
        tags['tags_tag_name'] = tags['tags_tag_name'].str.replace(
            '#', '')
        
        
        # merge necessary datasets
        df_questions, df_professionals, df_merge = self._merge_cv_datasets(
            professionals,students,
            questions,answers,
            tags,tag_questions,tag_users,
            questions_score)
        
        #######################
        # Generate some features for calculates weights
        # that will use with interaction matrix
        #######################
        df_merge['num_ans_per_ques'] = df_merge.groupby(
            ['questions_id'])['answers_id'].transform('count')
        
        # merge pro previoius answered question tags with pro tags 
        df_professionals = self._merge_pro_pre_ans_tags(
            df_professionals, df_merge)
        
        # some more pre-processing 
        # handling null values 
        df_questions['score'] = df_questions['score'].fillna(0)
        df_questions['score'] = df_questions['score'].astype(int)
        df_questions['questions_tag_name'] = \
        df_questions['questions_tag_name'].fillna('No Tag')
        
        # remove duplicates tags from each questions 
        df_questions['questions_tag_name'] = \
        df_questions['questions_tag_name'].str.split(
            ',').apply(set).str.join(',')

        # fill nan with 'No Tag' if any 
        df_professionals['professional_all_tags'] = \
        df_professionals['professional_all_tags'].fillna(
            'No Tag')
        # replace "" with "No Tag", because previously we replace nan with ""
        df_professionals['professional_all_tags'] = \
        df_professionals['professional_all_tags'].replace(
            '', 'No Tag')
        
        df_professionals['professionals_location'] = \
        df_professionals['professionals_location'].fillna(
            'No Location')
        
        df_professionals['professionals_industry'] = \
        df_professionals['professionals_industry'].fillna(
            'No Industry')

        # remove duplicates tags from each professionals
        df_professionals['professional_all_tags'] = \
        df_professionals['professional_all_tags'].str.split(
            ',').apply(set).str.join(',')

        # remove some null values from df_merge
        df_merge['num_ans_per_ques']  = \
        df_merge['num_ans_per_ques'].fillna(0)
        
        return df_questions, df_professionals, df_merge

In [20]:
class LightFMDataPrep:
    def __init__(self):
        pass
    def create_features(self, dataframe, features_name, id_col_name):
        """
        Generate features that will be ready for feeding into lightfm

        Parameters
        ----------
        dataframe: Dataframe
            Pandas Dataframe which contains features
        features_name : List
            List of feature columns name avaiable in dataframe
        id_col_name: String
            Column name which contains id of the question or
            answer that the features will map to.
            There are two possible values for this variable.
            1. questions_id_num
            2. professionals_id_num

        Returns
        -------
        Pandas Series
            A pandas series containing process features
            that are ready for feed into lightfm.
            The format of each value
            will be (user_id, ['feature_1', 'feature_2', 'feature_3'])
            Ex. -> (1, ['military', 'army', '5'])
        """

        features = dataframe[features_name].apply(
            lambda x: ','.join(x.map(str)), axis=1)
        features = features.str.split(',')
        features = list(zip(dataframe[id_col_name], features))
        return features



    def generate_feature_list(self, dataframe, features_name):
        """
        Generate features list for mapping 

        Parameters
        ----------
        dataframe: Dataframe
            Pandas Dataframe for Users or Q&A. 
        features_name : List
            List of feature columns name avaiable in dataframe. 

        Returns
        -------
        List of all features for mapping 
        """
        features = dataframe[features_name].apply(
            lambda x: ','.join(x.map(str)), axis=1)
        features = features.str.split(',')
        features = features.apply(pd.Series).stack().reset_index(drop=True)
        return features
    
    def create_data(self, questions, professionals, merge):
        question_feature_list = self.generate_feature_list(
            questions,
            ['questions_tag_name'])

        professional_feature_list = self.generate_feature_list(
            professionals,
            ['professional_all_tags'])
        
        merge['total_weights'] = 1 / (
            merge['num_ans_per_ques'])
        
        # creating features for feeding into lightfm 
        questions['question_features'] = self.create_features(
            questions, ['questions_tag_name'], 
            'questions_id_num')

        professionals['professional_features'] = self.create_features(
            professionals,
            ['professional_all_tags'],
            'professionals_id_num')
        
        return question_feature_list,\
    professional_feature_list,merge,questions,professionals
        
    def fit(self, questions, professionals, merge):
        ########################
        # Dataset building for lightfm
        ########################
        question_feature_list, \
        professional_feature_list,\
        merge,questions,professionals = \
        self.create_data(questions, professionals, merge)
        
        
        # define our dataset variable
        # then we feed unique professionals and questions ids
        # and item and professional feature list
        # this will create lightfm internel mapping
        dataset = Dataset()
        dataset.fit(
            set(professionals['professionals_id_num']), 
            set(questions['questions_id_num']),
            item_features=question_feature_list, 
            user_features=professional_feature_list)


        # now we are building interactions
        # matrix between professionals and quesitons
        # we are passing professional and questions id as a tuple
        # e.g -> pd.Series((pro_id, question_id), (pro_id, questin_id))
        # then we use lightfm build in method for building interactions matrix
        merge['author_question_id_tuple'] = list(zip(
            merge.professionals_id_num,
            merge.questions_id_num,
            merge.total_weights))

        interactions, weights = dataset.build_interactions(
            merge['author_question_id_tuple'])



        # now we are building our questions and
        # professionals features
        # in a way that lightfm understand.
        # we are using lightfm build in method for building
        # questions and professionals features 
        questions_features = dataset.build_item_features(
            questions['question_features'])

        professional_features = dataset.build_user_features(
            professionals['professional_features'])
        
        return interactions, weights,questions_features,professional_features

In [21]:
class TrainLightFM:
    def __init__(self):
        pass
        
    def train_test_split(self, interactions, weights):
        train_interactions, test_interactions = \
        cross_validation.random_train_test_split(
            interactions, 
            random_state=np.random.RandomState(2019))
        
        train_weights, test_weights = \
        cross_validation.random_train_test_split(
            weights, 
            random_state=np.random.RandomState(2019))
        return train_interactions,\
    test_interactions, train_weights, test_weights
    
    def fit(self, interactions, weights,
            questions_features, professional_features,
            cross_validation=False,no_components=150,
            learning_rate=0.05,
            loss='warp',
            random_state=2019,
            verbose=True,
            num_threads=4, epochs=5):
        ################################
        # Model building part
        ################################

        # define lightfm model by specifying hyper-parametre
        # then fit the model with ineteractions matrix,
        # item and user features
        
        model = LightFM(
            no_components,
            learning_rate,
            loss=loss,
            random_state=random_state)
        model.fit(
            interactions,
            item_features=questions_features,
            user_features=professional_features, sample_weight=weights,
            epochs=epochs, num_threads=num_threads, verbose=verbose)
        
        return model

In [22]:
"""
Recommendations classs: Now we are going to build a class for making recommendations. 
This will make easy for making recommendations in djono api. This recommendations class 
build with extra features. You can use this for general prediction by giving professionals ids 
and questions features. It has another features that let's choose questions from range of two dates 
and make recommendation from those questions.

This is useful because those professionals that choose email frequency lavel as "weekly" or "daily", 
we can select questions from a week and then recommend those questions.
"""

class LightFMRecommendations:
    """
    Make prediction given model and professional ids
    """
    def __init__(self, lightfm_model,
                 professionals_features,
                 questions_features,
                 questions,professionals,merge):
        self.model = lightfm_model
        self.professionals_features = professionals_features
        self.questions_features = questions_features
        self.questions = questions
        self.professionals = professionals
        self.merge = merge
        
    def previous_answered_questions(self, professionals_id):
        previous_q_id_num = (
            self.merge.loc[\
                self.merge['professionals_id_num'] == \
                professionals_id]['questions_id_num'])
        
        previous_answered_questions = self.questions.loc[\
            self.questions['questions_id_num'].isin(
            previous_q_id_num)]
        return previous_answered_questions
        
    
    def _filter_question_by_pro(self, professionals_id):
        """Drop questions that professional already answer"""
        previous_answered_questions = \
        self.previous_answered_questions(professionals_id)
        
        discard_qu_id = \
        previous_answered_questions['questions_id_num'].values.tolist()
        
        questions_for_prediction = \
        self.questions.loc[~self.questions['questions_id_num'].isin(discard_qu_id)]
        
        return questions_for_prediction
    
    def _filter_question_by_date(self, questions, start_date, end_date):
        mask = \
        (questions['questions_date_added'] > start_date) & \
        (questions['questions_date_added'] <= end_date)
        
        return questions.loc[mask]
        
    
    def recommend_by_pro_id_general(self,
                                    professional_id,
                                    num_prediction=8):
        questions_for_prediction = self._filter_question_by_pro(professional_id)
        score = self.model.predict(
            professional_id,
            questions_for_prediction['questions_id_num'].values.tolist(), 
            item_features=self.questions_features,
            user_features=self.professionals_features)
        
        questions_for_prediction['recommendation_score'] = score
        questions_for_prediction = questions_for_prediction.sort_values(
            by='recommendation_score', ascending=False)[:num_prediction]
        return questions_for_prediction
    
    def recommend_by_pro_id_frequency_date_range(self,
                                                 professional_id,
                                                 start_date,
                                                 end_date,
                                                 num_prediction=8):
        questions_for_prediction = \
        self._filter_question_by_pro(professional_id)
        
        start_date = datetime.strptime(start_date, '%Y-%m-%d')
        end_date = datetime.strptime(end_date, '%Y-%m-%d')
        
        questions_for_prediction = self._filter_question_by_date(
            questions_for_prediction, start_date, end_date)
        
        score = self.model.predict(
            professional_id,
            questions_for_prediction['questions_id_num'].values.tolist(), 
            item_features=self.questions_features,
            user_features=self.professionals_features)
        
        questions_for_prediction['recommendation_score'] = score
        questions_for_prediction = questions_for_prediction.sort_values(
            by='recommendation_score', ascending=False)[:num_prediction]
        return questions_for_prediction

In [23]:
# instiate all class instance
cv_data_prep = CareerVillageDataPreparation()
light_fm_data_prep = LightFMDataPrep()
train_lightfm = TrainLightFM()

# process raw data
df_questions_p, df_professionals_p, df_merge_p = \
cv_data_prep.prepare(
    df_professionals,df_students,
    df_questions,df_answers,
    df_tags,df_tag_questions,df_tag_users,
    df_question_scores)


# prepare data for lightfm 
interactions, weights, \
questions_features, professional_features = \
light_fm_data_prep.fit(
    df_questions_p, df_professionals_p, df_merge_p)


# finally build and trian our model
model = train_lightfm.fit(interactions,
                          weights,
                          questions_features,
                          professional_features)

Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4


In [24]:
# define our recommender class
lightfm_recommendations = LightFMRecommendations(
    model,
    professional_features,questions_features,
    df_questions_p, df_professionals_p, df_merge_p)

# let's what our model predict for user id 3
print("Recommendation for professional: " + str(3))
display(lightfm_recommendations.recommend_by_pro_id_general(3)[:8])

Recommendation for professional: 3


Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body,questions_id_num,tag_questions_question_id,questions_tag_name,id,score,students_id,students_location,students_date_joined,students_id_num,question_features,recommendation_score
2423,9515b833b2ac4092a8b1a8cdb380781f,941ae126a59745fa9b4556293b38c1fb,2019-01-08 20:47:44+00:00,How long does it take to become a Detective?,#law #criminal-justice #lawyer #police #law-en...,2423,9515b833b2ac4092a8b1a8cdb380781f,"law,criminal-justice,police,law-enforcement,la...",9515b833b2ac4092a8b1a8cdb380781f,2,941ae126a59745fa9b4556293b38c1fb,"Oakland, California",2019-01-08 20:35:58+00:00,30755.0,"(2423, [law, criminal-justice, police, law-enf...",-2.352616
17184,570ca25a625d461abffac230ea110db5,941ae126a59745fa9b4556293b38c1fb,2019-01-10 01:48:47+00:00,What types of Detectives are there?,#law #criminal-justice #lawyer #law-enforcemen...,17184,570ca25a625d461abffac230ea110db5,"law,criminal-justice,police,law-enforcement,la...",570ca25a625d461abffac230ea110db5,2,941ae126a59745fa9b4556293b38c1fb,"Oakland, California",2019-01-08 20:35:58+00:00,30755.0,"(17184, [law, criminal-justice, police, law-en...",-2.392477
9778,776e22d9eb1045eb8a9771eb015e8ddf,d7601a6cc1d04e61aaa16c95cbd0b128,2018-10-03 14:04:13+00:00,I want to be a police officer or a police disp...,#police-officer #law #law-enforcement #crimina...,9778,776e22d9eb1045eb8a9771eb015e8ddf,"law,criminal-justice,police,law-enforcement,po...",776e22d9eb1045eb8a9771eb015e8ddf,2,d7601a6cc1d04e61aaa16c95cbd0b128,"Olney, Illinois",2018-10-03 14:01:25+00:00,29951.0,"(9778, [law, criminal-justice, police, law-enf...",-2.759542
18872,c3e6e57cb27b4134be9b8608a711e2fc,43f813594dd44e16843ecae4e2362ead,2015-03-23 21:17:34+00:00,What majors would fit a law enforcement career?,Im asking this question because I've heard tha...,18872,c3e6e57cb27b4134be9b8608a711e2fc,"law,law-enforcement,police",c3e6e57cb27b4134be9b8608a711e2fc,4,43f813594dd44e16843ecae4e2362ead,"Los Angeles, California",2015-03-23 21:09:01+00:00,3322.0,"(18872, [law, law-enforcement, police])",-2.781396
20003,36a31f62e40748bfa9feb069082086d7,59e78bc12f624fc9bc8ee1c8878f34b2,2017-10-24 21:04:39+00:00,how many criminal psychologist jobs are our th...,I want to be one. #criminal-justice #psycholog...,20003,36a31f62e40748bfa9feb069082086d7,"criminal-justice,psychology,law-enforcement",36a31f62e40748bfa9feb069082086d7,3,59e78bc12f624fc9bc8ee1c8878f34b2,"Fontana, California",2017-10-24 20:56:11+00:00,22222.0,"(20003, [criminal-justice, psychology, law-enf...",-2.793029
18947,cdd0b274ec8f4122a39989707342ccfe,8a8305d32bd144d5877842dcabdfb6d7,2016-05-05 15:39:53+00:00,"Could I go straight into Law Enforcememt, when...",I am an explorer and is trying to set my caree...,18947,cdd0b274ec8f4122a39989707342ccfe,"law,law-enforcement,police",cdd0b274ec8f4122a39989707342ccfe,4,8a8305d32bd144d5877842dcabdfb6d7,"Laurinburg, North Carolina",2016-05-02 16:37:52+00:00,7103.0,"(18947, [law, law-enforcement, police])",-2.825045
16214,ccb15b06a96a4bcfb4d5844550af25cc,8a8305d32bd144d5877842dcabdfb6d7,2016-05-04 16:32:58+00:00,"Do you go to college, then B.L.E.T( Basic Law ...",I am an explorer and is trying to set my caree...,16214,ccb15b06a96a4bcfb4d5844550af25cc,"law,law-enforcement,police",ccb15b06a96a4bcfb4d5844550af25cc,2,8a8305d32bd144d5877842dcabdfb6d7,"Laurinburg, North Carolina",2016-05-02 16:37:52+00:00,7103.0,"(16214, [law, law-enforcement, police])",-2.851564
1936,c544db1b7cda482497adcf059f36e709,f0eeb9fe04884944b5abffe2b7011b84,2017-02-08 19:27:57+00:00,What degrees do you have to have in order to g...,Law enforcement is a career I am interested in...,1936,c544db1b7cda482497adcf059f36e709,"criminal-justice,law-enforcement,police",c544db1b7cda482497adcf059f36e709,4,f0eeb9fe04884944b5abffe2b7011b84,"Gibson, Louisiana",2017-02-03 18:29:06+00:00,17965.0,"(1936, [criminal-justice, law-enforcement, pol...",-2.853464


In [25]:
# also let's see what our model predicts for professional 3
# given questions between two dates
print("Recommendations for professionals (question from 2016-1-1 to 2016-12-31): " + str(3))
display(lightfm_recommendations.recommend_by_pro_id_frequency_date_range(3, '2016-1-1','2016-12-31')[:8])

Recommendations for professionals (question from 2016-1-1 to 2016-12-31): 3


TypeError: Cannot compare tz-naive and tz-aware datetime-like objects

Awesome! We can see our recommendations. Also, we can see, the new recommendation class has a method for recommending questions by a frequency of date. This is very helpful for recommending questions to professionals that have set their email frequency to daily or weekly.


# Conclusion

Idea that I tried but don't implemented in this notebook:

Adding location features: I tried adding location features but somehow it decreases model AUC score to 91% to 84%. That's why I don't use that features.
Adding dates and hearts data: I also tried that but it doesn't improve AUC score.
Correcting spelling error: I tried this method and successfully implemented it. But this is really slow. For that reason, I exluded it.
Idea that I think is important don't implemented in this notebook:

Adding professionals industry and title as a features. This will inhance our model diversity and will increase overall recommendations score.
CareerVillage should auto correct the hashtags for students questions asking time. This will help the model to match the tags more efficiently.
For those professionals those have choosen email frequency to immediete, we can create another same model just exchange user/item features. I mean train our model by giving questions as users and professionals as items. In this way, we can predict professionals by giving a questions. So that it helps to target daily frequency professionals.
Finally we came to end. I want give you a big thank you for reading this notebook. I have provided a very good recommender system for CareerVillage in the notebook. If you find any mistakes or have any suggestions feel free to comment. And don't forget to upvote. Good luck!

References:

[1] Improving Pairwise Learning for Item Recommendation from Implicit Feedback

[2] Content-based filtering

[3] What Is Content-Based Filtering?

[4] What Is Collaborative Filtering?

[5] Collaborative filtering

[6] LightFM model documentation

[7] Metadata Embeddings for User and Item Cold-start Recommendations

# References

https://www.kaggle.com/niyamatalmass/lightfm-hybrid-recommendation-system