# Methodology
This noteook illustrates the idea of recommending professionals based the similairty between the question and professionals' previous answers or user information. The professional are divided into two categories.

## Ref
The recommendation system idea based on similary is based on Daniel Becker's [Kernal](https://www.kaggle.com/danielbecker/careervillage-org-recommendation-engine)

## Categories
- Prof: The professor has answered questions
- New prof: The professor is new or did not answer any question yet
### For the Prof (provided answered before)
The similarity is based on their previous answers and the new question
### For the New prof(never answered a question)
The similarity is based on their user information, including tags and the new question


# Steps for recommending a Prof:

1. Calculate the tf-idf for the query text and all the questions
2. Use the cosine similiarty to get similiar questions for the query text.
3. Get the answers and professionals for the similar questions.
4. Make a recommendation to fit the best professionals to answer the new question.

In [1]:
# Helper libraries
import os
import math
import warnings
import pickle
import re

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords

In [2]:
# Read Data
files = [(f[:-4],"../input/%s"%f) for f in os.listdir("../input")]
files = dict(files)
print(os.listdir("../input"))

def read_csv(name):
    if name not in files:
        print("error")
        return None    
    return pd.read_csv(files[name])

# load data
questions = read_csv("questions")
answers = read_csv("answers")
professionals = read_csv("professionals")

['emails.csv', 'questions.csv', 'professionals.csv', 'comments.csv', 'tag_users.csv', 'group_memberships.csv', 'tags.csv', 'answer_scores.csv', 'students.csv', 'groups.csv', 'tag_questions.csv', 'question_scores.csv', 'matches.csv', 'answers.csv', 'school_memberships.csv']


# Preprocess the data

### The following lines are used for preprocessing data. If no change made, you can just load the presaved data

In [3]:
### Read questions data
q = questions.copy()
q['qtext'] = list(q.apply(lambda x:'%s %s' %(x['questions_title'],x['questions_body']), axis=1))
q = q.drop(['questions_author_id','questions_date_added','questions_body','questions_title'],axis=1)

### Read answers data
a = answers.copy()
a = a.drop(['answers_date_added'],axis=1)
uri_re = r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))'
tag_re = r'<\/?[a-z]+>'

def strip_html_tag(s):
    temp = re.sub(uri_re, ' ', str(s))
    return re.sub(tag_re, ' ', temp)

a['answers_body'] = a['answers_body'].apply(strip_html_tag)
a.rename(columns={'answers_question_id':'questions_id',
                  'answers_author_id':'professionals_id',
                  "answers_body":"atext"},inplace=True)
          
### Read professionals data
p = professionals.copy()
# Connect these professionals with tags and user_tag tables
tags = read_csv("tags")
tag_users = read_csv("tag_users")

tag_users = tag_users[tag_users["tag_users_user_id"].isin(p['professionals_id'])]
tag_users = tag_users.merge(tags, how="left", left_on="tag_users_tag_id", right_on = "tags_tag_id")

t = tag_users.groupby(["tag_users_user_id"])['tags_tag_name'].apply(lambda x: ' '.join(x))
t = t.to_frame().reset_index()

p = p.merge(t, how="left", left_on="professionals_id", right_on = "tag_users_user_id")

# get text
temp = p[['professionals_location','professionals_industry','professionals_headline','tags_tag_name']].fillna('')

p['ptext'] = temp['professionals_location']+" "+temp['professionals_industry']+" "+\
             temp['professionals_headline']+" "+temp['tags_tag_name']

p = p.drop(['professionals_location','professionals_industry','professionals_headline',\
           'professionals_date_joined','tag_users_user_id','tags_tag_name'],axis=1)

print("Number of professionals: ",len(p['ptext']),",with ",sum(p['ptext']!="   ")," of them having data" )

Number of professionals:  28152 ,with  27770  of them having data


In [4]:
# Remove general words
noise = ['school','would','like', 'want', 'dont', 
         'become','sure','go', 'get', 'college', 
         'career', 'wanted', 'im', 'ing', 'ive',
         'know', 'high', 'becom', 'job', 'best',
         'day', 'hi', 'name', 'help', 'people',
         'year', 'years', 'next', 'interested', 
         'question', 'questions', 'take', 'even',
         'though', 'please', 'tell']

stop = stopwords.words('english')

def remove_general_words(df, col):
    df[col] = df[col].str.lower().str.split() # convert all str to lowercase    
    df[col] = df[col].apply(lambda x: [item for item in x if item not in stop]) # remove stopwords
    df[col] = df[col].apply(lambda x: [item for item in x if item not in noise]) # remove general words
    df[col] = df[col].apply(' '.join) # convert list to str
    return df

q = remove_general_words(q,'qtext')
p = remove_general_words(p,'ptext')
a = remove_general_words(a,'atext')

In [5]:
'''
# Save the data
pickle.dump(q, open('../input/q.p', 'wb'))
pickle.dump(a, open('../input/a.p', 'wb'))
pickle.dump(p, open('../input/p.p', 'wb'))
'''

OSError: [Errno 30] Read-only file system: '../input/q.p'

In [6]:
'''
# Read saved data
q = pickle.load(open('../input/q.p', mode='rb'))
a = pickle.load(open('../input/a.p', mode='rb'))
p = pickle.load(open('../input/p.p', mode='rb'))
'''

"\n# Read saved data\nq = pickle.load(open('../input/q.p', mode='rb'))\na = pickle.load(open('../input/a.p', mode='rb'))\np = pickle.load(open('../input/p.p', mode='rb'))\n"

In [7]:
def get_similar_docs(corpus, query_text, threshold=0.0, top=10):
    tfidf = TfidfVectorizer(ngram_range=(1,3), stop_words = 'english', max_features = 500, max_df=0.9)
    corpus_tfidf = tfidf.fit_transform(corpus)
    text_tfidf = tfidf.transform([query_text])
    sim = cosine_similarity(corpus_tfidf, text_tfidf)
    sim_idx = (sim >= threshold).nonzero()[0]
    result = pd.DataFrame({'similarity':sim[sim_idx].reshape(-1,),
                          'text':corpus[sim_idx]},
                          index=sim_idx)
    result = result.sort_values(by=['similarity'], ascending=False).head(top)
    return result

In [8]:
# Example
corpus = q['qtext']
query_text = corpus[2]
print('Example 1 Question:\n', query_text)
sim_questions = get_similar_docs(corpus, query_text)

Example 1 Question:
 going abroad first increase chances jobs back home? i'm planning going abroad first job. teaching serious ideas. working stay home instead i'm assuming staying leaving makeba huge difference care about, unless find something first job. think ways going abroad seen good bad. side respectable employers willl side with. #working-abroad #employment- #overseas


In [9]:
sim_questions

Unnamed: 0,similarity,text
2,1.0,going abroad first increase chances jobs back ...
5590,0.667993,scope b sc (zoology) abroad ? #any
6551,0.586143,look teaching abroad experience? enjoy tutorin...
10747,0.579467,what’s hardest part studying abroad student be...
8972,0.575461,"studying abroad summer valencia, spain wonderi..."
20681,0.569156,meaningful trips abroad experience budget? tra...
18806,0.566032,cashier; settle abroad suggest steps take? com...
21399,0.55279,learning abroad give leg amongst future compet...
17864,0.544714,studying abroad change perspective? going coll...
12330,0.537329,merits studying abroad? second student college...


## Merges the questions with the corresponding answers

In [10]:
def get_questions_answers(sim_questions):  
    sim_q_a = sim_questions.merge(q, left_index=True, right_index=True).merge(a)
    return sim_q_a

In [11]:
sim_q_a = get_questions_answers(sim_questions)
sim_q_a.head()

Unnamed: 0,similarity,text,questions_id,qtext,answers_id,professionals_id,atext
0,1.0,going abroad first increase chances jobs back ...,4ec31632938a40b98909416bdd0decff,going abroad first increase chances jobs back ...,1a6b3749d391486c9e371fbd1e605014,7e72a630c303442ba92ff00e8ea451df,work global company values highly internationa...
1,0.667993,scope b sc (zoology) abroad ? #any,89103412bd0f433996599b67c4d2ff5d,scope b sc (zoology) abroad ? #any,0663b9f6f90a485787865282a8aadb28,036358eed7044c66a6058d1006fedbad,
2,0.667993,scope b sc (zoology) abroad ? #any,89103412bd0f433996599b67c4d2ff5d,scope b sc (zoology) abroad ? #any,1885ba1004b540be8cf29496beb9fdea,58fa5e95fe9e480a9349bbb1d7faaddb,"sameer, could rephrase question?"
3,0.586143,look teaching abroad experience? enjoy tutorin...,0f12d98bfe5d4aef862bc77f8b40e736,look teaching abroad experience? enjoy tutorin...,2210b91d822c4c279cf31222b0d92e4a,ce169b8f116243849edd246fef5a204b,education key success!
4,0.579467,what’s hardest part studying abroad student be...,13249144c97f4d5e8e69efba68cdb4cc,what’s hardest part studying abroad student be...,44853686b07549e4adc2ad1c8a4aa8cb,8aca53dfdf6e4c2c85a4f0eacd996a46,"athena, congrats getting trinity! truly except..."


## Get the top recommended professionals based on questions

In [12]:
def get_recommendation(df, top_n=5):
    temp_values = df['similarity']/df['questions_id'].apply(lambda x: df.groupby('questions_id').size()[x])
    temp_values = temp_values * df['professionals_id'].apply(lambda x: df.groupby('professionals_id').size()[x])
    df["value"] = temp_values
    top_prof = df.groupby('professionals_id').sum().reset_index().sort_values('value', ascending = False).head(top_n)
    top_prof = top_prof[['professionals_id', 'value']]
    top_prof.columns = ['professional', 'recommendation_score']
    print(top_prof)        

In [13]:
get_recommendation(sim_q_a)

                        professional  recommendation_score
10  7e72a630c303442ba92ff00e8ea451df              1.000000
13  a2a3490912f9435b82172850e907c1c7              0.913240
15  ce169b8f116243849edd246fef5a204b              0.586143
16  e44c2855fd6649ae869fe145d167cbe4              0.575461
2   211e39ac71694bb5af233b205a0f2057              0.566032


# Steps for recommending a new prof

When a new professor has not answered any question, the above method will not be able to pair him to the new questions. We will use prof's information and tags in this case. 

1. Filter out the new professionals
2. Rank the similarity between the new question and professionals' information
3. Make a recommendation to fit the best professionals to answer the new question.

In [14]:
# find the new professionals
pa = p.merge(a, how = "left")
p_new = p[pa["questions_id"].isnull()]
print ("There are",p_new.shape[0],"new professionals")

There are 4078 new professionals


  This is separate from the ipykernel package so we can avoid doing imports until


In [15]:
# Example
corpus_p = p['ptext']
print('Example 2 Question:\n', query_text)
sim_professionals = get_similar_docs(corpus_p, query_text, top=2)

Example 2 Question:
 going abroad first increase chances jobs back home? i'm planning going abroad first job. teaching serious ideas. working stay home instead i'm assuming staying leaving makeba huge difference care about, unless find something first job. think ways going abroad seen good bad. side respectable employers willl side with. #working-abroad #employment- #overseas


In [16]:
# Recommend new professiors
rec_new = sim_professionals.merge(p, left_index=True, right_index=True)[['professionals_id', 'similarity']]
rec_new.columns = ['professional', 'recommendation_score']
rec_new

Unnamed: 0,professional,recommendation_score
26052,5f472f31cb784afea23dfa9e8eef4661,0.584398
27039,b8e73f39630044b19514db3608863805,0.569107
