In this kernel, we are going to try to provide the optimal professional given a specific question. In order to do so, we follow the same approach as done by [Antons Rubert](https://www.kaggle.com/antonsruberts/sentence-embeddings-centorid-method-vs-doc2vec). This means that we try to make embeddings on sentence level such that similar questions are close to each other.

But first lets start with having a look at the available data.

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

from IPython.display import HTML

import matplotlib.pyplot as plt

from nltk.corpus import stopwords
import gensim

from gensim.utils import simple_preprocess
from gensim.models import FastText

from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer

from functools import partial
import random

from ipywidgets import interact

pd.set_option('display.max_colwidth', -1)

The following files are available for task at hand:

In [2]:
!ls '../input'

answer_scores.csv      groups.csv	    school_memberships.csv
answers.csv	       matches.csv	    students.csv
comments.csv	       professionals.csv    tag_questions.csv
emails.csv	       question_scores.csv  tag_users.csv
group_memberships.csv  questions.csv	    tags.csv


The main data source for this kernel is the questions and corresponding answers. So lets load these data sources. We will use beautifulsoup to filter out the HTML that is in some of the questions/answers.

In [3]:
def get_text(text):
    try:
        soup = BeautifulSoup(text, 'lxml')
        return soup.get_text()
    except Exception as e:
        print(text)
        raise e

In [4]:
questions = pd.read_csv('../input/questions.csv', parse_dates=['questions_date_added'])
questions['questions_body'] = questions['questions_body'].apply(get_text)

display(questions.head(5))

Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body
0,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26,Teacher career question,What is a maths teacher? what is a maths teacher useful? #college #professor #lecture
1,eb80205482e4424cad8f16bc25aa2d9c,acccbda28edd4362ab03fb8b6fd2d67b,2016-05-20 16:48:25,I want to become an army officer. What can I do to become an army officer?,I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question #military #army
2,4ec31632938a40b98909416bdd0decff,f2c179a563024ccc927399ce529094b5,2017-02-08 19:13:38,Will going abroad for your first job increase your chances for jobs back home?,"I'm planning on going abroad for my first job. It will be a teaching job and I don't have any serious career ideas. I don't know what job I would be working if I stay home instead so I'm assuming staying or leaving won't makeba huge difference in what I care about, unless I find something before my first job. I can think of ways that going abroad can be seen as good and bad. I do not know which side respectable employers willl side with. #working-abroad #employment- #overseas"
3,2f6a9a99d9b24e5baa50d40d0ba50a75,2c30ffba444e40eabb4583b55233a5a4,2017-09-01 14:05:32,"To become a specialist in business management, will I have to network myself?",i hear business management is a hard way to get a job if you're not known in the right areas. #business #networking
4,5af8880460c141dbb02971a1a8369529,aa9eb1a2ab184ebbb00dc01ab663428a,2017-09-01 02:36:54,Are there any scholarships out there for students that are first generation and live in GA?,I'm trying to find scholarships for first year students but they all seem to be for other states besides GA. Any help??\r\n\r\n#college\r\n#scholarships \r\n#highschoolsenior \r\n#firstgeneration \r\n


Now we need to do some basic preprocessing. For now, we won't do anything complex but more might be added later.

In [5]:
stopword = stopwords.words('english')

questions['text'] = questions['questions_title'] + questions['questions_body']
questions['text_list'] = questions['text'].apply(simple_preprocess)
questions['text'] = questions['text_list'].apply(lambda x: ' '.join(x))

display(questions.head(3))

Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body,text,text_list
0,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26,Teacher career question,What is a maths teacher? what is a maths teacher useful? #college #professor #lecture,teacher career questionwhat is maths teacher what is maths teacher useful college professor lecture,"[teacher, career, questionwhat, is, maths, teacher, what, is, maths, teacher, useful, college, professor, lecture]"
1,eb80205482e4424cad8f16bc25aa2d9c,acccbda28edd4362ab03fb8b6fd2d67b,2016-05-20 16:48:25,I want to become an army officer. What can I do to become an army officer?,I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question #military #army,want to become an army officer what can do to become an army officer am priyanka from bangalore now am in th std when go to college should not get confused on what want to take to become army officer so am asking this question military army,"[want, to, become, an, army, officer, what, can, do, to, become, an, army, officer, am, priyanka, from, bangalore, now, am, in, th, std, when, go, to, college, should, not, get, confused, on, what, want, to, take, to, become, army, officer, so, am, asking, this, question, military, army]"
2,4ec31632938a40b98909416bdd0decff,f2c179a563024ccc927399ce529094b5,2017-02-08 19:13:38,Will going abroad for your first job increase your chances for jobs back home?,"I'm planning on going abroad for my first job. It will be a teaching job and I don't have any serious career ideas. I don't know what job I would be working if I stay home instead so I'm assuming staying or leaving won't makeba huge difference in what I care about, unless I find something before my first job. I can think of ways that going abroad can be seen as good and bad. I do not know which side respectable employers willl side with. #working-abroad #employment- #overseas",will going abroad for your first job increase your chances for jobs back home planning on going abroad for my first job it will be teaching job and don have any serious career ideas don know what job would be working if stay home instead so assuming staying or leaving won makeba huge difference in what care about unless find something before my first job can think of ways that going abroad can be seen as good and bad do not know which side respectable employers willl side with working abroad employment overseas,"[will, going, abroad, for, your, first, job, increase, your, chances, for, jobs, back, home, planning, on, going, abroad, for, my, first, job, it, will, be, teaching, job, and, don, have, any, serious, career, ideas, don, know, what, job, would, be, working, if, stay, home, instead, so, assuming, staying, or, leaving, won, makeba, huge, difference, in, what, care, about, unless, find, something, before, my, first, job, can, think, of, ways, that, going, abroad, can, be, seen, as, good, and, bad, do, not, know, which, side, respectable, employers, willl, side, with, working, abroad, employment, overseas]"
3,2f6a9a99d9b24e5baa50d40d0ba50a75,2c30ffba444e40eabb4583b55233a5a4,2017-09-01 14:05:32,"To become a specialist in business management, will I have to network myself?",i hear business management is a hard way to get a job if you're not known in the right areas. #business #networking,to become specialist in business management will have to network myself hear business management is hard way to get job if you re not known in the right areas business networking,"[to, become, specialist, in, business, management, will, have, to, network, myself, hear, business, management, is, hard, way, to, get, job, if, you, re, not, known, in, the, right, areas, business, networking]"
4,5af8880460c141dbb02971a1a8369529,aa9eb1a2ab184ebbb00dc01ab663428a,2017-09-01 02:36:54,Are there any scholarships out there for students that are first generation and live in GA?,I'm trying to find scholarships for first year students but they all seem to be for other states besides GA. Any help??\r\n\r\n#college\r\n#scholarships \r\n#highschoolsenior \r\n#firstgeneration \r\n,are there any scholarships out there for students that are first generation and live in ga trying to find scholarships for first year students but they all seem to be for other states besides ga any help college scholarships firstgeneration,"[are, there, any, scholarships, out, there, for, students, that, are, first, generation, and, live, in, ga, trying, to, find, scholarships, for, first, year, students, but, they, all, seem, to, be, for, other, states, besides, ga, any, help, college, scholarships, firstgeneration]"


<h2> Finding similar questions </h2>

Now lets train embeddings on the questions. These embeddings allow us to find the most similar questions

In [6]:
emb_size = 100
model_question = FastText(questions['text_list'], size=emb_size, window = 6, sg=1, workers=4)
model_question.train(questions['text_list'], total_examples=len(questions.index), epochs=50)

In [7]:
vect_question = TfidfVectorizer(min_df=model_question.vocabulary.min_count)
tfidf_question = vect_question.fit_transform(questions['text'])

In [10]:
def get_sentence_embedding(m, tfidf, vectorizer, emb_size=100):
    wordvecs = np.zeros((emb_size, tfidf.shape[-1]))
    for i, name in enumerate(vectorizer.get_feature_names()):
        wordvecs[:, i] = m.wv[name]

    emb = tfidf @ wordvecs.T
    emb = emb / (tfidf.sum(axis=1) + 1e-10)
    
    return emb

In [11]:
sen_emb = get_sentence_embedding(model_question, tfidf_question, vect_question, 100)

In [22]:
@interact
def get_similar_question(x=200):
    nn = NearestNeighbors(n_neighbors=6, metric='cosine')
    nn.fit(sen_emb)
    dist, idxs = nn.kneighbors(sen_emb[x])

    sim_questions = questions.loc[idxs[0], ['questions_id', 'questions_author_id', 'questions_title', 'questions_body']]
    sim_questions['Score'] = dist[0]

    #display(HTML('Similar questions (actual question on top):'))
    #display(sim_questions)
    return sim_questions

interactive(children=(IntSlider(value=200, description='x', max=600, min=-200), Output()), _dom_classes=('widg…

<h2> Questions based on tags </h2>
In case we can't find a similar question, we have to look for a professional that can answer the question at hand. For this, we can create a training set based on previous question/ answer pairs.

In [16]:
profs = pd.read_csv('../input/professionals.csv')
tag_users = pd.read_csv('../input/tag_users.csv')
tags = pd.read_csv('../input/tags.csv')
tags['tags_tag_name'] = tags['tags_tag_name'].fillna(' ').apply(get_text)

tag_users = tag_users.merge(tags, left_on='tag_users_tag_id', right_on='tags_tag_id')
tag_users['tags_tag_name'] += ' '
tag_users = tag_users.groupby('tag_users_user_id')['tags_tag_name'].sum().to_frame()

group_memberships = pd.read_csv('../input/group_memberships.csv')
groups = pd.read_csv('../input/groups.csv')

group_memberships = group_memberships.merge(groups, left_on='group_memberships_group_id', right_on='groups_id')
group_memberships['groups_group_type'] += ' '
group_memberships = group_memberships.groupby('group_memberships_user_id')['groups_group_type'].sum().to_frame()

profs = profs.merge(tag_users, left_on='professionals_id', right_on='tag_users_user_id')
profs = profs.merge(group_memberships, left_on='professionals_id', right_on='group_memberships_user_id')
profs['info'] = profs['tags_tag_name'] + ' ' + profs['groups_group_type']
profs = profs.drop(['professionals_location', 'professionals_date_joined'], axis=1)

profs['info_list'] = profs['info'].apply(simple_preprocess)
profs['info'] = profs['info_list'].apply(lambda x: ' '.join(x))

display(profs.sample(5))

  ' Beautiful Soup.' % markup)


Unnamed: 0,professionals_id,professionals_industry,professionals_headline,tags_tag_name,groups_group_type,info,info_list
301,42e9647b83164e09a9df8724b3f62e63,Security and Investigations,Web Developer | Woman in Tech | Avid Volunteer,computer-software computer-science college internships web-development computer-programming women-in-tech women-in-stem women-in-engineering computerscience security-and-investigations womenintech womeninstem computerprogramming,cause professional network,computer software computer science college internships web development computer programming women in tech women in stem women in engineering computerscience security and investigations womenintech womeninstem cause professional network,"[computer, software, computer, science, college, internships, web, development, computer, programming, women, in, tech, women, in, stem, women, in, engineering, computerscience, security, and, investigations, womenintech, womeninstem, cause, professional, network]"
135,7f5bc9e50fee45858f138c2f9308f890,Telecommunications,NOC Video Hub Tech,video spanish networking telecommunications customer-service visio information-technology network-security technical-support cybersecurity fiber-optics cyber bilingual-spanish it-support,youth program,video spanish networking customer service visio information technology network security technical support cybersecurity fiber optics cyber bilingual spanish it support youth program,"[video, spanish, networking, customer, service, visio, information, technology, network, security, technical, support, cybersecurity, fiber, optics, cyber, bilingual, spanish, it, support, youth, program]"
336,1cdad970105449c4af632527125a1725,,Recruiter at Opya | Talent Acquisition Enthusiast,"resume-writing interviewing-skills health,-wellness-and-fitness recruiting interviews women-in-tech women resume recruitment women-in-business salary-negotiation cover-letters",cause,resume writing interviewing skills health wellness and fitness recruiting interviews women in tech women resume recruitment women in business salary negotiation cover letters cause,"[resume, writing, interviewing, skills, health, wellness, and, fitness, recruiting, interviews, women, in, tech, women, resume, recruitment, women, in, business, salary, negotiation, cover, letters, cause]"
456,69b9668a07da4668b7e8fa079c76d197,Mental Health Care,Intensive Care Coordinator at Lahey Health Behavioral Services,college counseling college-major education graduate-school healthcare internships sociology hospital group-therapy community clinical mental-health-care school inpatient therapy coordination programs children masters outpatient socialwork trauma mentalhealth connecticut privatepractice massachusetts families adolescents umass macro socialissues socialchange,club mentorship program,college counseling college major education graduate school healthcare internships sociology hospital group therapy community clinical mental health care school inpatient therapy coordination programs children masters outpatient socialwork trauma mentalhealth connecticut privatepractice massachusetts families adolescents umass macro socialissues socialchange club mentorship program,"[college, counseling, college, major, education, graduate, school, healthcare, internships, sociology, hospital, group, therapy, community, clinical, mental, health, care, school, inpatient, therapy, coordination, programs, children, masters, outpatient, socialwork, trauma, mentalhealth, connecticut, privatepractice, massachusetts, families, adolescents, umass, macro, socialissues, socialchange, club, mentorship, program]"
132,d9803d63cfd94b47bf3c88ba54519193,Telecommunications,Tech Advisor,telecommunications information-technology careers,youth program,information technology careers youth program,"[information, technology, careers, youth, program]"


In [36]:
#model_profs = FastText(profs['info_list'], size=emb_size, window=8, sg=1, workers=4)
#model_profs.train(profs['info_list'], total_examples=len(profs.index), epochs=50)

#vect_profs = TfidfVectorizer(min_df=model_profs.vocabulary.min_count)
tfidf_profs = vect_question.transform(profs['info'])

In [38]:
prof_emb = get_sentence_embedding(model_question, tfidf_profs, vect_question, 100)

In [39]:
@interact
def get_closest_professional(x=200):
    nn = NearestNeighbors(n_neighbors=6, metric='cosine')
    nn.fit(prof_emb)
    dist, idxs = nn.kneighbors(sen_emb[x])

    question = questions.loc[x, ['questions_id', 'questions_author_id', 'questions_title', 'questions_body']].to_frame()
    display(question)
    
    closest_profs = profs.loc[idxs[0], ['professionals_id', 'professionals_industry', 'professionals_headline', 'info']]
    closest_profs['Score'] = dist[0]

    return closest_profs

interactive(children=(IntSlider(value=200, description='x', max=600, min=-200), Output()), _dom_classes=('widg…

It looks like we are able to find related professionals, but the model seems to have a preference for professionals with a lot of tags. Probably because the model is trained on questions instead of tags of questions.

<h2> Evaluate our current method </h2>

Now that we have a first setup for our model, we have to find a way to evaluate our model. For the evaluation, we are going to look at the previously answered questions and check how the professional which answered the question related to the score our model gave.

In [60]:
answers = pd.read_csv('../input/answers.csv', parse_dates=['answers_date_added'])
answer_score = pd.read_csv('../input/answer_scores.csv')

answers = answers.dropna(subset=['answers_body'])
answers['answers_body'] = answers['answers_body'].apply(get_text)

answers = answers.merge(answer_score, left_on='answers_id', right_on='id')
answers = answers.loc[answers['score'] > 0]
answers = answers.merge(profs.reset_index(), left_on='answers_author_id', right_on='professionals_id')
answers = answers.merge(questions.reset_index(), left_on='answers_question_id', right_on='questions_id')
answers = answers.loc[:, ['answers_id', 'professionals_id', 'score', 'info', 'index_x', 'index_y', 'questions_body', 'text', 'text_list']]

display(answers.head(5))

Unnamed: 0,answers_id,professionals_id,score,info,index_x,index_y,questions_body,text,text_list
0,9d1a775e148f4e0190a24dbebe53665f,b9d984e161c64171a9018e02b03eab3e,1,medicine nursing general surgery medical practice registered nurses nursing education nurse management icu nurse nursingschool nursingstudent youth program,210,21,"As medical technology improves, people live longer, healthier lives. Anti-aging technologies are getting better and better. If people start regularly living to 100 or longer, or if human life is extended even further beyond natural limits, how we provide for everyone?\r\n\r\nHow will people in my generation flourish if people currently in their 20s 30s, 40s, and 50s never retire, and keep working indefinitely? How will my generation support and care for a massive population of elderly people? \r\n\r\n#technology #future #medicine #career",if medicine improves and allows people to live much longer lives how will we provide for everyone as medical technology improves people live longer healthier lives anti aging technologies are getting better and better if people start regularly living to or longer or if human life is extended even further beyond natural limits how we provide for everyone how will people in my generation flourish if people currently in their and never retire and keep working indefinitely how will my generation support and care for massive population of elderly people technology future medicine career,"[if, medicine, improves, and, allows, people, to, live, much, longer, lives, how, will, we, provide, for, everyone, as, medical, technology, improves, people, live, longer, healthier, lives, anti, aging, technologies, are, getting, better, and, better, if, people, start, regularly, living, to, or, longer, or, if, human, life, is, extended, even, further, beyond, natural, limits, how, we, provide, for, everyone, how, will, people, in, my, generation, flourish, if, people, currently, in, their, and, never, retire, and, keep, working, indefinitely, how, will, my, generation, support, and, care, for, massive, population, of, elderly, people, technology, future, medicine, career]"
1,db560eaf375c48df9afd35b36e7058a7,be5d23056fcb4f1287c823beec5291e1,1,job search resume writing law law enforcement social work sociology social impact legal resume litigation civil litigation litigation support justice police civil rights mediation youth program,8,26,"I know this might be a bit of a hard question to answer... Lately, I have just been feeling like giving up because I feel like my education is pointless because I won't be able to succeed and get through it. I am about to graduate in May. I am trying to hold on, but I am not doing so well in my courses. Should I visit a counselor? How should I fix this? #dealing-with-college",how to not give up on college when you are feeling depressed know this might be bit of hard question to answer lately have just been feeling like giving up because feel like my education is pointless because won be able to succeed and get through it am about to graduate in may am trying to hold on but am not doing so well in my courses should visit counselor how should fix this dealing with college,"[how, to, not, give, up, on, college, when, you, are, feeling, depressed, know, this, might, be, bit, of, hard, question, to, answer, lately, have, just, been, feeling, like, giving, up, because, feel, like, my, education, is, pointless, because, won, be, able, to, succeed, and, get, through, it, am, about, to, graduate, in, may, am, trying, to, hold, on, but, am, not, doing, so, well, in, my, courses, should, visit, counselor, how, should, fix, this, dealing, with, college]"
2,4b72a86cedb8405593264f18f4adc054,be5d23056fcb4f1287c823beec5291e1,3,job search resume writing law law enforcement social work sociology social impact legal resume litigation civil litigation litigation support justice police civil rights mediation youth program,8,487,"I know that having confidence is very important when you are interviewing with potential employers and for when you are interacting with your peers. However, I find that sometimes I lose confidence when something negative happens, such as being rejected from a company, or doing poorly on a midterm. I try not to let it get to me, but sometimes it does and I feel very timid. How can I learn to stay confident even when something bad happens? Any help would be great.\r\n\r\nThanks in advance! #college #career #career-counseling #career-choice #interviews #interviewing-skills #personal-development #job-application",how do you stay confident know that having confidence is very important when you are interviewing with potential employers and for when you are interacting with your peers however find that sometimes lose confidence when something negative happens such as being rejected from company or doing poorly on midterm try not to let it get to me but sometimes it does and feel very timid how can learn to stay confident even when something bad happens any help would be great thanks in advance college career career counseling career choice interviews interviewing skills personal development job application,"[how, do, you, stay, confident, know, that, having, confidence, is, very, important, when, you, are, interviewing, with, potential, employers, and, for, when, you, are, interacting, with, your, peers, however, find, that, sometimes, lose, confidence, when, something, negative, happens, such, as, being, rejected, from, company, or, doing, poorly, on, midterm, try, not, to, let, it, get, to, me, but, sometimes, it, does, and, feel, very, timid, how, can, learn, to, stay, confident, even, when, something, bad, happens, any, help, would, be, great, thanks, in, advance, college, career, career, counseling, career, choice, interviews, interviewing, skills, personal, development, job, application]"
3,ef5169c25534400ca3441389ae6e373b,be5d23056fcb4f1287c823beec5291e1,1,job search resume writing law law enforcement social work sociology social impact legal resume litigation civil litigation litigation support justice police civil rights mediation youth program,8,519,Lately I’ve been stressing about college because of my social anxiety. I can barely talk in front of people. But I’m great at interviewing and I do have good friends. Is college a good place for me? #college #collegeanxiety #collegesocialanxiety #socialanxiety #college-advice #student,anxiety in collegelately ve been stressing about college because of my social anxiety can barely talk in front of people but great at interviewing and do have good friends is college good place for me college collegeanxiety socialanxiety college advice student,"[anxiety, in, collegelately, ve, been, stressing, about, college, because, of, my, social, anxiety, can, barely, talk, in, front, of, people, but, great, at, interviewing, and, do, have, good, friends, is, college, good, place, for, me, college, collegeanxiety, socialanxiety, college, advice, student]"
4,4121bd9612f1429fbaa71bd736d206a3,5fc4a2e4ab464675b57902bdc1aa83cd,1,project career jobs socialanxiety professional network,355,519,Lately I’ve been stressing about college because of my social anxiety. I can barely talk in front of people. But I’m great at interviewing and I do have good friends. Is college a good place for me? #college #collegeanxiety #collegesocialanxiety #socialanxiety #college-advice #student,anxiety in collegelately ve been stressing about college because of my social anxiety can barely talk in front of people but great at interviewing and do have good friends is college good place for me college collegeanxiety socialanxiety college advice student,"[anxiety, in, collegelately, ve, been, stressing, about, college, because, of, my, social, anxiety, can, barely, talk, in, front, of, people, but, great, at, interviewing, and, do, have, good, friends, is, college, good, place, for, me, college, collegeanxiety, socialanxiety, college, advice, student]"


In [59]:
sen_emb[519].shape

(1, 100)