# Summary
## The problem
This Kernel is an entry into the CareerVillage.org (CV) Kaggle competition with the following objective: develop a method to recommend relevant questions to the professionals who are most likely to answer them.

The proposed solution should support the CV metric:
A good recommender system should help us to reach a target of a high (95%?) percent of questions which get a 10-quality-point-answer within 24 hours, without churning the Pros, and within the bounds of fairness.

## Proposed Solution
A triage question handler is proposed together with a set of recommendations to handle the cold start problem and a method that provides a list of suggested questions for professionals that visit the CV site.
The triage question handler compares a new question with all the previously asked questions and puts the question into one of three categories:
1.	When there are many good matches with similar previously answered questions, students should be given immediate feedback by displaying this list of similar questions. They could be asked if their question was answered or if they still want professionals to be asked. This gives old questions/answers new life and the opportunity for professionals to get regular feedback to promote continued engagement.
2.	When there are fewer less good matches with similar previously answered questions, there is a group of questions where the matches are still good enough to find professionals who will find the student's question relevant.
3.	When there are very few or no good matches with similar previously answered questions the system can immediately flag this problem. Ideally CV could recruit a group of very engaged professionals who are willing to get involved in answering these difficult to answer questions. In some cases that answer may require the development of a dialogue which would require an extension to the current system.

A number of technologies are considered to drive the model. Six of these are explored in detail:

* tfidf questions bags of words similarity

* Word2Vec

* FastText

* GloVe

* Universal sentence encoder (USE)

* tfidf professionals bag of words similarity 

For the first 5 models, the 10 best matches are found for the last 100 questions in the data set that are answered. An extensive but subjective comparison shows the FastText system is the most suitable with Word2Vec almost as good.

The subjective analysis of the last 20 questions shows that the FastText method can deliver 4 relevant similar question out of 5 whereas the tfidf method only delivers 2 relevant similar question out of 3.

A process is then developed to use the recommendations to tune the model to fulfil the problem requirements.

## Support
The proposed solution is supported by an extensive Exploratory Data Analysis and recommendations for further improving the effectiveness of the system.


# Kernel Contents
## 1 [Jared's Checklist](#checklist)
## 2 [Exploratory Data Analysis (EDA)](#eda)
## 3 [Cold Start](#coldstart)
## 4 [Recommender technology for Engaged Professionals](#recommender)
## 5 [Triage Question Handler](#triage)
## 6 [Frequency and  Relevance](#FandR)
## 7 [Tuning the Question Handler](#tuning)
## 8 [Recommendations for further improvement](#recommend)
## 9 [Looking for similar questions](#simq)

###     [Method 1: tfidf](#m1)
###     [Method 2: Word2vec and sentence embedding](#m2)
###     [Method 3: Using Fasttext](#m3)
###     [Method 4: Using Global Vectors](#m4)
###     [Method 5: universal-sentence-encoder](#m5)
###     [Method 6: Using tfidf to compare question bow with prof bow](#m6)
## 10 [Comparing recommender engines](#compare)
## 11 [Finding a list of relevant questions for a professional](#proflist)
## 12 [Find similar professionals](#simprof)

## 13 [EDA Code](#edacode)



# Jared's Checklist<a id='checklist'></a>

Jared has provided a checklist in the Discussion area that is a useful way of summarising this Kernel:

#### Did you decide to predict Pros given a Question, or predict Questions given a Pro? Why?
Both directions are handled:
* When a student asks a question, similar questions are found and the professionals who answered these question are predicted. Comparison of professionals also allows for the option of finding more professionals.

* When a professional visits the website a list of relevant questions are provided by comparing the professional's bag of words with all the asked questions.

#### Does your model address the "cold start" problem for Professionals who just signed up but have not yet answered any questions? How'd you approach that challenge?

Yes [there is a separate section on this](#coldstart). Ideally the professional should be encouraged to answer a few questions on the website as part of the on-boarding.

In addition a professional defined question frequency variable is [recommended](#maxfreq). This is the frequency with which professionals would like to receive questions. If questions are not being allocated at the desired frequency, then the top most relevant questions are sent to the professional which are similar to the ones the professional would have received on visiting the website. 
    
#### Did your model have any novel approaches to ensuring that "no question gets left behind"?

Yes, the triage system immediately identifies question that are likely to be difficult to answer. Ideally these are directed to particularly engaged professionals who have “signed up” to answer difficult questions.

If this is impractical then the alternative is for these questions to be directed to professionals who have not received a question in their requested timeframe. Other questions which had not received an answer within 24 hours should also be allocated to these [professionals](#maxfreq2).

These difficult to answer questions should be identified separately from the relevant questions that could be directed to the professional on hitting the time limit with a message like:

"We are having trouble finding somebody who might be able to provide an answer to these questions. Do you think you have any useful information for the student or do you know someone who does. This way, the dangers of delivering irrelevant questions is overcome."

Some tuning of this method would be needed. For instance it should probably be avoided for new professionals and the allocation could be based on relevance but with a lower threshold than normal.
    
#### What types of models did you try, and why did you pick the model architecture that you picked?

[There is a section on this](#recommender).

    
#### Did you engineer any new data features? Which ones worked well for your model?

I think there are a number of factors that make the recommender work well using the FastText technology:

* Bringing all the question information together in one unit as the students use the interface is different ways and so the information is spread.

* Cleaning the text in a bespoke way designed for the particular recommender technology

* Using the cosine similarity measure to triage the questions so that they can be treated differently.

* Developing a bag of words for the professional which mean that the system is continually and automatically learning more about the professional.

* Using the professional defined question frequency variable to manage the relationship with the professional. As a result, new professionals do not need to be treated differently from seasoned professionls. Also all the professionals whose time period is about to elapse can be considered when we see that a question is not getting answered within the 24 hours target. In these circumstances the thresholds could be lowered.

#### Is there anything built into your approach to protect certain Professionals from being "overburdened" with a disproportionately high share of matches?

Yes, the professional defined question frequency mentioned above makes sure that CV can tailor the emails to the individual professional. Professional would not receive emails more frequently than requested. However to ensure flexibility, when professionals answer a question they  should be asked if they would be happy to receive another question within the time frame set by the frequency.

#### What do you wish you could use in the future (that perhaps we could try out on our own as an extension of your work)?

[There is a separate section on this](#recommend).

If I had to prioritise, I would pick exploring ways of:
* building on the interaction students have with the system so they provide more feedback and develop a richer understanding of their opportunities.
* building ownership of the mission and loyalty within the professionals




# Exploratory Data Analysis<a id='eda'></a>

This competition is, essentially, about changing the behaviour of the professionals.
By better targeting of questions to professionals, the aim is to make it more likely that they will respond.

Much of this EDA has therefore been designed to provide information and insights about the following:

•	if we want to change behavior it would be good to know what happens now

•	if we want to change behavior we need some way of measuring impact...so what happens now?

 
## Data Summary

[23,931](#questions) questions from 12,329 students  
[51,123](#questions)  answers   
[50,106](#pd) answers provided by [28,152 professionals](#professionals)  
[6,679](#pd) answers provided where a professional has answered more questions than asked
        * *hence professionals have some other way of accessing questions than emails*  
[4.3m questions are sent to professionals in 1.85m emails](#pd)  
        * *hence only about 1 in 100 questions sent to professional results in an answer*  
            
## Plots Summary

[This diagram](#log_questions_answered) which plots the log of the number of questions answered against the  log of the number of questions asked for each professional, provides some useful insights:  
* if the only source of answers was from emails sent to professionals then there should not be any points above the red line. But there are and many of these points are in the early stages of the professional interacting with CV  
* after receiving about 10 questions the number of answers drops off and from 40 questions onwards there is no noticable increase in answers  
* there is a huge range in the ratio of questions answered to questions asked: one professional has answered 1,710 question but has only been asked about 10 and there are some professionals who have been asked thousands of questions and have never provided an answer.  

[Interest in answering questions drops off quickly](#activity)  but  there is weak evidence that those that stay continue to contribute.

[This diagram](#answers v professionals) shows "who answers". The [top 5 professionals](#whoanswers) provide nearly 10% of the answers. 18k professionals have provided no answers. 9k professionals who have answered between 1 and 10 questions provide nearly half the of answers. The professionals who have answered between 11 and 40 questions provide nearly a quarter of the answers.  


# Insights

The analysis shows that: 
1.	In their interaction with CV, professionals are not limited to receiving questions and that this alternative is significant in the early stages of the relationship. This alternative is actually due to the fact that the professional can go to the CV website directly to  access relevant questions. This is an important source of answers and the solution should provide for it.
2.	Interest in answering questions drops off quickly and there is weak evidence that those that stay continue to contribute. . Over 25% of answers are provided in the first day of involvement in CV.
3.	There are a few hero professionals who answer many questions but if we are going to make a significant improvement in the number of answers, we need to focus on the professionals that drop out early. 
4.	CV have set the aim of the competition to better target questions to professionals so it more likely that they will respond. As we need to focus on the early stages of the relationship with the professional, we should not rely on using data from the answers of long-standing professionals to guide the targeting. They are non-typical.
5.	Professionals can access questions from the CV website and so are not restricted to answering emails. Care is therefore needed when using inferred links between questions and professionals to build a recommender. By answering, the professional indicates that the question is relevant but the contact method, ie email, should not be assumed.
6.	Tags are a working part of the existing CV system with students being familiar with the idea and are happy to use them. Any solution should therefore include the option of the currently available free-form tags.
7.	Students are free to invent tags. This has the benefit of allowing the data set to evolve and automatically respond to requirement changes. This occurred recently with the tag data-science.
8.	The hypen in data-science is useful for simple term based recommenders as it can be retained or collapsed to “datascience”. This is much more useful for targeting than “data” and “science”. Bi-grams can also be useful here. To give this problem some context: a data scientist can receive many irrelevant requests if his tags are “data” and “science” whereas datascience is very targeted.
9.	However there are problems with tags: not all students use them and many are so generic that they are not useful for targeting. Some form of augmentation is needed and that is the essence of the job.
10.	Although we shouldn't completely ignore the requirements of the professionals still answering questions after, say 30 days, initial focus should be on these first few days. In fact the CV metric is based on 24 hours and so professionals should be persuaded to respond within that time frame. Most membership organisations have "on-boarding" programmes and so it would be useful to think about how an automatic recommender system could help. 
11.	The analysis, which shows that student questions can be broadly categorised into three groups, can be used during the design of the recommender:

•	Ones that have many good matches with similar previously answered questions 

•	Ones that have fewer less good matches with similar previously answered questions

•	Ones that have very few or no good matches with similar previously answered questions

12.	These questions that have very few or no good matches with similar previously answered questions are a problem. In many cases it will not be clear what the student really wants to know or the question can appear to be completely off topic or so general that the professional doesn’t know where to start. 
13.	When presented with a question, a professional’s choices are either to answer or ignore the questions. Ignoring a question leaves a bad feeling and too many of these difficult questions leads, possibly, to the professional moving away from CV. Ideally professionals new to CV should only get "just right" questions. Can we build a recommender that is that sophisticated or is some mix of human and machine more practical?
14.	Are professionals leaving because of inadequate feedback. We all need to know that our efforts, especially voluntary ones, are appreciated. There is a comment and “hearts” opportunity for students. They can either comment or like an answers. 
15.	Unfortunately few students use the comments and “likes” opportunity. This has two drawbacks:

•	Professionals do not get enough feedback.

•	The comments system and “likes” are so little used that they should not currently be used in any scoring mechanism during the design of the recommender.

16. Having something in common is important. The formations of groups then is a good idea. Currently there are only a few groups with members and so groups cannot currently be useful in directing questions to professionals.
17. Membership of schools_membership is a mixture of students and professionals. The number are small and so cannot currently be a major part of the recommender. However as having something in common is important, it is possible that this group could be used in the future to help prioritise which professionals should be asked the question. However relevance is usually more important.



# Cold Start<a id='coldstart'></a>

The data shows that many professionals, those that answer the questions, don’t answer many questions. This could be because questions directed to them are not relevant and their motivation to help fades. This can happen with recommender systems for new users where there is insufficient interaction with the users to understand their interests.

In a typical collaborative filtering application like a movie recommender, there are problems when a new movie is added to the catalogue or a new user joins the group. Pure collaborative filtering algorithm rely on the item’s and user’s interactions to make recommendations. If there are no interactions there can be no recommendations or at least recommendations are likely to be of poor quality.

With CV, each question is like a new item with no collaborative filtering possible. Our problem is more of a content-based filtering application where items are recommended on the basis of the features the item possesses. We have the words in the question and the tags and have to make the best use of them.

Strategies to deal with the cold start problem are more interesting when thinking about the new professionals. There is an interesting article on Wikipedia: https://en.wikipedia.org/wiki/Cold_start_(computing).

So should we focus on these new professionals. There are a lot of them and just answering one or two more questions would make a big difference to the quantity of answers. Or should we work on making life for the committed professional easier so they can answer more questions. There are a fewer of these people but the quality of their answers may be higher? The answer probably is we should attempt to do both.

Ideally the recommender design should be seamless so that there is no distinction in the way that the core engine works with new and seasoned professionals.

When professionals join, we don’t know very much about them. This cold start problem can be addressed by engagement processes, user interface design and working with meta data.

Here are some recommended strategies, most of which can be implemented with relatively little disruption:

•	It would be good if we made it easier for professionals to add tags and perhaps categorise them: eg, sector, skills, experience, location, education. 

•	When professionals join they should be encouraged to go onto the website and answer a few questions. We can use the information from the answers to vastly improve the recommender.

•	Elsewhere in this Kernel it is suggested that professionals are given the option of determining how often they would like to receive emails. When this period is over without them receiving an email, they should receive the the most relevant questions which they would have seen if they had gone to the website. This approach would work well for new professionals who will not have answered many questions.

•	Using the professionals' bags of words it is possible to find similar professionals. This process can be used to introduce new professionals into the system as follows. The student asks a question, the system finds a similar question and identifies the professionals who answered the similar questions. In addition to sending the student's question to these professionals, similar professionals are found by comparing professionals and if these similar professionals are within a tunable threshold they would also receive the email. It would be very useful to record metrics for this approach to allow tuning.

•	When a question is presented, the professional only has two options: answer or ignore. When the professional ignores the question, we are losing valuable information. The professional will have spent time reading the question and should be given the opportunity to provide feedback. Not only is this useful to CV, it engages the professional and is good for motivation. Questions are likely to be ignored because they are either off-topic, too generic or out of the professionals’ experience. Initially, we could have 2 new buttons: “off-topic” and “not for me”. “Off topic” questions could be re-directed for the moderator to review. When a question is marked “not for me”, a supplementary question could be asked: was the question “too generic” or “not in your field”. “Too generic” should go to the moderator for review. 

•	In all cases (off-topic, too generic and not in your field), the question should be marked so that the professional doesn’t see it again. This is really important as currently revisits to the CV site are discouraging because professionals are faced with a list of questions that they have previously ignored.





# Recommender technology for Engaged Professionals<a id='recommender'></a>

Quote from the [FastText website](https://fasttext.cc/docs/en/supervised-tutorial.html):

"The goal of text classification is to assign documents (such as emails, posts, text messages, product reviews, etc...) to one or multiple categories. Such categories can be review scores, spam v.s. non-spam, or the language in which the document was typed. Nowadays, the dominant approach to build such classifiers is machine learning, that is learning classification rules from examples. In order to build such classifiers, we need labeled data, which consists of documents and their corresponding categories (or tags, or labels)."

To build a deep learning, ie neural net, recommender we need labelled data, an algorithm and a classifier. For this problem we have data but it is not labelled. Some data is self-labelled, like the link between questions and professionals that answered, but there is not sufficient data like this to produce training, validation and data sets.

CV has experience of the FastText library and has given some thought to labelling the data. But this is not currently available and so a solution must be based on some more traditional text classification methods.

There has been some research on improving the word2vec methodology to improve document similarity accuracy. For instance the Word Mover algorithm looks promising. [This paper](http://proceedings.mlr.press/v37/kusnerb15.pdf) includes an interesting figure which shows for small documents there is actually little benefit in using these more sophisticated methods compared with a traditional tfidf bag of words (bow).

Some questions are long but many are short. The CV metric includes the phrase “within the bounds of fairness”. This refers to the desire to ensure that those with less advantage are catered for. It seems likely that these students will ask shorter questions and so the recommender should be tuned for short questions.

For this reason topic based methods like LDA and to some extent LSA are not explored. These very useful technniques find multiple topics in a document but in our case we have some questions with very few words and the words are the topic. Other techniques like SVA are not explored because the data set is small enough to not benefit from scaling down.

Five recommender engines are tested in the dataset. Two are based on tfidf. In addition WordwVec, GloVe and the universal sentence embedder solutions are developed. Results for each method are presented in dataframes.

### Accuracy Scores
One way of measuring the success of a recommender would be to see if the recommender identified the professional who answered the student's question in the previously answered set. This could be considered to be a good measure as it would then be compared against the current method. However we know that many questions are answered via the website and so it would not be a fair comparision. Also it could be that the professional's answer was the only time the professional answered that type of question. We would then have no evidence, apart from tags, that the professional should be identified. Tags are useful but are only part of the information available. This method is therefore not recommended.

As it is difficult to produce quantitative measures to develop accuracy scores,  the dataframes below hold the results from the last 100 questions asked in the provided data that were a answered. This allows for inspection of a range of questions and does not suffer from the problem of cherry picking questions to show the validity of models.

In addition for each method I have provided a grading based on inspection of the last 20 questions in the following table:

similar questions to students

This is the number of times that the recommender has a least one similar question that is deemed good enough to be presented to the student in the hope that it might answer the students question.

false positives

This is the number of times the best match question that would have been provided by the recommender was not deemed to be sufficiently relevant by me.

false negatives

This is the number of times the best match question was not provided by the recommender but I deemed it to be sufficiently relevant.

Relevance of Top10 Qs

Out of the top 10 best matches, this is the number that I deemed to be relevant on the basis that the person who provided the answer to the question would find the student's questions to be answerable.

Of course, all these judgements are subjective but they do give us the chance of comparing recommender engines.




In [None]:
%%javascript 
IPython.OutputArea.auto_scroll_threshold = 9999;

In [None]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
df_recommender_results = pd.DataFrame([['','','','',''],[13,16,18,15,12],[4,1,2,4,3],[4,1,0,3,3],['','','','',''],['','','','',''],[10,10,10,10,10], [10,10,10,6,10], [0,6,9,1,7],[0,0,0,1,0],[6,10,10,6,10],
                                          [0,0,0,0,0], [0,0,3,0,0], [0,7,6,4,6],[10,10,10,10,10],[10,10,10,10,10],
                                          [5,10,10,3,10], [8,9,10,10,9], [7,6,6,5,1],[10,10,9,10,10],[10,10,9,9,9],
                                          [10,10,9,10,10], [10,9,9,8,10], [10,10,10,10,9],[7,10,10,8,10],[8,8,10,8,8],[131,155, 160,129,149],
                                       [66,78,80,65,75]],
    
                                      columns=['tfidf', 'Word2Vec','FastText','GloVec','UniSE'],
                index=['Similar Q for last 20 queries','similar questions to students available','false positives','false negatives','','Relevance of Top10 Qs','query0', 'query1','query2', 'query3','query4', 'query5','query6', 'query7',
                      'query8', 'query9','query10', 'query11','query12', 'query13','query14', 'query15','query16', 'query17',
                      'query18', 'query19','Total','%'])
display (df_recommender_results)


The subjective analysis of the last 20 questions shows that FastText method can deliver good matches to the student 18 times with only one result being wrong. This represents 94% accuracy or 1 in 20 wrong. In comparison, the tfidf method can only deliver good matches to the student 13 times with four results being wrong. This represents 69% accuracy or 1 in 3 wrong.Overall the results show that the FastText approach with sentence embedding using averaging based on tfidf gives best results.

The "Relevance of Top10 Qs" data is a better proxy for determining each methods suitability to respond to the challenge set. FastText delivers 4 relevant similar question out 5 wheras tfidf delivers 2 relevant similar question out 3. 

FastText provides the most relevant similar questions and hence links to professionals most likely to answer a student's query. It also provides the best false positives and negatives results alongside Word2Vec

It is not suprising that FastText and Word2Vec outperform tfidf but they are also better than GloVe and USE.

GloVe is better than tfidf but the other methods ae much better. GloVe is based on a fixed vocabulary that is probably not a good reflection on the terms used by students when asking career questions. The dimensions of its vectors are also smaller at 200 than FastText, Word2Vec and USE.

USE uses a pre-trained vocabulary like GloVe but the vectors have been trained at a sentence level. The USE code here does not work on the Kaggle GPU and so the CPU has been used. 

Although USE produces good results, it does not seem necessary to use the more sophisticated methods on this problem and sentence embedding with FastText does the job. In fact the USE module became unavailable on the April 21st and so the associated code has been commented out in this version of the Kernel. This is disappointing but not critical because the FastText method is preferred for this application. 

It is worth noting that the tag system is particularly useful with FastText and Word2Vec. Normally a number of tags are defined and they are collected and stay together in the bag of words. Word2Vec and FastText will allocate significance to the proximity of tags to each other and develop vectors to reflect this.

By regularly updating and using the professionals bag of words (method 6) rather than the question bag of words, the system can continue to learn about a professional and respond to changes in demand and vocabulary. This approach can be used with any of the other five recommender engines but is only developed for tfidf here.

CV is updating the professionals following tags. These are included in the professionals bags of words and so automatically benefit. The professonals bags of words can be used in three ways:
* to find questions when a professional visit the site
* to find similar professionals when it is necessary to augment the number of professionals to email
* to find professionals who would find a student's question relevant.


# Triage Question Handler<a id='triage'></a>


The method involves regularly pre-processing all the questions that have been previously answered. All the text associated with a questions is lumped together in one bow. The reason for lumping everything together (title, body and tags) is that students use the interface in different ways. Some ask the question in the title, some put the tags in the body and others don’t bother with tags.

When a question is asked it is compared with every previously answered question using the FastText method to produce a cosine similarity array with one value between 0 and 1 for every previously answered questions.
 
During the EDA it was noticed that student questions can be broadly categorised into three groups
* Ones that have many good matches with similar previously answered questions 
* Ones that have fewer less good matches with similar previously answered questions
* Ones that have very few or no good matches with similar previously answered questions

Two cosine thresholds are selected to separate questions into the three categories described above:

* When there are many good matches with similar previously answered questions, students should be given immediate feedback by displaying this list of similar questions. They could be asked if their question was answered or if they still want professionals to be asked. This gives old questions new life and the opportunity for professionals to get regular feedback to promote continued engagement.
 
* When there are fewer less good matches with similar previously answered questions, the DataFrame shows that the matches are still good enough to find professionals who will find the question relevant.

* When very few or no good matches with similar previously answered questions the system can immediately flag this problem. Ideally CV could recruit a group of very engaged professionals who are willing to get involved in answering these difficult to answer questions. In some cases that answer may require the development of a dialogue which would require an extension to the current system. 


If the approach of recruiting  super engaged professionals is impractical another approach is possible by identifying professionals that have not received an email withing their preferred frequency and asking them the question. The question should be flagged as one "that is a little out of their area but as CV are having trouble placing it could they provide an answer or do they someone who might?"

For FastText current suggested thresholds:

high threshold:  0.94
low threshold:   0.90

All questions, to a maximum of say 10, with a cosine similarity greater that the high threshold can be directed to the student.

The low threshold is triggered when the cosine similarity of 10th best match is below this threshold.

# Frequency and  Relevance<a id='FandR'></a>

*"A good recommender system should help us to reach a target of a high (95%?) percent of questions which get a 10-quality-point-answer within 24 hours, without churning the Pros, and within the bounds of fairness."*

How do we balance getting 95%+ questions answered against professional churn. The more professionals that are asked a question, the more likely we are to get a timely answer.
But it will be more likely that professionals will find the question irrelevant and the frequency of being asked intolerable.

To be able to work towards a balance we need metrics and these are difficult to come by. The aspirational metric above is particularly difficult as there are qualitative items in it. As a first stage towards the aspiration, we could simplify the metric as follows:

#### % of questions answered in 24 hours

easy to measure

#### 10 quality point answer

This is difficult and some questions are difficult to answer comprehensively anyway. If we work on the basis that any answer is useful, shows someone cares and could be the start of a conversation, we could drop this requirement initially.

#### without churn

We know from the EDA that there is a churn problem. It is likely that it can be improved by addressing the cold start issues mentioned above. We do need a measure of the number of active professional that is reliable and responsive. We could use: % of professionals who have either visited the site or opened an email in the last 10 days say.

#### within the bounds of fairness

This is about making the platform accessible and designed for the people with less adavantage. It also is about recruiting professionals from the full range of employment opportunities and should include more trades. On this basis we could exclude this element from the relevance v frequency metrics as it is handled by design. A separate metric could be developed which interrogates professionals tags etc to categorise the professionals and then report on defficiencies in professions.

We therefore have % questions answered v active professionals. 

This measure of active professionals should be a measure of the underlying active professionals such that in any given period being measured new professionals are excluded from the data. 


# Tuning the Question Handler<a id='tuning'></a>

We know that many questions are answered by professionals visiting the website. This is good as professionals can choose themselves what to answer and their answers expands our knowledge of them. 

As far as emails are concerned we have four variables:

how many emails to send per questions
* Currently on average, 100 emails are sent for each answer

who to send them to:
*  Professionals who have answered similar questions or whose bow shows that the query could be relevant to the professional

when:<a id='maxfreq'></a>
*  We could ask professionals as part of the on-boarding to choose the max frequency of questions being sent to them: say the options are 1 per week, 1 per day or any. Also assume that the majority select 1 per week

when to send them:
* emails can be sent immediately, daily or weekly. Consider dropping the weekly option as it does not support the performance metric
        
     
We know there is a huge range of responses from professionals, ranging from 0 to 1,710 answers provided. However in the analysis below I use mean/average to help develop a tuning method:

In the last 12 months of data, 5,844 questions were asked, or 112 per week on average. 
Let's say we are going to target more answers per question on average, by going from the current 2 to 3. 
Target a move to 20 emails per answer from the current 100. This should be expected as we provide an improved recommender

Emails sent per week = 112 x 3 x 20 = 6,720

Active professionals required to service, based on 1 per week = 6,720

This seems to be in the right ball park. 6,720 x 100/20 = 33,600 professionals would be needed if the answer frequence did not improve. This is too many and so other things would have to compensate. 

#### Question Frequecy<a id='maxfreq2'></a>
It would be useful to ask a professional who has just answered a question if they would be willing to answer another question without waiting for their selected time period to end, ie if they select to receive answers once per week, would they be willing to receive another questions within a week immediately after answering a question.

At the other end of the spectrum, we know that some professsionals don't get asked questions. So at the end of the time period they have selected for their email frequency, if a question has not been asked then they should receive an email with the top 10 relevant questions calculated in the same way as the top 10 questions for professionals visiting the website.


To tune this process then we need the following data:

% q answered in 24 hours, q24

rolling average number of questions being asked per week, q

rolling average number of emails sent to professionals to receive one answer, e

average professional question frequency request, number per week, f

target answers per question, apq

active professionals, ap

active professionals required, apr


underlying active professionals, uap

emails to send per question = apq x e

apr = apq x e x q / f

if apr > ap then alert

monitor q24 and uap for trends, 

We know that CV gets about 2 answers from professionals per questions. The top 10 most relevant questions shown in the results dataframe for each of the recommender technologies would therefore identify the 20 professionals required to service the targets above. We also know from the results that far more professionals could be identified for the majority of the questions asked and so even the 100 emails per answer could be serviced adequately. When this approach does not deliver sufficient professionals to ask then similar professionals can be found by [comparing the professionals' bags of words](#simprof).


# Recommendations for further improvement<a id='recommend'></a>

The subjective analysis of the last 20 questions shows that FastText method can deliver good matches to the student 18 timess with only one result being wrong. This represents 94% accuracy. This is pretty good and while Machine learning could beat it, it is likely that resource could be better allocated elsewhere. In comparison, the tfidf method can only deliver good matches to the student 13 timess with four results being wrong. This represents 69% accuracy. Having said that there is room for tuning the model. Some work has been done here but further improvement is possible of the parameters used to build the FastText model and the thresholds set in the tirage model.

Having something in common between a student and an answering professional has been recognised as being important. This could be location, interest or skill. The development of groups and schools_membership should therefore be encouraged. Currently membership of schools_membership is a mixture of students and professionals but the numbers are small and so cannot currently be a major part of the recommender. Membership of groups is also low. 

In some questions, location is important, for instance: how do I get a law internship in New York?. Generally though subject relevance is more important. In this case, would a lawyer at some distance from NY provide a better answer than a teacher in the student's schools_membership. As CV develops, group membership and locations could be used in the future to help prioritise which professionals should be asked the question. At this stage it is probably better to rely almost totally on question similiarity. The one exception could be to take into account previous interactions between professional and student. Even in this case though, question similarity will probably be enough. If the students is trying to decide between being an engineer and a doctor, the student will probably get better advice from an engineer and a doctor rather than getting an opinion only from one of them who was the first to engage.

The GloVec prep code shows that 1% of terms used by students are not recognised by the standard vocab. This is due to spelling and typo errors and also concatenating words to make tags. Some form of auto correcton would be useful. Also concatenation can be a problem for some recommender models because by allowing freeform concatenating, the chance of finding matches is reduced. A Word2Vec model does not need this concatenation and, on balance, it would be better if concatenation was discouraged. The current CV experiment on providing tag suggestions would be ideal. It both ensures correct spelling and promotes well used tags.

Professional are represented far more than trades in the professionals database. It would be useful if this database was matched against some standard business category system so that data about professions/trades coverage could be established and monitored. Initiatives could then be designed to overcome areas of weaker coverage.

CV already has a system of rewards and feedback to professionals showing how they are doing. As students are not currently providing enough feedback, other forms of feedback are provided that are generated by the system. Although it is good to get this feedback it is not as rewarding for a professional as direct feedback from a a student that the professional has helped. Finding ways of encouraging students to provide feedback would be useful:
* Consider initially withholding next step information and telling a student it is available on ticking a box. The tick could be assocated with a measure of the usefulness of the answer.
* Provide students with good matches to their question immediately and asking them to tick relevant one allows us to collect "views" and "likes" data for questions.
* Asking students if they think an answer would be useful to other students as suggested by Andrea Madunic.

On relaunch it would be good to contact all the professionals that have shown interest in the past and deliver some good matches to them and monitor if re-engagement is successful.

Re-engagement strategies could be developed to alert professionals that have not answered for a while to a question where the cosine match is high.
 
It is clear that there is a wide range of questions being asked and some are so generic or unclear that an answer is difficult to provide. In these cases it would help to coach the student in available choices and options and so developing content or a career content platform would be useful. This could prove to be impractically expensive but volunteers could perhaps provide content. Ideally, given the demographic, the content needs to be video based and interactive.

The CV platform will benefit from the development in deep learning in the future. CV are currently developing a suggested tag system using Fasttext which will be very useful. Automation of metrics would also be useful. For instance using something like BERT on a labelled set of answers could provide a machine learning method to grade answers.

 

# Set up

In [None]:
"""Read in the data"""
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
glove_path = '../input/glove-global-vectors-for-word-representation/glove.6B.200d.txt'


professionals = pd.read_csv('../input/data-science-for-good-careervillage/professionals.csv')
emails = pd.read_csv('../input/data-science-for-good-careervillage/emails.csv')
matches = pd.read_csv('../input/data-science-for-good-careervillage/matches.csv')
questions = pd.read_csv('../input/data-science-for-good-careervillage/questions.csv')
answers = pd.read_csv('../input/data-science-for-good-careervillage/answers.csv')
tag_questions = pd.read_csv('../input/data-science-for-good-careervillage/tag_questions.csv')
tags = pd.read_csv('../input/data-science-for-good-careervillage/tags.csv')
tag_users = pd.read_csv('../input/data-science-for-good-careervillage/tag_users.csv')
comments = pd.read_csv('../input/data-science-for-good-careervillage/comments.csv')

question_scores = pd.read_csv('../input/data-science-for-good-careervillage/question_scores.csv')
answer_scores = pd.read_csv('../input/data-science-for-good-careervillage/answer_scores.csv')

group_memberships = pd.read_csv('../input/data-science-for-good-careervillage/group_memberships.csv')
groups = pd.read_csv('../input/data-science-for-good-careervillage/groups.csv')
school_memberships = pd.read_csv('../input/data-science-for-good-careervillage/school_memberships.csv')
students = pd.read_csv('../input/data-science-for-good-careervillage/students.csv')


In [None]:

import numpy as np # linear algebra
import time
import os
import nltk, string
import random
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer

random_state = 21

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

stop_words = set(stopwords.words('english'))
stop_words.update(['gives'])

"""number of columns in the results from a recommender run"""
sample_len = 100

'''remove punctuation, lowercase, stem'''
remove_punctuation_map = dict((ord(char), ' ') for char in string.punctuation)    
def normalize(text):
    return nltk.word_tokenize(text.lower().translate(remove_punctuation_map))


def cosine_sim(text1, text2):
    tfidf = vectorizer.fit_transform([text1, text2])
    return ((tfidf * tfidf.T).A)[0,1]

def takeSecond(elem):
    return elem[1]

def clean_text(text):
    text = text.lower().translate(remove_punctuation_map)
    
    return ' '.join(lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text))


print(os.listdir("../input"))


#pd.options.display.max_colwidth = -1

# Recommender Exploration<a id='recommender_exploration'></a>

## Questions Bags of Words

Students use the title, body and tags sections in different ways. To capture all the information in one place we produce a bag of words.

First step is to combine the question title and text and all the tags for the questions into one cell...


In [None]:
questions_tags = questions.merge(right=tag_questions, how = 'left',
                                            left_on ='questions_id',
                                            right_on ='tag_questions_question_id')

questions_tagwords = questions_tags.merge(right=tags, how = 'left',
                                            left_on ='tag_questions_tag_id',
                                            right_on ='tags_tag_id')

questions_tagwords =questions_tagwords.sort_values('questions_id')
questions_tagwords = questions_tagwords.drop (['tag_questions_tag_id','tag_questions_question_id','tags_tag_id','questions_author_id'], axis = 1)

questions_tagwords_tb = questions_tagwords.copy()
questions_tagwords_tb['q_tb'] = questions_tagwords['questions_title'] + " " + questions_tagwords['questions_body']

questions_tagwords_tb = questions_tagwords_tb.drop (['questions_title','questions_body'], axis = 1)
questions_tagwords_tb_str = questions_tagwords_tb.copy()
questions_tagwords_tb_str ['tags'] = questions_tagwords_tb ['tags_tag_name'].map (str)
questions_tagwords_tb_str = questions_tagwords_tb_str.drop (['tags_tag_name'], axis = 1)

foo =lambda x:', '.join(x)
agg_f = {'questions_id':'first', 'questions_date_added' : 'first' ,'q_tb': 'first','tags' : foo}

questions_q_tb_tags  = questions_tagwords_tb_str.groupby(by='questions_id').agg(agg_f)

questions_q_tb_tags  = questions_q_tb_tags.drop(['questions_id'], axis = 1)
questions_q_tb_tags  =questions_q_tb_tags .sort_values ('questions_date_added', ascending = False).reset_index()



questions_q_tb_tags.head(1)


### Clean Text

Here a "clean text" version is produced by moving to lower case, removing punctuation and lemmatizing all in the function clean_text. 
* Lemmatizing was particularly useful for tfidf as it removed the distinction between singular and plural.
* Punctuation is replaced with a space to reduce concatenation problems. Actually though FastText could probably handle the concatenation of tags. However with the current setting only tags that appear more than once would be considered.

Then we remove stop words using a standard english stop word list with the option of updating as shown in the set up cell. This option was only used for a test word "gives" as results were good without further tuning.

The result of the cleaning can be seen by comparing the full text (bow_f) with the clean text (bow).

Further tuning by adding the stop word "would" would probably improve the accuracy for question4 below.

Given the experience gained here, it is recommended that the cleaning function is bespoke to the particular recommender chosen for the CV task.

In [None]:
"""Tags are repeated in the body of the question and so stop """
"""questions_bow  = questions_q_tb_tags.copy()
questions_bow ['bow'] = questions_bow['q_tb'] + " " + questions_bow['tags']
questions_bow  = questions_bow.drop(['q_tb','tags','questions_id'], axis = 1)
questions_bow  =questions_bow .sort_values ('questions_date_added', ascending = False).reset_index()
pd.options.display.max_colwidth = -1"""
start = time.time()
questions_bow  = questions_q_tb_tags.copy()
questions_bow = questions_bow.rename(columns={'q_tb': 'bow_f'})

questions_bow['bow'] = questions_bow.bow_f.apply(clean_text)
questions_bow['bow'] = questions_bow['bow'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
questions_bow  = questions_bow.drop(['tags'], axis = 1)
#questions_bow  =questions_bow .sort_values ('questions_date_added', ascending = False).reset_index()


end = time.time()
print('run time',end - start)
pd.options.display.max_colwidth = 500
questions_bow.head(5)



Drop the questions without answers.

To make connections between students and professionals we concentrate on the questions with answers.

Some questions have no answers because they were still new when the data set was collected.

In [None]:

q_with_answers_bow = questions_bow.merge(right=answers, how = 'left',
                                            left_on ='questions_id',
                                            right_on ='answers_question_id')
q_with_answers_bow = q_with_answers_bow.dropna(how='any')
q_with_answers_bow = q_with_answers_bow.drop (['answers_id','answers_author_id','answers_question_id','answers_date_added','answers_body'], axis=1)
q_with_answers_bow  =q_with_answers_bow .sort_values ('questions_date_added', ascending = False)

q_with_answers_bow.drop_duplicates( inplace = True)
q_with_answers_bow = q_with_answers_bow.reset_index()
q_with_answers_bow = q_with_answers_bow.drop (['index'], axis=1)

q_with_answers_bow.head(1)

Produce a subset of the questions data, using the most recent...

The last questions was asked on 31/1/19.

5,844 questions asked in the previous 12 months.

In [None]:

questions_bow['questions_date'] = pd.to_datetime(questions_bow['questions_date_added'])
q_bow_n = questions_bow.copy()
q_bow_n['questions_date'] = questions_bow['questions_date'].dt.normalize()
q_bow_n = q_bow_n.drop (['questions_date_added'], axis = 1)
"""Change date to include more questions"""
"""last month is 304"""
last_qbow_full = q_bow_n[questions_bow.questions_date > '2018-01-31']
last_qbow_full.describe()


Produce a subset of the questions data for performance testing...

In [None]:
"""Produce a subset of the questions data for performance testing"""

half_qbow_full = q_bow_n[questions_bow.questions_date > '2017-01-1']
half_qbow_full.describe()

In [None]:
"""Run this cell to get a smaller random sample of the smaller question data set"""

last_qbow = last_qbow_full.sample(n=10, random_state = 42).reset_index()
last_qbow = last_qbow.drop (['index'], axis = 1)
last_qbow.head(1)

In [None]:
"""Produce a subset of the questions data, using the x most recent"""

#last_qbow = q_with_answers_bow[0:50]

last_qbow = questions_bow[0:sample_len]
last_qbow.head(1)

## Professionals Bags of Words

We want to get to know more about the professionals so that we can direct more relevant questions to them. We will do this by putting everything they have told us about themselves into a bag of words **together with all the words from the questions they have answered**.

We could also look at including the the words of the answers they have provided but there is the danger that they use different vocabulary from the students and so I will not do this initially. This could we tried later to see if relevance improves.

Get the tags associated with professionals...

In [None]:
professionals_tags = professionals.merge(right=tag_users, how = 'left',
                                            left_on ='professionals_id',
                                            right_on ='tag_users_user_id')

professionals_tagwords = professionals_tags.merge(right=tags, how = 'left',
                                            left_on ='tag_users_tag_id',
                                            right_on ='tags_tag_id')

professionals_tagwords =professionals_tagwords.sort_values('professionals_id')
professionals_tagwords = professionals_tagwords.drop (['professionals_location','professionals_date_joined','tag_users_tag_id','tag_users_user_id','tags_tag_id'], axis=1)

professionals_tagwords.head(1)

In [None]:
"""Convert the columns to strings, even though they already look like strings? 
    This is to make concatenation possible later..."""
df_p_q_str = professionals_tagwords.copy()
df_p_q_str ['tag'] = df_p_q_str ['tags_tag_name'].map (str)
df_p_q_str = df_p_q_str.drop (['tags_tag_name'], axis = 1)
df_p_q_str ['industry'] = df_p_q_str ['professionals_industry'].map (str)
df_p_q_str = df_p_q_str.drop (['professionals_industry'], axis = 1)
df_p_q_str ['job'] = df_p_q_str ['professionals_headline'].map (str)
df_p_q_str = df_p_q_str.drop (['professionals_headline'], axis = 1)

df_p_q_str.head(1)

In [None]:
"""merge the tags"""

foo =lambda x:', '.join(x)
agg_f = {'professionals_id':'first', 'industry': 'first','job': 'first','tag' : foo}

df_p_q= df_p_q_str.groupby(by='professionals_id').agg(agg_f)
df_p_q = df_p_q.drop (['professionals_id'], axis = 1).reset_index()


df_p_q.head(1)

In [None]:
"""Merge the questons answered by the professional to the professionals dataframe"""

df_p_a = df_p_q.merge(right=answers, how = 'left',
                                            left_on ='professionals_id',
                                            right_on ='answers_author_id')
df_p_a = df_p_a.drop (['answers_author_id','answers_date_added','answers_body'], axis=1)

df_p = df_p_a.merge(right=questions_bow, how = 'left',
                                            left_on ='answers_question_id',
                                            right_on ='questions_id')

df_p = df_p.drop (['answers_id','answers_question_id','questions_id'], axis=1)
df_p ['qbow'] = df_p ['bow'].map (str)
df_p = df_p.drop (['bow'], axis = 1)
df_p ['qbow_f'] = df_p ['bow_f'].map (str)
df_p = df_p.drop (['bow_f'], axis = 1)


df_p.head(1)

In [None]:
"""merge the tags"""

foo =lambda x:', '.join(x)
agg_f = {'professionals_id':'first', 'industry': 'first','job': 'first','tag' : foo}

df_p_q= df_p_q_str.groupby(by='professionals_id').agg(agg_f)
df_p_q = df_p_q.drop (['professionals_id'], axis = 1).reset_index()


df_p_q.head(1)

In [None]:
"""Merge the questions"""

Foo =lambda x:', '.join(x)
agg_f = {'professionals_id':'first',  'industry': 'first','job': 'first','tag' : 'first', 'qbow' : foo, 'qbow_f' : foo}

df_p_bow  = df_p.groupby(by='professionals_id').agg(agg_f)
df_p_bow = df_p_bow.drop (['professionals_id'], axis = 1).reset_index()
df_p_bow = df_p_bow.sort_values('professionals_id')


df_p_bow['bow_f'] = df_p_bow['industry'] + " " + df_p_bow['job']+ " " + df_p_bow['tag']+ " " + df_p_bow['qbow_f']

df_p_bow['bow'] = df_p_bow['industry'] + " " + df_p_bow['job']+ " " + df_p_bow['tag']+ " " + df_p_bow['qbow']
df_p_bow = df_p_bow.drop (['industry','job','tag','qbow','qbow_f'], axis = 1).reset_index()

df_p_bow.head(1)

In [None]:
"""drop the professionals who have not answered a question"""

df_p_nonan = df_p.dropna(how='any')

df_p_bow_nonan  = df_p_nonan.groupby(by='professionals_id').agg(agg_f)
df_p_bow_nonan = df_p_bow_nonan.drop (['professionals_id'], axis = 1).reset_index()
df_p_bow_nonan = df_p_bow_nonan.sort_values('professionals_id')



df_p_bow_nonan['bow'] = df_p_bow_nonan['industry'] + " " + df_p_bow_nonan['job']+ " " + df_p_bow_nonan['tag']+ " " + df_p_bow_nonan['qbow']
df_p_bow_nonan = df_p_bow_nonan.drop (['industry','job','tag','qbow'], axis = 1).reset_index()


df_p_bow_nonan.head(1)

# Looking for similar questions<a id='simq'></a>

# Method 1: tfidf<a id='m1'></a>

tdidf or tf-idf stands for term frequency–inverse document frequency. It is a statistic which reflects the importance of a word in a document in a collection of documents. The term increases with the number of times the word appears in the document and decreases with the number of times the word appears in the collection.

tfidf is simple to calculate but does not accunt for context or meaning.



## Function to return array of cosine similiarities for two arrays of questions using tfidf


In [None]:

"""Find the tfidf cos between one set of questions and another"""
"""Allow for the first array will be the total sample and the second a smaller sample"""

from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

   
def get_sim_q_array (q_total,q_query):
    #vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
    vectorizer = TfidfVectorizer(tokenizer=normalize)

    """q_query could be passed 1 or more queries"""
    vectorizer.fit(q_total)
    q_total_tfidf = vectorizer.transform(q_total)
    q_query_tfidf = vectorizer.transform(q_query)
    q_sim_array = cosine_similarity(q_total_tfidf, q_query_tfidf)
    
    return (q_sim_array)

## Performance problem with tfidf<a id='production'></a>

This cell shows the first stage of a production implementation. The array output (q_sim) can be used to indentify similiar questions that can either be immediately presented to the student or can be used to find suitable professionals to be asked the questions.

For this test we are finding a similarity array for the last question in the set.

In [None]:
"""Single query to get tfidf similiarities"""
start = time.time()

q_total = ["".join(x) for x in (q_with_answers_bow['bow'])]
q_queries = [last_qbow.loc [0]['bow']]
q_sim = get_sim_q_array (q_total,q_queries)

end = time.time()
print('run time',end - start)
#print(q_sim)

The last cell and the next cell show that the relationship between procesing time vs questions in the data set is linear. This makes the use of tfidf questionable<a id='performance'></a> as the bank of questions expands due to the continued success of CV.

In [None]:
"""Single query to get tfidf similiarities for half questions for performance test"""
start = time.time()

q_total = ["".join(x) for x in (half_qbow_full['bow'])]
q_queries = [last_qbow.loc [0]['bow']]
q_sim = get_sim_q_array (q_total,q_queries)

end = time.time()
print('run time',end - start)
#print(q_sim)

## Get the similarity array for all questions with answers v all questions with answers

In [None]:
"""Multiple queries to get tfidf similiarities"""

start = time.time()

q_total = ["".join(x) for x in (q_with_answers_bow['bow'])]
q_queries = ["".join(x) for x in (q_with_answers_bow['bow'])]
q_sim_tfidf_array = get_sim_q_array (q_total,q_queries)

end = time.time()
print('run time',end - start)
#print(q_sim_m_array)

In [None]:
"""function to produce dataframe of results of similarity tests"""
def get_sim_results_with_threshold (column_head,index,sim_array,questions,query,h_threshold,l_threshold):

    col_h = column_head + str(index)
    
    df_sim_q = pd.DataFrame({'Cosine':sim_array[:,index], col_h:questions['bow_f']})

    df_sim_q_sorted = df_sim_q.sort_values('Cosine',ascending = False )
    if df_sim_q_sorted.iloc[0]['Cosine'] > .9999:
        df_sim_q_sorted = df_sim_q_sorted.drop(df_sim_q_sorted.index[0])

    h_num = 0
    l_num = 0
    worst_h_num = -1
    i = 0
    questions_len = len(questions)
    while i< questions_len and df_sim_q_sorted.iloc[i]['Cosine'] > l_threshold:
        #print ('i, df_sim_q_sorted.iloc[i]['Cosine']')
        if df_sim_q_sorted.iloc[i]['Cosine'] > l_threshold:
            l_num += 1
            worst_match_to_profs= i
        if df_sim_q_sorted.iloc[i]['Cosine'] > h_threshold:
            worst_h_num = i
            h_num += 1
        i += 1
    
    df_sim_q_sample = df_sim_q_sorted[:10]
        
    best_cos_0 = df_sim_q_sample.iloc[0]['Cosine']
    best_cos_9 = df_sim_q_sample.iloc[9]['Cosine']
    
    df_sim_q_sample = df_sim_q_sample.drop ('Cosine', axis=1).reset_index()
    df_sim_q_sample = df_sim_q_sample.drop ( 'index', axis=1)

    df_sim_q_sample_T = df_sim_q_sample.T
    df_sim_q_sample_T.insert(loc=0, column='query_id', value=[query.iloc[index]['questions_id']] )
    df_sim_q_sample_T.insert(loc=1, column='query_bow', value=[query.iloc[index]['bow_f']]  )
    df_sim_q_sample_T.insert(loc=2, column='best_cos', value=best_cos_0)
    df_sim_q_sample_T.insert(loc=3, column='10th_best_cos', value=best_cos_9)
    df_sim_q_sample_T.insert(loc=4, column='similar Q to students', value= h_num)
    df_sim_q_sample_T.insert(loc=5, column='Qs to profs', value=l_num)
    df_sim_q_sample_T.insert(loc=6, column='best matches', value=' ')

    if worst_h_num > -1:
        df_sim_q_sample_T.insert(loc=17, column='worst match to students', value=df_sim_q_sorted.iloc[worst_h_num][col_h])
    else:
        df_sim_q_sample_T.insert(loc=17, column='worst match to students', value='not available')
    
    if l_num > 0:
        df_sim_q_sample_T.insert(loc=18, column='worst match to profs', value=df_sim_q_sorted.iloc[worst_match_to_profs][col_h])
    else:
        df_sim_q_sample_T.insert(loc=18, column='worst match to profs', value='not available')
    
    
    
    return ( df_sim_q_sample_T)

## tfidf similarity results<a id='tfidfsimilarityresults'></a>

The DataFrame below shows the  results from the tfidf bows method for the last 100 questions asked that have answers.

query_id

This is the id of the questions asked

query_bow

This is all the text from the question lumped together: title, body, tags

best_cos

This is the cosine similarity between the query and the best match. The range is 0-1, the higher the better. When the best_match_cos is < than the high threshold, the best_match text is not displayed and the h_ andl_thrshold_q text displays the best_match and next_best instead. In this case the studen would not be presented with any previously answered questions

10th_best_cos

This is the text of a question that has been previously answered which the 10th best match to the query

similar Q to students

This is the number of previously answered questions with a cosine similarity greater than the high threshold set by the user. Is this case, it is the number of questions which could be presented to the student as they may answer the student's questions

Qs to profs

This is the number of previously answered questions with a cosine similarity greater than the low threshold. When this number is low (say less than 100) then the question could be directed to the highly engaged professionals who have committed to answer difficult to answer questions

best_matches

These are the texts of a 10 questions that has been previously answered which best match the query

worst match to students

This is the first previously answered question above the high threshold. Its relevance to the query is a good indication that the high threshold has been set high enough to ensure that highly relevant answers are provided to the student

worst match to profs

This is the first previously answered question above the low threshold. Its relevance to the query is a good indication that the low threshold has been set high enough to ensure that reasonably relevant questions are being selected to identify professional that are likely to answer the query


In [None]:
"""Compare  q with q using tfidf"""
h_threshold =0.45
l_threshold =0.225


results_T = get_sim_results_with_threshold ('query',0,q_sim_tfidf_array,q_with_answers_bow,q_with_answers_bow,h_threshold,l_threshold)

for i in range(1,sample_len):
    next_result = get_sim_results_with_threshold ('query',i,q_sim_tfidf_array,q_with_answers_bow,q_with_answers_bow,h_threshold,l_threshold)
    results_T = pd.concat([results_T,next_result])
results_tfidf = results_T.T
pd.options.display.max_colwidth = 500
display(results_tfidf) 

## Plots of number of similiar questions above the thresholds

The plots show the wide variety in the number of responses and the need for a system that can handle both difficult to answer and easy to answer questions

In [None]:
import numpy as np
import matplotlib.pyplot as plt

q_num =[]
for i in range (sample_len):
    q_num.append(i)

area = np.pi*3
plt.figure(figsize=(10,10))
plt.xlim(0, sample_len)
plt.ylim(0,200)

# Plot
plt.scatter(q_num, results_T['similar Q to students'], s=25, c='red', alpha=0.5, label = "> high threshold")
plt.scatter(q_num, results_T['Qs to profs'], s=25, c='blue', alpha=0.5, label = "> low threshold")

plt.title('Scatter plot showing similiar questions found for each query ')
plt.xlabel('query')
plt.ylabel('similiar questions')
plt.legend()
plt.show()

### Plot showing number of questions above the high threshold

In [None]:
area = np.pi*3
plt.figure(figsize=(10,10))
plt.xlim(0, sample_len)
plt.ylim(0,20)

# Plot
plt.scatter(q_num, results_T['similar Q to students'], s=25, c='red', alpha=0.5, label = "> high threshold")

plt.title('Scatter plot showing similiar questions found for each query ')
plt.xlabel('query')
plt.ylabel('similiar questions')
plt.legend()
plt.show()

### Function to call when finding professionals chosen by recommenders

In [None]:
def get_sim_questions_id (column_head,index,sim_array,questions,query):

    col_h = column_head + str(index)
    
    df_sim_q = pd.DataFrame({'Cosine':sim_array[:,index], col_h:questions['questions_id']})
    
    
    df_sim_q_sorted = df_sim_q.sort_values('Cosine',ascending = False )
    if df_sim_q_sorted.iloc[0]['Cosine'] > .9999:
        df_sim_q_sorted = df_sim_q_sorted.drop(df_sim_q_sorted.index[0])
        
    df_sim_q_sample = df_sim_q_sorted[:10]
    
    df_sim_q_sample = df_sim_q_sample.drop ('Cosine', axis=1).reset_index()
    df_sim_q_sample = df_sim_q_sample.drop ( 'index', axis=1)

    
    df_sim_q_sample_T = df_sim_q_sample.T
    df_sim_q_sample_T.insert(loc=0, column='id', value=[query.iloc[index]['questions_id']] )
    df_sim_q_sample_T.insert(loc=1, column='bow', value=[query.iloc[index]['bow_f']]  )    
    
    
    return ( df_sim_q_sample_T)

In [None]:
def find_top_q_ids (q_sim_array):
    q_id_results_T = get_sim_questions_id ('query',0,q_sim_array,q_with_answers_bow,q_with_answers_bow)

    for i in range(1,sample_len):
        next_result = get_sim_questions_id ('query',i,q_sim_array,q_with_answers_bow,q_with_answers_bow)
        q_id_results_T = pd.concat([q_id_results_T,next_result])
    return q_id_results_T.T


In [None]:
def professionals_to_ask (sim_q_results, index):
    col = 'query' + str(index)

    df_prof_with_a = sim_q_results[[col]]
    df_prof_with_a = df_prof_with_a.rename(columns={col: 'questions_id'})
    df_prof_with_a = df_prof_with_a.drop(['id','bow'], axis = 0)   
    df_prof_with_a = df_prof_with_a.merge(right=answers, how = 'left',
                                            left_on ='questions_id',
                                            right_on ='answers_question_id')
    df_prof_with_a = df_prof_with_a.merge(right=professionals, how = 'left',
                                            left_on ='answers_author_id',
                                            right_on ='professionals_id')
    df_prof_with_a =  df_prof_with_a.drop (['questions_id','answers_author_id','professionals_headline','professionals_date_joined',
                                        'answers_id','answers_date_added','answers_body',
                                    'answers_question_id','professionals_location','professionals_industry'], axis = 1)
    tot = df_prof_with_a.shape[0]
    
     
    df_prof_with_a_T = df_prof_with_a.T
    #df_prof_with_a_T.insert(loc=0, column='total', value = tot)
    df_prof_with_a = df_prof_with_a_T.T
    #df_prof_with_a = df_prof_with_a.rename(columns={'professionals_id': col})
    df_prof_with_a.insert(loc=1, column='total', value = 0)
    
    return df_prof_with_a
#tfidfx.head(20)

## Find professionals chosen using tfidf

In [None]:
"""Compare  q with q using tfidf to get q_ids then prof_ids"""
q_id_results = find_top_q_ids (q_sim_tfidf_array)
pd.options.display.max_colwidth = 500
q_id_results.head(20) 


In [None]:
"""check that we have right ids!"""
"""x = q_with_answers_bow.count()
print (x.questions_id)
i =0 
while i<x.questions_id:
    row = q_with_answers_bow.iloc[i]
    
    if  row.questions_id == 'c0c9260091b3443f9c712d5ff2d2c2e0':
        print ('q_with_answers_bow.iloc[i][questions_id]',row.questions_id)
        print ('q_with_answers_bow.iloc[i][bow_f]',row.bow_f)
    i += 1"""

In [None]:
df_tfifd_prof_with_a = professionals_to_ask (q_id_results, 1)
df_tfifd_prof_with_a.head()

In [None]:
df_tfifd_prof_with_a = professionals_to_ask (q_id_results, 0)


for i in range(1,sample_len):
    next_p_to_a = professionals_to_ask (q_id_results, i)
    df_tfifd_prof_with_a = pd.concat([df_tfifd_prof_with_a,next_p_to_a],axis=0, sort=True)
    
pd.options.display.max_colwidth = 500
pd.options.display.max_seq_items = 2000

df_tfidf_p_grouped = df_tfifd_prof_with_a.groupby('professionals_id').count()
df_tfidf_p_grouped = df_tfidf_p_grouped.sort_values('total',ascending = False )


In [None]:
next_p_to_a.head()

## Professional identified by tfidf

The cells below show that the 1,000 most relevant questions, according to tfidf, to the 100 last questions asked in the dataset were provided by:

1,108 professionals
who provided 2,180 answers
with the majority of professionals providing just one answer during that period but one professional providing 81 answers for the 100 questions. This is the same professonal who has answered 1,710 questions.


In [None]:
df_tfidf_p_grouped.sum(axis = 0, skipna = True) 

In [None]:
df_tfidf_p_grouped.describe()

In [None]:

q_num =[]
for i in range (df_tfidf_p_grouped.shape[0]):
    q_num.append(i)

area = np.pi*3
plt.figure(figsize=(10,10))
plt.xlim(0, df_tfidf_p_grouped.shape[0])
plt.ylim(0,100)

# Plot
plt.scatter(q_num, df_tfidf_p_grouped['total'], s=25, c='red', alpha=1)

plt.title('Scatter plot showing number of queries answered by professionals  ')
plt.xlabel('professional')
plt.ylabel('queries answered')
plt.legend()
plt.show()

In [None]:
q_sim_tfidf_array = []

# Method 2: Word2vec and sentence embedding<a id='m2'></a>

There is a Kaggle tutorial on Word2Vec here: https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial

Here is an article which explains of the idea behind Word2Vec: http://cgi.cs.mcgill.ca/~enewel3/posts/implementing-word2vec/

"One of these assumptions is the distributional hypothesis, which is the idea that the meaning of a word can be understood from the words that tend to be near it. For example “bread” might tend to show up near “eat”, “bake”, “butter”, “toast”, etc., and this entourage gives a signal of what “bread” means."

Here is the original paper: https://arxiv.org/pdf/1301.3781.pdf

The first step to finding the similarity between questions is to convert all the words in the vocabulary to vectors...

In [None]:

start = time.time()

total_q_bow = ["".join(x) for x in (q_with_answers_bow['bow'])]
#vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
vectorizer = TfidfVectorizer(tokenizer=normalize)
tfidf = vectorizer.fit_transform(total_q_bow)

cachedStopWords = stopwords.words("english")
#print(total_q_bow)
total_q_bow_l  = [x.lower() for x in total_q_bow]
#print(total_q_bow_l)
all_words = [nltk.word_tokenize(x.translate(remove_punctuation_map)) for x in total_q_bow_l]

for i in range(len(all_words)):  
    all_words[i] = [w for w in all_words[i] if w not in cachedStopWords]

#print(all_words)

end = time.time()
print('run time',end - start)

In [None]:
"""w2v_model= Word2Vec(all_words, min_count=2)
"""
from gensim.models import Word2Vec

embed_size = 300

#the model is set up as with the Kaggle paper above with the noted exceptions:
# min_count =2 because the corpus is realtively small
# window = 5 to capture more context around a word
w2v_model = Word2Vec(min_count=2,
                     window=5,
                     size=embed_size,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=1)

start = time.time()

w2v_model.build_vocab(all_words, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time.time() - start) / 60, 2)))

w2v_model.train(all_words, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time.time() - start) / 60, 2)))



print('w2v_model.corpus_count',w2v_model.corpus_count)
vocabulary = w2v_model.wv.vocab  
#print(vocabulary)


## How good is word2vec

This test shows that with the vocabulary built from the questions, the technique is finding good matches. In this case the most similar words to police are shown...

In [None]:
w2v_model.wv.most_similar(positive=["police"])

## Question embedding

In the following code the word2vec word vectors are combined to produce vectors for each question. This is done by finding the average of all the embeddings improved by taking into account the tfidf scores as described in this paper: http://www2.aueb.gr/users/ion/docs/BioNLP_2016.pdf.

These question vectors are then used to find similarity using cosine similarity

In [None]:
"""sentence embedding for questions usig Word2Vec"""
start = time.time()

rows, cols = tfidf.nonzero()
print (rows)
print (cols)
rows_l = len(rows)

s_embed = []
s_embeds = []
dividend = []
atStart = True
oldr = -1
w_cnt = 0
vocab = vectorizer.get_feature_names()

for i in range (rows_l):
    r = rows[i]
    c = cols[i]
    if (oldr != r):
        if (atStart == False):
            #calc embedding for questions
            s_embed = np.divide(dividend, divisor)
            s_embeds.append(s_embed.flatten())
            
        else: 
            atStart = False
        oldr = r
        w_cnt = 0
        dividend = np.zeros((1, embed_size))
        divisor = 0

       
    #print('r,c,w_cnt',r,c,w_cnt)
    word = vocab[c]
    if word in w2v_model.wv.vocab:
        wt = tfidf[r,c]
        #print (wt, word)
        w_embed = w2v_model.wv[word]
        #print(w_embed)
        #print(w_embed * wt)
        dividend = np.add(dividend, w_embed * wt)
        divisor += wt
        w_cnt +=1
#    else:
#        print (word, " not in vocab")
s_embed = np.divide(dividend, divisor)
s_embeds.append(s_embed.flatten())
#print (s_embeds)
end = time.time()
print('run time',end - start)


## Word2Vec Advantage

You would expect Word2Vec sentence embedding to provide better comparisons between questions compared to tfidf because Word2Vec accounts for context as well as term frequency.

Another big advantage of Word2Vec is that the vectors of the corpus of questions can be stored and do not have to be recalculated for each new question run. If the corpus could be rerun then it would be marginally better because all new words would be vectorised with relation to each other but the benefit is likely to be very small.

By re-running Word2Vec regularly it is possible to ensure that the vocabulary is a living record of how students ask questions and so relevance is improved.

In [None]:
"""Dataframe used to store embeddings"""

df_embed = pd.DataFrame({'col':s_embeds})
df_q_s_embed = pd.merge( questions_bow,df_embed, left_index=True, right_index=True)
df_q_s_embed.head(1)

In [None]:
"""Cosine similarity for all q v one q"""

start = time.time()

q_embed_array = cosine_similarity(s_embeds, [s_embeds[0]])
end = time.time()
print('run time',end - start)

"""This should be compared against the tfidf method that takes about 12 secs to get the array"""

In [None]:
"""Cosine similarity for all q v all q"""

start = time.time()

q_embed_array = cosine_similarity(s_embeds, s_embeds)
end = time.time()
print('run time',end - start)


## Word2Vec Results

The key for the dataframe below can be found [here](#tfidfsimilarityresults).



In [None]:
start = time.time()
"""Compare  q with q using sentence embedding"""
h_threshold =0.84
l_threshold =0.75


results_T = get_sim_results_with_threshold ('query',0,q_embed_array,q_with_answers_bow,q_with_answers_bow,h_threshold,l_threshold)

for i in range(1,sample_len):
    next_result = get_sim_results_with_threshold ('query',i,q_embed_array,q_with_answers_bow,q_with_answers_bow,h_threshold,l_threshold)
    results_T = pd.concat([results_T,next_result])
results_word2vec = results_T.T
pd.options.display.max_colwidth = 500
pd.options.display.max_seq_items = 2000
end = time.time()
print('df run time',end - start)
display (results_word2vec) 

## Professionals chosen using Word2Vec

In [None]:
"""Compare  q with q using Word2Vec to get q_ids then prof_ids"""
q_id_results = find_top_q_ids (q_embed_array)
pd.options.display.max_colwidth = 500
q_id_results.head(20) 

In [None]:
"""check that we have right ids!"""
x = q_with_answers_bow.count()
print (x.questions_id)
i =0 
while i<x.questions_id:
    row = q_with_answers_bow.iloc[i]
    
    if  row.questions_id == 'f46e757e38fa4243805534133ff9cb5b':
        print ('q_with_answers_bow.iloc[i][questions_id]',row.questions_id)
        print ('q_with_answers_bow.iloc[i][bow_f]',row.bow_f)
    i += 1

## Professionals identified by Word2Vec

The cells below show that the 1,000 most relevant questions, according to Word2Vec, to the 100 last questions asked in the dataset were provided by:

1,198 professionals
who provided 2,2264 answers
with the majority of professionals providing just one answer during that period but one professional providing 79 answers for the 100 questions.


In [None]:
df_w2v_prof_with_a = professionals_to_ask (q_id_results, 0)


for i in range(1,sample_len):
    next_p_to_a = professionals_to_ask (q_id_results, i)
    df_w2v_prof_with_a = pd.concat([df_w2v_prof_with_a,next_p_to_a],axis=0, sort=True)
    
pd.options.display.max_colwidth = 500
pd.options.display.max_seq_items = 2000

df_w2v_p_grouped = df_w2v_prof_with_a.groupby('professionals_id').count()
df_w2v_p_grouped = df_w2v_p_grouped.sort_values('total',ascending = False )
df_w2v_p_grouped.describe()

In [None]:
df_w2v_p_grouped.sum(axis = 0, skipna = True) 

In [None]:
q_embed_array.shape

In [None]:
q_embed_array =[]

# Method 3: Using Fasttext with sentence embedding<a id='m3'></a>

This Kaggle blog alerted to me that FaxtText may produce better results than Word2Vec:
https://www.kaggle.com/antonsruberts/sentence-embeddings-centorid-method-vs-doc2vec

The author Antons Rubert states:

"The main difference of FastText from Word2Vec is that it uses sub-word information (i.e character n-grams). While it brings additional utility to the embeddings, it also considerably slows down the process."


The method used here is identical to the one used for the Word2Vec model except that the vectors are calculated by FastText rather than Word2Vec.

By using FastText here, we retain all the advantages of the Word2Vec process and improve the accuracy.



First build the FastText model and show that the vectors can produce word similarities in the question bow..

In [None]:
"""uncomment to see all_words"""
#print (all_words)

In [None]:
from gensim.models import FastText
start = time.time()
embed_size = 300
"""all_words is a list of all the questions with the words separated and cleaned"""
ft_model = FastText(all_words, size=embed_size, window=5, min_count=2, workers=1
                    ,sg=1)
print('Time to build FastText model: {} mins'.format(round((time.time() - start) / 60, 2)))


ft_model.wv.most_similar(positive=["police"])

## Question embedding

In the following code the FastText word vectors are combined to produce vectors for each question. This is done by finding the average of all the embeddings improved by taking into account the tfidf scores as described in this paper: http://www2.aueb.gr/users/ion/docs/BioNLP_2016.pdf.

These question vectors are then used to find similarity using cosine similarity.

In [None]:
"""Uncomment this to see how tdidfs are stored"""
#print (tfidf)

In [None]:
"""sentence embedding for questions usig FastText"""
start = time.time()
"""tfidf is calculated in the Word2Vec section"""
"""There a tfidf value for every word in all_words"""
rows, cols = tfidf.nonzero()
print (rows)
print (cols)
rows_l = len(rows)

s_embed = []
s_embeds = []
dividend = []
atStart = True
oldr = -1
w_cnt = 0
"""using vectorization calculated in the Word2Vec section"""
vocab = vectorizer.get_feature_names()

#this method of calculating the embeddings is a bit ugly but takes advantage of how tfidfs are stored
#for every question
for i in range (rows_l):
    r = rows[i]
    c = cols[i]
    if (oldr != r):
        #new questions and so store last embeddings
        if (atStart == False):
            #calc embedding for last questions
            s_embed = np.divide(dividend, divisor)
            s_embeds.append(s_embed.flatten())
            
        else: 
            atStart = False
        oldr = r
        w_cnt = 0
        dividend = np.zeros((1, embed_size))
        divisor = 0

       
    #find the next word
    word = vocab[c]
    if word in ft_model.wv.vocab:
        #word is in the vocab and so calculate its contribution to the question vector
        wt = tfidf[r,c]
        #print (wt, word)
        w_embed = ft_model.wv[word]
        #print(w_embed)
        #print(w_embed * wt)
        dividend = np.add(dividend, w_embed * wt)
        divisor += wt
        w_cnt +=1
#    else:
#        print (word, " not in vocab")
s_embed = np.divide(dividend, divisor)
s_embeds.append(s_embed.flatten())
#print (s_embeds)
end = time.time()
print('Sentence embedding run time',end - start)
start = time.time()

q_embed_array = cosine_similarity(s_embeds, s_embeds)
end = time.time()
print('cosine sim time',end - start)


The key for the dataframe below can be found [here](#tfidfsimilarityresults).

In [None]:
"""Compare  q with q using sentence embedding"""
start = time.time()

h_threshold =0.94
l_threshold =0.9

results_T = get_sim_results_with_threshold ('query',0,q_embed_array,q_with_answers_bow,q_with_answers_bow,h_threshold,l_threshold)
for i in range(1,sample_len):
#for i in range(1,20):

    next_result = get_sim_results_with_threshold ('query',i,q_embed_array,q_with_answers_bow,q_with_answers_bow,h_threshold,l_threshold)
    results_T = pd.concat([results_T,next_result])
results_FastText = results_T.T
pd.options.display.max_colwidth = 500
pd.options.display.max_seq_items = 2000
end = time.time()
print('df time',end - start)

display (results_FastText) 

#### Professionals identified by FastText

The cells below show that the 1,000 most relevant questions, according to Word2Vec, to the 100 last questions asked in the dataset were provided by:

1,198 professionals
who provided 2,314 answers
with the majority of professionals providing just one answer during that period but one professional providing 78 answers for the 100 questions.


In [None]:
"""Compare  q with q using FastText to get q_ids then prof_ids"""
q_id_results = find_top_q_ids (q_embed_array)
pd.options.display.max_colwidth = 500
#q_id_results.head(20) 

In [None]:
df_FT_prof_with_a = professionals_to_ask (q_id_results, 0)


for i in range(1,sample_len):
    next_p_to_a = professionals_to_ask (q_id_results, i)
    df_FT_prof_with_a = pd.concat([df_FT_prof_with_a,next_p_to_a],axis=0, sort=True)
    
pd.options.display.max_colwidth = 500
pd.options.display.max_seq_items = 2000

df_FT_p_grouped = df_FT_prof_with_a.groupby('professionals_id').count()
df_FT_p_grouped = df_FT_p_grouped.sort_values('total',ascending = False )
df_FT_p_grouped.describe()

In [None]:
df_FT_p_grouped.sum(axis = 0, skipna = True) 

In [None]:
q_embed_array = []

# Method 4: Using Global Vectors<a id='m4'></a> 

From https://www.kaggle.com/rtatman/glove-global-vectors-for-word-representation:

"This dataset contains English word vectors pre-trained on the combined Wikipedia 2014 + Gigaword 5th Edition corpora (6B tokens, 400K vocab). All tokens are in lowercase. This dataset contains 50-dimensional, 100-dimensional and 200-dimensional pre trained word vectors. For 300-dimensional word vectors and additional information, please see the project website."

For these tests I have used the 200 dimension set of vectors.

In the following code the global word vectors are combined to produce vectors for each question. This is done by finding the average of all the embeddings improved by taking into account the tfidf scores in the same way as used for the Word2Vec test.


In [None]:
start = time.time()

glove_embed= dict()
glove_data= open(glove_path)
for line in glove_data:
    data = line.split(' ')
    word = data[0]
    vectors = np.asarray(data[1:], dtype='float32')
    glove_embed[word] = vectors
    
glove_data.close()
end = time.time()
print('run time',end - start)

In [None]:
"""sentence embedding for questions using Global Vectors"""
start = time.time()

embed_size = 200
rows, cols = tfidf.nonzero()
print (rows)
print (cols)
rows_l = len(rows)

s_embed = []
s_embeds = []
dividend = []
atStart = True
oldr = -1
w_cnt = 0
vocab = vectorizer.get_feature_names()
tot_words = 0
words_not_in_gv = 0
for i in range (rows_l):
    r = rows[i]
    c = cols[i]
    if (oldr != r):
        if (atStart == False):
            #calc embedding for questions
            s_embed = np.divide(dividend, divisor)
            s_embeds.append(s_embed.flatten())
            
        else: 
            atStart = False
        oldr = r
        w_cnt = 0
        dividend = np.zeros((1, embed_size))
        divisor = 0

       
    #print('r,c,w_cnt',r,c,w_cnt)
    word = vocab[c]
    #print (word)
    wt = tfidf[r,c]
    #print (wt, word)
    if word in glove_embed:
        w_embed = glove_embed[word]
        dividend = np.add(dividend, w_embed * wt)
        divisor += wt
    else:
        words_not_in_gv += 1
    tot_words += 1    
    w_cnt +=1
s_embed = np.divide(dividend, divisor)
s_embeds.append(s_embed.flatten())
print ('The following figures show the benefit of an auto correct or spell check on input')
print ('Number of words not in global vectors', words_not_in_gv,'total words', tot_words)
#print (s_embeds)
end = time.time()
print('run time',end - start)

In [None]:
"""Cosine similarity for all q v all q"""

start = time.time()

q_embed_array = cosine_similarity(s_embeds, s_embeds)
end = time.time()
print('run time',end - start)

## GloVe Results

The key for the dataframe below can be found [here](#tfidfsimilarityresults).

In [None]:
"""Compare  q with q using global vecs"""
h_threshold =0.85
l_threshold =0.75


results_T = get_sim_results_with_threshold ('query',0,q_embed_array,q_with_answers_bow,q_with_answers_bow,h_threshold,l_threshold)

for i in range(1,sample_len):
    next_result = get_sim_results_with_threshold ('query',i,q_embed_array,q_with_answers_bow,q_with_answers_bow,h_threshold,l_threshold)
    results_T = pd.concat([results_T,next_result])
results_glovec = results_T.T
pd.options.display.max_colwidth = 500
pd.options.display.max_seq_items = 2000

display (results_glovec) 

## Professionals chosen using GloVec

In [None]:
"""Compare  q with q using GloVec to get q_ids then prof_ids"""
q_id_results = find_top_q_ids (q_embed_array)
pd.options.display.max_colwidth = 500
q_id_results.head(20) 

In [None]:
df_glo_prof_with_a = professionals_to_ask (q_id_results, 0)


for i in range(1,sample_len):
    next_p_to_a = professionals_to_ask (q_id_results, i)
    df_glo_prof_with_a = pd.concat([df_glo_prof_with_a,next_p_to_a],axis=0, sort=True)
    
pd.options.display.max_colwidth = 500
pd.options.display.max_seq_items = 2000

df_glo_p_grouped = df_glo_prof_with_a.groupby('professionals_id').count()
df_glo_p_grouped = df_glo_p_grouped.sort_values('total',ascending = False )


## Professional identified by GloVe

The cells below show that the 1,000 most relevant questions, according to GloVe, to the 100 last questions asked in the dataset were provided by:

1,118 professionals
who provided 2,373 answers
with the majority of professionals providing just one answer during that period but one professional providing 87 answers for the 100 questions.


In [None]:
df_glo_p_grouped.sum(axis = 0, skipna = True) 


In [None]:
df_glo_p_grouped.describe()


In [None]:
q_embed_array =[]

# Method 5: universal-sentence-encoder<a id='m5'></a>

# The USE module became unavailable on the April 21st and so the associated code has been commented out in this version of the Kernel. This is disappointing but not critical because the FastText method is preferred for this application. 



"Google’s Universal Sentence Encoder encodes text into high dimensional vectors.
The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs.

The input is variable length English text and the output is a 512 dimensional vector."

More details can be found here:
https://tfhub.dev/google/universal-sentence-encoder/2

I found this blog useful:
https://medium.com/@gaurav5430/universal-sentence-encoding-7d440fd3c7c7

The encoder provides a matrix defining the similarity between a set of questions. 


In [None]:
"""Edit 1"""

"""import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import os, sys
from sklearn.metrics.pairwise import cosine_similarity

# get cosine similairty matrix
def cos_sim(input_vectors):
    similarity = cosine_similarity(input_vectors)
    return similarity

# get topN similar sentences

module_url = "https://tfhub.dev/google/universal-sentence-encoder/2" #@param ["https://tfhub.dev/google/universal-sentence-encoder/2", "https://tfhub.dev/google/universal-sentence-encoder-large/3"]
# Import the Universal Sentence Encoder's TF Hub module
embed = hub.Module(module_url)"""


In [None]:
"""Edit 2"""

"""q_total = ["".join(x) for x in (q_with_answers_bow['bow'])]


start = time.time()
with tf.Session() as session:

  session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  sentences_embeddings = session.run(embed(q_total))

similarity_matrix = cos_sim(np.array(sentences_embeddings))

print (similarity_matrix)
end = time.time()
print('run time',end - start)"""

## USE Results

The key for the dataframe below can be found [here](#tfidfsimilarityresults).

In [None]:
"""Edit 3"""
"""h_threshold =0.75
l_threshold =0.75


uni_results_T = get_sim_results_with_threshold ('query',0,similarity_matrix,q_with_answers_bow,q_with_answers_bow,h_threshold,l_threshold)

for i in range(1,sample_len):
    next_result = get_sim_results_with_threshold ('query',i,similarity_matrix,q_with_answers_bow,q_with_answers_bow,h_threshold,l_threshold)
    uni_results_T = pd.concat([uni_results_T,next_result])
uni_results = uni_results_T.T
pd.options.display.max_colwidth = 500
pd.options.display.max_seq_items = 2000

display (uni_results)"""
uni_results = results_glovec

## Professionals chosen using universal sentence embedder

In [None]:
"""Edit 4"""
"""Compare  q with q using USE to get q_ids then prof_ids
q_id_results = find_top_q_ids (similarity_matrix)
pd.options.display.max_colwidth = 500
q_id_results.head(20) """

In [None]:
df_uni_prof_with_a = professionals_to_ask (q_id_results, 0)


for i in range(1,sample_len):
    next_p_to_a = professionals_to_ask (q_id_results, i)
    df_uni_prof_with_a = pd.concat([df_uni_prof_with_a,next_p_to_a],axis=0, sort=True)
    
pd.options.display.max_colwidth = 500
pd.options.display.max_seq_items = 2000

df_uni_p_grouped = df_uni_prof_with_a.groupby('professionals_id').count()
df_uni_p_grouped = df_uni_p_grouped.sort_values('total',ascending = False )


## Professionals identified by the universal sentence embedder

The cells below show that the 1,000 most relevant questions, according to USE, to the 100 last questions asked in the dataset were provided by:

1,118 professionals
who provided 2,079 answers
with the majority of professionals providing just one answer during that period but one professional providing 87 answers for the 100 questions.


In [None]:
df_uni_p_grouped.sum(axis = 0, skipna = True) 

In [None]:
df_uni_p_grouped.describe()

In [None]:
similarity_matrix = []

## Method 6: Using tfidf to compare question bow with professional bow<a id='m6'></a>

In [None]:
def get_q_prof_with_threshold (column_head,index,sim_array,questions,query,h_threshold,l_threshold):

    col_h = column_head + str(index)
    
    df_sim_q = pd.DataFrame({'Cosine':sim_array[:,index], col_h:questions['bow_f']})

    df_sim_q_sorted = df_sim_q.sort_values('Cosine',ascending = False )
    if df_sim_q_sorted.iloc[0]['Cosine'] > .9999:
        df_sim_q_sorted = df_sim_q_sorted.drop(df_sim_q_sorted.index[0])

    h_num = 0
    l_num = 0
    worst_h_num = -1
    i = 0
    questions_len = len(questions)
    while i< questions_len and df_sim_q_sorted.iloc[i]['Cosine'] > l_threshold:
        #print ('i, df_sim_q_sorted.iloc[i]['Cosine']')
        if df_sim_q_sorted.iloc[i]['Cosine'] > l_threshold:
            l_num += 1
            worst_match_to_profs= i
        if df_sim_q_sorted.iloc[i]['Cosine'] > h_threshold:
            worst_h_num = i
            h_num += 1
        i += 1
    
        
    df_sim_q_sample = df_sim_q_sorted[:10]
    
        
    best_cos_0 = df_sim_q_sample.iloc[0]['Cosine']
    best_cos_9 = df_sim_q_sample.iloc[9]['Cosine']
    
    df_sim_q_sample = df_sim_q_sample.drop ('Cosine', axis=1).reset_index()
    df_sim_q_sample = df_sim_q_sample.drop ( 'index', axis=1)

    df_sim_q_sample_T = df_sim_q_sample.T
    df_sim_q_sample_T.insert(loc=0, column='id', value=[query.iloc[index]['questions_id']] )
    df_sim_q_sample_T.insert(loc=1, column='bow', value=[query.iloc[index]['bow_f']]  )
    df_sim_q_sample_T.insert(loc=2, column='best_cos', value=best_cos_0)
    df_sim_q_sample_T.insert(loc=3, column='10th_best_cos', value=best_cos_9)
    df_sim_q_sample_T.insert(loc=4, column='similar Q to students', value= h_num)
    df_sim_q_sample_T.insert(loc=5, column='Qs to profs', value=l_num)
    df_sim_q_sample_T.insert(loc=6, column='best matches', value=' ')

    if worst_h_num > -1:
        df_sim_q_sample_T.insert(loc=17, column='worst match to students', value=df_sim_q_sorted.iloc[worst_h_num][col_h])
    else:
        df_sim_q_sample_T.insert(loc=17, column='worst match to students', value='not available')
    
    if l_num > 0:
        df_sim_q_sample_T.insert(loc=18, column='worst match to profs', value=df_sim_q_sorted.iloc[worst_match_to_profs][col_h])
    else:
        df_sim_q_sample_T.insert(loc=18, column='worst match to profs', value='not available')
    
    
    
    return ( df_sim_q_sample_T)

In [None]:

"""Compare  q with prof using tfidf"""
h_threshold =0.45
l_threshold =0.225


start = time.time()

q_total = ["".join(x) for x in (df_p_bow['bow'])]
q_queries = ["".join(x) for x in (q_with_answers_bow['bow'])]
q_sim_p_array = get_sim_q_array (q_total,q_queries)

end = time.time()
print('run time',end - start)
#print(q_sim_m_array)

results_prof_T = get_q_prof_with_threshold ('query',0,q_sim_p_array,df_p_bow,q_with_answers_bow,h_threshold,l_threshold)
#for i in range(1,sample_len):
#reduced to make ui managable
for i in range(1,20):
    next_result = get_q_prof_with_threshold ('query',i,q_sim_p_array,df_p_bow,q_with_answers_bow,h_threshold,l_threshold)
    results_prof_T = pd.concat([results_prof_T,next_result])
results_prof_tfidf = results_prof_T.T
pd.options.display.max_colwidth = 700
display(results_prof_tfidf) 


In [None]:
def get_sim_p_id (column_head,index,sim_array,questions,query):

    col_h = column_head + str(index)
    
    df_sim_q = pd.DataFrame({'Cosine':sim_array[:,index], col_h:questions['professionals_id']})
    
    
    df_sim_q_sorted = df_sim_q.sort_values('Cosine',ascending = False )
    if df_sim_q_sorted.iloc[0]['Cosine'] > .9999:
        df_sim_q_sorted = df_sim_q_sorted.drop(df_sim_q_sorted.index[0])
        
    df_sim_q_sample = df_sim_q_sorted[:20]
    
    df_sim_q_sample = df_sim_q_sample.drop ('Cosine', axis=1).reset_index()
    df_sim_q_sample = df_sim_q_sample.drop ( 'index', axis=1)

    
    df_sim_q_sample_T = df_sim_q_sample.T
    """    df_sim_q_sample_T.insert(loc=0, column='id', value=[query.iloc[index]['questions_id']] )
    df_sim_q_sample_T.insert(loc=1, column='bow', value=[query.iloc[index]['bow_f']]  )    
    """    
    
    return ( df_sim_q_sample_T)

In [None]:
def find_top_p_ids (q_sim_array):
    q_id_results_T = get_sim_p_id ('query',0,q_sim_array,df_p_bow,q_with_answers_bow)
    for i in range(1,sample_len):
        next_result = get_sim_p_id ('query',i,q_sim_array,df_p_bow,q_with_answers_bow)
        q_id_results_T = pd.concat([q_id_results_T,next_result])
    return q_id_results_T.T

In [None]:
"""Compare  q with q using tfidf to get q_ids then prof_ids"""
p_id_results= find_top_p_ids (q_sim_p_array)
pd.options.display.max_colwidth = 500
p_id_results.head(2000) 

In [None]:
"""check that we have right ids!"""
"""x = df_p_bow.count()
print (x.professionals_id)
i =0 
while i<x.professionals_id:
    row = df_p_bow.iloc[i]
    
    if  row.professionals_id == '57a497a3dd214fe6880816c376211ddb':
        print ('df_p_bow.iloc[i][professionals_id]',row.professionals_id)
        print ('df_p_bow.iloc[i][bow_f]',row.bow_f)
    i += 1"""

Concatentate the professionals into one list

In [None]:
p_id_results_T0 = get_sim_p_id ('query',0,q_sim_p_array,df_p_bow,q_with_answers_bow)
p_id_results =p_id_results_T0.T
for i in range(1,sample_len):
    p_id_results_Tx = get_sim_p_id ('query',i,q_sim_p_array,df_p_bow,q_with_answers_bow)
    p_id_resultsx = p_id_results_Tx.T
    col_h = 'query' + str(i)
    p_id_resultsx = p_id_resultsx.rename(columns={col_h: 'query0'})

    p_id_results = pd.concat([p_id_results,p_id_resultsx],axis=0, sort=True)

p_id_results.describe()   

In [None]:
p_id_results = p_id_results.reset_index()
p_id_results = p_id_results.rename(columns={'query0': 'professionals_id'})

#p_id_results_r = p_id_results_r.drop('index')
p_id_results.head(10)   

In [None]:
p_id_results_grouped = p_id_results.groupby('professionals_id').count()
p_id_results_grouped = p_id_results_grouped.rename(columns={'index': 'total'})
p_id_results_grouped = p_id_results_grouped.sort_values('total',ascending = False )

p_id_results_grouped.head()

In [None]:
p_id_results_grouped.describe()

In [None]:
print (p_id_results.shape)

In [None]:
q_sim_p_array = []

# Comparing recommender engines<a id='compare'></a>

This cell can be used to visuallycompare the results for one student question from the 5 question based methods.

Three examples are chosen to show the similar questions found side by side.


#### Example of a straight forward request about a job type

Handled well by all methods

In [None]:
"""Change the index number between 0 and sample_len """

index = 0

def compare_methods(index):

    col = 'query' + str(index)

    tfidfx = results_tfidf[[col]]
    tfidfx = tfidfx.rename(columns={col: 'tfidf'})
    
    senembedx = results_word2vec[[col]]
    senembedx = senembedx.rename(columns={col: 'Word2Vec'})
   
    
    senembedFTx = results_FastText[[col]]
    senembedFTx = senembedFTx.rename(columns={col: 'FastText'})
    

    glovecx = results_glovec[[col]]
    glovecx = glovecx.rename(columns={col: 'GloVe'})

    univecx = uni_results[[col]]
    univecx = univecx.rename(columns={col: 'USE'})

    df_queryx = pd.concat([tfidfx,senembedx],axis=1, sort=False)
    df_queryx = pd.concat([df_queryx,senembedFTx],axis=1, sort=False)
    df_queryx = pd.concat([df_queryx,glovecx],axis=1, sort=False)
    df_queryx = pd.concat([df_queryx,univecx],axis=1, sort=False)

    
    return df_queryx

display(compare_methods(index))

#### Example of a general request

handled well by all methods

In [None]:
index = 4
compare_methods(index)


#### Example of handling a spelling error

FastText and Word2Vec to a lesser extent handle the incorrect spelling of pediatrician.


In [None]:
index = 2
compare_methods(index)


### Professional who answered the questions

In the examples above, we can see that the four methods do provide similar results but the degree of overlap is difficult to analyse. The code below provides the statistics.

Note these figure do  not include USE.

In [None]:
df_p_with_a = df_tfidf_p_grouped.merge(right=df_w2v_p_grouped, how = 'outer',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')

df_p_with_a = df_p_with_a.rename(columns={'total_x': 'tfidf'})
df_p_with_a = df_p_with_a.rename(columns={'total_y': 'Word2Vec'})


df_p_with_a = df_p_with_a.merge(right=df_FT_p_grouped, how = 'outer',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a = df_p_with_a.rename(columns={'total': 'FastText'})



df_p_with_a = df_p_with_a.merge(right=df_glo_p_grouped, how = 'outer',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')

df_p_with_a = df_p_with_a.rename(columns={'total': 'GloVe'})


df_p_with_a = df_p_with_a.merge(right=df_uni_p_grouped, how = 'outer',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a = df_p_with_a.rename(columns={'total': 'USE'})


df_p_with_a.describe()

In [None]:
df_p_with_a.shape

The following table shows that of the 2,440 professionals found only 394 are present in all 4 sets of results. The slight variation in these quoted numbers and the ones printed by running the code is due to some slight variation in the model outputs that is not significant.

In [None]:
df_p_with_a = df_tfidf_p_grouped.merge(right=df_w2v_p_grouped, how = 'inner',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a = df_p_with_a.rename(columns={'total_x': 'tfidf'})


df_p_with_a = df_p_with_a.merge(right=df_FT_p_grouped, how = 'inner',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')



df_p_with_a = df_p_with_a.merge(right=df_glo_p_grouped, how = 'inner',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')

df_p_with_a = df_p_with_a.merge(right=df_uni_p_grouped, how = 'inner',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')


df_p_with_a.describe()

In [None]:
df_p_with_a.shape

### Comparing the overlap of professionals answering between 2 engines

The following cell outputs show that roughly each method shares half the professionals with each other method. Using two methods could therefore increase the number of identified professionals by 50%. However, if more professionals are required,  it is probably better to continue to use the best method and ask for the professionals who asked the next most relevant questions. The visual inspection of questions shows that the methods will find the same questions but they rank them differently.

In [None]:
df_p_with_a = df_tfidf_p_grouped.merge(right=df_w2v_p_grouped, how = 'outer',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a.shape

In [None]:
df_p_with_a = df_tfidf_p_grouped.merge(right=df_w2v_p_grouped, how = 'inner',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a.shape

In [None]:
df_p_with_a = df_tfidf_p_grouped.merge(right=df_glo_p_grouped, how = 'outer',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a.shape

In [None]:
df_p_with_a = df_tfidf_p_grouped.merge(right=df_glo_p_grouped, how = 'inner',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a.shape

In [None]:
df_p_with_a = df_tfidf_p_grouped.merge(right=df_uni_p_grouped, how = 'outer',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a.shape

In [None]:
df_p_with_a = df_tfidf_p_grouped.merge(right=df_uni_p_grouped, how = 'inner',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a.shape

In [None]:
df_p_with_a = df_w2v_p_grouped.merge(right=df_glo_p_grouped, how = 'outer',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a.shape

In [None]:
df_p_with_a = df_w2v_p_grouped.merge(right=df_glo_p_grouped, how = 'inner',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a.shape

In [None]:
df_p_with_a = df_w2v_p_grouped.merge(right=df_uni_p_grouped, how = 'outer',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a.shape

In [None]:
df_p_with_a = df_w2v_p_grouped.merge(right=df_uni_p_grouped, how = 'inner',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a.shape

In [None]:
df_p_with_a = df_glo_p_grouped.merge(right=df_uni_p_grouped, how = 'outer',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a.shape

In [None]:
df_p_with_a = df_glo_p_grouped.merge(right=df_uni_p_grouped, how = 'inner',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a.shape

In [None]:
p_id_results_grouped.head()

In [None]:
df_tfidf_p_grouped.head()

In [None]:
df_p_with_a = df_tfidf_p_grouped.merge(right=p_id_results_grouped, how = 'outer',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a.shape

In [None]:
df_p_with_a = df_tfidf_p_grouped.merge(right=p_id_results_grouped, how = 'inner',
                                            left_on ='professionals_id',
                                            right_on ='professionals_id')
df_p_with_a.shape


# Finding a list of relevant questions for a professional<a id='proflist'></a>

 This is used when a professional visits the website and needs to be presented with a set of questions that are relevant.
 
 In the production version, filtering would be required so that previously answered questions are not re-presented.
 
 In this code tfidf is used to provide the questions. Any of the other methods could also be used.

In [None]:
def get_prof_q_results (prof_index,dfs_p_bow,q_sim_p_array,q_with_answers_bow):

    prof = 'prof ' + str(prof_index)
    df_profs_q = pd.DataFrame({'Cosine':q_sim_p_array[:,prof_index], prof:q_with_answers_bow['bow_f']})

    df_profs_q_sorted = df_profs_q.sort_values('Cosine',ascending = False )
    if df_profs_q_sorted.iloc[0]['Cosine'] > .9999:
        df_profs_q_sorted = df_profs_q_sorted.drop(df_profs_q_sorted.index[0])

    
    df_profs_q_sample = df_profs_q_sorted[:10]
    best_cos_0 = df_profs_q_sample.iloc[0]['Cosine']
    best_cos_9 = df_profs_q_sample.iloc[9]['Cosine']

    df_profs_q_sample = df_profs_q_sample.drop ('Cosine', axis=1).reset_index()
    df_profs_q_sample = df_profs_q_sample.drop ( 'index', axis=1)

    df_profs_q_sample_T = df_profs_q_sample.T
    df_profs_q_sample_T.insert(loc=0, column='professionals_id', value=[dfs_p_bow.iloc[prof_index]['professionals_id']] )
    df_profs_q_sample_T.insert(loc=1, column='professionals_bow', value=[dfs_p_bow.iloc[prof_index]['bow']]  )
    df_profs_q_sample_T.insert(loc=2, column='best_cos', value=best_cos_0)
    df_profs_q_sample_T.insert(loc=3, column='10th_best_cos', value=best_cos_9)
    df_profs_q_sample_T.insert(loc=4, column='Best Matches', value=' ')

    return ( df_profs_q_sample_T)


In [None]:
"""Produce a subset of the professional data of all professionals"""

dfs_p_bow = df_p_bow.sample(n=10, random_state = 21).reset_index()
dfs_p_bow = dfs_p_bow.drop (['index','level_0'], axis = 1)

"""Compare questions against the professionals bow"""

q_total = ["".join(x) for x in (q_with_answers_bow['bow'])]
q_queries = ["".join(x) for x in (dfs_p_bow['bow'])]
q_sim_p_array = get_sim_q_array (q_total,q_queries)

"""get results for all professionals sample"""
results_T = get_prof_q_results (0,dfs_p_bow,q_sim_p_array,q_with_answers_bow)
prof_sample_len = 10
for i in range(1, prof_sample_len):
    next_result = get_prof_q_results (i,dfs_p_bow,q_sim_p_array,q_with_answers_bow)
    results_T = pd.concat([results_T,next_result])
results = results_T.T
pd.options.display.max_colwidth = 500
results.head(15)    



# Find similar professionals<a id='simprof'></a>

## This is used when we need to find more professionals to answer a particular question.

In [None]:
"""Compare nonan sample of professionals bow against the professionals bow"""

dfs_p_bow_nonan = df_p_bow_nonan.sample(n=10, random_state = 21).reset_index()
dfs_p_bow_nonan = dfs_p_bow_nonan.drop (['index','level_0'], axis = 1)

pd.options.display.max_colwidth = -1

q_total = ["".join(x) for x in (df_p_bow['bow'])]
q_queries = ["".join(x) for x in (dfs_p_bow_nonan['bow'])]
q_sim_pp_array = get_sim_q_array (q_total,q_queries)

"""get results for  p v p comparison"""
results_T = get_prof_q_results (0,dfs_p_bow_nonan,q_sim_pp_array,df_p_bow)
prof_sample_len = 10
for i in range(1, prof_sample_len):
    next_result = get_prof_q_results (i,dfs_p_bow_nonan,q_sim_pp_array,df_p_bow)
    results_T = pd.concat([results_T,next_result])
results = results_T.T
pd.options.display.max_colwidth = 500
results.head(15)    

# EDA Code<a id='edacode'></a>

This competition is, essentially, about changing the behaviour of the professionals.
By better targeting of questions to professionals, the aim is to make it more likely that they will respond.

Much of this EDA has therefore been designed to provide information and insights about the following

•	if we want to change behavior it would be good to know what happens now

•	if we want to change behavior we need some way of measuring impact...so what happens now?


## Setup

In [None]:
"""Merge matches and emails to find total questions asked"""
"""There are about 1.8m emails and 4.3m matches and so soe emails contain more than one question """

match_q_p = matches.merge(right=emails, how = 'left',
                                            left_on ='matches_email_id',
                                            right_on ='emails_id')

match_q_p.head()

In [None]:
match_q_p_simple = match_q_p.drop (['emails_id','emails_date_sent','emails_frequency_level'], axis=1)
match_q_p_simple.head()

In [None]:
match_q_p_simple = match_q_p_simple.sort_values ('emails_recipient_id')
match_recipents = match_q_p_simple.groupby('emails_recipient_id').count()
match_recipents = match_recipents.sort_values ('emails_recipient_id')
match_recipents = match_recipents.reset_index()

match_recipents = match_recipents.drop ('matches_question_id', axis=1)
match_recipents = match_recipents.rename(columns={'matches_email_id': 'questions_received'})


match_recipents.head()

## This section counts up the number of emails that each professional has been sent

In [None]:


email_recipents = emails[['emails_id' ,'emails_recipient_id']]
email_recipents = email_recipents.groupby('emails_recipient_id').count()
sorted_email_recipents = email_recipents.sort_values ('emails_recipient_id')
sorted_email_recipents = sorted_email_recipents.reset_index()

df_profs_emails = professionals.copy()

df_profs_emails = df_profs_emails.sort_values('professionals_id')
df_profs_emails.reset_index(inplace =True, drop =True)
df_profs_emails = df_profs_emails.merge(right=sorted_email_recipents, how = 'left',
                                            left_on ='professionals_id',
                                            right_on ='emails_recipient_id')

df_profs_emails = df_profs_emails.drop ('emails_recipient_id', axis=1)
df_profs_emails = df_profs_emails.rename(columns={'emails_id': 'emails_received'})
df_profs_emails = df_profs_emails.fillna(0)
df_profs_emails.head() 

 


This cell shows that the people receiving the most emails have received over 3,000 in the last three years.

In [None]:
df_profs_emails_q = df_profs_emails.merge(right=match_recipents, how = 'left',
                                            left_on ='professionals_id',
                                            right_on ='emails_recipient_id')
df_profs_emails_q = df_profs_emails_q.drop ('emails_recipient_id', axis = 1)
df_profs_emails_q_sorted = df_profs_emails_q.sort_values ('emails_received' ,ascending = False)

df_profs_emails_q_sorted.head() 

## This section counts the answers provided by each professional

One professional has answered 1,710 questions out of the 3,280 received.

In [None]:

"""Using groupby to speed up data processing"""

answers_cut = answers[['answers_author_id' ,'answers_question_id']]
answer_count = answers_cut.groupby('answers_author_id').count()
sorted_answer_count = answer_count.sort_values ('answers_author_id')
sorted_answer_count = sorted_answer_count.reset_index()

sorted_answer_count.head()

"""Merge the info on answered questions to the professional df"""

df_profs_emails_answers = df_profs_emails_q.copy()
df_profs_emails_answers = df_profs_emails_answers.sort_values('professionals_id')
df_profs_emails_answers.reset_index(inplace =True, drop =True)

df_profs_emails_answers = df_profs_emails_answers.merge(right=sorted_answer_count, how = 'left',
                                            left_on ='professionals_id',
                                            right_on ='answers_author_id')
df_profs_emails_answers = df_profs_emails_answers.drop ('answers_author_id', axis=1)
df_profs_emails_answers = df_profs_emails_answers.rename(columns={'answers_question_id': 'questions_answered'})
df_profs_emails_answers = df_profs_emails_answers.fillna(0)
df_profs_emails_answers_sorted = df_profs_emails_answers.sort_values ('questions_answered' ,ascending = False)

df_profs_emails_answers_sorted.head() 

## Comparing answers provided to questions asked
**Scatter Plot**
In this plot, each dot represents the data of one professional and shows the relationship between questions asked and questions answered.

If you imagine a straight line going through 0,0 with a slop of 1 (45 degrees), then there should be no dots above the line.

**Why?** because that would mean that more answers have been provided than questions. You can see that the majority of dots are below the line but there is a significant minority that are above it.

The reason for this is that professionals can go directly to the website to find questions. This is an important source of answers to CV.

In [None]:
"""Scatter Plot for q answered v q asked"""
import matplotlib.pyplot as plt
import math


"""Need to use log to get data spreading"""
df_profs_emails_answers ['log_questions_received'] = df_profs_emails_answers ['questions_received']
df_profs_emails_answers ['log_questions_answered'] = df_profs_emails_answers ['questions_answered']

def getlog (x):
    if (x == 0):
        x= 'NaN'
    else:
        x = math.log10(x)
    return x
   

df_profs_emails_answers['log_questions_received'] = df_profs_emails_answers['log_questions_received'].map(getlog)
df_profs_emails_answers['log_questions_answered'] = df_profs_emails_answers['log_questions_answered'].map(getlog)
df_profs_emails_truncatedanswers = df_profs_emails_answers.copy()

plt.figure(figsize=(10,10))
plt.scatter(df_profs_emails_answers['questions_received'],df_profs_emails_answers['questions_answered'],  color='k', s=25, alpha=0.2)
plt.xlim(-5, 50)
plt.ylim(-5,50)
plt.plot([-5,50], [-5,50], 'k-', color = 'red')

plt.xlabel('questions_received')
plt.ylabel('questions_answered')
plt.title('CareerVillage Q v A truncated at 50')
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(df_profs_emails_answers['log_questions_received'],df_profs_emails_answers['log_questions_answered'],  color='k', s=25, alpha=0.2)
plt.plot([0,3], [0,3], 'k-', color = 'red'), plt.xlim(0, 3), plt.ylim(0,3)
plt.xlabel('log_questions_received'),plt.ylabel('log_questions_answered')
plt.title('CareerVillage Questions Chart')
plt.legend()
plt.show()

In [None]:
df_profs_emails_answers ['a_q_not_asked'] =  df_profs_emails_answers['questions_answered'] - df_profs_emails_answers['questions_received']
df_profs_emails_answers ['a_q_not_asked'] = df_profs_emails_answers ['a_q_not_asked'].apply(lambda x: 0 if x < 0 else x)
print (df_profs_emails_answers ['a_q_not_asked'].sum())
df_profs_emails_answers.describe()

# This section develops data to determine activity


In [None]:

df_profs_emails_answers['DateTime'] = pd.to_datetime(df_profs_emails_answers['professionals_date_joined'])
df_profs_emails_answers['date_joined'] = df_profs_emails_answers['DateTime'].dt.normalize()
df_profs_emails_answers = df_profs_emails_answers.drop(['professionals_date_joined','DateTime'],axis = 1)
#df_profs_emails_answers['Day Joined'] = [2011-01-01]
df_profs_emails_answers.head()


In [None]:
"""Start processing emails to get first and last emails sent dates"""
"""Perhaps don't need to sort but useful for manualchecking"""
sorted_email_recipents = emails.sort_values  (['emails_recipient_id','emails_date_sent'])
sorted_email_recipents.reset_index(inplace =True, drop =True)

sorted_email_recipents.head()

In [None]:
"""This section finds the days that professionals receive emails"""
"""Very slow, working on 1.8m emails"""
sorted_email_recipents_dates = sorted_email_recipents.copy()
sorted_email_recipents_dates['email_date'] = pd.to_datetime(sorted_email_recipents['emails_date_sent'])


In [None]:
sorted_email_recipents_dates['email_date'] = sorted_email_recipents_dates['email_date'].dt.normalize()
sorted_email_recipents_dates = sorted_email_recipents_dates.drop ('emails_date_sent', axis = 1)


In [None]:

sorted_email_recipents_dates_min = sorted_email_recipents_dates.groupby('emails_recipient_id').min()
sorted_email_recipents_dates_min = sorted_email_recipents_dates_min.rename(columns={'email_date':'first_email_date'})
sorted_email_recipents_dates_min = sorted_email_recipents_dates_min.drop(['emails_id','emails_frequency_level'], axis = 1)


sorted_email_recipents_dates_min.head()

In [None]:
sorted_email_recipents_dates_max = sorted_email_recipents_dates.groupby('emails_recipient_id').max()
sorted_email_recipents_dates_max = sorted_email_recipents_dates_max.rename(columns={'email_date':'last_email_date'})
sorted_email_recipents_dates_max = sorted_email_recipents_dates_max.drop(['emails_id','emails_frequency_level'], axis = 1)

sorted_email_recipents_dates_max.head()

In [None]:

sorted_email_recipents_dates_min_max = sorted_email_recipents_dates_min.merge(right=sorted_email_recipents_dates_max, how = 'left',
                                            left_on ='emails_recipient_id',
                                            right_on ='emails_recipient_id')


sorted_email_recipents_dates_min_max.head()

In [None]:
df_profs_emails_answers = df_profs_emails_answers.merge(right=sorted_email_recipents_dates_min_max, how = 'left',
                                            left_on ='professionals_id',
                                            right_on ='emails_recipient_id')

df_profs_emails_answers.head()

In [None]:
#df_profs_emails_answers['Days to 1st email'] = pd.to_datetime(df['date'])
df_profs_emails_answers['days_before_1st_email'] = df_profs_emails_answers['first_email_date'] - df_profs_emails_answers['date_joined']
df_profs_emails_answers['days_ns'] = df_profs_emails_answers['last_email_date'] - df_profs_emails_answers['first_email_date']
df_profs_emails_answers['days_emailed'] = df_profs_emails_answers['days_ns'].apply(lambda x: x.days)
df_profs_emails_answers = df_profs_emails_answers.drop('days_ns', axis = 1)
df_profs_emails_answers.head()


In [None]:
df_profs_emails_answers['emails_per_day'] = df_profs_emails_answers['emails_received'] / df_profs_emails_answers['days_emailed']
df_profs_emails_answers['answers_per_day'] = df_profs_emails_answers['questions_answered'] / df_profs_emails_answers['days_emailed']

df_profs_emails_answers['log_answers_per_day'] = df_profs_emails_answers['answers_per_day'] 
df_profs_emails_answers['log_answers_per_day'] = df_profs_emails_answers['log_answers_per_day'].map(getlog)

df_profs_emails_answers['log_emails_per_day'] = df_profs_emails_answers['emails_per_day'] 
df_profs_emails_answers['log_emails_per_day'] = df_profs_emails_answers['log_emails_per_day'].map(getlog)

df_profs_emails_answers.head()

# Questions answered vs time being emailed

If professionals were staying with the program, we would expect to see a trend rising from left to right.

We do not see this trend which indicates that professional stop answering questions even though they continue to receive emails. It would be useful to know whether the professionals  open the emails or not. This information would lead to different re-engagement strategies.

It is also apparent that there a are a large number of professionals that do not answer even one question. A re-engagement strategy is also required for these people.

In [None]:

plt.figure(figsize=(10,10))
plt.scatter(df_profs_emails_answers['days_emailed'],df_profs_emails_answers['questions_answered'],  color='red', s=25, alpha=0.2)

plt.xlim(-5, 250)
plt.ylim(-5,50)

plt.xlabel('days_emailed')
plt.ylabel('questions_answered')
plt.title('CareerVillage Questions Answered')
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(df_profs_emails_answers['days_emailed'],df_profs_emails_answers['log_questions_answered'],  color='red', s=25, alpha=0.2)

plt.xlabel('days_emailed')
plt.ylabel('log_questions_answered')
plt.title('CareerVillage Questions Answered')
plt.legend()
plt.show()

## This section finds the days that professionals are active

In [None]:
"""This section finds the days that professionals are active"""

answers_author_date = answers[['answers_author_id' ,'answers_date_added']]
answers_author_date = answers_author_date.sort_values(['answers_author_id' ,'answers_date_added'])
answers_author_date.reset_index(inplace =True, drop =True)
answers_author_date['answer_date'] = pd.to_datetime(answers_author_date['answers_date_added'])
answers_author_date['answer_date'] = answers_author_date['answer_date'].dt.normalize()
answers_author_date = answers_author_date.drop ('answers_date_added', axis = 1)

answers_author_date_min = answers_author_date.groupby('answers_author_id').min()
answers_author_date_max = answers_author_date.groupby('answers_author_id').max()
answers_author_date_min_max = answers_author_date_min.merge(right=answers_author_date_max, how = 'left',
                                            left_on ='answers_author_id',
                                            right_on ='answers_author_id')

answers_author_date_min_max = answers_author_date_min_max.rename(columns={'answer_date_x':'first_answer'})
answers_author_date_min_max = answers_author_date_min_max.rename(columns={ 'answer_date_y' :'last_answer'})

answers_author_date_min_max.head()

The plot below shows the answers per day decay with time. This re-inforces the conculsion that professionals are not staying with the program. 

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(df_profs_emails_answers['days_emailed'],df_profs_emails_answers['log_answers_per_day'],  color='red', s=25, alpha=0.2)

plt.xlabel('days_emailed')
plt.ylabel('log_answers_per_day')
plt.title('CareerVillage Response ')
plt.legend()
plt.show()

In [None]:
df_profs_emails_answers_active = df_profs_emails_answers.merge(right=answers_author_date_min_max, how = 'left',
                                            left_on ='professionals_id',
                                            right_on ='answers_author_id')
df_profs_emails_answers_active.head()

In [None]:
df_profs_emails_answers_active['days_active_ns'] = df_profs_emails_answers_active['last_answer'] - df_profs_emails_answers_active['first_answer']
df_profs_emails_answers_active['days_active'] = df_profs_emails_answers_active['days_active_ns'].apply(lambda x: x.days)
df_profs_emails_answers_active = df_profs_emails_answers_active.drop('days_active_ns', axis = 1)
df_profs_emails_answers_active.head(10)

# Activity v Engagement<a id='activity'></a>

The following plot explores the relationship between engagement, measured as the number of days between the first and last quesion answered, and activity, measured as the number of questions answered.

The plot shows a weak upward trend for those professions that stay active. 
The density of the plots shows the number of professionals and this indicates that many do not stay active

[Go to summary](#summary)   

In [None]:
df_profs_emails_answers_active['log_questions_answered'] = df_profs_emails_answers_active['questions_answered'] 
df_profs_emails_answers_active['log_questions_answered'] = df_profs_emails_answers_active['log_questions_answered'].map(getlog)
plt.figure(figsize=(10,10))
plt.scatter(df_profs_emails_answers_active['days_active'],df_profs_emails_answers_active['log_questions_answered'],  color='red', s=25, alpha=0.2)
plt.xlabel('days_active')
plt.ylabel('log_questions_answered')
plt.title('CareerVillage Activity ')
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(df_profs_emails_answers_active['days_active'],df_profs_emails_answers_active['questions_answered'],  color='red', s=25, alpha=0.2)

plt.xlim(-50, 500)
plt.ylim(-5,50)

plt.xlabel('days_active')
plt.ylabel('questions_answered')
plt.title('CareerVillage Activity ')
plt.legend()
plt.show()

# Professionals Data Frame<a id='professionals'></a>

[Go to summary](#summary) 

In [None]:
df_profs_emails_answers_active.describe(include = 'all')

# Professionals Data<a id='pd'></a>

[Go to summary](#summary) 

In [None]:
df_profs_emails_answers_active.sum()

# Questions and Answers<a id='questions'></a>

[Go to summary](#summary) 

In [None]:
questions.describe()

In [None]:
questions_answers = questions.merge(right=answers, how = 'left',
                                            left_on ='questions_id',
                                            right_on ='answers_question_id')
questions_answers.describe()

In [None]:
questions_answers.head(2)

In [None]:
print(questions_answers['answers_id'].isna().sum() )

# Who Answers<a id='whoanswers'></a>

[Go to summary](#summary) 

In [None]:
answers_v_professionals = df_profs_emails_answers_active.copy()
answers_v_professionals = answers_v_professionals[['questions_answered','professionals_id']]
answers_v_professionals = answers_v_professionals.groupby('questions_answered').count()
match_recipents = answers_v_professionals.sort_values ('questions_answered')
answers_v_professionals = answers_v_professionals.reset_index()
answers_v_professionals = answers_v_professionals.rename(columns={'professionals_id': 'professionals'})
print (answers_v_professionals.sum())
answers_v_professionals.head(50)

# Answers v Professionals Plot 1

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(answers_v_professionals['questions_answered'],answers_v_professionals['professionals'],  color='red', s=25, alpha=0.2)

plt.xlabel('questions_answered')
plt.ylabel('professionals')
plt.title('Professional Activity ')
plt.legend()
plt.show()


# Answers v Professionals Plot 2 <a id='answers v professionals'></a>

[Go to summary](#summary) 

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(answers_v_professionals['questions_answered'],answers_v_professionals['professionals'],  color='red', s=25, alpha=0.2)
plt.xlim(-10, 100),plt.ylim(-50,200)
plt.xlabel('questions_answered'),plt.ylabel('professionals')
plt.title('Professional Activity ')
plt.legend()
plt.show()

# What % of answers are provided in a professional's 1st day? <a id='answers_first_day'></a>

[Go to summary](#summary) 

In [None]:

answers_profs = answers.merge(right=professionals, how = 'left',
                                            left_on ='answers_author_id',
                                            right_on ='professionals_id')
answers_profs['DateTime'] = pd.to_datetime(answers_profs['professionals_date_joined'])
answers_profs['date_joined'] = answers_profs['DateTime'].dt.normalize()
answers_profs['DateTime'] = pd.to_datetime(answers_profs['answers_date_added'])
answers_profs['answer_date'] = answers_profs['DateTime'].dt.normalize()
answers_profs = answers_profs.drop(['DateTime'],axis = 1)

answers_profs['days_ns'] = answers_profs['answer_date'] - answers_profs['date_joined']
answers_profs['days_to_answer'] = answers_profs['days_ns'].apply(lambda x: x.days)
answers_profs = answers_profs.drop('days_ns', axis = 1)


answers_profs.describe()

In [None]:
answers_v_professionals = answers_v_professionals.groupby('questions_answered').count()


In [None]:
sorted_answers_profs = answers_profs.sort_values ('days_to_answer')
sorted_answers_profs = sorted_answers_profs.reset_index()
#sorted_answers_profs.head()
sorted_answers_profs_g = sorted_answers_profs.groupby('days_to_answer').count()
sorted_answers_profs_g.head()

# EDA on hearts and comments

The following analysis shows that there are too few comments and hearts to be able to use the information to score answers when deciding which professionals are providing the best answers:

total answers: 51,138

total comment: 14,966

total heart > 0: 51,138 - 37,301


In [None]:
comments.describe ()

In [None]:
answer_scores.describe()

In [None]:
grp_answer_scores = answer_scores.groupby('score').count()
grp_answer_scores.head()

In [None]:
df_answer_scores_comments= answer_scores.merge(right=comments, how = 'left',
                                            left_on ='id',
                                            right_on ='comments_parent_content_id')
df_answer_scores_comments.head()

In [None]:
df_answer_scores_comments.describe()

In [None]:
df_qa= questions_bow.merge(right=answers, how = 'left',
                                            left_on ='questions_id',
                                            right_on ='answers_question_id')


In [None]:
df_qahc= df_qa.merge(right=df_answer_scores_comments, how = 'left',
                                            left_on ='answers_id',
                                            right_on ='id')

df_qahc_s = df_qahc.drop (['answers_id','answers_author_id',
                           'answers_question_id','answers_date_added','id','comments_id','comments_author_id',
                           'comments_parent_content_id','comments_date_added'], axis=1)
df_qahc_s = df_qahc_s.sort_values ('score' ,ascending = False)
df_qahc_s.head(1)

# Groups

Having something in common is important. The formations of groups then is a good idea. Currently there are only groups with members and so groups cannot currently be useful in directing questions to professionals.

In [None]:
group_memberships.describe()

In [None]:
group_memberships_group  = group_memberships.groupby(by='group_memberships_group_id').count()

group_memberships_group  =group_memberships_group.sort_values ('group_memberships_user_id', ascending = False).reset_index()
group_memberships_group.head(50)


In [None]:
groups.head()

# Schools membership

Membership is a mixture of students and professionals. The number are small and so cannot currently be a major part of the recommender. However as having something in common is important, it is possible that this group could be used in the future to help prioritise which professional should be asked the question. However relevance is usually more important.

In [None]:
school_memberships.head()

In [None]:
school_memberships_group  = school_memberships.groupby(by='school_memberships_school_id').count()

school_memberships_group  =school_memberships_group.sort_values ('school_memberships_user_id', ascending = False).reset_index()
school_memberships_group.head()

In [None]:
school_memberships_group.describe()

In [None]:
school_student = school_memberships.merge(right=students, how = 'outer',
                                            left_on ='school_memberships_user_id',
                                            right_on ='students_id')
school_student_prof = school_student.merge(right=professionals, how = 'outer',
                                            left_on ='school_memberships_user_id',
                                            right_on ='professionals_id')


In [None]:
school_student_prof_group  = school_student_prof.groupby(by='school_memberships_school_id').count()

school_student_prof_group  =school_student_prof_group.sort_values ('school_memberships_user_id', ascending = False).reset_index()
school_student_prof_group.head()