In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import os
import seaborn as sns
import warnings
import copy
import nltk
import string
from nltk.corpus import stopwords
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MultiLabelBinarizer
from scipy.spatial.distance import pdist, squareform
warnings.filterwarnings('ignore')
root_path = '../input/'
print('The csv files provided are:\n')
print(os.listdir(root_path))

The csv files provided are:

['emails.csv', 'questions.csv', 'professionals.csv', 'comments.csv', 'tag_users.csv', 'group_memberships.csv', 'tags.csv', 'answer_scores.csv', 'students.csv', 'groups.csv', 'tag_questions.csv', 'question_scores.csv', 'matches.csv', 'answers.csv', 'school_memberships.csv']


In [2]:
df_emails = pd.read_csv(root_path + 'emails.csv')
df_questions = pd.read_csv(root_path + 'questions.csv')
df_professionals = pd.read_csv(root_path + 'professionals.csv')
df_comments = pd.read_csv(root_path + 'comments.csv')
df_tag_users = pd.read_csv(root_path + 'tag_users.csv')
df_group_memberships = pd.read_csv(root_path + 'group_memberships.csv')
df_tags = pd.read_csv(root_path + 'tags.csv')
df_answer_scores = pd.read_csv(root_path + 'answer_scores.csv')
df_students = pd.read_csv(root_path + 'students.csv')
df_groups = pd.read_csv(root_path + 'groups.csv')
df_tag_questions = pd.read_csv(root_path + 'tag_questions.csv')
df_question_scores = pd.read_csv(root_path + 'question_scores.csv')
df_matches = pd.read_csv(root_path + 'matches.csv')
df_answers = pd.read_csv(root_path + 'answers.csv')
df_school_memberships = pd.read_csv(root_path + 'school_memberships.csv')

This notebook is a continuation of [Part I: Yet Another EDA, Strategy & Useful Links](https://www.kaggle.com/akshayt19nayak/part-i-yet-another-eda-strategy-useful-links). In this notebook, I am going to focus on building a recommender system **exclusively using tags**. What's funny is that this solution does not involve machine learning in any which way. A recommender system built by extracting data from the body of questions and answers is in the works. I haven't done everything here due to memory and time constraints. There are 3 sections to this notebook:

- [Pre-processing](#Pre-processing)
- [Recommender System](#Recommender-System)
- [Real-Time Implementation](#Real-Time-Implementation)

## Pre-processing

First, we focus on using tags that have a user following of atleast 0.25% of the total number of tagged users and 0.25% of the total number of tagged questions which is about 58 questions. These threshold values have been selected after a bit of experimentation and can be treated as hyperparameters (although they aren't in the technical sense)

In [35]:
df_tag_users_merged = pd.merge(df_tag_users, df_tags, left_on='tag_users_tag_id', right_on='tags_tag_id', how='inner')
#To see the tags that are linked with every question
thresh = 0.0025
df_tag_questions_merged = pd.merge(df_tag_questions, df_tags, left_on='tag_questions_tag_id', right_on='tags_tag_id', how='inner')
user_tags = list((df_tag_users_merged['tags_tag_name'].value_counts()/df_tag_users_merged['tag_users_user_id'].nunique() 
                     > thresh).index[(df_tag_users_merged['tags_tag_name'].value_counts()/df_tag_users_merged['tag_users_user_id'].nunique() > thresh)])
question_tags = list((df_tag_questions_merged['tags_tag_name'].value_counts()/df_tag_questions_merged['tag_questions_question_id'].nunique() 
                     > thresh).index[(df_tag_questions_merged['tags_tag_name'].value_counts()/df_tag_questions_merged['tag_questions_question_id'].nunique() > thresh)])
relevant_tags = set(user_tags).union(set(question_tags))

Now we'll look at the coverage of these tags

In [36]:
print('The number of relevant tags:', len(relevant_tags))
print('Coverage of tagged questions:', df_tag_questions_merged[df_tag_questions_merged['tags_tag_name'].isin(
relevant_tags)]['tag_questions_question_id'].nunique()/df_tag_questions_merged['tag_questions_question_id'].nunique())
print('Coverage of tagged users:', df_tag_users_merged[df_tag_users_merged['tags_tag_name'].isin(
relevant_tags)]['tag_users_user_id'].nunique()/df_tag_users_merged['tag_users_user_id'].nunique())

The number of relevant tags: 421
Coverage of tagged questions: 0.8777911370663002
Coverage of tagged users: 0.9266935964505661


There are a few things we would like to do here:
- Find tags that are misspelings/similar to the relevant tags
- Replace the similar tags with the main tags. A good way to decide main tags would be to rank them in order of the people that follow them. Why? Professionals are only going to get alerts/ see questions on the platform if they follow a particular tag
- Thus increase the coverage

### Edit Distance (Levenshtein Distance)

- Edit Distance (a.k.a. Levenshtein Distance) is a measure of similarity between two strings referred to as the source string and the target string.

- The distance between the source string and the target string is the minimum number of edit operations (deletions, insertions, or substitutions) required to transform the source into the target. The lower the distance, the more similar the two strings. ([Source](https://python.gotrained.com/nltk-edit-distance-jaccard-distance/))

To see an example we'll first test it on the tag 'computer'

In [5]:
def return_if_found(df, column, string):
    return(df[df[column].str.contains(string)][column].unique())
array_computer = return_if_found(df_tag_users_merged, 'tags_tag_name', 'computer')
computer_ed = np.array([nltk.edit_distance('computer',word) for word in array_computer])
df_computer_ed = pd.DataFrame({'string':array_computer, 'edit_distance':computer_ed})
df_computer_ed[df_computer_ed['edit_distance']<=2]

Unnamed: 0,string,edit_distance
3,computer,0
22,computers,1
56,#computers,2
72,#computer,1


The algorithm is as follows:
- Sort the tags based on the count of tagged users per tag
- Save the positions (or the rank) of these tags in a seperate dictionary
- For all the tags in relevant tags, find tags that are at an edit distance of 2 for length of tags greater than 5 and tags that contain a '#' at the beginning or a '-'/'s' at the end of tags that have a length lesser than/equal to 5 (with smaller tags it is possible to entirely change the meaning even with an edit distance of 1)
- If a higher (highest would be rank 0) ranked tag is included in the list of similar tags for a particular tag, remove it
- Extend the list of lower ranked tags 
- Remove tags that are in both keys and values of the dictionary of similar tags
- Make a dictionary of tags to replace with the main tags

In [6]:
#Sorting tags based on follower count
dict_users_tags = df_tag_users_merged['tags_tag_name'].value_counts().to_dict()
list_users_tags = sorted(dict_users_tags, key=dict_users_tags.get)[::-1]
dict_rank_tags = {}
for rank, tags in enumerate(list_users_tags):
    dict_rank_tags[tags] = rank
def find_similar(relevant, space):
    dict_similar_tags = {}
    for tag_i in tqdm(relevant):
        dict_similar_tags[tag_i] = []
        for tag_j in space:
            if tag_i != tag_j:
                if (len(tag_i) >5 and nltk.edit_distance(tag_i, tag_j) <= 2): #Condition 1
                    dict_similar_tags[tag_i].append(tag_j)
                elif (len(tag_i) <=5 and (tag_j == '#'+tag_i or tag_j == tag_i+'s' or tag_j == tag_i+'-')): #Condition 2
                    dict_similar_tags[tag_i].append(tag_j)
    return(dict_similar_tags)
dict_similar_tags = find_similar(relevant_tags, list_users_tags)

100%|██████████| 421/421 [16:51<00:00,  2.31s/it]


In [7]:
print('Rank of tag internship:', dict_rank_tags['internship'])
print('Tags similar to internship:', dict_similar_tags['internship'])
print('Rank of tag internship:', dict_rank_tags['internships'])
print('Tags similar to internship:', dict_similar_tags['internships'])

Rank of tag internship: 204
Tags similar to internship: ['internships', '#internships', '#internship', 'internships2', '#intership', 'externship', 'intership', 'internship2', 'interneships', 'internships-']
Rank of tag internship: 29
Tags similar to internship: ['#internships', 'internship', '#internship', 'internships2', 'intership', 'internship2', 'interneships', 'internships-']


Our job now is to remove the lower ranked tags and couple all variations of the tag 'internship' in the tag 'internships'

In [8]:
#Deepcopy as the value is a mutable data structure
dict_similar_tags_copy = copy.deepcopy(dict_similar_tags)
for tag, similar in dict_similar_tags.items():
    for subtag in similar:
        if dict_rank_tags[subtag] < dict_rank_tags[tag]:
            dict_similar_tags_copy[tag].remove(subtag) 
        else:
            pass
for tag, similar in dict_similar_tags_copy.items():
    extensions = []
    for subtag in similar:
        try:
            extensions.extend(dict_similar_tags_copy[subtag])
        except:
            pass
    dict_similar_tags_copy[tag].extend(extensions)
    dict_similar_tags_copy[tag] = list(set(dict_similar_tags_copy[tag]))
#This is what we'll replace
similar_tags = list(set([item for sublist in list(dict_similar_tags_copy.values()) for item in sublist]))
list_keys = list(dict_similar_tags.keys())
for tag in list_keys:
    if tag in similar_tags:
        del dict_similar_tags_copy[tag]
#This is what we'll use to replace
main_tags = list(dict_similar_tags_copy.keys())

In [9]:
print('Tags similar to internships:', dict_similar_tags_copy['internships'])
print('Tags similar to internship:\n')
try:
    print(dict_similar_tags_copy['internship'])
#internship is not a key on account of being a low ranked tag
except KeyError as e:
    print('Key Error:', e)

Tags similar to internships: ['intership', '#externships', 'internships2', '#internship', 'internship2', 'externship', 'internships-', 'internship', '#intership', 'interneships', '#internships']
Tags similar to internship:

Key Error: 'internship'


Now let's look at the coverage

In [39]:
print('The number of relevant tags:', len(main_tags))
print('The number of tags that the relevant tags cover:', len(similar_tags))
df_tag_users_merged = pd.merge(df_tag_users, df_tags, left_on='tag_users_tag_id', right_on='tags_tag_id', how='inner')
df_tag_questions_merged = pd.merge(df_tag_questions, df_tags, left_on='tag_questions_tag_id', right_on='tags_tag_id', how='inner')
print('Coverage of tagged questions:', df_tag_questions_merged[df_tag_questions_merged['tags_tag_name'].isin(
    set(main_tags).union(set(similar_tags)))]['tag_questions_question_id'].nunique()/df_tag_questions_merged['tag_questions_question_id'].nunique())
print('Coverage of tagged users:', df_tag_users_merged[df_tag_users_merged['tags_tag_name'].isin(
    set(main_tags).union(set(similar_tags)))]['tag_users_user_id'].nunique()/df_tag_users_merged['tag_users_user_id'].nunique())

The number of relevant tags: 387
The number of tags that the relevant tags cover: 1121
Coverage of tagged users: 0.9358320641017152
Coverage of tagged questions: 0.895225008588114


The coverage has definitely improved albeit marginally. Also, the number of tags has reduced. We don't want to consider too many tags as this will just blow up the number of features we have per question and will create computational problems. Now we'll replace the similar tags with the main tags

In an earlier version of the notebook, we used tags that have usage above a threshold of 0.5% of tagged questions and 0.5% of tagged users. In that case the coverage of tagged questions improved by about 3%, compared to the 2% here. Minor improvements are great as even a 1% improvement translates to 233 questions. Thus, by lowering the threshold and using the algorithm, we have been able to cover 1400 additional questions

In [11]:
#To create a dictionary of tags that'll have the tag to be replaced as the key and the tag to replace it by as the value
dict_replace_tag = {}
for tag, similar in dict_similar_tags_copy.items():
    for subtag in similar:
        dict_replace_tag[subtag] = tag
for tag in main_tags:
    dict_replace_tag[tag] = tag
#Only looking at the questions that have atleast one tag out of the union of main tags and similar tags
df_all_tag_questions = df_tag_questions_merged[df_tag_questions_merged['tags_tag_name'].isin(
    set(main_tags).union(similar_tags))].groupby('tag_questions_question_id', as_index=False).agg({'tags_tag_name':list})
def replace(list_tags):
    replaced_list = []
    for tag in list_tags:
        try:
            replaced_list.append(dict_replace_tag[tag])
        except:
            pass
    return(list(set(replaced_list)))
df_all_tag_questions['replaced_tag_name'] = df_all_tag_questions['tags_tag_name'].apply(replace)

### Cosine Similarity

Now we one-hot encode the questions with tags they are linked with and build a similarity matrix using cosine similarity as the metric

In [12]:
#Create MultiLabelBinarizer object
one_hot = MultiLabelBinarizer()
#One-hot encode data
design_matrix = one_hot.fit_transform(df_all_tag_questions['replaced_tag_name'])

In [16]:
#Building cosine similarity matrix 
cos_sim = 1-squareform(pdist(design_matrix, metric='cosine')) #pdist computes cosine distance, so we subtract that from 1 to compute similarity
del design_matrix #To free up the RAM

## Recommender System

To make recommendations based on a question:    
- Find questions that are similar to the one under consideration
- Find if the similar questions have been answered
- If yes, find if the professional is active. Active professional has to have answered a question within the last 1 year
- If multiple professionals fit the criteria, rank them based on the proportion of questions they have answered within 24-48 hours [since that is a key metric](https://www.kaggle.com/c/data-science-for-good-careervillage/discussion/84845#latest-496046) 

In [17]:
#To see the profile of the volunteers and the questions that they have answered
df_questions['questions_date_added'] = pd.to_datetime(df_questions['questions_date_added'])
df_answers['answers_date_added'] = pd.to_datetime(df_answers['answers_date_added'])
df_answers_professionals = pd.merge(df_answers, df_professionals, left_on='answers_author_id', right_on='professionals_id', how='outer')
df_questions_answers_professionals = pd.merge(df_questions, df_answers_professionals, left_on='questions_id', right_on='answers_question_id')
df_qap_time_taken = df_questions_answers_professionals.groupby(['professionals_id','questions_id']).agg({'questions_date_added':min, 'answers_date_added':min})
df_qap_time_taken['less_than_2_days'] = df_qap_time_taken['answers_date_added'] - df_qap_time_taken['questions_date_added'] < '2 days'
df_qap_time_taken = df_qap_time_taken.reset_index().groupby('professionals_id', as_index=False).agg({'less_than_2_days':np.mean})
last_date = df_questions['questions_date_added'].max() #date of the last question asked on the platform
df_ap_grouped = df_answers_professionals.groupby('professionals_id').agg({'answers_date_added':max}).apply(lambda x:
                                                                                          (last_date-x).dt.days)
df_ap_grouped.rename(columns={'answers_date_added':'days_since_answered'}, inplace=True)
active_professionals = df_ap_grouped[df_ap_grouped['days_since_answered']<365].index

### Example Recommendation 1

In [67]:
qid = 1
idx = np.argsort(cos_sim[qid,:])[-6:-1]
print('Question Title and Body:\n')
#Sample question
print(list(df_questions[df_questions['questions_id']==df_all_tag_questions['tag_questions_question_id'].iloc[qid]]['questions_title']))
print(list(df_questions[df_questions['questions_id']==df_all_tag_questions['tag_questions_question_id'].iloc[qid]]['questions_body']))

Question Title and Body:

['Should I declare a minor during undergrad if I want to be a lawyer?']
["I'm currently an undergrad, but I want to go to law school and be a lawyer. I've been thinking about minoring in psychology but would it be helpful in the long run at all? #psychology #law "]


In [64]:
#Printing out the question body as it gives more insight into what the student actually wants to ask
print('Similar questions ranked by cosine similarity:\n')
for rank, index in enumerate(idx[::-1]):
    print(rank, '-', list(df_questions[df_questions['questions_id']==df_all_tag_questions.iloc[index]['tag_questions_question_id']]['questions_body']))

Similar questions ranked by cosine similarity:

0 - ["Interested in the field and seeing the various areas of expertise one can have. Also what types of schooling beyond a bachelor's degree could help? #psychology #law #clinical"]
1 - ["I'm asking this because on TV I've seen many shows about people who study the criminal brain and they study how criminals think. This seems very interesting to me but I don't know where to get started. #psychology #law #criminology #behavioral-health"]
2 - ['What are some ways to intertwine the two fields? Are there certain jobs that draw aspects from both? #psychology #law']
3 - ['My life is moving, and I cannot think of something that I would love to do everyday as a job unless it contains those two aspects in some way.  #psychology #law #lawyer #brain']
4 - ['For an adult-learner with no prior work experience in the legal field, but completing a Certificate in Paralegal Studies, what specific life skills might I use as transferable skills both on my 

In [65]:
author_id = df_answers[df_answers['answers_question_id'].isin(df_all_tag_questions.iloc[idx[::-1]]['tag_questions_question_id'])]['answers_author_id']
active_author_id = author_id[author_id.isin(active_professionals)]
df_recommended_pros = df_qap_time_taken[df_qap_time_taken['professionals_id'].isin(active_author_id)].sort_values('less_than_2_days', ascending=False)
print('The recommended professionals ranked by the proportion of questions answered within 48 hours:', df_recommended_pros['professionals_id'].tolist())
print('The profile of the professionals:')
df_professionals[df_professionals['professionals_id'].isin(df_recommended_pros['professionals_id'])]

The recommended professionals ranked by the proportion of questions answered within 48 hours: ['369f1c8646b649f6997eae7809696bd5', 'c5c2ca95fcd3463a8852b8bc9d636313', 'be5d23056fcb4f1287c823beec5291e1', 'e1d39b665987455fbcfbec3fc6df6056', '85f378f43eee44c986addf5fc27038ce', '0d8a769b6e2c447d97a6435f96814029', '874bb5c4fd4b4e498660c1f2c0a4ab3e']
The profile of the professionals:


Unnamed: 0,professionals_id,professionals_location,professionals_industry,professionals_headline,professionals_date_joined
1737,369f1c8646b649f6997eae7809696bd5,"Harlingen, Texas","Information Technology and Services, Franchise...",Career Guru,2015-02-05 17:52:38 UTC+0000
2546,c5c2ca95fcd3463a8852b8bc9d636313,"Tampa, Florida",Legal Services,Immigration Attorney,2015-11-06 15:36:36 UTC+0000
3581,be5d23056fcb4f1287c823beec5291e1,"San Antonio, Texas",Legal Services,Employment Counselor | Open Records Specialist,2016-01-21 03:23:22 UTC+0000
4397,85f378f43eee44c986addf5fc27038ce,"Reno, Nevada",Law Practice,Attorney,2016-03-07 18:41:34 UTC+0000
5876,e1d39b665987455fbcfbec3fc6df6056,Greater Philadelphia Area,Professional Training,Industrial-Organizational Psychology & HR Cons...,2016-05-04 18:12:23 UTC+0000
10426,0d8a769b6e2c447d97a6435f96814029,"Moraga, California","Health psychology, Wellness and Fitness","Oncology Therapist, Wellness Professional",2017-04-30 17:48:38 UTC+0000
26953,874bb5c4fd4b4e498660c1f2c0a4ab3e,"San Antonio, Texas",Legal Services,Thinks outside the box.,2019-01-06 13:46:54 UTC+0000


### Example Recommendation 2

In [69]:
qid = 512
idx = np.argsort(cos_sim[qid,:])[-6:-1]
print('Question Title and Body:\n')
#Sample question. Printing out the question body as it gives more insight into what the student actually wants to ask
print(list(df_questions[df_questions['questions_id']==df_all_tag_questions['tag_questions_question_id'].iloc[qid]]['questions_title']))
print(list(df_questions[df_questions['questions_id']==df_all_tag_questions['tag_questions_question_id'].iloc[qid]]['questions_body']))

Question Title and Body:

["My current plan is to go to a one year film college to get a certificate in screenwriting. Many people have mentioned that you really don't need a film degree to get into film, so a certificate is fine. Is this true?"]
['#film #film-production #director #screenwriting']


In [58]:
#Printing out the question body as it gives more insight into what the student actually wants to ask
print('Similar questions ranked by cosine similarity:\n')
for rank, index in enumerate(idx[::-1]):
    print(rank, '-', list(df_questions[df_questions['questions_id']==df_all_tag_questions.iloc[index]['tag_questions_question_id']]['questions_body']))

Similar questions ranked by cosine similarity:

0 - ['I Am A Junior In Highschool And We Had A Film Project We Worked On For A Bout A Month And I Really Enjoyed Working On It And Seem to Have A Bit Of Luck And Though Into How To Work The Camera Angles And Edit The Video. I Would Like To Learn More About Filming #film #movies-and-cinema']
1 - ['importance of which classes should be taken first before others. #film ']
2 - ['i am a senior in high school going on to college soon  to study film, and i am curious to know what are the best steps to take to ensure employment in the film industry after college. #film #in #film-production #industry']
3 - ['#film #film-production #director #screenwriting']
4 - ['because i would like to go to school out of state but i would like to make sure that im spending money on a good school that will help my future  #film']


In [59]:
author_id = df_answers[df_answers['answers_question_id'].isin(df_all_tag_questions.iloc[idx[::-1]]['tag_questions_question_id'])]['answers_author_id']
active_author_id = author_id[author_id.isin(active_professionals)]
df_recommended_pros = df_qap_time_taken[df_qap_time_taken['professionals_id'].isin(active_author_id)].sort_values('less_than_2_days', ascending=False)
print('The recommended professionals ranked by the proportion of questions answered within 48 hours:', df_recommended_pros['professionals_id'].tolist())
print('The profile of the professionals:')
df_professionals[df_professionals['professionals_id'].isin(df_recommended_pros['professionals_id'])]

The recommended professionals ranked by the proportion of questions answered within 48 hours: ['c3c345b8e5044054a0544296ac29cb88', 'a1006e6a58a0447592e2435caa230f78', '70855821c28f4b4c8fb5e627e081bb56', 'e27c43e8671242e1bfb80829744ee3ad', '9a5aead62c344207b2624dba90985dc5', '9421dd803d164e5da26436a01a92ce13', 'cc0ef2b535894a77a92cc740be6c2513']
The profile of the professionals:


Unnamed: 0,professionals_id,professionals_location,professionals_industry,professionals_headline,professionals_date_joined
1723,a1006e6a58a0447592e2435caa230f78,"State of Goiás, State of Goiás, Brazil",Research,Educational Writer- New Heights Educational Group,2015-01-26 20:00:16 UTC+0000
2338,c3c345b8e5044054a0544296ac29cb88,"Santa Monica, California",Marketing,"Accountant, Financial Analyst, Marketing Speci...",2015-10-07 21:32:13 UTC+0000
3786,cc0ef2b535894a77a92cc740be6c2513,"Austin, Texas",Motion Pictures and Film,Film Editor,2016-02-08 18:18:52 UTC+0000
4227,9421dd803d164e5da26436a01a92ce13,"Ocean Shores, Washington",Motion Pictures and Film,Writer/Director,2016-02-29 19:43:48 UTC+0000
5074,70855821c28f4b4c8fb5e627e081bb56,"Los Angeles, California",Motion Pictures and Film,Filmmaker/Screenwriter/Educator,2016-03-30 17:06:05 UTC+0000
24748,e27c43e8671242e1bfb80829744ee3ad,"Chicago, Illinois",Entertainment,Theatre Professional and Consultant,2018-10-30 18:11:23 UTC+0000
25379,9a5aead62c344207b2624dba90985dc5,"Newark, New Jersey",Education,Either fall or grow!,2018-11-15 19:16:05 UTC+0000


The recommendations in both cases do look relevant.

## Real-Time Implementation

- This isn't the right way to recommend as for a particular question, we only have data available prior to the date at which the question was posted and in this case we have considered all questions for analysis
- Every time a new tag is added, find the main tag that it is closest to wrt edit distance. In some way, there needs to be more control over tag creation as lesser tags lead to information enrichment. Since it is a forum for asking career based questions and not Twitter/Instagram, it is certainly possible to do this.
- Also, the feature vector for every question can be stored in the database
- The cosine similarity of all questions (say m) asked before a given time period needs to be pre-computed and stored in the database since it is a memory and time intensive computation
- Every time a new question is asked, it can be batched with other questions asked within (say) a 2 hour period and the cosine similarity of these questions (say n) with the questions that have been asked in the past can be computed
- This way, we aren't computing cosine similarity of (m+n) x (m+n) questions, but (n) x (m+n) questions and the update becomes easier and faster
- I know the technical details are a little fuzzy but this is just a rough solution to the problem at hand