# 1. Introduction & EDA

In this module to find the recommended professionals to answer some questions, we will use 'questions.csv','answers.csv', 'professionals.csv', 'answers_score.csv', 'questions_score.csv', 'tag_users.csv', 'tag_questions.csv' and 'tags.csv' files from "Data Science for Good: CareerVillage.org
Match career advice questions with professionals in the field".

During the Exploratory Data Analysis (EDA) of these files, in 'questions.csv' file, we found some question having HTML tags within its text body. Hashtag also attached as well. We also can find some of the questions are placed in questions_title rather than in questions_body column and vice versa. Some of the questions also having URL links within its text.




In [1]:
import pandas as pd
import operator
import re
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.externals import joblib
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt



#Load files and datasets
answer=pd.read_csv('../input/answers.csv')
professional=pd.read_csv('../input/professionals.csv')
tags=pd.read_csv('../input/tags.csv')
question_score=pd.read_csv('../input/question_scores.csv')
answer_score=pd.read_csv('../input/answer_scores.csv')
questi=pd.read_csv('../input/questions.csv')
tag_question=pd.read_csv('../input/tag_questions.csv')
tag_user=pd.read_csv('../input/tag_users.csv')

In [2]:
questi[questi['questions_body']=='<p>I am a sophomore in Boston and I am not sure what I want to do yet. I really think I want to be a pediatric nurse but I am not sure if I will want to go further into the health and medicine field if I really like it. </p>']

Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body
9441,3a782425ac6544c9ba72f06d75220e97,dfc8e2f9dff14a46b103bf87942eb6aa,2016-04-08 15:13:02 UTC+0000,In college can I go through a nursing program ...,<p>I am a sophomore in Boston and I am not sur...


In [3]:
questi[questi['questions_body'].str.contains('http://www.typeoflawyer.com/different-types-of-law-careers/')]

Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body
668,dc085775efb64764868cc41dd68188f6,2d86f61bdad747c9911ceb2248a030f3,2012-10-25 09:19:06 UTC+0000,How to Prepare for Law School?,There is no specific pre-law school course. It...


In [4]:
questi.isnull().sum() 

questions_id            0
questions_author_id     0
questions_date_added    0
questions_title         0
questions_body          0
dtype: int64

# 2. Pre-Processing Data

Before clustering and labeling process, we need to merge several tables into 'questions.csv'


In [5]:
#Rename the name of column in 'tag_users.csv' table to be merged with 'tags.csv'
tag_user=tag_user.rename(index=str, columns={'tag_users_tag_id':'tags_tag_id'})
tag_professional=tag_user.merge(tags, how='left', on='tags_tag_id')

#Merging the tag_professional for question with 'professionals.csv'
tag_professional=tag_professional.rename(index=str, columns={'tag_users_user_id':'professionals_id'})
final_professional=professional.merge(tag_professional, how='left', on='professionals_id')

#Rename the name of column in 'Professionals.csv' table to be merged with 'answers.csv'
final_professional=final_professional.rename(index=str, columns={'professionals_id':'answers_author_id'})
answer_prof_merged=answer.merge(final_professional, how='left', on='answers_author_id')

#Rename the name of column in 'answer_score.csv' table to be merged with 'answer_prof_merged'
answer_score1=answer_score.rename(index=str, columns={'id':'answers_id'})
answer_prof_merged=answer_prof_merged.merge(answer_score1, how='left', on='answers_id')

#Rename the name of column in 'question_scores.csv' table to be merged with 'questions.csv'
question_score1=question_score.rename(index=str, columns={'id':'questions_id'})
questi=questi.merge(question_score1, how='left', on='questions_id')

#Rename the name of column in 'tag_questions.csv' table to be merged with 'tags.csv'
tag_question=tag_question.rename(index=str, columns={'tag_questions_question_id':'questions_id'})
tag1=tags.rename(index=str, columns={'tags_tag_id':'tag_questions_tag_id'})

final_tag=tag_question.merge(tag1, how='left', on='tag_questions_tag_id')


#Merging the tags for question with 'questions.csv'
questi1=questi.merge(final_tag, how='left', on='questions_id')

After that, we need to clean up every question text from HTML tags, hashtag using "cleaned_list" function. We also need to remove all URL links. The main purpose of this process is to provide good quality text data before continuing to tokenization and TFIDF vectorizer.


In "merged_body_and_title" function, it will merge between question text body and question text title together in "merged_question" column, to avoid any question body that is misplaced in title column and vice versa.

In [6]:
import nltk

temporary1=[]
temporary2=[]
end=[]
question_new_list=[]

#Cleaned Texts from HTML tags, URL, Hashtag, and then translating all of them to english as the default 
def cleaned_list(file,name):
    text_=list(str(x) for x in file[name])
    question_new_list=[]
    for i in text_:
        cleaned_str=BeautifulSoup(i)
        cleaned_text=cleaned_str.get_text()
        result= re.sub(r"http\S+", "", cleaned_text)
        result0=re.sub("(\\d|\\W)+"," ",result)
        result1=result0.replace('#','')
        question_new_list.append(result1)
    return question_new_list
                   
#Some of questions are posted as title rather than body, so we merged it both of them as 'merged text'
def merged_body_and_title(file):
    bodies= cleaned_list(file,'questions_body')
    titles= cleaned_list(file,'questions_title')
    for i in range(0, len(titles)):
        nltk_tokens_body = nltk.word_tokenize(bodies[i])
        nltk_tokens_title = nltk.word_tokenize(titles[i])
        for f in nltk_tokens_title :
            temporary1.append(f)
            t=' '.join(temporary1)
        for v in nltk_tokens_body :
            temporary2.append(v)
            t1=' '.join(temporary2)
            t3=t+'.'+t1
        end.append(t3)
        temporary1.clear()
        temporary2.clear()
    return end

questi1['merged_question']=merged_body_and_title(questi1)
answer_prof_merged['merged_question']=cleaned_list(answer_prof_merged,'answers_body')



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


# 3. Data Processing

For clustering, we will use KMeans to get the main top words per cluster. This clustering will be labeled on each question in 'questi1' table. But before that, we need to find the optimal number of the cluster as the input for our KMeans. That's why we will do "Elbow Method" to find optimal k.

For the experiment, let's take range from 80 to 200 number of clusters.

In [7]:
non_nan_list=list(str(x) for x in questi1['merged_question'].drop_duplicates())
answer_list=list(str(x) for x in answer_prof_merged['answers_body'])

tfidf=TfidfVectorizer(stop_words='english')
X_idf = tfidf.fit_transform(non_nan_list)

In [8]:
# Sum_of_squared_distances = []
# K = range(80,200)
# for k in K:
#     print(k)
#     model = KMeans(n_clusters=k)
#     km=model.fit(X_idf)
#     Sum_of_squared_distances.append(km.inertia_)

In [9]:
# plt.figure(figsize=(80,80))
# plt.plot(K, Sum_of_squared_distances, 'bx-')
# plt.xlabel('k')
# plt.ylabel('Sum_of_squared_distances')
# plt.title('Elbow Method For Optimal k')
# plt.show()

The result shows that the 'elbow' is created between 100 to 150 number of clusters. So, we will take  138 as the optimal k for our KMeans's input.

# 4. KMeans Process and TFIDF

Exploring the top tearms for each cluster using this method:

In [10]:

true_k =138
model = KMeans(n_clusters=true_k, init='k-means++',max_iter=300, n_init=10)
model.fit(X_idf)
    
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = tfidf.get_feature_names()

# print(order_centroids)
for i in range(true_k):
    print("Cluster %d:" % i),
    for t in order_centroids[i, :10]:
        print(' %s' % terms[t]),
print

Top terms per cluster:
Cluster 0:
 bio
 engineering
 medical
 engineer
 biomedical
 biology
 science
 medicine
 interested
 want
Cluster 1:
 want
 know
 college
 career
 like
 school
 help
 people
 need
 best
Cluster 2:
 volunteer
 volunteering
 work
 experience
 hours
 job
 career
 resume
 college
 want
Cluster 3:
 experience
 job
 college
 work
 field
 gain
 jobs
 career
 want
 need
Cluster 4:
 computers
 computer
 technology
 software
 career
 science
 engineering
 love
 interested
 want
Cluster 5:
 masters
 degree
 bachelors
 phd
 college
 master
 just
 education
 graduate
 school
Cluster 6:
 animation
 animator
 art
 animators
 artist
 computer
 cartoons
 design
 job
 character
Cluster 7:
 make
 profit
 non
 sure
 want
 friends
 college
 organization
 best
 nonprofits
Cluster 8:
 art
 artist
 fine
 arts
 college
 career
 drawing
 school
 want
 design
Cluster 9:
 summer
 job
 college
 internship
 school
 jobs
 internships
 programs
 high
 time
Cluster 10:
 jobs
 job
 college
 caree

 major
 teaching
 teacher
 language
 improve
 writing
 education
 teach
Cluster 130:
 dance
 dancer
 dancing
 arts
 hip
 ballet
 hop
 performing
 want
 college
Cluster 131:
 gpa
 college
 school
 high
 year
 good
 grades
 job
 colleges
 important
Cluster 132:
 campus
 live
 college
 living
 housing
 better
 dorm
 life
 money
 apartment
Cluster 133:
 airline
 flight
 aviation
 careervillage
 industry
 attendant
 behalf
 posted
 administrator
 students
Cluster 134:
 classes
 college
 high
 school
 taking
 want
 need
 major
 know
 year
Cluster 135:
 writing
 writer
 creative
 author
 write
 publishing
 editing
 english
 fiction
 journalism
Cluster 136:
 question
 posting
 response
 searched
 staff
 member
 youth
 popular
 moment
 share
Cluster 137:
 sociology
 psychology
 anthropology
 major
 jobs
 social
 career
 degree
 interested
 like


<function print>

In [11]:

true_k =138
model = KMeans(n_clusters=true_k, init='k-means++',max_iter=300, n_init=10)
model.fit(X_idf)
    
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = tfidf.get_feature_names()

# print(order_centroids)
for i in range(true_k):
    print("Cluster %d:" % i),
    for t in order_centroids[i, :10]:
        print(' %s' % terms[t]),
print

Top terms per cluster:
Cluster 0:
 event
 manager
 bank
 planner
 banking
 planning
 management
 business
 th
 completed
Cluster 1:
 life
 want
 know
 career
 future
 like
 college
 people
 don
 help
Cluster 2:
 major
 college
 want
 undecided
 choosing
 know
 career
 right
 double
 choose
Cluster 3:
 business
 management
 entrepreneurship
 start
 college
 major
 want
 degree
 know
 entrepreneur
Cluster 4:
 college
 advice
 know
 want
 admissions
 student
 going
 school
 best
 good
Cluster 5:
 na
 wan
 know
 college
 gon
 work
 really
 make
 model
 pediatrician
Cluster 6:
 electrical
 engineering
 engineer
 mechanical
 electrician
 college
 career
 major
 like
 best
Cluster 7:
 mechanical
 engineering
 engineer
 know
 engineers
 want
 like
 college
 job
 career
Cluster 8:
 counselor
 counseling
 guidance
 psychology
 marriage
 career
 school
 counselors
 genetic
 family
Cluster 9:
 want
 college
 career
 know
 like
 best
 need
 school
 interested
 help
Cluster 10:
 chemistry
 biology
 

 paths
 careers
Cluster 123:
 psychologist
 psychology
 clinical
 psychiatrist
 psychiatry
 child
 want
 people
 know
 does
Cluster 124:
 internships
 internship
 college
 start
 apply
 experience
 year
 intern
 summer
 school
Cluster 125:
 criminal
 justice
 law
 criminology
 police
 want
 major
 career
 psychology
 enforcement
Cluster 126:
 summer
 internship
 job
 college
 internships
 jobs
 school
 programs
 high
 time
Cluster 127:
 transfer
 college
 year
 university
 transferring
 community
 school
 admissions
 student
 credits
Cluster 128:
 anthropology
 anthropologist
 cultural
 sociology
 major
 career
 degree
 jobs
 college
 like
Cluster 129:
 biology
 wildlife
 major
 biologist
 conservation
 college
 zoology
 science
 degree
 molecular
Cluster 130:
 graduate
 school
 college
 grad
 debt
 undergraduate
 degree
 programs
 career
 going
Cluster 131:
 cosmetology
 cosmetologist
 hair
 makeup
 nails
 cosmetics
 beauty
 school
 know
 long
Cluster 132:
 med
 pre
 school
 medicine


<function print>

We also check the similarity between the question text as the input with its question in "questi1" to support the accuracy of KMeans clustering using "similarity" function.

In [12]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
 
def similarity(X_list, y_list):
    X = tfidf.transform(X_list)
    y=tfidf.transform(y_list)
    
    d=cosine_similarity(X, y)
    return d
 

After the KMeans clustering, now we can labeling all question in 'questi1' dataframe.

In [13]:
def labeling(name_list):
    clustering=[]
    for i in name_list:
        Y = tfidf.transform([i])
        prediction = model.predict(Y)
        for e in prediction:
            clustering.append(e)
    return clustering


In [14]:
full_list_question=list(str(x) for x in questi1['merged_question'])
questi1['label']=labeling(full_list_question)

Now, we can merge answer dataframe with questi1 dataframe that has been labeled in "answer_and_question". 

In [15]:
answer2=answer_prof_merged.rename(index=str, columns={'answers_question_id':'questions_id'})

answer_and_question=answer2.merge(questi1,how='left', on='questions_id')

answer_and_question.head()

Unnamed: 0,answers_id,answers_author_id,questions_id,answers_date_added,answers_body,professionals_location,professionals_industry,professionals_headline,professionals_date_joined,tags_tag_id,tags_tag_name_x,score_x,merged_question_x,questions_author_id,questions_date_added,questions_title,questions_body,score_y,tag_questions_tag_id,tags_tag_name_y,merged_question_y,label
0,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,332a511f1569444485cf7a7a556a5e54,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...,"Cleveland, Ohio",Mental Health Care,Assist with Recognizing and Developing Potential,2015-10-19 20:56:49 UTC+0000,129.0,career,0.0,Hi You are asking a very interesting question ...,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,1.0,14147.0,lecture,Teacher career question.What is a maths teache...,137
1,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,332a511f1569444485cf7a7a556a5e54,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...,"Cleveland, Ohio",Mental Health Care,Assist with Recognizing and Developing Potential,2015-10-19 20:56:49 UTC+0000,129.0,career,0.0,Hi You are asking a very interesting question ...,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,1.0,27490.0,college,Teacher career question.What is a maths teache...,137
2,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,332a511f1569444485cf7a7a556a5e54,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...,"Cleveland, Ohio",Mental Health Care,Assist with Recognizing and Developing Potential,2015-10-19 20:56:49 UTC+0000,129.0,career,0.0,Hi You are asking a very interesting question ...,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,1.0,21438.0,professor,Teacher career question.What is a maths teache...,137
3,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,332a511f1569444485cf7a7a556a5e54,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...,"Cleveland, Ohio",Mental Health Care,Assist with Recognizing and Developing Potential,2015-10-19 20:56:49 UTC+0000,187.0,jobs,0.0,Hi You are asking a very interesting question ...,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,1.0,14147.0,lecture,Teacher career question.What is a maths teache...,137
4,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,332a511f1569444485cf7a7a556a5e54,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...,"Cleveland, Ohio",Mental Health Care,Assist with Recognizing and Developing Potential,2015-10-19 20:56:49 UTC+0000,187.0,jobs,0.0,Hi You are asking a very interesting question ...,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,1.0,27490.0,college,Teacher career question.What is a maths teache...,137


In [16]:
answer_and_question.isnull().sum()

answers_id                       0
answers_author_id                0
questions_id                     0
answers_date_added               0
answers_body                    24
professionals_location       63703
professionals_industry       73549
professionals_headline       35269
professionals_date_joined     3037
tags_tag_id                  10693
tags_tag_name_x              10693
score_x                        400
merged_question_x                0
questions_author_id              0
questions_date_added             0
questions_title                  0
questions_body                   0
score_y                         94
tag_questions_tag_id         15130
tags_tag_name_y              15130
merged_question_y                0
label                            0
dtype: int64

# 5. Finding How Fast The Response Comes From Professional

In order to get this data, we need to get interval time between the date of question first posted and the date of that questions answered.

In [17]:
from datetime import datetime
interval=[]
for i in range (0, len(answer_and_question)):
    answer_date=datetime.strptime(str(answer_and_question['answers_date_added'][i]),'%Y-%m-%d %H:%M:%S UTC+0000')
    sent_date=datetime.strptime(str(answer_and_question['questions_date_added'][i]),'%Y-%m-%d %H:%M:%S UTC+0000')
    diff= (answer_date-sent_date).days
    interval.append(diff)
answer_and_question['interval']=interval

After we got the interval days of response, we can find mean of response time for each professional using aggregate() and groupby() function. The smaller mean indicates faster that professional in answering a question.

In [18]:
answer_recommend=answer_and_question.groupby('answers_author_id').agg({'interval':'mean'})
answer_final_end=answer_and_question.merge(answer_recommend, how='left', on='answers_author_id')

In [19]:
answer_recommend.head()

Unnamed: 0_level_0,interval
answers_author_id,Unnamed: 1_level_1
00009a0f9bda43eba47104e9ac62aff5,236.666667
000d4635e5da41e3bfd83677ee11dda4,199.5
00271cc10e0245fba4a35e76e669c281,213.609195
003cc21be89d4e42bc4424131a378e86,406.0
0046ab8089c04b3a8df3f8c28621a818,149.230769


# 6. Additional: How to get reccomended professional who hasn't yet answered any question?
Sometimes, even the professional in 'professionals.csv' hasn't been answered any question, we need to send them a relevant question to open more chance for that question to be answered. In order to do this, we will get any tags name that related with 'label' column. Based on those tags, we can search for other professional in 'professional.csv'.

In [20]:
def not_answerred_prof(label):
    label=answer_and_question.loc[answer_and_question['label'].isin(label)]
    list_tags=list(str(x) for x in label['tags_tag_name_x'].drop_duplicates())
    for i in list_tags:
        if i is not np.nan:
            final=final_professional.loc[final_professional['tags_tag_name'].isin(list_tags)]
            final1=final[['answers_author_id','tags_tag_name']].drop_duplicates(subset='answers_author_id')
    return final1

# 7. Presenting The Reccomended Professional
Using its label to filter specific professional and sorted it based on 'similarity_score' to present its final result. We present The Top 10 recommended professionals to answer.

In [21]:


def find_the_professional(text_list):
    list_similarity=[]
    label_list=labeling(text_list)
    for i in range(0,len(text_list)):
        df=answer_final_end.loc[answer_final_end['label'].isin(label_list)]
        sim_score=similarity(text_list,list(str(x) for x in df['merged_question_y']))
        for b in sim_score:
            for c in b:
                list_similarity.append(c)
        df['similarity_score']=list_similarity
        df1=df.sort_values(by=['similarity_score'], ascending=False)
        professional_result=df1[['answers_author_id','interval_y', 'similarity_score', 'score_x']].drop_duplicates(subset='answers_author_id')
        professional_result=professional_result.rename(columns={'interval_y':'interval mean', 'score_x':'heart'})
        additional_prof=not_answerred_prof(label_list)
        print('Label number:',label_list)
        print('\033[1m','Top 10 List of Professionals For This Question:',text_list[i],'\033[1m')
        return professional_result[0:10], additional_prof[0:10]




In [22]:
def final_recommendation(text_list):
    a,b=find_the_professional(text_list)
    display(a)
    print('\033[1m','--------------------------------------------------------------------------------------------------------------','\033[1m')
    print('\033[1m','Top 10 Additional Professionals:','\033[1m')
    display(b)

In [23]:
final_recommendation(['how to get scholarship ?'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


Label number: [9]
[1m Top 10 List of Professionals For This Question: how to get scholarship ? [1m


Unnamed: 0,answers_author_id,interval mean,similarity_score,heart
618709,e55402249ea24e23997d6674eaf283d4,193.142857,0.691588,0.0
188577,1f9f56d1f288449b96276a2583b06893,48.0,0.60546,0.0
139056,926c156d54a44a1780332ba20efc9f8f,129.486056,0.522282,0.0
139048,81c4329d8d1c4f499575fcfa99e8cecf,126.609756,0.522282,1.0
1259331,321e45d8cbf74a0d96c43e48448b6c2b,62.030303,0.503661,1.0
1399917,58fa5e95fe9e480a9349bbb1d7faaddb,18.141612,0.498818,0.0
1399904,53939a491acb42a4815463845a638186,169.018307,0.498818,0.0
940087,4dc7d9040592416f98c0bef1ad2c31f5,75.726073,0.491522,1.0
770247,a3c9ab19f7724ca1bca2cd1c1774f11b,63.666667,0.417469,0.0
770245,b62d97756bed415e9727f0db156d14dd,2.0,0.417469,0.0


[1m -------------------------------------------------------------------------------------------------------------- [1m
[1m Top 10 Additional Professionals: [1m


Unnamed: 0,answers_author_id,tags_tag_name
2,0c673e046d824ec0ad0ebe012a0673e4,consulting
21,102fb92c28034ad988b593d0111cb4bb,design
27,5a4a16842ec64430ac3f916aacf35fe1,architecture
34,81999d5ad93549dab55636a545e84f2a,aviation
40,7d425e8d7cfb4fe7b0702fff4d6d84e7,politics
45,7daf1e6dfb3443b99b240890f0a4d69b,engineer
59,b7dc946585734ab8acfbeeeb0d76af20,entrepreneurship
65,4863a65cd35b42a1bb89f3ecfc8fa2fe,environment
72,68ecc66323b8418092fdd724aaf5be94,hospitality
75,2b17cd431763494fa6096617645ba173,politics
