In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import os
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
root_path = '../input/'
print('The csv files provided are:\n')
print(os.listdir(root_path))

In [None]:
df_emails = pd.read_csv(root_path + 'emails.csv')
df_questions = pd.read_csv(root_path + 'questions.csv')
df_professionals = pd.read_csv(root_path + 'professionals.csv')
df_comments = pd.read_csv(root_path + 'comments.csv')
df_tag_users = pd.read_csv(root_path + 'tag_users.csv')
df_group_memberships = pd.read_csv(root_path + 'group_memberships.csv')
df_tags = pd.read_csv(root_path + 'tags.csv')
df_answer_scores = pd.read_csv(root_path + 'answer_scores.csv')
df_students = pd.read_csv(root_path + 'students.csv')
df_groups = pd.read_csv(root_path + 'groups.csv')
df_tag_questions = pd.read_csv(root_path + 'tag_questions.csv')
df_question_scores = pd.read_csv(root_path + 'question_scores.csv')
df_matches = pd.read_csv(root_path + 'matches.csv')
df_answers = pd.read_csv(root_path + 'answers.csv')
df_school_memberships = pd.read_csv(root_path + 'school_memberships.csv')

This notebook is part of a three notebook submission ([Part II ](https://www.kaggle.com/akshayt19nayak/part-ii-tag-recsys-cosine-levenshtein-dist) and [Part III](https://www.kaggle.com/akshayt19nayak/part-iii-nlp-word-mover-distance)) for the Data Science for Good: CareerVillage.org challenge. Even though a lot of people have done an extensive EDA on the datasets provided, this notebook has been created to plan the workflow, gather additional insights (if any) and understand the scope of the problem. As I build the recommender system, I may find an insight which I will be committing here. The second notebook will focus on pre-processsing, model building and NLP which I haven't covered in much detail here. I have, however, inserted links to some of the interesting subproblems that need to be solved in order to build a reasonable model. 

## EDA

The basic structure of every component under EDA:

- head of the relevant tables
- basic information and initial thoughts
- questions and exploratory data analysis
- insights
- strategy

### Index

1. [Questions and Answers](#Questions-and-Answers)
2. [Tags](#Tags)
3. [Professionals](#Professionals)
4. [Emails and Matches](#Emails-and-Matches)
5. [Students, Groups and School Memberships](#Students,-Groups-and-School-Memberships)
6. [Comments, Question and Answer Scores](#Comments,-Question-and-Answer-Scores)

## Questions and Answers

In [None]:
print('questions')
display(df_questions.head(2))
print('answers')
display(df_answers.head(2))

**Basic information about the tables and initial thoughts**

•	questions.csv (Contains Text): Questions get posted by students. Sometimes they're very advanced. Sometimes they're just getting started. It's all fair game, as long as it's relevant to the student's future professional success. Has no null values. Important to understand the topic of the question so that we can club similar questions and send them to the volunteer who is the most likely to answer them.

•	answers.csv (Contains Text): Contains answers. Has 1 null value in the answers_body field. Important as we believe every author will mostly answer questions that he has expertise in. This combined with the professional's industry and headline can be a powerful way of understanding a professional's expertise. 

**What is the distribution of time with which questions get answered and is there any relation between the response time and the length of the question body?**

In [None]:
df_questions['questions_date_added'] = pd.to_datetime(df_questions['questions_date_added'])
df_answers['answers_date_added'] = pd.to_datetime(df_answers['answers_date_added'])
df_qa = pd.merge(df_questions, df_answers, left_on='questions_id', right_on='answers_question_id', how='left')
df_qa_grouped = df_qa.groupby('questions_id').agg({'questions_date_added':min, 'answers_date_added':min,
                                                   'questions_body':min})
df_qa_grouped['days_taken'] = (df_qa_grouped['answers_date_added'] - df_qa_grouped['questions_date_added']).dt.days
df_qa_grouped['questions_body_length'] = df_qa_grouped['questions_body'].apply(len)

In [None]:
print('Numerical summary of days taken to answer a question')
display(df_qa_grouped['days_taken'].describe())
plt.figure(figsize=(10,6))
plt.title('Distribution of days taken to answer a question')
plt.hist(df_qa_grouped['days_taken'], color='blue', edgecolor='black', bins=100)
plt.xlabel('min days taken to answer a question')
plt.ylabel('count')

In [None]:
print('Correlation between length of the body of questions and response time')
display(df_qa_grouped[['questions_body_length', 'days_taken']].corr())
plt.figure(figsize=(10,6))
plt.scatter(df_qa_grouped['questions_body_length'], df_qa_grouped['days_taken'])
plt.xlabel('questions_body_length')
plt.ylabel('days_taken')

**How many questions are unanswered?**

In [None]:
print('The number of questions that are unanswered are:', df_qa['answers_id'].isnull().sum(axis=0))

**How does the distribution of the the count of questions look with respect to time?**

In [None]:
print('Numerical summary of count of questions with respect to time')
display(df_questions['questions_date_added'].dt.year.describe())
plt.figure(figsize=(10,6))
plt.title('Count of questions with respect to time')
sns.countplot(df_questions['questions_date_added'].dt.year, color='violet')

**Insights**

- There are a few minor problems in the date variable in both questions.csv as well as answers.csv. The minimum number of days taken to answer a question is coming out to be -1.
- Most questions get answered typically within 2 days. Furthermore, less than 4% of the questions in the dataset are unanswered.
- There doesn't seem to be any sort of correlation between questions_body_length and days_taken
- The count of questions plot shows a dip in 2017. Majority of the questions were asked in the years 2016 and 2018

**Strategy**

- In order to make recommendations and evaluate question similarity - either on the basis of metadata (such as tags) or wordings, we will be needing past data. A recommendation cannot be made unless we have data to begin with (cold start problem). Thus, we can use all questions asked prior to the year 2018 as the data source for computing questions similarity and we will keep updating it as we go along processing each question.
- Evaluating similarity between 2 pieces of text is a hot research problem. To make a recommendation for a new question, you can either look at similar questions that were asked in the past and direct them to the professionals who answered them. A particularly useful resource could be [this article](https://medium.com/@adriensieg/text-similarities-da019229c894) that talks about estimating degree of similarity between two texts.

## Tags 

In [None]:
print('tags')
display(df_tags.head(2))
print('tag_users')
display(df_tag_users.head(2))
print('tag_questions')
display(df_tag_questions.head(2))

**Basic information about the tables and initial thoughts**

•	tags.csv (Contains text): Each tag gets a name. There are 16269 distinct tags. We suspect that any user is allowed to create a hashtag and that most tags are similar to a lot of ther tags. No null values

•	tag_users.csv: Users of any type can follow a hashtag. This shows you which hashtags each user follows. No null values. 

•	tag_questions.csv: Every question can be hashtagged. The hashtag-to-question pairings are contained in this file. No null values.

**What are some of the most popular tags?**

In [None]:
#To see the tags that every user follows 
df_tag_users_merged = pd.merge(df_tag_users, df_tags, left_on='tag_users_tag_id', right_on='tags_tag_id', how='inner')
#To see the tags that are linked with every question
df_tag_questions_merged = pd.merge(df_tag_questions, df_tags, left_on='tag_questions_tag_id', right_on='tags_tag_id', how='inner')

In [None]:
plt.figure(figsize=(10,6))
plt.title('50 most popular tags wrt user following')
sns.countplot(df_tag_users_merged[df_tag_users_merged['tags_tag_name'].isin(
    df_tag_users_merged['tags_tag_name'].value_counts().index[:50])]['tags_tag_name'], color='maroon', order=df_tag_users_merged['tags_tag_name'].value_counts().index[:50])
plt.ylabel('count')
plt.xticks(rotation='vertical')

In [None]:
plt.figure(figsize=(10,6))
plt.title('50 most popular tags wrt the number of questions they are linked to')
sns.countplot(df_tag_questions_merged[df_tag_questions_merged['tags_tag_name'].isin(
    df_tag_questions_merged['tags_tag_name'].value_counts().index[:50])]['tags_tag_name'], color='maroon', order=df_tag_questions_merged['tags_tag_name'].value_counts().index[:50])
plt.ylabel('count')
plt.xticks(rotation='vertical')

In [None]:
relevant_tags = set(df_tag_questions_merged['tag_questions_tag_id'].unique()).union(set(df_tag_users_merged['tag_users_tag_id'].unique()))
len(relevant_tags)

In [None]:
print('The total number of unique tagged users is:', df_tag_users_merged['tag_users_user_id'].nunique())
print('The total number of unique tags is:', df_tags['tags_tag_id'].nunique())
print('The total number of unique tagged questions is:', df_tag_questions_merged['tag_questions_question_id'].nunique())
print('The proportion of total questions that are linked with tags:', 
      df_tag_questions_merged['tag_questions_question_id'].nunique()/df_questions['questions_id'].nunique())
print('The proportion of tags that are linked with questions out of the total number of tags:', 
      df_tag_questions_merged['tag_questions_tag_id'].nunique()/df_tags['tags_tag_id'].nunique())
print('The proportion of tags that are followed by users out of the total number of tags:', 
      df_tag_users_merged['tag_users_tag_id'].nunique()/df_tags['tags_tag_id'].nunique())
print('The total number of tags that have a user following > 1% :', 
      sum(df_tag_users_merged['tags_tag_name'].value_counts()/df_tag_users_merged['tag_users_user_id'].nunique() > 0.01)) 
print('The total number of tags that are used in > 1% of the tagged questions:', 
      sum(df_tag_questions_merged['tags_tag_name'].value_counts()/df_tag_questions_merged['tag_questions_question_id'].nunique() > 0.01)) 

1% of tagged questions is basically 232 questions in total. In other words, 232 questions or lesser are tagged with the other 7040 tags.

If we look at the tags in the last 2 groups - tags that have a user following of > 1% (user_tags) and tags that are used in >1% of the tagged questions (question_tags)

In [None]:
user_tags = list((df_tag_users_merged['tags_tag_name'].value_counts()/df_tag_users_merged['tag_users_user_id'].nunique() 
                     > 0.01).index[(df_tag_users_merged['tags_tag_name'].value_counts()/df_tag_users_merged['tag_users_user_id'].nunique() > 0.01)])
question_tags = list((df_tag_questions_merged['tags_tag_name'].value_counts()/df_tag_questions_merged['tag_questions_question_id'].nunique() 
                     > 0.01).index[(df_tag_questions_merged['tags_tag_name'].value_counts()/df_tag_questions_merged['tag_questions_question_id'].nunique() > 0.01)])
print('The total number of tags:', len(set(question_tags).union(user_tags)))
print('The number of common tags:', len(set(question_tags).intersection(user_tags)))
print('The tags are:\n', set(question_tags).intersection(user_tags))

**What is the coverage of the union of user and question tags?**

In [None]:
print('Coverage of tagged questions:', df_tag_questions_merged[df_tag_questions_merged['tags_tag_name'].isin(
    set(user_tags).union(set(question_tags)))]['tag_questions_question_id'].nunique()/df_tag_questions_merged['tag_questions_question_id'].nunique())

In [None]:
print('Coverage of tagged users:', df_tag_users_merged[df_tag_users_merged['tags_tag_name'].isin(
    set(user_tags).union(set(question_tags)))]['tag_users_user_id'].nunique()/df_tag_users_merged['tag_users_user_id'].nunique())

A cursory glance across the names of the tags reveals that a lot of tags are similar to a lot of other tags

In [None]:
def print_if_found(df, column, string):
    print(df[df[column].str.contains(string)][column].unique())
print_if_found(df_tag_users_merged, 'tags_tag_name', 'computer')

In [None]:
print_if_found(df_tag_users_merged, 'tags_tag_name', 'psychology')

**What does the distribution of the number of tags linked with every question look like?**

In [None]:
#Looking at tags and questions
print('Numerical summary of the number of tags linked with every question')
display(df_tag_questions.groupby('tag_questions_question_id').agg({'tag_questions_tag_id':'count'})['tag_questions_tag_id'].describe())
plt.figure(figsize=(10,6))
plt.title('Count of the number of tags linked with every question')
sns.countplot(df_tag_questions.groupby('tag_questions_question_id').agg({'tag_questions_tag_id':'count'})['tag_questions_tag_id'], color='orange')
plt.xlabel('count of tags')
plt.ylabel('count')

**Insights**

- Most tags are in the long tail of the distribution. This is a caveat of not restricting user control/creation. While some tags may be extremely specific such as anthropology or human-computer-interaction, some tags are just misspellings or are similar to a lot of other tags. 
- The popularity wrt user following is skewed towards tech (telecommunication, computer science, IT etc.)
- Just 86 tags cover nearly 70% of tagged users and tagged questions
- Majority of the questions are linked with <4 tags. The count just decreases as we increase the number of tags

**Strategy**

- In order to eliminate noise and refine tags, we need to couple similar tags together. Rank the tags in terms of user following. Then, as we go down the order, for a given tag, look at other tags that are similar. In order to do this, you can look at [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy). This is also incorporated in the [nltk package](https://python.gotrained.com/nltk-edit-distance-jaccard-distance/).
- Another way to combine them would be on the basis of semantic similarity by using [Word Mover Distance](https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html). The coverage will definitely go up for tagged questions and tagged answers once we refine and combine the tags 
- Once we reduce the noise, a new question's degree of similiarity with another question on the basis of tags it is linked with can be calculated by using [cosine similarity](https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50). 
- What can also be done is we can look at professionals who are similar (may be on the basis of [cosine similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) or [jaccard similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html)) wrt the tags they follow and direct the relevant questions towards them.

## Professionals 

In [None]:
print('professionals')
display(df_professionals.head(2))

**Basic information about the tables and initial thoughts**

•	professionals.csv (Conatins text): They're the grown-ups who volunteer their time to answer questions on the site. Has 3098 null values in professionals_location, 2576 null values in professionals_industry, 2067 null values in professionals_headline

**How many professionals are actually active on the platfrom?**

In [None]:
#To see the profile of the volunteers and the questions that they have answered
df_answers_professionals = pd.merge(df_answers, df_professionals, left_on='answers_author_id', right_on='professionals_id', how='outer')

In [None]:
print('Number of professionals that are there on the platform:', df_professionals['professionals_id'].nunique())
print('Number of professionals that haven\'t answered questions on the platform:', df_answers_professionals['answers_id'].isnull().sum())
print('Number of answers that have been answered by users who have changed their registration type:', 
      len(set(df_answers['answers_author_id']) - set(df_professionals['professionals_id'])))
print('Proportion of professionals who haven\'t answered a question:', 
     df_answers_professionals['answers_id'].isnull().sum()/df_professionals['professionals_id'].nunique())

Now let's look at the the distribution of the activity of the professionals to know which professionals haven't answered questions in some time. We'll take the date of the last question asked as the reference date.

In [None]:
last_date = df_questions['questions_date_added'].max() #date of the last question asked on the platform
df_ap_grouped = df_answers_professionals.groupby('professionals_id').agg({'answers_date_added':max}).apply(lambda x:
                                                                                          (last_date-x).dt.days)
df_ap_grouped.rename(columns={'answers_date_added':'days_since_answered'}, inplace=True)
print('Numerical summary of days_since_answered')
display(df_ap_grouped['days_since_answered'].describe())
plt.figure(figsize=(10,6))
plt.title('Activity of professionals')
plt.hist(df_ap_grouped['days_since_answered'], bins=50, color='blue', edgecolor='black')
plt.xlabel('days_since_answered')
plt.ylabel('count')

**What is the breakdown of days_since_answered in terms of years?**

In [None]:
plt.figure(figsize=(10,6))
plt.title('Count of years since last answered')
sns.countplot((df_ap_grouped[pd.notnull(df_ap_grouped['days_since_answered'])]['days_since_answered']/365).apply(round), color='magenta')

**How many tagged users are professionals and how does the distribution of the number of tags they follow look like?**

In [None]:
#Looking at professionals and tag
df_tag_professionals = pd.merge(df_tag_users_merged, df_professionals, left_on='tag_users_user_id', 
                                 right_on='professionals_id')
print('Proportion of tagged users that are professionals:', df_tag_professionals['professionals_id'].nunique()/df_tag_users_merged['tag_users_user_id'].nunique())

In [None]:
print('Numerical summary of number of tags followed by tagged professionals')
display(df_tag_professionals.groupby('professionals_id').agg({'tag_users_tag_id':lambda x: len(x)})['tag_users_tag_id'].describe())
plt.figure(figsize=(10,6))
plt.title('Count of number of tags followed by tagged professionals')
sns.countplot(df_tag_professionals.groupby('professionals_id').agg({'tag_users_tag_id':len})['tag_users_tag_id'].astype(int), color='cyan')
plt.xlim((0,30))

**Insights**

- About 50% of the professionals have answered questions in the last 1.5 years
- The fact that 84% of tagged users are professionals suggests that only a small population of students follow hashtags. Is it because they don't find questions that are similar to the ones that they have on their mind? Or is it because the amount of relevant questions that come under a particular hashtag is far lesser than the total number of questions asked?
- Almost 40% of professionals follow only 1 hashtag. This one hashtag may not contain all relevant questions that a professional would want to answer. 

**Strategy**

- Since 50% of professionals have answered questions in the last 1.5 years, we will recommend questions to these professionals mostly
- Along with these active professionals, we may also want to send questions to inactive professionals in order to engage and encourage them to answer questions on the platform
- The location attribute can be refined by using techniques as described above in the Tags section
- For the header and industry variables, we can either use [Named Entity Recognition](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da) or extract important words by using a [tf-idf](http://tfidf.com/) approach

## Emails and Matches

In [None]:
print('emails')
display(df_emails.head(2))
print('matches')
display(df_matches.head(2))

**Basic information about the tables and initial thoughts**

•	emails.csv: Each email corresponds to one specific email to one specific recipient. The frequency_level refers to the 3 types of email templates which includes immediate emails sent right after a question is asked, daily digests, and weekly digests. Has no null values

•	matches.csv: Each row tells you which questions were included in emails. If an email contains only one question, that email's ID will show up here only once. If an email contains 10 questions, that email's ID would show up here 10 times. Has no null values

The matches table is an interesting one since it provides us with a way to assess relevance. We can look at the table and know:
- which questions was a professional sent
- which questions he has answered in the past

**How many professionals have been sent emails and out of these professionals how many have not answered even one question? **

In [None]:
df_emails_matches = pd.merge(df_emails, df_matches, left_on='emails_id', right_on='matches_email_id')
df_em_answers = pd.merge(df_emails_matches, df_answers, left_on=['emails_recipient_id','matches_question_id'], 
                              right_on=['answers_author_id', 'answers_question_id'], how='left')
grouped_response_rate = df_em_answers.groupby('emails_recipient_id').agg({'answers_id':lambda x: 
                                   (x.notnull().sum())/len(x)})['answers_id']
df_em_answers.head(2)

In [None]:
print('Number of professionals that have been sent emails:', df_emails_matches['emails_recipient_id'].nunique())
print('Number of professionals that have not answered even one question that was emailed to them:',
      grouped_response_rate[grouped_response_rate == 0].shape[0])

**How many professionals have answered questions that were not sent to them?**

In [None]:
df_em_answers_right = pd.merge(df_emails_matches, df_answers, left_on=['emails_recipient_id','matches_question_id'], 
                              right_on=['answers_author_id', 'answers_question_id'], how='right')
print('Number of professionals who have answered questions that weren\'t emailed to them:', 
     (df_em_answers_right.groupby('answers_author_id').agg({'emails_id': lambda x: x.isnull().sum()})['emails_id']!=0).sum())

This is particularly insightful as nearly 40% of professionals have answered at least one question that wasn't emailed to them.

**Out of the questions that get mailed to the professionals, what is their response rate i.e how many of these questions get answered?**

In [None]:
print('Numerical summary of the response rate')
display(grouped_response_rate.describe())
plt.figure(figsize=(10,6))
plt.title('Response rate of matched questions')
plt.hist(grouped_response_rate, color='blue', edgecolor='black',bins=50)
plt.ylabel('count')
plt.xlabel('response rate')

Since we know that 18072 professionals haven't answered a single question, the plot above was expected. If we exclude the 18072 professionals and talk about the rest

In [None]:
plt.figure(figsize=(10,6))
plt.title('Response rate of matched questions > 0')
plt.hist(grouped_response_rate[grouped_response_rate > 0], color='blue', edgecolor='black',bins=50)
plt.ylabel('count')
plt.xlabel('response rate > 0')

This reveals that out of the 3754 professionals that have answered atleast one question, a little less than 2500 professionals have a response rate < 2%

**How many emails have a frequency type of immediate out of the total emails that a professional receives?**

In [None]:
grouped_immediate_rate = df_emails_matches.drop_duplicates(['emails_recipient_id', 'emails_id']).groupby('emails_recipient_id').agg({'emails_frequency_level': 
                                        lambda x: (x.str.contains('email_notification_immediate').sum())/len(x)})['emails_frequency_level']
plt.figure(figsize=(10,6))
plt.title('Proportion of immediate emails')
plt.hist(grouped_immediate_rate, color='blue', edgecolor='black', bins=50)
plt.ylabel('count')
plt.xlabel('proportion of immmediate emails')

In [None]:
grouped_immediate_response = df_em_answers.groupby(['emails_recipient_id', 'emails_frequency_level']).agg({'answers_id': 
                                                    lambda x: (x.notnull().sum())/len(x)}).reset_index()
plt.figure(figsize=(10,6))
plt.title('Distribution of response rate of immediate questions')
plt.hist(grouped_immediate_response[grouped_immediate_response['emails_frequency_level']==
                                    'email_notification_immediate']['answers_id'], color='blue', edgecolor='black', bins=50)
plt.xlabel('response rate')
plt.ylabel('count')

**Insights**

- An alarming 83% of professionals haven't answered even one question mailed to them. This really tells you why this competition has been held in the first place.
- (questions answered)/(questions sent) which is the response rate gives us some sense of relevance of matches. The fact that 40% of professionals have answered atleast one question that wasn't emailed to them means that these professionals answer questions by directly going on the website which further suggests that most questions recommended are irrelevant
- The response rates are hardly encouraging. A majority of those who have answered questions have response rates less than 2%. Maybe don’t make the default as daily emails and instead focus on immediate and weekly email templates?  
- As far as immediate emails are concerned, according to [this discussion](https://www.kaggle.com/c/data-science-for-good-careervillage/discussion/84845#latest-496046), a question gets recommended to a professional if any professional follows the same hashtags as to what's on the question. If so, CV notifies them immediately. Since there are too many tags a question can get linked with, there may not be a 1:1 match between the tags on the question and the tags that a user follows. What's also possible is that a user may follow only 1 hashtag which is probably quite broad (40% of professionals) and he may get questions that are linked with many hashtags which ends up becoming irrelevant to him.

**Strategy**

- Apply thresholded matching to recommend questions. Instead of a 1:1 match, look at partial matches and recommend questions to professionals if the tags on the question have a match with the tags that a professional follows above a certain threshold
- In order to assess the performance of the recommender system, we can look at the recommended questions and see the ones that he has answered out of those. This is the metric that we'll optimize for

## Students, Groups and School Memberships

In [None]:
print('students')
display(df_students.head(2))
print('groups')
display(df_groups.head(2))
print('group_memberships')
display(df_group_memberships.head(2))
print('school_memberships')
display(df_school_memberships.head(2))

**Basic information about the tables and initial thoughts**

•	students.csv: They tend to range in age from about 14 to 24. 2033 null values in student_location.

•	groups.csv: There are a total of 49 group ids, which suggests that either 3 groups are new or that no students are a part of these 3 group ids.  Every group id is mapped to a group type. There are a total of 7 group types. Has no null values

•	group_memberships.csv: Any type of user can join any group. There are a total of 46 groups. Has no null values. This can be useful. A cross tab of topic of the question asked by the student vs the group id that he/she belongs to will be handy in understanding the agenda of the group.  Has no null values

•	school_memberships.csv: Just like group_memberships, but for schools instead. Can be analysed in a similar fashion to group_memberships. Do people from the same schools ask similar questions?

In [None]:
#To see the group memberships and type together
df_groups_merged = pd.merge(df_group_memberships, df_groups, left_on='group_memberships_group_id', right_on='groups_id', how='outer')
df_groups_professionals = pd.merge(df_groups_merged, df_professionals, left_on='group_memberships_user_id', right_on='professionals_id')
df_groups_students = pd.merge(df_groups_merged, df_students, left_on='group_memberships_user_id', right_on='students_id')
df_school_professionals = pd.merge(df_school_memberships, df_professionals, left_on='school_memberships_user_id', right_on='professionals_id')
df_school_students = pd.merge(df_school_memberships, df_students, left_on='school_memberships_user_id', right_on='students_id')

**How many users have a group membership/school membership? What is the breakdown of professionals and students?**

In [None]:
print('Number of groups that don\'t have a user following:', df_groups_merged['group_memberships_group_id'].isnull().sum())
print('Total number of users that have a group membership:', df_groups_merged['group_memberships_user_id'].nunique())
print('Proportion of users in the group memberships that are professionals:', df_groups_professionals['professionals_id'].nunique()/
     df_groups_merged['group_memberships_user_id'].nunique())
print('Proportion of users in the group memberships that are students:', df_groups_students['students_id'].nunique()/
     df_groups_merged['group_memberships_user_id'].nunique())
print('Total number of users that have a school membership:', df_school_memberships['school_memberships_user_id'].nunique())
print('Proportion of users in the school memberships that are professionals:', df_school_professionals['professionals_id'].nunique()/
     df_school_memberships['school_memberships_user_id'].nunique())
print('Proportion of users in the school memberships that are students:', df_school_students['students_id'].nunique()/
     df_school_memberships['school_memberships_user_id'].nunique())

**How many students have asked a question on the platform?**

In [None]:
df_students_questions = pd.merge(df_students, df_questions, left_on='students_id', right_on='questions_author_id')
print('Number of students on the platform:', df_students['students_id'].nunique())
print('Proportion of students who have asked a question on the platform:', df_students_questions['students_id'].nunique()/df_students['students_id'].nunique())

**Insights**

- Very few users (professionals and students) are a part of groups and school memberships
- 40% of students have asked a question on the platform. Does that mean that the others have questions that get asked by other students and answered by the relevant professionals?

**Strategy**

- Group and School Memberships don't seem to be useful. The fill rates are very low for any sort of similarity computation
- Location has a nice fill rate. [This kernel](https://www.kaggle.com/wjshenggggg/update-5-text-processing) deals with standardising locations of professionals and students. They can also be refined using the same techniques as the ones discussed under Tags
- We can use the idea of collaborative filtering using cosine similairty to recommend questions. That is if Students A and B have similar interests, and if A has asked questions on the platform, then B's questions can be recommended to the professional that answered A's questions

## Comments, Question and Answer Scores

In [None]:
print('comments')
display(df_comments.head(2))
print('question_scores')
display(df_question_scores.head(2))
print('answer_scores')
display(df_answer_scores.head(2))

**Basic information about the tables and initial thoughts**

•	comments.csv (Contains Text): Comments can be made on Answers or Questions. We refer to whichever the comment is posted to as the "parent" of that comment. Comments can be posted by any type of user. Has 4 null values in the comments_body field. This can provide additional information about the topic of the question 

•	question_scores.csv: "Hearts" scores for each question.

•	answer_scores.csv: "Hearts" scores for each answer.

**What is the breakdown of the number of comments posted on question vs answers?**

In [None]:
print('Number of comments:', df_comments['comments_id'].nunique())
print('Proportion of comments on questions:',
      len(set(df_comments['comments_parent_content_id']).intersection(set(df_questions['questions_id'])))/df_comments['comments_parent_content_id'].nunique())
print('Proportion of comments on answers:',
      len(set(df_comments['comments_parent_content_id']).intersection(set(df_answers['answers_id'])))/df_comments['comments_parent_content_id'].nunique())

**How does the distribution of count of scores compare for questions vs answers?**

In [None]:
print('Numerical summary of count of hearts on questions')
display(df_question_scores['score'].describe())
plt.figure(figsize=(10,6))
plt.title('Count of hearts on questions ')
sns.countplot(df_question_scores['score'], color='red')
plt.xlim(0,30)

In [None]:
print('Numerical summary of count of hearts on answers')
display(df_answer_scores['score'].describe())
plt.figure(figsize=(10,6))
plt.title('Count of hearts on answers ')
sns.countplot(df_answer_scores['score'], color='red')

**Insights**

- Questions seem to have more 'hearts' than answers. Do students tend to upvote questions that are similar to the ones that they want to ask?
- Is the paucity of hearts on answers suggestive of the fact that students are not getting the answer that they need for the question that they have asked?
- If not, suggest recommending questions to students that are similar to the ones they have asked (so as to remove duplicates) and ask them to upvote the answers they find helpful. Upvotes is one way in which professionals will be encouraged to answer more questions on the platform and thus they won't churn out. The data can be then used to rank professionals under tags (for eg. Michael is a top-writer in #finance)
- Number of views/clicks on an answer would have been useful as the metric (likes/clicks) could have been used to gauge relevance of an answer to a student. The paucity of likes on answers makes it hard to determine what user is more likely to give a *10-quality point answer*

**Strategy**

- For a given professional, combine the scores of the questions that he has answered, hearts on his answers and the number of comments on his answers
- This will act as a proxy for user engagement with the professional and how well he is perceived on this platform

Thus the final solution can be a combination of multiple different recommender systems that act in tandem to make the best recommendations and also improve the relevance/response rate.