## Challenge description: 
#### In this competition, we're challenged to use this new dataset to build predictive algorithms for different subjective aspects of question-answering. The question-answer pairs were gathered from nearly 70 different websites, in a "common-sense" fashion. The raters received minimal guidance and training and relied largely on their subjective interpretation of the prompts. As such, each prompt was crafted in the most intuitive fashion so that raters could simply use their common-sense to complete the task.
#### Demonstrating these subjective labels can be predicted reliably can shine a new light on this research area. Results from this competition will inform the way future intelligent Q&A systems will get built, hopefully contributing to them becoming more human-like.

## About the data:
#### The data for this competition includes questions and answers from various StackExchange properties. Our task is to predict the target values of 30 labels for each question-answer pair.
#### The list of 30 target labels is the same as the column names in the sample_submission.csv file. Target labels with the prefix question_ relate to the question_title and/or question_body features in the data. Target labels with the prefix answer_ relate to the answer feature.
#### Each row contains a single question and a single answer to that question, along with additional features. The training data contains rows with some duplicated questions (but with different answers). The test data does not contain any duplicated questions.
#### Target labels can have continuous values in the range [0,1]. Therefore, predictions must also be in that range.
#### The files provided are:
- train.csv - the training data (target labels are the last 30 columns)
- test.csv - the test set (you must predict 30 labels for each test set row)
- sample_submission.csv - a sample submission file in the correct format; column names are the 30 target labels

In [None]:
# importing necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt  
from tqdm.notebook import tqdm
import plotly.graph_objects as go
import os
from urllib.parse import urlparse
%matplotlib inline
from IPython.display import Image 
sns.set()

#### data provided by Kaggle

In [None]:
os.listdir('../input/google-quest-challenge')

In [None]:
# reading the data into dataframe using pandas
train = pd.read_csv('../input/google-quest-challenge/train.csv')
test = pd.read_csv('../input/google-quest-challenge/test.csv')
submission = pd.read_csv('../input/google-quest-challenge/sample_submission.csv')

In [None]:
# Let's check the top 5 entries of train data.
train.head()

In [None]:
# Let's check the statistical description of the numerical features in train data
train.describe()

In [None]:
train.iloc[:, 11:].columns

In [None]:
# let's check the unique values in target features
np.unique(train.iloc[:, 11:].values)

In [None]:
# These are the features provided in the test data
test.columns

In [None]:
# these are the features that we need to include while submitting the results
submission.columns

## EDA

#### For illustration, below is the anatomy of a webpage (unsing the link in 'url') and the top 6 features on the webpage.
#### The features question_user_page and answer_user_page are links to the user's page that can be accessed by clinking on the features question_user_name and answer_user_name below.

In [None]:
Image('../input/google-quest-qna-eda-img/url.png', width=920, height=480)

#### Question title

In [None]:
# A text feature that represents the title of the question.
train['question_title'].head()

In [None]:
# Let's calculate the length of each question title
length = train['question_title'].apply(lambda x:len(x.split(' ')))

In [None]:
length.describe()

In [None]:
# histogram of length of question titles
plt = go.Figure(data=[go.Histogram(x=length)], 
                layout = go.Layout(title='histogram of length of question title in train data', 
                                  xaxis=dict(title='length of sentences'), 
                                  yaxis=dict(title='frequency')))
plt.show()

In [None]:
test['question_title'].head()

In [None]:
# Let's calculate the length of each question title
length = test['question_title'].apply(lambda x:len(x.split(' ')))

In [None]:
length.describe()

In [None]:
# histogram of length of question titles
plt = go.Figure(data=[go.Histogram(x=length)], 
                layout = go.Layout(title='histogram of length of question title in test data', 
                                  xaxis=dict(title='length of sentences'), 
                                  yaxis=dict(title='frequency')))
plt.show()

#### Question body

In [None]:
# this is another and the main text feature that represents the full description of the question asked
train['question_body'].head()

In [None]:
# Lets check the length of the questions body
length = train['question_body'].apply(lambda x:len(x.split(' ')))

In [None]:
length.describe()

In [None]:
# histogram of length of question titles
plt = go.Figure(data=[go.Histogram(x=length, marker_color='#39f79b')], 
                layout = go.Layout(title='histogram of length of question body in train data',
                                  xaxis=dict(title='length of sentences'), 
                                  yaxis=dict(title='frequency')))
plt.show()

In [None]:
# histogram of length of question titles
plt = go.Figure(data=[go.Histogram(x=np.log1p(length), marker_color='#39f79b')], 
                layout = go.Layout(title='histogram of log of length of question body in train data', 
                                  xaxis=dict(title='log of length of sentences'), 
                                  yaxis=dict(title='frequency')))
plt.show()

In [None]:
test['question_body'].head()

In [None]:
# length of question in test data
length = test['question_body'].apply(lambda x:len(x.split(' ')))

In [None]:
# histogram of length of question titles
plt = go.Figure(data=[go.Histogram(x=length, marker_color='#39f79b')], 
                layout = go.Layout(title='histogram of length of question body in test data', 
                                  xaxis=dict(title='length of sentences'), 
                                  yaxis=dict(title='frequency')))
plt.show()

In [None]:
# histogram of length of question titles
plt = go.Figure(data=[go.Histogram(x=np.log1p(length), marker_color='#39f79b')], 
                layout = go.Layout(title='histogram of log of length of question body in test data', 
                                  xaxis=dict(title='log of length of sentences'), 
                                  yaxis=dict(title='frequency')))
plt.show()

#### Question user name, Answer user name
(these are irrelevant features so I have not done EDA for them)

In [None]:
train['question_user_name'].head()

In [None]:
train['answer_user_name'].head()

#### Answer

In [None]:
# Another important text type feature that represents the answers that given to the questions.
train['answer'].head()

In [None]:
# Length of answers
length = train['answer'].apply(lambda x:len(x.split(' ')))

In [None]:
length.describe()

In [None]:
# histogram of length of question titles
plt = go.Figure(data=[go.Histogram(x=length, marker_color='#eb4034')], 
                layout = go.Layout(title='histogram of length of answer in train data', 
                                  xaxis=dict(title='length of sentences'), 
                                  yaxis=dict(title='frequency')))
plt.show()

In [None]:
# histogram of length of question titles
plt = go.Figure(data=[go.Histogram(x=np.log1p(length), marker_color='#eb4034')], 
                layout = go.Layout(title='histogram of log of length of answer in train data', 
                                  xaxis=dict(title='log of length of sentences'), 
                                  yaxis=dict(title='frequency')))
plt.show()

In [None]:
test['answer'].head()

In [None]:
length = test['answer'].apply(lambda x:len(x.split(' ')))

In [None]:
# histogram of length of question titles
plt = go.Figure(data=[go.Histogram(x=length, marker_color='#eb4034')], 
                layout = go.Layout(title='histogram of length of answer in test data', 
                                  xaxis=dict(title='length of sentences'), 
                                  yaxis=dict(title='frequency')))
plt.show()

In [None]:
# histogram of length of question titles
plt = go.Figure(data=[go.Histogram(x=np.log1p(length), marker_color='#eb4034')], 
                layout = go.Layout(title='histogram of length of answer in test data', 
                                  xaxis=dict(title='length of sentences'), 
                                  yaxis=dict(title='frequency')))
plt.show()

#### Category

In [None]:
# This feature represents the category that the question answer pair belong to.
train['category'].head(10)

In [None]:
# There are 5 categories
train['category'].value_counts()

In [None]:
categories = train['category'].value_counts()
fig = go.Figure([go.Pie(labels=categories.keys(), values=categories)])
fig.update_traces(hole=.4, hoverinfo="label+percent+name")
fig.update_layout(title_text="'category' Pie chart for train data",
                  annotations=[dict(text='category', x=0.5, y=0.5, 
                                    font_size=20, showarrow=False)])
fig.show()

In [None]:
# This feature represents the category that the question answer pair belong to.
test['category'].head(10)

In [None]:
# There are 5 categories
test['category'].value_counts()

In [None]:
categories = test['category'].value_counts()
fig = go.Figure([go.Pie(labels=categories.keys(), values=categories)])
fig.update_traces(hole=.4, hoverinfo="label+percent+name")
fig.update_layout(title_text="'category' Pie chart for test data",
                  annotations=[dict(text='category', x=0.5, y=0.5, 
                                    font_size=20, showarrow=False)])
fig.show()

#### Host

In [None]:
# this feature represents the host/domain name of the question answer page url.
train['host'].head(10)

In [None]:
# We can see that there are 63 type of these host names
train.host.value_counts()

In [None]:
categories = train['host'].value_counts()
fig = go.Figure([go.Pie(labels=categories.keys(), values=categories)])
fig.update_traces(hole=.4, hoverinfo="label+percent+name")
fig.update_layout(title_text="'host' Pie chart for train data",
                  annotations=[dict(text='host', x=0.5, y=0.5, 
                                    font_size=20, showarrow=False)])
fig.show()

In [None]:
# this feature represents the host/domain name of the question answer page url.
test['host'].head(10)

In [None]:
# We can see that there are 63 type of these host names
test.host.value_counts()

In [None]:
categories = test['host'].value_counts()
fig = go.Figure([go.Pie(labels=categories.keys(), values=categories)])
fig.update_traces(hole=.4, hoverinfo="label+percent+name")
fig.update_layout(title_text="'host' Pie chart for test data",
                  annotations=[dict(text='host', x=0.5, y=0.5, 
                                    font_size=20, showarrow=False)])
fig.show()

## Data Scraping

#### The remaining 3 features 'url', 'question_user_page', 'answer_user_page' can be a great source for some external data let's see how.

#### URL
this feature holds the webpage url's of the questions and answers

#### In the beginning of the EDA, we saw the anatomy of the webpage that we land on using the links in feature 'url'. Now let's see what else new features can be extracted from teh webpage.
#### For each question, there can be multiple answers. The accepted answer (one with a green tick) is the one that is provided in the original dataset. But we can scrape the other answers from the webpage as well.
*I've denoted the other answers as 'post' below.

In [None]:
Image('../input/google-quest-qna-eda-img/posts.png', width=920, height=480)

#### There are 2 more features that we can scrape-- 'upvotes' and 'comments'. The feature 'upvotes' will hold the number of upvotes that the accepted answer received and the feature 'comments' will hold the comments in the  posts.

In [None]:
Image('../input/google-quest-qna-eda-img/upvotes_comments.png', width=920, height=480)

In [None]:
train['url'].head(10)

In [None]:
# function for scraping the answers and their topmost comment. 
# Since all of the urls are of stackoverflow, they have the same html hierarchy.
def get_answers_comments(url): 
  try:
    get = request.urlopen(url).read() # read the html data from the url page
    src = BeautifulSoup(get, 'html.parser') # convert the data into a beautifulsoup object
    upvotes, answer = [], [] 
    correct_ans, comments = [], []
    new_features = []
    post_layout = src.find_all("div", class_ = 'post-layout') # Collecting all the posts from the page
    l = len(post_layout) # number of answers present
    for p in post_layout[:l]: # collecting answer, upvotes, comments from posts
      answer.append(p.find_all('div', class_='post-text')[0].text.strip())
      upvotes.append(int(p.find_all("div", class_ = 'js-vote-count grid--cell fc-black-500 fs-title g rid fd-column ai-center')[0].get('data-value')))
      correct_ans.append(len(p.find_all("div", class_ = 'js-accepted-answer-indicator grid--cell fc-g reen-500 ta-center py4')))
      comments.append('\n'.join([i.text.strip() for i in p.find_all('span', class_='comment-copy')]))
    idx = np.argmax(correct_ans) # index of the correct answer among all the posts
    new_features.append(upvotes.pop(idx)) # correct answer's upvotes
    new_features.append(comments.pop(idx)) # correct answer's comments
    del answer[idx]
    # collecting the answer and top comment from the top 3 posts apart from the one already provided in train.csv
    if l < 3: k=l
    else: k=3
    for a,b in zip(answer[:k], comments[:k]): 
      new_features.append(a) 
      new_features.append(b)
    for a,b in zip(answer[:3-k], comments[:3-k]): 
      new_features.append('') 
      new_features.append('')

    return new_features
    
  except:
    return [np.nan]*8 # return np.nan if the code runs into some error like page not found

#### Question user page, Answer user page

#### If we go to the userpage using the link provided in the features 'question_user_page' and 'answer_user_page' there are 4 new useful features that we can scrape.
#### The 4 new features are 'reputation', 'gold_score', 'silver_score', 'bronze_score'.

In [None]:
Image('../input/google-quest-qna-eda-img/user.png', width=920, height=480)

In [None]:
train['question_user_page'].head()

In [None]:
train['answer_user_page'].head()

In [None]:
# code for scraping the data. Since all of the urls are of stackoverflow, they have the same html hierarchy.
def get_user_rating(url):
  try:
    get = request.urlopen(url).read()
    src = BeautifulSoup(get, 'html.parser')
    reputation, gold = [], []
    silver, bronze = [], []
    template = src.find_all("div", class_ = 'grid--cell fl-shrink0 ws2 overflow-hidden')[0] 
    reputation = int(''.join(template.find_all('div', class_='grid--cell fs-title fc-dark')[0].text.strip().split(',')))
    gold = int(''.join(template.find_all('div', class_='grid ai-center s-badge s-badge__gold')[0].text.strip().split(',')))
    silver = int(''.join(template.find_all('div', class_='grid ai-center s-badge s-badge__silver')[0].text.strip().split(',')))
    bronze = int(''.join(template.find_all('div', class_='grid ai-center s-badge s-badge__bronze')[0].text.strip().split(',')))
    output = [reputation, gold, silver, bronze] 
  except:
    output = [np.nan]*4 # return np.nan if the code runs into some error like page not found return output

  return output

In [None]:
a = [[1,2],[3,4]]
b = [[4,5,6],[7,8,9]]
np.hstack((a,b))

In [None]:
from tqdm.notebook import tqdm
def scrape_data(df):
    answers_comments = []
    for url in tqdm(df['url']):
      answers_comments.append(get_answers_comments(url))
    question_user_rating = []
    for url in tqdm(df['question_user_page']):
      question_user_rating.append(get_user_rating(url))
    answer_user_rating = []
    for url in tqdm(df['answer_user_page']):
      answer_user_rating.append(get_user_rating(url))
    
    return np.hstack((answerd_comments, user_rating, answer_user_rating))

# # Saving as dataframe
# columns = ['upvotes', 'comments_0', 'answer_1', 'comment_1', 'answer_2','comment_2',
#             'answer_3', 'comment_3', 'reputation_q', 'gold_q','silver_q', 'bronze_q', 
#             'reputation_a', 'gold_a', 'silver_a','bronze_a']
# scraped_train = pd.DataFrame(scrape_data(train), columns=columns)
# scraped.to_csv(f'scraped_train.csv', index=False)
# scraped_test = pd.DataFrame(scrape_data(train), columns=columns)
# scraped.to_csv(f'scraped_test.csv', index=False)

In [None]:
# Since I've already scraped the data once, I'll use that for the further analysis
scraped_train = pd.read_csv('../input/google-quest-qna-scraped-data/scraped_features_train.csv')
scraped_test = pd.read_csv('../input/google-quest-qna-scraped-data/scraped_features_test.csv')

In [None]:
scraped_train.head()

#### Upvotes

In [None]:
upvotes = scraped_train['upvotes'].replace(' ', np.nan).dropna().apply(lambda x:int(x.split('.')[0]))

In [None]:
# histogram of upvotes
plt = go.Figure(data=[go.Histogram(x=upvotes, marker_color='#00a0a0')], 
                layout = go.Layout(title='histogram upvotes for train data', 
                                  xaxis=dict(title='upvotes count'), 
                                  yaxis=dict(title='frequency')))
plt.show()

In [None]:
upvotes = scraped_test['upvotes'].replace(' ', np.nan).dropna().apply(lambda x:int(x.split('.')[0]))

In [None]:
# histogram of length of question titles
plt = go.Figure(data=[go.Histogram(x=upvotes, marker_color='#00a0a0')], 
                layout = go.Layout(title='histogram upvotes for test data', 
                                  xaxis=dict(title='upvotes count'), 
                                  yaxis=dict(title='frequency')))
plt.show()

#### comments_0

In [None]:
length_c0 = scraped_train['comments_0'].apply(lambda x:len(x.split(' ')))
length_c1 = scraped_train['comment_1'].apply(lambda x:len(x.split(' ')))
length_c2 = scraped_train['comment_2'].apply(lambda x:len(x.split(' ')))
length_c3 = scraped_train['comment_3'].apply(lambda x:len(x.split(' ')))

In [None]:
# histogram of length of comments
plt = go.Figure(data=[go.Histogram(x=np.log1p(length_c0), marker_color='#941759', name='comment_0'),
                      go.Histogram(x=np.log1p(length_c1), marker_color='#386082', name='comment_1'),
                      go.Histogram(x=np.log1p(length_c2), marker_color='#789501', name='comment_2'),
                      go.Histogram(x=np.log1p(length_c3), marker_color='#e80995', name='comment_3')], 
                layout = go.Layout(title='histogram of log of length of comments for train data', 
                                  xaxis=dict(title='comment length'), 
                                  yaxis=dict(title='frequency')))
plt.show()

In [None]:
length_c0 = scraped_test['comments_0'].apply(lambda x:len(x.split(' ')))
length_c1 = scraped_test['comment_1'].apply(lambda x:len(x.split(' ')))
length_c2 = scraped_test['comment_2'].apply(lambda x:len(x.split(' ')))
length_c3 = scraped_test['comment_3'].apply(lambda x:len(x.split(' ')))

In [None]:
# histogram of length of comments
plt = go.Figure(data=[go.Histogram(x=np.log1p(length_c0), marker_color='#941759', name='comment_0'),
                      go.Histogram(x=np.log1p(length_c1), marker_color='#386082', name='comment_1'),
                      go.Histogram(x=np.log1p(length_c2), marker_color='#789501', name='comment_2'),
                      go.Histogram(x=np.log1p(length_c3), marker_color='#e80995', name='comment_3')], 
                layout = go.Layout(title='histogram of log of length of comments for test data', 
                                  xaxis=dict(title='comment length'), 
                                  yaxis=dict(title='frequency')))
plt.show()

In [None]:
length_a1 = scraped_train['answer_1'].apply(lambda x:len(x.split(' ')))
length_a2 = scraped_train['answer_2'].apply(lambda x:len(x.split(' ')))
length_a3 = scraped_train['answer_3'].apply(lambda x:len(x.split(' ')))

In [None]:
# histogram of length of answers
plt = go.Figure(data=[go.Histogram(x=np.log1p(length_a1), marker_color='#386082', name='answer_1'),
                      go.Histogram(x=np.log1p(length_a2), marker_color='#789501', name='answer_2'),
                      go.Histogram(x=np.log1p(length_a3), marker_color='#e80995', name='answer_3')], 
                layout = go.Layout(title='histogram of log of length of answers for train data', 
                                  xaxis=dict(title='answer_length'), 
                                  yaxis=dict(title='frequency')))
plt.show()

In [None]:
length_a1 = scraped_test['answer_1'].apply(lambda x:len(x.split(' ')))
length_a2 = scraped_test['answer_2'].apply(lambda x:len(x.split(' ')))
length_a3 = scraped_test['answer_3'].apply(lambda x:len(x.split(' ')))

In [None]:
# histogram of length of answers
plt = go.Figure(data=[go.Histogram(x=np.log1p(length_a1), marker_color='#386082', name='answer_1'),
                      go.Histogram(x=np.log1p(length_a2), marker_color='#789501', name='answer_2'),
                      go.Histogram(x=np.log1p(length_a3), marker_color='#e80995', name='answer_3')], 
                layout = go.Layout(title='histogram of log of length of answers for test data', 
                                  xaxis=dict(title='answer_length'), 
                                  yaxis=dict(title='frequency')))
plt.show()

In [None]:
scraped_train.columns[-8:]

#### Description of the last 8 scraped features from question and answer user page links
#### 'reputation_q', 'gold_q', 'silver_q', 'bronze_q', 'reputation_a', 'gold_a', 'silver_a', 'bronze_a'

In [None]:
# For train data
scraped_train.iloc[:, -8:].describe()

In [None]:
# For test data
scraped_test.iloc[:, -8:].describe()

### Target features

In [None]:
import matplotlib.pyplot as plt
# histograms of the target labels
f,ax = plt.subplots(5,6, figsize=(24,20))
for i,label in enumerate(train.columns[11:]):
  plt.subplot(5,6,i+1)
  plt.hist(train[label], bins=20)
  plt.title(label)

plt.show()

In [None]:
plt.figure(figsize=(16,14))
Var_Corr = train.iloc[11:].corr()
sns.heatmap(Var_Corr, xticklabels=Var_Corr.columns, yticklabels=Var_Corr.columns) 
plt.title('Correlation between target features.')
plt.show()