## Removing unanswered questions:

In [1]:
import pandas as pd
path = '../Dataset/pythonquestions/Answers.csv'
answers_df = pd.read_csv(path)
answers_df.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body
0,497,50.0,2008-08-02T16:56:53Z,469,4,<p>open up a terminal (Applications-&gt;Utilit...
1,518,153.0,2008-08-02T17:42:28Z,469,2,<p>I haven't been able to find anything that d...
2,536,161.0,2008-08-02T18:49:07Z,502,9,<p>You can use ImageMagick's convert utility f...
3,538,156.0,2008-08-02T18:56:56Z,535,23,<p>One possibility is Hudson. It's written in...
4,541,157.0,2008-08-02T19:06:40Z,535,20,"<p>We run <a href=""http://buildbot.net/trac"">B..."


In [None]:
path = '../Dataset/pythonquestions/Questions.csv'
questions_df = pd.read_csv(path)
questions_df.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...
2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...
3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...
4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...


In [None]:
answers_df.nunique()

In [None]:
questions_df.nunique()

So, there are a total of 607282 questions and 987122 answers. Finding out unanswered questions:

In [None]:
answers_df.sort_values('ParentId', inplace=True)
answers_df.head()

In [None]:
qIDs_withAnswers = set(answers_df['ParentId'].tolist())
print "qIDs with answers: ", len(qIDs_withAnswers)
total_qIDs = set(questions_df['Id'].tolist())
print "Total qIDs: ", len(total_qIDs)

In [None]:
607282 - 539238

Hence, there are 68,044 unanswered questions. Removing them now.

In [None]:
unanswered_qIDs = list(total_qIDs - qIDs_withAnswers)
print len(unanswered_qIDs)

In [None]:
# REMOVING 'unanswered_qIDs' from the main 'questions_df'
clean_questions_df = questions_df[~questions_df['Id'].isin(unanswered_qIDs)]
clean_questions_df.head()

In [None]:
# Another way to do the same thing:
blah_df = questions_df[questions_df['Id'].isin(qIDs_withAnswers)]
# Check if both are equal:
print blah_df.equals(clean_questions_df)

In [None]:
# Final check: everything in clean_questions_df should be in answers_df, and vice-versa.
print set(clean_questions_df['Id'].tolist()) - set(answers_df['ParentId'])
print set(answers_df['ParentId'].tolist()) - set(clean_questions_df['Id'])

In [None]:
# Writing it to a CSV
clean_questions_df.to_csv('../Dataset/pythonquestions/cleanQuestions.csv', index=False)

## Now, labelling the dataset with "BestAnswer/NonBestAnswer"

In [None]:
answers_df.head(10)

In [None]:
# answers_df.sort_values(['ParentId','Score'], ascending=[True, False], inplace=True)

In [None]:
idx = answers_df.groupby(['ParentId'])['Score'].transform(max) == answers_df['Score']
idx[:10]

In [None]:
bestAnswer_df = answers_df[idx]
bestAnswer_df.head(10)

In [None]:
bestAnswer_df.nunique()

In [None]:
clean_questions_df.nunique()

Currently, we have 614567 best answers and 539238 questions which is not possible. This is due to answers (for the same question) having the same number of upvotes. Removing duplicates:

In [None]:
len(set(bestAnswer_df['ParentId'].tolist())) # set() should have the same number: 539238

In [None]:
# Remove duplicates:
bestAnswer_df.drop_duplicates(subset='ParentId', keep="last", inplace=True)
bestAnswer_df.head()

In [None]:
bestAnswer_ids = set(bestAnswer_df['Id'].tolist())
bestAnswer_df.nunique()

In [None]:
nonBestAnswer_ids = set(answers_df['Id'].tolist()) - bestAnswer_ids
len(nonBestAnswer_ids)

In [None]:
447884 + 539238

In [None]:
answers_df.nunique()

In [None]:
# Writing best answers to a 2-column CSV:
with open('../Dataset/pythonquestions/labeled_answerIDs.csv', 'wb') as f:
    f.write("Answer_ID,Label\n")
    
    for ID in bestAnswer_ids:
        f.write(str(ID)+","+"BestAnswer"+"\n")
        
    for ID in nonBestAnswer_ids:
        f.write(str(ID)+","+"Non_BestAnswer"+"\n")

In [None]:
temp_df = pd.read_csv('../Dataset/pythonquestions/labeled_answerIDs.csv')
temp_df.head()

In [None]:
ba = temp_df[temp_df['Label'] == 'BestAnswer']
ba.nunique()

In [None]:
nba = temp_df[temp_df['Label'] == 'Non_BestAnswer']
nba.nunique()

In [None]:
# Code to retrieve all columns given only answer IDs:
#answers_df[answers_df['Id'].isin(list(nonBestAnswer_ids))]