# Asking and Answering Questions on Stack Overflow
Stack Overflow has released official [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) and [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer) guides.

Why do we care about the quality of the questions asked? Where does that put the pedagogical "There is no such thing as a bad question?"

Although there is a educational aspect to Stack Overflow, at the end of the day, it is a question and answer forum for developers to get and give help. If you are asking for help, you want to maximize your chances at getting it. If you are helping, you want to maximize your chances of being effective at helping. In this context, we could rephrase "good" questions to be questions likely to be answered, and "good" answers to be answers likely to be accepted.

Stack Overflow has the feature that

> Sometimes users encounter the following message when posting a question: 
>> 'This post does not meet our quality standards.'

>If you see this message, then your question was automatically blocked by the server. All new questions are subjected to a "minimum quality" filter that checks for some basic indicators of a good, complete question. Check to make sure that your question has the following:
A clear title.
A reasonable explanation of what your question is. Add as much detail as you can.
Any background research you've tried but wasn't enough to solve your problem.
Correct use of English spelling and grammar to the best of your ability.

However, everything they list above is quite subjective. Let's explore what features correlate with the votes and answers for a question.

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from multiprocessing import Pool
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.manifold import TSNE
import re

In [0]:
from google.colab import files
uploaded = files.upload()

Saving Answers.csv to Answers.csv
Saving Questions.csv to Questions.csv


In [2]:
Questions = pd.read_csv('pythonquestions/Questions.csv', encoding="ISO-8859-1", index_col='Id')
Answers = pd.read_csv('pythonquestions/Answers.csv', encoding="ISO-8859-1", index_col='Id')
Tags = pd.read_csv('pythonquestions/Tags.csv', encoding="ISO-8859-1", index_col='Id')

Questions.columns = ['QuestionUserId', 'QuestionCreateDate', 'QuestionScore', 'QuestionTitle', 'QuestionBody']
Answers.columns = ['AnswerUserId', 'AnswerCreateDate', 'ParentId', 'AnswerScore', 'AnswerBody']

FileNotFoundError: ignored

In [0]:
print(Questions.shape)
Questions.head()

(607282, 5)


Unnamed: 0_level_0,QuestionUserId,QuestionCreateDate,QuestionScore,QuestionTitle,QuestionBody
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...
535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...
594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...
683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...


In [0]:
print(Answers.shape)
Answers.head()

(987122, 5)


Unnamed: 0_level_0,AnswerUserId,AnswerCreateDate,ParentId,AnswerScore,AnswerBody
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
497,50.0,2008-08-02T16:56:53Z,469,4,<p>open up a terminal (Applications-&gt;Utilit...
518,153.0,2008-08-02T17:42:28Z,469,2,<p>I haven't been able to find anything that d...
536,161.0,2008-08-02T18:49:07Z,502,9,<p>You can use ImageMagick's convert utility f...
538,156.0,2008-08-02T18:56:56Z,535,23,<p>One possibility is Hudson. It's written in...
541,157.0,2008-08-02T19:06:40Z,535,20,"<p>We run <a href=""http://buildbot.net/trac"">B..."


In [0]:
print(Tags.shape)
Tags.head()

(1885078, 1)


Unnamed: 0_level_0,Tag
Id,Unnamed: 1_level_1
469,python
469,osx
469,fonts
469,photoshop
502,python


In [0]:
print(Questions.isnull().sum())
print(Answers.isnull().sum())
print(Tags.isnull().sum())

QuestionUserId        6212
QuestionCreateDate       0
QuestionScore            0
QuestionTitle            0
QuestionBody             0
dtype: int64
AnswerUserId        5367
AnswerCreateDate       0
ParentId               0
AnswerScore            0
AnswerBody             0
dtype: int64
Tag    443
dtype: int64


Looks like this data is pretty clean. The only nulls are for users (probably users that have deleted their accounts since posting), and tags (since these are optional).

# Question 1: How many users are engaged on Q, A, & Q&A?

In [0]:
User_id_inQ = Questions['QuestionUserId'].unique()
User_id_inA = Answers['AnswerUserId'].unique()
User_id_inBoth=set(User_id_inQ).intersection(User_id_inA)

In [0]:
print(str(len(User_id_inQ)) + ' users posting questions')
print(str(len(User_id_inA)) + ' users posting answers')
print(str(len(User_id_inBoth)) + ' users posting both')

213928 users posting questions
149177 users posting answers
63779 users posting both


In [0]:
# reduce memory and computation
# selected_ids = np.random.choice(range(Questions.shape[0]), 10000, replace=False)
# Questions = Questions.loc[selected_ids, :]

# Question 2: How many questions contain Python errors and tracebacks?

In [0]:
Questions['HasError'] = Questions['QuestionBody'].str.contains("[A-Z][a-z]+Error:\s")

In [0]:
print(
    Questions['QuestionBody'].str.contains("Error").sum(), 
    Questions['QuestionBody'].str.contains("Error:").sum(), 
    Questions['QuestionBody'].str.contains("Error:\s").sum(), 
    Questions['QuestionBody'].str.contains("[a-z]+Error:\s").sum(), 
    Questions['HasError'].sum()
)

118516 96055 95190 85601 85481


In [0]:
print(str(Questions['HasError'].sum() / Questions['HasError'].count()) + " of questions have a full Python-formatted error message")

0.14075997641952173 of questions have a full Python-formatted error message


In [0]:
Questions['HasTraceback'] = Questions['QuestionBody'].str.contains("Traceback \(most recent call last\):")

In [0]:
print(str(Questions['HasTraceback'].sum() / Questions['HasTraceback'].count()) + " of questions have a full Python-formatted traceback")

0.06771977433877506 of questions have a full Python-formatted traceback


In [0]:
Questions.head()

Unnamed: 0_level_0,QuestionUserId,QuestionCreateDate,QuestionScore,QuestionTitle,QuestionBody,HasError,HasTraceback
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,False,False
502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,False,False
535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...,False,False
594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...,False,False
683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...,False,False


# Overall number of votes per question

In [0]:
total_average_votes = Questions['QuestionScore'].sum() / len(Questions['QuestionScore'])
print('total average votes ' + str(total_average_votes))

total average votes 2.2831369940159596


# Question 3: Do questions with errors have more or less votes?

In [0]:
QuestionsWithError = Questions[Questions['HasError']]
hasError_average_votes = QuestionsWithError['QuestionScore'].sum() / len(QuestionsWithError['QuestionScore'])
print('error average votes ' + str(hasError_average_votes))
QuestionsWithOnlyError = Questions[(Questions['HasError'] & ~Questions['HasTraceback'])]
hasOnlyError_average_votes = QuestionsWithOnlyError['QuestionScore'].sum() / len(QuestionsWithOnlyError['QuestionScore'])
print('only error average votes ' + str(hasOnlyError_average_votes))

error average votes 1.7215053637650473
only error average votes 1.6573415912339209


# Question 4: Do questions with tracebacks have more or less votes?

In [0]:
QuestionsWithTraceback = Questions[Questions['HasTraceback']]
hasTraceback_average_votes = QuestionsWithTraceback['QuestionScore'].sum() / len(QuestionsWithTraceback['QuestionScore'])
print('traceback average votes ' + str(hasTraceback_average_votes))
QuestionsWithOnlyTraceback = Questions[(~Questions['HasError'] & Questions['HasTraceback'])]
hasOnlyTraceback_average_votes = QuestionsWithOnlyTraceback['QuestionScore'].sum() / len(QuestionsWithOnlyTraceback['QuestionScore'])
print('only traceback average votes ' + str(hasOnlyTraceback_average_votes))

traceback average votes 1.8253617021276596
only traceback average votes 1.8328611898016998


# Question 5: Do questions with errors and tracebacks have more or less votes?

In [0]:
QuestionsWithErrorAndTraceback = Questions[(Questions['HasError'] & Questions['HasTraceback'])]

In [0]:
QuestionsWithErrorAndTraceback = QuestionsWithError[QuestionsWithError['HasTraceback']]
hasErrorAndTraceback_average_votes = QuestionsWithErrorAndTraceback['QuestionScore'].sum() / len(QuestionsWithErrorAndTraceback['QuestionScore'])
print('error and traceback average votes ' + str(hasErrorAndTraceback_average_votes))

error and traceback average votes 1.8235169363146095


# Overall average number of answers per question

In [0]:
QA = Questions.merge(Answers, how='left', left_index=True, right_on='ParentId')
QA.head(3)

Unnamed: 0_level_0,QuestionUserId,QuestionCreateDate,QuestionScore,QuestionTitle,QuestionBody,HasError,HasTraceback,AnswerUserId,AnswerCreateDate,ParentId,AnswerScore,AnswerBody
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
497,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,False,False,50.0,2008-08-02T16:56:53Z,469,4.0,<p>open up a terminal (Applications-&gt;Utilit...
518,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,False,False,153.0,2008-08-02T17:42:28Z,469,2.0,<p>I haven't been able to find anything that d...
3040,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,False,False,457.0,2008-08-06T03:01:23Z,469,12.0,<p>Unfortunately the only API that isn't depre...


In [0]:
AnswersPerQuestion = QA.groupby('ParentId').count()
AnswersPerQuestion.sort_values('AnswerCreateDate', ascending=False).head()

Unnamed: 0_level_0,QuestionUserId,QuestionCreateDate,QuestionScore,QuestionTitle,QuestionBody,HasError,HasTraceback,AnswerUserId,AnswerCreateDate,AnswerScore,AnswerBody
ParentId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
550632,55,55,55,55,55,55,55,54,55,55,55
312443,45,45,45,45,45,45,45,45,45,45,45
36932,43,43,43,43,43,43,43,42,43,43,43
89228,42,42,42,42,42,42,42,42,42,42,42
38987,37,37,37,37,37,37,37,37,37,37,37


In [0]:
total_average_answers = AnswersPerQuestion['AnswerCreateDate'].sum() / AnswersPerQuestion['AnswerCreateDate'].count()
print('total average answers ' + str(total_average_answers))

total average answers 1.625475479266634


# Question 6: Do questions with errors have more or less answers?

In [0]:
AnswersPerQuestionWithError = QA[QA['HasError']].groupby('ParentId').count()
hasError_average_answers = AnswersPerQuestionWithError['AnswerCreateDate'].sum() / len(AnswersPerQuestionWithError['AnswerCreateDate'])
print('error average answers ' + str(hasError_average_answers))
AnswersPerQuestionWithOnlyError = QA[(QA['HasError'] & ~QA['HasTraceback'])].groupby('ParentId').count()
hasOnlyError_average_answers = AnswersPerQuestionWithOnlyError['AnswerCreateDate'].sum() / len(AnswersPerQuestionWithOnlyError['AnswerCreateDate'])
print('only error average answers ' + str(hasOnlyError_average_answers))

error average answers 1.4572361109486318
only error average answers 1.4740924249642686


# Question 7: Do questions with tracebacks have more or less answers?

In [0]:
AnswersPerQuestionWithTraceback = QA[QA['HasTraceback']].groupby('ParentId').count()
hasTraceback_average_answers = AnswersPerQuestionWithTraceback['AnswerCreateDate'].sum() / len(AnswersPerQuestionWithTraceback['AnswerCreateDate'])
print('traceback average answers ' + str(hasError_average_answers))
AnswersPerQuestionWithOnlyTraceback = QA[(~QA['HasError'] & QA['HasTraceback'])].groupby('ParentId').count()
hasOnlyTraceback_average_answers = AnswersPerQuestionWithOnlyTraceback['AnswerCreateDate'].sum() / len(AnswersPerQuestionWithOnlyTraceback['AnswerCreateDate'])
print('only traceback average answers ' + str(hasOnlyTraceback_average_answers))

traceback average answers 1.4572361109486318
only traceback average answers 1.3069343515211234


# Question 8: Do questions with errors and tracebacks have more or less answers?

In [0]:
AnswersPerQuestionWithErrorAndTraceback = QA[(QA['HasError'] & QA['HasTraceback'])].groupby('ParentId').count()
hasErrorAndTraceback_average_answers = AnswersPerQuestionWithErrorAndTraceback['AnswerCreateDate'].sum() / len(AnswersPerQuestionWithErrorAndTraceback['AnswerCreateDate'])
print('error traceback average answers ' + str(hasErrorAndTraceback_average_answers))

error traceback average answers 1.4304368902623765


In [0]:
def purify_string(html):
    # removes line breaks and tags
    return re.sub('(\r\n)+|\r+|\n+', " ", re.sub('<[^<]+?>', '', html))

In [0]:
Questionsbodytext = Questions.loc[:, 'QuestionBody'].apply(purify_string)

In [0]:
def combine_title_body(tnb):
    return tnb[0] + " " + tnb[1]

In [0]:
p = Pool(8)
Questionstext = p.map(combine_title_body, zip(Questions['QuestionTitle'], Questionsbodytext))
p.close()

In [0]:
Questions[:1]

Unnamed: 0_level_0,QuestionUserId,QuestionCreateDate,QuestionScore,QuestionTitle,QuestionBody,HasError,HasTraceback
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,False,False


In [0]:
lem = WordNetLemmatizer()
def cond_tokenize(t):
    if t is None:
        return []
    else:
        return [lem.lemmatize(w.lower()) for w in word_tokenize(t)]

p = Pool(8)
tokens = list(p.imap(cond_tokenize, Questionstext))
p.close()

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Process ForkPoolWorker-29:
Process ForkPoolWorker-26:
Process ForkPoolWorker-27:
Process ForkPoolWorker-28:
Process ForkPoolWorker-25:
Process ForkPoolWorker-30:
Process ForkPoolWorker-32:
Process ForkPoolWorker-31:
Traceback (most recent call last):
  File "/Users/captain/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/captain/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Users/captain/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
Traceback (most recent call last):
  File "/Users/captain/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
 

In [0]:
pure_tokens = [" ".join(sent) for sent in tokens]

In [0]:
pure_tokens[:2]

In [0]:
vectorizer = TfidfVectorizer(max_features=2000, stop_words='english', ngram_range=[1, 1], sublinear_tf=True)
tfidf = vectorizer.fit_transform(pure_tokens)

In [0]:
idfs = pd.DataFrame([[v, k] for k, v in vectorizer.vocabulary_.items()], columns=['id', 'word']).sort_values('id')
idfs['idf'] = vectorizer.idf_
idfs.sort_values('idf').head(10)