# Asking and Answering Questions on Stack Overflow
Stack Overflow has released official [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) and [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer) guides.

Why do we care about the quality of the questions asked? Where does that put the pedagogical "There is no such thing as a bad question?"

Although there is an educational aspect to Stack Overflow, at the end of the day, it is a question and answer forum for developers to get and give help. If you are asking for help, you want to maximize your chances of getting it. If you are helping, you want to maximize your chances of being effective at helping. In this context, we could rephrase "good" questions to be questions likely to be answered, and "good" answers to be answers likely to be accepted.

Stack Overflow has the feature that

> Sometimes users encounter the following message when posting a question: 
>> 'This post does not meet our quality standards.'

>If you see this message, then your question was automatically blocked by the server. All new questions are subjected to a "minimum quality" filter that checks for some basic indicators of a good, complete question. Check to make sure that your question has the following:
A clear title.
A reasonable explanation of what your question is. Add as much detail as you can.
Any background research you've tried but wasn't enough to solve your problem.
Correct use of English spelling and grammar to the best of your ability.

However, everything they list above is quite subjective. Let's explore what features correlate with the votes and answers for a question.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

Let's read in the processed data by Text\ Processing.ipynb

In [87]:
Questions = pd.read_csv('LemmatizedQuestions.csv', encoding="ISO-8859-1")
Answers = pd.read_csv('LemmatizedAnswers.csv', encoding="ISO-8859-1")
Tags = pd.read_csv('pythonquestions/Tags.csv', encoding="ISO-8859-1")
Tags.columns = ['TID', 'Tag']

In [88]:
print(Questions.shape)
Questions.head()

(607282, 8)


Unnamed: 0,QID,QuestionUserId,QuestionCreateDate,QuestionScore,QuestionTitle,QuestionBody,QuestionTitleAndBody,QuestionTitleAndBodyLemmatized
0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,How can I find the full path to a font from it...,"['How', 'can', 'I', 'find', 'the', 'full', 'pa..."
1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,Get a preview JPEG of a PDF on Windows? <p>I h...,"['Get', 'a', 'preview', 'JPEG', 'of', 'a', 'PD..."
2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...,Continuous Integration System for a Python Cod...,"['Continuous', 'Integration', 'System', 'for',..."
3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...,cx_Oracle: How do I iterate over a result set?...,"['cx_Oracle', ':', 'How', 'do', 'I', 'iterate'..."
4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...,Using 'in' to match an attribute of Python obj...,"['Using', ""'in"", ""'"", 'to', 'match', 'an', 'at..."


In [89]:
print(Answers.shape)
Answers.head()

(987122, 7)


Unnamed: 0,AID,AnswerUserId,AnswerCreateDate,ParentId,AnswerScore,AnswerBody,AnswerBodyLemmatized
0,497,50.0,2008-08-02T16:56:53Z,469,4,<p>open up a terminal (Applications-&gt;Utilit...,"['open', 'up', 'a', 'terminal', '(', 'Applicat..."
1,518,153.0,2008-08-02T17:42:28Z,469,2,<p>I haven't been able to find anything that d...,"['I', 'have', ""n't"", 'been', 'able', 'to', 'fi..."
2,536,161.0,2008-08-02T18:49:07Z,502,9,<p>You can use ImageMagick's convert utility f...,"['You', 'can', 'use', 'ImageMagick', ""'s"", 'co..."
3,538,156.0,2008-08-02T18:56:56Z,535,23,<p>One possibility is Hudson. It's written in...,"['One', 'possibility', 'is', 'Hudson', '.', 'I..."
4,541,157.0,2008-08-02T19:06:40Z,535,20,"<p>We run <a href=""http://buildbot.net/trac"">B...","['We', 'run', 'Buildbot', '-', 'Trac', 'at', '..."


In [90]:
print(Tags.shape)
Tags.head()

(1885078, 2)


Unnamed: 0,Id,Tag
0,469,python
1,469,osx
2,469,fonts
3,469,photoshop
4,502,python


In [91]:
print(Questions.isnull().sum())
print(Answers.isnull().sum())
print(Tags.isnull().sum())

QID                                  0
QuestionUserId                    6212
QuestionCreateDate                   0
QuestionScore                        0
QuestionTitle                        0
QuestionBody                         0
QuestionTitleAndBody                 0
QuestionTitleAndBodyLemmatized       0
dtype: int64
AID                        0
AnswerUserId            5367
AnswerCreateDate           0
ParentId                   0
AnswerScore                0
AnswerBody                 0
AnswerBodyLemmatized       0
dtype: int64
Id       0
Tag    443
dtype: int64


Looks like this data is pretty clean. The only nulls are for users (probably users that have deleted their accounts since posting), and tags (since these are optional).

### Question 1: How many users are engaged on Q, A, & Q&A?

In [92]:
User_id_inQ = Questions['QuestionUserId'].unique()
User_id_inA = Answers['AnswerUserId'].unique()
User_id_inBoth=set(User_id_inQ).intersection(User_id_inA)

In [93]:
print(str(len(User_id_inQ)) + ' users posting questions')
print(str(len(User_id_inA)) + ' users posting answers')
print(str(len(User_id_inBoth)) + ' users posting both')

213928 users posting questions
149177 users posting answers
63779 users posting both


In [94]:
# reduce memory and computation
# selected_ids = np.random.choice(range(Questions.shape[0]), 10000, replace=False)
# Questions = Questions.loc[selected_ids, :]

### Question 2: How many questions contain errors, tracebacks, or code?

In [96]:
Questions['HasError'] = Questions['QuestionBody'].str.contains("[A-Z][a-z]+Error:\s")

The following is to see what happens as I build up the regex to what Python error messages are formatted as:

In [97]:
print(
    Questions['QuestionBody'].str.contains("Error").sum(), 
    Questions['QuestionBody'].str.contains("Error:").sum(), 
    Questions['QuestionBody'].str.contains("Error:\s").sum(), 
    Questions['QuestionBody'].str.contains("[a-z]+Error:\s").sum(), 
    Questions['HasError'].sum()
)

118516 96055 95190 85601 85481


In [98]:
print(str(Questions['HasError'].sum() / Questions['HasError'].count()) + " of questions have a full Python-formatted error message")

0.14075997641952173 of questions have a full Python-formatted error message


In [99]:
Questions['HasTraceback'] = Questions['QuestionBody'].str.contains("Traceback \(most recent call last\):")

In [100]:
print(str(Questions['HasTraceback'].sum() / Questions['HasTraceback'].count()) + " of questions have a full Python-formatted traceback")

0.06771977433877506 of questions have a full Python-formatted traceback


In [101]:
Questions['HasCodeTag'] = Questions['QuestionBody'].str.contains("</code>")

In [102]:
print(str(Questions['HasCodeTag'].sum() / Questions['HasCodeTag'].count()) + " of questions have the code tag")

0.8603647070059709 of questions have the code tag


In [103]:
Questions['HasMultiLineCode'] = Questions['QuestionBody'].str.contains("<code>.*\n.*</code>")

In [104]:
print(str(Questions['HasMultiLineCode'].sum() / Questions['HasMultiLineCode'].count()) + " of questions have multi line code")

0.23783514084066382 of questions have multi line code


In [105]:
Questions.head()

Unnamed: 0,QID,QuestionUserId,QuestionCreateDate,QuestionScore,QuestionTitle,QuestionBody,QuestionTitleAndBody,QuestionTitleAndBodyLemmatized,HasError,HasTraceback,HasCodeTag,HasMultiLineCode
0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,How can I find the full path to a font from it...,"['How', 'can', 'I', 'find', 'the', 'full', 'pa...",False,False,False,False
1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,Get a preview JPEG of a PDF on Windows? <p>I h...,"['Get', 'a', 'preview', 'JPEG', 'of', 'a', 'PD...",False,False,False,False
2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...,Continuous Integration System for a Python Cod...,"['Continuous', 'Integration', 'System', 'for',...",False,False,False,False
3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...,cx_Oracle: How do I iterate over a result set?...,"['cx_Oracle', ':', 'How', 'do', 'I', 'iterate'...",False,False,False,False
4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...,Using 'in' to match an attribute of Python obj...,"['Using', ""'in"", ""'"", 'to', 'match', 'an', 'at...",False,False,True,False


### Overall number of votes per question

In [106]:
total_average_votes = Questions['QuestionScore'].sum() / len(Questions['QuestionScore'])
print('total average votes ' + str(total_average_votes))

total average votes 2.2831369940159596


### Question 3: Do questions with errors have more or less votes?

In [107]:
QuestionsWithError = Questions[Questions['HasError']]
hasError_average_votes = QuestionsWithError['QuestionScore'].sum() / len(QuestionsWithError['QuestionScore'])
print('error average votes ' + str(hasError_average_votes))
QuestionsWithOnlyError = Questions[(Questions['HasError'] & ~Questions['HasTraceback'])]
hasOnlyError_average_votes = QuestionsWithOnlyError['QuestionScore'].sum() / len(QuestionsWithOnlyError['QuestionScore'])
print('only error average votes ' + str(hasOnlyError_average_votes))

error average votes 1.7215053637650473
only error average votes 1.6573415912339209


### Question 4: Do questions with tracebacks have more or less votes?

In [108]:
QuestionsWithTraceback = Questions[Questions['HasTraceback']]
hasTraceback_average_votes = QuestionsWithTraceback['QuestionScore'].sum() / len(QuestionsWithTraceback['QuestionScore'])
print('traceback average votes ' + str(hasTraceback_average_votes))
QuestionsWithOnlyTraceback = Questions[(~Questions['HasError'] & Questions['HasTraceback'])]
hasOnlyTraceback_average_votes = QuestionsWithOnlyTraceback['QuestionScore'].sum() / len(QuestionsWithOnlyTraceback['QuestionScore'])
print('only traceback average votes ' + str(hasOnlyTraceback_average_votes))

traceback average votes 1.8253617021276596
only traceback average votes 1.8328611898016998


### Question 5: Do questions with errors and tracebacks have more or less votes?

In [109]:
QuestionsWithErrorAndTraceback = Questions[(Questions['HasError'] & Questions['HasTraceback'])]

In [110]:
QuestionsWithErrorAndTraceback = QuestionsWithError[QuestionsWithError['HasTraceback']]
hasErrorAndTraceback_average_votes = QuestionsWithErrorAndTraceback['QuestionScore'].sum() / len(QuestionsWithErrorAndTraceback['QuestionScore'])
print('error and traceback average votes ' + str(hasErrorAndTraceback_average_votes))

error and traceback average votes 1.8235169363146095


### Overall average number of answers per question

In [111]:
QA = Questions.merge(Answers, how='left', left_on='QID', right_on='ParentId')
QA.head(3)

Unnamed: 0,QID,QuestionUserId,QuestionCreateDate,QuestionScore,QuestionTitle,QuestionBody,QuestionTitleAndBody,QuestionTitleAndBodyLemmatized,HasError,HasTraceback,HasCodeTag,HasMultiLineCode,AID,AnswerUserId,AnswerCreateDate,ParentId,AnswerScore,AnswerBody,AnswerBodyLemmatized
0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,How can I find the full path to a font from it...,"['How', 'can', 'I', 'find', 'the', 'full', 'pa...",False,False,False,False,497.0,50.0,2008-08-02T16:56:53Z,469.0,4.0,<p>open up a terminal (Applications-&gt;Utilit...,"['open', 'up', 'a', 'terminal', '(', 'Applicat..."
1,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,How can I find the full path to a font from it...,"['How', 'can', 'I', 'find', 'the', 'full', 'pa...",False,False,False,False,518.0,153.0,2008-08-02T17:42:28Z,469.0,2.0,<p>I haven't been able to find anything that d...,"['I', 'have', ""n't"", 'been', 'able', 'to', 'fi..."
2,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,How can I find the full path to a font from it...,"['How', 'can', 'I', 'find', 'the', 'full', 'pa...",False,False,False,False,3040.0,457.0,2008-08-06T03:01:23Z,469.0,12.0,<p>Unfortunately the only API that isn't depre...,"['Unfortunately', 'the', 'only', 'API', 'that'..."


In [112]:
AnswersPerQuestion = QA.groupby('ParentId').count()
AnswersPerQuestion.sort_values('AnswerCreateDate', ascending=False).head()

Unnamed: 0_level_0,QID,QuestionUserId,QuestionCreateDate,QuestionScore,QuestionTitle,QuestionBody,QuestionTitleAndBody,QuestionTitleAndBodyLemmatized,HasError,HasTraceback,HasCodeTag,HasMultiLineCode,AID,AnswerUserId,AnswerCreateDate,AnswerScore,AnswerBody,AnswerBodyLemmatized
ParentId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
550632.0,55,55,55,55,55,55,55,55,55,55,55,55,55,54,55,55,55,55
312443.0,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45
36932.0,43,43,43,43,43,43,43,43,43,43,43,43,43,42,43,43,43,43
89228.0,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42
18686860.0,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37


In [113]:
total_average_answers = AnswersPerQuestion['AnswerCreateDate'].sum() / AnswersPerQuestion['AnswerCreateDate'].count()
print('total average answers ' + str(total_average_answers))

total average answers 1.8305868651689978


### Question 6: Do questions with errors have more or less answers?

In [114]:
AnswersPerQuestionWithError = QA[QA['HasError']].groupby('ParentId').count()
hasError_average_answers = AnswersPerQuestionWithError['AnswerCreateDate'].sum() / len(AnswersPerQuestionWithError['AnswerCreateDate'])
print('error average answers ' + str(hasError_average_answers))
AnswersPerQuestionWithOnlyError = QA[(QA['HasError'] & ~QA['HasTraceback'])].groupby('ParentId').count()
hasOnlyError_average_answers = AnswersPerQuestionWithOnlyError['AnswerCreateDate'].sum() / len(AnswersPerQuestionWithOnlyError['AnswerCreateDate'])
print('only error average answers ' + str(hasOnlyError_average_answers))

error average answers 1.6628310551046561
only error average answers 1.671052063080579


### Question 7: Do questions with tracebacks have more or less answers?

In [115]:
AnswersPerQuestionWithTraceback = QA[QA['HasTraceback']].groupby('ParentId').count()
hasTraceback_average_answers = AnswersPerQuestionWithTraceback['AnswerCreateDate'].sum() / len(AnswersPerQuestionWithTraceback['AnswerCreateDate'])
print('traceback average answers ' + str(hasError_average_answers))
AnswersPerQuestionWithOnlyTraceback = QA[(~QA['HasError'] & QA['HasTraceback'])].groupby('ParentId').count()
hasOnlyTraceback_average_answers = AnswersPerQuestionWithOnlyTraceback['AnswerCreateDate'].sum() / len(AnswersPerQuestionWithOnlyTraceback['AnswerCreateDate'])
print('only traceback average answers ' + str(hasOnlyTraceback_average_answers))

traceback average answers 1.6628310551046561
only traceback average answers 1.600452488687783


### Question 8: Do questions with errors and tracebacks have more or less answers?

In [116]:
AnswersPerQuestionWithErrorAndTraceback = QA[(QA['HasError'] & QA['HasTraceback'])].groupby('ParentId').count()
hasErrorAndTraceback_average_answers = AnswersPerQuestionWithErrorAndTraceback['AnswerCreateDate'].sum() / len(AnswersPerQuestionWithErrorAndTraceback['AnswerCreateDate'])
print('error traceback average answers ' + str(hasErrorAndTraceback_average_answers))

error traceback average answers 1.6495353224792118


## Preparing text for NLP

Let's also merge in the tags for the questions.

In [117]:
QAT = QA.merge(Tags, how='left', left_on='QID', right_on='TID')

KeyError: 'TID'

In [None]:
QAT.head()

In [None]:
# QAT.to_csv('QAT.csv', index=False)