# Analyzing Stackoverflow Data

We will explore and analyze dataset containing questions from writers.stackoverflow.com. 

Downloaded `writers.stackexchange.com.7z` specifically from https://archive.org/details/stackexchange

In [114]:
import sys
sys.path.append('../')
from ml_editor.data_ingestion import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display
pd.set_option('display.max_rows', 70)
pd.set_option('display.max_columns', 70)
pd.set_option('display.max_colwidth', 100)

## Download Data

In [2]:
site = "writers"
writers = get_data_from_dump(site)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41717/41717 [00:32<00:00, 1265.54it/s]


## EDA

In [5]:
writers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41717 entries, 0 to 41716
Data columns (total 23 columns):
Id                       41717 non-null object
PostTypeId               41717 non-null object
AcceptedAnswerId         4971 non-null object
CreationDate             41717 non-null object
Score                    41717 non-null object
ViewCount                9674 non-null object
Body                     41717 non-null object
OwnerUserId              38833 non-null object
LastEditorUserId         13033 non-null object
LastEditorDisplayName    985 non-null object
LastEditDate             13941 non-null object
LastActivityDate         41717 non-null object
Title                    9674 non-null object
Tags                     9674 non-null object
AnswerCount              9674 non-null object
CommentCount             41717 non-null object
FavoriteCount            3981 non-null object
ClosedDate               1232 non-null object
ContentLicense           41717 non-null object
body_te

In [11]:
writers.sample(5)

Unnamed: 0,Id,PostTypeId,AcceptedAnswerId,CreationDate,Score,ViewCount,Body,OwnerUserId,LastEditorUserId,LastEditorDisplayName,LastEditDate,LastActivityDate,Title,Tags,AnswerCount,CommentCount,FavoriteCount,ClosedDate,ContentLicense,body_text,ParentId,CommunityOwnedDate,OwnerDisplayName
24739,32428,2,,2018-01-09T20:54:43.880,9,,"<p>If I were writing, they would have to be su...",26047,,,,2018-01-09T20:54:43.880,,,,0,,,CC BY-SA 3.0,"If I were writing, they would have to be suspi...",32427.0,,
31631,39894,2,,2018-11-05T11:25:31.050,2,,<p>I would think that any age is ok - I'm thin...,33873,,,,2018-11-05T11:25:31.050,,,,0,,,CC BY-SA 4.0,I would think that any age is ok - I'm thinkin...,39841.0,,
7846,8887,2,,2013-09-12T22:17:28.880,9,,<p>The name on the book is a brand name. It's ...,272,272.0,,2013-09-12T22:31:16.073,2013-09-12T22:31:16.073,,,,0,,,CC BY-SA 3.0,The name on the book is a brand name. It's a s...,8880.0,,
22500,29950,1,29959.0,2017-08-28T15:05:53.993,18,4964.0,<p>I'm currently writing a tale with two prota...,10394,10394.0,,2017-08-28T16:16:44.223,2017-09-09T10:09:30.327,Two protagonists where one is dark - a mistake?,<characters><readers><protagonist>,6.0,11,2.0,,CC BY-SA 3.0,I'm currently writing a tale with two protagon...,,,
20266,27453,2,,2017-04-04T17:49:11.817,1,,<p>I face this issue a lot. I used to worry ab...,7968,,,,2017-04-04T17:49:11.817,,,,0,,,CC BY-SA 3.0,I face this issue a lot. I used to worry about...,27444.0,,


In [15]:
writers[writers["ViewCount"].notnull()]["ViewCount"].sample(5)

26787      50
15287    6885
7362     7060
29045     410
11979     171
Name: ViewCount, dtype: object

In [90]:
notnull_titles = writers[writers["Title"].notnull()]["Title"]
notnull_titles.sample(5)

17429    When Showing Over Telling Becomes Too Extravagant
39675            Will too many characters be overwhelming?
19390         How to show a brief hesitation around a word
28696    Do readers not like a book if it's too dark an...
34118    Consulting experts - why should they talk to s...
Name: Title, dtype: object

In [128]:
short_titles = notnull_titles[notnull_titles.str.len() < 20]
print(short_titles.shape)short_titles.sample(5)


(175,)


33508     Reference of plots
36588     Am I a new writer?
9216     When opening a book
2494     English writers IDE
19238     Thriller sub-genre
Name: Title, dtype: object

In [124]:
no_body_text = writers[writers["body_text"].str.len() < 15]["body_text"]
print(no_body_text.shape)
no_body_text.sort_values(ascending=False).head(4)

(87,)


3375     \n
41195      
11794      
7502       
Name: body_text, dtype: object

In [121]:
body_text = writers[writers["body_text"].str.len() > 15]["body_text"]
print(body_text.shape)
body_text.sort_values(ascending=False).head(4)

(41630,)


17600    “Where do you get your inspiration?” \nThis is an often hated, and feared, Q author’s get. Their...
8343     “Out, Out—” has its morbid description of a young boy bleeding out and its underlying theme of d...
24097    “One should try to invite people from this world to eternity, from sin to obedience, from greedi...
10069    “If once a man indulges himself in murder, very soon he comes to think little of robbing; and fr...
Name: body_text, dtype: object

In [129]:
short_body_text = writers[(writers["body_text"].str.len() > 15) & (writers["body_text"].str.len() < 40)]["body_text"]
print(short_body_text.shape)
short_body_text.sort_values(ascending=False).head(4)

(53,)


2523       what about\n\nInformation IS Power\n\n
30953    it's called "as if told" first person.\n
2452                 help! I need somebody, help!
2658         You can write however you want to.\n
Name: body_text, dtype: object

The dataset has 41717 posts. Some initial observations:

- As per `AcceptedAnswerId` a significant number of those questions have no accepted answers
- A significant number have not been viewed either (or there was no data for those posts). 
- Very few have a title, but all of them have some body text. This doesn't seem right. Shouldn't all posts have titles?