# Uploading dataset

First, I upload the dataset into panda's dataframe

In [2]:
import pandas as pd
from bs4 import BeautifulSoup

ORIGINAL_DATASET = "../data/python_answers.csv"
TARGET_DATASET = "../data/python_answers_new.csv"

data = pd.read_csv(ORIGINAL_DATASET, encoding='latin1')

In [3]:
data.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body
0,497,50.0,2008-08-02T16:56:53Z,469,4,<p>open up a terminal (Applications-&gt;Utilit...
1,518,153.0,2008-08-02T17:42:28Z,469,2,<p>I haven't been able to find anything that d...
2,536,161.0,2008-08-02T18:49:07Z,502,9,<p>You can use ImageMagick's convert utility f...
3,538,156.0,2008-08-02T18:56:56Z,535,23,<p>One possibility is Hudson. It's written in...
4,541,157.0,2008-08-02T19:06:40Z,535,20,"<p>We run <a href=""http://buildbot.net/trac"">B..."


# Getting the content of the body

Next, I define a function, `get_string` to get the content of the HTML-based body field. I am using `BeautifulSoup` to get all the `<p>` content

In [4]:
def get_string(text):
    soup = BeautifulSoup(text, 'html.parser')
    lst = []
    for p in soup.find_all('p'):
        lst.extend(list(map(lambda x: x.string if x.string else "", p.children)))
    return "".join(lst)

# Checking if code blocks are present

After that, I define a function `is_code_present` to check if a post contains any code block. In StackOverflow, this code block is defined within `<pre>` tags

In [5]:
def is_code_present(text):
    soup = BeautifulSoup(text, 'html.parser')
    return 1 if soup.pre else 0

# Running the functions

Next we will apply the 2 functions to the pandas dataframe and save it to a new csv file

In [6]:
data['full_text'] = data['Body'].apply(get_string)

data['is_code_present'] = data['Body'].apply(is_code_present)

In [18]:
data.to_csv(TARGET_DATASET, index=False)

In [7]:
data.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body,full_text,is_code_present
0,497,50.0,2008-08-02T16:56:53Z,469,4,<p>open up a terminal (Applications-&gt;Utilit...,open up a terminal (Applications->Utilities->T...,1
1,518,153.0,2008-08-02T17:42:28Z,469,2,<p>I haven't been able to find anything that d...,I haven't been able to find anything that does...,0
2,536,161.0,2008-08-02T18:49:07Z,502,9,<p>You can use ImageMagick's convert utility f...,You can use ImageMagick's convert utility for ...,1
3,538,156.0,2008-08-02T18:56:56Z,535,23,<p>One possibility is Hudson. It's written in...,One possibility is Hudson. It's written in Ja...,0
4,541,157.0,2008-08-02T19:06:40Z,535,20,"<p>We run <a href=""http://buildbot.net/trac"">B...","We run Buildbot - Trac at work, I haven't used...",0


Sanity check on random row

In [8]:
print(data.loc[1024]['Body'])
print('###################################')
print(data.loc[1024]['full_text'])

<p>In addition to calling os._exit() to avoid the registered exit handler you also need to catch the unhandled exception:</p>

<pre><code>import atexit
import os

def helloworld():
    print "Hello World!"

atexit.register(helloworld)    

try:
    raise Exception("Good bye cruel world!")

except Exception, e:
    print 'caught unhandled exception', str(e)

    os._exit(1)
</code></pre>

###################################
In addition to calling os._exit() to avoid the registered exit handler you also need to catch the unhandled exception:


# Applying the same for the question dataset

Finally, we will just apply the same function to the question dataset

In [9]:
ORIGINAL_DATASET = "../data/python_questions.csv"
TARGET_DATASET = "../data/python_questions_new.csv"

data_question = pd.read_csv(ORIGINAL_DATASET, encoding='latin1')

In [10]:
data_question.head()

Unnamed: 0.1,Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body,Number of Answers,Tags
0,0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,4.0,"['python', 'osx', 'fonts', 'photoshop']"
1,1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,3.0,"['python', 'windows', 'image', 'pdf']"
2,2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...,7.0,"['python', 'continuous-integration', 'extreme-..."
3,3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...,3.0,"['python', 'sql', 'database', 'oracle', 'cx-or..."
4,4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...,8.0,"['python', 'arrays', 'iteration']"


In [None]:
data_question['full_text'] = data_question['Body'].apply(get_string)

data_question['is_code_present'] = data_question['Body'].apply(is_code_present)

In [12]:
data_question.head()

Unnamed: 0.1,Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body,Number of Answers,Tags,full_text,is_code_present
0,0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,4.0,"['python', 'osx', 'fonts', 'photoshop']",I am using the Photoshop's javascript API to f...,0
1,1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,3.0,"['python', 'windows', 'image', 'pdf']",I have a cross-platform (Python) application w...,0
2,2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...,7.0,"['python', 'continuous-integration', 'extreme-...",I'm starting work on a hobby project with a py...,0
3,3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...,3.0,"['python', 'sql', 'database', 'oracle', 'cx-or...",There are several ways to iterate over a resul...,0
4,4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...,8.0,"['python', 'arrays', 'iteration']",I don't remember whether I was dreaming or not...,1


In [13]:
print(data_question.loc[1024]['Body'])
print('###################################')
print(data_question.loc[1024]['full_text'])

<p>I started off programming in Basic on the <a href="http://en.wikipedia.org/wiki/ZX81" rel="nofollow">ZX81</a>, then <a href="http://en.wikipedia.org/wiki/IBM_BASICA" rel="nofollow">BASICA</a>, <a href="http://en.wikipedia.org/wiki/GW-BASIC" rel="nofollow">GW-BASIC</a>, and <a href="http://en.wikipedia.org/wiki/QBasic" rel="nofollow">QBasic</a>.  I moved on to C (Ah, Turbo C 3.1, I hardly knew ye...)</p>

<p>When I got started in microcontrollers I regressed with the <a href="http://en.wikipedia.org/wiki/BASIC_Stamp" rel="nofollow">BASIC Stamp</a> from Parallax.  However, BASIC is/was awesome because it was so easy to understand and so hard to make a mistake.  I moved on to assembly and C eventually because I needed the additional power (speed, capacity, resources, etc.), but I know that if the bar was much higher many people would never get into programming microcontrollers.</p>

<p>I keep getting an itch to make my own on-chip BASIC interpretor, but I wonder if there's need for BAS

In [14]:
data_question.head()

Unnamed: 0.1,Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body,Number of Answers,Tags,full_text,is_code_present
0,0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,4.0,"['python', 'osx', 'fonts', 'photoshop']",I am using the Photoshop's javascript API to f...,0
1,1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,3.0,"['python', 'windows', 'image', 'pdf']",I have a cross-platform (Python) application w...,0
2,2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...,7.0,"['python', 'continuous-integration', 'extreme-...",I'm starting work on a hobby project with a py...,0
3,3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...,3.0,"['python', 'sql', 'database', 'oracle', 'cx-or...",There are several ways to iterate over a resul...,0
4,4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...,8.0,"['python', 'arrays', 'iteration']",I don't remember whether I was dreaming or not...,1


Unnecessary column (Unnamed:0). Let's delete it

In [15]:
del data_question['Unnamed: 0']

In [16]:
data_question.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body,Number of Answers,Tags,full_text,is_code_present
0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,4.0,"['python', 'osx', 'fonts', 'photoshop']",I am using the Photoshop's javascript API to f...,0
1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,3.0,"['python', 'windows', 'image', 'pdf']",I have a cross-platform (Python) application w...,0
2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...,7.0,"['python', 'continuous-integration', 'extreme-...",I'm starting work on a hobby project with a py...,0
3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...,3.0,"['python', 'sql', 'database', 'oracle', 'cx-or...",There are several ways to iterate over a resul...,0
4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...,8.0,"['python', 'arrays', 'iteration']",I don't remember whether I was dreaming or not...,1


In [19]:
data_question.to_csv(TARGET_DATASET, index=False)