# Collect questions from Stack Overflow

This notebook demonstrates Python code for collecting `watson-studio` questions from [Stack Overflow](https://stackoverflow.com/questions/tagged/watson-studio) using the [Stack Exchange API](https://api.stackexchange.com/docs):

- Step 0: Add project token to notebook
- Step 1: Collect questions
- Step 2: Clean and format questions
- Step 3: Save questions in Watson Studio project

## Step 0: Add project token

To be able to easily save questions in .csv files as assets in our Watson Studio project, we need a _project token_.

Follow the steps in this topic: [Adding a project token](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/token.html?audience=wdp&context=data)

## Step 1: Collect questions

Define some useful routines:

In [2]:
import datetime, time, math
import requests

def toEpochSeconds( str ):
    d = datetime.datetime.strptime( str, '%Y-%m-%d' )
    t = time.mktime( d.timetuple() )
    return math.floor( t )

def queryStackOverflow( start_date_str, end_date_str, tag ):
    # StackExchange API: https://api.stackexchange.com/docs/advanced-search
    params = { "tagged"   : tag,
               "fromdate" : toEpochSeconds( start_date_str ),
               "todate"   : toEpochSeconds( end_date_str ),
               "sort"     : "creation",
               "order"    : "desc",
               "site"     : "stackoverflow",
               "filter"   : "withbody"
             }
    req = requests.get( 'https://api.stackexchange.com/2.2/search/advanced', params=params )
    return req.json()



Query Stack Overflow:

In [3]:
first_date_str = "2019-08-01"
final_date_str = "2019-08-31"
tag            = "watson-studio"
questions = queryStackOverflow( first_date_str, final_date_str, tag )

View results:

In [4]:
import json
print( json.dumps( questions, indent=3 ) )

{
   "items": [
      {
         "tags": [
            "python-3.x",
            "jupyter-notebook",
            "ibm-watson",
            "watson-studio"
         ],
         "owner": {
         },
         "is_answered": false,
         "view_count": 17,
         "answer_count": 0,
         "score": 0,
         "last_activity_date": 1566575719,
         "creation_date": 1566575719,
         "question_id": 57629588,
         "link": "https://stackoverflow.com/questions/57629588/importing-scripts-into-a-notebook-in-ibm-watson-studio",
         "title": "Importing scripts into a notebook in IBM WATSON STUDIO",
         "body": "<p>I am doing PCA on CIFAR 10 image on IBM WATSON Studio Free version so I uploaded the python file for downloading the CIFAR10 on the studio</p>\n\n<p>pic below. \n<a href=\"https://i.stack.imgur.com/eoLz7.jpg\" rel=\"nofollow noreferrer\"><img src=\"https://i.stack.imgur.com/eoLz7.jpg\" alt=\"enter image description here\"></a> </p>\n\n<p>But when I trying to i

## Step 2: Clean and format questions

Take results from Stack Overflow and make them suitable for processing (eg. sending to the IBM Watson Natural Language Classifier (NLC) service or the IBM Watson Natural Language Understanding (NLU) service.)

Define some useful routines:

In [5]:
import re

def cleanText( str ):
    str_clean = str
    str_clean = re.sub( "[\n\r]",    " ", str_clean )
    str_clean = re.sub( "\&\S+\;",   "",  str_clean )
    str_clean = re.sub( "http[\S]+", " ", str_clean )
    str_clean = re.sub( "\<.*?\>",   " ", str_clean )
    str_clean = re.sub( "[^0-9a-zA-Z .\?\-\_]", " ", str_clean )
    str_clean = re.sub( "\s+",       " ", str_clean )
    str_clean = re.sub( "^\s+",      "",  str_clean )
    str_clean = re.sub( "\s+$",      "",  str_clean )
    return str_clean[:999]

def extractQuestionsTxt( questions_json ):
    questions_txt = []
    for item in questions_json["items"]:
        org_title = item["title"]
        org_body  = item["body"]
        txt_title = cleanText( org_title )
        txt_body  = cleanText( org_body )
        questions_txt.append( { "tags"      : item["tags"],
                                "title_org" : org_title,
                                "title_txt" : txt_title,
                                "question_org"  : org_body,
                                "question_txt"  : txt_body } )
    return questions_txt

def printQuestions( questions_txt ):
    for question in questions_txt:
        print( "**" + question["title_txt"] + "**" )
        print( "TAGS: " + " | ".join( question["tags"] ) )
        print( '"' + question["question_txt"] + '"' )
        print( "\n\n" )


Create some _cleaned_ questions:

In [6]:
questions_txt = extractQuestionsTxt( questions )

View results:

In [7]:
printQuestions( questions_txt )

**Importing scripts into a notebook in IBM WATSON STUDIO**
TAGS: python-3.x | jupyter-notebook | ibm-watson | watson-studio
"I am doing PCA on CIFAR 10 image on IBM WATSON Studio Free version so I uploaded the python file for downloading the CIFAR10 on the studio pic below. But when I trying to import cache the following error is showing. pic below- After spending some time on google I find a solution but I can t understand it. link the solution is as follows - Click the Add Data icon Shows the Add Data icon and then browse the script file or drag it into your notebook sidebar. Click in an empty code cell in your notebook and then click the Insert to code link below the file. Take the returned string and write to a file in the file system that comes with the runtime session. To import the classes to access the methods in a script in your notebook use the following command For Python from python file name import class name I can t understand this line and write to a file in the file sys

## Step 3: Save cleaned questions

Define a handy function:

In [8]:
import pandas as pd

def createDataFrameCSV( question_arr ):
    question_list = []
    for question in question_arr:
        text = question["question_txt"]
        question_list.append( text )
    return pd.DataFrame( data={ "Questions" : question_list } ).to_csv()

Use project_lib to save the questions in a .csv file as an asset in the project:

In [9]:
project.save_data( "so_questions_watson-studio_2019-August.csv", createDataFrameCSV( questions_txt ), overwrite=True )

{'file_name': 'so_questions_watson-studio_2019-August.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'nluworkshopproj-donotdelete-pr-kxjtz2yxb1sovi',
 'asset_id': '7392c276-2725-481a-b6ae-172fc3b9dc3b'}