<a href="https://colab.research.google.com/github/sielerod/search_stackoverflow/blob/master/Read_Stackoverflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Objetivo:**   
* Capturar as perguntas mais frequentes sobre Python no stackoverflow
* Armazenar para cada pergunta: link, breve descrição da pergunta, quantidade de votos e visualizações, pergunta, respostas com melhor avaliação


**Fonte:** https://stackoverflow.com/questions/


In [153]:
import requests # Getting Webpage content
from requests.exceptions import HTTPError
from bs4 import BeautifulSoup as bs # Scraping webpages
from time import sleep
import pandas as pd

#Leitura do dado cru no Stackoverflow
**read_stackoverflow_overview(tags=[], tab='Frequent', pages)**

Leitura do resumo das perguntas mais frequentes no stackoverflow com base em alguns parâmetros de busca. 

Retorna um objeto requests contendo o resultado de requests.get

* tags: argumento opcional com lista  de strings contendo os tipos de pergunta para seleação. Ex.: ['python', 'php', 'javascript']
>ex. de URL para página com mais de 1 tag: https://stackoverflow.com/questions/tagged/sql+sql-server?tab=Frequent

* tab: string com tipo de ordenação a ser aplicado, pode ser:
'Frequent' (opção default), 'Votes', 'Unanswered', 'Bounties', 'Active', 'Newest'

* Selector: seleção dos trechos do html a serem retornados. Por default, será question-summary

* pages: número de páginas para leitura



In [8]:
def read_stackoverflow_overview(tags=[], tab='Frequent', pages=5):
  link = 'https://stackoverflow.com/questions'
  selector='question-summary'
  
  if tags:
    tags_link = '/tagged/'
    pre=''
    for t in tags:
      tags_link += pre + t
      pre = '+' 
    link += tags_link

  link += '?tab='+tab

  questions_text = ''
  soup_selection = []
  for page in range(1,pages+1):
    page_link = '&page='+str(page)

    try:
      request = requests.get(link+page_link)
      request.raise_for_status()
      #questions_text += request.text
      try:
        soup = bs(request.text, 'html.parser')
        soup_selection.append(soup.select('.'+selector))
      except: print ("Could not transform to soup object by selecting ",selector)
    except HTTPError:
      print ("Could not download page ", page)

    sleep(0.05)

  return soup_selection


In [9]:
questions_overview_raw = read_stackoverflow_overview(tags=['python','django'],tab='Frequent',pages=2)
type(questions_overview_raw)

list

In [None]:
questions_overview_raw

#Transformação do dado cru coletado do Stackoverflow em dicionário
**questions_overview(questions_overview_raw)**

O dicionário deve conter a visão geral das perguntas do stackoverflow, com:

* link
* brief_description
* votes
* views

###Análise do padrão da página HTML para captura de informações relevantes:

Em "question-summary", temos as seguintes informações relevantes:

1.   class = statscontainer, com:
*   Número de votos em class="vote-count-post "
>```<span class="vote-count-post high-scored-post"><strong>2473</strong></span>```

*   Número de respostas aceitas em class="status answered-accepted" 
>```<div class="status answered-accepted"><strong>23</strong>answers</div>```

*   Conteúdo e Title contendo quantidade de views em class="views supernova" 
>```<div class="views supernova" title="307,292 views">307k views</div>```

2.   class = summary, com:
* class="question-hyperlink" contendo em *href* parte do link para compor link de acesso à página detalhada da pergunta e Título da pergunta
>``` <a href="/questions/15112125/how-to-test-multiple-variables-against-a-value" class="question-hyperlink">How to test multiple variables against a value?</a>```

*   Breve resumo em class="excerpt"
>```<div class="excerpt"> brief description of the question ...</div>```

*   Tags em class="post-tag"
>```<a href="/questions/tagged/python" class="post-tag" title="show questions tagged 'python'" rel="tag">python</a>```





In [124]:
def questions_overview(questions_overview_raw):
  #questions_overview = pd.DataFrame({'questions':[]})
  questions_overview = { 'questions':[]}
  for soups in questions_overview_raw:
    for q in soups:
      q_title = q.select_one('.question-hyperlink').getText()
      q_link = 'https://stackoverflow.com'+q.select_one('.question-hyperlink').get('href')
      q_summary = q.select_one('.excerpt').getText()
      q_vote_count =  q.select_one('.vote-count-post').getText()
      #q_answered_accepted = q.select_one(".answered-accepted.mini-counts").getText()
      q_views =  q.select_one('.views').attrs['title']
      q_tags = []
      for tag in q.select('.post-tag'): q_tags.append(tag.getText())

      questions_overview['questions'].append({
          'title': q_title,
          'link': q_link,
          'summary': q_summary,
          'vote_count': q_vote_count,
          'views': q_views,
          'tags': q_tags,
          'full_question':"",
          'best_answer':"",
      })
  
  return questions_overview

#Trasformação de dicionário em dataframe

In [125]:
questions_df = pd.DataFrame(questions_overview(questions_overview_raw)['questions'])
type(questions_df)

pandas.core.frame.DataFrame

#Exemplos de como acessar a informação no dataframe:

In [12]:
print('Lista com links:\n',questions_df['link'])

print('\n Acesso a dados de um link específico\n--- Link: ',questions_df['link'][3])

print('\n--- Título: ', questions_df['title'][3])

print('\n--- Breve Descrição: ', questions_df['summary'][3])

print('\n--- Contagem de votos: ', questions_df['vote_count'][3])

print('\n--- Contagem de visualizações: ', questions_df['views'][3])

print('\n--- Lista como tags: ', questions_df['tags'][3])


Lista com links:
 0     https://stackoverflow.com/questions/23708898/p...
1     https://stackoverflow.com/questions/573618/set...
2     https://stackoverflow.com/questions/8000022/dj...
3     https://stackoverflow.com/questions/5100539/dj...
4     https://stackoverflow.com/questions/8609192/di...
                            ...                        
95    https://stackoverflow.com/questions/17716624/d...
96    https://stackoverflow.com/questions/26697565/d...
97    https://stackoverflow.com/questions/6367014/ho...
98    https://stackoverflow.com/questions/2201598/ho...
99    https://stackoverflow.com/questions/1208067/wh...
Name: link, Length: 100, dtype: object

 Acesso a dados de um link específico
--- Link:  https://stackoverflow.com/questions/5100539/django-csrf-check-failing-with-an-ajax-post-request

--- Título:  Django CSRF check failing with an Ajax POST request

--- Breve Descrição:  
            I could use some help complying with Django's CSRF protection mechanism via my

Próximos passos:


1.   Enriquecer questions_df com a informação detalhada da pergunta e conteúdo da resposta com melhor avaliação
2.   Limpar dados em questions_dic para remover caracteres irrelevantes, como: \n, \t, artigos, pronomes



In [126]:
questions_df.head()

Unnamed: 0,title,link,summary,vote_count,views,tags,full_question,best_answer
0,'pip' is not recognized as an internal or exte...,https://stackoverflow.com/questions/23708898/p...,\r\n I'm running into a weird error...,338,"1,043,690 views","[python, django, windows, pip]",,
1,Set up a scheduled job?,https://stackoverflow.com/questions/573618/set...,\r\n I've been working on a web app...,521,"169,395 views","[python, django, web-applications, scheduled-t...",,
2,Django template how to look up a dictionary va...,https://stackoverflow.com/questions/8000022/dj...,"\r\n mydict = {""key1"":""value1"", ""ke...",234,"142,128 views","[python, django, templates, dictionary]",,
3,Django CSRF check failing with an Ajax POST re...,https://stackoverflow.com/questions/5100539/dj...,\r\n I could use some help complyin...,180,"151,034 views","[python, ajax, django, csrf]",,
4,"differentiate null=True, blank=True in django",https://stackoverflow.com/questions/8609192/di...,\r\n When we add a database field i...,917,"257,589 views","[python, django, django-models]",,


In [144]:
import re
def read_question_detail(questions_df):
  
  idx = 0
  for link in questions_df['link']:
    question = []
    answer = []
    try:
      request = requests.get(questions_df['link'][0])
      request.raise_for_status()
      try:
        #import urllib.request
        #from lxml import html
        #page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
        #tree = html.fromstring(page.content)
        #prices = tree.xpath('//span[@class="item-price"]/text()')
        soup = bs(request.text, 'html.parser')
        questions_df['full_question'][idx] = str(soup.select('.question .post-text'))
        #answer.append(soup.select_one('.div.answer-body'))
        #answer.append(soup.select('div.-summary.answer'))

      except: 
        print ("Could not transform to soup object by selecting")
        questions_df['full_question'][idx] = "No Answer :( "

    except HTTPError:
      print ("Could not download page")

    idx += 1

    sleep(0.05)

  print(questions_df['full_question'][0])

  return questions_df



In [145]:
questions_df = read_question_detail(questions_df)

[<div class="post-text" itemprop="text">
<p>I'm running into a weird error when trying to install Django on my computer.</p>
<p>This is the sequence that I typed into my command line:</p>
<pre class="lang-none prettyprint-override"><code>C:\Python34&gt; python get-pip.py
Requirement already up-to-date: pip in c:\python34\lib\site-packages
Cleaning up...

C:\Python34&gt; pip install Django
'pip' is not recognized as an internal or external command,
operable program or batch file.

C:\Python34&gt; lib\site-packages\pip install Django
'lib\site-packages\pip' is not recognized as an internal or external command,
operable program or batch file.
</code></pre>
<p>What could be causing this?</p>
<p>This is what I get when I type in <code>echo %PATH%</code>:</p>
<pre class="lang-none prettyprint-override"><code>C:\Python34&gt;echo %PATH%
C:\Program Files\ImageMagick-6.8.8-Q16;C:\Program Files (x86)\Intel\iCLS Client\
;C:\Program Files\Intel\iCLS Client\;C:\Windows\system32;C:\Windows;C:\Windows

In [151]:
import re

def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>|&[.*?]')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

In [152]:
print(cleanhtml(questions_df['full_question'][0]))

[
I'm running into a weird error when trying to install Django on my computer.
This is the sequence that I typed into my command line:
C:\Python34&gt; python get-pip.py
Requirement already up-to-date: pip in c:\python34\lib\site-packages
Cleaning up...

C:\Python34&gt; pip install Django
'pip' is not recognized as an internal or external command,
operable program or batch file.

C:\Python34&gt; lib\site-packages\pip install Django
'lib\site-packages\pip' is not recognized as an internal or external command,
operable program or batch file.

What could be causing this?
This is what I get when I type in echo %PATH%:
C:\Python34&gt;echo %PATH%
C:\Program Files\ImageMagick-6.8.8-Q16;C:\Program Files (x86)\Intel\iCLS Client\
;C:\Program Files\Intel\iCLS Client\;C:\Windows\system32;C:\Windows;C:\Windows\S
ystem32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\
Windows Live\Shared;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Progr
am Files (x86)\Intel\OpenCL 