**Objetivo:**   
* Capturar as perguntas mais frequentes sobre Python no stackoverflow
* Armazenar para cada pergunta as respostas melhor avaliadas
* Criar ferramenta de busca usando como base as informações do stackoverflow


**Fonte:** https://stackoverflow.com/questions/tagged/python?tab=Frequent

In [None]:
import numpy as np
import pandas as pd



#Leitura do dado cru no Stackoverflow
**read_stackoverflow_raw(tags=[], tab='Frequent', selector='question-summary', pages)**

Leitura do resumo das perguntas mais frequentes no stackoverflow com base em alguns parâmetros de busca. 

Retorna um objeto requests contendo o resultado de requests.get

* tags: argumento opcional com lista  de strings contendo os tipos de pergunta para seleação. Ex.: ['python', 'php', 'javascript']
>ex. de URL para página com mais de 1 tag: https://stackoverflow.com/questions/tagged/sql+sql-server?tab=Frequent

* tab: string com tipo de ordenação a ser aplicado, pode ser:
'Frequent' (opção default), 'Votes', 'Unanswered', 'Bounties', 'Active', 'Newest'

* Selector: seleção dos trechos do html a serem retornados. Por default, será question-summary

* pages: número de páginas para leitura



In [60]:
import requests # Getting Webpage content
from requests.exceptions import HTTPError
from bs4 import BeautifulSoup as bs # Scraping webpages
from time import sleep

def read_stackoverflow_raw(tags=[], tab='Frequent', selector='question-summary', pages=5):
  link = 'https://stackoverflow.com/questions'

  if tags:
    tags_link = '/tagged/'
    pre=''
    for t in tags:
      tags_link += pre + t
      pre = '+' 
    link += tags_link

  link += '?tab='+tab

  questions = []
  for page in range(1,pages+1):
    page_link = '&page='+str(page)

    try:
      request = requests.get(link+page_link)
      request.raise_for_status()
      try:
        soup = bs(request.content, 'html.parser')
        soup_selection = soup.select('.'+selector)
        questions.append(soup_selection)
        print(link+page_link)
      except: print ("Could not select page %d with selector %s" %(page, selector))      
    except HTTPError:
      print ("Could not download page ", page)

    sleep(0.05)

  return questions

In [65]:
questions_list_raw = read_stackoverflow_raw(tags=['python','django'],tab='Frequent',selector='summary', pages=3)
#stack_page.content

https://stackoverflow.com/questions/tagged/python+django?tab=Frequent&page=1
https://stackoverflow.com/questions/tagged/python+django?tab=Frequent&page=2
https://stackoverflow.com/questions/tagged/python+django?tab=Frequent&page=3


#Transformação do dado cru coletado do Stackoverflow em dicionário
**questions_overview()**

O dicionário com visão geral das perguntas do stackoverflow contém as seguintes informações:

* link
* brief_description
* votes
* views

**Análise do padrão da página HTML para captura de informações relevantes:**

Em "question-summary", temos as seguintes informações relevantes:

1.   class = statscontainer, com:
*   class="vote-count-post " contendo número de votos
*   class="status answered-accepted" contendo número de respostas aceitas
*   class="views supernova" contendo string com quantidade de views, também contido em title

2.   class = summary, com:
*   Título e link em
``` <a href="/questions/15112125/how-to-test-multiple-variables-against-a-value" class="question-hyperlink">How to test multiple variables against a value?</a>```

*   Breve resumo em class="excerpt"
*   Tags em class="post-tag"






In [47]:
def questions_overview():

In [195]:
py_questions_html.content



In [262]:
# Cria objeto BeautifulSoup a partir da Request
soup = bs(py_questions_html.content, 'html.parser')

question_list = soup.select('.question-summary')
#body = soup.find('body')

In [307]:
#for link in body.find_all('a', class_ ='question-hyperlink'):
  #print(link.get('href'))
  #print('https://stackoverflow.com/questions'+link.get('href'))

for question in question_list:
  link = 'https://stackoverflow.com'+link.select_one('.question-hyperlink').get('href')
  print(link)


https://stackoverflow.com/questions/15112125/how-to-test-multiple-variables-against-a-value
https://stackoverflow.com/questions/509211/understanding-slice-notation
https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples
https://stackoverflow.com/questions/23294658/asking-the-user-for-input-until-they-give-a-valid-response
https://stackoverflow.com/questions/240178/list-of-lists-changes-reflected-across-sublists-unexpectedly
https://stackoverflow.com/questions/1132941/least-astonishment-and-the-mutable-default-argument
https://stackoverflow.com/questions/1373164/how-do-i-create-a-variable-number-of-variables
https://stackoverflow.com/questions/2612802/how-to-clone-or-copy-a-list
https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe
https://stackoverflow.com/questions/53645882/pandas-merging-101
https://stackoverflow.com/questions/1207406/how-to-remove-items-from-a-list-while-iterating
https://stackoverflow.com/questions/952914/how-to-ma

In [None]:
#Captura de dados principais com BeautifulSoup

questions_overview = { "questions":[]}

for question in question_list:
  q_title = question.select_one('.question-hyperlink').getText()
  q_link = question.select_one('.question-hyperlink').attrs['href']
  q_summary = question.select_one('.excerpt').getText()
  q_vote_count =  question.select_one('.vote-count-post').getText()
  #q_answered_accepted = question.select_one(".answered-accepted.mini-counts").getText()
  q_views =  question.select_one('.views').attrs['title']
  q_tags = []
  for tag in question.select('.post-tag'): q_tags.append(tag.getText())

  questions_overview['questions'].append({
      'title': q_title,
      'link': 'https://stackoverflow.com'+q_link,
      'summary': q_summary,
      'vote_count': q_vote_count,
      'views': q_views,
      'tags': q_tags,
  })

In [177]:
questions_overview['questions'][49]['link']

'https://stackoverflow.com/questions/826948/syntax-error-on-print-with-python-3'

In [178]:
len(questions_overview['questions'])

50