<a href="https://colab.research.google.com/github/sielerod/search_stackoverflow/blob/master/Read_Stackoverflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Objetivo:**   
* Capturar as perguntas mais frequentes sobre Python no stackoverflow
* Armazenar para cada pergunta: link, breve descrição da pergunta, quantidade de votos e visualizações, pergunta, respostas com melhor avaliação


**Fonte:** https://stackoverflow.com/questions/


In [81]:
import requests # Getting Webpage content
from requests.exceptions import HTTPError
from bs4 import BeautifulSoup as bs # Scraping webpages
from time import sleep
import pandas as pd

import re #biblioteca para trabalhar com regular expressions - regex
import string

import nltk
#nltk.download('punkt')
#nltk.download('stopwords')
from nltk.stem import RSLPStemmer #Stemming Portugues
from nltk.stem import PorterStemmer #Stemming Ingles
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import unidecode


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\siele\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


#Leitura do dado cru no Stackoverflow
**read_stackoverflow_overview(tags=[], tab='Frequent', pages)**

Leitura do resumo das perguntas mais frequentes no stackoverflow com base em alguns parâmetros de busca. 

Retorna um objeto requests contendo o resultado de requests.get

* tags: argumento opcional com lista  de strings contendo os tipos de pergunta para seleação. Ex.: ['python', 'php', 'javascript']
>ex. de URL para página com mais de 1 tag: https://stackoverflow.com/questions/tagged/sql+sql-server?tab=Frequent

* tab: string com tipo de ordenação a ser aplicado, pode ser:
'Frequent' (opção default), 'Votes', 'Unanswered', 'Bounties', 'Active', 'Newest'

* Selector: seleção dos trechos do html a serem retornados. Por default, será question-summary

* pages: número de páginas para leitura



In [2]:
def read_stackoverflow_overview(tags=[], tab='Frequent', pages=5):
  link = 'https://stackoverflow.com/questions'
  selector='question-summary'
  
  if tags:
    tags_link = '/tagged/'
    pre=''
    for t in tags:
      tags_link += pre + t
      pre = '+' 
    link += tags_link

  link += '?tab='+tab

  questions_text = ''
  soup_selection = []
  for page in range(1,pages+1):
    page_link = '&page='+str(page)

    try:
      request = requests.get(link+page_link)
      request.raise_for_status()
      #questions_text += request.text
      try:
        soup = bs(request.text, 'html.parser')
        soup_selection.append(soup.select('.'+selector))
      except: print ("Could not transform to soup object by selecting ",selector)
    except HTTPError:
      print ("Could not download page ", page)

    sleep(0.05)

  return soup_selection


In [3]:
questions_overview_raw = read_stackoverflow_overview(tags=['python','django'],tab='Frequent',pages=2)
type(questions_overview_raw)

list

In [4]:
questions_overview_raw

6</span>
  </div>
  <div class="user-gravatar32">
  <a href="/users/2592/josh-hunt"><div class="gravatar-wrapper-32"><img alt="" class="bar-sm" height="32" src="https://www.gravatar.com/avatar/7b101504605f7d8657ad0bbf87d565d0?s=32&amp;d=identicon&amp;r=PG" width="32"/></div></a>
  </div>
  <div class="user-details">
  <a href="/users/2592/josh-hunt">Josh Hunt</a>
  <div class="-flair">
  <span class="reputation-score" dir="ltr" title="reputation score 12,175">12.2k</span><span aria-hidden="true" title="24 gold badges"><span class="badge1"></span><span class="badgecount">24</span></span><span class="v-visible-sr">24 gold badges</span><span aria-hidden="true" title="71 silver badges"><span class="badge2"></span><span class="badgecount">71</span></span><span class="v-visible-sr">71 silver badges</span><span aria-hidden="true" title="94 bronze badges"><span class="badge3"></span><span class="badgecount">94</span></span><span class="v-visible-sr">94 bronze badges</span>
  </div>
  </div>
  

#Transformação do dado cru coletado do Stackoverflow em dicionário
**questions_overview(questions_overview_raw)**

O dicionário deve conter a visão geral das perguntas do stackoverflow, com:

* link
* brief_description
* votes
* views

###Análise do padrão da página HTML para captura de informações relevantes:

Em "question-summary", temos as seguintes informações relevantes:

1.   class = statscontainer, com:
*   Número de votos em class="vote-count-post "
>```<span class="vote-count-post high-scored-post"><strong>2473</strong></span>```

*   Número de respostas aceitas em class="status answered-accepted" 
>```<div class="status answered-accepted"><strong>23</strong>answers</div>```

*   Conteúdo e Title contendo quantidade de views em class="views supernova" 
>```<div class="views supernova" title="307,292 views">307k views</div>```

2.   class = summary, com:
* class="question-hyperlink" contendo em *href* parte do link para compor link de acesso à página detalhada da pergunta e Título da pergunta
>``` <a href="/questions/15112125/how-to-test-multiple-variables-against-a-value" class="question-hyperlink">How to test multiple variables against a value?</a>```

*   Breve resumo em class="excerpt"
>```<div class="excerpt"> brief description of the question ...</div>```

*   Tags em class="post-tag"
>```<a href="/questions/tagged/python" class="post-tag" title="show questions tagged 'python'" rel="tag">python</a>```





In [65]:
def clean_text (text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.replace('\n',' ').strip()
    text = text.lower()
    text = unidecode.unidecode(text)
    return text

Teste do funcionamento da biblioteca re no tratamento de regular expressions:

In [43]:
answers_count = '0answer'
int(re.sub('answer*','',answers_count))

0

Teste do funcionamento da biblioteca beautifulsoup

In [5]:
rel_soup = bs('<div class="status answered-accepted"><strong>12</strong>answers</div>')
rel_soup = int(rel_soup.select_one('.answered-accepted').next_element.getText())

print(rel_soup)

12


In [53]:
def questions_overview(questions_overview_raw):
  questions_overview = { 'questions':[]}

  for soups in questions_overview_raw:
    for soup in soups:
      title = soup.select_one('.question-hyperlink').getText()
      link = 'https://stackoverflow.com'+soup.select_one('.question-hyperlink').get('href')
      summary = soup.select_one('.excerpt').getText()
      vote_count =  soup.select_one('.vote-count-post').getText()
      answers_count = soup.select_one('.answered-accepted')
      answers_count = re.sub('\D','',answers_count.getText('')) if answers_count else '0'
      views =  re.sub('views','',soup.select_one('.views').attrs['title'])
      views = re.sub(',','',views)
      tags = []
      for tag in soup.select('.post-tag'): tags.append(tag.getText())

      questions_overview['questions'].append({
          'title': clean_text(title),
          'link': link,
          'summary': clean_text(summary),
          'vote_count': int(vote_count),
          'answers_count': int(answers_count),
          'views': int(views),
          'tags': tags,
          'full_question': '',
          'best_answer': '',
           'search_vector': '', #unificar vetor com pergunta e resposta
      })

  questions_df = pd.DataFrame(questions_overview['questions'])
  
  return questions_df

#Trasformação de dicionário em dataframe

In [54]:
questions_df = questions_overview(questions_overview_raw)
type(questions_df)

pandas.core.frame.DataFrame

#Exemplos de como acessar a informação no dataframe:

In [55]:
print('Lista com links:\n',questions_df['link'])

print('\n Acesso a dados de um link específico\n--- Link: ',questions_df['link'][0])

print('\n--- Título: ', questions_df['title'][0])

print('\n--- Breve Descrição: ', questions_df['summary'][0])

print('\n--- Contagem de votos: ', questions_df['vote_count'][0])

print('\n--- Contagem de respostas: ', questions_df['answers_count'][0])

print('\n--- Contagem de visualizações: ', questions_df['views'][0])

print('\n--- Lista como tags: ', questions_df['tags'][0])


Lista com links:
 0     https://stackoverflow.com/questions/23708898/p...
1     https://stackoverflow.com/questions/573618/set...
2     https://stackoverflow.com/questions/8000022/dj...
3     https://stackoverflow.com/questions/5100539/dj...
4     https://stackoverflow.com/questions/8609192/di...
5     https://stackoverflow.com/questions/1156246/ha...
6     https://stackoverflow.com/questions/2428092/cr...
7     https://stackoverflow.com/questions/7446187/no...
8     https://stackoverflow.com/questions/2642613/wh...
9     https://stackoverflow.com/questions/629551/how...
10    https://stackoverflow.com/questions/20306981/h...
11    https://stackoverflow.com/questions/298772/dja...
12    https://stackoverflow.com/questions/7933596/dj...
13    https://stackoverflow.com/questions/1395807/pr...
14    https://stackoverflow.com/questions/291945/how...
15    https://stackoverflow.com/questions/1626326/ho...
16    https://stackoverflow.com/questions/4668619/ho...
17    https://stackoverflow.co

Próximos passos:


1.   Enriquecer questions_df com a informação detalhada da pergunta e conteúdo da resposta com melhor avaliação
2.   Limpar dados em questions_dic para remover caracteres irrelevantes, como: \n, \t, artigos, pronomes



In [9]:
questions_df.head()

Unnamed: 0,answers_count,best_answer,full_question,link,summary,tags,title,views,vote_count
0,32,,,https://stackoverflow.com/questions/23708898/p...,\r\n I'm running into a weird error...,"[python, django, windows, pip]",'pip' is not recognized as an internal or exte...,1045040,338
1,24,,,https://stackoverflow.com/questions/573618/set...,\r\n I've been working on a web app...,"[python, django, web-applications, scheduled-t...",Set up a scheduled job?,169474,521
2,8,,,https://stackoverflow.com/questions/8000022/dj...,"\r\n mydict = {""key1"":""value1"", ""ke...","[python, django, templates, dictionary]",Django template how to look up a dictionary va...,142226,234
3,18,,,https://stackoverflow.com/questions/5100539/dj...,\r\n I could use some help complyin...,"[python, ajax, django, csrf]",Django CSRF check failing with an Ajax POST re...,151126,180
4,17,,,https://stackoverflow.com/questions/8609192/di...,\r\n When we add a database field i...,"[python, django, django-models]","differentiate null=True, blank=True in django",257805,917


In [57]:

def read_question_detail(questions_df):
  
  idx = 0
  for link in questions_df['link']:
    question = []
    answer = []
    try:
      request = requests.get(link)
      request.raise_for_status()
      try:
        #import urllib.request
        #from lxml import html
        #page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
        #tree = html.fromstring(page.content)
        #prices = tree.xpath('//span[@class="item-price"]/text()')
        soup = bs(request.text, 'html.parser')
        questions_df['full_question'][idx] = clean_text(soup.find("div", {"id": "question"}).select_one('.post-text').getText())
        questions_df['best_answer'][idx] = clean_text(soup.find("div", {"id": "answers"}).select_one('.post-text').getText())
        #answer.append(soup.select_one('.div.answer-body'))
        #answer.append(soup.select('div.-summary.answer'))

      except: 
        print ("Could not transform to soup object by selecting")
        questions_df['full_question'][idx] = "No Question :( "
        questions_df['best_answer'][idx] = "No Answer :( "

    except HTTPError:
      print ("Could not download page")

    idx += 1

    sleep(0.05)

  print(questions_df['best_answer'][0])

  return questions_df



Testes:

In [27]:
request = requests.get('https://stackoverflow.com/questions/23708898/pip-is-not-recognized-as-an-internal-or-external-command')
request.raise_for_status()
rel_soup = bs(request.text, 'html.parser')
#rel_soup = rel_soup.find("div", {"id": "answers"}).select_one('.post-text').getText()
rel_soup = rel_soup.find("div", {"id": "question"}).select_one('.post-text').getText()

print(rel_soup)


I'm running into a weird error when trying to install Django on my computer.
This is the sequence that I typed into my command line:
C:\Python34> python get-pip.py
Requirement already up-to-date: pip in c:\python34\lib\site-packages
Cleaning up...

C:\Python34> pip install Django
'pip' is not recognized as an internal or external command,
operable program or batch file.

C:\Python34> lib\site-packages\pip install Django
'lib\site-packages\pip' is not recognized as an internal or external command,
operable program or batch file.

What could be causing this?
This is what I get when I type in echo %PATH%:
C:\Python34>echo %PATH%
C:\Program Files\ImageMagick-6.8.8-Q16;C:\Program Files (x86)\Intel\iCLS Client\
;C:\Program Files\Intel\iCLS Client\;C:\Windows\system32;C:\Windows;C:\Windows\S
ystem32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\
Windows Live\Shared;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Progr
am Files (x86)\Intel\OpenCL SDK\2.0\bin\x

Testes - clean Text:

In [71]:
request = requests.get('https://stackoverflow.com/questions/17648966/handling-non-ascii-chars-in-c')
request.raise_for_status()
rel_soup = bs(request.text, 'html.parser')
#rel_soup = rel_soup.find("div", {"id": "answers"}).select_one('.post-text').getText()
rel_soup = rel_soup.find("div", {"id": "question"}).select_one('.post-text').getText()

print(clean_text(rel_soup))

i am facing some issues with nonascii chars in c i have one file containg nonascii chars which i am reading in c via file handling after reading the filesay 1txt i am storing the data into string stream and writing it into another filesay 2txt assume 1txt contains acao  in 2txt i should get same ouyput but nonascii chars are printed as their hex value in 2txt also i am quite sure that c is handling ascii chars as ascii only  please help on how to print these chars correctly in 2txt edit firstly psuedocode for whole process 1shell script to read from db one value and stores in 11txt 2cpp codeacpp reading 11txt and writing to ftxt  data present in db which is being read instalacao file 11txt contains instalaaSSaPSo file ftxt contains instalaaSSaPSo ouput of acpp on screen instalacao acpp include iterator include iostream include algorithm include sstream includefstream include iomanip  using namespace std int main      ifstream myreadfile     ofstream f2     myreadfileopen11txt     f2ope

In [58]:
questions_df = read_question_detail(questions_df)

You need to add the path of your pip installation to your PATH system variable By default pip is installed to CPython34Scriptspip pip now comes bundled with new versions of python so the path CPython34Scripts needs to be added to your PATH variable To check if it is already in your PATH variable type echo PATH at the CMD prompt To add the path of your pip installation to your PATH variable you can use the Control Panel or the setx command For example setx PATH PATHCPython34Scripts   Note According to the official documentation variables set with setx variables are available in future command windows only not in the current command window In particular you will need to start a new cmdexe instance after entering the above command in order to utilize the new environment variable  Thanks to Scott Bartell for pointing this out


In [151]:
def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>|&[.*?]')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

In [73]:
print(questions_df['full_question'][0])


Im running into a weird error when trying to install Django on my computer This is the sequence that I typed into my command line CPython34 python getpippy Requirement already uptodate pip in cpython34libsitepackages Cleaning up  CPython34 pip install Django pip is not recognized as an internal or external command operable program or batch file  CPython34 libsitepackagespip install Django libsitepackagespip is not recognized as an internal or external command operable program or batch file  What could be causing this This is what I get when I type in echo PATH CPython34echo PATH CProgram FilesImageMagick688Q16CProgram Files x86InteliCLS Client CProgram FilesInteliCLS ClientCWindowssystem32CWindowsCWindowsS ystem32WbemCWindowsSystem32WindowsPowerShellv10CProgram Files x86 Windows LiveSharedCProgram Files x86IntelOpenCL SDK20binx86CProgr am Files x86IntelOpenCL SDK20binx64CProgram FilesIntelIntelR Mana gement Engine ComponentsDALCProgram FilesIntelIntelR Management Engine C omponentsIPTC

In [91]:
stemmer = PorterStemmer()
#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(questions_df['full_question'][0]+ "success success success")
#We initialize the stopwords variable
stop_words = stopwords.words('english')
keywords    = [word for word in tokens if not word in stop_words and len(word) > 2 and word not in string.punctuation]
stem_words  = [stemmer.stem(word) for word in tokens if not word in stop_words and len(word) > 2 and word not in string.punctuation]

print(stem_words)

['run', 'weird', 'error', 'tri', 'instal', 'django', 'comput', 'thi', 'sequenc', 'type', 'command', 'line', 'cpython34', 'python', 'getpippi', 'requir', 'alreadi', 'uptod', 'pip', 'cpython34libsitepackag', 'clean', 'cpython34', 'pip', 'instal', 'django', 'pip', 'recogn', 'intern', 'extern', 'command', 'oper', 'program', 'batch', 'file', 'cpython34', 'libsitepackagespip', 'instal', 'django', 'libsitepackagespip', 'recogn', 'intern', 'extern', 'command', 'oper', 'program', 'batch', 'file', 'what', 'could', 'caus', 'thi', 'get', 'type', 'echo', 'path', 'cpython34echo', 'path', 'cprogram', 'filesimagemagick688q16cprogram', 'file', 'x86intelicl', 'client', 'cprogram', 'filesintelicl', 'clientcwindowssystem32cwindowscwindowss', 'ystem32wbemcwindowssystem32windowspowershellv10cprogram', 'file', 'x86', 'window', 'livesharedcprogram', 'file', 'x86intelopencl', 'sdk20binx86cprogr', 'file', 'x86intelopencl', 'sdk20binx64cprogram', 'filesintelintelr', 'mana', 'gement', 'engin', 'componentsdalcprog

In [92]:
print(keywords)

['running', 'weird', 'error', 'trying', 'install', 'Django', 'computer', 'This', 'sequence', 'typed', 'command', 'line', 'CPython34', 'python', 'getpippy', 'Requirement', 'already', 'uptodate', 'pip', 'cpython34libsitepackages', 'Cleaning', 'CPython34', 'pip', 'install', 'Django', 'pip', 'recognized', 'internal', 'external', 'command', 'operable', 'program', 'batch', 'file', 'CPython34', 'libsitepackagespip', 'install', 'Django', 'libsitepackagespip', 'recognized', 'internal', 'external', 'command', 'operable', 'program', 'batch', 'file', 'What', 'could', 'causing', 'This', 'get', 'type', 'echo', 'PATH', 'CPython34echo', 'PATH', 'CProgram', 'FilesImageMagick688Q16CProgram', 'Files', 'x86InteliCLS', 'Client', 'CProgram', 'FilesInteliCLS', 'ClientCWindowssystem32CWindowsCWindowsS', 'ystem32WbemCWindowsSystem32WindowsPowerShellv10CProgram', 'Files', 'x86', 'Windows', 'LiveSharedCProgram', 'Files', 'x86IntelOpenCL', 'SDK20binx86CProgr', 'Files', 'x86IntelOpenCL', 'SDK20binx64CProgram', 'Fi

In [93]:
from sklearn.feature_extraction.text import CountVectorizer #TF

text = ' '.join([word for word in keywords])
print(text)



running weird error trying install Django computer This sequence typed command line CPython34 python getpippy Requirement already uptodate pip cpython34libsitepackages Cleaning CPython34 pip install Django pip recognized internal external command operable program batch file CPython34 libsitepackagespip install Django libsitepackagespip recognized internal external command operable program batch file What could causing This get type echo PATH CPython34echo PATH CProgram FilesImageMagick688Q16CProgram Files x86InteliCLS Client CProgram FilesInteliCLS ClientCWindowssystem32CWindowsCWindowsS ystem32WbemCWindowsSystem32WindowsPowerShellv10CProgram Files x86 Windows LiveSharedCProgram Files x86IntelOpenCL SDK20binx86CProgr Files x86IntelOpenCL SDK20binx64CProgram FilesIntelIntelR Mana gement Engine ComponentsDALCProgram FilesIntelIntelR Management Engine omponentsIPTCProgram Files x86IntelIntelR Management Engine Components DALCProgram Files x86IntelIntelR Management Engine ComponentsIPTCP r

In [94]:
vectorizer = CountVectorizer()
vectorizer.fit([text])
vector = vectorizer.transform([text])
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

(1, 68)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 2 1 1 1 1 3 1 1 1 1 1 2 3 1 1 1 3 1 4 1 2 2 9 1 1 2 1 1 1 3 2 2 1 1
  1 3 1 2 2 3 2 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1]]


In [95]:

#Vetoriza um texto novo
text2 = ["How can I install install install install  install Django? No success success success with django so far... django django"]
vector2 = vectorizer.transform(text2)
print(vector2.toarray())

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
