<a href="https://colab.research.google.com/github/sielerod/search_stackoverflow/blob/master/Read_Stackoverflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Objetivo:**   
* Capturar as perguntas mais frequentes sobre Python no stackoverflow
* Armazenar para cada pergunta: link, breve descrição da pergunta, quantidade de votos e visualizações, pergunta, respostas com melhor avaliação


**Fonte:** https://stackoverflow.com/questions/


In [33]:
import numpy as np 
import pandas as pd

import requests # Coleta de conteúdo em Webpage
from requests.exceptions import HTTPError
from bs4 import BeautifulSoup as bs # Scraping webpages
from time import sleep
import json

import re #biblioteca para trabalhar com regular expressions - regex
import string
import unidecode

import nltk
#nltk.download('punkt')
#nltk.download('stopwords')
#from nltk.stem import RSLPStemmer #Stemming Portugues
#from nltk.stem import PorterStemmer #Stemming Ingles com algoritmo de Porter: algoritmo menos agressivo nas reduções
from nltk.stem import SnowballStemmer #Stemming Porter2: mais agressivo nas reduções do que Porter stemmer e um pouco mais rápido 
#from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords


In [1]:
def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>|&[.*?]')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

#Leitura do dado cru no Stackoverflow
**read_stackoverflow_overview(tags=[], tab='Frequent', pages)**

Leitura do resumo das perguntas mais frequentes no stackoverflow com base em alguns parâmetros de busca. 

Retorna um objeto requests contendo o resultado de requests.get

* tags: argumento opcional com lista  de strings contendo os tipos de pergunta para seleação. Ex.: ['python', 'php', 'javascript']
>ex. de URL para página com mais de 1 tag: https://stackoverflow.com/questions/tagged/sql+sql-server?tab=Frequent

* tab: string com tipo de ordenação a ser aplicado, pode ser:
'Frequent' (opção default), 'Votes', 'Unanswered', 'Bounties', 'Active', 'Newest'

* Selector: seleção dos trechos do html a serem retornados. Por default, será question-summary

* pages: número de páginas para leitura



In [3]:
def read_stackoverflow_overview(tags=[], tab='Frequent', pages=5):
  link = 'https://stackoverflow.com/questions'
  selector='question-summary'
  
  if tags:
    tags_link = '/tagged/'
    pre=''
    for t in tags:
      tags_link += pre + t
      pre = '+' 
    link += tags_link

  link += '?tab='+tab

  questions_text = ''
  soup_selection = []
  for page in range(1,pages+1):
    page_link = '&page='+str(page)

    try:
      request = requests.get(link+page_link)
      request.raise_for_status()
      try:
        soup = bs(request.text, 'html.parser')
        soup_selection.append(soup.select('.'+selector))
      except: print ("Could not transform to soup object by selecting ",selector)
    except HTTPError:
      print ("Could not download page ", page)

    sleep(0.05)

  return soup_selection


In [8]:
questions_overview_raw = read_stackoverflow_overview(tags=['python','django'],tab='Frequent',pages=2)
type(questions_overview_raw)

list

#Transformação do dado cru coletado do Stackoverflow em dataframe
**questions_overview(questions_overview_raw)**

O dataframe deve conter a visão geral das perguntas do stackoverflow, com:

* link
* brief_description
* votes
* views

###Análise do padrão da página HTML para captura de informações relevantes:

Em "question-summary", temos as seguintes informações relevantes:

1.   class = statscontainer, com:
*   Número de votos em class="vote-count-post "
>```<span class="vote-count-post high-scored-post"><strong>2473</strong></span>```

*   Número de respostas aceitas em class="status answered-accepted" 
>```<div class="status answered-accepted"><strong>23</strong>answers</div>```

*   Conteúdo e Title contendo quantidade de views em class="views supernova" 
>```<div class="views supernova" title="307,292 views">307k views</div>```

2.   class = summary, com:
* class="question-hyperlink" contendo em *href* parte do link para compor link de acesso à página detalhada da pergunta e Título da pergunta
>``` <a href="/questions/15112125/how-to-test-multiple-variables-against-a-value" class="question-hyperlink">How to test multiple variables against a value?</a>```

*   Breve resumo em class="excerpt"
>```<div class="excerpt"> brief description of the question ...</div>```

*   Tags em class="post-tag"
>```<a href="/questions/tagged/python" class="post-tag" title="show questions tagged 'python'" rel="tag">python</a>```





In [9]:
def questions_overview(questions_overview_raw):
  questions_overview = { 'questions':[]}

  for soups in questions_overview_raw:
    for soup in soups:
      title = soup.select_one('.question-hyperlink').getText()
      link = 'https://stackoverflow.com'+soup.select_one('.question-hyperlink').get('href')
      summary = soup.select_one('.excerpt').getText()
      vote_count =  soup.select_one('.vote-count-post').getText()
      answers_count = soup.select_one('.answered-accepted')
      answers_count = re.sub('\D','',answers_count.getText('')) if answers_count else '0'
      views =  re.sub('views','',soup.select_one('.views').attrs['title'])
      views = re.sub(',','',views)
      tags = []
      for tag in soup.select('.post-tag'): tags.append(tag.getText())

      questions_overview['questions'].append({
          'title': title,
          'link': link,
          'summary': summary,
          'vote_count': int(vote_count),
          'answers_count': int(answers_count),
          'views': int(views),
          'tags': tags,
          'full_question': '',
          'best_answer': '',
      })

  questions_df = pd.DataFrame(questions_overview['questions'])
  
  return questions_df

In [10]:
questions_df = questions_overview(questions_overview_raw)
type(questions_df)

pandas.core.frame.DataFrame

#Exemplos de como acessar a informação no dataframe:

In [13]:
print('Lista com links:\n',questions_df['link'][0:3])
print('\n Acesso a dados de um link específico\n--- Link: ',questions_df['link'][0])
print('\n--- Título: ', questions_df['title'][0])
print('\n--- Breve Descrição: ', questions_df['summary'][0])
print('\n--- Contagem de votos: ', questions_df['vote_count'][0])
print('\n--- Contagem de respostas: ', questions_df['answers_count'][0])
print('\n--- Contagem de visualizações: ', questions_df['views'][0])
print('\n--- Lista como tags: ', questions_df['tags'][0])
questions_df.head(3)

Lista com links:
 0    https://stackoverflow.com/questions/23708898/p...
1    https://stackoverflow.com/questions/573618/set...
2    https://stackoverflow.com/questions/8000022/dj...
Name: link, dtype: object

 Acesso a dados de um link específico
--- Link:  https://stackoverflow.com/questions/23708898/pip-is-not-recognized-as-an-internal-or-external-command

--- Título:  'pip' is not recognized as an internal or external command

--- Breve Descrição:  
            I'm running into a weird error when trying to install Django on my computer.
This is the sequence that I typed into my command line:
C:\Python34> python get-pip.py
Requirement already up-to-date: ...
        

--- Contagem de votos:  343

--- Contagem de respostas:  32

--- Contagem de visualizações:  1060830

--- Lista como tags:  ['python', 'django', 'windows', 'pip']


Unnamed: 0,answers_count,best_answer,full_question,link,summary,tags,title,views,vote_count
0,32,,,https://stackoverflow.com/questions/23708898/p...,\r\n I'm running into a weird error...,"[python, django, windows, pip]",'pip' is not recognized as an internal or exte...,1060830,343
1,24,,,https://stackoverflow.com/questions/573618/set...,\r\n I've been working on a web app...,"[python, django, web-applications, scheduled-t...",Set up a scheduled job?,170303,523
2,8,,,https://stackoverflow.com/questions/8000022/dj...,"\r\n mydict = {""key1"":""value1"", ""ke...","[python, django, templates, dictionary]",Django template how to look up a dictionary va...,143481,236


Próximos passos:


1.   Enriquecer questions_df com a informação detalhada da pergunta e conteúdo da resposta com melhor avaliação
2.   Limpar dados em questions_df para remover caracteres irrelevantes, como: \n, \t, artigos, pronomes



In [15]:

def read_question_detail(questions_df):
  
  idx = 0
  for link in questions_df['link']:
    question = []
    answer = []
    try:
      request = requests.get(link)
      request.raise_for_status()
      try:
        soup = bs(request.text, 'html.parser')
        questions_df['full_question'][idx] = soup.find("div", {"id": "question"}).select_one('.post-text').getText()
        questions_df['best_answer'][idx] = soup.find("div", {"id": "answers"}).select_one('.post-text').getText()

      except: 
        print ("Could not transform to soup object by selecting")

    except HTTPError:
      print ("Could not download page")

    idx += 1

    sleep(0.05)

  return questions_df

In [16]:
questions_df = read_question_detail(questions_df)

In [17]:
questions_df.columns

Index(['answers_count', 'best_answer', 'full_question', 'link', 'summary',
       'tags', 'title', 'views', 'vote_count'],
      dtype='object')

In [20]:
#remove todas as pontuações e retorna lista de palavras
def clean_text (text):
    text = text.translate(str.maketrans('', '', string.punctuation)) #remove todas as pontuações: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
    text = text.replace('\n',' ').strip() 
    text = text.lower()
    text = unidecode.unidecode(text)
    return text

In [54]:
#redução das palavras para sua raiz (stemming), remoção de stopwords e palavras com menos de 2 caracteres, e criação do vocabulário com a quantidade de ocorrência de cada palavra em todos os documentos

def stackoverflow_vocabulary(questions_df):
    docs_stem_words = []
    vocabulary = {}
    stop_words = stopwords.words('english')
    snowball_stemmer = SnowballStemmer("english")

    for index in range(len(questions_df)):
        text = questions_df['title'][index] + questions_df['full_question'][index] + questions_df['best_answer'][index] 
        tokentext = word_tokenize(clean_text(text))
        stem_words  = [snowball_stemmer.stem(word) for word in tokentext if not word in stop_words and len(word) > 2 and word not in string.punctuation]
        docs_stem_words.append(stem_words)

        #Inicializa vocabulário sem repetição de palavras
        for word in stem_words:
            vocabulary[word] = 0

    #Contabiliza ocorrência de cada palavra em todos os documentos
    for words in docs_stem_words:
        for word in words:
            vocabulary[word] += 1
    
    return vocabulary, docs_stem_words

vocabulary, docs_stem_words = stackoverflow_vocabulary(questions_df)


In [67]:
#Criar índice invertido para viabilizar buscas
def create_InvertedIndex(vocabulary, docs_stem_words): 
    invertedList = dict()
    for term in vocabulary:
        invertedList[term] = list()
        index = 0
        for stem_words in docs_stem_words:
            frequencia = 0
            for word in stem_words:
                if word == term:
                    frequencia += 1
            if frequencia > 0:
                invertedList[term].append([index, frequencia])
            index += 1
            invertedList[term].sort(key=itemgetter(1), reverse=True)

    # Serialize data into file:
    json.dump(invertedList, open( "stackoverflow_InvertedIndex.json", 'w' ) )

    return #invertedList

#invertedList = create_InvertedIndex(vocabulary, docs_stem_words)
create_InvertedIndex(vocabulary, docs_stem_words)

# Read data from file:
invertedList = json.load( open( "stackoverflow_InvertedIndex.json" ) )
#invertedList.items()


In [71]:
invertedList['python'][:5]

[[69, 10], [9, 7], [70, 6], [73, 6], [31, 5]]

In [None]:
def simple_stemming_docs(documents):
    snowball_stemmer = SnowballStemmer("english")
    stop_words = stopwords.words('english')
    tokens = sum([word_tokenize(clean_text(document)) for document in documents], [])
    stem_words  = [snowball_stemmer.stem(word) for word in tokens if not word in stop_words and len(word) > 2 and word not in string.punctuation]

    return stem_words

In [38]:
def simple_lookup_query(query, invertedList):
    terms = simple_stemming_docs([query])

    docs_index = {}

    for term in terms:
        if term in invertedList.keys():
            docs_index[term] = [index[0] for index in invertedList[term]]
        else:
            docs_index['missingTerm'] = ['']

    return docs_index

In [39]:
searchTerms = input("Digite os termos de busca: ")
docs_index = simple_lookup_query(searchTerms,invertedList)

NameError: name 'stemming_docs' is not defined

In [None]:
def print_search_result(docs_index, docs, operator='OR'):
    for i, (k, v) in enumerate(docs_index.items()):
        print("{:<8}key: {:<20} value: {}".format(i, k, v))

    print()

    resultList=[lista[1] for lista in docs_index.items()]

    responseSet = []

    if operator == 'AND' and [''] in resultList:
        resultList = []
    elif [''] in resultList:
        resultList.remove([''])
    
    if len(resultList) == 1:
        responseSet = resultList[0]

    #Realiza a interseção entre os conjuntos
    for i in range(len(resultList)-1):
        #Operador AND
        if operator == 'AND':
            responseSet.append(list(set(resultList[i]).intersection(resultList[i+1])))
        else:
            #Operador OR
            responseSet.append(list(set(resultList[i]).union(resultList[i+1])))

    print("Foram encontrados ", len(np.unique(responseSet)), " documentos com o termo de busca...")
    print()

    #Monta o Resultado
    lista = []
    for doc in np.unique(responseSet):
        documento = docs[doc]
        for term in docs_index.keys():
            documento = documento.replace(term, "\033[48;5;0m\033[38;5;226m {term} \033[0;0m".format(term=term))        
        lista.append(str(str(doc+1) + " - " + documento))
        
    #Exibe o Resultado
    for resultado in lista:
        print(resultado)

    return

print_search_result(docs_index,documents,'OR')

In [77]:
# create functions for TD-IDF / BM25
import math
from textblob import TextBlob as tb

def tf(word, doc):
    return doc.count(word) / len(doc)

def n_containing(word, doclist):
    return sum(1 for doc in doclist if word in doc)

def idf(word, doclist):
    return math.log(len(doclist) / (0.01 + n_containing(word, doclist)))

def tfidf(word, doc, doclist):
    return (tf(word, doc) * idf(word, doclist))

In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer #TF-IDF

worddic = {}

for index in range(len(set_stem_words)):
    for word in wordsunique:
        if word in set_stem_words[0][index]:
            word = str(word)
            positions = list(np.where(np.array(set_stem_words[0][index]) == word)[0])
            idfs = tfidf(word,set_stem_words[0][index],set_stem_words)

            try:
                worddic[word] = [index,positions,idfs]
            except:
                worddic[word] = []
                worddic[word] = [index,positions,idfs]
    index += 1

In [90]:
worddic['instal']

[84,
 [6, 44, 56, 60, 64, 86, 99, 102, 108, 111, 183, 186, 204, 217, 227],
 0.5233147938622832]

In [91]:
print(questions_df['full_question'][5])


I want users on the site to be able to download files whose paths are obscured so they cannot be directly downloaded.
For instance, I'd like the URL to be something like this: http://example.com/download/?f=somefile.txt
And on the server, I know that all downloadable files reside in the folder /home/user/files/.
Is there a way to make Django serve that file for download as opposed to trying to find a URL and View to display it?



In [92]:
print(questions_df['best_answer'][5])


For the "best of both worlds" you could combine S.Lott's solution with the xsendfile module: django generates the path to the file (or the file itself), but the actual file serving is handled by Apache/Lighttpd. Once you've set up mod_xsendfile, integrating with your view takes a few lines of code:
from django.utils.encoding import smart_str

response = HttpResponse(mimetype='application/force-download') # mimetype is replaced by content_type for django 1.7
response['Content-Disposition'] = 'attachment; filename=%s' % smart_str(file_name)
response['X-Sendfile'] = smart_str(path_to_file)
# It's usually a good idea to set the 'Content-Length' header too.
# You can also set any other required headers: Cache-Control, etc.
return response

Of course, this will only work if you have control over your server, or your hosting company has mod_xsendfile already set up.
EDIT:

mimetype is replaced by content_type for django 1.7

response = HttpResponse(content_type='application/force-download') 

In [75]:

tfidfvectorizer = TfidfVectorizer()
tfidfvectorizer.fit(['django'])
vectortfidf = tfidfvectorizer.transform(['django'])
# summarize encoded vector
print(vectortfidf.shape)
print(type(vectortfidf))
print(vectortfidf.toarray())

(1, 1)
<class 'scipy.sparse.csr.csr_matrix'>
[[1.]]


In [106]:
from sklearn.feature_extraction.text import CountVectorizer #TF

text = ' '.join([word for word in keywords])
print(text)

set()

running weird error trying install django computer sequence typed command line cpython34 python getpippy requirement already uptodate pip cpython34libsitepackages cleaning cpython34 pip install django pip recognized internal external command operable program batch file cpython34 libsitepackagespip install django libsitepackagespip recognized internal external command operable program batch file could causing get type echo path cpython34echo path cprogram filesimagemagick688q16cprogram files x86intelicls client cprogram filesintelicls clientcwindowssystem32cwindowscwindowss ystem32wbemcwindowssystem32windowspowershellv10cprogram files x86 windows livesharedcprogram files x86intelopencl sdk20binx86cprogr files x86intelopencl sdk20binx64cprogram filesintelintelr mana gement engine componentsdalcprogram filesintelintelr management engine omponentsiptcprogram files x86intelintelr management engine components dalcprogram files x86intelintelr management engine componentsiptcp rogram files x86

In [107]:
vectorizer = CountVectorizer()
vectorizer.fit([text])
vector = vectorizer.transform([text])
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

(1, 113)
<class 'scipy.sparse.csr.csr_matrix'>
[[ 1  1  2  1  2  1  1  2  1  1  1  1  1  1  1  1  1  7  1  1  1  1  1  1
   2  3  1  1  1  1  1  1  1  3  1  2  4  1  1  1  1  2  2  9  1  1  2  1
   1  1  1  3  2  1  1  2  2  1  1  1  3  2  1  3  1  1  1  2  1  1  1  1
  11  1  7  1  2  1  2  2  1  1  1  1  1  1  1  1  3  1  1  2  1  1  1  2
   1  1  1  1  5  2  1  1  1  2  1  1  1  2  2  1  1]]


In [111]:

#Vetoriza um texto novo
text2 = ["How can I install install install install  install Django? No success success success with django so far... django django"]
vector2 = vectorizer.transform(text2)
print(vector2.toarray())

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]]


In [109]:

from sklearn.feature_extraction.text import TfidfVectorizer #TF-IDF
tfidfvectorizer = TfidfVectorizer()
tfidfvectorizer.fit([text])
vectortfidf = tfidfvectorizer.transform([text])
# summarize encoded vector
print(vectortfidf.shape)
print(type(vectortfidf))
print(vectortfidf.toarray())

(1, 113)
<class 'scipy.sparse.csr.csr_matrix'>
[[0.04218245 0.04218245 0.08436491 0.04218245 0.08436491 0.04218245
  0.04218245 0.08436491 0.04218245 0.04218245 0.04218245 0.04218245
  0.04218245 0.04218245 0.04218245 0.04218245 0.04218245 0.29527718
  0.04218245 0.04218245 0.04218245 0.04218245 0.04218245 0.04218245
  0.08436491 0.12654736 0.04218245 0.04218245 0.04218245 0.04218245
  0.04218245 0.04218245 0.04218245 0.12654736 0.04218245 0.08436491
  0.16872982 0.04218245 0.04218245 0.04218245 0.04218245 0.08436491
  0.08436491 0.37964209 0.04218245 0.04218245 0.08436491 0.04218245
  0.04218245 0.04218245 0.04218245 0.12654736 0.08436491 0.04218245
  0.04218245 0.08436491 0.08436491 0.04218245 0.04218245 0.04218245
  0.12654736 0.08436491 0.04218245 0.12654736 0.04218245 0.04218245
  0.04218245 0.08436491 0.04218245 0.04218245 0.04218245 0.04218245
  0.46400699 0.04218245 0.29527718 0.04218245 0.08436491 0.04218245
  0.08436491 0.08436491 0.04218245 0.04218245 0.04218245 0.04218245
 

In [112]:

#Vetoriza um texto novo
vectortfidf2 = tfidfvectorizer.transform(text2)
print(vectortfidf2.toarray())

[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.56568542 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.70710678 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.42426407 0.         0.         0.         0.
  0.         0.         0.         0.   

In [113]:
from sklearn.metrics.pairwise import cosine_similarity
print("TF ", cosine_similarity(vector, vector2))
print("TF-IDF: ", cosine_similarity(vectortfidf, vectortfidf2))

TF  [[0.1968615]]
TF-IDF:  [[0.1968615]]


In [None]:
questions_overview_raw = read_stackoverflow_overview(tags=['python','django'],tab='Frequent',pages=2)
type(questions_overview_raw)