In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [2]:
cd drive/My Drive/search_engine

/content/drive/My Drive/search_engine


## The Task

As per the problem statement, I have to develope a stackoverflow based semantic search engine. Thus in order to understand and learn from the data, I need to gather Questions and Answers that were posted on Stack Overflow. Thus what I need are the following: 

- Title 
- Question body
- Answers for that question
- Votes for each answers


## Data


I will use this https://www.kaggle.com/stackoverflow/stackoverflow dataset. I includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive.

- I will use bq_helper which is a helper class to perform read-only BigQuery Tasks. 
Reference : https://www.kaggle.com/sohier/introduction-to-the-bq-helper-package
- There are many tables on the Stackoverflow database, but we only need concern ourselves with **posts_questions** and **posts_answers**


## Library imports

In [None]:
import bq_helper
from bq_helper import BigQueryHelper
import os
import pandas as pd
import numpy as np
import spacy
from tqdm import tqdm
from bs4 import BeautifulSoup
from textblob import TextBlob
import re
import nltk
import inflect
from nltk.corpus import stopwords
import heapq

## Getting the data 

In [None]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="credentials.json"
bq_assistant = BigQueryHelper("bigquery-public-data", "stackoverflow")

query = "SELECT q.id, q.title, q.body, q.tags, a.body as answers, a.score FROM `bigquery-public-data.stackoverflow.posts_questions` AS q INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a ON q.id = a.parent_id LIMIT 1000000"
df = bq_assistant.query_to_pandas(query)
df.to_csv('Original_data.csv')

In [None]:
original_data = pd.read_csv('Original_data.csv', index_col=0)

In [19]:
original_data.head()

Unnamed: 0,id,title,body,tags,answers,score
0,6684668,SVN: Create a branch from branch and merge to ...,"<p>We have a branch B1, and it is still not st...",svn|svn-merge,<p>Since you stated that you created B2 just t...,10
1,6927011,Is the device token as unique as the device ID?,"<p>If we reset an iPhone, the device ID remain...",iphone|ios|devicetoken,<p>I assume you are referring to the device to...,22
2,6549414,How to run make in vim and open results in a s...,<p>I use vim for coding. When I have to compil...,vim|ide,<p>If you want compile and run if compile succ...,-2
3,2060741,Does Objective-C use short-circuit evaluation?,<p>I tried something along the lines of:</p>\n...,objective-c|short-circuiting,<p>Objective-C is a strict superset of C.</p>\...,10
4,13392623,Returning JSONP from Jersey,<p>I am currently using Jersey to return JSON....,jersey|jsonp,"<p>Take a look at the <a href=""http://java.net...",9


In [20]:
original_data.shape

(1000000, 6)

## Data analysis and processing

### Loding the language for the text processing

In [None]:
EN = spacy.load('en_core_web_sm')

### Missing value check

In [None]:
original_data.isna().sum()

id         0
title      0
body       0
tags       0
answers    0
score      0
dtype: int64

#### Observation
- Is good to see that there is no missing value in the dataset.
- Hence no need for the data imputation.

### Checking for Duplicate in data if any

In [None]:
original_data.duplicated().any()

True

#### Observation
- As you can see there are some duplicates in the data
- so in the next cell I will combine it and try to create a corpus based on their common questions and tags.

### Creating a corpus

- As there are many repeated qustions but with the unique answer so I combine the answer and score of them in the below cell

In [21]:
combination = {
    # Function 1
    'answers':{
        'combined_answers': lambda x: "\n".join(x)
    },
    # Function 2
    'score':{
        'combined_score': 'sum'
    }
}
# https://www.geeksforgeeks.org/python-pandas-series-agg/
grouped_data = original_data.groupby(['id','title', 'body','tags'],as_index=False).agg(combination)
deduped_data = pd.DataFrame(grouped_data)

in a future version.

For column-specific groupby renaming, use named aggregation

    >>> df.groupby(...).agg(name=('column', aggfunc))

  return super().aggregate(arg, *args, **kwargs)


In [22]:
deduped_data.head()

Unnamed: 0_level_0,id,title,body,tags,answers,score
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,combined_answers,combined_score
0,502,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,python|windows|image|pdf,<p>You can use ImageMagick's convert utility f...,47
1,1417,How can I get the authenticated user name unde...,"<p>First, let's get the security consideration...",php|apache|authentication|http-authentication,<p>I think that you are after this</p>\n\n<pre...,37
2,3144,'Best' Diff Algorithm,<p>I need to implement a Diff algorithm in VB....,vb.net|diff,"<p>I like <a href=""http://www.xmailserver.org/...",17
3,3196,"SQL query, count and group by",<p>If I have data like this:</p>\n\n<pre><code...,sql,<pre><code>select name from table group by nam...,45
4,3831,How do I best detect an ASP.NET expired session?,<p>I need to detect when a session has expired...,asp.net|http|session,<p>Try the following</p>\n\n<pre><code>If Sess...,9


## Text processing of the text data

Here I am creating some function which will help me out to preprocess the text data the function include the following:
- Tokenization of the text
- Converting the tokens to lowercase
- Removing the punctuation from the tokens list
- Removing the stopwords from the tokens list
- Tokenizatio of the code string

In [None]:
def tokenize_text(text):
    "Apply tokenization using spacy to docstrings."
    tokens = EN.tokenizer(text)
    return [token.text.lower() for token in tokens if not token.is_space]

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

def normalize(words):
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    return words

def tokenize_code(text):
    "A very basic procedure for tokenizing code strings."
    return RegexpTokenizer(r'\w+').tokenize(text)

def preprocess_text(text):
    return ' '.join(normalize(tokenize_text(text)))

**The raw text for Questions and Answers is given along with the HTML markup with which it was displayed on StackOverflow originally**. 
These refer usually to *p tags, h1-h6 tags and the code tags*

- I constructed a new feature column called 'post_corpus' by combining the title, question body, and all the answers
- I prepended the title to the question body 
- I skipped the 'code' sections because they do not offer useful information for our task
- I constructed urls for each question by appending 'https://stackoverflow.com/questions/' with the question id
- I constructed 2 features for sentiment using the open Source **Textblob library** 

In [None]:
title_list = [] 
content_list = []
url_list = []
comment_list = []
sentiment_polarity_list = []
sentiment_subjectivity_list = []
vote_list =[]
tag_list = []
corpus_list = []

for i, row in tqdm(deduped_data.iterrows()):
    title_list.append(row.title.values[0])    # Get question title
    tag_list.append(row.tags.values[0])     # Get question tags
    
    # Questions
    content = row.body.values[0]
    soup = BeautifulSoup(content, 'lxml')
    if soup.code: soup.code.decompose()     # Remove the code section
    tag_p = soup.p
    tag_pre = soup.pre
    text = ''
    if tag_p: text = text + tag_p.get_text()
    if tag_pre: text = text + tag_pre.get_text()
        
    content_list.append(str(row.title.values[0]) + ' ' + str(text))   # Append title and question body data to the updated question body
    
    url_list.append('https://stackoverflow.com/questions/' + str(row.id.values[0]))
    
    # Answers
    content = row.answers.values[0]
    soup = BeautifulSoup(content, 'lxml')
    if soup.code: soup.code.decompose()
    tag_p = soup.p
    tag_pre = soup.pre
    text = ''
    if tag_p: text = text + tag_p.get_text()
    if tag_pre: text = text + tag_pre.get_text()
    comment_list.append(text)
    
    vote_list.append(row.score.values[0])       # Append votes
    
    corpus_list.append(content_list[-1] + ' ' + comment_list[-1])     # Combine the updated body and answers to make the corpus
    
    sentiment = TextBlob(row.answers.values[0]).sentiment
    sentiment_polarity_list.append(sentiment.polarity)
    sentiment_subjectivity_list.append(sentiment.subjectivity)

content_token_df = pd.DataFrame({'original_title': title_list, 'post_corpus': corpus_list, 
                                 'question_content': content_list, 'question_url': url_list, 
                                 'tags': tag_list, 'overall_scores':vote_list,
                                 'answers_content': comment_list, 
                                 'sentiment_polarity': sentiment_polarity_list, 
                                 'sentiment_subjectivity':sentiment_subjectivity_list})

155322it [37:48, 30.69s/it]

- Now I have taken the count of every tags and make a dictionary of it ad later on i have selected the top 100 tags  you can cange that number also.

In [None]:
content_token_df.tags = content_token_df.tags.apply(lambda x: x.split('|'))   # Convert raw text data of tags into lists

# Make a dictionary to count the frequencies for all tags
tag_freq_dict = {}
for tags in content_token_df.tags:
    for tag in tags:
        if tag not in tag_freq_dict:
            tag_freq_dict[tag] = 0
        else:
            tag_freq_dict[tag] += 1

In [None]:
most_common_tags = heapq.nlargest(100, tag_freq_dict, key=tag_freq_dict.get)

In [None]:
final_indices = []
for i,tags in enumerate(content_token_df.tags.values.tolist()):
    if len(set(tags).intersection(set(most_common_tags)))>1:   # The minimum length for common tags should be 2 because 'python' is a common tag for all
        final_indices.append(i)

In [None]:
final_data = content_token_df.iloc[final_indices]

- After selecting the data with the top 100 tags I have done the processing of it and also normalize the numerical data as well.

In [None]:
import spacy
EN = spacy.load('en_core_web_sm')

# Preprocess text for 'question_body', 'post_corpus' and a new column 'processed_title'
final_data.question_content = final_data.question_content.apply(lambda x: preprocess_text(x))
final_data.post_corpus = final_data.post_corpus.apply(lambda x: preprocess_text(x))
final_data['processed_title'] = final_data.original_title.apply(lambda x: preprocess_text(x))

# Normalize numeric data for the scores
final_data['overall_scores'] = (final_data.overall_scores - final_data.overall_scores.mean()) / (final_data.overall_scores.max() - final_data.overall_scores.min())

In [None]:
final_data.tags = final_data.tags.apply(lambda x: '|'.join(x))    # Combine the lists back into text data
final_data.drop(['answers_content'], axis=1)

In [None]:
# Save the data
final_data.to_pickle('data/Preprocessed_data.pkl')