In [1]:
import pandas as pd

### This code is for preprocessing the dataset and creating a smaller dataset.
My dataset has over 2 million records and it takes a lot of processing time so I have taken a sample of that data for this project.
The sample is 30% random data chosen across three datasets Quesitons.csv, Answers.csv, and Tags.csv while maintaining the context of the relation of the data itself.

In [2]:
# Load the datasets.
questions = pd.read_csv("/home/shreyaspujari/Documents/DocumentClustering_Stackoverflow/data/Questions.csv", encoding="ISO-8859-1")
answers = pd.read_csv("/home/shreyaspujari/Documents/DocumentClustering_Stackoverflow/data/Answers.csv", encoding="ISO-8859-1")
tags = pd.read_csv("/home/shreyaspujari/Documents/DocumentClustering_Stackoverflow/data/Tags.csv", encoding="ISO-8859-1")

Displaying the structure of the data to understand the relationships

In [3]:
print(questions.head())
print(answers.head())
print(tags.head())

    Id  OwnerUserId          CreationDate            ClosedDate  Score  \
0   80         26.0  2008-08-01T13:57:07Z                   NaN     26   
1   90         58.0  2008-08-01T14:41:24Z  2012-12-26T03:45:49Z    144   
2  120         83.0  2008-08-01T15:50:08Z                   NaN     21   
3  180    2089740.0  2008-08-01T18:42:19Z                   NaN     53   
4  260         91.0  2008-08-01T23:22:08Z                   NaN     49   

                                               Title  \
0  SQLStatement.execute() - multiple queries in o...   
1  Good branching and merging tutorials for Torto...   
2                                  ASP.NET Site Maps   
3                 Function for creating color wheels   
4  Adding scripting functionality to .NET applica...   

                                                Body  
0  <p>I've written a database generation script i...  
1  <p>Are there any really good tutorials explain...  
2  <p>Has anyone got experience creating <strong>... 

### In this part of the code we will sample 30% of the dataset for using in out analysis

In [4]:
sampled_questions = questions.sample(frac=0.3, random_state=42)

# Filter the related entries from Answersa and Tags based on the question IDs
sampled_ids = sampled_questions['Id'].tolist()
sampled_Answers = answers[answers["ParentId"].isin(sampled_ids)]
sampled_tags = tags[tags["Id"].isin(sampled_ids)]

### Merging the Datasets

In [5]:
merged_data = sampled_questions[['Id', 'Title', 'Body']].merge(
    sampled_Answers[['ParentId', 'Body']],
    left_on='Id',
    right_on='ParentId',
    suffixes=('_question', '_answer'),
    how='left'
).merge(
    sampled_tags[['Id', 'Tag']],
    left_on='Id',
    right_on='Id',
    how='left'
)

In [6]:
# Rename columns for clarity
merged_data.rename(columns={
    'Body_question': 'Question_Body',
    'Body_answer': 'Answer_Body',
    'Title': 'Question_Title'
}, inplace=True)

### Cleaning the dataset

In [7]:
merged_data.dropna(subset=['Question_Body', 'Answer_Body'], inplace=True)  # Remove rows missing key fields
merged_data['Question_Body'] = merged_data['Question_Body'].str.lower().str.strip()
merged_data['Answer_Body'] = merged_data['Answer_Body'].str.lower().str.strip()
merged_data['Tag'] = merged_data['Tag'].str.lower().str.strip()

In [8]:
# Save the dataset back to the path
merged_data.to_csv("/home/shreyaspujari/Documents/DocumentClustering_Stackoverflow/data/final_dataset.csv")

In [9]:
merged_data.head(20)

Unnamed: 0,Id,Question_Title,Question_Body,ParentId,Answer_Body,Tag
0,17016800,Handling the EditText send keyboard event for ...,<pre><code>import com.example.methanegaszonege...,17016800.0,<p>try changing the line</p>\n\n<pre><code>edi...,android
1,17016800,Handling the EditText send keyboard event for ...,<pre><code>import com.example.methanegaszonege...,17016800.0,<p>try changing the line</p>\n\n<pre><code>edi...,events
2,17016800,Handling the EditText send keyboard event for ...,<pre><code>import com.example.methanegaszonege...,17016800.0,<p>try changing the line</p>\n\n<pre><code>edi...,android-edittext
3,17016800,Handling the EditText send keyboard event for ...,<pre><code>import com.example.methanegaszonege...,17016800.0,<p>try changing the line</p>\n\n<pre><code>edi...,send
4,7685280,EditText: how to enable/disable input?,<p>i have a 7x6 grid of edittext views. i want...,7685280.0,<p>try using this <code>setfocusableontouch()<...,android
5,7685280,EditText: how to enable/disable input?,<p>i have a 7x6 grid of edittext views. i want...,7685280.0,<p>you can do the follwoing:<br>\nif you want ...,android
6,7685280,EditText: how to enable/disable input?,<p>i have a 7x6 grid of edittext views. i want...,7685280.0,<p>according the android guide line please use...,android
7,7685280,EditText: how to enable/disable input?,<p>i have a 7x6 grid of edittext views. i want...,7685280.0,<p>since you are using gridview to achieve you...,android
8,7685280,EditText: how to enable/disable input?,<p>i have a 7x6 grid of edittext views. i want...,7685280.0,"<p>setting input type to null is not enough, s...",android
9,7685280,EditText: how to enable/disable input?,<p>i have a 7x6 grid of edittext views. i want...,7685280.0,<p>i finally found a solution. it's a matter o...,android
