## Data Exploration
This notebook will allow us to explore the data found in the Kaggle dataset

In [3]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from pathlib import Path

**Kaggle 45k dataset**  
https://www.kaggle.com/datasets/imoore/60k-stack-overflow-questions-with-quality-rate  

In [2]:
df = pd.read_csv('datasets/kaggle-stackoverflow-45k/train.csv')
df.shape

(45000, 6)

In [3]:
df.head()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y
0,34552656,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,<java><repeat>,2016-01-01 00:21:59,LQ_CLOSE
1,34553034,Why are Java Optionals immutable?,<p>I'd like to understand why Java 8 Optionals...,<java><optional>,2016-01-01 02:03:20,HQ
2,34553174,Text Overlay Image with Darkened Opacity React...,<p>I am attempting to overlay a title over an ...,<javascript><image><overlay><react-native><opa...,2016-01-01 02:48:24,HQ
3,34553318,Why ternary operator in swift is so picky?,"<p>The question is very simple, but I just cou...",<swift><operators><whitespace><ternary-operato...,2016-01-01 03:30:17,HQ
4,34553755,hide/show fab with scale animation,<p>I'm using custom floatingactionmenu. I need...,<android><material-design><floating-action-but...,2016-01-01 05:21:48,HQ


In [4]:
subset = df.iloc[:100]
subset.head()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y
0,34552656,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,<java><repeat>,2016-01-01 00:21:59,LQ_CLOSE
1,34553034,Why are Java Optionals immutable?,<p>I'd like to understand why Java 8 Optionals...,<java><optional>,2016-01-01 02:03:20,HQ
2,34553174,Text Overlay Image with Darkened Opacity React...,<p>I am attempting to overlay a title over an ...,<javascript><image><overlay><react-native><opa...,2016-01-01 02:48:24,HQ
3,34553318,Why ternary operator in swift is so picky?,"<p>The question is very simple, but I just cou...",<swift><operators><whitespace><ternary-operato...,2016-01-01 03:30:17,HQ
4,34553755,hide/show fab with scale animation,<p>I'm using custom floatingactionmenu. I need...,<android><material-design><floating-action-but...,2016-01-01 05:21:48,HQ


In [5]:
print(subset.iloc[0].Body)

<p>I'm already familiar with repeating tasks every n seconds by using Java.util.Timer and Java.util.TimerTask. But lets say I want to print "Hello World" to the console every random seconds from 1-5. Unfortunately I'm in a bit of a rush and don't have any code to show so far. Any help would be apriciated.  </p>



In the first 100 data points, all tags can be extracted cleanly (i.e. no extra < or > tags)

In [6]:
tagTest = subset.copy()
tagTest = tagTest.assign(Tags = tagTest["Tags"].str.findall(r"<(.*?)>"))
">" in "".join(tagTest["Tags"].sum()) or "<" in "".join(tagTest["Tags"].sum())

False

In [7]:
tagTest.head()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y
0,34552656,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,"[java, repeat]",2016-01-01 00:21:59,LQ_CLOSE
1,34553034,Why are Java Optionals immutable?,<p>I'd like to understand why Java 8 Optionals...,"[java, optional]",2016-01-01 02:03:20,HQ
2,34553174,Text Overlay Image with Darkened Opacity React...,<p>I am attempting to overlay a title over an ...,"[javascript, image, overlay, react-native, opa...",2016-01-01 02:48:24,HQ
3,34553318,Why ternary operator in swift is so picky?,"<p>The question is very simple, but I just cou...","[swift, operators, whitespace, ternary-operato...",2016-01-01 03:30:17,HQ
4,34553755,hide/show fab with scale animation,<p>I'm using custom floatingactionmenu. I need...,"[android, material-design, floating-action-but...",2016-01-01 05:21:48,HQ


Associated Links may be found in anchor tags
Different languages may be found in body tags (e.g. stdio.h for C)
This query may be picking up code references that are not necessarily HTML (e.g. <Attraction*>, <?>)

In [8]:
bodyTest = subset.copy()
t = set(bodyTest["Body"].str.findall(r"<[^/].*?>").sum())
for _ in list(t)[:5]:
    print(_)

<?>
<a href="https://i.stack.imgur.com/U971h.png" rel="noreferrer">
<div class="col-md-3">
<a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/import" rel="noreferrer">
<script type="text/javascript">


In [9]:
bodyTest["Body"].str.findall(r"<(.*?)>(.*?)<(/\1)>").iloc[2]

[('p',
  'I am attempting to overlay a title over an image - with the image darkened with a lower opacity. However, the opacity effect is changing the overlaying text as well - making it dim. Any fix to this? Here is what is looks like:',
  '/p'),
 ('p',
  '<a href="https://i.stack.imgur.com/1HzD7.png" rel="noreferrer"><img src="https://i.stack.imgur.com/1HzD7.png" alt="enter image description here"></a>',
  '/p'),
 ('p',
  'And here is my code for the custom component (article preview - which the above image is a row of article preview components): ',
  '/p')]

**Cornell Dataset**  
https://www.cs.cornell.edu/~arb/data/stackoverflow-answers/  
Based on the text file contents, the graph approach doesn't have any actual questions/answers immediately available.  
The Cornell dataset might be useful for similarity (although the inference strength is weak with only tags available)

**Kaggle 10% Dataset**  
https://www.kaggle.com/datasets/stackoverflow/stacksample?resource=download  
Need to load using "latin-1" encoding--UTF-8 errors


In [19]:
b10 = Path("datasets") / "kaggle-stackoverflow-10-percent"

In [27]:
ten_percent_questions = pd.read_csv(b10 / "Questions.csv", encoding="latin-1")
ten_percent_questions.head(3)

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
0,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
1,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...
2,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...


In [24]:
ten_percent_answers = pd.read_csv(b10 / "Answers.csv", encoding="latin-1")
ten_percent_answers.head(3)

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body
0,92,61.0,2008-08-01T14:45:37Z,90,13,"<p><a href=""http://svnbook.red-bean.com/"">Vers..."
1,124,26.0,2008-08-01T16:09:47Z,80,12,<p>I wound up using this. It is a kind of a ha...
2,199,50.0,2008-08-01T19:36:46Z,180,1,<p>I've read somewhere the human eye can't dis...


In [26]:
ten_percent_tags = pd.read_csv(b10 / "Tags.csv", encoding="latin-1")
ten_percent_tags.head(3)

Unnamed: 0,Id,Tag
0,80,flex
1,80,actionscript-3
2,80,air


continue here with parsing HTML (like with the 45k dataset)