## Data Exploration
This notebook will allow us to explore the data found in the Kaggle dataset

In [11]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import html
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import string
import functions #personal functions


In [12]:
questions = pd.read_csv("Dataset/Questions.csv",encoding='latin-1')
answers = pd.read_csv("Dataset/Answers.csv",encoding='latin-1')
tags = pd.read_csv("Dataset/Tags.csv",encoding='latin-1')
tag_question = tags.groupby('Id').agg(list).merge(questions,how='inner',on = "Id")
df = tag_question.merge(answers,how = "inner",left_on = "Id", right_on = "ParentId")
df.columns = df.columns.str.replace("_x","_question").str.replace("_y","_answer")
df = df[['Id','Tag','Score_question','Title','Body_question',"Score_answer","Body_answer"]]

Reads in all three files, and merges them together. See result below.

In [13]:
df

Unnamed: 0,Id,Tag,Score_question,Title,Body_question,Score_answer,Body_answer
0,469,"[python, osx, fonts, photoshop]",21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,4,<p>open up a terminal (Applications-&gt;Utilit...
1,469,"[python, osx, fonts, photoshop]",21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,2,<p>I haven't been able to find anything that d...
2,469,"[python, osx, fonts, photoshop]",21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,12,<p>Unfortunately the only API that isn't depre...
3,469,"[python, osx, fonts, photoshop]",21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,1,<p>There must be a method in Cocoa to get a li...
4,502,"[python, windows, image, pdf]",27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,9,<p>You can use ImageMagick's convert utility f...
...,...,...,...,...,...,...,...
987117,40142948,"[python, tuples]",-2,How do you make it so that a function returns ...,<p>Basically I have a tuple which has 5 tuples...,0,<p>Hereâs the quick and dirty way:</p>\n\n<p...
987118,40143133,"[python, beautifulsoup]",1,error handling with BeautifulSoup when scraped...,<p>I'm totally noob to python so please forgiv...,1,<p>You may check the value of <code>name_box</...
987119,40143166,"[python, python-3.x]",1,finding cubed root using delta and epsilon in ...,<p>I am trying to write a program that finds c...,2,"<p>First thing, you should use <code>if/elif</..."
987120,40143190,"[python, bash, multiline]",1,How to execute multiline python code from a ba...,<p>I need to extend a shell script (bash). As ...,5,<p>Use a here-doc:</p>\n\n<pre><code>result=$(...


In [16]:
df['Body_question'] = (df['Body_question']
    .apply(html.unescape)
    .str.replace(r'<[a-zA-Z/][^>]*>', '', regex=True)  #remove HTML tags
    .str.replace(r'\n+', ' ', regex=True)              #collapse newlines
    .str.replace(r'  +', ' ', regex=True)              #collapse multiple spaces
    .str.replace('\r','')
    .str.replace('’',"'")
    .str.lower()
    .str.strip())

In [17]:
df['Body_answer'] = (df['Body_answer']
    .apply(html.unescape)
    .str.replace(r'<[a-zA-Z/][^>]*>', '', regex=True)  #remove HTML tags
)

In [18]:
df['Title'] = (df['Title']
    .str.replace(r'\n+', ' ', regex=True)              #collapse newlines
    .str.replace(r'  +', ' ', regex=True)              #collapse multiple spaces
    .str.replace('\r','')
    .str.lower()
    .str.replace('’',"'")
    .str.strip())

In [21]:
stop_words = set(stopwords.words('english')) - {"not", "no", "never"}
lemmatizer = WordNetLemmatizer()
def preprocess_text(text):
    # Expand contractions
    text = functions.expand_contractions(text)
    
    # Tokenization
    tokens = word_tokenize(text)

    # Lowercase and keep only alphabetic words
    tokens = [word for word in tokens if word.isalpha()]

    # Remove stopwords
    tokens = [w for w in tokens if w not in stop_words]

    # Lemmatize
    tokens = [lemmatizer.lemmatize(w) for w in tokens]
    return " ".join(tokens)

In [11]:
df['Body_question'] = df['Body_question'].apply(preprocess_text) #applys preprocess to the whole dataset
df['Title'] = df['Title'].apply(preprocess_text) #Takes 20 min to run on full dataset

KeyboardInterrupt: 

In [None]:
df['question'] = df['Title'] + " " + df['Body_question'] #combine Title and Body for questions

In [21]:
df = df[['Id','Tag','Score_question','question','Score_answer','Body_answer']] #Keep relevant columns

In [22]:
df.to_csv('Dataset/cleaned.csv', index=False)