## Data Exploration
This notebook will allow us to explore the data found in the Kaggle dataset

In [3]:
import pandas as pd
import numpy as np
import os
import html
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import functions #personal functions

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /Users/jake/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jake/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
questions = pd.read_csv("Dataset/Questions.csv",encoding='latin-1')
answers = pd.read_csv("Dataset/Answers.csv",encoding='latin-1')
tags = pd.read_csv("Dataset/Tags.csv",encoding='latin-1')
tag_question = tags.groupby('Id').agg(list).merge(questions,how='inner',on = "Id")
df = tag_question.merge(answers,how = "inner",left_on = "Id", right_on = "ParentId")
df.columns = df.columns.str.replace("_x","_question").str.replace("_y","_answer")
df = df[['Id','Tag','Score_question','Title','Body_question',"Score_answer","Body_answer"]]

Reads in all three files, and merges them together. See result below.

In [5]:
df

Unnamed: 0.1,Unnamed: 0,Id,Tag,Score_question,question,Score_answer,Body_answer
0,0,469,"['python', 'osx', 'fonts', 'photoshop']",21,find full path font display name mac using pho...,4,open up a terminal (Applications->Utilities->T...
1,1,469,"['python', 'osx', 'fonts', 'photoshop']",21,find full path font display name mac using pho...,2,I haven't been able to find anything that does...
2,2,469,"['python', 'osx', 'fonts', 'photoshop']",21,find full path font display name mac using pho...,12,Unfortunately the only API that isn't deprecat...
3,3,469,"['python', 'osx', 'fonts', 'photoshop']",21,find full path font display name mac using pho...,1,There must be a method in Cocoa to get a list ...
4,4,502,"['python', 'windows', 'image', 'pdf']",27,get preview jpeg pdf window python application...,9,You can use ImageMagick's convert utility for ...
...,...,...,...,...,...,...,...
987117,987117,40142948,"['python', 'tuples']",-2,make function return tuple divided multiple li...,0,Hereâs the quick and dirty way:\n\ndef forma...
987118,987118,40143133,"['python', 'beautifulsoup']",1,error handling beautifulsoup scraped url not r...,1,You may check the value of name_box variable -...
987119,987119,40143166,"['python', 'python-3.x']",1,finding cubed root using delta epsilon python ...,2,"First thing, you should use if/elif instead of..."
987120,987120,40143190,"['python', 'bash', 'multiline']",1,execute multiline python code bash script need...,5,Use a here-doc:\n\nresult=$(python <\n


In [5]:
df['Body_question'] = (df['Body_question']
    .apply(html.unescape)
    .str.replace(r'<[a-zA-Z/][^>]*>', '', regex=True)  #remove HTML tags
    .str.replace(r'\n+', ' ', regex=True)              #collapse newlines
    .str.replace(r'  +', ' ', regex=True)              #collapse multiple spaces
    .str.replace('\r','')
    .str.replace('’',"'")
    .str.lower()
    .str.strip())

In [6]:
df['Body_answer'] = (df['Body_answer']
    .apply(html.unescape)
    .str.replace(r'<[a-zA-Z/][^>]*>', '', regex=True)  #remove HTML tags
)

In [7]:
df['Title'] = (df['Title']
    .str.replace(r'\n+', ' ', regex=True)              #collapse newlines
    .str.replace(r'  +', ' ', regex=True)              #collapse multiple spaces
    .str.replace('\r','')
    .str.lower()
    .str.replace('’',"'")
    .str.strip())

In [8]:
stop_words = set(stopwords.words('english')) - {"not", "no", "never"} #negations can still be important to capturing a sentence
lemmatizer = WordNetLemmatizer()
def preprocess_text(text):
    # Expand contractions
    text = functions.expand_contractions(text)
    
    # Tokenization
    tokens = word_tokenize(text)

    # Lowercase and keep only alphabetic words
    tokens = [word for word in tokens if word.isalpha()]

    # Remove stopwords
    tokens = [w for w in tokens if w not in stop_words]

    # Lemmatize
    tokens = [lemmatizer.lemmatize(w) for w in tokens]
    return " ".join(tokens)

In [None]:
df['Body_question'] = df['Body_question'].apply(preprocess_text) #applys preprocess to the whole dataset
df['Title'] = df['Title'].apply(preprocess_text) #Takes 20 min to run on full dataset

In [17]:
df['question'] = df['Title'] + " " + df['Body_question'] #combine Title and Body for questions

In [21]:
df = df[['Id','Tag','Score_question','question','Score_answer','Body_answer']] #Keep relevant columns

In [22]:
df.to_csv('Dataset/cleaned.csv', index=False) #Save dataframe