### Imports

Import the nessessary packages. 

In [73]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import random
import warnings

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing


### Dataset

Load dataset using pandas. Since the data is initially split into training and validation set, we concatinate them for now to provide flexibility in out code. We will split them again later on in the code, but then we can choose for ourselves how much of the data to use for training and validation. 

Since we want the target value to numerical values, as this is needed by the model. Print the first lines of the data, for the user to see the format. 

In [74]:
train = pd.read_csv("../input/60k-stack-overflow-questions-with-quality-rate/train.csv")
valid = pd.read_csv("../input/60k-stack-overflow-questions-with-quality-rate/valid.csv")
data = pd.concat([train,valid], keys=["Id", "Title", "Body", "Tags", "CreationDate", "Y"])

data=data.drop(['Id', 'CreationDate'], axis=1)
data['Y']=data['Y'].map({'LQ_CLOSE':0, 'LQ_EDIT':1, 'HQ':2})
data.head()


Unnamed: 0,Unnamed: 1,Title,Body,Tags,Y
Id,0,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,<java><repeat>,0
Id,1,Why are Java Optionals immutable?,<p>I'd like to understand why Java 8 Optionals...,<java><optional>,2
Id,2,Text Overlay Image with Darkened Opacity React...,<p>I am attempting to overlay a title over an ...,<javascript><image><overlay><react-native><opa...,2
Id,3,Why ternary operator in swift is so picky?,"<p>The question is very simple, but I just cou...",<swift><operators><whitespace><ternary-operato...,2
Id,4,hide/show fab with scale animation,<p>I'm using custom floatingactionmenu. I need...,<android><material-design><floating-action-but...,2


### Clean data

Use regex to clean the data, to increase the accuracy of the model. Apply this to the body and title.

In [75]:
def clean_data(text):
    text = text.lower()
    text = re.sub(r'[^(a-zA-Z0-9)\s\*\+-\/\(\)=&|]','', text)
    return text
data['Body'] = data['Body'].apply(clean_data)
data['Title'] = data['Title'].apply(clean_data)

### Distribute data

Choose what data to use for training. The default value for now is to use 75% for training, and 25% for validation. This can be changed by changing the values for how much of the data-variable to be placed in train and valid. If Tags and Title should be included for this method, the variables can be changed to "True".

In [76]:
def distribute_data(data=data):  
    
    Tags = False
    Title = False

    # Training Sets
    train = data[:45000]
    trainX = train['Body']
    
    if (Tags):
        trainX += train['Tags']
    if (Title):
        trainX += train['Title']
    trainY = train['Y'].values

    # Validation Sets
    valid = data[45000:]
    validX = valid['Body']
    
    if (Tags):
        validX += valid['Tags']
    if (Title):
        validX += valid['Title']
    
    validY = valid['Y'].values
    return trainX, trainY, validX, validY


trainX, trainY, validX, validY = distribute_data()


### Vectorize data

Vectorize the data such that it can be interperated by the machine learning model.

In [77]:
vectorizer = TfidfVectorizer()
trainX = vectorizer.fit_transform(trainX.apply(lambda x: np.str_(x)))
validX = vectorizer.transform(validX)


### Run

Chose the value for k. Train the model, score the validation set and print the result.

In [78]:
knn = KNeighborsClassifier(n_neighbors = 250)

knn.fit(trainX, trainY)
score = knn.score(validX, validY)
print('Accuracy: {}'.format(score))


Accuracy: 0.6734
