#### Predict Stack Overflow tags

Baseline model - Post text is one-hot encoded and trained on a single layer 

In [1]:
import numpy as np
import pandas as pd
import os
import tensorflow as tf

from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, LSTM
from keras.preprocessing import text, sequence
from keras.utils import to_categorical
import matplotlib.pyplot as plt
%matplotlib inline

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [10]:
df = pd.DataFrame()
for x in os.listdir('data/'):
    loc = 'data/{0}'.format(x)
    query_results = pd.read_csv(loc)
    df = df.append(query_results)
df.reset_index(inplace=True,drop=True)
df.head()

Unnamed: 0,post,tags
0,c++ this to write object into binary file i v...,c++
1,move constructor call optimization on explicit...,c++
2,how to print the multimap elements having vect...,c++
3,forward declaration of class in a.h i have th...,c++
4,why is this legal c++ typedef func i did thi...,c++


In [3]:
#df = pd.read_csv('stack_overflow_tags.csv')
#df.head()

Unnamed: 0,post,tags
0,How to play audio specific time in wordpress /...,javascript
1,precommit asks for jest test when commiting I ...,javascript
2,firebase saves the document twice firebase sav...,javascript
3,Character clips into platform for 1 frame? I'm...,javascript
4,"How to fix, Fixed layout for the scrollable co...",javascript


In [11]:
df.shape

(40000, 2)

In [12]:
def word_count(x):
    return len(x.split())

In [13]:
df['post'].map(word_count).mean()

141.23445

In [7]:
df['post'].loc[1]

'precommit asks for jest test when commiting I am using precommit hook and husky in my project where I want to check the eslint and stylelint and prettify the code. I do not want to do any test in my application so there wont be any test related work. However, I am getting the following issue whenever i commit the changes.  This is how i have configured "scripts": {\n    "start": "react-scripts start",\n    "build": "react-scripts build",\n    "test": "react-scripts test",\n    "eject": "react-scripts eject",\n    "lint:css": "stylelint src/**/*.js",\n    "flow": "flow",\n    "precommit": "lint-staged"\n  },\n  "husky": {\n    "hooks": {\n      "pre-commit": "yarn precommit"\n    }\n  },\n lint-staged.config.js module.exports = {\n  linters: {\n    \'**/*.+(js|css|graphql)\': [\n      \'eslint --fix\',\n      \'prettier --write\',\n      \'stylelint app/**/*.js\',\n      \'git add\',\n    ],\n  },\n}\n I am using create-react-app. How do i bypass the test check when commiting the chang

In [14]:
df['post'].loc[10]



In [15]:
#df['post'].loc[1000]

In [16]:
df['post'].loc[10000]

'how to make a responsive step bar in html css  i want to make a responsive step bar with year on top side of step circle and text in bottom of the circle like this one:  <a href= https://i.stack.imgur.com/y4zws.png  rel= nofollow noreferrer ><img src= https://i.stack.imgur.com/y4zws.png  alt= enter image description here ></a>    here is my html code.   <pre><code>&lt;div class= container &gt;       &lt;ul class= progressbar &gt;         &lt;li&gt;visionet was founded lorem ipsum dolor sit amet new york usa by arshad masood&lt;/li&gt;           &lt;li&gt;visionet was founded lorem ipsum dolor sit amet new york usa by arshad masood&lt;/li&gt;           &lt;li&gt;visionet was founded lorem ipsum dolor sit amet new york usa by arshad masood&lt;/li&gt;           &lt;li&gt;visionet was founded lorem ipsum dolor sit amet new york usa by arshad masood&lt;/li&gt;           &lt;li&gt;visionet was founded lorem ipsum dolor sit amet new york usa by arshad masood&lt;/li&gt;           &lt;li id= l

In [17]:
df['post'].loc[30000]

'can anyone tell me how to display images from database without using gridview or repeater controls   please tell  how to display images from local folder in div without using gridview and other displaying controls and please see <a href= http://www.google.co.in/imgres biw=1366&amp;bih=600&amp;tbm=isch&amp;tbnid=j5q6cvwn3lo0km%3a&amp;imgrefurl=http://www.techalam.com/olx-in-a-great-place-to-buy-sell-stuff-online/&amp;docid=ieirb6zbxkzcom&amp;imgurl=http://www.techalam.com/wp-content/uploads/2013/09/olx-3.jpg&amp;w=350&amp;h=217&amp;ei=m20ku-o_dm7graf8tygicq&amp;zoom=1&amp;ved=0coubeiqcmc8&amp;iact=rc&amp;dur=1752&amp;page=3&amp;start=46&amp;ndsp=19  rel= nofollow >this example</a> because i want to display the images like that.'

In [18]:
df.tags.value_counts()

mysql            2000
c                2000
javascript       2000
jquery           2000
ios              2000
sql              2000
ruby-on-rails    2000
css              2000
php              2000
asp.net          2000
python           2000
angularjs        2000
iphone           2000
.net             2000
android          2000
objective-c      2000
html             2000
c++              2000
c#               2000
java             2000
Name: tags, dtype: int64

In [19]:
X_train, X_test, y_train, y_test = train_test_split(df['post'],df['tags'], test_size=0.2)

### Text preprocessing

Tokenize

In [20]:
max_words = 1000
tokenize = text.Tokenizer(num_words=max_words, char_level=False)

In [21]:
tokenize.fit_on_texts(X_train) # only fit on train
X_train = tokenize.texts_to_matrix(X_train)
X_test = tokenize.texts_to_matrix(X_test)

In [22]:
X_train.shape

(32000, 1000)

In [25]:
encoder = LabelEncoder()
encoder.fit(y_train)
y_train = encoder.transform(y_train)
y_test = encoder.transform(y_test)

In [26]:
X_train

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       ...,
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

### One-hot encoding

In [27]:
num_classes = np.max(y_train) + 1
y_train = to_categorical(y_train, num_classes)
y_test = to_categorical(y_test, num_classes)

### Model

In [None]:
model = Sequential()
model.add(Dense(512,input_shape=(max_words,),activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [38]:
history = model.fit(X_train, y_train,
                    batch_size=32,
                    epochs=3,
                    verbose=1,
                    validation_split=0.1)

Train on 28800 samples, validate on 3200 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [31]:
score = model.evaluate(X_test, 
                       y_test,
                       batch_size=batch_size, verbose=1)
score



[0.5725286792516708, 0.815625]