# Exploratory Data Analysis

## Introduction
After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always important to explore the data first.

In this notebook, we will try to find some more obvious patterns with EDA before identifying the hidden patterns with machines learning (ML) techniques. We are going to look at the following for each question posted on Stack Overflow:

1. **Most/least common words**: find these to add to a list of stopwords
2. **Other basic EDA**: look at scores, view count, etc.
3. **Choose a model**: decide which model to continue with for topic modeling

In [1]:
# data analysis and manipulation
import pandas as pd
import numpy as np
from collections import Counter

# files
import pickle

# text manipulation
import re
import string
from bs4 import BeautifulSoup

# natural language processing
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize, RegexpTokenizer
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

# topic modeling
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

### Most/Least Common Words
The process to add to our list of stop words will be to read in the CountVectorizer/TF-IDF Vectorizer pickle object, find the top 30 words used in each question, aggregate those words across all documents (i.e. questions), and add words to the stop words list.

In [2]:
# read in the dtm for the count vectorizer
df_cv = pd.read_pickle('df_cv.pkl')
df_cv = df_cv.transpose()
df_cv.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14081,14082,14083,14084,14085,14086,14087,14088,14089,14090
aaa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ab,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aba,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abab,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ababab,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# do the same for the tf-idf vectorizer
df_tfidf = pd.read_pickle('df_tfidf.pkl')
df_tfidf = df_tfidf.transpose()
df_tfidf.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14081,14082,14083,14084,14085,14086,14087,14088,14089,14090
aaa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ab,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aba,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abab,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ababab,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
# find the top 30 words used in each question
top_dict_cv = {}
for c in df_cv.columns:
    top = df_cv[c].sort_values(ascending=False).head(30)
    top_dict_cv[c]= list(zip(top.index, top.values))

top_dict_cv

{0: [('sum', 8),
  ('data', 8),
  ('arraysize', 8),
  ('int', 7),
  ('datac', 6),
  ('code', 5),
  ('runs', 4),
  ('long', 4),
  ('loop', 4),
  ('faster', 4),
  ('start', 4),
  ('unsigned', 4),
  ('thought', 3),
  ('include', 3),
  ('array', 3),
  ('generate', 2),
  ('new', 2),
  ('stdsortdata', 2),
  ('import', 2),
  ('public', 2),
  ('elapsedtime', 2),
  ('main', 2),
  ('stdendl', 2),
  ('random', 2),
  ('seconds', 2),
  ('sorted', 2),
  ('just', 2),
  ('test', 2),
  ('sorting', 2),
  ('stdcout', 2)],
 1: [('json', 3),
  ('know', 2),
  ('id', 2),
  ('like', 2),
  ('type', 2),
  ('content', 1),
  ('targeted', 1),
  ('ive', 1),
  ('gather', 1),
  ('rest', 1),
  ('slightly', 1),
  ('api', 1),
  ('theres', 1),
  ('support', 1),
  ('varying', 1),
  ('properly', 1),
  ('mime', 1),
  ('correct', 1),
  ('textjavascript', 1),
  ('question', 1),
  ('purported', 1),
  ('seen', 1),
  ('issues', 1),
  ('browser', 1),
  ('returned', 1),
  ('answer', 1),
  ('hurt', 1),
  ('applicationjson', 1),
  (

In [5]:
# print the top 15 words used in each question
for question, top_words in top_dict_cv.items():
    print(question)
    print(', '.join([word for word, count in top_words[0:14]]))
    print('---')

0
sum, data, arraysize, int, datac, code, runs, long, loop, faster, start, unsigned, thought, include
---
1
json, know, id, like, type, content, targeted, ive, gather, rest, slightly, api, theres, support
---
2
use, inside, comments, file, json, garbage, gateway, gates, gasrc, gas, garden, garblestrslice, garbled, zyx
---
3
myobject, newuri, ircevent, privmsg, regex, follows, method, var, object, new, http, say, create, way
---
4
object, null, avoid, code, test, want, nullpointerexception, use, appear, programming, example, unreadable, tests, alternative
---
5
suit, compile, enum, public, used, second, void, code, diamonds, spades, does, following, fails, time
---
6
nodejs, javascript, web, lets, run, problems, good, understand, general, server, like, stuff, type, considering
---
7
function, apply, methods, versa, performance, vs, vice, best, difference, using, differences, use, invoke, var
---
8
return, false, event, method, use, aclickfunction, custom, handling, executing, prone, eve

---
820
change, public, api, new, type, foo, void, languages, class, method, break, language, adding, affected
---
821
libraries, threading, boost, app, start, purpose, use, commercial, qt, possible, summary, testing, ive, database
---
822
order, sort, stop, descending, way, arrays, like, lazy, easy, class, array, ascending, gamesmapinsertstdpairint, gameplay
---
823
button, color, tried, highlighted, background, change, user, finger, didnt, effect, uibutton, code, app, changing
---
824
gdb, id, use, sessions, session, set, like, history, command, starting, saves, keys, previous, access
---
825
better, way, check, available, sizeofarr, judge, traditional, alternate, trying, meter, methods, just, element, situaions
---
826
case, string, mixed, regardless, query, sensitive, queries, value, function, returns, return, characters, mysql, make
---
827
ctrlc, app, thing, start, locally, just, interrupt, heroku, zyx, gaq, gasrc, gas, garden, garblestrslice
---
828
---
829
gridview, implement, 

---
1653
null, index, table, create, primary, variable, databasedefault, id, int, declare, key, server, temptable, collate
---
1654
commit, file, did, weeks, restore, single, ago, old, want, gaps, zyx, gapplyx, garbage, garbled
---
1655
gallery, load, holding, phone, works, correctly, picture, pictures, appears, horizontal, image, bitmap, shot, fine
---
1656
git, message, commit, push, following, solution, merge, feel, note, number, terminal, happy, did, fastforwards
---
1657
enum, values, public, myenum, declare, hexadecimal, flags, write, decimal, declarations, confused, hex, use, easier
---
1658
class, aclass, test, subclass, gaps, gateway, gates, gasrc, gas, garden, garblestrslice, garbled, garbage, gaq
---
1659
rvm, command, using, rails, install, installed, version, gemset, new, following, ruby, switched, gem, trying
---
1660
new, animal, want, class, just, animalsadd, java, add, simply, create, arraylistanimal, list, responses, way
---
1661
mathrandom, java, random, range, like,

2486
video, source, src, player, sources, firefox, object, im, movie, script, attribute, seen, work, just
---
2487
---
2488
im, ide, ant, really, holiday, use, java, eclipse, enhancements, coding, compaq, time, bought, ive
---
2489
script, shorthand, slashes, url, line, src, noticed, duplicate, http, absolute, following, start, preserve, tags
---
2490
text, arabic, server, textview, tvsettextt, code, boxes, values, instead, string, getting, successfully, showing, want
---
2491
usrlib, file, import, line, syspath, settings, media, make, wsgi, use, used, set, choices, mofin
---
2492
line, css, dotted, draw, gcdij, garbage, gatewayinterface, gateway, gates, gasrc, gas, garden, garblestrslice, garbled
---
2493
balance, cforeach, im, jsp, new, called, use, tags, var, scriptlets, retrieve, jstl, table, fine
---
2494
flask, access, documentation, tell, trying, im, doesnt, user, agent, gather, gatewayinterface, gateway, gates, gasrc
---
2495
page, user, print, automatically, dialog, javascript

3319
text, way, position, html, want, div, rect, svg, sparately, lines, autolinewrap, fills, using, elements
---
3320
topic, bear, thats, didnt, similar, zyx, garblestrslice, gather, gatewayinterface, gateway, gates, gasrc, gas, garden
---
3321
relative, dom, element, xy, plugins, jquery, locate, containerparent, topleft, document, current, documents, left, property
---
3322
image, bicubic, options, scale, size, reduction, open, way, sharp, does, example, documented, offer, algorithm
---
3323
button, listview, click, shows, checkbox, line, info, inside, add, display, react, bar, clickable, doesnt
---
3324
value, species, option, selected, yes, set, form, property, asc, manner, corresponding, itype, question, false
---
3325
im, ios, usingshouldautorotatetointerfaceorientation, working, expect, release, fixed, size, bug, rotation, master, using, called, controller
---
3326
round, use, number, php, say, nearest, code, garbage, gateway, gates, gasrc, gas, garden, garblestrslice
---
3327
ad

---
3986
keybinding, ctrl, needed, example, savecommand, binding, command, key, modifier, keys, shift, way, modifiers, like
---
3987
return, form, false, containssmileystring, scontains, null, boolean, statement, related, written, especially, exit, reasons, seen
---
3988
syntax, support, unsigned, doesnt, double, zyx, gapplyx, gas, garden, garblestrslice, garbled, garbage, gaq, gaps
---
3989
file, ignored, gitignore, git, directory, bartxt, foo, recursive, useful, applies, debugging, exist, tried, want
---
3990
iliststring, list, inside, order, new, alphabetic, strings, ascending, liststring, garbled, garblestrslice, zyx, garden, gaq
---
3991
int, just, select, way, ways, share, exists, given, suffice, value, simpler, believe, test, better
---
3992
dictionary, values, animals, newanimals, say, string, dictionaries, duplicates, merging, fastest, append, way, receive, copy
---
3993
text, nstextfieldcell, vertically, internal, flag, columns, apples, centered, mycategories, draws, datacell

---
4819
span, space, want, use, dont, spanspan, tag, like, css, parent, elements, html, javascript, optional
---
4820
campaigncategories, uintmaxvalue, sorcery, campaigns, concrete, campaigngroups, targets, devised, programming, multiply, heres, affiliategroups, affiliates, want
---
4821
calls, function, script, possible, achieve, multithreading, readcfg, independent, let, introduce, shell, scripts, different, arguments
---
4822
table, cells, width, size, make, outer, contained, fixed, specify, dynamic, currently, fit, autosize, stretch
---
4823
require, includelib, code, unshift, tell, does, fileexpandpath, ruby, started, root, rescue, references, rootactivesupportlib, character
---
4824
android, testing, applications, mocking, test, mock, unit, suggestions, roboelectric, suggested, features, hard, possibly, new
---
4825
view, attaching, need, method, overridden, actually, actuall, size, invoked, add, onmeasureint, looks, rendering, container
---
4826
value, string, message, loginpag

5652
using, im, trying, experience, creating, site, support, crop, libraries, image, does, actual, seen, imagemagick
---
5653
image, images, resolution, able, search, set, tineye, engine, supply, smaller, size, billion, sizes, algorithm
---
5654
detect, processorscores, correctly, physical, number, supported, able, threads, performance, enabled, windows, maximum, question, creating
---
5655
like, module, modules, im, buffer, allocation, memory, need, want, code, access, solutions, maths, bounty
---
5656
ios, want, database, android, seen, work, offline, solution, webdatabase, store, blackberry, writing, works, target
---
5657
dont, nice, difference, differences, different, developers, googled, importance, use, scenarios, models, exactly, examples, necessity
---
5658
int, util, qstring, namespace, using, iusrlocaltrolltechqt, tmp, type, reference, include, templateclass, main, utilcpp, undefined
---
5659
exception, log, try, exctraceback, like, using, excvalue, functionality, access, sy

6485
mvc, aspnet, regards, spring, comparison, performance, better, technology, sirmak, productivity, maintenance, vs, features, make
---
6486
content, user, activity, app, using, scrollers, scroller, new, development, containing, wanna, relativelayout, screen, android
---
6487
hibernate, checked, column, boolean, schema, orgspringframeworkorm, domain, orghibernatecfgconfigurationbuildsessionfactoryconfigurationjava, default, null, im, fields, annotations, mysql
---
6488
option, value, form, multiple, select, submit, values, id, post, div, class, method, posted, type
---
6489
uitextrange, textview, im, subclass, firstrectforrange, tried, need, different, rect, using, editable, apple, actually, docs
---
6490
hex, string, hello, world, format, way, best, example, like, convert, vice, versa, garden, garblestrslice
---
6491
branch, maintenance, master, merge, git, want, forward, commit, considered, project, cherrypicking, apply, conflicts, easy
---
6492
im, nice, gui, osx, tools, windows, 

---
7456
using, vertical, bootstrap, class, menu, create, hack, sidebar, just, dropdown, separate, css, way, know
---
7457
cart, render, using, carthtml, point, page, style, blog, string, updates, ajax, does, say, partial
---
7458
numbers, like, views, larger, add, function, question, simple, commas, php, make, change, know, possible
---
7459
times, restart, java, machine, related, version, use, works, mysql, built, jvmbind, fix, process, stop
---
7460
template, void, param, fooint, syntax, foointint, specialization, examples, foot, different, difference, translation, primer, czech
---
7461
backgroundimage, webkitfilter, blur, blurs, searching, code, fixed, describing, tried, body, use, resources, urlhttpwwwpublicdomainpicturesnetpictures, velkapebblesandsea
---
7462
class, extends, tdao, foodao, multiple, autowired, generics, genericdaot, service, ideally, erasure, foo, fails, type
---
7463
factory, service, creation, domain, data, class, objects, iam, object, does, layers, layer, ui,

select, field, make, want, null, like, conditional, possible, statement, checks, gaq, gatewayinterface, gateway, gates
---
8193
age, months, richness, using, plotdata, mydata, change, geomboxplotaesgroupage, aesage, data, left, right, continuous, value
---
8194
ive, android, looked, machine, sdk, command, key, using, folder, machines, jarsigner, linux, windows, debug
---
8195
potatoproject, hard, linking, creating, srcpotatoprojectegginfo, writing, file, srcpotatoprojectegginfosourcestxt, setuppy, src, srcpotato, license, static, srctomato
---
8196
app, engine, google, mac, sdk, ive, answer, os, autoupdating, havent, getting, uninstalldisable, alerts, looked
---
8197
handler, minute, like, doing, ashx, use, output, does, image, im, heavy, processing, caching, prevent
---
8198
foo, delete, foofoo, monsters, monster, monstersi, int, destructor, currently, considering, following, happens, code, virtual
---
8199
---
8200
templates, case, using, integrityerror, powered, message, catch, obvi

---
9123
file, card, advice, sd, images, called, code, new, working, directory, im, android, trying, build
---
9124
androidinputtype, string, phone, brings, hides, interface, combine, ability, edittext, hide, input, know, gccg, gaq
---
9125
trying, best, im, gui, eclipse, suggest, thanks, use, java, duplicate, builder, create, user, designer
---
9126
moc, using, use, main, nsmainqueueconcurrencytype, nsconfinementconcurrencytype, background, situation, fetched, controller, driven, believe, initializing, app
---
9127
application, updater, start, executable, patch, delete, make, oneclick, official, self, download, version, process, net
---
9128
load, dll, addresses, application, rebasing, modules, loader, dlls, doesnt, process, preferred, windows, executable, hard
---
9129
string, az, ideas, need, allow, things, regexp, dots, like, replace, edit, id, stripped, strip
---
9130
xml, xlink, xmlnsa, el, var, ahref, atitle, new, org, namespace, prefix, httpwww, fragment, code
---
9131
gesture,

file, vbscript, need, access, application, extension, addin, code, script, studio, windows, vbs, visual, gang
---
9985
clusters, variance, percent, set, total, sum, unique, variances, print, points, cluster, centroids, kmeans, number
---
9986
openwith, registrykey, currentuser, setvalue, work, application, shell, keyname, doesnt, string, using, filetype, createsubkey, shellcreatesubkey
---
9987
lib, ive, make, additional, files, add, link, happen, new, thing, dependencies, adding, thanks, file
---
9988
logo, language, second, joke, question, asked, im, podcast, linux, programming, use, programmed, world, windows
---
9989
string, return, dictionarystring, url, headers, webrequest, webresponse, public, static, using, code, new, webresponseheaders, bool
---
9990
make, issue, location, defaults, radius, mkmapview, common, getting, zoom, thanks, advance, thing, exact, answer
---
9991
catch, collectionaddnew, exception, jsonstringvalue, try, err, contextresponseend, contextresponsecontenttyp

10864
start, hardware, device, buy, bit, able, got, time, delphi, starting, learn, plug, good, simple
---
10865
luck, application, cmd, line, reason, command, experience, arguments, breakpoint, thats, using, setting, loss, ask
---
10866
parsertester, possible, java, jar, classpath, run, using, separate, need, tried, folder, like, classesjar, retain
---
10867
reddit, table, new, thing, things, didn, add, attribute, like, worry, users, idea, votes, data
---
10868
author, index, citation, paper, year, capable, web, institution, metrics, citationcount, given, despite, services, retrievecalculate
---
10869
server, remote, error, strings, computer, request, set, connection, occurred, web, path, processed, problem, site
---
10870
start, thanks, deleted, key, administrator, table, recordes, way, entry, tbphotos, want, primary, restart, doesnt
---
10871
standardized, having, mangling, machine, past, called, exporting, size, say, im, interoperability, extern, functions, volumes
---
10872
resourc

module, modules, daemon, initpy, project, particular, time, im, reload, small, suggestions, isnt, changed, takes
---
11818
user, login, want, ip, address, authenticated, django, access, credentials, webpage, different, time, deny, given
---
11819
used, shift, equivalent, arithmetic, value, calculate, powers, multiplying, useful, infinity, negative, results, wondering, wiki
---
11820
return, function, strcharat, var, words, array, strlength, str, code, wrod, characters, tmp, strcharatstrlength, thislength
---
11821
new, viewhsplashscreenh, self, animlengthscreenchangeanimlength, iphone, storyboard, test, screen, size, widescreen, code, currentdevice, uiscreen, uiuserinterfaceidiomphone
---
11822
script, json, im, escape, object, using, embed, foo, django, html, edit, creating, reference, lets
---
11823
objects, method, does, thread, use, object, type, memory, attributes, methods, class, programming, safe, instantiated
---
11824
characters, printf, character, chr, petscii, crandom, bash,

---
12749
program, struct, reason, point, code, randomized, shall, stored, expression, signed, byte, implementation, int, answer
---
12750
master, integration, branch, merged, cehk, result, command, merge, dm, bgp, fjloq, git, question, exceptional
---
12751
query, string, possible, javascript, just, url, using, existence, check, jquery, gateway, gates, gasrc, gas
---
12752
global, gcc, int, void, wall, wextra, ansi, file, size, aout, char, question, globaldatac, output
---
12753
file, directory, using, tries, check, throws, exist, does, open, knows, fileexists, traverses, particular, application
---
12754
appx, device, message, gets, running, receives, services, restarted, soon, send, android, applications, manage, say
---
12755
idrole, username, positional, iduser, user, select, parameters, query, parameter, hibernate, deprecated, role, instead, somebody
---
12756
fail, return, personpets, personid, cast, person, work, didnt, pets, iqueryable, raised, given, list, heres
---
12757
pkg

work, putting, trying, rails, google, does, doesnt, docs, dothis, methods, experience, aftercreate, create, array
---
13651
require, way, svn, lot, happens, files, new, ensure, repositories, gitsvn, attribute, gbytes, gaps, garblestrslice
---
13652
backbone, reusing, server, resources, code, node, possible, using, solution, application, rewriting, normal, works, client
---
13653
like, listint, int, new, myvalues, using, create, linq, compiler, way, possible, says, time, array
---
13654
specify, body, miliseconds, thanks, time, session, webconfig, unit, sessionstate, aspnet, tell, gamesmapinsertstdpairint, gaq, gasrc
---
13655
public, interface, set, property, interfaces, ihasmembers, iorderedentity, meaning, entity, inamedentity, abstract, objects, class, memberstatus
---
13656
catch, blocks, used, somethings, solution, writing, code, good, notice, simple, variations, errors, reviewing, way
---
13657
update, authority, user, enable, current, revisitmeans, immediately, does, way, using,

In [6]:
# repeat the process for the tf-idf document-term matrix
top_dict_tfidf = {}
for c in df_tfidf.columns:
    top = df_tfidf[c].sort_values(ascending=False).head(30)
    top_dict_tfidf[c]= list(zip(top.index, top.values))

top_dict_tfidf

{0: [('arraysize', 0.5278445485693687),
  ('datac', 0.39588341142702654),
  ('sum', 0.3577145535997552),
  ('data', 0.1815692342610622),
  ('int', 0.17446696918402244),
  ('unsigned', 0.16655725970467167),
  ('faster', 0.14889949877950148),
  ('runs', 0.14333036672831964),
  ('stdsortdata', 0.13196113714234217),
  ('elapsedtime', 0.13196113714234217),
  ('loop', 0.1317481591470164),
  ('long', 0.1186571069769755),
  ('start', 0.11372804101144657),
  ('stdendl', 0.09763393214700793),
  ('thought', 0.09574743812087676),
  ('stdcout', 0.09153821653454555),
  ('sorting', 0.09090080129809036),
  ('include', 0.09031491689789034),
  ('sorted', 0.0857892977753784),
  ('array', 0.0836750802485507),
  ('code', 0.0811676422931402),
  ('primary', 0.07672691286161575),
  ('random', 0.07640698073021791),
  ('seconds', 0.07314938846777747),
  ('generate', 0.06636257139723833),
  ('systemoutprintlnsystemnanotime', 0.06598056857117109),
  ('stdrand', 0.06598056857117109),
  ('clockt', 0.065980568571171

In [7]:
for question, top_words in top_dict_tfidf.items():
    print(question)
    print(', '.join([word for word, count in top_words[0:14]]))
    print('---')

0
arraysize, datac, sum, data, int, unsigned, faster, runs, stdsortdata, elapsedtime, loop, long, start, stdendl
---
1
json, textxjson, textxjavascript, applicationxjavascript, purported, hurt, targeted, varying, gather, mime, messing, pushing, standards, type
---
2
comments, json, inside, file, use, getclassthatdefinedmethodbarfoomethod, getclassgetresource, getclassgetresourceasstream, getclassgetresourcepath, getclassgetresourceurl, getclassgetresourceurlgetpath, getclassmethodsobject, zyx, getclassvarsgetclassobject
---
3
myobject, newuri, privmsg, ircevent, regex, follows, var, method, http, remove, property, say, end, best
---
4
nullpointerexception, object, null, avoid, someobjectdocalc, consequence, necessity, someobject, unreadable, test, idiom, ugly, alternative, tests
---
5
suit, enum, dosomethingsuit, enumerateallsuitsdemomethod, compile, spades, diamonds, clubs, hearts, enumerate, public, keyword, foreach, fails
---
6
nodejs, lets, considering, web, javascript, run, genera

percent, sentence, selectiveescape, test, break, printselectiveescape, print, typeerror, str, happens, required, format, actually, output
---
722
interviewing, rounds, nonnegative, equation, interview, friend, optimal, feedback, sorted, integers, iterate, thoughts, job, pattern
---
723
key, unique, primary, index, use, getclassgetresourceurlgetpath, getclass, getclassgetclassloadergetresource, getclassgetresource, getclassgetresourceasstream, getclassgetresourcepath, getclassgetresourceurl, getclassthatdefinedmethodbarfoomethod, getclassmethodsobject
---
724
partial, views, viewstartcshtml, razor, layout, view, folder, file, assign, engine, common, layoutshtml, anonbarcshtml, poses
---
725
dll, appconfig, connectionstring, musicgenesis, configurationmanagerappsettings, attempts, return, file, putting, console, according, configuration, obviously, copy
---
726
preview, renderusercontrol, whith, renderpartial, offering, upgrading, anymore, partial, render, functionality, complete, folder

uibarbuttonitem, button, image, uiimage, setbackgroundimage, initwithcustomviewv, topcapheight, imagesizeheight, imagesizewidth, stretchableimagewithleftcapwidth, forward, imagenamed, uibutton, border
---
1555
passphrase, github, cuserssubnussshidrsa, subnusmvcgit, git, elevated, key, cmdexe, origin, remember, guide, master, host, push
---
1556
golangorg, windows, nov, releases, compilers, official, binary, implemented, programming, os, appears, compiler, linux, google
---
1557
copy, deep, constructoroperatorfunction, icloneable, implement, remark, disagree, variant, derive, irrelevant, brackets, clone, preferred, apparently
---
1558
questions, learning, knows, interesting, book, solutions, great, help, thanks, getclassmethodsobject, getclass, getclassgetclassloadergetresource, getclassgetresource, getclassgetresourceasstream
---
1559
align, center, layout, screen, left, xml, button, right, add, set, example, want, code, getclrframeworkmajorversion
---
1560
github, wiped, project, invo

---
2388
log, viewer, program, chainsaw, freetext, connector, proved, unresponsive, buggy, decent, filtering, levels, colors, socket
---
2389
volatile, stdatomics, williamss, anthony, stdatomic, orthogonal, sutter, herb, qualifier, atomic, concurrent, concepts, concerned, gcc
---
2390
transform, rotate, played, translate, scale, animation, keyframes, colorchange, meant, playing, rid, frame, stop, goes
---
2391
console, ide, instead, systemconsolewritelinestr, systemdiagnosticsdebugwritelinestr, studio, visual, erased, junk, output, exited, having, popping, text
---
2392
string, mean, alertstrsubstr, alertstr, foo, does, mutable, immutable, return, str, alert, modified, initial, assume
---
2393
promotion, promotiontitle, promotionurl, promotionid, dr, connection, string, static, connect, public, drread, constring, sqlopenconnection, retrievepromotion
---
2394
mixins, import, appassetsstylesheets, dialogdividercolor, mastercssscss, colorscssscss, mixin, rails, colors, variable, files, un

---
3221
double, sum, total, mylistsum, mylistamountsum, accomplishing, list, calculating, looping, incorrect, individual, love, entry, syntax
---
3222
table, columns, shape, color, colorcode, shapeid, vertexlist, colorname, colorid, shapename, maps, id, tables, lets
---
3223
dataproperty, datalist, orderby, descending, ascending, data, queries, sortascending, ascendingquery, descendingquery, parameter, resp, select, var
---
3224
myfilem, suppose, defined, fm, function, wasnt, online, clear, documentation, script, best, read, possible, want
---
3225
searchview, ifroomcollapseactionview, searchviewsetonquerytextlistenerthis, searchviewperformclick, androidwidgetsearchview, drawableicactionsearch, searchviewrequestfocus, androidactionviewclass, menufinditemridmenusearchgetactionview, idmenusearch, upfront, androidshowasaction, thanx, looks
---
3226
array, leo, reset, null, key, question, getclassgetresource, getclassgetresourceasstream, getclassgetresourcepath, getclassgetresourceurl, ge

4054
simulator, woooohooohtml, httpvolcorelimbicsoftcom, gamekitpt, iphoneos, bluetooth, run, detect, device, api, reason, test, app, time
---
4055
mvvm, app, ef, newbie, question, wpf, programming, xamls, coexist, ive, dal, singletons, know, write
---
4056
list, element, ul, addlistelement, lihihi, libli, thelistremovechildthelistlastchild, foodelegateli, thelisthaschildnodes, liali, elementdatagrade, idfoo, item, click
---
4057
dll, compact, deployment, sql, server, serviceability, dlls, xcopy, folder, microsoft, edition, itaexe, fraexe, oledb
---
4058
exit, language, tell, difference, zyx, getclassvarsgetclassobject, getclassgetresourcepath, getclassgetresourceurl, getclassgetresourceurlgetpath, getclassmethodsobject, getclassthatdefinedmethodbarfoomethod, getclrframeworkmajorversion, getclientresponseclass, getclassgetresource
---
4059
urlfailtogoto, exit, formerrorphp, headersprintf, filea, sth, redirect, location, header, calling, thank, php, function, return
---
4060
sattr, hell

4887
produces, mod, operation, python, modulo, depending, language, results, right, different, used, make, like, getclassmethodsobject
---
4888
showpanel, function, delay, global, parallaxaboutonloadfunction, aboutdelay, gods, fadein, responding, page, panel, children, navigation, hide
---
4889
copying, history, repository, copy, directory, simple, way, zyx, getclassgetresourceurlgetpath, getclassgetclassloadergetresource, getclassgetresource, getclassgetresourceasstream, getclassgetresourcepath, getclassgetresourceurl
---
4890
motionevent, unit, constructor, wanted, manually, test, create, doesnt, thanks, getclassgetresourceurl, getclassgetclassloadergetresource, getclassgetresource, getclassgetresourceasstream, getclassgetresourcepath
---
4891
ifelse, statements, compiled, differences, operator, faster, code, getcheckeditempositions, getcomments, getclientresponseclass, getclrframeworkmajorversion, getcode, getcolor, getcolumncount
---
4892
init, email, houseid, modelsmodelinitself, 

gson, typedto, piece, returned, itemdto, typedtoclass, gsonfromjsonreply, orgjson, arraylistitemdto, string, mytypes, jsonarray, david, jsonobject
---
5721
listcustomer, listobject, getlist, list, cast, suggestion, compiler, compile, says, appreciated, return, does, zyx, getclassgetresourceurlgetpath
---
5722
driverfindelementbyname, driverfindelementbyxpath, click, selecting, httpwwwtizagcomphptexamplesformexphp, sendkeys, using, working, option, quote, xoption, drivernavigategotourl, firefoxdriver, formselectoption
---
5723
bash, mysql, namehost, nagioshost, uroot, wd, script, value, id, fetch, returned, query, returns, select
---
5724
pytest, repo, repoteststestapppy, repotests, barks, repomodelspy, reposettingspy, somedefinapp, repoapppy, path, import, easyinstall, likes, behaves
---
5725
split, reliable, char, forced, sub, individual, definition, character, strings, single, update, method, use, getclassgetresource
---
5726
testdb, php, command, database, line, tables, db, wanted, 

rklogdebug, restkit, macro, mapping, enable, noticed, undefined, debug, appears, calls, object, trying, code, im
---
6554
directory, word, files, count, authapplication, zegrep, xception, wc, occurrence, particular, say, number, working, tried
---
6555
valueforkey, nsexception, writeable, inelegant, myproperty, doing, id, sdk, catch, iphone, exist, way, property, key
---
6556
folder, library, dll, physically, tortoise, tfs, dlls, physical, drag, im, stores, explorer, svn, suggested
---
6557
therevisual, enclose, encloses, form, inside, fieldset, legend, behave, inner, li, border, tag, stuff, use
---
6558
js, recommended, bundle, files, pages, sense, cdns, order, cdn, web, sit, serve, decision, make
---
6559
synonym, synonyms, ora, graphical, awesome, looping, chain, familiar, trick, schema, helpful, debugging, definition, tool
---
6560
row, entries, column, different, count, dataframe, remains, rowwise, columnwise, randomized, randomize, changed, previous, randomizing
---
6561
idalvikv

---
7387
clr, spconfigure, enabled, spconfig, dasolpsdev, reconfigure, msg, near, incorrect, disabled, installation, enable, execution, configuration
---
7388
webclient, useragent, webheadercollection, clientheaders, headers, proper, useragentstring, headershttprequestheaderuseragent, myuseragentstring, options, advise, set, considering, phone
---
7389
image, crop, resdrawable, slice, parsed, unsure, imageview, loading, folder, suggestions, duplicate, android, possible, like
---
7390
duedate, dyntaskduedatevalue, dyn, callsite, datetime, task, json, newtonsoftjsonlinqjobject, jsonconvertdeserializeobjectdynamicrawjson, callsitetargetclosure, deserialisation, systemdynamicupdatedelegates, newtonsoft, fixes
---
7391
systemgc, disableexplicitgc, calculation, java, seconds, implemented, module, stoptheworld, redeploying, developper, hazards, recomodation, calls, abovementioned
---
7392
client, sql, server, fancy, usb, exe, zip, installing, explorer, management, ideally, choose, does, tools

8236
window, output, systematic, perfect, displaying, wasnt, disable, mind, shown, prefer, noticed, necessary, close, compile
---
8237
sitecom, localstorage, inaccessible, subdomains, decides, subdomain, visit, cookies, decision, personal, replacing, stupid, originally, domain
---
8238
uiimage, image, object, instantiate, alloc, init, contain, contains, created, check, doesnt, like, getclassvarsgetclassobject, getclassthatdefinedmethodbarfoomethod
---
8239
css, examine, styles, vs, designcss, improving, fontstyle, poorly, resharper, improved, files, refactor, expert, knew
---
8240
exclamation, tortoisegit, pulled, stays, committed, pushed, sign, red, modified, small, shows, seen, issue, change
---
8241
datagridview, row, want, field, things, select, able, user, use, getclassmethodsobject, getclassgetresource, getclassgetresourceasstream, getclassgetresourcepath, getcalendarhourofdaycalgetcalendarminute
---
8242
jsonobjetcsarrays, orthisname, jsonobjectdescription, jsonobjects, jsonseri

9053
api, restcontroller, sturgeon, request, phil, codeigniter, create, generating, sending, keys, rest, manually, send, users
---
9054
print, lambdax, true, rise, false, lt, compares, eq, intercept, addresses, doesnt, low, hack, comparison
---
9055
mvc, variables, application, state, robustsecure, applicationsession, tenantsaccounts, sesion, stateless, things, busy, realise, caching, used
---
9056
flags, lunacy, moronic, aggravating, nullobjects, thought, clike, booleans, httpmsdnmicrosoftcomenuslibrary, java, earth, conversions, natural, numeric
---
9057
keyguardmanager, lock, getsystemserviceactivitykeyguardservice, lockdisablekeyguard, keyguardmanagernewkeyguardlockkeyguardservice, lockreenablekeygaurd, keyguardlock, android, unlock, permission, programmatically, phone, enable, snippet
---
9058
runnable, pool, submit, execute, ways, thread, difference, add, getcellbycolumnandrowcolumn, getclientresponseclass, getclassgetresourceurl, getclassgetresourceurlgetpath, getclassmethodsobj

9886
male, gender, female, btree, index, indexes, rowids, useful, bitmaps, unspecified, filtered, identifier, bitmap, wikipedia
---
9887
linux, primary, tools, environment, complete, recently, os, programming, started, set, need, using, getcamerastereomode, getclassgetresource
---
9888
icons, images, project, buttoncontent, folder, currently, projects, siteoforigin, packsiteoforigindataimagescomponenticons, proj, button, subfolder, called, designer
---
9889
pycharm, intellij, compelling, buy, beta, love, coming, plugin, noticed, havent, using, reason, python, just
---
9890
addindex, devise, unique, email, tests, constraintexception, testunitusertestrb, activerecordrecordnotunique, usertestrb, dbtestload, truth, users, reran, rerun
---
9891
proxy, jimmy, systems, internet, detect, requires, automatically, setting, access, read, java, user, application, thanks
---
9892
recommended, step, codeoriented, css, layouting, difficulties, feedback, regards, guide, seeing, appreciate, advice, mas

pagestatuscodeshould, pagestatuscode, capybaranotsupportedbydrivererror, deletes, spec, giving, pages, js, status, post, check, end, true, line
---
10720
harddrive, filezeigene, includeing, local, lying, userscript, greasemonkey, include, webpages, documentations, python, modifies, ease, indexhtml
---
10721
document, objectivec, grammar, provided, standard, years, close, docsapijavalangobjecthtml, bewildered, httpwwwomnigroupcommailmanarchivemacosxdev, httpdocsoraclecomjavase, httpclangllvmorgdocsobjectivecliteralshtml, barest, ironically
---
10722
xyvalue, valid, esscence, tested, chrome, double, console, worked, wanted, starting, expected, php, check, javascript
---
10723
content, cms, dynamic, store, sql, html, server, type, data, use, getclassgetclassloadergetresource, getclassgetresource, getclassgetresourceasstream, getclassgetresourcepath
---
10724
roles, role, group, deletetask, viewtask, addtask, permissions, aspnet, approach, user, assigned, application, authorizeroles, decor

pdf, byte, inputstreamtostring, systemoutprintln, bytearray, catch, inputstream, convertdoctobytearraystring, bytearraynull, fileinputstreamsourcepath, inputstreamtostringgetbytes, convertbytearraytodocbyte, systemoutprintlne, dabcxyz
---
11646
stylesheet, mobile, cssmobilecss, landscapeonly, autorotate, handheld, jquery, accordingly, rel, media, orientation, alter, devices, disable
---
11647
press, finish, stack, onbackpressed, toast, detection, exiting, warn, coded, pop, activities, id, detect, seconds
---
11648
fault, hardware, msdn, address, valid, memorymemory, acctually, encompasses, process, exception, belonging, troubleshooting, owned, freed
---
11649
serverclientside, question, maybe, javascript, clientside, vice, versa, serverside, stupid, green, considering, strange, scope, developers
---
11650
icollection, ilist, implement, collection, concurrentbag, nongeneric, indexer, robust, threadsafe, ordered, method, doesnt, interfaces, adds
---
11651
typeid, views, data, view, table

href, homea, httpwwwgusdecoolcom, contactushtml, googlea, usa, httpgooglecom, link, contact, blank, target, gusdecoolcom, jquery, server
---
12553
doubleminvalue, boiled, comparing, systemoutprintln, essentially, bug, double, returns, false, possible, following, code, getchilditem, getchildlocations
---
12554
safari, homescreen, mobile, curious, app, exclusives, comedy, ferrells, doubletab, fod, httpmfunnyordiecom, site, click, shows
---
12555
gamecenteravailable, bool, difference, userauthenticated, gcturnbasedmatchhelper, interface, selfivar, property, ivar, brackets, nsobject, declaring, defining, readonly
---
12556
documentall, nonprimitive, falsy, dom, alert, hello, explain, object, doesnt, example, code, getcode, getcolor, getclientresponseclass
---
12557
datamodule, conventions, compared, purpose, special, usually, normal, module, properties, having, project, used, class, does
---
12558
method, class, passing, arguments, command, main, running, line, trying, way, im, like, getch

array, linq, select, multidimensional, toarraydatatable, jagged, stringtablerowscounttablecolumnscount, tablecolumnscount, tablerowsijtostring, tablerowscount, twodimensional, arrayi, thats, int
---
13386
documentbodydeselectall, deselect, function, global, selected, figure, got, javascript, text, simple, just, like, getcsv, getcurrentsessioncreatequery
---
13387
array, cherry, position, searchvalue, racerecord, arraysearchsearchvalue, arraysearch, colorid, carid, key, banana, dimensional, associative, car
---
13388
androidlayoutwidth, androidlayoutheight, fillparent, imageview, linearlayout, wrapcontent, androidid, textview, view, androidorientation, convertview, androidlayoutgravity, gridview, vertical
---
13389
token, generatorbased, nexttoken, lookahead, implies, item, scan, generator, ahead, implemented, making, scanstring, tokenlistindex, tokenlist
---
13390
namedscope, mongoid, mongoidtimestamps, embeddedin, mongoiddocument, include, createdat, desc, active, record, limit, recen

**NOTE**: Notice that in both dictionaries containing the top words, many of the terms that look to be a concatenation of two or more words, and terms that don't appear to be actual words top words, have a score of 0 beside them in both document-term matrices. Not all words that score 0 have no meaning, but we can certainly filter out some of these words by term frequency and weights. We'll first add the top 30 words to a list, and then we'll remove those less-common terms that are not actual words.

In [8]:
# add the top 30 terms from each question from the dictionary to a list
words_cv = []
for question in df_cv.columns:
    top = [word for (word, count) in top_dict_cv[question]]
    for t in top:
        words_cv.append(t)
        
words_cv

['sum',
 'data',
 'arraysize',
 'int',
 'datac',
 'code',
 'runs',
 'long',
 'loop',
 'faster',
 'start',
 'unsigned',
 'thought',
 'include',
 'array',
 'generate',
 'new',
 'stdsortdata',
 'import',
 'public',
 'elapsedtime',
 'main',
 'stdendl',
 'random',
 'seconds',
 'sorted',
 'just',
 'test',
 'sorting',
 'stdcout',
 'json',
 'know',
 'id',
 'like',
 'type',
 'content',
 'targeted',
 'ive',
 'gather',
 'rest',
 'slightly',
 'api',
 'theres',
 'support',
 'varying',
 'properly',
 'mime',
 'correct',
 'textjavascript',
 'question',
 'purported',
 'seen',
 'issues',
 'browser',
 'returned',
 'answer',
 'hurt',
 'applicationjson',
 'applicationxjavascript',
 'text',
 'use',
 'inside',
 'comments',
 'file',
 'json',
 'garbage',
 'gateway',
 'gates',
 'gasrc',
 'gas',
 'garden',
 'garblestrslice',
 'garbled',
 'zyx',
 'gather',
 'gaq',
 'gaps',
 'gapplyx',
 'gap',
 'gantt',
 'gang',
 'gamma',
 'gaming',
 'gameuimain',
 'gamesmapinsertstdpairint',
 'gamesmapcurrentpos',
 'gatewayinterf

In [9]:
# aggregate the list and identify the most common words along with how many routines they occur in
most_common_cv = Counter(words_cv).most_common()

In [10]:
len(most_common_cv)

34457

We can see that in our list, the least common terms (i.e those that occur just once) are overwhelmingly not words. Let's remove those.

In [11]:
most_common_cv[-100:]

[('loud', 1),
 ('administratorlevel', 1),
 ('seedsrb', 1),
 ('longnumber', 1),
 ('intnumber', 1),
 ('longmaxvalue', 1),
 ('nparraydata', 1),
 ('basemap', 1),
 ('rangelendata', 1),
 ('standardfamous', 1),
 ('rootfolder', 1),
 ('rootfoldersomefoldersomesubfolderxmlmyfilexml', 1),
 ('rootfoldersomefoldersomesubfoldernxml', 1),
 ('somenetworkpathrootfolder', 1),
 ('tagselection', 1),
 ('appendnewoption', 1),
 ('developerplatformsiphonesimulatorplatformdeveloperusrbinllvmgcc', 1),
 ('lz', 1),
 ('usersyariksmirnovlibrarydeveloperxcodederiveddatagoozzycugjuvvsrzjqwvfiicxtykbqaguxbuildproductsdebugiphonesimulatorgoozzyappgoozzy',
  1),
 ('isysroot', 1),
 ('developerplatformsiphonesimulatorplatformdeveloperusrbindeveloperusrbinusrbinbinusrsbinsbin',
  1),
 ('diphoneosversionminrequired', 1),
 ('usersyariksmirnovdesktopgoozybranchesnew', 1),
 ('usersyariksmirnovlibrarydeveloperxcodederiveddatagoozzycugjuvvsrzjqwvfiicxtykbqaguxbuildintermediatesgoozzybuilddebugiphonesimulatorgoozzybuildobjectsnor

In [12]:
# if less than two of the questions have it as a top word, exclude it from the list
add_stop_words_cv = [word for word, count in most_common_cv if count < 2]
add_stop_words_cv

['arraysize',
 'datac',
 'stdsortdata',
 'elapsedtime',
 'hurt',
 'newuri',
 'ircevent',
 'privmsg',
 'consequence',
 'someobjectdocalc',
 'diamonds',
 'spades',
 'clubs',
 'aclickfunction',
 'eiffel',
 'mulsd',
 'movapd',
 'movsd',
 'addsd',
 'jnj',
 'edx',
 'deleteafter',
 'funcs',
 'nullreferenceexception',
 'inte',
 'sizeofstruct',
 'buildbugonnulle',
 'whereever',
 'usrincludelinuxkernelh',
 'buildbugonzeroe',
 'dropbox',
 'readerclose',
 'readfilestring',
 'prototypically',
 'prototypal',
 'typeofint',
 'critique',
 'chatroom',
 'aonea',
 'sausage',
 'harrowprint',
 'foocontroller',
 'consolesvisual',
 'scopeemitsomeevent',
 'consolelogmass',
 'scopeonsomeevent',
 'secondctrlscope',
 'paw',
 'toes',
 'paws',
 'jextees',
 'maximums',
 'ity',
 'custs',
 'dbcustomers',
 'deferred',
 'sysstderr',
 'sysstderrwritespamn',
 'inunion',
 'tailinunionu',
 'inuniont',
 'unionnode',
 'templateid',
 'buckets',
 'prefixes',
 'tarraynewinstanceclazz',
 'stackt',
 'stackclasst',
 'shorthash',
 '

Great! We'll repeat this process once more for the TF-IDF document-term matrix and see what words are included in the stopwords list.

In [13]:
words_tfidf = []
for question in df_tfidf.columns:
    top = [word for (word, count) in top_dict_tfidf[question]]
    for t in top:
        words_tfidf.append(t)
        
words_tfidf

['arraysize',
 'datac',
 'sum',
 'data',
 'int',
 'unsigned',
 'faster',
 'runs',
 'stdsortdata',
 'elapsedtime',
 'loop',
 'long',
 'start',
 'stdendl',
 'thought',
 'stdcout',
 'sorting',
 'include',
 'sorted',
 'array',
 'code',
 'primary',
 'random',
 'seconds',
 'generate',
 'systemoutprintlnsystemnanotime',
 'stdrand',
 'clockt',
 'arrayssortdata',
 'intarraysize',
 'json',
 'textxjson',
 'textxjavascript',
 'applicationxjavascript',
 'purported',
 'hurt',
 'targeted',
 'varying',
 'gather',
 'mime',
 'messing',
 'pushing',
 'standards',
 'type',
 'applicationjson',
 'slightly',
 'id',
 'textjavascript',
 'security',
 'know',
 'rest',
 'issues',
 'anybody',
 'returned',
 'properly',
 'browser',
 'theres',
 'api',
 'support',
 'seen',
 'comments',
 'json',
 'inside',
 'file',
 'use',
 'getclassthatdefinedmethodbarfoomethod',
 'getclassgetresource',
 'getclassgetresourceasstream',
 'getclassgetresourcepath',
 'getclassgetresourceurl',
 'getclassgetresourceurlgetpath',
 'getclassmet

In [14]:
most_common_tfidf = Counter(words_tfidf).most_common()

In [15]:
len(most_common_tfidf)

65299

It looks like we have the same problem in our list with words from the TF-IDF object, except that for the TF-IDF model, we have words that don't make sense that are both commonly used and uncommonly used. This may not be the best model to move forward with. Again, the least common terms are the ones that are usually not words. Let's remove them here, as well.

In [16]:
most_common_tfidf[-100:]

[('auditrecord', 1),
 ('combartholemauditrecord', 1),
 ('webinfapplicationcontextxml', 1),
 ('orgspringframeworkbeansbeaninstantiationexception', 1),
 ('javasecurityprivilegedactionexception', 1),
 ('orgspringframeworkbeansfactorybeancreationexception', 1),
 ('tellus', 1),
 ('massa', 1),
 ('duis', 1),
 ('nisi', 1),
 ('venenatis', 1),
 ('erat', 1),
 ('httpwwwcodeprojectcomkbwpfhtmltextblockaspx', 1),
 ('mauris', 1),
 ('pharetra', 1),
 ('interdum', 1),
 ('siteasuspendisse', 1),
 ('tempus', 1),
 ('sapien', 1),
 ('phasellus', 1),
 ('posuere', 1),
 ('quam', 1),
 ('hrefhttpsomesitecomsome', 1),
 ('feugiat', 1),
 ('praesent', 1),
 ('lacus', 1),
 ('scrollsethorizontalscrollbarpolicyjscrollpanehorizontalscrollbaralways', 1),
 ('textareasetvisibletrue', 1),
 ('adel', 1),
 ('scrollsetverticalscrollbarpolicyjscrollpaneverticalscrollbaralways', 1),
 ('textareaseteditablefalse', 1),
 ('scrollvsetverticalscrollbarpolicyjscrollpaneverticalscrollbaralways', 1),
 ('frameaddscroll', 1),
 ('scrollv', 1),


In [17]:
add_stop_words_tfidf = [word for word, count in most_common_cv if count < 2]
add_stop_words_tfidf

['arraysize',
 'datac',
 'stdsortdata',
 'elapsedtime',
 'hurt',
 'newuri',
 'ircevent',
 'privmsg',
 'consequence',
 'someobjectdocalc',
 'diamonds',
 'spades',
 'clubs',
 'aclickfunction',
 'eiffel',
 'mulsd',
 'movapd',
 'movsd',
 'addsd',
 'jnj',
 'edx',
 'deleteafter',
 'funcs',
 'nullreferenceexception',
 'inte',
 'sizeofstruct',
 'buildbugonnulle',
 'whereever',
 'usrincludelinuxkernelh',
 'buildbugonzeroe',
 'dropbox',
 'readerclose',
 'readfilestring',
 'prototypically',
 'prototypal',
 'typeofint',
 'critique',
 'chatroom',
 'aonea',
 'sausage',
 'harrowprint',
 'foocontroller',
 'consolesvisual',
 'scopeemitsomeevent',
 'consolelogmass',
 'scopeonsomeevent',
 'secondctrlscope',
 'paw',
 'toes',
 'paws',
 'jextees',
 'maximums',
 'ity',
 'custs',
 'dbcustomers',
 'deferred',
 'sysstderr',
 'sysstderrwritespamn',
 'inunion',
 'tailinunionu',
 'inuniont',
 'unionnode',
 'templateid',
 'buckets',
 'prefixes',
 'tarraynewinstanceclazz',
 'stackt',
 'stackclasst',
 'shorthash',
 '

We can now update our original DataFrame and our document-term matrices with the new list of stop words.

In [18]:
# read in the dataframe
df = pd.read_pickle('df.pkl')

# add new stop words
stop_words_cv = text.ENGLISH_STOP_WORDS.union(add_stop_words_cv)

# create a column with cleaned text
df['tokenized_word'] = df['body'].apply(word_tokenize)
df['removed_stop_words'] = df['tokenized_word'].apply(lambda x: [word for word in x if word not in stop_words_cv])
df['no_stop_words'] = df['removed_stop_words'].apply(lambda x: ' '.join(x))
df.drop(columns=['tokenized_word', 'removed_stop_words'], inplace=True)

# pickle the updated dataframe
df.to_pickle("df_stop.pkl")

In [19]:
df.head()

Unnamed: 0,id,title,body,answer_count,favorite_count,score,tags,view_count,reputation,no_stop_words
0,11227809,Why is processing a sorted array faster than a...,here is a piece of c code that seems very pecu...,13,7317.0,14772,java c++ performance optimization branch-predi...,805490,1,piece c code peculiar strange reason sorting d...
1,477816,What is the correct JSON content type?,ive been messing around with json for some tim...,29,1089.0,6768,json content-type,1403837,95,ive messing json time just pushing text anybod...
2,244777,Can I use comments inside a JSON file?,can i use comments inside a json file if so how,39,378.0,3437,json comments,631045,25,use comments inside json file
3,208105,How do I remove a property from a JavaScript o...,say i create an object as follows var myobject...,13,539.0,2891,javascript object-properties,865544,16,say create object follows var myobject method ...
4,271526,Avoiding != null statements,the idiom i use the most when programming in j...,49,1083.0,2499,java nullpointerexception null,737912,369,idiom use programming java test object null us...


In [20]:
# recreate the document-term matrix
cv_stop = CountVectorizer(stop_words=stop_words_cv)
cv_wm = cv_stop.fit_transform(df.body)
df_stop_cv = pd.DataFrame(cv_wm.toarray(), columns=cv_stop.get_feature_names())
df_stop_cv.index = df.index

# pickle for later use
df_stop_cv.to_pickle("df_stop_cv.pkl")
outfile = open("cv_stop.pkl", "wb")
pickle.dump(cv_stop, outfile)
outfile.close()

In [21]:
# repeat the process for tfidf dtm
stop_words_tfidf = text.ENGLISH_STOP_WORDS.union(add_stop_words_tfidf)

tfidf_stop = CountVectorizer(stop_words=stop_words_tfidf)
tfidf_wm = tfidf_stop.fit_transform(df.body)
df_stop_tfidf = pd.DataFrame(tfidf_wm.toarray(), columns=tfidf_stop.get_feature_names())
df_stop_tfidf.index = df.index

df_stop_tfidf.to_pickle("df_stop_tfidf.pkl")
outfile = open("tfidf_stop.pkl", "wb")
pickle.dump(tfidf_stop, outfile)
outfile.close()

## Basic EDA
We'll now draw out some basic information from the DataFrame just to try to gain some interesting insights related to questions on Stack Overflow.

In [22]:
df.head()

Unnamed: 0,id,title,body,answer_count,favorite_count,score,tags,view_count,reputation,no_stop_words
0,11227809,Why is processing a sorted array faster than a...,here is a piece of c code that seems very pecu...,13,7317.0,14772,java c++ performance optimization branch-predi...,805490,1,piece c code peculiar strange reason sorting d...
1,477816,What is the correct JSON content type?,ive been messing around with json for some tim...,29,1089.0,6768,json content-type,1403837,95,ive messing json time just pushing text anybod...
2,244777,Can I use comments inside a JSON file?,can i use comments inside a json file if so how,39,378.0,3437,json comments,631045,25,use comments inside json file
3,208105,How do I remove a property from a JavaScript o...,say i create an object as follows var myobject...,13,539.0,2891,javascript object-properties,865544,16,say create object follows var myobject method ...
4,271526,Avoiding != null statements,the idiom i use the most when programming in j...,49,1083.0,2499,java nullpointerexception null,737912,369,idiom use programming java test object null us...


In [23]:
df.score.describe()

count    14091.000000
mean        52.935491
std        169.295244
min         16.000000
25%         20.000000
50%         28.000000
75%         48.000000
max      14772.000000
Name: score, dtype: float64

In [24]:
df.view_count.describe()

count    1.409100e+04
mean     4.295889e+04
std      7.523798e+04
min      2.380000e+02
25%      1.019850e+04
50%      2.237800e+04
75%      4.868150e+04
max      2.141036e+06
Name: view_count, dtype: float64

## Choose a Model
We saw that the TF-IDF vectorizer contained both common and uncommon terms that did not make a lot of sense. It would be harder to add these terms to a list of stop words to increase the ability of the model that we use later to draw out only meaningful terms. Therefore, we'll use the CountVectorizer object from here on for our model.