# Exploratory Data Analysis

## Introduction
After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always important to explore the data first.

When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the same when working with text data. We are going to find some more obvious patterns with EDA before identifying the hidden patterns with machines learning (ML) techniques. We are going to look at the following for each question posted on Stack Overflow:

1. Most common words - find these to add to a list of stopwords
2. Size of vocabulary - look at the number of unique words
3. Relationship to tags - add topics to the original DataFrame to draw parallels

In [1]:
# data analysis and manipulation
import pandas as pd
import numpy as np
from collections import Counter

# files
import pickle

# text manipulation
import re
import string
from bs4 import BeautifulSoup

# natural language processing
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize, RegexpTokenizer
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

# topic modeling
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

### Most Common Words

In [2]:
# read in the dtm for the count vectorizer
df_cv = pd.read_pickle('df_cv.pkl')
df_cv = df_cv.transpose()
df_cv.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14081,14082,14083,14084,14085,14086,14087,14088,14089,14090
aaa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ab,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aba,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abab,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ababab,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# do the same for the tf-idf vectorizer
df_tfidf = pd.read_pickle('df_tfidf.pkl')
df_tfidf = df_tfidf.transpose()
df_tfidf.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14081,14082,14083,14084,14085,14086,14087,14088,14089,14090
aaa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ab,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aba,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abab,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ababab,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
# find the top 30 words used in each question
top_dict_cv = {}
for c in df_cv.columns:
    top = df_cv[c].sort_values(ascending=False).head(30)
    top_dict_cv[c]= list(zip(top.index, top.values))

top_dict_cv

{0: [('data', 8),
  ('arraysize', 8),
  ('sum', 8),
  ('int', 7),
  ('datac', 6),
  ('code', 5),
  ('unsigned', 4),
  ('start', 4),
  ('runs', 4),
  ('long', 4),
  ('faster', 4),
  ('loop', 4),
  ('include', 3),
  ('thought', 3),
  ('array', 3),
  ('just', 2),
  ('seconds', 2),
  ('stdcout', 2),
  ('new', 2),
  ('stdendl', 2),
  ('sorting', 2),
  ('stdsortdata', 2),
  ('primary', 2),
  ('public', 2),
  ('random', 2),
  ('sorted', 2),
  ('elapsedtime', 2),
  ('generate', 2),
  ('test', 2),
  ('main', 2)],
 1: [('json', 3),
  ('know', 2),
  ('like', 2),
  ('type', 2),
  ('id', 2),
  ('seen', 1),
  ('similar', 1),
  ('anybody', 1),
  ('properly', 1),
  ('mime', 1),
  ('applicationjson', 1),
  ('ive', 1),
  ('best', 1),
  ('purported', 1),
  ('question', 1),
  ('varying', 1),
  ('pushing', 1),
  ('correct', 1),
  ('returned', 1),
  ('browser', 1),
  ('time', 1),
  ('start', 1),
  ('doing', 1),
  ('gather', 1),
  ('issues', 1),
  ('slightly', 1),
  ('targeted', 1),
  ('answer', 1),
  ('cont

In [5]:
# print the top 15 words used in each question
for question, top_words in top_dict_cv.items():
    print(question)
    print(', '.join([word for word, count in top_words[0:14]]))
    print('---')

0
data, arraysize, sum, int, datac, code, unsigned, start, runs, long, faster, loop, include, thought
---
1
json, know, like, type, id, seen, similar, anybody, properly, mime, applicationjson, ive, best, purported
---
2
inside, comments, file, json, use, aaa, scheduledthreadpoolexecutor, scheduledrun, scheduledrunfindparamsscheduledrunid, scheduledrunid, scheduledruns, scheduledrunsid, scheduledruntransaction, scheduledschools
---
3
myobject, ircevent, follows, regex, newuri, var, privmsg, method, end, way, best, new, http, remove
---
4
object, null, test, use, code, nullpointerexception, avoid, want, good, field, java, idiom, appear, tests
---
5
suit, enum, public, compile, void, time, does, second, following, spades, clubs, diamonds, gives, code
---
6
nodejs, lets, web, run, javascript, like, good, server, type, considering, general, understand, problems, use
---
7
apply, function, func, performance, difference, differences, vice, versa, vs, use, methods, using, var, alerthello
---
8

---
818
window, code, undefined, jquery, ive, appear, going, pair, seen, noticed, whats, encapsulated, namespacing, particular
---
819
html, know, javascript, storing, highlighting, number, maintenance, does, sort, easier, plain, staying, cms, widget
---
820
change, api, public, type, new, foo, languages, class, void, method, break, adding, code, members
---
821
libraries, make, outofthebox, stl, general, game, qt, crossplatform, use, commercial, app, interface, start, compatible
---
822
order, sort, array, way, easy, descending, arrays, ascending, class, like, stop, lazy, scale, scheduledschools
---
823
button, color, tried, highlighted, user, background, change, finger, uibutton, code, calling, working, didnt, remains
---
824
gdb, starting, sessions, like, id, keys, saves, previous, arrow, use, new, set, session, commands
---
825
way, better, check, situaions, sizeofarr, alternate, judge, element, available, methods, array, just, traditional, trying
---
826
case, string, characters, 

---
1651
reusable, ihttphandler, use, situations, value, implement, property, says, request, true, class, isreusable, handler, questions
---
1652
like, json, object, want, does, string, difficulties, php, function, work, userid, thanks, data, looks
---
1653
null, table, index, create, temptable, key, nvarchar, server, id, sql, declare, variable, int, databasedefault
---
1654
commit, weeks, did, file, single, restore, old, ago, want, schedulewithfixeddelay, scheduledrunsid, scheduledruntransaction, scheduledschools, scheduledthreadpoolexecutor
---
1655
load, gallery, holding, horizontal, appears, image, shot, fine, correctly, bitmap, vertically, phone, vertical, rotated
---
1656
git, message, commit, following, push, solution, merge, question, section, checkout, reset, redid, quick, pushing
---
1657
enum, flags, write, declare, hexadecimal, myenum, public, values, usually, use, people, instead, times, way
---
1658
class, test, aclass, subclass, aaa, scheduledtimerwithtimeinterval, sched

worker, static, new, using, error, bwdowork, message, thread, backgroundworker, void, facebookfriendslist, facebookapplicationfacebookfriendslist, trying, wpf
---
2318
need, pm, function, numbers, library, place, outputting, ampm, strings, float, seconds, display, like, code
---
2319
im, dirnamefile, uses, projects, looks, sure, aaa, scheduleorstep, scheduledrunsid, scheduledruntransaction, scheduledschools, scheduledthreadpoolexecutor, scheduledtimerwithtimeinterval, scheduleinrunloopnsrunloop
---
2320
vector, iter, clear, iterating, times, numberin, way, int, multiple, code, stdvectorintiterator, enditer, mynmberspushbacki, element
---
2321
good, bad, value, slider, okay, sender, ibaction, slidervaluechangeduislider, slidervaluesender, nil, code, display, use, integer
---
2322
distinct, siteid, ts, site, count, cp, trying, time, datesubnow, im, select, reason, multiple, collect
---
2323
control, custom, scenario, delete, build, use, contains, want, sender, im, eventargs, function, pa

like, methods, object, use, namespace, js, using, searching, function, cases, site, performance, used, hard
---
3151
search, faceted, im, fulltext, wikipedia, requirement, open, feature, good, context, helpful, trying, usebenefit, understand
---
3152
file, folder, create, applicationcontext, spring, project, confused, classpath, root, inside, guess, src, mvc, terms
---
3153
text, contents, img, image, floatright, div, want, margintop, case, position, space, cake, contains, seeminglysimple
---
3154
file, application, got, manifest, error, signing, export, code, finish, problem, steps, package, apk, card
---
3155
command, bundle, rails, domaincom, ruby, executing, gemfile, server, test, development, works, rvmgemsruby, deploy, servers
---
3156
set, view, loaded, help, initializer, variables, storyboard, developer, segue, remains, instance, choose, like, id
---
3157
table, records, column, contained, regardless, words, returned, given, datetime, day, contains, following, wish, limit
---
3

---
3984
test, machinedoubleeps, function, replications, zerorange, relative, dwin, diffrangex, elements, john, meanx, elapsed, tolerance, results
---
3985
compilers, float, converting, int, multiplying, generally, gcc, multiplications, executable, general, shifts, smart, using, exponent
---
3986
keybinding, ctrl, example, like, modifiers, modifier, keys, created, key, command, way, binding, savecommand, needed
---
3987
return, containssmileystring, boolean, false, null, scontains, form, equivalently, second, complex, methods, arguably, experience, points
---
3988
doesnt, syntax, support, unsigned, double, aaa, scheduledrunid, scheduledruns, scheduledrunsid, scheduledruntransaction, scheduledschools, scheduledthreadpoolexecutor, scheduledtimerwithtimeinterval, scheduleinrunloopnsrunloop
---
3989
file, gitignore, ignored, git, bartxt, foo, directory, subdirectories, applies, ignore, want, useful, tried, exist
---
3990
new, list, iliststring, order, alphabetic, liststring, strings, insid

---
4817
schemas, service, web, generateelementproperty, im, false, using, problem, jaxbelement, wcf, file, generates, read, types
---
4818
razor, viewdata, escaped, array, using, js, possible, project, advance, serialize, expected, mvc, behaving, quot
---
4819
span, space, want, use, dont, html, tag, parent, spanspan, like, elements, css, looking, looks
---
4820
powers, campaigncategories, make, affiliategroups, campaigngroups, campaigns, uintmaxvalue, enum, multiply, number, targets, concrete, sorcery, flags
---
4821
calls, function, script, achieve, different, introduce, let, like, independent, scripts, feature, arguments, instantiate, readcfg
---
4822
table, cells, width, size, outer, currently, stretch, fixed, relative, entire, sizing, dont, way, make
---
4823
require, tell, fileexpandpath, code, includelib, ruby, unshift, does, source, repo, referencing, decided, rubygems, actionmailer
---
4824
android, testing, applications, suggestions, test, unit, mock, mocking, cases, thanks,

5650
got, xcode, certificate, iphone, new, csr, developer, error, code, build, original, identity, time, ide
---
5651
workaround, headers, section, javastyle, keyvalue, content, pairs, module, file, configparser, raises, properties, exception, parses
---
5652
im, using, certain, list, requires, image, website, id, does, like, nodejs, npm, experience, actual
---
5653
image, images, resolution, set, tineye, search, able, type, return, algorithm, compare, crawled, supply, use
---
5654
detect, processorscores, physical, correctly, number, supported, able, logical, order, flavours, linux, windows, degrades, platforms
---
5655
like, module, modules, im, allocation, memory, buffer, want, access, code, need, subset, solutions, things
---
5656
ios, seen, database, android, want, works, encrypted, windows, store, solution, plugin, party, prepopulated, plugins
---
5657
scenarios, differences, developers, dont, googled, different, examples, importance, difference, necessity, models, exactly, nice,

---
6150
dao, public, loadedgroup, class, intfc, native, sunreflectdelegatingmethodaccessorimplinvokedelegatingmethodaccessorimpljava, groupgetid, sunreflectnativemethodaccessorimpl, sunreflectnativemethodaccessorimplinvokenativemethodaccessorimpljava, object, protected, method, javalangreflectmethodinvokemethodjava
---
6151
simple, class, constructor, public, textbox, super, base, variables, way, things, extends, called, allowing, object
---
6152
line, factory, multiple, public, headers, new, im, wcf, listuri, error, summary, host, return, httpstackoverflowcomquestions
---
6153
pay, net, application, web, aspnet, software, microsoft, just, edit, payment, develop, free, make, apps
---
6154
datagrid, look, theme, aero, resourcedictionary, controls, like, wpf, toolkit, windows, im, just, need, running
---
6155
want, request, status, way, afhttpclient, reauthenticate, operation, handler, code, token, use, remove, failed, authentication
---
6156
service, binding, basichttpbinding, data, ma

7049
screen, element, window, corner, easy, screeny, possible, screenx, add, browser, left, picture, relative, specific
---
7050
alignment, type, malloc, dynamically, storage, thats, allocates, allocated, memory, widget, aligned, properly, new, objects
---
7051
like, bar, want, regex, split, instead, youre, quite, lines, expecting, doesnt, long, match, code
---
7052
username, focus, starts, activity, appear, inputmethodmanager, code, keyboard, checkfocus, boolean, guide, want, logi, checkfocususerrequestfocus
---
7053
code, difference, arm, writing, im, different, like, assembly, noticed, running, importantly, ive, fairly, behaviors
---
7054
transitive, consistent, symmetric, follow, reflexive, properties, work, dependency, harm, violating, implemented, correctly, specifies, java
---
7055
new, changes, site, git, framework, file, master, branch, merge, files, rebase, process, changed, problem
---
7056
activities, multiple, application, pages, way, close, open, remain, scheduledruntrans

---
7983
float, int, oval, public, override, void, layout, ondraw, new, hspec, tutorials, right, extends, private
---
7984
break, case, switch, stuff, foo, default, braces, matter, curly, asking, use, ago, doesnt, good
---
7985
admin, numerical, im, module, columns, django, using, display, simple, blank, id, total, custom, row
---
7986
right, change, isnt, constant, like, scenarioitemtemplate, scheduledtimerwithtimeinterval, scheduledexecutorservice, scheduledrun, scheduledrunfindparamsscheduledrunid, scheduledrunid, scheduledruns, scheduledrunsid, scheduledruntransaction
---
7987
field, javalanginteger, int, entity, use, does, options, unknown, class, nullable, type, heavy, values, present
---
7988
javascript, referenced, actual, height, css, background, use, possible, pixels, size, image, width, scheduleinrunloopnsrunloop, schedulerschedulewithfixeddelaynew
---
7989
dummy, java, books, ignored, anybody, lang, workaroundfix, commandline, help, single, example, term, quotes, arguments


default, field, message, characters, jquery, like, instead, override, plugin, validation, required, postcode, label, enter
---
8924
---
8925
nr, char, unknown, consists, chars, say, lets, string, scheduledtimerwithtimeinterval, scheduledrunfindparamsscheduledrunid, scheduledrunid, scheduledruns, scheduledrunsid, scheduledruntransaction
---
8926
file, tests, new, work, bar, foo, filewriterfile, close, sourcefromfilefilegetlinestolist, falseappend, extends, val, good, sbt
---
8927
log, messages, wrote, read, simple, app, rack, able, rackcommonlogger, file, called, sinatrarbcomintrohtml, sinatrabase, sinatra
---
8928
program, example, myprogramexe, make, console, echo, parameters, question, works, word, application, know, schedulerscheduledeferred, scheduledrun
---
8929
headers, using, precompiled, projects, like, pros, pertains, lot, large, specifically, cons, gotchas, time, save
---
8930
nodejs, applications, testing, tell, tools, experience, memory, leaks, detecting, scheduledschools, 

9736
readmostly, linker, section, myro, movl, script, init, define, code, eax, exit, main, movq, printf
---
9737
field, dot, error, insert, use, names, example, db, schedulerschedulewithfixeddelaynew, schedules, schema, scheduledrunid, scheduledruns, scheduledrunsid
---
9738
collection, cars, just, var, car, im, type, question, naming, listcar, variable, case, matter, doesnt
---
9739
class, object, practice, convert, ba, data, type, best, preserve, assume, second, defined, scheduleinrunloopnsrunloop, scheduledschools
---
9740
controller, someview, class, property, value, creating, access, view, delegate, appdelegate, reference, schedulerscheduledeferred, scheduler, schedulewithfixeddelayntimes
---
9741
err, nil, error, details, http, clientdoreq, resp, fmtprintf, authenticate, server, client, req, request, accessdenied
---
9742
implement, know, care, existing, function, package, does, exist, believe, difficult, said, path, cdirfile, write
---
9743
string, uiactivities, truncated, diffe

10649
function, functions, report, id, label, getting, bunch, predefined, generate, able, sections, pass, like, prints
---
10650
html, delay, content, fadeout, fade, complete, trick, products, fadein, replacing, little, replace, delayed, appears
---
10651
imageview, uiview, adding, uiimageview, centre, bounds, image, code, nslog, dynamicmainview, setcenterdynamicmainviewcenter, wrong, clear, new
---
10652
myelement, mylist, like, class, new, list, xml, represents, code, number, tag, add, linkedlist, achieve
---
10653
sequence, viterbi, algorithm, random, output, letters, just, words, trying, based, problem, seeing, added, observed
---
10654
bluetooth, device, keyboard, possible, type, app, limitation, iphone, android, ios, like, apis, limited, exposed
---
10655
does, equals, hood, javas, im, value, entirely, interested, parts, switch, compare, use, mainly, case
---
10656
compiled, javascript, coffescript, asyncawait, bring, feature, web, nodejs, browser, used, attempt, language, schmoe

---
11649
question, maybe, javascript, vice, asking, languages, server, clientside, mean, useful, client, edit, developers, versa
---
11650
icollection, ilist, implement, collection, method, doesnt, methods, sense, concurrentbag, dont, make, does, wont, add
---
11651
data, view, views, typeid, say, query, create, table, event, tables, members, caches, updating, used
---
11652
public, thing, string, reflection, class, method, foo, new, type, void, implicit, showthing, figure, object
---
11653
project, studio, features, bit, visual, windows, disabled, did, curiosity, missing, experience, codebehind, systemconsole, net
---
11654
new, environment, discussion, edit, said, use, variable, attach, want, used, style, good, question, inspect
---
11655
null, return, session, nulls, code, userid, method, getusersessionid, sessionid, string, check, getnewusersession, complicated, case
---
11656
ctx, var, context, taskschedulerfromcurrentsynchronizationcontext, null, use, systemwebhttpcontextcurrent

---
12649
sum, time, elapsed, ms, double, int, numpoints, runtest, code, const, numiters, var, end, include
---
12650
low, int, efficient, im, wondering, putting, versus, alternatives, numbers, rerplacing, difference, using, relevant, delta
---
12651
option, column, descr, description, columns, code, select, user, excel, selects, need, value, screenshot, drop
---
12652
ive, like, vcs, months, using, time, version, control, features, branching, ability, advance, used, ads
---
12653
uitabbaritem, badge, gets, index, applied, make, sure, uitabbarcontrollers, uitabbarcontroller, tbi, tbibadgevalue, adding, uitabbar, change
---
12654
trying, simulate, use, time, mouse, using, eclipse, app, emulator, touch, android, multitouch, im, scheduledrunsid
---
12655
subversion, svn, error, repository, history, message, url, git, path, old, module, created, cause, release
---
12656
mylistadd, mylist, liststring, contents, easy, listbox, list, strings, new, populate, using, way, world, hello
---
12657


im, testing, using, bit, tools, writing, tests, unit, mspec, good, write, lot, just, going
---
13649
doctotal, pediinvoicedetail, know, going, want, cte, value, entire, result, group, enter, update, following, inside
---
13650
work, experience, aftercreate, create, methods, dothis, doesnt, does, rails, trying, google, docs, array, putting
---
13651
attribute, lot, gitsvn, way, repositories, happens, files, svn, new, ensure, require, schedulerscheduledeferred, scheduledthreadpoolexecutor, scheduledruns
---
13652
backbone, reusing, server, solution, node, possible, using, application, resources, code, achieve, spa, rewriting, dont
---
13653
like, myvalues, new, listint, int, create, way, using, possible, linq, says, time, array, compile
---
13654
aspnet, unit, webconfig, time, sessionstate, session, specify, body, tell, thanks, miliseconds, scheduledschools, scheduledrunfindparamsscheduledrunid, scheduledrunid
---
13655
public, interface, set, property, interfaces, ihasmembers, iorderede

In [6]:
# repeat the process for the tf-idf document-term matrix
top_dict_tfidf = {}
for c in df_tfidf.columns:
    top = df_tfidf[c].sort_values(ascending=False).head(30)
    top_dict_tfidf[c]= list(zip(top.index, top.values))

top_dict_tfidf

{0: [('arraysize', 0.5278445485693687),
  ('datac', 0.39588341142702654),
  ('sum', 0.3577145535997552),
  ('data', 0.1815692342610622),
  ('int', 0.17446696918402244),
  ('unsigned', 0.16655725970467167),
  ('faster', 0.14889949877950148),
  ('runs', 0.14333036672831964),
  ('stdsortdata', 0.13196113714234217),
  ('elapsedtime', 0.13196113714234217),
  ('loop', 0.1317481591470164),
  ('long', 0.1186571069769755),
  ('start', 0.11372804101144657),
  ('stdendl', 0.09763393214700793),
  ('thought', 0.09574743812087676),
  ('stdcout', 0.09153821653454555),
  ('sorting', 0.09090080129809036),
  ('include', 0.09031491689789034),
  ('sorted', 0.0857892977753784),
  ('array', 0.0836750802485507),
  ('code', 0.0811676422931402),
  ('primary', 0.07672691286161575),
  ('random', 0.07640698073021791),
  ('seconds', 0.07314938846777747),
  ('generate', 0.06636257139723833),
  ('clockt', 0.06598056857117109),
  ('stdrand', 0.06598056857117109),
  ('dataarraysize', 0.06598056857117109),
  ('arraysso

In [7]:
for question, top_words in top_dict_tfidf.items():
    print(question)
    print(', '.join([word for word, count in top_words[0:14]]))
    print('---')

0
arraysize, datac, sum, data, int, unsigned, faster, runs, stdsortdata, elapsedtime, loop, long, start, stdendl
---
1
json, textxjavascript, textxjson, applicationxjavascript, purported, hurt, targeted, gather, varying, mime, messing, pushing, standards, type
---
2
comments, json, inside, file, use, powersave, powers, powerpoint, powerpc, powerof, powermockitoverifystatic, powermockitomockstaticloggerfactoryclass, powermock, powershellexe
---
3
myobject, newuri, privmsg, ircevent, regex, follows, var, method, http, remove, property, say, end, best
---
4
nullpointerexception, object, null, avoid, someobjectdocalc, consequence, necessity, someobject, unreadable, test, idiom, ugly, alternative, tests
---
5
suit, enum, dosomethingsuit, enumerateallsuitsdemomethod, compile, spades, clubs, diamonds, hearts, enumerate, public, keyword, foreach, fails
---
6
nodejs, lets, considering, web, javascript, run, general, nodepad, ajaxlike, mighty, stuff, problems, bitly, amazons
---
7
apply, funccal

886
camera, picture, intent, app, putextraextraoutput, mediastoreactionimagecapture, extraoutput, ok, actionimagecapture, button, just, locks, stays, works
---
887
typedefs, pointers, nots, functions, dos, understand, peoples, numerical, recall, stumped, tips, took, ago, thoughts
---
888
wikipedia, pages, monthly, statistics, number, edited, displays, blog, activity, allow, api, information, like, script
---
889
callback, parameters, params, object, function, functionparameters, asdf, structured, optional, containing, integer, required, string, example
---
890
purchase, customer, customers, table, denormalize, date, tablecolumn, id, purchases, performancewise, customerid, itemid, beneficial, belongs
---
891
api, users, version, returns, redirect, scope, latest, paramspath, apiasdfusers, vapiversion, stripe, apiusers, apiversion, end
---
892
nsjsonserialization, json, data, parse, structure, jsonobjectwithdata, nsjsonreadingmutablecontainers, id, jsonencode, excbadaccess, array, diction

1719
prepared, values, statements, insert, rows, multiple, query, readings, tbl, pdo, inserting, queries, security, know
---
1720
homescreen, login, facebook, screens, pressed, button, finish, sign, prevent, activity, intentflagactivitynohistory, loginsignup, return, onbackpressed
---
1721
vs, sizeofbuffer, snprintfbuffer, nonstandards, snprintf, include, shocked, unfortunate, somestring, plans, stdlibh, compliant, stdioh, identifier
---
1722
clone, huserservergitreposmyprojectgit, git, myproject, bare, path, cd, machine, gitrepos, answer, destination, logged, fatal, decided
---
1723
guavalibraries, googlecollections, guava, maven, features, repository, adding, library, looks, looking, like, powershellexe, powershell, powersave
---
1724
mysql, server, office, remote, waittimeout, process, halfway, dies, files, report, gone, heard, processing, away
---
1725
range, loop, python, doesnt, trying, work, aaa, powermockitoverifystatic, powers, powerpoint, powerpc, powerof, powermock, powermoc

threads, cleanupbeforeexit, myconsoleprogramonexit, event, program, termination, cleanup, monitor, terminate, triggered, closing, conditions, connections, handles
---
2553
blocker, event, pageid, dialogue, epreventdefault, solve, click, issue, opens, dialog, page, href, status, popoup
---
2554
search, tip, grep, maybe, quickest, speeding, quicker, organize, centos, text, modes, files, fashion, lowercase
---
2555
onetomanyfield, phonenumbers, modelsonetomanyfieldphonenumber, relationship, phonenumber, dude, onetomany, im, django, numbers, listphonenumber, phonenumbermodelsmodel, dudemodelsmodel, businessmodelsmodel
---
2556
video, chrome, downloads, sorry, loading, time, wallpaper, bandwidth, vote, videos, hey, miss, heavy, looping
---
2557
mtextpaint, isnmultilinentext, canvasdrawtexttext, ondraw, cryptic, text, paul, multiline, breaks, hopefully, canvas, seeing, pointers, quick
---
2558
svn, repository, separate, msys, currenty, httpcodegooglecompmsysgitdownloadslist, git, command, gi

serializable, matter, general, exactly, mean, java, class, does, powerof, powersave, powers, powerpoint, powerpc, powermock
---
3386
sid, coming, checking, exist, vdatabase, vthread, query, current, select, database, table, view, oracle, error
---
3387
threads, quadcore, multicore, stealing, point, simultaneously, time, computer, wouldnt, machine, thought, multiple, whats, running
---
3388
pgadmin, tables, visually, postgresql, utilizes, actually, forgive, phpmyadmin, utility, individual, possibly, piece, software, fields
---
3389
cookies, cookie, thischkremembermechecked, willl, httpcookietxtusernametext, emailid, thischkrememberme, responsecookiesaddcookie, cookieexpiresaddyears, txtpasswordtext, store, httpcookie, rememberme, passwords
---
3390
church, internal, locations, using, false, occurred, error, dynamic, compiler, funcchurch, xytrue, testapplication, funcdynamic, errorreport
---
3391
string, arrayofstring, org, httpwww, dictionary, collectionsee, store, xmlnsxsd, xmlschema, 

---
4219
node, children, template, root, poster, suggests, achieved, discussion, nodes, render, production, tree, passed, django
---
4220
innerhtml, divs, trigger, img, src, attribute, changed, event, lets, custom, say, change, way, powermanagernewwakelockpowermanagerscreenbrightwakelock
---
4221
createstring, actionresult, title, httpposthttpget, acceptverbshttpverbspostacceptverbshttpverbsget, acceptverbshttpverbspost, decorate, httppost, public, attributes, action, different, posxe, powersave
---
4222
rhtml, htmlerb, erb, difference, powerof, powershell, powersave, powers, powerpoint, powerpc, powermockitoverifystatic, powershells, powermockitomockstaticloggerfactoryclass, powermock
---
4223
literal, value, word, strings, context, mean, difference, values, used, does, postxml, powervrs, powervr, powerups
---
4224
startinfouseshellexecute, processstartinfo, true, startinfoworkingdirectory, correctionprocesswaitforexit, processexited, processexitedobject, rawdatafilename, processstart

---
5052
classme, div, coders, pageload, aspx, hoping, dynamically, follows, css, html, add, possible, thanks, id
---
5053
menu, tablets, game, options, button, ics, menus, bar, actionbar, action, deal, targetsdk, openoptionsmenu, lacked
---
5054
lengthsize, ntext, varchar, cast, bytes, longer, column, sql, data, thanks, aaa, powerof, powerpoint, powerpc
---
5055
arrb, arra, object, xaml, simplest, wpf, arrays, parse, json, display, function, string, example, powershellexe
---
5056
datetime, date, select, starttime, picker, calendar, filters, comparing, like, row, mysql, stored, strings, suggestions
---
5057
slider, valuechanged, event, dispatchertimer, sliding, discusses, subclassing, invokes, seek, timespan, dragging, invoking, triggers, user
---
5058
tool, directly, library, bit, test, powermockitoverifystatic, powersave, powers, powerpoint, powerpc, powerof, powermockitomockstaticloggerfactoryclass, powershellexe, powermock
---
5059
persistent, stores, core, multiple, data, nspersi

5885
namespace, mynamespace, somevalue, define, statement, correct, restricted, unsigned, const, saying, thing, int, following, poxweb
---
5886
psome, div, textp, solid, border, red, snippet, css, somethingelse, snippets, identical, following, class, jquery
---
5887
configurations, eclipse, im, arguments, argument, eclipseargumentstxt, extras, output, esp, project, executables, run, stores, commandline
---
5888
byte, static, public, mservicewriteprintercommandsfeedline, int, image, printer, bitset, pixel, offset, bitmap, im, mservicewriteprintercommands, selectbitimagemode
---
5889
stamps, graphical, easiest, extract, ubuntu, need, binary, know, tool, linux, actually, difference, command, line
---
5890
vertices, graph, edgeobjectvertices, gaddedgenew, vertex, verticesi, zoomzoomcontrol, dockpanel, edges, layoutalgorithmtype, edgeobjectverticesi, iedgeobject, fsa, creategraphtovisualize
---
5891
join, hql, relationship, asomebsome, left, select, beans, workarounds, add, use, outer, crit

6718
syntax, xhtml, html, gain, going, reason, compatibility, years, switch, applicationxhtmlxml, finiky, nonxml, mimetype, comfortable
---
6719
sillydb, class, construct, getconnection, inside, private, say, singleton, instantiated, function, meant, instantiate, docs, allowed
---
6720
phpinfo, thisprojectdevinfophp, infophp, browser, practical, ubuntu, isnt, write, type, create, time, question, thanks, file
---
6721
commit, git, fusion, fyi, vim, type, vm, opens, editor, hit, happening, keys, bash, creates
---
6722
bufferedstream, httpwebresponsestatuscode, requestlength, throughput, httpwebresponse, threads, webrequest, tool, stream, using, response, servicepointmanagerdefaultconnectionlimit, stwriterequest, binaryreaderbufferedstream
---
6723
picture, roll, pitch, degrees, yaw, sitting, rotation, reads, table, home, suppose, left, reading, resting
---
6724
hall, contexthallsaddorupdate, french, japanese, hid, german, var, id, updatedatabase, new, rows, management, public, console
--

7596
stringvalue, geselecteerd, beantwoord, value, niet, combobox, public, selected, stringvalueattribute, intalsenumparsetypeofals, stringvalueattributestring, stringcboalsselectedvalue, nietbeantwoord, nietgeselecteerd
---
7597
video, videoview, buffered, activity, starts, portion, minutes, button, resumed, current, onpause, buffering, seek, onresume
---
7598
maxfilesize, submit, input, hidden, supposed, type, filesuploadsize, byes, maxuploadsize, preceed, field, file, impose, form
---
7599
stdendl, stdcout, ownership, int, boostsharedptrt, xusecount, yusecount, xnew, boostsharedptrint, yget, xget, shares, yx, unusual
---
7600
floatmaxvalue, floatingpoint, overflow, wonder, types, explain, returns, true, powermock, powermockitomockstaticloggerfactoryclass, powermockitoverifystatic, powermanagerservice, powershell, powerof
---
7601
li, ul, script, dropdown, asdsali, loansli, loansa, lipayday, licontactli, liaboutli, dropdowntoggledropdown, classdropdown, classnav, jsbootstrapbootstrap

8440
tinymce, textarea, space, ocuppy, textareastyleheight, enables, resize, set, rendering, area, editor, stop, js, parent
---
8441
intent, emailintentsettype, htc, mail, works, send, emailintentputparcelablearraylistextra, intentextrastream, appliationoctetstream, intentactionsend, androidcontentintentactionsendmultiple, thunderbolt, shows, client
---
8442
linker, forceload, allload, flag, xcode, syntax, parameter, symbols, lib, party, hand, project, application, libfile
---
8443
stream, settingsstring, null, localencoding, string, buffer, writer, filled, new, encoding, settings, ifwriter, streamreadbuffer, streamwriterstream
---
8444
sms, intent, phone, vndandroiddirmmssms, intentt, intenttputextra, intenttputextraintentextratext, intenttsetdatauriparse, intenttsettype, datsmsto, actandroidintentactionsendto, contextstartactivityintentt, cmpcomandroidmmsuicomposemessageactivity, intentintentactionview
---
8445
requests, getting, ajax, batched, independantly, queued, newbie, necessar

invoice, invoices, fields, displays, thinking, parameters, companypkinvoicenopk, nooflines, invoicevalue, invoicecompanyinvoicenoinvoicelineno, invoicecompanyinvoiceno, invoicecompany, invoiceline, distinguishing
---
9385
maincpp, error, ctestg, tdm, stdthread, thread, lpthread, tjoin, tfoo, foon, include, stdc, main, mingw
---
9386
assembly, cool, extensions, extension, define, properties, defined, written, methods, use, powermanagernewwakelockpowermanagerscreenbrightwakelock, powermanagerservice, powermock, powermockitomockstaticloggerfactoryclass
---
9387
patterns, design, discussions, python, java, equally, apparently, follow, apply, head, bad, examples, writing, thing
---
9388
thingshandler, thingeditor, callback, alert, sloth, humbly, thingeditors, needs, motivated, apologise, inevitable, favour, ignorance, thing
---
9389
ipad, connecting, mobile, operator, mac, dummynet, downgrading, httpwwwmanageuscom, problem, error, suspecting, simulating, bandwidth, uk
---
9390
appsconvertal

---
10274
legend, fieldset, gap, span, border, position, legendspanfoospanlegend, luckily, ive, concurrently, fixes, solution, tiny, wrap
---
10275
bitmapfactoryoptions, outofmemoryerror, practice, catch, optionsinsamplesize, options, bitmapfactorydecodefilefile, catching, eprintstacktrace, bitmap, reduce, usage, ways, memory
---
10276
lines, draw, trails, image, mountain, draws, user, drawing, bunch, save, basic, basically, allow, path
---
10277
guice, bar, foo, somebarfooerimplementationfoo, public, somemodule, barfooer, barfooerfoothatbar, thisfoo, thisbar, injector, syntactic, sugar, void
---
10278
arch, iprefixinclude, src, export, cpp, os, pkgconfigpath, wstrictprototypes, sitepackagesnumpycoreinclude, lprefixlib, ilibrarypython, iusrlocalinclude, mplbuild, makeosx
---
10279
cstdlib, include, stdlibh, standards, std, contained, coding, namespace, style, writing, reason, difference, code, use
---
10280
singleton, pattern, design, programming, language, implement, does, powerof, po

jar, fileset, classpath, cappljavacommon, header, jre, files, path, dir, libdir, glazedlists, javacclasspath, dirlibext, rearchitect
---
11218
headers, copy, phase, private, section, header, mipadi, xcodes, difference, project, roles, want, public, explains
---
11219
deadlock, command, need, stdout, lot, minutes, output, execute, popenwait, popencommunicate, subprocesspipe, httpbugspythonorg, subprocesspopen, pass
---
11220
jar, files, relation, class, compiled, dynamically, necessary, package, load, directory, program, write, read, java
---
11221
stringnpos, somename, sep, ifsep, namefindfirstof, modifynamestring, namesep, modifynamename, testrtl, wrongly, includeiostream, sizet, matched, endl
---
11222
applications, production, deployment, virtualenv, process, run, sitepackages, development, server, works, weve, thats, deploy, command
---
11223
configdata, php, lack, keyword, messages, registrygetconfigdata, configgetdata, configreloaddata, error, configincphp, getdata, static, sat, 

12157
uiimage, image, path, imagepath, occasion, imagebyscalingandcroppingforsizecgsizemake, assetslibraryassetassetjpgid, imagenamedpath, extjpg, imagewithcontentsoffilepath, setimageimage, asset, imageview, temp
---
12158
asked, gmp, factorial, arbitrarily, interview, forums, digits, obtain, method, calculate, places, accomplish, searched, various
---
12159
page, variables, pagespecific, ive, templating, flask, colors, requested, ideal, templates, tutorials, decided, online, lots
---
12160
yaxis, labels, numeric, text, lowlowmediumhighvery, bands, highcharts, values, instead, puts, like, plot, graph, somewhat
---
12161
error, print, zerodivisionerror, yep, python, caught, works, skool, try, purpose, old, needed, exception, value
---
12162
https, magento, pages, ifserverhttps, clicky, tracker, identifying, installing, enabled, website, suggestions, great, php, help
---
12163
timeout, operation, client, wcf, proxyoperationruntime, servicemodeladdressinganonymous, service, httpschemasmi

---
12998
lowercase, settimestylensdateformattershortstyle, detailstimeformatter, pm, apples, returned, according, style, documentation, values, time, doesnt, set, following
---
12999
rails, curl, thirdparty, preferred, json, cronjobs, triedandtrue, xml, contextual, url, instincts, nightly, expertise, packaged
---
13000
func, badfunc, comments, strings, todo, doc, code, def, fix, prefixing, python, miss, disadvantages, mainly
---
13001
include, statex, unix, fp, define, ifoswindows, oswindows, defined, flag, windows, targeting, char, sizeof, segment
---
13002
resultset, eclipse, watch, coming, showing, plugin, world, fields, exist, expected, properties, studio, visual, shows
---
13003
schema, dotnetconfigxsd, complained, sections, webconfig, config, normal, said, making, expected, stuff, standard, custom, element
---
13004
statecd, plot, actualvalue, lattice, predictedvalue, xvalue, pg, plotdd, xyplotpredictedvalue, printpg, geompointshape, aesxvalue, optsaspectratio, datadd
---
13005


ulongs, java, longs, porting, unsigned, values, bytes, attain, marshalcopy, memmove, thing, portability, requesting, indices
---
13884
service, androidname, androidaccountsaccountauthenticator, androidexported, authenticator, authentication, intentfilter, sync, services, attribute, xmlauthenticator, authenticatorauthenticationservice, androidresource, gingerbread
---
13885
days, todaydaydate, todaymonthdate, todayyeardate, todays, minus, month, year, day, regular, random, need, choose, ill
---
13886
slower, java, compiled, faster, weakly, spectrum, javascript, technology, pre, fly, interpreter, optimal, prior, typed
---
13887
sql, quickest, table, kinds, dummy, wide, varchar, fields, thank, testing, performance, bit, int, server
---
13888
lock, bounded, implement, inelegant, acquire, multithreading, polling, preserve, inefficient, critical, guarantee, waiting, timeout, arbitrary
---
13889
testphp, normally, included, include, test, tried, file, just, powermock, powerpoint, powerpc, pow

**NOTE**: Notice that in both dictionaries containing the top words, many of the terms that look to be a concatenation of two or more words, and terms that don't appear to be actual words top words, have a score of 0 beside them in both document-term matrices. Not all words that score 0 have no meaning, but we can certainly filter out some of these words by term frequency and weights. We'll first add the top 30 words to a list, and then we'll remove those less-common terms that are not actual words.

In [8]:
# add the top 30 terms from each question from the dictionary to a list
words_cv = []
for question in df_cv.columns:
    top = [word for (word, count) in top_dict_cv[question]]
    for t in top:
        words_cv.append(t)
        
words_cv

['data',
 'arraysize',
 'sum',
 'int',
 'datac',
 'code',
 'unsigned',
 'start',
 'runs',
 'long',
 'faster',
 'loop',
 'include',
 'thought',
 'array',
 'just',
 'seconds',
 'stdcout',
 'new',
 'stdendl',
 'sorting',
 'stdsortdata',
 'primary',
 'public',
 'random',
 'sorted',
 'elapsedtime',
 'generate',
 'test',
 'main',
 'json',
 'know',
 'like',
 'type',
 'id',
 'seen',
 'similar',
 'anybody',
 'properly',
 'mime',
 'applicationjson',
 'ive',
 'best',
 'purported',
 'question',
 'varying',
 'pushing',
 'correct',
 'returned',
 'browser',
 'time',
 'start',
 'doing',
 'gather',
 'issues',
 'slightly',
 'targeted',
 'answer',
 'content',
 'textjavascript',
 'inside',
 'comments',
 'file',
 'json',
 'use',
 'aaa',
 'scheduledthreadpoolexecutor',
 'scheduledrun',
 'scheduledrunfindparamsscheduledrunid',
 'scheduledrunid',
 'scheduledruns',
 'scheduledrunsid',
 'scheduledruntransaction',
 'scheduledschools',
 'scheduledtimerwithtimeinterval',
 'scheduledataprojecttasksprojecttaskindex'

In [9]:
# aggregate the list and identify the most common words along with how many routines they occur in
most_common_cv = Counter(words_cv).most_common()

In [10]:
len(most_common_cv)

34536

We can see that in our list, the least common terms (i.e those that occur just once) are overwhelmingly not words. Let's remove those.

In [11]:
most_common_cv[-100:]

[('developerplatformsiphonesimulatorplatformdeveloperusrbinllvmgcc', 1),
 ('usersyariksmirnovdesktopgoozybranchesnew', 1),
 ('quartzcore', 1),
 ('usersyariksmirnovlibrarydeveloperxcodederiveddatagoozzycugjuvvsrzjqwvfiicxtykbqaguxbuildintermediatesgoozzybuilddebugiphonesimulatorgoozzybuildobjectsnormal',
  1),
 ('objcabiversion', 1),
 ('diphoneosversionminrequired', 1),
 ('developerplatformsiphonesimulatorplatformdeveloperusrbindeveloperusrbinusrbinbinusrsbinsbin',
  1),
 ('pausing', 1),
 ('unpausing', 1),
 ('nas', 1),
 ('comyokiandroidcat', 1),
 ('getapplicationcontextgetresources', 1),
 ('resid', 1),
 ('getidentifier', 1),
 ('aboutemailprompt', 1),
 ('readfd', 1),
 ('bufreserven', 1),
 ('vectorresize', 1),
 ('bufinsertbufend', 1),
 ('vecveccapacity', 1),
 ('whileinread', 1),
 ('responsesetheader', 1),
 ('abstractoutputbufferdowriteabstractoutputbufferjava', 1),
 ('scalagroovy', 1),
 ('useable', 1),
 ('utcdatetime', 1),
 ('datetimekind', 1),
 ('serializeprototthis', 1),
 ('serializedes

In [12]:
# if less than two of the questions have it as a top word, exclude it from the list
add_stop_words_cv = [word for word, count in most_common_cv if count < 2]
add_stop_words_cv

['arraysize',
 'datac',
 'stdsortdata',
 'elapsedtime',
 'ircevent',
 'newuri',
 'privmsg',
 'someobjectdocalc',
 'consequence',
 'spades',
 'clubs',
 'diamonds',
 'hearts',
 'amazons',
 'eventbased',
 'aclickfunction',
 'mailed',
 'mulsd',
 'movapd',
 'reduces',
 'movsd',
 'addsd',
 'jnj',
 'edx',
 'esi',
 'readfilefile',
 'deleteafter',
 'sizeofstruct',
 'inte',
 'usrincludelinuxkernelh',
 'whereever',
 'buildbugonnulle',
 'buildbugonzeroe',
 'tapping',
 'readfilestring',
 'bird',
 'prototypically',
 'prototypal',
 'typeofint',
 'monitored',
 'critique',
 'aonea',
 'sausage',
 'harrowprint',
 'samvermettes',
 'samvermette',
 'foocontroller',
 'consolesvisual',
 'consolelogmass',
 'scopeemitsomeevent',
 'scopeonsomeevent',
 'secondctrlscope',
 'paw',
 'paws',
 'toes',
 'maximums',
 'dbcustomers',
 'ity',
 'custs',
 'sysstderr',
 'sysstderrwritespamn',
 'inunion',
 'tailinunionu',
 'unionnode',
 'inuniont',
 'sizeofu',
 'templateid',
 'buckets',
 'prefixes',
 'stackclasst',
 'stackt',


Great! We'll repeat this process once more for the TF-IDF document-term matrix and see what words are included in the stopwords list.

In [13]:
words_tfidf = []
for question in df_tfidf.columns:
    top = [word for (word, count) in top_dict_tfidf[question]]
    for t in top:
        words_tfidf.append(t)
        
words_tfidf

['arraysize',
 'datac',
 'sum',
 'data',
 'int',
 'unsigned',
 'faster',
 'runs',
 'stdsortdata',
 'elapsedtime',
 'loop',
 'long',
 'start',
 'stdendl',
 'thought',
 'stdcout',
 'sorting',
 'include',
 'sorted',
 'array',
 'code',
 'primary',
 'random',
 'seconds',
 'generate',
 'clockt',
 'stdrand',
 'dataarraysize',
 'arrayssortdata',
 'miraculously',
 'json',
 'textxjavascript',
 'textxjson',
 'applicationxjavascript',
 'purported',
 'hurt',
 'targeted',
 'gather',
 'varying',
 'mime',
 'messing',
 'pushing',
 'standards',
 'type',
 'applicationjson',
 'slightly',
 'id',
 'textjavascript',
 'security',
 'know',
 'rest',
 'issues',
 'anybody',
 'returned',
 'properly',
 'browser',
 'theres',
 'api',
 'support',
 'seen',
 'comments',
 'json',
 'inside',
 'file',
 'use',
 'powersave',
 'powers',
 'powerpoint',
 'powerpc',
 'powerof',
 'powermockitoverifystatic',
 'powermockitomockstaticloggerfactoryclass',
 'powermock',
 'powershellexe',
 'powermanagerservice',
 'powermanagernewwakelo

In [14]:
most_common_tfidf = Counter(words_tfidf).most_common()

In [15]:
len(most_common_tfidf)

65298

It looks like we have the same problem in our list with words from the TF-IDF object. Again, the least common terms are the ones that are usually not words. Let's remove them here, as well.

In [16]:
most_common_tfidf[-100:]

[('combartholemauditrecord', 1),
 ('auditrecord', 1),
 ('orgspringframeworkbeansfactorybeancreationexception', 1),
 ('orgspringframeworkbeansbeaninstantiationexception', 1),
 ('webinfapplicationcontextxml', 1),
 ('javasecurityprivilegedactionexception', 1),
 ('tellus', 1),
 ('massa', 1),
 ('duis', 1),
 ('nisi', 1),
 ('pharetra', 1),
 ('venenatis', 1),
 ('mauris', 1),
 ('siteasuspendisse', 1),
 ('phasellus', 1),
 ('erat', 1),
 ('tempus', 1),
 ('sapien', 1),
 ('feugiat', 1),
 ('hrefhttpsomesitecomsome', 1),
 ('httpwwwcodeprojectcomkbwpfhtmltextblockaspx', 1),
 ('interdum', 1),
 ('quam', 1),
 ('molestie', 1),
 ('lacus', 1),
 ('praesent', 1),
 ('nisl', 1),
 ('adel', 1),
 ('scrollvsetverticalscrollbarpolicyjscrollpaneverticalscrollbaralways', 1),
 ('frameaddscroll', 1),
 ('framesetresizablefalse', 1),
 ('framesetsize', 1),
 ('framesetvisible', 1),
 ('textareasetvisibletrue', 1),
 ('textareasetsize', 1),
 ('boutros', 1),
 ('textareasetlinewraptrue', 1),
 ('scrollh', 1),
 ('scrollhsethorizont

In [17]:
add_stop_words_tfidf = [word for word, count in most_common_cv if count < 2]
add_stop_words_tfidf

['arraysize',
 'datac',
 'stdsortdata',
 'elapsedtime',
 'ircevent',
 'newuri',
 'privmsg',
 'someobjectdocalc',
 'consequence',
 'spades',
 'clubs',
 'diamonds',
 'hearts',
 'amazons',
 'eventbased',
 'aclickfunction',
 'mailed',
 'mulsd',
 'movapd',
 'reduces',
 'movsd',
 'addsd',
 'jnj',
 'edx',
 'esi',
 'readfilefile',
 'deleteafter',
 'sizeofstruct',
 'inte',
 'usrincludelinuxkernelh',
 'whereever',
 'buildbugonnulle',
 'buildbugonzeroe',
 'tapping',
 'readfilestring',
 'bird',
 'prototypically',
 'prototypal',
 'typeofint',
 'monitored',
 'critique',
 'aonea',
 'sausage',
 'harrowprint',
 'samvermettes',
 'samvermette',
 'foocontroller',
 'consolesvisual',
 'consolelogmass',
 'scopeemitsomeevent',
 'scopeonsomeevent',
 'secondctrlscope',
 'paw',
 'paws',
 'toes',
 'maximums',
 'dbcustomers',
 'ity',
 'custs',
 'sysstderr',
 'sysstderrwritespamn',
 'inunion',
 'tailinunionu',
 'unionnode',
 'inuniont',
 'sizeofu',
 'templateid',
 'buckets',
 'prefixes',
 'stackclasst',
 'stackt',


We can now update our original DataFrame and our document-term matrices with the new list of stop words.

In [18]:
# read in the dataframe
df = pd.read_pickle('df.pkl')

# add new stop words
stop_words_cv = text.ENGLISH_STOP_WORDS.union(add_stop_words_cv)

# create a column with cleaned text
df['tokenized_word'] = df['body'].apply(word_tokenize)
df['removed_stop_words'] = df['tokenized_word'].apply(lambda x: [word for word in x if word not in stop_words_cv])
df['no_stop_words'] = df['removed_stop_words'].apply(lambda x: ' '.join(x))
df.drop(columns=['tokenized_word', 'removed_stop_words'], inplace=True)

# pickle the updated dataframe
df.to_pickle("df_stop.pkl")

In [19]:
df.head()

Unnamed: 0,id,title,body,answer_count,favorite_count,score,tags,view_count,reputation,no_stop_words
0,11227809,Why is processing a sorted array faster than a...,here is a piece of c code that seems very pecu...,13,7317.0,14772,java c++ performance optimization branch-predi...,805490,1,piece c code peculiar strange reason sorting d...
1,477816,What is the correct JSON content type?,ive been messing around with json for some tim...,29,1089.0,6768,json content-type,1403837,95,ive messing json time just pushing text anybod...
2,244777,Can I use comments inside a JSON file?,can i use comments inside a json file if so how,39,378.0,3437,json comments,631045,25,use comments inside json file
3,208105,How do I remove a property from a JavaScript o...,say i create an object as follows var myobject...,13,539.0,2891,javascript object-properties,865544,16,say create object follows var myobject method ...
4,271526,Avoiding != null statements,the idiom i use the most when programming in j...,49,1083.0,2499,java nullpointerexception null,737912,369,idiom use programming java test object null us...


In [20]:
# recreate the document-term matrix
cv_stop = CountVectorizer(stop_words=stop_words_cv)
cv_wm = cv_stop.fit_transform(df.body)
df_stop_cv = pd.DataFrame(cv_wm.toarray(), columns=cv_stop.get_feature_names())
df_stop_cv.index = df.index

# pickle for later use
df_stop_cv.to_pickle("df_stop_cv.pkl")
outfile = open("cv_stop.pkl", "wb")
pickle.dump(cv_stop, outfile)
outfile.close()

In [21]:
# repeat the process for tfidf dtm
stop_words_tfidf = text.ENGLISH_STOP_WORDS.union(add_stop_words_tfidf)

tfidf_stop = CountVectorizer(stop_words=stop_words_tfidf)
tfidf_wm = tfidf_stop.fit_transform(df.body)
df_stop_tfidf = pd.DataFrame(tfidf_wm.toarray(), columns=tfidf_stop.get_feature_names())
df_stop_tfidf.index = df.index

df_stop_tfidf.to_pickle("df_stop_tfidf.pkl")
outfile = open("tfidf_stop.pkl", "wb")
pickle.dump(tfidf_stop, outfile)
outfile.close()