### Data Cleaning & Preparation
Now we have a sense of what the data looks like. The visualization makes sense given that this is the comments section for Wikipedia article edits, so words like 'talk', 'article', 'think' etc. will be common. Next we will:

- Do named-entity recognition
- Remove common stop words
- make letters lowercase
- Include custom stop words (e.g 'article') as necessary 

For CountVectorizer feature extraction.

In [5]:
my_stop_words = ENGLISH_STOP_WORDS.union(['don', 'wikipedia', 'like','article', 'page', 'wiki', 'edit', 'edits',
                                          'did', 'maybe', 'talk', 'talks', 'just', 'people', 'know', 'information',
                                         'removed', 'look', 'use', 'section', 'user', 've', 'utc', 'think'])
#included manual stop words

#Use .union because ENGLISH_STOP_WORDS is a frozen set:
#https://stackoverflow.com/questions/26826002/adding-words-to-stop-words-list-in-tfidfvectorizer-in-sklearn

minwords = 50
maxwords = 16000

vectorizer = CountVectorizer(min_df = minwords, max_df = maxwords,
                             stop_words = my_stop_words, lowercase = True, token_pattern="\\b[a-z][a-z]+\\b")

limit = 100000
#we are limiting our data set size to 100,000 rows for the purpose of efficiency

cleaned = vectorizer.fit_transform(df['comment_text'][0:limit])

len(vectorizer.get_feature_names())

5622

### Topic Modelling
We apply LDA modelling and visualize the results in pyLDAvis.

In [6]:
lda_model_final = LatentDirichletAllocation(n_components=20, # Number of topics
                                      learning_method='online',
                                      random_state=0,       
                                      n_jobs = -1  # Use all available CPUs
                                     )
lda_output_final = lda_model_final.fit_transform(cleaned)


In [7]:
pyLDAvis.enable_notebook()
visualized = pyLDAvis.sklearn.prepare(lda_model_final, cleaned, vectorizer, mds='tsne')
pyLDAvis.save_html(visualized, 'outputfile.html')


Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.



