#### First we start by importing the libraries we'll use
We want to import the print_dunction library so we can use the print() command

We want to import from sklearn the TfidfVectorizer and CountVectorizer (found in the "text" sublibrary of the "feature_extraction" sublibrary)

We want to import the LatentDirilechtAllocation function from the decomposition sublibrary of sklearn. This function is the primary function for topic modelling

We want to import the textclean library to format our input text and make it more uniform (lower case, no punctation, no extra spaces, etc.)

In [None]:
from __future__ import print_function
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import textclean

#### We want to define a function that prints the top N words associated with a topic model

In [None]:
def print_top_words(model, feature_names, n_top_words):
	for topic_idx, topic in enumerate(model.components_):
		topic_idx += 1
		print("Topic #%d:" % topic_idx)
		print(" ".join([feature_names[i]
						for i in topic.argsort()[:-n_top_words - 1:-1]]))
	print()


#### We now want to load in the data.

Our data is the first chapter of the book, "Doing Data Science"

Each paragraph is on a new line

We open the file in read-binary mode ("rb") and then read every line into a list called "dataset", where each line a new element in the list

We can print the first paragraph using the 0 index of the list

In [None]:
print("Loading dataset...")
dataset = open("..\dataset\doingdatascience1.txt","rb").readlines()
print("Dataset loaded")
print(dataset[0])

####  We now want to process the text make everything lowercase, remove punctuation, remove extra spaces, remove non-printable characters

The cleaned data is saved in a variable called data_samples

We can print the first line again to see how it changed

Because the documents are in a list, we can see how many documents we have using the len function

In [None]:
data_samples = [textclean.process(line) for line in dataset]
print(data_samples[0])
n_samples = len(dataset)
print("Number of documents: "+str(n_samples))



##### We want to define the parameters for the topic model

We want to only use the 500 most common words in our text for calculating the topic model

We want the topic model to have 3 topics

In [None]:
n_features = 500
n_topics = 3

#### We now start the analysis by creating a matrix of documents (rows) and how often each of the top 300 words occurs

We specify other features we want:

We want a maximum frequency for a word, where it cannot occur is more than 95% of the documents

We want a minimum frequency for words. A word has to occur in at least 2 documents to be considered in the final feature list

We want stopwords ("the", "is", "an", "of", etc.) removed

In [None]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)

#### We are fitting a topic model on our matrix of documents and words 

In [None]:
print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5, learning_method='online', learning_offset=50., random_state=0)
t0 = time()
lda.fit(tf)

#### We want to see the top 10 words associated with each topic

We use our "print_top_words" function that we created at the top

In [None]:
n_top_words = 10
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)