# How does unsupervised machine learning work?
Unsupervised machine learning takes places in two steps - the *training* phase, and the *testing* phase. In the *training* phase, the algorithm develops a set of criteria for determining patterns in the data. One common unsupervised learning task is clustering, wherein the algorithm seeks to find groupings (labels) in a dataset. In the *training* phase, the clustering algorithm uses the training set to develop a set of criteria for deciding what class (label) an observation belongs to. 

In the testing phase, the criteria that tha algorithm developed in the training phase is applied to data that it had not seen during the training phase. This criteria is used to assign each observation in the *testing* data set to one of the classes discovered during the *training* phase. 

For example, lets consider the basket of fruit again:

Object | Height | Width | Color  | Mass | Round ?
-----  | -------| ------| -------| ---- | -------
Apple  | 6cm    | 7cm   | Red    | 330g | TRUE   
Orange | 6cm    | 7cm   | Orange | 330g | TRUE   
Lemon  | 5cm    | 4cm   | Yellow | 150g | FALSE  

In an unsupervised learning task, we don't know what fruit we have:

Height | Width | Color  | Mass | Round ?
-------| ------| -------| ---- | -------
 6cm    | 7cm   | Red    | 330g | TRUE   
 6cm    | 7cm   | Orange | 330g | TRUE   
 5cm    | 4cm   | Yellow | 150g | FALSE  
 

### How do we discover unknown patterns in the data?
![algorithms_cheatsheet](images/algorithms_cheatsheet.png)

## What is topic modeling using LDA?

![diagram showing a collection of texts then an arrow going towards a black box named LDA. On the other side of the black box are two arrows. One is slightly tilted up and points toward three circles. Each circle is a topic and contains a sample of words in that topic. The other arrow is slightly titled down and points towards a document. In the document, words are annotated to indicate which topic they belong to (if any)](images/lda_diagram.png)

In unsupervised learning tasks, the algorithm is not given any information about the type of observation they are seeing. As mentioned above, clustering tasks seek to find groupings in the dataset. One subset of these clustering tasks are topic extraction tasks, where the aim is to find common groupings of items across collections of items. One method of doing so is Latent Dirichlet allocation (LDA). In broad strokes, LDA extracts topics through the following method:<sup>1, 2</sup>

1. Arbitrariy decide that there are 10 topics
2. Select one document and randomly assign each word in the document to one of the 10 topics. 
3. Repeat 2 for all the other documents. This results in the same word being assigned to multiple topics.
4. Compute
    1. how many topics are in each document?
    2. how many topic assignements are due to a given word?
5. Take one word in one document and reassign it to a new topic and then repeat step 4.
6. Repeat step 5 until the model stabilizes such that reassign topics does not change distributions. 

LDA yields the a set of words associated to each topic (4.2) and the mixture of topics associated to each document (4.1).
    

<sup>1</sup>[Introduction to Latent Dirichlet Allocation](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/) by Edward Chen 

<sup>2</sup>[The LDA Buffet is Now Open](http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/) by Matthew Jockers 

<sup>3</sup> Image is inspired by Christine Doig's PyTexas 2015 ["Introduction to Topic Modeling"](http://chdoig.github.io/pytexas2015-topic-modeling/#/) presentation

### Read data in from a spreadsheet
Lets take the data we just saved out and load it back into a dataframe so that we can do some analysis with it!

In [45]:
import pandas as pd
df = pd.read_csv("df_news_romance.csv")
df.head()

Unnamed: 0,label,sentence,NN,JJ
0,news,"['The', 'Fulton', 'County', 'Grand', 'Jury', '...",11,2
1,news,"['The', 'jury', 'further', 'said', 'in', 'term...",13,2
2,news,"['The', 'September-October', 'term', 'jury', '...",16,2
3,news,"['``', 'Only', 'a', 'relative', 'handful', 'of...",9,3
4,news,"['The', 'jury', 'said', 'it', 'did', 'find', '...",5,3


### Preparing data for machine learning
We're almost ready to do some machine learning!  First, we need to turn our sentences into the type of *feature vectors* LDA expects to work with. 

In [181]:
df['sentence'].head()

0    ['The', 'Fulton', 'County', 'Grand', 'Jury', '...
1    ['The', 'jury', 'further', 'said', 'in', 'term...
2    ['The', 'September-October', 'term', 'jury', '...
3    ['``', 'Only', 'a', 'relative', 'handful', 'of...
4    ['The', 'jury', 'said', 'it', 'did', 'find', '...
Name: sentence, dtype: object

#### Bag of Words
For LDA, we preprocess our data using sklearn's text feature extraction tools. In particular, we use the `CountVectorizer` which computes the frequency of each token in the document. We can omit outliers from our feature set using the `max_df` keyword to ignore words that occur more than some threshold and the `min_df` keyword to ignore words that occur below our set thresholds. The `CountVectorizer` can also be used to strip out stop words using the `stop_words` keyword argument.

In [179]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(min_df =2, stop_words='english')
tf = tf_vectorizer.fit_transform(df['sentence'])

`CountVectorizer` processes the text such that `tf` is a sparse matrix containing the count of words in each document. One document in the Brown corpus is the following sentence: 
>Mrs. Robert O. Spurdle is chairman of the committee , which includes Mrs. James A. Moody , Mrs. Frank C. Wilkinson , Mrs. Ethel Coles , Mrs. Harold G. Lacy , Mrs. Albert W. Terry , Mrs. Henry M. Chance , 2d , Mrs. Robert O. Spurdle , Jr. , Mrs. Harcourt N. Trimble , Jr. , Mrs. John A. Moller , Mrs. Robert Zeising , Mrs. William G. Kilhour , Mrs. Hughes Cauffman , Mrs. John L. Baringer and Mrs. Clyde Newman .

Via the `CountVectorizer` the stop words, punctuation, and very low frequency words have been removed. This yeilds the words and their counts listed below and visualized in the word cloud. 

```python
{'2d': 1, 'albert': 1, 'chairman': 1, 'chance': 1, 'committee': 1,
 'ethel': 1, 'frank': 1, 'harold': 1, 'henry': 1, 'hughes': 1, 'includes': 1, 
 'james': 1, 'john': 2, 'jr': 2, 'lacy': 1, 'moody': 1, 'mrs': 15, 
 'robert': 3, 'spurdle': 2, 'terry': 1, 'trimble': 1, 'william': 1}
```

![Word cloud visualization, where the size of the word is relative to its frequency in a sentence, of "Mrs. Robert O. Spurdle is chairman of the committee , which includes Mrs. James A. Moody , Mrs. Frank C. Wilkinson , Mrs. Ethel Coles , Mrs. Harold G. Lacy , Mrs. Albert W. Terry , Mrs. Henry M. Chance , 2d , Mrs. Robert O. Spurdle , Jr. , Mrs. Harcourt N. Trimble , Jr. , Mrs. John A. Moller , Mrs. Robert Zeising , Mrs. William G. Kilhour , Mrs. Hughes Cauffman , Mrs. John L. Baringer and Mrs. Clyde Newman ."](images/countvect_wordcloud.png?)

### Partitioning data into train and test sets
When you are partitioning your data into train and test sets, a good place to start is to use 75% of your data for training,and 25% of your data for testing.  We want as much training data as possible, while also having enough testing data to ensure that our trained classifier is generalizable across a number of examples.  This will also lead to more accurate evalutation of our trained classifier.

Fortunately, sklearn has a function that will do exactly this!

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(fv, df['label'],
                                                stratify=df['label'], 
                                                test_size=0.25,
                                                random_state = 42)

- We use the "stratify" argument because we have an uneven amount of training data; we have more news sentances than romance sentences.  By using stratify, we ensure that our classifier will take this data imbalence into account.


- In this example, we are using a fixed random state, to ensure we will always get exactly the same value when we classify.  Adding this argument is unnecessary for most types of classification; we do it here to ensure our results do not vary slightly across runs.

Let's check the size of our train and test datasets using the `.shape` attribute of train and test data. 

In [6]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(6790, 2) (6790,)
(2264, 2) (2264,)


## Let's do topic modeling with sklearn!
One of the best things about sklearn is the simplicity of its syntax.

To do machine learning with sklearn, follow these five steps (the function names remain the same, regardless of the algorithm you use!):

### Step 1:  Import your desired algorithm

In [7]:
from sklearn.svm import LinearSVC

### Step 2: Create an instance of your machine learning algorithm

In [8]:
classifier = LinearSVC(random_state=42)

### Step 3:  Fit your data to your classifier (train)

In [9]:
classifier.fit(X_train, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=42, tol=0.0001,
     verbose=0)

As mentioned above, LinearSVC, which is a linear model for classification that separates classes using a line, a plane, or a hyperplane. The `classifier.fit` method searches for that line, plane, or hyperplane-which is also called the decision boundary. The dark gray line in the figure below is the decision boundary that the *LinearSVC* classifier found for this set of training data. All the data (dots) to the left of the gray line in the area with the orange background are classified as romance, while all the data to the right in the blue area are classified as news. The leftward skew of the classification space is due to the data being very dense and highly overlapping.

![LinearSVC training data](images/training_boundary.png?)

### Step 4:  Predict labels for unseen data (test)

In [10]:
y_predict = classifier.predict(X_test)

### Step 5: Score!
Evaluate the skill of the model by computing the 
* score: how many predicted labels are the same as the actual labels 
* confusion matrix: true positive, false positive, false negative, and true negative counts

In [11]:
classifier.score(X_test, y_test)

0.70759717314487636

Right now, our classifier can predict previously unseen news and data 

In [12]:
from sklearn.metrics import confusion_matrix

In [13]:
confusion_matrix(y_test, y_predict)

array([[747, 409],
       [253, 855]])


|      |actual news | actual romance |
|:--: | :--:| :--:|
|predicted news | 747 | 409 |
|predicted romance|253 | 855|

In `LinearSVC` the `classifier.predict` decides which class a data point is in based on which side of the decision boundary, which is the gray line in the figure, the point falls on. Points in the orange area to the left of the gray line are classified as romance, while points in the blue area to the right of the gray line. Orange points in the blue area are romance texts that are misclassified as news texts, while blue points in the orange area are news texts that are misclassified as romance texts. 

![LinearSVC Testing Data](images\testing_boundary.png?)