## Lab 4.3

#### Requirements

- Collect the data
- Perform a **truncated singular value decomposition (SVD)** on the dataset to determine which components are most significant within the articles.
- Create a write-up of your findings; for the technical team members make sure to comment your process, and for the non-technical team members, draft a brief report to outline why your findings are significant.

Just as in a real life scenario, the data and your analysis will not always be clear cut. While you may be wondering when you've succeeded in solving the problem, we're looking for your best recommendations based on the available data. Work through the process until you and your teammate have enough information to provide an in-depth analysis. Your manager would like to see at least 60% accuracy for your analysis.

**Bonus:**

1. Continue tuning your model to reach a higher threshold/percentage
2. Triangulate or repeat using a different method

#### Starter code

For this project, we're going to be utilizing the Newsgroup 20 data that is publicly available on the UCI Machine learning repository. Fortunately, Scikit has a direct implementation of this to make our data collection process easy:

```
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
```

We're going to be looking at the training subset, which included several thousand news articles, as a subset of the larger 20,000 newsgroup article set. 

Within this set, we can have two attributes - the data and the class labels. 

Our class labels are under "target_names" 

```
newsgroups_train.target_names
```

and our data:

```
newsgroups_train.data
```

[Here is your starter code!](./code/starter-code.ipynb)

#### Deliverable ####

Your finished product will be a Jupyter Notebook containing your analysis, which will include;

- Your solution code
- A brief write-up on your findings, with one paragraph on your findings and one paragraph on your procedures
- Recommendations for further analytical procedures on the datasets

> [Solution Code](./code/solution-code.ipynb)

## Additional Resources

- A link to [the Newsgroup 20 datasets](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html) for Scikit. 
- The [SVD Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)for Scikit. 

#### Setup your imports

In [1]:
import pandas as pd 
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.cluster import KMeans
from sklearn import metrics
import matplotlib.pyplot as plt

#### 1. Pull the training set from the newsgroup data

In [2]:
newsgroups_train = fetch_20newsgroups(subset='train')

In [3]:
y = newsgroups_train.target
x = newsgroups_train.data

#### 2. Create the vectorizer 

In [4]:
vectorizer = CountVectorizer(max_features = 1000,
                             ngram_range=(1, 2),
                             stop_words='english',
                             binary=True)

#### 3. Create the Truncated Singular Value Decomposition

In [5]:
svd = TruncatedSVD(n_components=50, random_state=42)

#### 4. Setup your k-means clustering

In [6]:
print newsgroups_train.target

[7 4 4 ..., 3 1 8]


In [7]:
k = 10
km = KMeans(n_clusters=k)

#### 5. Fit the vectorizer and SVD

In [8]:
X = vectorizer.fit_transform(x)

In [9]:
X2 = svd.fit_transform(X)

#### 7. Fit the kmeans

In [10]:
km.fit(X2)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=10, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [11]:
labels = km.labels_
centroids = km.cluster_centers_

In [12]:
print labels
print centroids

[1 1 2 ..., 3 8 1]
[[  4.18037149e+00  -2.78048992e-01   7.69426660e-01  -6.59378385e-01
   -3.93012745e-02  -2.24957642e-01   1.28168210e-01   1.14445597e-01
   -2.08116240e-01  -1.86407744e-02   9.91034775e-02  -8.54276914e-03
   -4.28923903e-03   1.00452962e-01  -9.71930406e-03  -1.63551569e-01
   -1.77202494e-02   1.73506368e-02  -6.92354287e-02  -3.11776415e-04
   -1.99893221e-02  -1.89868035e-02  -3.32071510e-02   1.71795990e-02
    7.76017902e-04   3.01091989e-02  -3.35620696e-02   4.30688762e-02
   -2.40724751e-02   2.64134672e-02  -3.72185501e-02  -5.97202905e-02
   -8.45392685e-02   1.75352312e-03   1.63625695e-02  -1.60607632e-02
   -3.71538717e-02   2.59588684e-02   1.79903594e-02  -1.34008183e-02
    2.20416530e-02   2.21621303e-02   1.06787731e-02   5.86048593e-02
   -2.05919310e-02  -8.65067137e-03  -1.76093348e-02   2.50642146e-02
   -1.27083343e-02   7.66767552e-03]
 [  2.66453712e+00  -1.57947545e+00   5.74052896e-01   2.55642333e-01
    1.51991800e-01  -2.29851524e-0

#### 8. Check the performance of our kmeans test

In [13]:
metrics.accuracy_score(y, labels)

0.058953508926993109

#### Classification Report

In [14]:
print metrics.classification_report(y, labels)

             precision    recall  f1-score   support

          0       0.07      0.11      0.08       480
          1       0.09      0.32      0.14       584
          2       0.02      0.05      0.03       591
          3       0.09      0.21      0.12       590
          4       0.01      0.00      0.00       578
          5       0.00      0.00      0.00       593
          6       0.01      0.03      0.01       585
          7       0.00      0.00      0.00       594
          8       0.09      0.26      0.13       598
          9       0.07      0.18      0.10       597
         10       0.00      0.00      0.00       600
         11       0.00      0.00      0.00       595
         12       0.00      0.00      0.00       591
         13       0.00      0.00      0.00       594
         14       0.00      0.00      0.00       593
         15       0.00      0.00      0.00       599
         16       0.00      0.00      0.00       546
         17       0.00      0.00      0.00   

  'precision', 'predicted', average, warn_for)


#### Confusion Matrix

In [15]:
print(metrics.confusion_matrix(y, labels))

[[ 54  31  67  14   4  30 123   0 138  19   0   0   0   0   0   0   0   0
    0   0]
 [ 30 185  14 102   4   3  71   2  54 119   0   0   0   0   0   0   0   0
    0   0]
 [ 27 141  28  85   2   1  87  11  64 145   0   0   0   0   0   0   0   0
    0   0]
 [ 34 172  24 125   2   3  76   3  52  99   0   0   0   0   0   0   0   0
    0   0]
 [ 28 177  23  83   1   1  74   0  89 102   0   0   0   0   0   0   0   0
    0   0]
 [ 27 196  16 122  12   0  57   2  69  92   0   0   0   0   0   0   0   0
    0   0]
 [ 15 225   9 124   0   0  15  11  15 171   0   0   0   0   0   0   0   0
    0   0]
 [ 62 107  62  64   2   6 115   0 103  73   0   0   0   0   0   0   0   0
    0   0]
 [ 29 119  40  62   1   2 147   2 154  42   0   0   0   0   0   0   0   0
    0   0]
 [ 49  91  44  38   0  12 121  14 123 105   0   0   0   0   0   0   0   0
    0   0]
 [ 52 101  37  29   1   9 134  42  94 101   0   0   0   0   0   0   0   0
    0   0]
 [ 86  54  85 101  10  23 106   5  92  33   0   0   0   0   0   0

#### Note: Repeat the kmeans test with varying values of "k" to determine the best performance. Use the techniques that we learned about in the *Tuning Clusters* lesson to further tune the clusters