## K-Means algorithm

Let's build a KMeans clustering algorithm that clusters similar documents together. We shall use the 'nlp', 'philosophy' data from yesterday as demo and `train_finance.txt`, `train_medicine.txt`, `train_sports.txt`, `test_finance.txt`, `test_medicine.txt`, `test_sports.txt` for exercise

### Loading and Preprocessing

In [None]:
import glob, os
os.chdir('sample_data/') #change directory to where the folders are
folders = glob.glob('*') #load all the folder names into a list
# print(folders)

all_texts = []
all_categories = []

for folder in folders:
    print('importing text files from "{}" folder...'.format(folder), end=' ')
    
    files_in_folder = glob.glob(folder+'/*.txt')
    
    for _file_ in files_in_folder:
        with open(_file_, 'r', encoding='latin-1') as f:
            text_in_file = f.read()
            all_texts.append(text_in_file)
            all_categories.append(folder)
            
    print('found {} files'.format(len(files_in_folder)))
        
os.chdir('../') #revert back to original working directory

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
import re

stopwords = nltk.corpus.stopwords
eng_stopwords = stopwords.words('english')
wordnet_lemmatizer = WordNetLemmatizer()

def basic_preprocessing(text):
    text = text.lower() #lowering
    text = re.sub(r'\[.*?\]', '', text) #removing all instances of citation brackets found in wiki articles
    text = word_tokenize(text)
    text = [word for word in text if word not in eng_stopwords] #removing stop words
    text = [wordnet_lemmatizer.lemmatize(word) for word in text]

    return(text)

processed_texts = [basic_preprocessing(text) for text in all_texts]


### Test-Train Split

In [None]:
test_texts = all_texts[0:3] + all_texts[10:13]
test_data = processed_texts[0:3] + processed_texts[10:13]
test_target = all_categories[0:3] + all_categories[10:13]

train_texts = all_texts[3:10] + all_texts[13:20]
train_data = processed_texts[3:10] + processed_texts[13:20]
train_target = all_categories[3:10] + all_categories[13:20]


train_data = list(zip(train_data, train_target))


print(len(train_data), len(test_data))

### K-Means Model
* Create a TF-IDF Matrix on the train data
* Input this to a KMeans object

```python
km = KMeans(n_clusters=k)
clusters = km.fit(train_tfidf)
```

This applies kmeans on tfidf data and stores the cluster centers

* Predict clusters on train and test data
```python
train_clusters = clusters.predict(train_tfidf)
test_clusters = clusters.predict(test_tfidf)
```  

This predicts which cluster each datapoint (row) belongs to

In [None]:
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf.fit(train_texts)
train_tfidf = tfidf.transform(train_texts)

km = KMeans(n_clusters=2) #Specify how many clusters must be built
clusters = km.fit(train_tfidf) #This runs K-Means algo and decides where the cluster centroids are
print('Train document clusters {}'.format(clusters.predict(train_tfidf)))
#here 0 stands for one cluster and 1 stands for another

## Test data
test_tfidf = tfidf.transform(test_texts)
print('Test document clusters: {}'.format(clusters.predict(test_tfidf)))

```python
Train document clusters [1,1,1,1,1,1,1,0,0,0,0,0,0,0]
```
This means, the first document in train is at cluster 1 and the last document at cluster 0.  
Understand that the numbers don't mean anything by themselves. What is important is the relative arrangement of the documents. We know that the first 7 documents in train were related to nlp category and that is reflected as first 7 documents being clustered in #1

### (Optional) Computing Distances between two documents

As we know that the distance in K-Means is computed using Euclidean formula, we shall have a glance of how far the `data_point=10` is from the rest of the documents using two for loops. One for cluster 1 and other for cluster 2. Intuitively, we should expect smaller distances from cluster 2 since `data_point=10` is in cluster 2.

We then plot both the distances.

In [None]:
from sklearn.metrics.pairwise import euclidean_distances
all_distances_1 = []
all_distances_2 = []
data_point = 10

for i in range(7):
    all_distances_1.append(euclidean_distances(train_tfidf[data_point], train_tfidf[i])[0][0])

for i in range(7,14):
    all_distances_2.append(euclidean_distances(train_tfidf[data_point], train_tfidf[i])[0][0])

from matplotlib import pyplot as plt
%matplotlib inline
plt.plot(range(7), all_distances_1, color='r', label='distance of data point {} from cluster 1 documents'.format(data_point))
plt.plot(range(7,14), all_distances_2, color='b', label='distance of data point {} from cluster 2 documents'.format(data_point))
plt.legend()

### (Optional) Building other classification models
Because of a uniform syntax in sklearn it is very easy to build other classifiers such as Agglomerative Clustering, Logistic Regression and Support Vector Classifier, K-Nearest Neighbors on the same data. Let's see how

#### Hierarchial Clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
import numpy as np
Hclustering = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
Hclustering.fit(train_tfidf.toarray()) #This creates two clusters
ms = np.column_stack((train_target,Hclustering.labels_)) #column_stack will simply join two columns. Use print to see what it looks like
df = pd.DataFrame(ms, columns = ['Ground truth','Clusters']) #Conversion of array into a dataframe
pd.crosstab(df['Ground truth'], df['Clusters'], margins=True) #Creating a table.

#### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

logreg_classifier = LogisticRegression()
logreg_classifier.fit(train_tfidf, train_target)

train_predictions = logreg_classifier.predict(train_tfidf)
test_predictions = logreg_classifier.predict(test_tfidf)

print('Train Accuracy: {}%'.format(accuracy_score(train_predictions, train_target)*100))
print('Test Accuracy: {}%'.format(accuracy_score(test_predictions, test_target)*100))


#### Support Vector Classification

In [None]:
from sklearn.svm import SVC

svm_classifier = SVC()
svm_classifier.fit(train_tfidf, train_target)

train_predictions = svm_classifier.predict(train_tfidf)
test_predictions = svm_classifier.predict(test_tfidf)

print('Train Accuracy: {}%'.format(accuracy_score(train_predictions, train_target)*100))
print('Test Accuracy: {}%'.format(accuracy_score(test_predictions, test_target)*100))

#### K-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_tfidf, train_target)

test_predictions = knn.predict(test_tfidf)
print('Test Accuracy: {}%'.format(accuracy_score(test_predictions, test_target)*100))


## Exercise
Loaded are the text files `train_finance.txt` , `train_medicine.txt` and `train_sports.txt` as `train data and`, `test_finance.txt` , `test_medicine.txt` and `test_sports.txt` as test data from activity_data folder  
* Build a K-Means clustering model on train data and see if the test data clusters are as expected.
* Note that we don't need the train_categories information since it is an unsupervised algorithm

In [None]:
import glob
import os
os.chdir('activity_data/')
train_files = glob.glob('train_*.txt')

train_data = []
train_categories = []

for train_file in train_files:
    with open(train_file, 'r') as f:
        _data_ = f.readlines()
        train_data.extend(_data_)
        train_categories.extend([train_file.split('_')[-1].split('.')[0]]*len(_data_))
        
test_files = glob.glob('test_*.txt')
test_data = []
test_categories = []
for test_file in test_files:
    with open(test_file, 'r') as f:
        _data_ = f.readlines()
        test_data.extend(_data_)
        test_categories.extend([test_file.split('_')[-1].split('.')[0]]*len(_data_))
        
print(len(train_data), len(test_data))
os.chdir('..')