# Lab 9: Document Analysis

In this assignment, we will learn how to do document classification and clustering



## 1. Example

In this example, we use [20newsgroups](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset) dataset. Each sample is a document and there are totally 20 classes. 

### 1.1 Load data

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

print("Train data target labels: {}".format(data_train.target))
print("Train data target names: {}".format(data_train.target_names))

print('#training samples: {}'.format(len(data_train.data)))
print('#testing samples: {}'.format(len(data_test.data)))


Train data target labels: [7 4 4 ... 3 1 8]
Train data target names: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
#training samples: 11314
#testing samples: 7532


### 1.2 Represent documents with TF-IDF represention

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler


#TF-IDF representation for each document
vectorizer = TfidfVectorizer()
data_train_vectors = vectorizer.fit_transform(data_train.data)
data_test_vectors = vectorizer.transform(data_test.data) 

print(data_train_vectors.shape, data_test_vectors.shape)


(11314, 101631) (7532, 101631)


### 1.3 Use KNN to do document classification

Here, we use the cross-validation method to select $K$.

In [3]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score


Xtr = data_train_vectors
Ytr = data_train.target

Xte = data_test_vectors
Yte = data_test.target

k_range = range(1, 5)
param_grid = dict(n_neighbors=k_range)

clf_knn =  KNeighborsClassifier(n_neighbors=1)

grid = GridSearchCV(clf_knn, param_grid, cv=5, scoring='accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_score_)
print(grid.best_params_)

0.16855203045338205
{'n_neighbors': 1}


### 1.3 Use Logistic Regression to do document classification
Here, we also use the cross-validation method to select the regularization coefficient. 

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np

#=====training with cross validation======
coeff = range(1, 10)
param_grid = dict(C=coeff)

clf_lr = LogisticRegression(penalty='l2')

grid = GridSearchCV(clf_lr, param_grid, cv=5, scoring='accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_params_)

#=====testing======
clf_lr = LogisticRegression(penalty='l2', C=grid.best_params_['C'])
clf_lr.fit(Xtr, Ytr)

y_pred = clf_lr.predict(Xte)

acc = accuracy_score(Yte, y_pred)
macro_f1 = f1_score(Yte, y_pred, average='macro')
micro_f1 = f1_score(Yte, y_pred, average='micro')

print(acc, macro_f1, micro_f1)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

{'C': 9}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.6841476367498672 0.6746130862400499 0.6841476367498672


## 2. Task: Document Classification and Clustering

In this task, we are going to use [BBCNews](BBC_News_Train.csv) dataset. There are 1490 articles from 5 topics, including tech, business, sport, entertainment, politics. 

* Task 1: Please use KNN and logistic regression to do classification, and compare their performance.

* Task 2: Please use K-means to partition this dataset into 5 clusters and find the representative words in each cluster. 

### 2.1 Load data and represent it with TF-IDF representation

In [10]:
# your code
import pandas as pd
data = pd.read_csv('BBC_News_train.csv')
data

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business
...,...,...,...
1485,857,double eviction from big brother model caprice...,entertainment
1486,325,dj double act revamp chart show dj duo jk and ...,entertainment
1487,1590,weak dollar hits reuters revenues at media gro...,business
1488,1587,apple ipod family expands market apple has exp...,tech


In [12]:
from sklearn.model_selection import train_test_split

label_encoder = LabelEncoder()
data['Category'] =  label_encoder.fit_transform(data['Category'])

print(label_encoder.classes_)
print(label_encoder.transform(label_encoder.classes_))

X = data['Text'].values
y = data['Category'].values

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=9)

vectorizer = TfidfVectorizer()
data_train_vectors = vectorizer.fit_transform(X_train)
data_test_vectors = vectorizer.transform(X_test) 

print(data_train_vectors.shape, data_test_vectors.shape)

[0 1 2 3 4]
[0 1 2 3 4]
(894, 19811) (596, 19811)


### 2.2 Use KNN to do document classification

In [13]:
# your code
Xtr = data_train_vectors
Ytr = y_train

Xte = data_test_vectors
Yte = y_test

k_range = range(1, 5)
param_grid = dict(n_neighbors=k_range)

clf_knn =  KNeighborsClassifier(n_neighbors=1)

grid = GridSearchCV(clf_knn, param_grid, cv=5, scoring='accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_score_)
print(grid.best_params_)

0.9071872449940368
{'n_neighbors': 4}


### 2.3 Use Logistic Regression to do document classification

In [14]:
#=====training with cross validation======
coeff = range(1, 10)
param_grid = dict(C=coeff)

clf_lr = LogisticRegression(penalty='l2')

grid = GridSearchCV(clf_lr, param_grid, cv=5, scoring='accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_params_)

#=====testing======
clf_lr = LogisticRegression(penalty='l2', C=grid.best_params_['C'])
clf_lr.fit(Xtr, Ytr)

y_pred = clf_lr.predict(Xte)

acc = accuracy_score(Yte, y_pred)
macro_f1 = f1_score(Yte, y_pred, average='macro')
micro_f1 = f1_score(Yte, y_pred, average='micro')

print(acc, macro_f1, micro_f1)

{'C': 5}
0.9697986577181208 0.9687099935490613 0.9697986577181208


### 2.4 Use K-means to do document clustering and find the 10 most representative words in each cluster. 

In [24]:
# your code
top10 = []
for label in label_encoder.classes_:
    indices = np.where(y_pred == label)

    x_means = np.mean(Xtr[indices], axis=0)

    x_means = np.array(x_means).ravel() #convert to 1d array
    sorted_means = np.argsort(x_means)[::-1][:10] #indices of the top 10 scores
    print(sorted_means)

    features = vectorizer.get_feature_names()
    top_features = [(features[i], x_means[i]) for i in sorted_means]
    top10.append(top_features)
top10

[17856 18046 12495  1536  9254  8575  9757  9789  7484 17849]
[17856 18046 12495  9254  1536  7484 17849  9757 12574  9789]
[17856 18046 12495  9254  1536  7484 17849 15530  8575  9757]
[17856 18046 12495  1536  9254  7484  8575 12574  9757 17849]
[17856 18046  9254  1536 12495  7484  9789 12574 15530  9757]


[[('the', 0.21185142727241907),
  ('to', 0.10888259281494002),
  ('of', 0.0784287000293933),
  ('and', 0.07639008273787903),
  ('in', 0.07462298860399737),
  ('he', 0.03970661921177506),
  ('is', 0.03889639420979456),
  ('it', 0.03793374045626795),
  ('for', 0.03739803709337121),
  ('that', 0.0373378730038568)],
 [('the', 0.21065390527794495),
  ('to', 0.09707225960726662),
  ('of', 0.07799136405621165),
  ('in', 0.07507077491352893),
  ('and', 0.0737558852325052),
  ('for', 0.042479964762264504),
  ('that', 0.03795431533502796),
  ('is', 0.03714259150295907),
  ('on', 0.03368356245026184),
  ('it', 0.03320164578077916)],
 [('the', 0.22374116397960972),
  ('to', 0.09681747127731297),
  ('of', 0.08019039870994858),
  ('in', 0.07866617521033697),
  ('and', 0.07218754380845005),
  ('for', 0.03904680107950239),
  ('that', 0.03886392564284952),
  ('said', 0.038753451323058),
  ('he', 0.03723947539383976),
  ('is', 0.03542445684023885)],
 [('the', 0.2032945660523669),
  ('to', 0.101151237036

In [25]:
top10_words = []
for i in range(len(top10)):
    temp = []
    for j in range(len(top10[i])):
        temp.append(top10[i][j][0])
    top10_words.append(temp)
top10_words

[['the', 'to', 'of', 'and', 'in', 'he', 'is', 'it', 'for', 'that'],
 ['the', 'to', 'of', 'in', 'and', 'for', 'that', 'is', 'on', 'it'],
 ['the', 'to', 'of', 'in', 'and', 'for', 'that', 'said', 'he', 'is'],
 ['the', 'to', 'of', 'and', 'in', 'for', 'he', 'on', 'is', 'that'],
 ['the', 'to', 'in', 'and', 'of', 'for', 'it', 'on', 'said', 'is']]