Run the cell below if you are using Google Colab to mount your Google Drive in your Colab instance. Adjust the path to the files in your Google Drive as needed if it differs.

If you do not use Google Colab, running the cell will simply do nothing, so do not worry about it.

In [1]:
try:
    from google.colab import drive
    drive.mount('/content/drive/')
    %cd 'drive/My Drive/Colab Notebooks/07_TextMining'
except ImportError as e:
    pass

## Exercise 7: Text Mining

### 7.1. Which documents are similar?

#### 7.1.1. The file documents.zip is provided in ILIAS and contains three corpora. Load and vectorize the 4-documents corpus using the load_files function. How many different attributes has the generated example set?

In [2]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pandas as pd

corpus_4_docs = load_files('DataSetEx7', categories=['corpus-4docs'], encoding='utf-8')

# create a vectorizer and transform the documents

Answer: 947 attributes

#### 7.1.2.	Examine the generated word list. What are the most common words? Look for the three most common words that might be helpful for text mining tasks!

In [3]:
import pandas as pd

def generate_word_list(X, Y, feature_names, target_names):
    d = pd.DataFrame(X.toarray(), columns=feature_names)
    doc = d[ d>0 ].count()
    d = d.assign(target=Y)
    d = d.groupby(by='target').sum()
    d = d.transpose()
    d.columns = target_names
    total = d.sum(axis=1)
    d = d.assign(total_occurrences=total)
    d = d.assign(document_occurrences=doc)
    d = d.sort_values(by='total_occurrences', ascending=False)
    return d

In [4]:
# create the word list from the transformed dataset and show it


Answer: It’s hard to find the most common word which would help to mine the text because the top words are so called stopwords. At position 30 you can find Madrid followed by United which may indicate a football game. At position 46 League is listed which underlines the first conclusion.

##### 7.1.3. Remove stopwords and apply the porter stemmer. By how many attributes do the operators reduce the size of your example set?

In [6]:
!pip install nltk
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
nltk.download('stopwords')
import re, string

stemmer = PorterStemmer()
token_pattern = re.compile(r"(?u)\b\w\w+\b")
my_stopwords = set(stopwords.words('english'))

def tokenize(text):
    stems = []
    tokens = token_pattern.findall(text)
    for item in tokens:
        if item not in my_stopwords:
            stems.append(stemmer.stem(item))
    return stems

You should consider upgrading via the 'c:\users\tobi1\anaconda3\envs\dm1\python.exe -m pip install --upgrade pip' command.


Collecting nltk
  Downloading nltk-3.6.5-py3-none-any.whl (1.5 MB)
Collecting regex>=2021.8.3
  Downloading regex-2021.10.23-cp36-cp36m-win_amd64.whl (273 kB)
Installing collected packages: regex, nltk
Successfully installed nltk-3.6.5 regex-2021.10.23


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tobi1\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [None]:
# create a new vectorizer with stemming and transform the documents again

# re-create the word list based on the new vectorizer


#### 7.1.4.	Compute the cosine similarity on TF-IDF vectors between the documents with the cosine_similarity function. Which documents are most similar? Can you confirm the judgment of the algorithm by reading the documents?

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# create a vectorizer that uses TF-IDF weights


# calculate the cosine similarity between all documents and show the results

In [None]:
# print the two most similar documents

#TODO: change the indices to the indices of the most similar documents
idx1 = 0
idx2 = 0

print(corpus_4_docs.data[idx1][:500])
print('\n==================\n')
print(corpus_4_docs.data[idx2][:500])

#### 7.1.5.	Experiment with different similarity metrics as well as with different vector creation methods. Which combination produces the best similarity scores? 

for different pairwise distances you can use the [pairwise_distances function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html)

In [None]:
from sklearn.metrics.pairwise import *
from sklearn.feature_extraction.text import *

# create different vectorizers


# calcualte the features


# calculate different similarity/distance functions


## 7.2.1 Learn a Classifier for the 300-Documents Corpus
The 300-documents corpus contains postings from three different news groups. Vectorize
the 300-documents corpus and learn a classifier for classifying the postings. Evaluate the
classifier using 10-fold X-Validation. Which accuracy does your classifier reach? Increase the
performance of your classifier by pruning the document vectors.

In [None]:
import matplotlib.pyplot as plt

corpus_300_docs = load_files('DataSetEx7/corpus-300docs',encoding='utf-8')

class_dist = pd.Series(corpus_300_docs.target).value_counts()
plt.bar(class_dist.index, class_dist)
plt.show()

In [None]:
# create a vectorizer

# inspect the word list

First, we create a baseline model with all features:

In [None]:
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.naive_bayes import MultinomialNB

# create a vectorizer for your baseline


# define the cross-validation splits
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# evaluate a baseline model

Then, we test different pruning approaches:

In [None]:
from sklearn import tree
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# define a pipeline and parameter grid


# define the cross-validation splits for the nested CV
nested_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# define and evaluate a grid search


#### 7.2.2 Try to do the same classification as in 7.2.1 using word2vec embeddings. You can aggregate word embeddings to get a document representation by applying mean pooling (elementwise average of word vectors).

In [None]:
# this will download the model (which is 1.3 GB huge) - to change the target folder, execute the following two lines
#import os
#os.environ["GENSIM_DATA_DIR"] = "C:/cache"

import gensim.downloader
word2vec_model = gensim.downloader.load('word2vec-google-news-300')

In [None]:
from sklearn.base import TransformerMixin, BaseEstimator

class Word2VecVectorizer(BaseEstimator, TransformerMixin):
    
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
    
    def transform(self, X):
        return_matrix = []
        for doc in X:
            mean_vector = np.zeros(self.model.vector_size)
            count = 0
            for word in self.tokenizer(doc):    
                try:
                    word_vector = self.model[word]
                except KeyError as e:
                    continue
                count += 1
                mean_vector = np.add(mean_vector, word_vector)

            return_matrix.append(mean_vector)
        return np.array(return_matrix)

    def fit(self, X, y=None, **fit_params):
        return self

# initialize Word2VecVectorizer and run it similarly to 7.2.1

#### 7.2.3 Now do the same using BERT embeddings from the huggingface library. Experiment with mean pooling as well as using the [CLS] token representation as document representations.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel

class BertVectorizer(BaseEstimator, TransformerMixin):
    
    def __init__(self, model, tokenizer, use_cls=False):
        self.model = model
        self.tokenizer = tokenizer
        self.use_cls = use_cls
    
    def bert_mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0] #First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    def transform(self, X):
        
        return_matrix = []
        
        for doc in X:
            tokenized = self.tokenizer(doc, padding=True, truncation=True, max_length=512, return_tensors='pt')
            
            self.model.eval()
            with torch.no_grad():
                output = self.model(tokenized['input_ids'])
            if self.use_cls:
                return_matrix.append(output[1].squeeze(0).numpy())
            else:
                mean_pooled = self.bert_mean_pooling(output, tokenized['attention_mask'])
                return_matrix.append(mean_pooled.squeeze(0).numpy())
            
        return np.array(return_matrix)

    def fit(self, X, y=None, **fit_params):
        return self

# initialize BertVectorizer and run it similarly to 7.2.1

### 7.3. Learn a Classifier for the Job Postings
#### 7.3.1.	The Job Postings corpus contains 500 descriptions of open positions belonging to 30 different job categories. The corpus is provided as an Excel file in ILIAS. Vectorize the corpus  and learn a Naïve Bayes classifier for classifying the job adds. Evaluate the classifying using 10-fold X-Validation. Analyze the classifier performance and the word list. What do you discover? 

In [None]:
import pandas as pd
job_postings = pd.read_excel('DataSetEx7/JobPostings.xls')
job_postings.head()

In [None]:
job_postings_target = job_postings['Category']
job_postings_data = job_postings['JobText']

In [None]:
import matplotlib.pyplot as plt

# plot and inspect the class distribution

In [None]:
# vectorize the documents and show the word list

#### 7.3.2 Experiment with different vector creation and pruning methods as well as different types of classifiers in order to increase the performance. What is highest accuracy that you can reach? Which problem concerning precision and recall does remain?

In [None]:
# setup and evaluate a baseline model

In [None]:
from sklearn.tree import DecisionTreeClassifier

# create a pipeline and parameter grid


# create and evaluate a grid search
