Libraries for natural language processing (NLP)
================================================

In this lab, you will train an unsupervised language model on a corpus. You will also use a pre-trained language model.

In this lab, we will use the external library: [Gensim](https://radimrehurek.com/gensim/).

In this lab, you will also practice using default and keyword arguments to functions, as well as using klasses, instance variables and methods.

The questions marked in bold might appear in the quizz

I. Importing modules and defining functions
----------------------------------------
First, the modules, classes and functions needed are imported.
Then two functions are defined.
(Nothing happens when you run this code block. But if running individual code blocks below, you need to this block first, to have everything imported and defined.)

In [1]:
import glob

# Import a class from gensim.models
from gensim.models import Word2Vec

# Import another a class from gensim.models
from gensim.models import KeyedVectors


# Import a sub-module (thereby, when you use the function 'simple_preprocess' 
# from this sub-module, 
# you need to to prefix it with 'utils'
from gensim import utils

KEYED_VECTORS_NAME = "newsgroups.kv"

def train_model_on_documents_list(document_list, window=5, min_count=5, nr_of_documents=None):
    
    if nr_of_documents:
        documents_to_use =  document_list[:nr_of_documents]
    else:
        documents_to_use = document_list
        
    # A simple built-in utility function that does a white-space-based tokenization
    # (Need to prefix 'simple_preprocess' with 'utils', since only 'utils' is imported)
    token_lists = [utils.simple_preprocess(document) for document in documents_to_use]
    
    # Create an instance of the a Word2vec class, which also trains a model
    # (The class Word2Vec is imported, so no prefixes are needed)
    model = Word2Vec(sentences=token_lists, window=window, min_count=min_count, workers = 4)
    
    return model

def get_similar_words(word, model_vectors, topn=10):
    if word in model_vectors:
        return model_vectors.most_similar(word, topn=topn)
    else:
        return None

def get_and_preprocess_news_group_texts():
    data = []
    
    for j in glob.glob(f'/kaggle/input/news-groups-lab/20news-18828/*/*'):
        with open(j, 'r', encoding='cp1252') as f:
            lines = [line.replace("_", " ").strip() for line in f.readlines() if not (line.startswith("From:") or line.startswith("Subject:"))]
            data.append(" ".join(lines))
    return data

**1) ❓💬❓There is a Python naming convention for classes, and another naming convention for functions and methods. Can you see the different naming conventions in the code above? (We will not talk much about object orientation in this course. But to use external libraries, you need to be able to use classes, function and methods.)**

> !💬! The naming convention for classes is *camel case*, i.e. start each word in capital letters, eliminate spaces, e.g. ```CamelCase```. The naming convention for methods is *snake case*, i.e. using only small letters, separate words with underscore characters, e.g. ```snake_case```. Examples of classes above are *Word2Vec* and *KeyedVectors*, examples of methods are *most_similar*, *replace*, *strip*.

**2) ❓💬❓Which arguments for the two functions above have default values?**

> !💬! The function ```train_model_on_documents_list``` has default values for arguments ```window```, ```min_count``` and ```nr_of_documents```.<br>
> The function ```get_similar_words``` has default values for only one argument; ```topn```.

II. Import a text corpus and train a word2vec model
----------------------------------------------------
Below, you import a text corpus, using scikit learn. Then you use this text corpus to train a word2vec model. Finally, the model is saved. (See "Output" to the right, i.e. /kaggle/working.)



In [2]:
newsgroups_data = sorted(get_and_preprocess_news_group_texts())

# The function train_model_on_documents_list creates an instance of the class Word2Vec
model = train_model_on_documents_list(newsgroups_data, nr_of_documents=1000)

print("We have created an instance of:", type(model))
trained_vectors = model.wv

# Saving the the keyed-vectors instance, rather than saving the model takes up less space
# The drawback is that you can't continue training it
trained_vectors.save(KEYED_VECTORS_NAME)

We have created an instance of: <class 'gensim.models.word2vec.Word2Vec'>


**3) ❓💬❓ Two Python [built-in](https://docs.python.org/3/library/functions.html) functions are used. Can you see where?** How is it colour-coded in Kaggle?

> !💬! Actually, *three* built functions are used: ```sorted```, ```print``` and ```type```. Kaggle color-codes them green

**4) ❓💬❓Our own defined function `train_model_on_documents_list` returns an instance of the gensim [Word2Vec-class](https://radimrehurek.com/gensim/models/word2vec.html). `model.wv` is an instance-variable that stores a KeyedVectors instance. When you run `trained_vectors.save(KEYED_VECTORS_NAME)`, you use one of the instance-methods of KeyedVector. Can you spot the syntax difference bewteen an instance-variable and an instande-method?**

> !💬! The instance method is ```save```, and the difference in syntax compared to an instance variable is that an instance method is always called with arguments inside parenthesis, even if there are zero arguments, i.e. ```save()```. 




In [3]:
# Load an instance of KeyedVectors
loaded_word2vec_vectors = KeyedVectors.load(KEYED_VECTORS_NAME)
print("You have loaded an instance of the class:", type(loaded_word2vec_vectors))

# Print some statistics for the KeyedVector instance that you loaded
print("\nThe total number of word-vectors in the model is:", len(loaded_word2vec_vectors))
print("\nThe first five elements in the first vector look like this: \n", 
      loaded_word2vec_vectors[0][:5])
print("The first five elements in the second vector look like this: \n", 
      loaded_word2vec_vectors[1][:5])


print()
for word in ["computer", "jump", "dance", "bike", "good", "tea", "coffe", "sing", "juice"]:
    print(word, get_similar_words(word, loaded_word2vec_vectors))

You have loaded an instance of the class: <class 'gensim.models.keyedvectors.KeyedVectors'>

The total number of word-vectors in the model is: 6747

The first five elements in the first vector look like this: 
 [-0.19201891 -0.48475322 -0.5871777   2.461023   -1.4539908 ]
The first five elements in the second vector look like this: 
 [-0.6393317  -0.24812081 -0.43280372  0.99472964 -0.22260278]

computer [('international', 0.9925001263618469), ('relativity', 0.9911588430404663), ('bodies', 0.9906104803085327), ('seattle', 0.9900129437446594), ('seven', 0.9898717403411865), ('security', 0.9890558123588562), ('persons', 0.9889115691184998), ('various', 0.9885237812995911), ('rate', 0.9882726073265076), ('institutes', 0.9881563186645508)]
jump [('hardware', 0.9903427958488464), ('monitors', 0.9898173809051514), ('avoid', 0.9886245131492615), ('murder', 0.9884785413742065), ('union', 0.9881682395935059), ('scanner', 0.9880817532539368), ('grant', 0.9876208901405334), ('alone', 0.9872841835

In [4]:
# Retrain model (task 9 below)
model_retrained = train_model_on_documents_list(newsgroups_data, nr_of_documents=100000, window=4, min_count=10)
for word in ["computer", "jump", "dance", "bike", "good", "tea", "coffe", "sing", "juice"]:
    similar = get_similar_words(word, model_retrained.wv)
    pretty = [x[0]+':'+str(round(x[1], 2)) for x in similar] if similar else '-'
    print(word, ':', ', '.join(pretty) ,'\n')

computer : lab:0.65, architecture:0.63, engineering:0.62, network:0.61, silicon:0.61, software:0.6, computing:0.6, multimedia:0.59, technology:0.59, systems:0.59 

jump : hang:0.77, turn:0.74, sit:0.72, walk:0.72, break:0.69, move:0.68, pass:0.67, pull:0.67, drop:0.67, kick:0.66 

dance : panasonic:0.69, vhs:0.67, zyxel:0.66, strap:0.66, cassettes:0.66, steel:0.65, disc:0.65, broadway:0.65, xga:0.65, suns:0.64 

bike : riding:0.78, car:0.74, ride:0.74, tires:0.73, seat:0.73, helmet:0.72, tire:0.7, truck:0.68, cage:0.68, passenger:0.66 

good : bad:0.75, great:0.63, fair:0.61, nice:0.61, reasonable:0.58, poor:0.58, best:0.57, decent:0.57, big:0.56, perfect:0.53 

tea : cups:0.69, goaltenders:0.69, portugal:0.65, crunching:0.64, beers:0.64, skates:0.63, jacket:0.63, shorted:0.63, corn:0.63, seater:0.62 

coffe : - 

sing : crush:0.78, cain:0.76, forsake:0.75, kaan:0.75, eph:0.73, bury:0.73, persist:0.73, chorus:0.72, wreck:0.71, bribe:0.7 

juice : flu:0.77, glasses:0.76, sixteen:0.76, j

In [5]:
# Task 11 below
len(model_retrained.wv)

22142

**5) ❓💬❓When you load your saved KeyedVectors, you use a class-method. How do you see that it is a class-method?**

> !💬! Class methods are called be prepending the class name, in this case ```KeyedVectors.load()```

**6) ❓💬❓ In the code that trains a Word2Vec model and creates an instance of a Word2Vec model  `model = Word2Vec(sentences=token_lists, window=window, min_count=min_count)` three keyword arguments are used (and not positional arguments).**

There are many other parameters to the instantiation of `Word2Vec`, and they get their default values. What are these parameters, and their default values? (See the documentation: [Word2Vec-class](https://radimrehurek.com/gensim/models/word2vec.html), search for `gensim.models.word2vec.Word2Vec`). You don't have to understand what the parameters do, it's just to practice reading documentation for external libraries.

> !💬! Remaining parameters are: corpus_file, vector_size, alpha, max_vocab_size, sample, seed, workers, min_alpha, sg, hs, negative, ns_exponent, cbow_mean, hashfxn, epochs, null_word, trim_rule, sorted_vocab, batch_words, compute_loss, callbacks, comment, max_final_vocab, shrink_windows

**7) ❓💬❓Find out, either by reading the documentation for `gensim.models.word2vec.Word2Vec`, or by printing out the length of the vectors in the code above. What is the length of the vectors for the Word2vec model that has been trained?**

> !💬! Vector length is specified by the named attribute ```vector_size```, and here the default value of **100** is applied.

8) Take a look at the output of the model. How many words does the model contain? The semantically closest words to four given words are shown. Are they semantically close in your opinion?

> !💬! The model contains **6747** words. The closeness is poor from a human interpretation

9) 💻💻 Change the parameters that you use for envoking the `train_model_on_documents_list` function. Now default parameters are used for some arguments. Instead use a window size of *4*, a minimum document occurrence (`min_count`) to *10*, and set the nr of documents to use for training to *100000*.

10) With the new settings? How is the quality of the semantically similar words? The same, or has it changed?

> !💬! The semantic simlarity is much better in general, but still poor for some of the words

**11) ❓💬❓ How many word-vectors does the model now contain?**

> !💬! The model contains **22142** words.


III. Use a pre-trained word2vec model
--------------------------------------
Here, we will use a pre-trained model.

It is trained on 1 million words and, and it has 300-dimensional vectors (i.e., that contain 300 elements).

In [6]:
# Uncomment the code below to run it
print("Loading a pre-trained language model. Might take some time")
pre_trained_vectors = KeyedVectors.load_word2vec_format("/kaggle/input/google-news/GoogleNews-vectors-negative300.bin", binary=True, unicode_errors='ignore')

for word in ["computer", "jump", "dance", "bike", "good"]:
    print(word, get_similar_words(word, pre_trained_vectors))




Loading a pre-trained language model. Might take some time
computer [('computers', 0.7979379892349243), ('laptop', 0.6640493273735046), ('laptop_computer', 0.6548868417739868), ('Computer', 0.647333562374115), ('com_puter', 0.6082080006599426), ('technician_Leonard_Luchko', 0.5662748217582703), ('mainframes_minicomputers', 0.5617720484733582), ('laptop_computers', 0.5585449934005737), ('PC', 0.5539618730545044), ('maker_Dell_DELL.O', 0.5519254207611084)]
jump [('jumps', 0.7040783166885376), ('jumping', 0.6920815706253052), ('leap', 0.6579699516296387), ('jumped', 0.5995128154754639), ('climb', 0.5455771088600159), ('leaping', 0.5435682535171509), ('Jump', 0.5267875790596008), ('drop', 0.5252555012702942), ('leaped', 0.5227952599525452), ('leapt', 0.5092256665229797)]
dance [('dancing', 0.8380802869796753), ('dances', 0.8213121891021729), ('dancers', 0.7513905763626099), ('Dance', 0.739539384841919), ('ballroom_dance', 0.7040098309516907), ('dancer', 0.6916878819465637), ('dance_troupe'

12) With this larger model, how is the quality of the semantically similar words? The same, or has it changed?

> !💬! The semantic simlarity generally makes better sense. The tokenization could probably be better as there are many variants on the same word returned as semantically similar. Not to say this is untrue, but it might not be that relevant.

13) The number next to the neighbouring words is the cosine similarity between the vectors for the words. So, e.g., 'good' and 'great' has the cosine similarity '0.7291510105133057'. 
Read the documentation for [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html) for how to use the instance-method `similarity(()` to calculate the distance between (i) the two words 'book' and 'novel', and the two words (ii) 'book' and 'mind'.  

**14) ❓💬❓ What is the cosine similarity between 'book' and 'mind'** 

> !💬! Output from the code below: 0.13833922

In [7]:
# Distance between 'book' and 'newspaper':
print('book and newspaper similarity', pre_trained_vectors.similarity('book', 'newspaper'))

# Distance between 'book' and 'mind':
print('book and mind similarity', pre_trained_vectors.similarity('book', 'mind'))

print("Distances printed")

book and newspaper similarity 0.22616692
book and mind similarity 0.13833922
Distances printed


IV: Find out which words are the most common in different news groups
==========================================================================
Here you will use classes from the machine learning library scikit [scikit-learn](https://scikit-learn.org/stable/index.html)
And a stopword list from the [NLTK library](https://www.nltk.org)
  

In [8]:
import glob
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

def get_and_preprocess_news_group_texts_separate_groups(list_of_subjects_to_select):
    
    # Only select a subset of the newsgroup data
    data_for_all_subjects = []
    for subject in list_of_subjects_to_select:
        print("Reading from:", subject)
        data_for_subject = []
        for file_name in glob.glob(f'/kaggle/input/news-groups-lab/20news-18828/{subject}/*'):
            with open(file_name, 'r', encoding='cp1252') as f:
                lines = [line.replace("_", " ").strip() for line in f.readlines() if not (line.startswith("From:") or line.startswith("Subject:"))]
                data_for_subject.append(" ".join(lines))
        data_for_all_subjects.append(" ".join(data_for_subject))
    return data_for_all_subjects


# Select a subset of the newsgruops to investigate
news_group_subjects = ['sci.space', 'sci.med', 'sci.crypt', 'rec.sport.baseball', 'alt.atheism']
  
# Read the texts
text_data = get_and_preprocess_news_group_texts_separate_groups(news_group_subjects)
print(f"Have read texts for {len(text_data)} news groups\n")

# Count the frequencies of the different words

# The line below is the original, changed in task 16 below
# vectorizer = CountVectorizer(stop_words = stopwords.words('english'))
#
# The line below was again changed in task 18 
# vectorizer = CountVectorizer(stop_words = stopwords.words('english'), max_df=len(news_group_subjects)-1, ngram_range=(1,2))
vectorizer = TfidfVectorizer(stop_words = stopwords.words('english') + ['msg', 'n3jxp'], max_df=len(news_group_subjects)-1, ngram_range=(1,2))

X = vectorizer.fit_transform(text_data)

# Report the output
NR_OF_WORDS_TO_SHOW = 100
for transformed, subject in zip(X, news_group_subjects):
    
    score_vec = transformed.toarray()[0] #Get a vector with scores for the subject, for each of the words in the corpus
    scores_with_words = [(score, word) for score, word in zip(score_vec, vectorizer.get_feature_names_out())] 
        
    sorted_scores_with_words = sorted(scores_with_words, reverse=True)
    
    print(f"\nThe {NR_OF_WORDS_TO_SHOW} most typical words for {subject.upper()} are:")
    print(sorted_scores_with_words[:NR_OF_WORDS_TO_SHOW])

Reading from: sci.space
Reading from: sci.med
Reading from: sci.crypt
Reading from: rec.sport.baseball
Reading from: alt.atheism
Have read texts for 5 news groups


The 100 most typical words for SCI.SPACE are:
[(0.20800392767693884, 'spacecraft'), (0.20073125569981626, 'orbit'), (0.18879931961143948, 'launch'), (0.17333660639744902, 'lunar'), (0.16275803414779264, 'shuttle'), (0.14946013978612827, 'solar'), (0.141721103421798, 'satellite'), (0.11945254557589449, 'mars'), (0.11916891689824621, 'hst'), (0.1072159433461687, 'venus'), (0.10400196383846942, 'zoo toronto'), (0.09964009319075218, 'sky'), (0.08259970233000476, 'flight'), (0.08157734819817183, 'missions'), (0.07583476529888396, 'henry zoo'), (0.07439029357890521, 'baalke'), (0.07341961337835466, 'orbital'), (0.07294582185892647, 'space station'), (0.07294582185892647, 'comet'), (0.072836918034082, 'vehicle'), (0.072836918034082, 'alaska edu'), (0.07150135013894772, 'sci space'), (0.07150135013894772, 'pluto'), (0.0701350170517

15) Run the code for extracting the most frequent words. Try to understand the code, and look at the output. Does the most frequent words seem to represent the news group subjects well?

> !💬! It represents the subject well enough that it would be easy to 'reverse engineer' and match the subjects to the 'word clouds' 

16) 💻💻 The [CountVectorizer class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from scikit-learn is used.  Find out from the documentation of the class, how you change the code above, so that words that occur in all four news group classes are excluded from the results. The code above creates one "document" from each of the news group classes. That is, you should change the code so that the maximum number of documents that a word can be included in should be **three**, because you have four documents. When you do this, how does the result change?

> !💬! It is a major improvement, almost all noise from common words is gone

17) 💻💻 Add an additional news group category to the list `news_group_subjects`. (In Kaggle, in the right-hand side panel, you have the data sets, if you expand the new-groups-lab, you will find the names of the news gruops), Run the code again, and see if you get words that seem respresentative to that category.

> !💬! Added ```alt.atheism``` and the words seem generally relevant

18) 💻💻 In step 17), did you hard-code the maximum number of documents occurrences to **three**? In that case, change the code, so that it uses the length of `news_group_subjects` for determining the maximum number of documents.

> !💬! Named argument to countVectorizer: ```max_df=len(news_group_subjects)-1``` 

17) 💻💻 Change the code, so that in addition to single words, also n-grams are extracted, containing sequences of up to two words.

> !💬! Named argument to countVectorizer: ```ngram_range=(1,2)```

18) 💻💻 There is another class in Scikit-learn that that instead of, as CountVectorizer, which uses raw word frequencies, uses the TF-IDF measure for extracting typical words. What is the name of this class? Change the code so that this class is used instead. How does the lists of the most typical words change?

> !💬! The class is ```TfidfVectorizer```. The difference is that each word gets an associated frequency weight between 0 and 1 based on the number of occurrencies in the text

19) 💻💻 Add the words 'msg' and 'n3jxp' to the stop word list used.

> !💬! Named argument to TfidfVectorizer: ```stop_words = stopwords.words('english') + ['msg', 'n3jxp']```


V: Small code refactoring and data validation
=================================================

20) 💻💻 This code line now occurs twice in the code above
```
lines = [line.replace("_", " ").strip() for line in f.readlines() if not (line.startswith("From:") or line.startswith("Subject:"))]
```

Rewrite the code so that you instead use a function for this, so avoid repeating code

> !💬! Add a function:
>```
def clean_lines(lines):
    return [line.replace("_", " ").strip() for line in f.readlines() if not (line.startswith("From:") or line.startswith("Subject:"))]
>```

> Replace original code with: ```lines = clean_lines(lines)```


21) 💻💻 Are the variable names used good, or could some of them be improved? In that case, change them to better variable names.

> !💬! OK.

22) 💻💻 Add an exception in the code above, that handles the case when `news_group_subjects` contains a news group that does not exist.

Extra
-----
(a) Now, larger text passages than sentences are sent to the word2vec model. So the model learns word contexts over the sentence-border. Try to split the texts into sentences before sending it to the model. Does it change the results in any way? To do that, you can, for instance look at the [tokenize](https://www.nltk.org/api/nltk.tokenize.html) package of the NLTK library.

(b) Find out how to use scikit learn to train a text classifyer for the five news groups you used above. That is, a classifier that can determine which new group a text belongs to.

(c) There is a separate [Notebook on Topic modelling](https://www.kaggle.com/code/mariaskeppstedt/topic-modelling-example), which forms an example of another type of unsupervised language model. Run it on the same five news group categories you used here. Do you get any sensible topics?

(d) Rewrite the code in section iv above, so that is used a dataFrame instead.