In [None]:
import seaborn as sns
sns.set()

# Topic Modeling

**Note:**  It is recommended to "Run All" cells when starting this notebook in the interest of the time needed for the calculations. 

Consider that we have a large collection of documents, where "document" could mean an email, a single tweet, a posting to a message board, etc, depending upon the collection in question.  Topic modeling is a technique that is used to extract the hidden topics from the document collection.  In this setting, a "topic" is considered to consist of a mixture of a set of keywords, and a "document" is a collection of a mixture of topics.  

In a topic modeling problem, we are given the collection of documents and want to try to infer the (hidden) collection of topics and the words that define each topic (possibly together with the ratios that define the mixtures of words in topics and/or mixture of topics in each document).  

Topic modeling is an **unsupervised learning** technique because we have no particular a priori knowledge of the number of topics in the document collection.  We don't have labels associated with the documents, but are trying to determine both the "right" number of labels, as well as what label we might assign to each document.  Once we specify a number of topics to our algorithm, it tries to infer the words that constitute the topics and topics that make up each document to end up with a good topic/keyword distribution.  

The goal of this notebook is to consider the topic modeling problem, and examine some of the methods used in this unsupervised learning method.  


## Importing our data

In the interests of time, we will use a random sample of 3% of a data set that is commonly used for NLP tasks, the Yelp review data.  It's unlikely that we'll get a truly representative sample of our data by only taking 3%, but the main purpose is to illustrate the topic modeling methods.  

In [None]:
# Download data file
!mkdir yelp_data 2> /dev/null
!wget --directory-prefix=yelp_data/ -nc https://s3.amazonaws.com/dataincubator-course/AML_NLP_data/restaurant_reviews.json.gz

# Download models
!wget -nc https://s3.amazonaws.com/dataincubator-course/AML_NLP_data/yelp_models.zip
!unzip -u yelp_models.zip

In [None]:
fraction = 0.03

import pandas as pd

filepath = 'yelp_data/restaurant_reviews.json.gz'
    
# Load JSON into DataFrame
df = pd.read_json(filepath)

# Take subsample of data
reviews = df.sample(frac=fraction, random_state=117).reset_index(drop=True)

In [None]:
len(reviews)

For the purposes of topic modeling, we need only the `text` field in the reviews. 

In [None]:
reviews_text = reviews[['text']]
reviews_text.head(10)

##  A first approach 

We'll first try topic modeling using a common unsupervised learning technique, namely clustering.  After all, we can think of taking the documents that we have and clustering them together to get groups of documents that are "talking about a similar topic (or topics)".  Relevant questions that we have to consider include:
 
1. What kind of pre-processing do we want to apply to the documents (e.g. tokenization, removal of stop words, lemmatization)?  
1. How do we measure "similarity" between documents?
1. How many clusters do we divide the documents into?  How do we decide that? 
1. Once we have the clusters, how do we find the words that make up the topic(s) of those documents? 

Let's consider some of these questions first, and we will get to others in due time.

## Pre-processing and feature extraction

Common preprocessing steps in natural language processing include these: 
1.  Tokenization:  The breaking of the text into "tokens", e.g., single words, pairs of words (bigrams), etc.  
1.  Stop word removal:  Do we remove common words from the text? 
1.  Lemmatization:  This is a "regularization process" on the text, treating words like `eat`, `eaten`, and `ate` all as the same word `eat`.  Or treating the plural form of a word as the non-plural version.  Or "de-tensing" verbs, so that `ran` is treated as `run`, `walked` as `walk`, etc.  

We will use the Python library `spaCy`, for most of this work.  We will use the "bag of words" approach and just consider individual words, so-called "1-grams", in this first approach.  This does risk losing context in which words are used, but is a common method in NLP.  

In order to save processing time, spaCy can run with some of its features disabled.  We will do that here, as we are utilizing a subset of its tools (in particular, lemmatization and parts of speech tagging).  The disabling can either be done when loading spaCy, or after the fact.  

In [None]:
import spacy
nlp = spacy.load('en', disable=['parser','ner'])

We'll make our own custom tokenizer/lemmatizer to use in the text processing steps.  We want to strip out punctuation and spaces from our document.  We'll let the vectorizer handle the not-very-helpful (from a topic modeling point of view) `-pron-` output from the lemmatizer by adding it to our collection of stopwords.  

In [None]:
def my_lemmatizer(doc):
    return [ w.lemma_.lower() for w in nlp(doc) 
                      if w.pos_ not in ['PUNCT', 'SPACE', 'SYM', 'CCONJ']
                      and w.lemma_ not in ['_', '.'] ]

In [None]:
stopwords = spacy.lang.en.stop_words.STOP_WORDS
stopwords.union(['-pron-'])

stopwords = set(my_lemmatizer(' '.join(list(stopwords))))

With our lemmatizer in hand, along with our stopwords, we're ready to vectorize our text.  We'll use the `CountVectorizer` to process the text.  This gives us a count of words in our text.  We follow that by the `TfidfTransformer` which will then weight the words by the inverse document frequency.  (Scikit-learn has a single transformer, the `TfidfVectorizer` that combines these two steps, but we have chosen to do these two transformations separately as we want to use information from the transformers later to help understand our clustering.)  

We will limit ourselves to 1000 terms in order to speed up the transformation and clustering methods.  

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [None]:
max_features = 1000

cv = CountVectorizer(tokenizer=my_lemmatizer, stop_words=stopwords, 
                     min_df=2, max_df=0.95, max_features=max_features)
counts = cv.fit_transform(reviews_text['text'])

tf = TfidfTransformer()
matrix = tf.fit_transform(counts)

In [None]:
matrix

How do we measure if two documents are "similar" to each other?  There are several ways we could consider doing this.  One way is that we can view each row of this transformed data as a multi-dimensional vector.  Two documents are similar to one another if their corresponding vectors are close to one another or, in other words, if the *difference* of their vectors is small in length.  This is the common comparison between vectors that is used in algorithms such as $k$-means clustering.  

## $K$-means clustering 

We can use this common clustering method to group our documents together to find "topics of conversation".  $K$-means clustering is an unsupervised learning algorithm as we do not know the "right" number of clusters in which to group our data, but there exist some well-known tools that we could use to attempt to determine this number.  We will make the somewhat arbitrary choice to cluster our data into five clusters.  

In [None]:
from sklearn.cluster import KMeans

In [None]:
number_of_clusters = 5

kmc_5 = KMeans(n_clusters=number_of_clusters, n_init=3, random_state=117)  # random_state for consistency
kmc_5.fit(matrix)

Once we have the clustering (which we hope identifies distinct topics), how do we find the words associated with each cluster?  

The `cluster_centers_` attribute of the clustering gives us the coordinates of each center of the clusters.  

In [None]:
kmc_5.cluster_centers_

What are the "important" words in each topic?  Let's use the center of each cluster, sort the coordinates to find the largest components of each vector, and combine that with the feature names to pick the corresponding words for those largest vector components.  

The `get_feature_names` method of the vectorizer gives us a list of words that we can use to look up a word, given its index in the vector. 

In [None]:
import numpy as np

In [None]:
number_of_top_words = 10

cluster_words = np.argsort(kmc_5.cluster_centers_, axis=1)
terms = cv.get_feature_names()

top_words = set()

for i in range(number_of_clusters):
    print('Cluster {}: '.format(i))
    print(' '.join([terms[k] for k in cluster_words[i][-number_of_top_words:]]),'\n')
    top_words = top_words.union([terms[k] for k in cluster_words[i][-number_of_top_words:]])
    
top_words = sorted(list(top_words))

## Visualizations

Can get a little more insight into the clustering?   

Let's combine our `counts` that we computed earlier together with the cluster labels that were computed by the `KMeans` method to create a pandas DataFrame to help us out.

In [None]:
word_df = pd.DataFrame(counts.toarray(), columns=terms)[top_words]
word_df['Cluster'] = kmc_5.labels_.tolist()

In [None]:
word_df.head()

## Size of the clusters 

How many documents (reviews) are in each cluster? 

In [None]:
word_df.groupby('Cluster').count()[top_words[0]].\
    plot.bar(rot=0).\
    set(ylabel='Document count',
    title='Number of Documents per Cluster');

---
**Group Discussion**
- Does the above frequency count of the cluster sizes suggest anything about our original data set?  
---

## Word frequencies

What's the frequency of the "top words" within each cluster?

In [None]:
word_df.groupby('Cluster').sum().transpose().\
    plot.bar(figsize=(13,5), width=0.7).\
    set(ylabel='Word frequency', 
    title='Word Frequencies by Topic, Combining the Top {} Words in Each Topic'.format(number_of_top_words));

---
**Group Discussion**
- Does the above visualization suggest some additional stop words that we might want to add (and then reprocess the resulting new bag of words matrix) to see if we can make our clustering better?
---

## Word clouds

We can use a word cloud to show the relative frequency of these top words in each cluster.

In [None]:
word_totals = { i: word_df.groupby('Cluster').sum().loc[i].to_dict() for i in range(number_of_clusters) }

In [None]:
import matplotlib.pyplot as plt
from ipywidgets import interact, IntSlider
from wordcloud import WordCloud

def show_wordcloud(topic=0):
    cloud = WordCloud(background_color='white', colormap='viridis')
    cloud.generate_from_frequencies(word_totals[topic])
    plt.gca().imshow(cloud)
    plt.axis('off')
    plt.tight_layout()
    
slider = IntSlider(min=0, max=number_of_clusters-1, step=1, value=0, description='Topic')
interact(show_wordcloud, topic=slider);

## Visualization in a low-dimensional space

We can attempt to see how well the clusters are separated by projecting into a lower-dimensional space using principal component analysis.  We have to remember that the `PCA` method from scikit-learn does not operate on sparse matrix representations, but given the size of our data, we can convert the feature matrix into a dense representation.  (Otherwise we might have to use the `TruncatedSVD` method which can take sparse matrix representations as input.)  

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=117)
matrix_pca = pca.fit_transform(matrix.toarray())

matrix_pca.shape

We'll use the `Cluster` label from the `word_df` to supply a color to each point in our transformed data set when we plot it.  

In [None]:
plt.scatter(matrix_pca[:,0], matrix_pca[:,1], c=word_df['Cluster'], 
            cmap='viridis', alpha=0.15)
plt.gca().set(title='Plot for 2-Dimensional PCA Projection', 
              xlabel='PCA component 1', ylabel='PCA component 2');

We can see separation for some of the clusters, one of them in particular seems relatively well-defined.  The other clusters are showing more "mixing", but this is a limitation of the projection down into a two-dimensional space (from our original 1000-dimensional feature space!).  

## Choosing the number of clusters

The choice of `number_of_clusters = 5` was more or less arbitrary when we made the clustering.  How do we determine the "right" number of clusters?  One way of attempting to evaluate the quality of a clustering is to use the so-called "silhouette score", and scikit-learn can compute this metric for us.  We again use the cluster labels that scikit-learn has computed for us earlier.

In [None]:
from sklearn.metrics import silhouette_score

In [None]:
print('Clusters: {}  Silhouette score: {}'.format(number_of_clusters, 
                                                  silhouette_score(matrix, word_df['Cluster'])))

We can compare this silhouette score for 5 clusters versus the silhouette score for other clusterings with different numbers of clusters.  Since this is a computationally intensive procedure, we have already performed these calculations and saved some relevant information which we will retrieve here.  

In [None]:
import pickle

with open('yelp_models/silhouettes.pkl', 'rb') as f:
    scores = pickle.load(f)

scores

What number of clusters (of those given here) maximizes the silhouette score?

In [None]:
sorted([ (k, v) for (k, v) in scores.items() if isinstance(k, int) ], key=lambda kv: kv[1], reverse=True)[0]

According to these numbers, we might conclude that 20 is the "right" number of clusters for this data.  On the other hand, we can note that there was a dip in the silhouette score in the change from 14 to 15 clusters.  We may not necessarily want to be guided solely by the silhouette score, but also ask if 20 is a manageable number of clusters to examine.  If we were to use 20 clusters, we should likely also examine the top words in the clusters to check if there is a significant overlap of those common words.  

In any case, we could recompute the $k$-means clustering using 20 clusters (or perhaps 14), and consider the various visualizations for that result more closely.  (We'll note that we can achieve even higher silhouette scores with a larger number of clusters, but again we need to determine if going beyond a certain threshold gives us a manageable clustering to work with.)  Instead of pursuing this avenue, we'll move onto the other main topic modeling method we want to discuss in this notebook.  

---
**Group Discussion**
- What other ideas might you have for determining if a topic modeling clustering is "good"?
- What are some possible limitations for using $k$-means clustering for topic modeling? 
---

## Latent Dirichlet allocation (LDA)

Clustering using values produced by the `TfidfTransformer` gives us a crude way of getting groups of topics.  This grouping is relying on "similarity" of the associated vectors being measured by the vectors being close to one another (in the resulting high-dimensional space).  In particular, we are basically assuming that people will be using the exact same words when discussing a given topic.  

[Latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) (LDA) moves us closer to the idealized model first described in this notebook, namely that a "topic" is a mixture of words, and a "document" is a mixture of topics.  Documents are thought of being generated word by word, where for each possible word a topic is chosen according to a (hidden) probability distribution over the set of topics, and then a particular word is chosen according to another (hidden) probability distribution over a set of words for that topic.  The training process in the LDA machine learning algorithm corresponds to attempting to construct these unknown probability distributions.  

The exact process by which this is performed behind the scenes isn't important to us.  (But interested readers can find more details [here](https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation).)  While scikit-learn implements a `LatentDirichletAllocation` class, we are going to use the gensim library and its [`LdaModel`](https://radimrehurek.com/gensim/models/ldamodel.html) class.  Using gensim lets us extract some more information about the resulting generative model more easily than using scikit-learn's LDA class.  

In [None]:
import gensim

import re

In [None]:
texts = reviews_text.text.values.tolist()

We prepare our text much like before, removing stop words, and lemmatizing the text of the reviews.  In order to remove much of the "noise", we will also limit ourselves to taking particular parts of speech which should provide us with the most descriptive types of words in the reviews.  We use `spaCy` to help with the parts of speech tagging.  

In [None]:
def process_words(texts, stop_words=set(), allowed_pos=['NOUN', 'PROPN', 'ADJ', 'VERB', 'ADV']):
    result = []
    for t in texts:
        t = re.sub('\'', '', t)  #  replace single quotation marks (mainly to capture contractions)
        t = gensim.utils.simple_preprocess(t, deacc=True)
        doc = nlp(' '.join(t))
        result.append([token.lemma_ for token in doc if token.pos_ in allowed_pos and 
                       token.lemma_ not in stop_words])
    return result

In [None]:
processed_text = process_words(texts, stop_words=stopwords.union(['-PRON-']))

We can compare one of the original reviews...

In [None]:
texts[:1]

... to the processed text that will be used for the LDA method.

In [None]:
print(processed_text[:1])

## The `LdaModel` class

To use the gensim `LdaModel` class, we will also use that library's vectorization class to create the "bag of words" in this case, as it provides the input in the right form for gensim.  We must first assemble the overall vocabulary/dictionary of words (using the `gensim.corpora.Dictionary` class), and then create the word count vectors (using the `doc2bow` method of the resulting `Dictionary`).  

In [None]:
dictionary = gensim.corpora.Dictionary(processed_text)
print('Number of unique tokens: {}'.format(len(dictionary)))

In [None]:
corpus = [dictionary.doc2bow(t) for t in processed_text]

Similar to the $k$-means clustering result, let's see what gensim's `LdaModel` gives us when we specify that we want to find five topics.

In [None]:
num_topics = 5

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, 
                                            id2word=dictionary,
                                            num_topics=num_topics, 
                                            random_state=117, update_every=1,
                                            chunksize=1500, 
                                            passes=5, iterations=10,
                                            alpha='asymmetric', eta=1/100,
                                            per_word_topics=True)

As mentioned previously, the `LdaModel` class provides methods to see more information about the topics it discovers.  For example, we can examine the five topics and get the relative weight of how much each of the topic's top words contribute to that topic.  

In [None]:
from pprint import pprint

pprint(lda_model.print_topics(num_words=15))

### Exercise

1. Do the topics seem somewhat "well-defined"?  
1. What happens if you increase the number of words shown for each topic (by modifying the `num_words` parameter)? 

##  More topic information
At the start of this section, we said that latent Dirichlet allocation views a document as a collection of topics.  Using the `get_document_topics` method, we can see the topics it has found for a particular document, as well as the proportion of the document associated with that topic.  This method takes an argument in the form of the bag of words representation of a document.  

In [None]:
lda_model.get_document_topics(corpus[0])

Typically, one topic is dominant in a particular document.  Let's extract the dominant topic (and percentage) for each of the reviews in our data.

In [None]:
def get_main_topic_df(model, bow, texts):
    topic_list = []
    percent_list = []
    keyword_list = []
    
    for wc in bow:
        topic, percent = sorted(model.get_document_topics(wc), key=lambda x: x[1], reverse=True)[0]
        topic_list.append(topic)
        percent_list.append(round(percent, 3))
        keyword_list.append(' '.join(sorted([x[0] for x in model.show_topic(topic)])))

    result_df = pd.concat([pd.Series(topic_list, name='Dominant_topic'), 
                           pd.Series(percent_list, name='Percent'), 
                           pd.Series(texts, name='Processed_text'), 
                           pd.Series(keyword_list, name='Keywords')], axis=1)

    return result_df

In [None]:
main_topic_df = get_main_topic_df(lda_model, corpus, processed_text)

main_topic_df.head(10)

In [None]:
grouped_topics = main_topic_df.groupby('Dominant_topic')
grouped_topics.count()['Processed_text'].\
    plot.bar(rot=0).\
    set(title='Dominant Topic Frequency in the {} Reviews'.format(len(reviews)),
        ylabel='Topic frequency'); 

---
**Group Discussion**
-  What might the above plot reveal about the set of reviews we started with?  (Compare this to the similar plot of number of documents per cluster that we obtained using $k$-means clustering.)
---

## Representative data

What's the "most representative" sentence we have in the data for each topic?  

In [None]:
representatives = pd.DataFrame()

for k in grouped_topics.groups.keys():
    representatives = pd.concat([ representatives, 
                                 grouped_topics.get_group(k).sort_values(['Percent'], ascending=False).head(1) ])
    
representatives

We can, of course, examine the original text of the review.

In [None]:
print('Document: {}  Dominant topic: {}\n'.format(representatives.index[1], 
                                       representatives.loc[representatives.index[1]]['Dominant_topic']))
print(texts[representatives.index[1]])

We have also located something that we may not have expected, a review in French!

In [None]:
print('Document: {}  Dominant topic: {}\n'.format(representatives.index[3], 
                                       representatives.loc[representatives.index[3]]['Dominant_topic']))
print(texts[representatives.index[3]])

### Exercise
1. Have a closer look at the reviews in topic 3.  Are many of them in French?  
1. How might you try to locate an English language review to serve as a "representative" for topic 3?  
(Note:  Another Python library `nltk` (which stands for ["Natural Language Toolkit"](https://www.nltk.org/)) contains a method that will attempt to identify the language in which particular text has been written.  We won't discuss the `nltk` library here.  Also note that the lemmatization method we were using from `spaCy` is assuming that the language of each review is English. `spaCy` supports other languages too, but we haven't taken the language of reviews into account here, assuming the majority are in English.)  

## Length of documents in each topic

Let's try a little more to try and reveal something about the length of the documents in each topic.  Note that we are using the "processed text" in the following visuals, i.e. after stop words and other "noise" have been removed by the processing steps we used to prepare for LDA.

In [None]:
def word_count_by_topic(topic=0):
    d_lens = [len(d) for d in grouped_topics.get_group(topic)['Processed_text']]
    plt.hist(d_lens, bins=50)
    large = plt.gca().get_ylim()[1]
    d_mean = round(np.mean(d_lens), 1)
    d_median = np.median(d_lens)
    plt.plot([d_mean, d_mean], [0,large], label='Mean = {}'.format(d_mean))
    plt.plot([d_median, d_median], [0,large], label='Median = {}'.format(d_median))
    plt.legend()
    plt.gca().set(xlabel='Document word count', ylabel='Number of documents', xlim=(0, 450), 
            title='Distribution of Document Lengths for {} Reviews in Topic {}'.format(len(d_lens), topic));

In [None]:
slider = IntSlider(min=0, max=num_topics-1, step=1, value=0, description='Topic')
interact(word_count_by_topic, topic=slider);

## Top word distribution per topic

Finally, can we reproduce a histogram of word distributions per topic, similar to what we did for the $k$-means clustering method?  Yes, we can, but it takes a little more engineering on our part since the "bag of words" representation isn't in such a nice format as we had when using the `CountVectorizer` from scikit-learn.

In [None]:
lda_top_words_index = set()
for i in range(lda_model.num_topics):
    lda_top_words_index = lda_top_words_index.union([k for (k,v) in lda_model.get_topic_terms(i)])

print('Indices of top words: \n{}\n'.format(lda_top_words_index))

In [None]:
words_we_care_about = [{dictionary[tup[0]]: tup[1] for tup in lst if tup[0] in lda_top_words_index} 
                       for lst in corpus]

In [None]:
lda_top_words_df = pd.DataFrame(words_we_care_about).fillna(0).astype(int).sort_index(axis=1)
lda_top_words_df['Cluster'] = main_topic_df['Dominant_topic']

In [None]:
lda_top_words_df.groupby('Cluster').sum().transpose().\
         plot.bar(figsize=(15, 5), width=0.7).\
         set(ylabel='Word frequency', 
         title='Word Frequencies by Topic, Combining the Top {} Words in Each Topic'.format(len(lda_top_words_index)));

## What are the common top words in the two topic modeling methods?
What top words have we found using both $k$-means clustering and LDA?  

In [None]:
common_words = set(lda_top_words_df.columns[:-1]).intersection(set(word_df.columns[:-1]))

In [None]:
print(len(common_words))
print(sorted(list(common_words)))

## The number of topics

As for $k$-means clustering, the initial selection of "5" for the number of topics in the LDA method was arbitrary.  One method that gensim provides to help determine the correct number of topics is to measure the "coherence" of the topics.  We are not going to go into the details of how coherence is defined, but the main idea is that coherence is supposed to model "human interpretability" of topics, with higher coherence scores corresponding to "better defined" topics.  

Similar to using silhouette scores for $k$-means, we could build LDA models with differing numbers of topics, and choose one with the highest coherence score.  

In [None]:
cm = gensim.models.coherencemodel.CoherenceModel(model=lda_model,
                                                 texts=processed_text,
                                                 dictionary=dictionary)

coherence_scores = [(num_topics, cm.get_coherence())]
print('Coherence score for {} topics:  {}'.format(*coherence_scores[0]))

In [None]:
for n in range(6, 9):
    mod = gensim.models.ldamodel.LdaModel(corpus=corpus, 
                                          id2word=dictionary,
                                          num_topics=n, 
                                          random_state=117, update_every=1,
                                          chunksize=1500, 
                                          passes=5, iterations=10,
                                          alpha='asymmetric', eta=1/100,
                                          per_word_topics=True)
    cmodel = gensim.models.coherencemodel.CoherenceModel(model=mod,
                                                 texts=processed_text,
                                                 dictionary=dictionary)
    coherence_scores.append((n, cmodel.get_coherence()))

In [None]:
coherence_scores

Interested readers can find many more technical details about coherence scores [here](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf).  

## Conclusion

In this notebook we have explored the topic modeling problem, that of extracting a useful grouping of documents into meaningful clusters.  The main takeaways from this lesson are:
- Topic modeling is an **unsupervised** learning problem, in that there is no a priori number of topics in which to divide your input data.  
- Like most NLP problems, pre-processing your input is important to get meaningful results.
- $k$-means clustering using vectorized data (such as word counts or TFIDF values) is one method for clustering similar documents using distance between the document vector representations.  
- Latent Dirichlet allocation is a *generative model*, where documents are viewed as a mixture of topics, and topics as a mixture of words.  Learning methods exist in Python libraries such as gensim and scikit-learn for building these topic modeling models and exploring the resulting features of the built models. 
- Determining the "right" number of topics can be a difficult task, but we can utilize measures such as the silhouette score for $k$-means clustering or coherence for latent Dirichlet allocation to help.  Analyzing the overlap of words in each topic can also help to determine what is a good value for the number of topics.  

*Copyright &copy; 2020 Pragmatic Institute. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*