<a href="https://colab.research.google.com/github/se7ven7-7/815project_team10/blob/main/Hands-on/04-text-mining/Text_Analysis_Advanced_unsolved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**Course: BA820 - Unsupervised and Unstructured ML**

**Notebook created by: Mohannad Elhamod**

## 1. Intuition Behind Word2Vec

To understand how Word2Vec works, we will create a toy model by training it on a  small number of sentences.

This is not a common practice. Generally, we just use a *pre-trained* model that was fitted to millions of sentences. Such models will be of high quality.




In [None]:
!pip install gensim



In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) # Get the set of stop words

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Here is some code that could clean text up

In [None]:
import string
from nltk.stem import PorterStemmer

ps = PorterStemmer()

def cleanup_text(sentence):
  # First, word tokenize.
  tokenized_sms_messages = word_tokenize(sentence)

  # Lower case
  tokenized_sms_messages = [word.lower() for word in tokenized_sms_messages]

  # Remove punctuation
  tokenized_sms_messages = [word for word in tokenized_sms_messages if word not in string.punctuation]

  # Remove stop words
  stop_words = set(stopwords.words('english'))
  tokenized_sms_messages = [word for word in tokenized_sms_messages if word not in stop_words]

  # Stem
  tokenized_sms_messages = [ps.stem(word) for word in tokenized_sms_messages]

  return tokenized_sms_messages

In [None]:
corpus = [
    'I love sleeping in my bed',
    'He hates eating at McDonalds every night',
    'I love drinking root beer',
    'He hates studying physics textbooks',
    'I love traveling to Europe every summer',
    'He hates swimming in the big pool',
]

# Tokenize first.
tokenized_corpus = [cleanup_text(sentence) for sentence in corpus]
tokenized_corpus

[['love', 'sleep', 'bed'],
 ['hate', 'eat', 'mcdonald', 'everi', 'night'],
 ['love', 'drink', 'root', 'beer'],
 ['hate', 'studi', 'physic', 'textbook'],
 ['love', 'travel', 'europ', 'everi', 'summer'],
 ['hate', 'swim', 'big', 'pool']]

In [None]:
from gensim.models import Word2Vec
import numpy as np

n-dimensions = 200

# We construct and train our own Word2Vec.
model_word2vec = Word2Vec(sentences=, vector_size=dimensions, window=, min_count=1, epochs=10000, workers=4, negative=10)

SyntaxError: cannot assign to expression here. Maybe you meant '==' instead of '='? (<ipython-input-6-2914dc6df27a>, line 4)

In [None]:
print("All words captured by the model:", model_word2vec.wv.key_to_index)

word = 'love'
print("The embedding of", word, "is", model_word2vec.)

# Get the embedding for each word captured by the model.
words = model_word2vec.
embeddings = np.array([model_word2vec.wv[word] for word in words])

In [None]:
embeddings.shape

Ten words have ten embeddings. Each word has a n-dimensional embedding (i.e., vector_size)

Now, let's plot a 3D PCA plot to see these embeddings



In [None]:
from sklearn.decomposition import PCA
import plotly.express as px
import pandas as pd

def plot_scatter_3d(model, embeddings):
  dim_red = PCA(n_components=3, random_state=42)

  embeddings_for_visualization = dim_red.fit_transform(embeddings)

  # Convert the reduced embeddings and words into a DataFrame
  df = pd.DataFrame(embeddings_for_visualization, columns=['x', 'y', 'z'])
  df['word'] = [ word for word in model_word2vec.wv.index_to_key]

  # Create a scatter plot using Plotly
  fig = px.scatter_3d(df, x='x', y='y', z='z', text='word', title='Word Embeddings Visualization')
  fig.show()

In [None]:
plot_scatter_3d(model_word2vec, embeddings)

Let's see how this maps to a pre-trained embedding model (GloVe or Word2Vec)

In [None]:
import gensim.downloader as api

# Load the pretrained model
# pretrained_model = api.load('word2vec-google-news-300')
pretrained_model = api.load('glove-wiki-gigaword-200')
# pretrained_model = api.load('glove-twitter-200')


Checking of the model does not recognize any of the words

In [None]:
[word  for word in words if word not in pretrained_model]

In [None]:
vector_size = pretrained_model.vector_size

embeddings = np.array([
     # if the word is not recognized, replace it with a vector of zeros
    for word in words
])

plot_scatter_3d(pretrained_model, embeddings)

## 2. Application: Using Neural Embeddings for Spam Detection

Now that we were able to represent the words using the pre-trained embeddings, let's apply it to our spam detection problem.

In [None]:
url = "https://raw.githubusercontent.com/elhamod/BA820/main/Hands-on/04-text-mining/hamspam.csv"
df_sms = pd.read_csv(url, names = ['type', 'text'], index_col='type')

X = df_sms['text']
y = df_sms.index

df_sms

First, do some pre-processing.

In [None]:
message = df_sms['text'][0]
message

In [None]:
print("a message:", cleanup_text(message)) # cleanup_text(message), message

print("Embedding of the entir message:",pretrained_model.get_mean_vector(message))

In [None]:
messages = df_sms['text']
tokenized_messages = [cleanup_text(message) for message in messages]

Now, to calculate sentence embeddings, let's average the word embeddings.

In [None]:
import numpy as np

vector_size = pretrained_model.vector_size  # Get the embedding size

vectorized_messages = [
     # f no tokens are recognized, use a zero vector
    for sentence in tokenized_messages
]

Now that the embeddings are constructed, we can split to train/test sets and use supervised learning.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import sklearn
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

def assess_model(df, embeddings):
  # train/test split
  X_train, X_test, y_train, y_test = train_test_split(embeddings, df.index, test_size=0.2, random_state=42)

  # train the model
  classifier = LogisticRegression()
  classifier.fit(X_train, y_train)

  # Predict on the test data
  y_pred = classifier.predict(X_test)

  # Evaluate the model
  accuracy = accuracy_score(y_test, y_pred)
  f1_score = sklearn.metrics.f1_score(y_test, y_pred, pos_label="spam")
  print(f"Accuracy: {accuracy}")
  print(f"f1_score: {f1_score}")
  print(sklearn.metrics.classification_report(y_test,y_pred))
  display(pd.DataFrame(confusion_matrix(y_test, y_pred, normalize='true'), columns=classifier.classes_, index=classifier.classes_ ))




In [None]:
assess_model(df_sms, vectorized_messages)

### 2.1 Misc Functions

Find words that are most similar to a word

In [None]:
word = 'astrology'

pretrained_model.similar_by_word(word) # , topn=5

Find word analogies

In [None]:
 pretrained_model.most_similar(positive=['woman', 'king'], negative=['man'])

Find cosine similarity between two sentences

In [None]:
 pretrained_model.n_similarity(word_tokenize('I like it'), word_tokenize('hate it'))

**Questions:**

- Would dimensionality reduction help improve the results?
- Would you be able to use clustering to find different of messages? Do the clusters align with the ham/spam split?
- Visualize the dataset using non-linear methods.

## 3. Using Deep Learning Embeddings

We just saw how embeddings like Word2Vec can help us represent text as vectors to perform downstream tasks, such as classification.

Let's try now more advanced deep learning models that produce more sophisticated embeddings.

We will use `DistelBERT` through [`huggingface`](https://huggingface.co/). `huggingface` is a widely used platfrom for datasets and deep learning models, especially Transformers.



In [None]:
!pip install -U sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer

st_model = SentenceTransformer('sentence-transformers/distilbert-base-nli-mean-tokens')

embeddings = st_model.


In [None]:
assess_model(df_sms, embeddings)

## 4. Using a Pretrained Model

Instead of extracting embeddings and then training logistic regression, how about we use a pre-trained deep learning model (a Transformer)?

Searching `huggingface` for a suitable model for ham/spam, one could find the following [Bert_Spam_ham](https://huggingface.co/saadkiet/Fine_Tuned_bert_Spam_ham) model.



In [None]:
# Use a pipeline as a high-level helper


Let's try on one sentence.

In [None]:
pipe(  )

Notice here that while the model internally computed the embeddings, it give us the final classification, along with the score indicating its certaining. So, we do not need to train a separate classifier.

In [None]:
def assess_model_bert(df, model):
  # train/test split
  X_train, X_test, y_train, y_test = train_test_split(df["text"], df.index, test_size=0.2, random_state=42)

  # Predict on the test data
  y_pred = model(X_test.to_list())
  y_pred = [int(x["label"][-1]) for x in y_pred]
  y_pred = ["ham" if x == 0 else "spam" for x in y_pred]

  # Evaluate the model
  accuracy2 = accuracy_score(y_test, y_pred)
  f1_score = sklearn.metrics.f1_score(y_test, y_pred, pos_label="spam")
  print(f"Accuracy: {accuracy2}")
  print(f"f1_score: {f1_score}")
  print(sklearn.metrics.classification_report(y_test,y_pred))
  display(pd.DataFrame(confusion_matrix(y_test, y_pred, normalize='true'), columns=["ham", "spam"], index=["ham", "spam"]))




In [None]:
assess_model_bert(df_sms, pipe)