# Brainstorm

In brainstorm, we are looking at the ways we cna potentially figure out how to tell trivial messages apart. This notebook will look at the checkmate messaging data before 2023-04-19.

In [2]:
import pandas as pd

In [132]:
checkmate_messages_df = pd.read_csv('../../src/data/CheckMate_Messages_Table.csv')

In [86]:
print(checkmate_messages_df.shape)

(86, 5)


In [89]:
checkmate_messages_df.dropna()
print(checkmate_messages_df.shape)
print(checkmate_messages_df.iloc[54])

(86, 5)
text                          😊
taggedCategory          Trivial
type                       text
isMachineCategorised       True
truthScore                  NaN
Name: 54, dtype: object


In [5]:
# {'Legitimate', 'Trivial', 'Scam', 'Info/News/Opinion', 'Illicit', 'Unsure', 'Spam'} are available message tags
print(set(checkmate_messages_df['taggedCategory'].values))

{'Spam', 'Scam', 'Trivial', 'Illicit', 'Legitimate', 'Info/News/Opinion', 'Unsure'}


In [6]:
# Machine Categorised messages are all trivial texts at present
print(checkmate_messages_df[checkmate_messages_df['isMachineCategorised'] == True])

              text taggedCategory  type  isMachineCategorised  truthScore
40           scam?        Trivial  text                  True         NaN
42              Hi        Trivial  text                  True         NaN
43  is this a scam        Trivial  text                  True         NaN
47  this is a test        Trivial  text                  True         NaN
54               😊        Trivial  text                  True         NaN
56          Hello!        Trivial  text                  True         NaN
59           Ello!        Trivial  text                  True         NaN
69             Hi!        Trivial  text                  True         NaN
70           Hello        Trivial  text                  True         NaN
73               👌        Trivial  text                  True         NaN
79       Whats up?        Trivial  text                  True         NaN


## Transforming dataset

Stage 1 of project is to differentiate trivial messages from non-trivial messages with a high degree of recall (false positives are less important) - make sure false negatives are as low as possible.
Hence, we need group the message categories into trivial or not trivial first

In [7]:
checkmate_messages_df['is_trivial'] = (checkmate_messages_df['taggedCategory']=='Trivial')

In [8]:
print(checkmate_messages_df[checkmate_messages_df['is_trivial']==True])

                                                 text taggedCategory   type  \
6                             Well done CheckMate 👏👏👏        Trivial   text   
11  Hello, no matter how busy work remember to eat Oh        Trivial   text   
17  Suck lozenges n  rub tiger balm on nose , last...        Trivial   text   
40                                              scam?        Trivial   text   
42                                                 Hi        Trivial   text   
43                                     is this a scam        Trivial   text   
45                                   Hello a bit slow        Trivial   text   
46                            Well done CheckMate 👏😅👏        Trivial   text   
47                                     this is a test        Trivial   text   
54                                                  😊        Trivial   text   
56                                             Hello!        Trivial   text   
59                                              Ello

In [9]:
print(len(checkmate_messages_df[checkmate_messages_df['is_trivial']== True]))
print(len(checkmate_messages_df))

23
86


23/86 of all messages in the current repository are trivial messages

## Word2Vec Embeddings using Gensim

In [25]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
import gensim.downloader as api
import numpy as np

In [26]:
wv = api.load('word2vec-google-news-300')

In [34]:
for index, word in enumerate(wv.index_to_key):
    if index == 10:
        break
    print(f"word #{index}/{len(wv.index_to_key)} is {word}")

word #0/3000000 is </s>
word #1/3000000 is in
word #2/3000000 is for
word #3/3000000 is that
word #4/3000000 is is
word #5/3000000 is on
word #6/3000000 is ##
word #7/3000000 is The
word #8/3000000 is with
word #9/3000000 is said


In [35]:
print(type(wv))

<class 'gensim.models.keyedvectors.KeyedVectors'>


In [44]:
print(len(wv.word_vec('said')))
print(len(wv.word_vec('with')))
print(type(wv.word_vec('print')))

300
300
<class 'numpy.ndarray'>


  print(len(wv.word_vec('said')))
  print(len(wv.word_vec('with')))
  print(type(wv.word_vec('print')))


Convert a string of words into their vectors and find their average

In [72]:
print(checkmate_messages_df.iloc[2]['text'])
print(checkmate_messages_df.iloc[5]['text'])
print(checkmate_messages_df.iloc[6]['text'])
print(checkmate_messages_df.iloc[11]['text'])

TN 95546718362782 is out for del. Allow for contactless del here: https://2g.to/EkEqr/3jI
1. National School Games (NSG) 2023 (Secondary Schools): 10 to 25 April at Choa Chu Kang Stadium
    EOI Link: https://form.jotform.com/223399002605452
Well done CheckMate 👏👏👏
Hello, no matter how busy work remember to eat Oh


In [73]:
nt_message = checkmate_messages_df.iloc[2]['text']
t_message = checkmate_messages_df.iloc[6]['text']

test_nt_message = checkmate_messages_df.iloc[5]['text']
test_t_message = checkmate_messages_df.iloc[11]['text']

In [21]:
message_tokens = checkmate_messages_df.iloc[2]['text'].split(' ')
print(message_tokens)

['TN', '95546718362782', 'is', 'out', 'for', 'del.', 'Allow', 'for', 'contactless', 'del', 'here:', 'https://2g.to/EkEqr/3jI']


In [104]:
def sentence_2_vector(sentence):
    message_vectors = []
    # print(f'Debugging {sentence}, {type(sentence)}')
    # print(f'Debugging {len(sentence)}')
    is_na = True
    for i in range(len(sentence)):
        if wv.__contains__(sentence[i]):
            message_vectors.append(wv.word_vec(sentence[i]))
            is_na = False
        else:
            pass
            # print(f'Does not contain token: {sentence[i]}')
    if not is_na:
        ave_vector = np.average(message_vectors, axis=0, keepdims=True)
        return ave_vector
    else:
        return None

In [84]:
nt_message_vector = sentence_2_vector(nt_message)
t_message_vector = sentence_2_vector(t_message)

test_nt_message_vector = sentence_2_vector(test_nt_message)
test_t_message_vector = sentence_2_vector(test_t_message)

Does not contain token:  
Does not contain token:  
Does not contain token:  
Does not contain token:  
Does not contain token:  
Does not contain token: .
Does not contain token:  
Does not contain token:  
Does not contain token:  
Does not contain token: a
Does not contain token:  
Does not contain token:  
Does not contain token: :
Does not contain token:  
Does not contain token: :
Does not contain token: /
Does not contain token: /
Does not contain token: .
Does not contain token: /
Does not contain token: /
Does not contain token:  
Does not contain token:  
Does not contain token: a
Does not contain token:  
Does not contain token: 👏
Does not contain token: 👏
Does not contain token: 👏
Does not contain token: .
Does not contain token:  
Does not contain token: a
Does not contain token: a
Does not contain token:  
Does not contain token:  
Does not contain token: a
Does not contain token:  
Does not contain token: (
Does not contain token: )
Does not contain token:  
Does not con

  message_vectors.append(wv.word_vec(sentence[i]))


In [85]:
print(len(nt_message_vector))
print(len(t_message_vector))

1
1


In [86]:
distance = np.linalg.norm(nt_message_vector - test_nt_message_vector)
print(distance)

0.3369276


In [88]:
print(f'Distance of test non-trivial from non-trivial:{np.linalg.norm(nt_message_vector - test_nt_message_vector)}')
print(f'Distance of test non-trivial from trivial:{np.linalg.norm(t_message_vector - test_nt_message_vector)}')

Distance of test non-trivial from non-trivial:0.336927592754364
Distance of test non-trivial from trivial:0.8595778346061707


In [89]:
print(f'Distance of test trivial from non-trivial:{np.linalg.norm(nt_message_vector - test_t_message_vector)}')
print(f'Distance of test trivial from trivial:{np.linalg.norm(t_message_vector - test_t_message_vector)}')

Distance of test trivial from non-trivial:0.67839115858078
Distance of test trivial from trivial:0.6198282837867737


## Train Test Split

In [133]:
from sklearn.model_selection import train_test_split

print(checkmate_messages_df)

                                                 text     taggedCategory  \
0                                                 NaN  Info/News/Opinion   
1   https://www.mas.gov.sg/news/media-releases/202...  Info/News/Opinion   
2   TN 95546718362782 is out for del. Allow for co...         Legitimate   
3   🚩🚩🚩 *"You flag, we check"* 🔍🔍🔍\n\nNot sure if ...  Info/News/Opinion   
4        https://form.gov.sg/63f594b42413ea0011831e7e         Legitimate   
..                                                ...                ...   
81  [SHIN MIN CONTEST] Happycall Jumbo 双面锅等你赢取！翻阅到...             Unsure   
82  Hello, sorry to bother you, I'm Nico from the ...               Scam   
83  Hello, my name is Sarah, from CME Group. We've...               Scam   
84  LTA: Notice As no valid E-tag detected in your...               Scam   
85  Excuse me, this is Stella, have you arranged a...             Unsure   

     type  isMachineCategorised  truthScore  
0   image                 False    6.8538

In [135]:
checkmate_messages_df = pd.read_csv('../../src/data/CheckMate_Messages_Table.csv')
checkmate_messages_df.dropna()
checkmate_messages_df['is_trivial'] = (checkmate_messages_df['taggedCategory']=='Trivial')


df = checkmate_messages_df[['text','is_trivial']]
print(df)

df_train, df_test= train_test_split(
        df, test_size=0.50, random_state=42)

                                                 text  is_trivial
0                                                 NaN       False
1   https://www.mas.gov.sg/news/media-releases/202...       False
2   TN 95546718362782 is out for del. Allow for co...       False
3   🚩🚩🚩 *"You flag, we check"* 🔍🔍🔍\n\nNot sure if ...       False
4        https://form.gov.sg/63f594b42413ea0011831e7e       False
..                                                ...         ...
81  [SHIN MIN CONTEST] Happycall Jumbo 双面锅等你赢取！翻阅到...       False
82  Hello, sorry to bother you, I'm Nico from the ...       False
83  Hello, my name is Sarah, from CME Group. We've...       False
84  LTA: Notice As no valid E-tag detected in your...       False
85  Excuse me, this is Stella, have you arranged a...       False

[86 rows x 2 columns]


In [136]:
# Data is not clean, not all sentences are strings
df_test['text'] = df_test['text'].astype('str')
df_train['text'] = df_train['text'].astype('str')

In [137]:
print(df_train.shape)
print(df_test.shape)
# print(y_train.shape)
# print(y_test.shape)
# Pandas Series
# print(X_train.map(sentence_2_vector).shape)
# print(X_train.map(sentence_2_vector).iloc[1])
# print(X_train.map(sentence_2_vector).iloc[1].shape)

(43, 2)
(43, 2)


In [138]:
df_train_is_trivial = df_train[df_train['is_trivial']==True]
df_train_not_trivial = df_train[df_train['is_trivial']==False]

df_train_is_trivial.dropna(inplace=True)
df_train_not_trivial.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train_is_trivial.dropna(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train_not_trivial.dropna(inplace=True)


In [139]:
# print(df_train_is_trivial['text'].map(sentence_2_vector).shape)
# print(df_train_not_trivial['text'].map(sentence_2_vector).shape)
# print(df_train_is_trivial['text'].map(sentence_2_vector))

In [140]:
v_train_is_trivial = df_train_is_trivial['text'].map(sentence_2_vector).dropna()
v_train_not_trivial = df_train_not_trivial['text'].map(sentence_2_vector).dropna()

  message_vectors.append(wv.word_vec(sentence[i]))


In [141]:
ave_vector_is_trivial = np.average(v_train_is_trivial)
ave_vector_not_trivial = np.average(v_train_not_trivial)

In [142]:
print(ave_vector_not_trivial)

[[-0.13695015  0.08342403  0.00514604  0.12669812 -0.06163373  0.0138057
  -0.0730136  -0.06388134 -0.02763501  0.03655811 -0.05462418 -0.06098907
  -0.16197072  0.02182103 -0.13201483  0.08333065  0.09458251  0.15757763
  -0.03072715  0.00231777 -0.22467469 -0.02953767  0.1022515   0.00427959
  -0.07633369  0.02171312 -0.23420987  0.06227919 -0.01620322 -0.01496878
  -0.02388855  0.0072543  -0.06449614 -0.14816591 -0.14029114  0.09190691
  -0.16870442  0.12773865 -0.06104794  0.08518694 -0.02328792 -0.05281211
   0.05741708  0.08812542  0.03634414 -0.00858928 -0.03811451 -0.15303779
  -0.08723135  0.08452263 -0.18150826  0.22038123 -0.00569471  0.2022705
   0.02431773  0.1271864  -0.15068513 -0.09973634 -0.01174608 -0.13402615
  -0.14104192 -0.06132596 -0.16747262 -0.07108618 -0.02530009 -0.17712204
  -0.0809148   0.12340422 -0.02468852  0.05159829  0.03182505 -0.08645429
   0.04162094 -0.01538992 -0.02737984  0.00783421  0.11337963 -0.0031864
   0.01052057 -0.07945222 -0.12340466 -0.

In [143]:
def classify_is_trivial(v):
    d_is_trivial = np.linalg.norm(ave_vector_is_trivial - v)
    d_not_trivial = np.linalg.norm(ave_vector_not_trivial - v)
    # print(f'distance from trivial:{d_is_trivial}, not trivial:{d_not_trivial}')
    if d_is_trivial<= d_not_trivial:
        return True
    else:
        return False

In [144]:
df_test

Unnamed: 0,text,is_trivial
75,"Hi, last year one of my colleague met you at y...",False
0,,False
70,Hello,True
22,Hi! I pray this msg find you in good health. H...,False
12,Hey Jolyn! We haven't seen you in the studio y...,False
56,Hello!,True
10,"Wear mask in trains, buses and crowded places ...",False
18,📢 HUUUURRY! Use your code GLASSSKIN10 & start ...,False
4,https://form.gov.sg/63f594b42413ea0011831e7e,False
67,"Hello, We have noticed your employment history...",False


In [145]:
v_test = df_test['text'].map(sentence_2_vector).dropna()

  message_vectors.append(wv.word_vec(sentence[i]))


In [146]:
prediction = (v_test.map(classify_is_trivial))

In [166]:
predicted_vs_output = pd.concat([df_test['is_trivial'], prediction], axis=1)
predicted_vs_output = predicted_vs_output.rename(columns={'is_trivial':'target', 'text':'prediction'})

In [173]:
print(type(predicted_vs_output))

true_positive = len(predicted_vs_output[(predicted_vs_output['target']==True) & (predicted_vs_output['prediction']==True)])
false_negative = len(predicted_vs_output[(predicted_vs_output['target']==True) & (predicted_vs_output['prediction']==False)])

recall = true_positive/(true_positive+false_negative)
print(recall)

<class 'pandas.core.frame.DataFrame'>
0.6666666666666666
