# English Message Modeling 

This notebook explores how we can analyze messages and their associated category, and how we can use spellcheck. 


In [None]:
# import statements
import pandas as pd
import numpy as np
import os
import re
import string
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, BatchNormalization, Dropout, Bidirectional

In [None]:
# setting up colab
from google.colab import drive
drive.mount('/content/drive',force_remount=True)
MY_DRIVE = "drive/My Drive/"
root_folder = os.path.join(MY_DRIVE, "girl_effect")

In [None]:
# cleaning data 
df = pd.read_csv(os.path.join(root_folder,"chatbots_data_new1.csv"))

In [None]:
df

# Getting English DataFrame
For this part, I just wanted to test out the LSTM Model creation, so I picked all the entries which would correspond to the message being in English and specifically within the Big Sis chatbot. I opted to approach the Chhaa Jaa dataset in a separate notebook.  

In [None]:
english = df[df["Org Name"] == "Big Sis V3"]

In [None]:
english

# Spell Check
The first step that I wanted to do was explore spell check. I found a pretty good library called pyspellchecker which allows us to replace words with the most likely autocrrected word (no change if it's already a valid word). 

In [None]:
# installing the spell checker 
!pip install pyspellchecker

In [None]:
from spellchecker import SpellChecker

In [None]:
spell = SpellChecker()  

# Applying Spell check to our data frame
We lowercase and apply spell check tothe column message in our data frame. This process takes quite a bit of time (it individually checks each word).

In [None]:
# function to spell check the word 
def corrected_string(sentence):
  running_str = ""
  sentence = str(sentence).lower()
  for i in sentence.split(" "):
    running_str+= spell.correction(i) + " "
  return running_str

In [None]:
df = english.head(1000) 

In [None]:
df["new_message"] = df["Message"].apply(corrected_string)

In [None]:
df

In [None]:
# create a mapping between key words and existing questions that big sis outputs
# source doc: https://docs.google.com/document/d/140TJz5oUsLpPQU2-zjLjM9hgo2MTUvYhvKwN8GP6O8I/edit 
phrases = {
            "love": ["Healthy relationships", "Am I The One?", "Communicating with your partner", "Is it love?", "Does love really exist?"],
            "sex": ["Choices for contraceptives", "Is it ok to be curious about sex?", "Curiosity about sex- take the quiz!", "What is sex?", "Unsure about masturbation"],
            "relationships": ["Am I the One?", "Choices for contraceptives", "Communicating with your partner", "Does he like you?"],
            "pregnancy":  ["How does pregnancy happen?", "How much do you know about pregnancy?", "Am I pregnant?", "If it's my first time can I get pregnant?", "Can I get pregnant through oral sex?"], 
            "unknown": []
          }
# created key phrases or topics that came up different times within the dataframe
phrase_key = ["love", "sex", "relationships", "pregnancy", "unknown", "contraception", "boyfriend"]

In [None]:
df["new_message"] = df["new_message"].astype(str)

# Generating Labels
I couldn't find any labeled data which told me what topic a user was talking about when they said a message. For example, "I want to go out with my bf" should map to boyfriend. I created a very naive way to generate labels-- essentially, what I did was look for one of 6 words within each message and labeled it as the first occurance of one of those words. This obviously is a flawed way of labeling since we are looking for exact word matches and ignoring context but it was mainly used to display what could occur if we had a topic label. Because of this method, we also have pretty large class imbalance; however, again this is just a proof a concept. 

In [None]:
# finds all of the key phrases defined above
def map_sentence_to_word(message):
  message = re.sub('[^0-9a-zA-Z]+', ' ', message)
  # find key words
  message_split = message.split(" ")
  # replace puncuation
  for i in message_split:
    # lowercase the entire message
    curr_string = i.lower()
    if curr_string in phrase_key: 
      return curr_string
  return "unknown"

In [None]:
df["label"] = df["new_message"].apply(map_sentence_to_word)
df

In [None]:
df["label"].value_counts()

In [None]:
#Drop the unknowns because it would just dilute our data while we are trying to trian our model on it. 
df = df[df["label"] != "unknown"]
df

In [None]:
SEQ_LEN = 10
SEQ_STEP = 2
PERCENT_VALIDATION = 0.2
ALPHABET = set(string.ascii_letters + string.punctuation + "123456789" + ' ')

In [None]:
# begin to create the inputs for the LSTM 
messages = list(df["new_message"])

for i in range(len(messages)):
  messages[i] = re.sub('[^0-9a-zA-Z]+', ' ', messages[i])

labels = list(df["label"])
messages

# Making the Input to the LSTM Model
We create a series of sequences. Each sequence is SEQ_LEN characters long. After creating our sequences, we make a mapping from index to character and character to index for the X data. We also create a mapping from index to phrase ("boyfriend" , "love" etc) and a mapping from phrase to index-- note that the phrase is what we are predicting.  Fianlly, we utilize this mapping to make the LSTM input which is number of sequences x number of elements in sequence x number of characters in alphabet 

In [None]:
sequences = []
label_seq = []
for i in range(len(messages)):
  message = messages[i]
  for j in range(0, len(message)-SEQ_LEN, SEQ_STEP):
    sequences.append(message[j:j+SEQ_LEN])
    label_seq.append(labels[i])
len(label_seq), len(sequences)

In [None]:
sequences

In [None]:
characters= sorted(list(set(ALPHABET)))
index_to_char = {}
char_to_index = {}
# creating a mapping
for idx, char in enumerate(characters):
  index_to_char[idx] = char
  char_to_index[char] = idx

index_to_phrase  = {}
phrase_to_index = {}
for idx, phrase in enumerate(phrase_key):
  index_to_phrase[idx] = phrase
  phrase_to_index[phrase] = idx

In [None]:
X = np.zeros((len(sequences), SEQ_LEN, len(characters)))
y = np.zeros((len(label_seq), len(phrase_key)))
for i, seq in enumerate(sequences):
  for j, char in enumerate(sequences[i]):
    X[i, j, char_to_index[char]] = 1
  y[i, phrase_to_index[label_seq[i]]] = 1

In [None]:
# input to the LSTM
X

In [None]:
#Created training data and validation data 
idxes = np.random.permutation(X.shape[0])
val_num = int(PERCENT_VALIDATION*len(X))
X = X[idxes]
X_train = X[val_num:]
X_val = X[:val_num]
y = y[idxes]
y_train = y[val_num:]
y_val = y[:val_num]

# Created the LSTM
Here is a below possibility of how the LSTM could look. I achieved fairly low loss on this dummy daa; however, this will obviously change when we have real labels. 

In [None]:
def build_model(num_chars, output_chars):
    model = Sequential()
    model.add(Bidirectional(LSTM(64, input_shape=(SEQ_LEN, num_chars), return_sequences=True,unit_forget_bias=True)))
    model.add(BatchNormalization())
    model.add(Dropout(rate=.1))
    model.add(Bidirectional(LSTM(128,return_sequences=True,unit_forget_bias=True)))
    model.add(BatchNormalization())
    model.add(Dropout(rate=.1))
    model.add(Bidirectional(LSTM(256,return_sequences=True,unit_forget_bias=True)))
    model.add(BatchNormalization())
    model.add(Dropout(rate=.1))
    model.add(Bidirectional(LSTM(128)))
    model.add(BatchNormalization())
    model.add(Dropout(rate=.2))
    model.add(Dense(output_chars, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='RMSprop')
    
    return model

In [None]:
print("Making New Model")
model = build_model(len(characters), len(phrase_to_index))

In [None]:

BATCH_SIZE = 32
EPOCHS = 20 

trained_model = model.fit(X_train, y_train, batch_size=BATCH_SIZE, epochs= EPOCHS, validation_data=(X_val, y_val))

# Testing on a random point
Below I wrote a sentence. Below, we convert it into sequences, then make it into a valid input for our LSTM and predict. We receive predictions for every subsequence of that sentence and we pick the majority vote 


In [None]:
test_message = ["Tell my bf I love him"]

In [None]:
test_seqs = []
for i in range(len(test_message)):
  message = test_message[i]
  for j in range(0, len(message)-SEQ_LEN, SEQ_STEP):
    test_seqs.append(message[j:j+SEQ_LEN])
test_seqs

In [None]:
X = np.zeros((len(test_seqs), SEQ_LEN, len(characters)))
for i, seq in enumerate(test_seqs):
  for j, char in enumerate(test_seqs[i]):
    X[i, j, char_to_index[char]] = 1

In [None]:
from scipy.stats import mode

In [None]:
preds = model.predict(X)
votes = []
for i in range(len(preds)):
  votes.append(np.argmax(preds[i]))
print(index_to_phrase[mode(votes)[0][0]])