# Exploring Translations of Hindi Messages

In this notebook I will explore translations of Hindi chatbot messages and how to convert them into English as a proof of concept. 

In [None]:
# import statements
import pandas as pd
import numpy as np
import os
import re 

In [None]:
# setting up google colab
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
%cd "drive/My Drive/girl_effect"

In [None]:
# importing google translate api 
! pip install google_trans_new

In [None]:
# creating a new translator 
from google_trans_new import google_translator  

In a separate notebook, I separated the data into messages sent through Chhaa Jaa, India's flagship chatbot, and Big Sis, used in primarily English speaking countries. For this notebook, I am working with the Hindi text specifically. 

In [None]:
# chatbot data from chhaa jaa only 
frame = pd.read_csv("chatbots_data_hindi.csv")
frame

## Google Translate API

Google Translate's API is currently in the works of delivering powerful translation results. In the following code blocks I explore this API and its results on the dataset.  

In [None]:
# creating string
frame["Message"] = frame["Message"].astype(str)
# creating translator instance
translator = google_translator()  

In [None]:
# only focusing on first 500 entries because kernel timed out
frame = frame.head(500)

In [None]:
# detect the language
def language_detect(message): 
  val = translator.detect(message)
  return val[1]

frame["language"] = frame["Message"].apply(language_detect)

In [None]:
frame

In [None]:
# checking how many rows were detected as hindi 
frame["language"].value_counts()

I noticed that I was getting a lot of entries that were detected as "chinese (simplified)". This is because most of the messages had nan values or had emojis that didn't translate well to CSV format. Earlier in the preprocesisng, I replaced all these characters with dummy characters such as "$#@". From here, I drop those particular rows. 

In [None]:
chinese_frame = frame[frame["language"] == "chinese (simplified)"]

In [None]:
chinese_frame

In [None]:
frame = frame[frame["language"]!="chinese (simplified)"]
frame

After finding all of the detected languages, then I convert the messages from hindi to english, and spot check the entries to sanity check. It helps that I speak Hindi haha

In [None]:
# translate to english 
def to_english(message):
  translated_message = translator.translate(message, lang_src='hi', lang_tgt='en')
  return translated_message

In [None]:
frame["english_version"] = frame["Message"].apply(to_english)

In [None]:
frame

From here it looks like most of the messages are starting to make more sense. Now, I can take the work I did in an alternative notebook to spell check and correct the english messages to gain further understanding of what the words mean. 