# Translating Text

This notebook is not a part of the final submission to build the model, however this is where the idea of translating text into a new language was tried and tested. Originally, the translated versions were stored as separate data files, but later on these datasets were removed. Hence, the code in this notebook might not be as neat or comprehensive as the rest of the project.

**NOTE:** In a way, this idea of translation can be likened to the automated dictionary approach

### Some of the ideas tested
1. Labelling all stopwords as STOPWORD
2. Labelling profane words (based on a list of profanity) as PROFANITY
3. Labelling words with "you" as YOUWORD

### Advantages of such translation
1. Reduces space to store data for a slight decrease in performance
2. Reinforces the social idea that, *it does not matter what word was used as long as the meaning is offensive*

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("../data/train.csv")
df.head()

Unnamed: 0,id,attack,text
0,348598183,0,which may contain more details
1,61527923,1,"Regardless, the point is that I am willing to ..."
2,325989249,0,Lede \nI'm reverting (again) the additions to...
3,197250961,0,I just came to this page and was wondering why...
4,116195271,1,It's worth having an illustration. The Type 2...


In [None]:
import re
DEF_STOPWORDS = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 'can', 'will', 'just', "don't", 'should', "should've", 'now', "aren't", "couldn't", "didn't", "doesn't", "hadn't", "hasn't", "haven't", "isn't", "mightn't", "mustn't", "needn't", "shan't", "shouldn't", "wasn't", "weren't", "won't", "wouldn't"]

In [None]:
def translate_text(text):
    """
    Translate the text into a new "special" language
    """

    # you words
    you_tokens = ['you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves']
    
    # most common profanity tokens as per training data
    common_profane_tokens = ['fuck', 'nigga', 'suck', 'die', 'bitch', 'faggot', 'shit', 'ass', 'bastard', 'blocked', 'kill', 'block', 'aids']

    # profane tokens as per list
    profane_list = open("../data/external/profanity_list.txt", 'r').readlines()
    profane_list = [w.replace('\n', '') for w in profane_list]
    profane_tokens = profane_list

    # single occurence words
    custom_stopwords = open("../data/custom_stopwords.txt", 'r').readlines()
    custom_stopwords = [w.replace('\n', '') for w in custom_stopwords]
    
    # default stopwords
    stop_tokens = DEF_STOPWORDS

    # lowercase and remove punctuations
    text = re.sub(r'[^\w\s]', '', text.lower())

    text_tokens = text.split()
    text_tokens = [tok for tok in text_tokens if tok not in custom_stopwords]
    translated_tokens = []

    for token in text_tokens:
        if token in you_tokens:
            translated_tokens.append("YOUWORD")
        elif token in common_profane_tokens:
            translated_tokens.append("COMMONPROFANITY")
        elif token in profane_tokens:
            translated_tokens.append("PROFANITY")
        elif token in stop_tokens:
            translated_tokens.append("STOPWORD")
        else:
            translated_tokens.append(token)

    translated_text = " ".join(translated_tokens)

    return translated_text

In [None]:
df['translated'] = df.text.apply(translate_text)

In [None]:
df.head()

Unnamed: 0,id,attack,text,translated
0,348598183,0,which may contain more details,may contain details
1,61527923,1,"Regardless, the point is that I am willing to ...",regardless point willing information add id ra...
2,325989249,0,Lede \nI'm reverting (again) the additions to...,lede im reverting STOPWORD additions lede cont...
3,197250961,0,I just came to this page and was wondering why...,came wondering criticism controversy tab click...
4,116195271,1,It's worth having an illustration. The Type 2...,STOPWORD worth STOPWORD illustration type 2 pi...


In [None]:
df = df.drop(['text'], axis=1)
df.columns = ['id', 'attack', 'text']
df.to_csv('../data/train_translated.csv', index=False)

In [None]:
df.head()

Unnamed: 0,id,attack,text
0,348598183,0,may contain details
1,61527923,1,regardless point willing information add id ra...
2,325989249,0,lede im reverting <STOPWORD> additions lede co...
3,197250961,0,came wondering criticism controversy tab click...
4,116195271,1,<STOPWORD> worth <STOPWORD> illustration type ...


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      15000 non-null  int64 
 1   attack  15000 non-null  int64 
 2   text    15000 non-null  object
dtypes: int64(2), object(1)
memory usage: 351.7+ KB


In [None]:
train = pd.read_csv("../data/train_translated.csv")
target = 'attack'
text_feat = 'text'

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      15000 non-null  int64 
 1   attack  15000 non-null  int64 
 2   text    14998 non-null  object
dtypes: int64(2), object(1)
memory usage: 351.7+ KB


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=94079369-10d1-4c3e-b7f5-859473c4f3a7' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>