### Data Augmentation
This notebook aims to preprocess the data by leaving only the important columns to build a generic topic classification machine learning model

* reads the data

* drops all columns except title text and label

* augments data by backtranslation to transform the data to be balanced

The resulting dataset can be found [here](https://www.kaggle.com/karimamd95/topic-balaned-dataset)
and the classification models can be found in my profile such as [this](https://www.kaggle.com/karimamd95/topic-classification-doc2vec-shallow-learning) and [this](https://www.kaggle.com/karimamd95/news-topic-classification-keras/edit)

In [None]:
!pip install ipython-autotime
%load_ext autotime

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
file = None
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import itertools
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

In [None]:
file = '/kaggle/input/topic-labeled-news-dataset/labelled_newscatcher_dataset.csv'
df=pd.read_csv(file, delimiter=';')
df.head()

### Exploratory Analysis and basic cleaning

In [None]:
df.shape

In [None]:
# keep only needed columns
df=df[['topic', 'title']]
df.head()

In [None]:
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True) #remove duplicate columns and confirm
df.shape

In [None]:
df.topic.value_counts()

Science is under represented ! The solution is to upsample it or downsample the others not to bias the ML model predictions against science

### Dealing with imbalance of data, oversampling the science topics by backtranslation

Back translation is a text augmentation technique that translates the input sentence into another language and then back to the source language
The output sentence is very close to the input sentence but in other words which doubles our data with accepted error rate usually

In [None]:
!pip install BackTranslation # install library

In [None]:
from BackTranslation import BackTranslation

In [None]:
df['title'][0] #try back translation on a sample

In [None]:
translator = BackTranslation()
result = translator.translate(df['title'][0], src='en') #use default intermediate language
print(result.result_text)

The sentences have a close or same meaning but with different structure and words !

#### trying several intermediate languages to find best fit:

In [None]:
# we need to make 4 or 5 times the number of science samples which means many intermediate languages
# looking up their codes to use them !
print(translator.searchLanguage('French'))
print(translator.searchLanguage('Italian'))
print(translator.searchLanguage('Spanish'))
print(translator.searchLanguage('German'))
print(translator.searchLanguage('Arabic'))
print(translator.searchLanguage('Chinese'))
print(translator.searchLanguage('Japanese'))
print(translator.searchLanguage('Russian'))

In [None]:
temp_languages = ['fr', 'it', 'es' , 'de', 'ar', 'zh-cn', 'ja', 'ru']

In [None]:
# testing effectiveness
sample = df['title'][0]
print('original:')
print(sample)
print('backtranslated:')
for lang in temp_languages:
    result = translator.translate(sample, src='en', tmp=lang)
    print(lang + ':   ' + result.result_text   )

In [None]:
df_science= df.query('topic == "SCIENCE"')
df_science

In [None]:
from tqdm import tqdm
import tqdm.notebook as tq

In [None]:
# choose five best performing by making different sentences
science_titles=df_science['title'].to_list()
science_dict= {'topic': [], 'title':[]}

Brute force for loop

In [None]:
for lang in tq.tqdm(temp_languages):
    for title in tq.tqdm(science_titles):
        try:
            result = translator.translate(title, src='en', tmp=lang)
            science_dict['topic'].append('SCIENCE')
            science_dict['title'].append(result.result_text)
        except:
            pass
df_science = pd.DataFrame(data=science_dict)
df_science

In [None]:
non_science_df = df.query('topic != "SCIENCE"')
non_science_df

### Trying Multiprocessing data augmentation

The trial did not work as expected but here is the code

In [None]:
# import multiprocessing

# multiprocessing.cpu_count() # number of cores

In [None]:
# import time
# from multiprocessing import Pool

# def backtrans(title):
#     try:
#         result = translator.translate(title, src='en', tmp='it')
#         science_dict['topic'].append('SCIENCE')
#         science_dict['title'].append(result.result_text)
#         print('success')
#     except:
#         print('error in:')
#         print(title)
#     time.sleep(1)

# p = Pool(processes=5)
# result = p.map(backtrans, science_titles)



In [None]:
df_science.shape

In [None]:
non_science_df.append(df_science)

In [None]:
non_science_df.shape

In [None]:
df_balanced=non_science_df.append(df_science)
df_balanced['topic'].value_counts()

In [None]:
df_balanced.drop_duplicates(inplace=True)

In [None]:
df_balanced['topic'].value_counts()

In [None]:
df_balanced.to_csv('topic_balanced_aug.csv', index=False, header=True)
# save augmented dataset into a csv file

In [None]:
df['topic'].value_counts()