# Data augmentation using googletrans
If the data augmentation technique for sentiment analysis involves translations, then it is typically done before preprocessing. This is because translation can introduce variations in the text that may be lost during preprocessing, such as capitalization and punctuation.

Therefore, it is generally recommended to perform translation-based data augmentation before preprocessing the text. However, it is important to note that the quality of the translations can affect the performance of the sentiment analysis model, so it is important to use high-quality translation services or tools to ensure that the meaning of the text is preserved. Additionally, it is important to evaluate the impact of translation-based data augmentation on the performance of the sentiment analysis model to determine if it is improving or hindering the model's accuracy.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!pip install 'googletrans==3.1.0a0'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting googletrans==3.1.0a0
  Downloading googletrans-3.1.0a0.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans==3.1.0a0)
  Downloading httpx-0.13.3-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.1/55.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting hstspreload (from httpx==0.13.3->googletrans==3.1.0a0)
  Downloading hstspreload-2023.1.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
Collecting chardet==3.* (from httpx==0.13.3->googletrans==3.1.0a0)
  Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting idna==2.* (from httpx==0.13.3->go

In [None]:
import pandas as pd

### Define the path of all the .CSV datasets
neutral_tweets = '/content/neutrals_data.csv'

### Load the .CSV files into a Pandas dataframe
neutral_tweets = pd.read_csv(neutral_tweets)

### Convert 'tweet' to string data type
neutral_tweets['tweet'] = neutral_tweets['tweet'].astype(str)

list_neutral_tweets = list(neutral_tweets.tweet)

In [1]:
from googletrans import Translator
translator = Translator()
new_instances = []

val = 0
for text in list_neutral_tweets:
  try:
    translation_french = translator.translate(text, src='en', dest='es').text
    translation_english = translator.translate(translation_french, src='es', dest='en').text
    val+=1
    print(val)
    new_instances.append(translation_english)

  except TypeError:
    continue

In [None]:
new_neutral_instances = []

for tweet in new_instances:
	new_row = {'tweet': tweet, 'sentiment': 2}
	new_neutral_instances.append(new_row)

# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(new_neutral_instances)

df.to_csv('/content/neutral_tweets.csv', index=False)