## 0-1. Import libraries

Import necessary libraries and download nltk packages

In [1]:
import pandas as pd
import re
import random
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/wonsukcha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/wonsukcha/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/wonsukcha/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## 0-2. Import word2vec

GoogleNews-vectors-negative300 is a pre-trainied Google News corpus word vector model.</br>
Gensim provides word2vec class that can construct from pre-trained model. </br>

In [2]:
import os
import gzip
import shutil
import urllib.request
from gensim.models import Word2Vec, KeyedVectors

GoogleNews-vectors-negative300 is bigger than 3Gb. Be careful when you run the following cell.

In [3]:
URL = 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz'
FILENAME = 'GoogleNews-vectors-negative300.bin.gz'
filepath = os.path.join(os.getcwd(), FILENAME)
urllib.request.urlretrieve(url, filepath)

filepath_vec = filepath.replace('.gz', '')
with gzip.open(filepath, 'rb') as f_in:
            with open(filepath_vec, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)

Create a word2vec instance using pretrained GoogleNews-vectors-negative300 vector model.

In [4]:
pretrained_path = filepath_vec
w2v_model = KeyedVectors.load_word2vec_format(pretrained_path, binary=True)

### 1. File selection & load data

We are going to use a mini dataset with four tweets.

In [5]:
filename = 'input_tweets_miniset.csv'

First load data and store it to pandas dataframe, and drop user column.

In [6]:
# read csv
df = pd.read_csv(os.path.join(filename))

# drop user column
df = df.drop(['user'], axis=1)

### 2. Pre-processing

Create a copy of dataframe. We are going to keep intermediate steps in this copied dataframe.

In [7]:
# create a copy of dataframe
df_copied = df.copy()

Use regular expressions of pythong methods to get rid of the unnecessary data

In [8]:
# beginning
regex_beginning = re.compile(r'(RT\s)?(@\S+)')
df_copied['preprocessed'] = df_copied['text'].str.replace(regex_beginning, '')

# end
regex_end = re.compile(r'[^. ]*…')
df_copied['preprocessed'] = df_copied['preprocessed'].str.replace(regex_end, '')

# website
regex_web = re.compile(r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')
df_copied['preprocessed'] = df_copied['preprocessed'].str.replace(regex_web, '')

# remove hashtag
regex_hash = re.compile(r'#+[a-zA-Z0-9(_)]{1,}')
df_copied['preprocessed'] = df_copied['preprocessed'].str.replace(regex_hash, '')

# remove digits, periods, parenthes, comma, etc
regex_unnecessary = re.compile(r"[^a-zA-Z' ]")
df_copied['preprocessed'] = df_copied['preprocessed'].str.replace(regex_unnecessary, '')

# lowercase and trim
df_copied['preprocessed'] = df_copied['preprocessed'].str.lower()
df_copied['preprocessed'] = df_copied['preprocessed'].str.strip()


Tokenize the cleaned tweet. Since nltk tokenize "'s" as a token, it is better to filter out "'s". </br>
With this step, we can't prevent it to be chosen as a word to be replaced.

In [9]:
# tokenize
df_copied['preprocessed'] = df_copied['preprocessed'].apply(word_tokenize)
df_copied['preprocessed'] = df_copied['preprocessed'].apply(lambda lst: [token for token in lst if token != "'s"])

Remove stop words.

In [10]:
stop_words_nltk = set(stopwords.words('english'))
df_copied['preprocessed'] = df_copied['preprocessed'] \
    .apply(lambda lst: [word for word in lst if not word in stop_words_nltk])

Let's check copied dataframe. We can see preprocessed data in "preprocessed" column.

In [11]:
df_copied

Unnamed: 0,sentiment,text,preprocessed
0,positive,RT @cutedejun: sm not letting xiaojun go for h...,"[sm, letting, xiaojun, go, graduation, brother..."
1,negative,"@Marcheline3Di For me, it was spent recovering...","[spent, recovering, allnighter, spent, music, ..."
2,positive,RT @OntarioHealthC: 954 ongoing #COVID19 outbr...,"[ongoing, outbreaks, ontario, hospitals, retir..."
3,negative,"RT @MicahPollak: Well, #COVID19 is once again ...","[well, leading, cause, death, based, average, ..."


### 3. Data augmentation

Hereby we will choose two words randomly from each preprocessed tweet. </br>
And then we will get synonyms of select words using word2vec model.

In [12]:
def get_synonyms(lst):
    # get two random words
    word1 = lst[random.randint(0, len(lst)-1)]
    word2 = lst[random.randint(0, len(lst)-1)]
    while word1 == word2:
        word2 = lst[random.randint(0, len(lst)-1)]

    # get synonyms. Error handling required for the case word not in present
    try:
        synonym1 = w2v_model.most_similar(word1)[0][0]
    except:
        synonym1 = None
        
    try:
        synonym2 = w2v_model.most_similar(word2)[0][0]
    except:
        synonym2 = None
        
    return [(word1, synonym1), (word2, synonym2)]

df_copied['synonyms'] = df_copied['preprocessed'].apply(get_synonyms)

In [13]:
df_copied

Unnamed: 0,sentiment,text,preprocessed,synonyms
0,positive,RT @cutedejun: sm not letting xiaojun go for h...,"[sm, letting, xiaojun, go, graduation, brother...","[(brother, younger_brother), (last, earlier)]"
1,negative,"@Marcheline3Di For me, it was spent recovering...","[spent, recovering, allnighter, spent, music, ...","[(projects, project), (related, relating)]"
2,positive,RT @OntarioHealthC: 954 ongoing #COVID19 outbr...,"[ongoing, outbreaks, ontario, hospitals, retir...","[(high, low), (ontario, alberta)]"
3,negative,"RT @MicahPollak: Well, #COVID19 is once again ...","[well, leading, cause, death, based, average, ...","[(deaths, fatalities), (daily, weekly)]"


Now that we've got words to replace and synonyms of those, let's create a new dataset with texts with words being replaced with synonyms.

In [14]:
text_augmented = []
for idx, synonyms in enumerate(df_copied['synonyms']):
    text = df['text'][idx]
    for synonym in synonyms:
        if synonym[1] != None:
            text = text.replace(synonym[0], synonym[1])
    text_augmented.append(text)

In [15]:
dict_augment = {
    'sentiment': df['sentiment'],
    'text': pd.Series(text_augmented)
}
df_augmented = pd.DataFrame(dict_augment)

By concatenating two dataframes, we could get a new dataframe doubled in size.

In [16]:
df_doubled = pd.concat([df, df_augmented], ignore_index=True)
df_doubled

Unnamed: 0,sentiment,text
0,positive,RT @cutedejun: sm not letting xiaojun go for h...
1,negative,"@Marcheline3Di For me, it was spent recovering..."
2,positive,RT @OntarioHealthC: 954 ongoing #COVID19 outbr...
3,negative,"RT @MicahPollak: Well, #COVID19 is once again ..."
4,positive,RT @cutedejun: sm not letting xiaojun go for h...
5,negative,"@Marcheline3Di For me, it was spent recovering..."
6,positive,RT @OntarioHealthC: 954 ongoing #COVID19 outbr...
7,negative,"RT @MicahPollak: Well, #COVID19 is once again ..."


Export the new dataset as txt file.

In [18]:
df_doubled.to_csv('output_data.txt', sep=',', index=False)