# Advanced Data Science Capstone - Week 2 - Feature Engineering/Creation and ETL

This notebook differs from the Feature Cration (Bag of Words) because we will preprocess and save data considering word embedding approaches. I found it easier to do in a different notebook. Consider this one as a compliment for the one named *Feature Creation (Bag of  Words) - Twitter US Airline Sentiment*

In [2]:
# importing data
import pandas as pd

data = pd.read_csv("data/data_preprocessed.csv", sep='\t')

# checking its dimensions
data.shape

(14133, 24)

In [3]:
import utils

data['text_preprocessed'] = data.text.apply(lambda x: utils.preprocess_tweet(x, punctuation = True))
data['text_preprocessed'] = data.text_preprocessed.apply(lambda x: utils.preprocess_text(x))

## Word2Vec

We will create word embeddings using the library called gensim. It is easy to implement and hyper parameters tuning. An important concept here is that before preprocessing we need to split data between train and test set. Because if we apply word embeddings for all data, it will leakage information to test set.

In [4]:
text = data['text_preprocessed']
y = data['airline_sentiment']

from sklearn.model_selection import train_test_split

text_train, text_test, y_train, y_test = train_test_split(text, y, test_size=0.30, random_state=0)

text_train.shape, text_test.shape

((9893,), (4240,))

In [5]:
sentences = list(text_train.values)

In [8]:
import logging
import warnings
from gensim.models import word2vec

warnings.filterwarnings("ignore")

# Parameters
num_features = 25    # Word vector dimensionality                      
min_word_count = 1   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 5         # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words


print ("Training model...")
model = word2vec.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "word_embeddings/25features_1minwords_5context"
model.save(model_name)

Training model...


In [9]:
print(len(sentences))
print(sentences[10])

9893
['ive', 'literally', 'been', 'holding', 'for', 'more', 'than', '2', 'hours', 'now', '<PERIOD>', 'i', 'was', 'told', '2hrs', '<PERIOD>', 'you', 'all', 'are', 'sabotaging', 'any', 'chance', 'i', 'have', 'of', 'getting', 'home']


In [10]:
model.most_similar('electrical')

[('las-den', 0.839499831199646),
 ('no-enertainment-on', 0.8384756445884705),
 ('swift', 0.8367525339126587),
 ('display', 0.8246525526046753),
 ('989', 0.8217084407806396),
 ('5350', 0.8195247650146484),
 ('19-', 0.81806480884552),
 ('sfo-jfk', 0.8122883439064026),
 ('delayed-no', 0.812042772769928),
 ('compounding', 0.8109996914863586)]

In [11]:
text_train[0:3]

12958    [in, miami, and, the, agents, rachel, wong, an...
13610    [delivered, mom's, delayed, bag, to, the, wron...
4523     [u, were, bae, until, u, lost, both, of, my, b...
Name: text_preprocessed, dtype: object