# Advanced Data Science Capstone - Week 2 - Feature Engineering/Creation and ETL

In this notebook, we will work on process known as feature engineering. Basically, we will use information extracted from data understand notebook and specific knowledge of the problem (in this case, text mining) to format data in order to use a predictive model such as SVM or Naive Bayes. Therefore, we will also create features to help our classifier using techniques well explored in the text mining field. Later we will compare these approaches to the ones using Deep Learning and automatic feature extraction for this specific case.

## 1.1 Pre-processing

Following the strategies found on [1], we will pre-process in this way:
   - Remove all URLs (e.g. www.xyz.com), hash tags (e.g. #topic), targets (@username)
   - Correct the spellings; sequence of repeated characters is to be handled
   - Replace all the emoticons with their sentiment.
   - Remove all punctuations ,symbols, numbers
   - Remove Stop Words
   - Remove Non-English Tweets

In [61]:
# importing data
import pandas as pd

data = pd.read_csv("Tweets.csv")

# checking its dimensions
data.shape

(14640, 15)

In [62]:
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import string

# function to help us developing features such as lower_case, punctuation (keep or strip),
# delete stop_words, etc.
def preprocess(s, preserve_case=False, strip_handles=False, reduce_len=False, 
               punctuation = False, stop_words = False, join = True):
    
    punctuation = [] if punctuation else list(string.punctuation+'”“’')
    stop = stopwords.words('english') + punctuation + ['rt', 'via'] if stop_words else punctuation + ['rt', 'via']
    tknzr = TweetTokenizer(preserve_case=preserve_case, 
                           strip_handles=strip_handles, reduce_len=reduce_len)
    
    tokens = tknzr.tokenize(s)
    
    tokens = [token for token in tokens
              if token not in stop and
              not token.startswith(('#', '@','http', '...'))] 
    
    if join:
        return ' '.join(tokens)
    
    return tokens

In [63]:
import random

rnd = random.randint(0, data.shape[0])
stop_words = False
punctuation = False

print("Original tweet: {}".format(data.text.values[7804]))
print("Preprocessed : {}".format(preprocess(data.text.values[7804], join=True, punctuation=True)))

Original tweet: @JetBlue quick ? Why is a person traveling w  a mosaic not get the green tag? Doesn't make sense I end up waitin 4 my sons bag anyway :/
Preprocessed : quick ? why is a person traveling w a mosaic not get the green tag ? doesn't make sense i end up waitin 4 my sons bag anyway :/


## 1.2 Feature Extraction/Engineering

### Processing for Bag of Words approach

Again, we have the following alternatives:
   - Parts of Speech Tags
   - Opinion words and phrases
   - Position of terms
   - Negation
   - Presence of emoticons
   - Stemming and lemmatization

Here, we will create different strategies based on feature engineering process. Overall, we will create different datasets to test our first models.

In [64]:
data['text_preprocessed'] = data.text.apply(lambda x: preprocess(x, preserve_case=True, punctuation=False))
data['text_preprocessed'].values[0:3]

array(['What said',
       "plus you've added commercials to the experience tacky",
       "I didn't today Must mean I need to take another trip"], dtype=object)

### Sentiment Scores

Here, we will use the library defined on nltk module called vader that contains a lexicon of positive, neutral and negative words. Here, we want to create features based on the polarity scores of our tweets, i.e., we will pass each tweet to the function polarity_scores, and it will returns values for negative, neutral, positive and compound. Then, they will be used as extra features to our classification task.

In [65]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

analyser = SentimentIntensityAnalyzer()

def polarity_scores_all(data_text_column):
    neg, neu, pos, compound = [], [], [], []
    analyser = SentimentIntensityAnalyzer()

    for text in data_text_column:
        dict_ = analyser.polarity_scores(text)
        neg.append(dict_['neg'])
        neu.append(dict_['neu'])
        pos.append(dict_['pos'])
        compound.append(dict_['compound'])
    
    return neg, neu, pos, compound

all_scores = polarity_scores_all(data.text_preprocessed.values)
data['neg_scores'] = all_scores[0]
data['neu_scores'] = all_scores[1]
data['pos_scores'] = all_scores[2]
data['compound_scores'] = all_scores[3]
data.tail(4)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,text_preprocessed,neg_scores,neu_scores,pos_scores,compound_scores
14636,569587371693355008,negative,1.0,Customer Service Issue,1.0,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,,leaving over 20 minutes Late Flight No warning...,0.296,0.704,0.0,-0.7906
14637,569587242672398336,neutral,1.0,,,American,,sanyabun,,0,@AmericanAir Please bring American Airlines to...,,2015-02-22 11:59:15 -0800,"Nigeria,lagos",,Please bring American Airlines to,0.0,0.635,0.365,0.3182
14638,569587188687634433,negative,1.0,Customer Service Issue,0.6659,American,,SraJackson,,0,"@AmericanAir you have my money, you change my ...",,2015-02-22 11:59:02 -0800,New Jersey,Eastern Time (US & Canada),you have my money you change my flight and don...,0.0,0.885,0.115,0.3818
14639,569587140490866689,neutral,0.6771,,0.0,American,,daviddtwu,,0,@AmericanAir we have 8 ppl so we need 2 know h...,,2015-02-22 11:58:51 -0800,"dallas, TX",,we have 8 ppl so we need 2 know how many seats...,0.0,0.951,0.049,0.0772


### Presence of emoticons and marks

In this step we process we replace each emoticon for its representation feeling such as SAD, HAPPY. Also, we will use INTERROGATION AND EXCLAMATION for indication of marks.

In [66]:
# Here, we replace emoticons based on its value: _SAD_, _HAPPY_, _NEUTRAL_
# Also, we include markers just as _INTERROGATION_ and _EXCLAMATION_
positive_emoticons = [':-)', ':)', '=)', ':D', ';D', ':]', ';]', ': D']
negative_emoticons = [':-(', ':(', '=(', ';(', 'D:', 'D;', ':[', ';[', ':/']

def emoticon_detection(raw_string):
    list_emoticons = []
    s = preprocess(raw_string, punctuation=True, join=False)
    for token in s:
        if token in positive_emoticons:
            list_emoticons.append((token, 'HAPPY'))
        if token in negative_emoticons:
            list_emoticons.append((token, 'SAD'))
        if token in ['?']:
            list_emoticons.append((token, 'INTERROGATION'))
        if token in ['!']:
            list_emoticons.append((token, 'EXCLAMATION'))
            
    s = ' '.join(s)   
                                  
    for emoticon in list_emoticons:
        s = s.replace(emoticon[0], emoticon[1])
    return s                       

data['text_preprocessed_with_emoticon'] = data.text.apply(lambda x: emoticon_detection(x))

### Stemming

Stemming is another common step in text analysis. Here, we will replace words for its root based on a rule called Porter Stemmer. From nlkt package, we have some examples: http://www.nltk.org/howto/stem.html

In [67]:
from nltk.stem.porter import PorterStemmer

def stemming(s):
    stemmer = PorterStemmer()
    tokens = preprocess(s, stop_words=True, join=False)
    x = [stemmer.stem(w) for w in tokens]
    
    return ' '.join(x)

rnd = random.randint(0, data.shape[0])
stop_words = False

print(rnd)
print("Original tweet: {}".format(data.text.values[rnd]))
print("Preprocessed : {}".format(preprocess(data.text.values[rnd], True)))
print("Preprocessed with stemming: {}".format(stemming(data.text.values[rnd])))

5282
Original tweet: @SouthwestAir Any way that I can get a receipt for a Cancelled Flightled portion of a roundtrip flight? Used the flight voucher just need receipt.
Preprocessed : Any way that I can get a receipt for a Cancelled Flightled portion of a roundtrip flight Used the flight voucher just need receipt
Preprocessed with stemming: way get receipt cancel flightl portion roundtrip flight use flight voucher need receipt


In [68]:
# stemming must be done after preprocessed text, because we do not want to remove interrogation marks,
# exclamation marks, emoticons, etc.
data['text_stemming_with_emoticon'] = data.text_preprocessed_with_emoticon.apply(lambda x: stemming(x))
data['text_stemming'] = data.text_preprocessed.apply(lambda x: stemming(x))

In [69]:
n = 10682

print(data['text_stemming_with_emoticon'].values[n])
print('*'*100)
print(data['text_stemming'].values[n])

us air asham servic hold hour interrog help repres reschedul due weather interrog
****************************************************************************************************
us air asham servic hold hour help repres reschedul due weather


### Negation

As discussed in [1], negation can be a feature that could be extremely informative about the opinion of a text. Here, we will create a variable indicating 1 if we have negation and 0 otherwise.

In [70]:
from nltk.sentiment.vader import negated

data['negation'] = data.text_preprocessed.apply(lambda x: negated(x.split()))
data['negation'].value_counts(normalize=True)

False    0.73026
True     0.26974
Name: negation, dtype: float64

### POS - Counting

In [71]:
from collections import Counter
import nltk

dict(nltk.pos_tag(data.text_stemming.values[2].split()))

{'today': 'NN',
 'must': 'MD',
 'mean': 'VB',
 'need': 'MD',
 'take': 'VB',
 'anoth': 'DT',
 'trip': 'NN'}

In [72]:
pos_family = {
    'NOUN' : ['NN','NNS','NNP','NNPS'],
    'PRON' : ['PRP','PRP$','WP','WP$'],
    'VERB' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'ADJ' :  ['JJ','JJR','JJS'],
    'ADV' : ['RB','RBR','RBS','WRB']
}


def count_pos_tag(data_text):
    total_count = []
    for s in data_text.text_preprocessed.values:
        partial_count = {}
        s = s.split()
        count_pos = Counter(dict(nltk.pos_tag(s)).values())

        for item, value in count_pos.items():
            partial_count[item] = partial_count.get(item, 0) + 1
            
        total_count.append(partial_count)

    return total_count



total_count = count_pos_tag(data)

In [73]:
pos_df = pd.DataFrame(total_count)
pos_df = pos_df.drop(['$', '.', ':', 'IN'], axis = 1)
pos_df.columns

Index(['CC', 'CD', 'DT', 'EX', 'FW', 'JJ', 'JJR', 'JJS', 'MD', 'NN', 'NNP',
       'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP',
       'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP',
       'WP$', 'WRB'],
      dtype='object')

In [74]:
pos_df['NOUN'] = pos_df[pos_family['NOUN']].sum(axis=1)
pos_df['PRON'] = pos_df[pos_family['PRON']].sum(axis=1)
pos_df['VERB'] = pos_df[pos_family['VERB']].sum(axis=1)
pos_df['ADJ'] = pos_df[pos_family['ADJ']].sum(axis=1)
pos_df['ADV'] = pos_df[pos_family['ADV']].sum(axis=1)

pos_df = pos_df[['NOUN', 'PRON', 'VERB', 'ADJ', 'ADV']]

In [75]:
data = pd.concat([data, pos_df], axis = 1)
data = data.fillna(value=0.0)
data.shape

(14640, 29)

In [76]:
# let's see how the data set is now
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 29 columns):
tweet_id                           14640 non-null int64
airline_sentiment                  14640 non-null object
airline_sentiment_confidence       14640 non-null float64
negativereason                     14640 non-null object
negativereason_confidence          14640 non-null float64
airline                            14640 non-null object
airline_sentiment_gold             14640 non-null object
name                               14640 non-null object
negativereason_gold                14640 non-null object
retweet_count                      14640 non-null int64
text                               14640 non-null object
tweet_coord                        14640 non-null object
tweet_created                      14640 non-null object
tweet_location                     14640 non-null object
user_timezone                      14640 non-null object
text_preprocessed                  1

In [77]:
data.shape

(14640, 29)

In [78]:
# As final steps, let's remove duplicates
data.drop_duplicates(subset=['text_stemming'], inplace=True)
data.drop_duplicates(subset=['text_stemming_with_emoticon'], inplace=True)
data.shape

(14076, 29)

In [None]:
data.to_csv('data_preprocessed_with_pos.csv', sep ='\t', index=False)