---
# <center> Features: TF-IDF, N-grams and word2vec </center>
---

In [None]:
# imports
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
import re

In [None]:
# load data
file = "../../data/AirlineSentiment.csv"
sent_data = pd.read_csv(file)
sent_data.head()

In [None]:
# tweet text
text = sent_data.values[:, 14]
# sentiment
sent = sent_data.values[:, 5]

# remove non-alphanumeric characters
text = [re.sub(r'\W+', ' ', x) for x in text]

# tokenize
text_tokens_full = [nltk.word_tokenize(x.lower()) for x in text]

# stem
stemmer = PorterStemmer()
text_tokens = [[stemmer.stem(w) for w in x] for x in text_tokens_full]

## 1. TF-IDF

First, see how features are represented with binary bag-of-words.

In [None]:
# create an instance of TfidfVectorizer; get it to remove stopwords
tf = TfidfVectorizer(stop_words = 'english', use_idf = False, norm = None, binary = True, lowercase = True)

# convert text to features
text_tokens_feat = tf.fit_transform([' '.join(x) for x in text_tokens])

# save feature names
feature_names = tf.get_feature_names()

pd.set_option("display.max_columns", 50)
feature_matrix = pd.DataFrame(text_tokens_feat.toarray()[:10], columns = feature_names)

# print the first several examples and the features
feature_matrix

#### Exercise. Represent features with tf-idf

Use `TfidfVectorizer` to create features represented with tf-idf weights. To do so, initialize this class with the following parameters:
* `use_idf: True`
* `norm: 'l2'`
* `binary: False`

In [None]:
# create an instance of TfidfVectorizer with tf-idf; also get it to remove stopwords, convert to lowercase
tf = TfidfVectorizer(stop_words = 'english', use_idf = True, norm = 'l2', binary = False, lowercase = True)

# convert text to features
text_tokens_tfidf = tf.fit_transform([' '.join(x) for x in text_tokens])

# get feature names
feature_names = tf.get_feature_names()

# print the first several examples and the features
feature_matrix = pd.DataFrame(text_tokens_tfidf.toarray()[:10], columns = feature_names)

feature_matrix

#### Exercise. Convert labels to numbers

Sentiment in the `sent` variable is now represented with words. Use `LabelEncoder` to convert it into numbers.

In [None]:
# Your code here
le = LabelEncoder()
sent = le.fit_transform(sent)
class_labels = le.classes_
[print(class_labels[x], x) for x in sent[:5]]

#### Exercise. Train the model

Perform a train / test split and train a `LinearSVC` model with features represented as tf-idf. Then, calculate the `f1` and `accuracy` scores.

In [None]:
# Your code here
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(text_tokens_tfidf, sent, random_state = 0)
clf = LinearSVC()
clf.fit(X_train_tfidf, y_train_tfidf)
y_pred_tfidf = clf.predict(X_test_tfidf)
f1_tfidf = f1_score(y_test_tfidf, y_pred_tfidf, average = "macro")
accuracy_tfidf = accuracy_score(y_test_tfidf, y_pred_tfidf)
print("SVM f1 score:", f1_tfidf, "accuracy:", accuracy_tfidf)

## 2. N-grams

### 2.1. Bigrams

#### Exercise. Add bigram features

Bigrams are two-token features. For example, in a sentence `"Today is sunny"`, `Today is` and `is sunny` are bigrams.

Use `TfidfVectorizer` to add bigram features. The parameter that controls this is `ngram_range`; use both 1-grams and bigrams. Have the vectorizer output tf-idf weights. You may also want to limit the number of features produced. This is controlled by `max_features` parameter.

In [None]:
# create an instance of TfidfVectorizer with 2-grams; also get it to remove stopwords, convert to lowercase
tf = TfidfVectorizer(stop_words = 'english', ngram_range = (1, 2), use_idf = True, norm = 'l2', 
                     binary = False, lowercase = True, max_features = 30000)

# convert text to features
text_tokens_bigram = tf.fit_transform([' '.join(x) for x in text_tokens])

# get feature names
feature_names = tf.get_feature_names()

# print dimensions of the feature matrix
print("Examples, features:", text_tokens_bigram.shape)

In [None]:
# print the first few feature names
feature_names[:5]

#### Exercise. Write a function that trains a `LinearSVC` model and prints out accuracy scores

Write a function that:
* Takes features and labels as input
* Performs train/test split
* Trains a `LinearSVC` model
* Calculates `f1` and `accuracy` scores and prints them

In [None]:
def train_svc_model(text_tokens, sent):
    """
    Train a LinearSVC model and print out f1 and accuracy scores
    
    Inputs:
    text_tokens - features
    sent - labels
    """
    X_train, X_test, y_train, y_test = train_test_split(text_tokens, sent, random_state = 0)
    clf = LinearSVC()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    f1 = f1_score(y_test, y_pred, average = "macro")
    accuracy = accuracy_score(y_test, y_pred)
    print("SVM f1 score:", f1, "accuracy:", accuracy)

Use your `train_svc_model()` function to train a model with bigram features.

In [None]:
print("Bigrams")
train_svc_model(text_tokens_bigram, sent)

### 2.2. Trigrams

#### Exercise. Add 3-gram features

As before, use `TfidfVectorizer` and its `ngram_range` parameter to create 1-, 2- and 3-grams with tf-idf weights. Again, you may want to limit the number of features created.

In [None]:
# create an instance of TfidfVectorizer with 3-grams; also get it to remove stopwords, convert to lowercase
tf = TfidfVectorizer(stop_words = 'english', ngram_range = (1, 3), use_idf = True, norm = 'l2', binary = False, 
                     lowercase = True, max_features = 80000)

# convert text to features
text_tokens_trigram = tf.fit_transform([' '.join(x) for x in text_tokens])

# get feature names
feature_names = tf.get_feature_names()

# dimensions of the feature matrix
print("Examples, features:", text_tokens_trigram.shape)

In [None]:
# print the first few feature names
feature_names[:5]

Use `train_svc_model()` to train a model with the new features. How is the accuracy?

In [None]:
print("Trigrams")
train_svc_model(text_tokens_trigram, sent)

## 3. word2vec

In [None]:
import gensim

# load pretrained vectors from file
w2v = gensim.models.KeyedVectors.load_word2vec_format("../../data/word2vec50tokens.bin", binary = True)

In [None]:
w2v.word_vec("good")

#### Exercise. Find words most similar to `great`

Use `similar_by_word()` function of the `w2v` object to find words that are most similar to `great`.

In [None]:
# Your code here
w2v.similar_by_word("great")