# BLU09 - Exercises

Welcome to the exercises of the BLU09! Should you get stuck on an exercise take a look at the hints or at the learning notebook in order to get some clues. Good luck!

In [1]:
import math
import hashlib
import inspect
import json
import pandas as pd
import numpy as np
import re
from hashlib import sha256
from collections import Counter
import string
import os
import random

In [2]:
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import WordPunctTokenizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
import spacy
from spacy.matcher import Matcher
from nltk.tokenize import WordPunctTokenizer
from spacy.lang.en import English
from spacy.matcher import Matcher

## The Goal
In this learning unit you are going to create a binary classifier to determine if a movie review is 'positive' or 'negative'. You will start by building some basic features, then go on to build more complex ones, and finally putting it all together. You should be able to have a working classifier by the end of the notebook. 

## The Dataset
For this Exercise Notebook, you are going to use the IMDB Large movie dataset - [Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/). Each movie review has either a Positive, or a Negative label. A negative review has a score equal or less than 4 (out of 10), and a positive review has a score equal or more than 7 (out of 10). Hence, reviews with more neutral ratings are not included in the datasets.

## Loading Data


First of all, let's load both the train and test set into a Dataframe

In [3]:
def load_imdb_sentiment_analysis_dataset(data_path, seed=42):

    imdb_data_path = os.path.join(data_path, 'aclImdb')

    # Load the training data
    train_texts = []
    train_labels = []
    for category in ['pos', 'neg']:
        train_path = os.path.join(imdb_data_path, 'train', category)
        for fname in sorted(os.listdir(train_path)):
            if fname.endswith('.txt'):
                with open(os.path.join(train_path, fname)) as f:
                    train_texts.append(f.read())
                train_labels.append(0 if category == 'neg' else 1)
                
    print("\nFinished loading Train set\n")
    
    # Load the test data.
    test_texts = []
    test_labels = []
    for category in ['pos', 'neg']:
        test_path = os.path.join(imdb_data_path, 'test', category)
        for fname in sorted(os.listdir(test_path)):
            if fname.endswith('.txt'):
                with open(os.path.join(test_path, fname)) as f:
                    test_texts.append(f.read())
                test_labels.append(0 if category == 'neg' else 1)
                
    print("\nFinished loading Test set\n")
    
    # Shuffle the training data and labels.
    random.seed(seed)
    random.shuffle(train_texts)
    random.seed(seed)
    random.shuffle(train_labels)
    
    return ((train_texts, np.array(train_labels)),
            (test_texts, np.array(test_labels)))

Warnings: The two cells below might take a few minutes depending on your machine...

In [4]:
(x_train, y_train), (x_test, y_test) = load_imdb_sentiment_analysis_dataset("datasets/")
df = pd.DataFrame(data=[x_train, y_train], index=["text", "label"]).T
df = df.append(pd.DataFrame(data=[x_test, y_test], index=["text", "label"]).T)[:5000]


Finished loading Train set


Finished loading Test set



In [5]:
df.head()

Unnamed: 0,text,label
0,The exploding zeppelins crashing down upon 'Sk...,0
1,"Classic, highly influential low budget thrille...",1
2,A very close and sharp discription of the bubb...,1
3,When robot hordes start attacking major cities...,0
4,This is the second Eytan Fox film I have seen....,1


In [6]:
#! python -m spacy download 'en_core_web_sm'

In [7]:
# load the small-sized SpaCy model
nlp = spacy.load('en_core_web_sm')
en_stopwords = nlp.Defaults.stop_words


# Create a list of SpaCy "Docs" by leveraging the SpaCy pipeline
docs = list(nlp.pipe(df.text))

Now let's have a look at the first 2 reviews to understand the text we are dealing with ...

In [8]:
en_stopwords

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [9]:
docs[:2]

[The exploding zeppelins crashing down upon 'Sky Captain' Jude Law's base present an adequate metaphor to describe how truly terrible this movie is. First off, let me state right off the bat that I sincerely doubt that Paramount will ever recover any money from this film. A cult hit it might become, but only because it is so remarkable for what it failed to achieve. I can see the studio pitch now. "Let's combine 1920's German Expressionism and a 1940's globetrotting adventure with a modern action flick and use computer animation to dominate every scene! Wow, won't that be a success! " Skycaptain bludgeons the viewer with its sheer excess. There are too many fake explosions, too many unconvincing dogfight scenes, and too few real moments where the characters are anything but painfully two-dimensional. After all, why shock and awe with one floating airship when you can have three, or five, or one hundred?! Moreover, what could have been a groundbreaking film, seamlessly combining compute

## Q1 - Text cleaning

Looking at the text above, you see that there are several HTML tags. First, let's clean 'em up! BeautifulSoup has a cool `get_text()` method that strips all the leftover html tags. Then let's use Regex, something that you have learned previously, to remove all the punctuations.

In [10]:
tokenizer = WordPunctTokenizer()

def remove_html_tags(text):
    soup = BeautifulSoup(text)
    return soup.get_text()

def remove_punct(text):
    #remove everything except words, digits and space
    text = re.sub(r'[^\w\s]','',text) 
        
    #regex often miss the underscore so let's remove that as well
    text = re.sub(r'\_','',text)
    text = text.lower()
    
    return text

def remove_stopwords(text, stopwords):
    tokens = tokenizer.tokenize(text)
    tokens = [tok.lower() for tok in tokens]
    if stopwords:
        tokens = [tok for tok in tokens if tok not in stopwords]

    text_processed = ' '.join(tokens)
    return text_processed

def preprocessing(df):
    """
    Implement the three above functions in the respective order to remove html tags, punctuations and stopwords
    Hint: Use the apply function.
    
    """
    df_ = df.copy()
    
    #df_['text'] = df_['text'].apply(...).apply(...).apply(...)
    
    # YOUR CODE HERE
    #raise NotImplementedError()
    df_['text'] = df_['text'].apply(remove_html_tags).apply(remove_punct).apply(remove_stopwords,stopwords=en_stopwords)
    return df_

In [11]:
# Let's clean, and process the df
df_raw = df.copy()
df = preprocessing(df)
value_hash = '81596d9ecc63f0a3d1b634903b64affc939a27cb09ebe297d4c0d9697ca2bb11'
assert sha256(str(df['text']).encode()).hexdigest() == value_hash

## Q2 - Text exploration with SpaCy 

Now that we have cleaned the data, let's start extracting some useful features. We will first start simple and perform some exploration using `SpaCy`.

### Q2.a) Create a simple matcher
You suspect that some positive words such as "excellent", "classic", and "great" often occur in Positive reviews. Let's quickly test that!

In [12]:
df.head()

Unnamed: 0,text,label
0,exploding zeppelins crashing sky captain jude ...,0
1,classic highly influential low budget thriller...,1
2,close sharp discription bubbling dynamic emoti...,1
3,robot hordes start attacking major cities stop...,0
4,second eytan fox film seen fantastic actor lio...,1


In [13]:
df[df['text'].str.contains('classic')].label.value_counts()

1    240
0    133
Name: label, dtype: int64

In [14]:
for word in ['classic', 'excellent', 'great']:
    print(word)
    print(df[df['text'].str.contains(word)].label.value_counts())
    print('-------')

classic
1    240
0    133
Name: label, dtype: int64
-------
excellent
1    332
0     68
Name: label, dtype: int64
-------
great
1    934
0    469
Name: label, dtype: int64
-------


Indeed, your intuition is right. It's clear that those positive words are more likely to occur in Positive reviews. 

Now, take advantage of SpaCy's `Matcher` to count the total *exact* number of matches of these words. Looking at the below figure should help you choose the pattern to use for this purpose.

![](media/token_attributes.png)

In [15]:
# Count the number of total exact matches of the words "excellent", "great", and "classic" using the SpaCy Matcher and assign it to "count"
#matcher = Matcher(...)
#
#for ... :
#    pattern = [...]
#    matcher.add(...)
#
#count =0
#for doc in docs:
#    matches = ...
#    count += ...

# YOUR CODE HERE
matcher = Matcher(nlp.vocab)
words = ['excellent','great','classic']
for word in words:
    pattern = [{'TEXT':word.lower()}]
    matcher.add(word,None,pattern)
count =0
for doc in docs:
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]  # the matched span
        span_text = span.text  # the span as a string
        #print(start, end, span_text)
        count +=1
count

#raise NotImplementedError()

2473

In [16]:
count_hash = '33e14c27247dae6ca2ac565cf7d5fa4200defa487918c52a2dfcccb6d09b4329'
assert sha256(str(count).encode()).hexdigest() == count_hash

### Q2.b) Extract Emojis

Looking at a few review examples, you realized that people tend to use emojis in their reviews. Perhaps we could extract some signals out of these? 

Let's build a matcher to extract positive emojis & negative emojis from the text and store their counts in `positive_emojis_count` and `negative_emojis_count`. 

You can easily do this with Regex - Spacy allows us to add the `REGEX` operator to our Matcher object. Hint: Check out [Spacy's documentation](https://spacy.io/usage/rule-based-matching#regex) to learn how to do that. 

In [17]:
nlp = English()  # We only want the tokenizer, so no need to load a model
matcher = Matcher(nlp.vocab)

pos_patterns = [{'TEXT':{"REGEX": r"\:\)"}}] #- For Positive emoji let's use ":)"
neg_patterns = [{'TEXT':{"REGEX": r"\:\("}}] #- For Negative emoji let's use ":("

# Hint - Don't forget to escape the special character "(" and ")"
# YOUR CODE HERE
#raise NotImplementedError()

def count_emoji_matches(pattern, docs = docs):
    matcher = Matcher(nlp.vocab)
    matcher.add("EMOJIS", None, pattern)
    
    n_emojis = []
    for doc in docs:
        matches = matcher(doc)
        emojis_count = len(matches)
        for match in matches:
            emojis_count += 1
        n_emojis.append(emojis_count)
            
    return n_emojis

positive_emojis_count = sum(count_emoji_matches(pos_patterns))
negative_emojis_count = sum(count_emoji_matches(neg_patterns))

In [18]:
positive_hash = '7688b6ef52555962d008fff894223582c484517cea7da49ee67800adc7fc8866'
negative_hash = 'd4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35'
assert sha256(str(positive_emojis_count).encode()).hexdigest() == positive_hash
assert sha256(str(negative_emojis_count).encode()).hexdigest() == negative_hash

### Q2.c) Extract Part of Speech features

You also think that negative reviews may have several adverbs followed by an adjective to express the extent to which how bad a movie is (e.g. ridiculously bad, unbelievable awful)

To help you, here's the list of PoS available in SpaCy:

![](media/pos_helper.png)

To complete this exercise you should build a matcher to extract all adverbs that are followed by an adjective. Store this sequence in a list, and assign the result to `adv_adj_list`.

In [19]:
#Store all the adv-adj sequence in a list called adv_adj_list
#matcher = ...
#pattern = [...]
#matcher.add(...)
#
#adv_adj_list = []
#for doc in docs:
#    matches = ...
#    for ... in matches:
#        adv_adj_list.append(...)


# YOUR CODE HERE
matcher = Matcher(nlp.vocab)
pattern = [{'POS': 'ADV'},{'POS':'ADJ'}]
matcher.add('word',None,pattern)
#
adv_adj_list = []
for doc in docs:
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id] 
        span = doc[start:end]
        adv_adj_list.append(span)
#raise NotImplementedError()
len(adv_adj_list)

15679

In [20]:
list_hash = '1cde2b9de1483573a1d70aed1fc8f90eb8c3f18a992f289543e3aa4c09a14edd'
assert len(adv_adj_list) == 15679
assert sha256(','.join(map(str, adv_adj_list)).encode()).hexdigest() == list_hash

### Q2.d) Extract entities

Your intuition is that Positive Reviews are likely to describe the movie plot, citing several locations in the movie. An idea is to extract some locations from the text.

Build a `Matcher` to match Location in the text and extract the top 10 most common ones. Assign them to `most_common_ents`.

*hint: Use [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) to extract the most common elements (check the most_common(n) method). You will need to feed it strings (not SpaCy spans)*

*note: in a real-case scenario we would perform some text preprocessing first and build a better entity recognizer, but let us not worry about that here*


In [21]:
len(docs)

5000

In [22]:
# Build a matcher to extract the location-type entities from the text and assign them to most_common_ents
#
#matcher = ...
#
#pattern = [...]
#matcher.add(...)
# 
#...
#
# most_common_ents = ...

# YOUR CODE HERE

#location_list = []
#for i, doc in enumerate(docs):
#    for e in doc.ents:
#        if e.label_ == 'GPE':   
#            location_list.append(e.text)
#most_common_ents = Counter(location_list).most_common(10)
#most_common_ents=[el[0] for el in most_common_ents]
#most_common_ents
###########################################################################
#
matcher = Matcher(nlp.vocab)
#
pattern = [{'ENT_TYPE':'GPE'}]
matcher.add('LOC', None, pattern)
#matcher.add(...)
most_common_ents_list = []
for doc in docs:
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]  # the matched span
        span_text = span.text  # the span as a string
        most_common_ents_list.append(span_text)
most_common_ents = Counter(most_common_ents_list).most_common(10)
most_common_ents
#raise NotImplementedError()

[('Hollywood', 337),
 ('New', 170),
 ('York', 141),
 ('America', 123),
 ('US', 106),
 ('London', 77),
 ('Japan', 57),
 ('Paris', 57),
 ('England', 50),
 ('City', 49)]

In [23]:
ent_hash = "35eb8667c8e28dacaba6290382dc6d46e4df777b27d1eef8678e099ad359de25"
assert len(most_common_ents) == 10
assert sha256(','.join(map(str, most_common_ents)).encode()).hexdigest() == ent_hash

Now that we have the most common locations, let's quickly check its usefulness and whether we should include it as a feature. 

Indeed, as can be seen below, locations are more likely to occur in Positive Reviews.

In [24]:
most_common_locations = [loc[0] for loc in most_common_ents]

for word in most_common_locations:
    print(word)
    print(df_raw[df_raw['text'].str.contains(word)].label.value_counts())
    print('-------')

Hollywood
1    150
0    123
Name: label, dtype: int64
-------
New
1    144
0     78
Name: label, dtype: int64
-------
York
1    97
0    35
Name: label, dtype: int64
-------
America
1    263
0    192
Name: label, dtype: int64
-------
US
1    102
0     81
Name: label, dtype: int64
-------
London
1    42
0    32
Name: label, dtype: int64
-------
Japan
1    63
0    43
Name: label, dtype: int64
-------
Paris
1    32
0    17
Name: label, dtype: int64
-------
England
1    40
0    17
Name: label, dtype: int64
-------
City
1    47
0    23
Name: label, dtype: int64
-------


## Q3 - Create Numerical Features

You start thinking what features could actually be useful for solving your problem. One possible factor that may help is to know the number of adjectives used, the length of the review, average word length and the count of positive & negative emojis

Add extra fields to the `df` dataframe with:
- The count of the number of adjectives - consider the adjectives as those identified by SpaCy
- The length of the document - you can simply count the number of characters. 
- The average word length - you learned how to do this in Learning Notebook - you don't need to remove stopwords as we already did it in the beginning
- The count of positive emojis - Hint: use `count_emoji_matches` function in Q2b.
- The count of negative emojis - Hint: use `count_emoji_matches` function in Q2b.

Assign the number of adjectives to a new column called `n_adjs`, the length of the reviews to a column called `text_length`, average word length to a column called `avg_word_length`, and the count of positive and negative emojis to two columns called `positive_emojis_count` and `negative_emojis_count`, respectively.

In [25]:
# Hint: you can iterate over the tokens in Spacy doc to inspect them 
# for doc in docs:
#    print(doc.ents)
#    for token in doc:
#        print(token.pos_)

#n_adjs = []
#
#for doc in docs:
#    count_adjs = 0
#    ...
#    ...
#    n_adjs.append(count_adjs)
#
#df['n_adjs'] = n_adjs
#df['text_length'] = ...
#df['avg_word_length'] = ...
#df['positive_emojis_count'] = count_emoji_matches(...)
#df['negative_emojis_count'] = count_emoji_matches(...)

# YOUR CODE HERE
#for doc in docs:
    #print(doc.ents)
    #for token in doc:
        #print(token.pos_)

name_of__adjs = [[token for token in doc if token.pos_ == 'ADJ'] for doc in docs]
n_adjs = [len(internal_list) for internal_list in name_of__adjs]
##ssd = [[np.mean(len(token)) for token in doc] for doc in docs]
#ssd
avg_word_length = []
for doc in docs:
    len_word = []
    for word in doc:
        len_word.append(len(word))
    avg_word_length.append(np.mean(len_word))  
####for doc in docs:
   ####count_adjs = 0
#    ...
#    ...
    ####n_adjs.append(count_adjs)
#
df['n_adjs'] = n_adjs
#df['text_length'] = [len(doc) for doc in docs]
df['text_length']=df['text'].map(len)
df['avg_word_length_old'] = avg_word_length 
#get the average word length
df['avg_word_length'] = df['text'].apply(lambda x: np.mean([len(t) for t in x.split() ]) if len([len(t) for t in x.split(' ') ]) > 0 else 0)
df['positive_emojis_count'] = count_emoji_matches(pos_patterns)
df['negative_emojis_count'] = count_emoji_matches(neg_patterns)
df
#raise NotImplementedError()

Unnamed: 0,text,label,n_adjs,text_length,avg_word_length_old,avg_word_length,positive_emojis_count,negative_emojis_count
0,exploding zeppelins crashing sky captain jude ...,0,30,1333,4.008299,6.329670,0,0
1,classic highly influential low budget thriller...,1,38,1115,4.383686,6.696552,0,0
2,close sharp discription bubbling dynamic emoti...,1,15,227,4.275000,6.600000,0,0
3,robot hordes start attacking major cities stop...,0,45,1285,4.269767,6.224719,0,0
4,second eytan fox film seen fantastic actor lio...,1,21,733,3.602339,5.612613,0,0
...,...,...,...,...,...,...,...,...
4995,best hong kong action films tense exciting sto...,1,8,262,3.720000,5.575000,0,0
4996,despite feelings star wars fans opinion return...,1,34,1417,3.965812,5.784689,0,0
4997,disappointed movie like french actors liked bu...,0,4,137,3.960000,5.900000,0,0
4998,nonaquatic role esther williams plays school t...,1,13,410,4.539823,7.058824,0,0


In [26]:
df.avg_word_length.round().sum()

30553.0

In [27]:
assert all(col in df.columns for col in ('n_adjs', 'avg_word_length', 'text_length', 'positive_emojis_count', 'negative_emojis_count'))
assert df.n_adjs.sum() == 101733
assert np.allclose(df.avg_word_length.sum(), 29805, 5)
assert df.text_length.sum() == 3829351
assert df.positive_emojis_count.sum() == 56
assert df.negative_emojis_count.sum() == 2

## Q4 - Pipelines and Feature Unions
It is now time for you to leverage on your newly built features and construct pipelines that can be fed to a classifier. You decide to use a [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) as you hear from industry experts it tends to work well for text classification problems.

In [28]:
# split data into train and test sets
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)
train_data.label = train_data.label.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [29]:
class Selector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a column from the dataframe to perform additional transformations on
    """ 
    def __init__(self, key):
        self.key = key
        
    def fit(self, X, y=None):
        return self
    

class TextSelector(Selector):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on text columns in the data
    """
    def transform(self, X):
        return X[self.key]
    
    
class NumberSelector(Selector):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def transform(self, X):
        return X[[self.key]]

    
def get_accuracy(feats, train_data, test_data):
    """
    Return the accuracy on the test_data by using a RandomForestClassifier trained on the 
    train_data with the features described by feats
    """

    pipeline = Pipeline([
        ('features',feats),
        ('classifier', RandomForestClassifier(random_state = 42, n_estimators=10)),
    ])

    pipeline.fit(train_data, train_data.label)

    preds = pipeline.predict(test_data)
    accuracy = np.mean(preds == test_data.label)
    
    print("Accuracy: {:.4f}".format(accuracy))
    
    return accuracy

### Q4.a) Build a Feature Union
You hypothesize that combining the text and numerical features could help you build a strong classifier. 

Use `FeatureUnion` to join:
- The Text features extracted from a standard sklearn `TfidfVectorizer` (with $ngram\_range=(1,2)$)
- The numeric feature of the length of the messages scaled to zero mean and unit variance *[hint](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)*
- The average word length that you have created previously

Assign the Feature Union to a variable named `feats`.

In [30]:
text_pipe = Pipeline([
                ('selector', TextSelector("text")),
                ('tfidf', TfidfVectorizer(ngram_range=(1,2)))
            ])
text_len_pipe =  Pipeline([
                ('selector', NumberSelector("text_length")),
                ('standard', StandardScaler())
            ])
word_len_pipe =  Pipeline([
                ('selector', NumberSelector("avg_word_length")),
                ('standard', StandardScaler())
            ])
feats = FeatureUnion([('text_pipe',text_pipe),('text_len_pipe',text_len_pipe),('word_len_pipe',word_len_pipe)])

# YOUR CODE HERE
#raise NotImplementedError()

In [31]:
assert isinstance(feats, FeatureUnion)
assert any(isinstance(obj, Selector) for obj in feats.transformer_list[0][1])
assert any(isinstance(obj, TfidfVectorizer) for obj in feats.transformer_list[0][1])
assert np.allclose(get_accuracy(feats, train_data, test_data), 0.7290, 0.01)

Accuracy: 0.7290


### Q4.b) Add more features
You decide to try adding the number of adjectives to your features to see if they can improve the performance of your classifier. 

On top of all features you have used for `feats`, add the number of adjectives `n_adjs` that you computed in Q3 to your features. Then assign your features to `feats_v2`. There should be 4 features in total.

In [32]:
adjs_pipe = Pipeline([
                ('selector', NumberSelector("n_adjs")),
                ('standard', StandardScaler())
            ])
#...
feats_v2 = FeatureUnion([('text_pipe',text_pipe),('text_len_pipe',text_len_pipe),('word_len_pipe',word_len_pipe),('adjs_pipe',adjs_pipe)])

# YOUR CODE HERE
#raise NotImplementedError()

In [33]:
accuracy = get_accuracy(feats_v2, train_data, test_data)
assert np.allclose(accuracy, 0.6860, 0.01)

Accuracy: 0.6860


### Q4.c) Add the Emojis feature
You try to improve your model even further by including the number of emojis `positive_emojis_count` and `negative_emojis_count` features that you created above. 

On top of all features in `feats_v2`, add the number of emojis to your features and assign the result to `feats_v3` (**no need to scale** the features this time). There should be 6 features in total.

In [34]:
positive_emojis_count  = Pipeline([
                ('selector', NumberSelector("positive_emojis_count"))
            ])
negative_emojis_count = Pipeline([
                ('selector', NumberSelector("negative_emojis_count"))
            ])
feats_v3 = FeatureUnion([('text_pipe',text_pipe),('text_len_pipe',text_len_pipe),('word_len_pipe',word_len_pipe),('adjs_pipe',adjs_pipe),('positive_emojis_count',positive_emojis_count),('negative_emojis_count',negative_emojis_count)])

# YOUR CODE HERE
#*raise NotImplementedError()

In [35]:
accuracy = get_accuracy(feats_v3, train_data, test_data)
assert np.allclose(accuracy, 0.7010, 0.01)

Accuracy: 0.7010


You realize that your accuracy actually decreased, which reminds you that more features does not necessarily mean better results.

## Conclusion

You realize you can get fairly ok accuracy on the sentiment analysis problem using a fairly simple solution. You know there are many things you could improve (e.g. Dimensionality Reduction) and many further paths you could choose in order to try to take your classifier to the next level, but you decide to leave that challenge for another day. 