# day17: Bag of Words Representations for Text Data

# Objectives

* Understand bag of words representations
* Think through the decision decisions you need to make and how they will impact classifier performance


# Outline

* [Part 1: Cleaning text into a list of tokens](#part1)
* [Part 2: Building a fixed-size vocabulary](#part2)
* [Part 3: Creating a BoW feature vector](#part3)
* [Part 4: Building a classifier given your BoW features](#part4)
* [Part 5: Sklearn's CountVectorizer for easy BoW feature processing](#part5)
* [Part 6: Using a pipeline with CountVectorizer](#part6)

We expect you'll get through part 4 during this class period. 

# Takeaways

* Bag of words representations are simple and still interpretable
* Bag of words representations are limited: you lose any information that comes from the *ordering* of the words
* Many key design decisions (how to handle rare words, how to handle similar words like "walk" and "walking") can matter

In [1]:
import numpy as np
import pandas as pd

In [2]:
import sklearn.linear_model
import sklearn.pipeline


In [3]:
# import plotting libraries
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('seaborn') # pretty matplotlib plots

import seaborn as sns
sns.set('notebook', font_scale=1.25, style='whitegrid')

# Setup: Raw Text Data

We've included some raw text from 200 negative reviews and 200 positive reviews below.

This is a subset of the training set you'll use for Project B.

Each line is one plain-text review. You'll see many slang terms, weird capitalization/punctuation/etc.

Just execute the cell and move on. You'll need to scroll down a lot.

In [4]:
all_reviews_as_line_separated_string = """Oh and I forgot to also mention the weird color effect it has on your phone.
THAT one didn't work either.
Waste of 13 bucks.
Product is useless, since it does not have enough charging current to charge the 2 cellphones I was planning to use it with.
None of the three sizes they sent with the headset would stay in my ears.
Worst customer service.
The Ngage is still lacking in earbuds.
It always cuts out and makes a beep beep beep sound then says signal failed.
the only VERY DISAPPOINTING thing was there was NO SPEAKERPHONE!!!!
Very disappointed in AccessoryOne.
Basically the service was very bad.
Bad Choice.
The only thing that disappoint me is the infra red port (irda).
horrible, had to switch 3 times.
It feels poorly constructed, the menus are difficult to navigate, and the buttons are so recessed that it is difficult to push them.
Don't make the same mistake I did.
Muddy, low quality sound, and the casing around the wire's insert was poorly super glued and slid off.
I advise EVERYONE DO NOT BE FOOLED!
Doesn't hold charge.
What a waste of time!
I'm very disappointed with my decision.
I also didn't like the "on" button, it felt like it would crack with use.
I bought these hoping I could make my Bluetooth headset fit better but these things made it impossible to wear.
We have tried 2 units and they both failed within 2 months.. Pros
Also difficult to put on.I'd recommend avoiding this product.
$50 Down the drain.
Absolutel junk.
Can't store anything but phone numbers to SIM.
Very disappointing.
I would not recommend this item to anyone.
Big Disappointment with calendar sync.
Not impressed.
Just does not work.
I even fully charged it before I went to bed and turned off blue tooth and wi-fi and noticed that it only had 20 % left in the morning.
Plus, I seriously do not believe it is worth its steep price point.
In my house I was getting dropped coverage upstairs and no coverage in my basement.
The phone takes FOREVER to charge like 2 to 5 hours literally.
Very unreliable service from T-mobile !
[...] down the drain because of a weak snap!
This is a simple little phone to use, but the breakage is unacceptible.
Pretty piece of junk.
This is so embarassing and also my ears hurt if I try to push the ear plug into my ear.
Unfortunately the ability to actually know you are receiving a call is a rather important feature and this phone is pitiful in that respect.
This is the first phone I've had that has been so cheaply made.
Awkward to use and unreliable.
Horrible phone.
If you are looking for a good quality Motorola Headset keep looking, this isn't it.
My father has the V265, and the battery is dying.
After a year the battery went completely dead on my headset.
Defective crap.
Poor Construction.
They work about 2 weeks then break.
Could not get strong enough signal.
I really wanted the Plantronics 510 to be the right one, but it has too many issues for me.The good
Excellent starter wireless headset.
Performed awful -- muffled, tinny incoming sound and severe echo for those on the other end of the call.
BT50 battery junk!.
The design might be ergonomic in theory but I could not stand having these in my ear.
camera color balance is AWFUL.
It dit not work most of the time with my Nokia 5320.
Looks good in the picture, but this case was a huge disappointment!!
I've had this bluetoooth headset for some time now and still not comfortable with the way it fits on the ear.
Plug was the wrong size.
the phone was unusable and was not new.
If I take a picture, the battery drops a bar, and starts beeping, letting me know its dieing.
It's so stupid to have to keep buying new chargers, car chargers, cradles, headphones and car kits every time a new phone comes out.
poor quality and service.
Poor product.
I'm a bit disappointed.
I tried talking real loud but shouting on the telephone gets old and I was still told it wasn't great.
There's a horrible tick sound in the background on all my calls that I have never experienced before.
The design is very odd, as the ear "clip" is not very comfortable at all.
At first I thought I was grtting a good deal at $7.44, until I plugged it into my phone (V3c Razr).
dont buy it.
So there is no way for me to plug it in here in the US unless I go by a converter.
Soyo technology sucks.
doesn't last long.
Still Waiting...... I'm sure this item would work well.. if I ever recieve it!
Poorly contstruct hinge.
Think it over when you plan to own this one!This sure is the last MOTO phone for me!
Problem is that the ear loops are made of weak material and break easily.
The bottowm line...another worthless, cheap gimmick from Sprint.
This pair of headphones is the worst that I have ever had sound-wise.
Att is not clear, sound is very distorted and you have to yell when you talk.
i would advise to not purchase this item it never worked very well.
It doesn't make you look cool.
Bought mainly for the charger, which broke soon after purchasing.
I put the latest OS on it (v1.15g), and it now likes to slow to a crawl and lock up every once in a while.
There's really nothing bad I can say about this headset.
We are sending it back.
I came over from Verizon because cingulair has nicer cell phones.... the first thing I noticed was the really bad service.
Unreliable - I'm giving up.
After my phone got to be about a year old, it's been slowly breaking despite much care on my part.
It was a waste of my money.
don't waste your money and time.
Due to this happening on every call I was forced to stop using this headset.
I checked everywhere and there is no feature for it which is really disappointing.
Does not fit.
I'll be looking for a new earpiece.
However, the keypads are so tinny that I sometimes reach the wrong buttons.
Unfortunately it did not work.
Couldn't figure it out
Worst software ever used.... If I could give this zero stars I would.
Lousy product.
Lasted one day and then blew up.
I bought it for my mother and she had a problem with the battery.
I was not impressed by this product.
Not enough volume.
Buyer--Be Very Careful!!!!!.
Were JERKS on the phone.
But when I check voice mail at night, the keypad backlight turns off a few seconds into the first message, and then I'm lost.
Disapointing Results.
I got the car charger and not even after a week the charger was broken...I went to plug it in and it started smoking.
I find this inexcusable and so will probably be returning this phone and perhaps changing carriers.
You get what you pay for I guess.
Bad Quality.
It's not what it says it is.
It's kind of embarrassing to use because of how it looks and mostly it's embarrassing how child-like the company is.
All in all, I'd expected a better consumer experience from Motorola.
Sending it back.
Everything about this product is wrong.First
It is cheap, and it feel and look just as cheap.
After receiving and using the product for just 2 days it broke.
Phone falls out easily.
The first thing that happened was that the tracking was off.
Linksys should have some way to exchange a bad phone for a refurb unit or something!
A must study for anyone interested in the "worst sins" of industrial design.
The BT headset was such a disapoinment.
I have had this phone for over a year now, and I will tell you, its not that great.
What a piece of junk.. I lose more calls on this phone.
Doesn't work at all.. I bougth it for my L7c and its not working.
very disappointed.
Earbud piece breaks easily.
I can barely ever hear on it and am constantly saying "what?"
It doesn't work in Europe or Asia.
I ordered this product first and was unhappy with it immediately.
I'm really disappointed all I have now is a charger that doesn't work.
Horrible, horrible protector.
Rip off---- Over charge shipping.
Case was more or less an extra that I originally put on but later discarded because it scratched my ear.
I've also had problems with the phone reading the memory card in which I always turn it on and then off again.
Piece of Junk.
The case is a flimsy piece of plastic and has no front or side protection whatsoever.
The battery is completely useless to me.
The biggest complaint I have is, the battery drains superfast.
So I had to take the battery out of the phone put it all back together and then restart it.
Bad Purchase.
It was horrible!.
Perhaps my phone is defective, but people cannot hear me when I use this.
Not only will it drain your player, but may also potentially fry it.
Improper description.... I had to return it.
Cant get the software to work with my computer.
I did not bother contacting the company for few dollar product but I learned the lesson that I should not have bought this form online anyway.
Reaching for the bottom row is uncomfortable, and the send and end keys are not where I expect them to be.3.
The calls drop, the phone comes on and off at will, the screen goes black and the worst of all it stops ringing intermittently.
The commercials are the most misleading.
Steer clear of this product and go with the genuine Palm replacementr pens, which come in a three-pack.
I don't think it would hold it too securly on your belt.
It makes very strange ticking noises before it ends the call.
I kept catching the cable on the seat and I had to pull the phone out to turn it on an off.
Piece of trash.
Not a good item.. It worked for a while then started having problems in my auto reverse tape player.
Then a few days later the a puff of smoke came out of the phone while in use.
Then I exchanged for the same phone, even that had the same problem.
Cumbersome design.
All three broke within two months of use.
One thing I hate is the mode set button at the side.
While I managed to bend the leaf spring back in place, the metal now has enough stress that it will break on the next drop.
The camera, although rated at an impressive 1.3 megapixels, renders images that fall well below expectations of such a relatively high resolution.
The screen does get smudged easily because it touches your ear and face.
Item Does Not Match Picture.
DO NOT BUY DO NOT BUYIT SUCKS
Doesn't Work.
Sprint charges for this service.
I am not impressed with this and i would not recommend this item to anyone.
This is essentially a communications tool that does not communicate.
stay away from this store, be careful.
It's A PIECE OF CRAP!
sucked, most of the stuff does not work with my phone.
Adapter does not provide enough charging current.
Talk about USELESS customer service.
What a big waste of time.
Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!
You also cannot take pictures with it in the case because the lense is covered.
Don't make the same mistake that I did and please don't buy this phone.
The construction of the headsets is poor.
It's uncomfortable and the sound quality is quite poor compared with the phone (Razr) or with my previous wired headset (that plugged into an LG).
No additional ear gels provided, and no instructions whatsoever.
Echo Problem....Very unsatisfactory
These products cover up the important light sensor above the ear outlet.
And none of the tones is acceptable.
don't waste your money.
This is infuriating.
The loudspeaker option is great, the bumpers with the lights is very ... appealing.
It clicks into place in a way that makes you wonder how long that mechanism would last.
I found this product to be waaay too big.
The instructions didn't explain that a microphone jack could be used.
Uncomfortable In the Ear, Don't use with LG VX9900 (EnV).
You have to hold the phone at a particular angle for the other party to hear you clearly.
Can't upload ringtones from a third party.
Today is the second time I've been to their lunch buffet and it was pretty good.
I would recommend saving room for this!
This place receives stars for their APPETIZERS!!!
It is PERFECT for a sit-down family meal or get together with a few friends.
I had heard good things about this place, but it exceeding every hope I could have dreamed of.
They were golden-crispy and delicious.
The wontons were thin, not thick and chewy, almost melt in your mouth.
All in all an excellent restaurant highlighted by great service, a unique menu, and a beautiful setting.
Ample portions and good prices.
Their regular toasted bread was equally satisfying with the occasional pats of butter... Mmmm...!
The food was outstanding and the prices were very reasonable.
In an interesting part of town, this place is amazing.
I want to first say our server was great and we had perfect service.
Im in AZ all the time and now have my new spot.
The ambience is wonderful and there is music playing.
This is the place where I first had pho and it was amazing!!
Both of the egg rolls were fantastic.
It was just not a fun experience.
Cant say enough good things about this place.
The bartender was also nice.
Their steaks are 100% recommended!
Awesome service and food.
Our server was fantastic and when he found out the wife loves roasted garlic and bone marrow, he added extra to our meal and another marrow to go!
A good time!
I will come back here every time I'm in Vegas.
The lighting is just dark enough to set the mood.
Service was fantastic.
A couple of months later, I returned and had an amazing meal.
The food came out at a good pace.
Great food and awesome service!
I love the decor with the Chinese calligraphy wall paper.
Service is perfect and the family atmosphere is nice to see.
Once you get inside you'll be impressed with the place.
Would come back again if I had a sushi craving while in Vegas.
High-quality chicken on the chicken Caesar salad.
We were promptly greeted and seated.
I tried the Cape Cod ravoli, chicken,with cranberry...mmmm!
Fantastic service here.
Just had lunch here and had a great experience.
The pancake was also really good and pretty large at that.
The service was outshining & I definitely recommend the Halibut.
I also had to taste my Mom's multi-grain pumpkin pancakes with pecan butter and they were amazing, fluffy, and delicious!
A great way to finish a great.
I personally love the hummus, pita, baklava, falafels and Baba Ganoush (it's amazing what they do with eggplant!).
This place is pretty good, nice little vibe in the restaurant.
The chicken was deliciously seasoned and had the perfect fry on the outside and moist chicken on the inside.
So flavorful and has just the perfect amount of heat.
This wonderful experience made this place a must-stop whenever we are in town again.
We were sat right on time and our server from the get go was FANTASTIC!
OMG I felt like I had never eaten Thai food until this dish.
The sergeant pepper beef sandwich with auju sauce is an excellent sandwich as well.
Very very fun chef.
The food, amazing.
An excellent new restaurant by an experienced Frenchman.
I will be back many times soon.
There was a warm feeling with the service and I felt like their guest for a special treat.
The atmosphere here is fun.
I'm so happy to be here!!!"
Their chow mein is so good!
The service was great, even the manager came and helped with our table.
Just as good as when I had it more than a year ago!
What I really like there is the crepe station.
It was absolutely amazing.
Some may say this buffet is pricey but I think you get what you pay for and this place you are getting quite a lot!
On the up side, their cafe serves really good food.
We thought you'd have to venture further away to get good sushi, but this place really hit the spot that night.
Everyone is treated equally special.
This isn't a small family restaurant, this is a fine dining establishment.
I'll definitely be in soon again.
I miss it and wish they had one in Philadelphia!
This is an Outstanding little restaurant with some of the Best Food I have ever tasted.
Great place to have a couple drinks and watch any and all sporting events as the walls are covered with TV's.
Pretty cool I would say.
The flair bartenders are absolutely amazing!
The croutons also taste homemade which is an extra plus.
the potatoes were great and so was the biscuit.
So they performed.
It's worth driving up from Tucson!
Best tacos in town by far!!
The chips and salsa were really good, the salsa was very fresh.
Our server was super nice and checked on us many times.
The cocktails are all handmade and delicious.
They know how to make them here.
We made the drive all the way from North Scottsdale... and I was not one bit disappointed!
Great service and food.
We loved the biscuits!!!
Definitely worth venturing off the strip for the pork belly, will return next time I'm in Vegas.
They really want to make your experience a good one.
Favorite place in town for shawarrrrrrma!!!!!!
The burger is good beef, cooked just right.
The seasonal fruit was fresh white peach puree.
Great place to eat, reminds me of the little mom and pop shops in the San Francisco Bay Area.
Hawaiian Breeze, Mango Magic, and Pineapple Delight are the smoothies that I've tried so far and they're all good.
The food is good.
The atmosphere is modern and hip, while maintaining a touch of coziness.
Best tater tots in the southwest.
Back to good BBQ, lighter fare, reasonable pricing and tell the public they are back to the old ways.
Penne vodka excellent!
I believe that this place is a great stop for those with a huge belly and hankering for sushi.
Cute, quaint, simple, honest.
A fantastic neighborhood gem !!!
Now the pizza itself was good the peanut sauce was very tasty.
It was awesome.
The nachos are a MUST HAVE!
I like Steiners because it's dark and it feels like a bar.
I had the opportunity today to sample your amazing pizzas!
At first glance it is a lovely bakery cafe - nice ambiance, clean, friendly staff.
Waitress was good though!
You get incredibly fresh fish, prepared with care.
These were so good we ordered them twice.
Service is quick and friendly.
I liked the patio and the service was outstanding.
All of the tapas dishes were delicious!
In the summer, you can dine in a charming outdoor patio - so very delightful.
They have a good selection of food including a massive meatloaf sandwich, a crispy chicken wrap, a delish tuna melt and some tasty burgers.
The food was excellent and service was very good.
I have been here several times in the past, and the experience has always been great.
Great place fo take out or eat in.
An absolute must visit!
Very friendly staff.
I loved the bacon wrapped dates.
Last night was my second time dining here and I was so happy I decided to go back!
They have a plethora of salads and sandwiches, and everything I've tried gets my seal of approval.
The selection on the menu was great and so were the prices.
All in all, I can assure you I'll be back.
Great food.
Food was good, service was good, Prices were good.
The black eyed peas and sweet potatoes... UNREAL!
As always the evening was wonderful and the food delicious!
I have eaten here multiple times, and each time the food was delicious.
My fiancé and I came in the middle of the day and we were greeted and seated right away.
And considering the two of us left there very full and happy for about $20, you just can't go wrong.
Great food and service, huge portions and they give a military discount.
I *heart* this place.
this place is good.
I love the owner/chef, his one authentic Japanese cool dude!
I didn't know pulled pork could be soooo delicious.
I recently tried Caballero's and I have been back every week since!
The food was very good and I enjoyed every mouthful, an enjoyable relaxed venue for couples small family groups etc.
CONCLUSION: Very filling meals.
You won't be disappointed.
I was proven dead wrong by this sushi bar, not only because the quality is great, but the service is fast and the food, impeccable.
The steak and the shrimp are in my opinion the best entrees at GC.
Restaurant is always full but never a wait.
Love this place, hits the spot when I want something healthy but not lacking in quantity or flavor.
The fries were great too.
I don't each much pasta, but I love the homemade /hand made pastas and thin pizzas here.
This was my first crawfish experience, and it was delicious!
On a positive note, our server was very attentive and provided great service.
I had strawberry tea, which was good.
The waitress was friendly and happy to accomodate for vegan/veggie options.
We loved the place.
Nice, spicy and tender.
We walked away stuffed and happy about our first Vegas buffet experience.
Point your finger at any item on the menu, order it and you won't be disappointed.
The sides are delish - mixed mushrooms, yukon gold puree, white corn - beateous.
Service was fine and the waitress was friendly.
Generous portions and great taste.
The roast beef sandwich tasted really good!
Loved it...friendly servers, great food, wonderful and imaginative menu.
I had a seriously solid breakfast here.
We had a group of 70+ when we claimed we would only have 40 and they handled us beautifully.
Omelets are to die for!
They had a toro tartare with a cavier that was extraordinary and I liked the thinly sliced wagyu with white truffle.
The grilled chicken was so tender and yellow from the saffron seasoning.
To summarize... the food was incredible, nay, transcendant... but nothing brings me joy quite like the memory of the pneumatic condiment dispenser.
Great food for the price, which is very high quality and house made.
Best Buffet in town, for the price you cannot beat it.
Anyway, this FS restaurant has a wonderful breakfast/lunch.
The ambiance was incredible.
The atmosphere was great with a lovely duo of violinists playing songs we requested.
This place is amazing!
The food was very good.
The portion was huge!
The goat taco didn't skimp on the meat and wow what FLAVOR!
The food was great as always, compliments to the chef.
* Both the Hot & Sour & the Egg Flower Soups were absolutely 5 Stars!
He deserves 5 stars.
Lordy, the Khao Soi is a dish that is not to be missed for curry lovers!
This place is awesome if you want something light and healthy during the summer.
I was seated immediately.
I will continue to come here on ladies night andddd date night ... highly recommend this place to anyone who is in the area (;
They also have the best cheese crisp in town.
Food was delicious!
Both of them were truly unbelievably good, and I am so glad we went back.
you can watch them preparing the delicious food!)
The cow tongue and cheek tacos are amazing.
This was my first time and I can't wait until the next.
The staff are also very friendly and efficient.
Of all the dishes, the salmon was the best, but all were great.
Food was so gooodd.
They have great dinners.
Best fish I've ever had in my life!
Great brunch spot.
I don't have very many words to say about this place, but it does everything pretty well.
The sweet potato fries were very good and seasoned well.
I could eat their bruschetta all day it is devine.
Ambience is perfect.
We ordered the duck rare and it was pink and tender on the inside with a nice char on the outside.
Service was good and the company was better!"""

# Part 1: Cleaning text into a list of tokens

Our goal here is to represent each line of text in the "raw" data as a list of tokens.

A 'token' here is just a string of non-whitespace characters.

We'll make some simple decisions:

* Make everything lower case
* Remove any punctuation

In [5]:
def tokenize_text(raw_text):
    ''' Transform a plain-text string into a list of tokens
    
    We assume that *whitespace* divides tokens.
    
    Args
    ----
    raw_text : string
    
    Returns
    -------
    list_of_tokens : list of strings
        Each element is one token in the provided text
    '''
    list_of_tokens = raw_text.split() # split method divides on whitespace by default
    for pp in range(len(list_of_tokens)):
        cur_token = list_of_tokens[pp]
        # Remove punctuation
        for punc in ['?', '!', '_', '.', ',', '"', '/']:
            cur_token = cur_token.replace(punc, "")
        # Turn to lower case
        clean_token = cur_token.lower()
        # Replace the cleaned token into the original list
        list_of_tokens[pp] = clean_token
    return list_of_tokens

## Demonstration of turning the text into a list of tokens

Lets show the raw and token-list representations of the first 10 lines


In [6]:
for line in all_reviews_as_line_separated_string.split("\n")[:10]:
    print("Raw text:")
    print(line)
    print("Clean token list:")
    print(tokenize_text(line))

Raw text:
Oh and I forgot to also mention the weird color effect it has on your phone.
Clean token list:
['oh', 'and', 'i', 'forgot', 'to', 'also', 'mention', 'the', 'weird', 'color', 'effect', 'it', 'has', 'on', 'your', 'phone']
Raw text:
THAT one didn't work either.
Clean token list:
['that', 'one', "didn't", 'work', 'either']
Raw text:
Waste of 13 bucks.
Clean token list:
['waste', 'of', '13', 'bucks']
Raw text:
Product is useless, since it does not have enough charging current to charge the 2 cellphones I was planning to use it with.
Clean token list:
['product', 'is', 'useless', 'since', 'it', 'does', 'not', 'have', 'enough', 'charging', 'current', 'to', 'charge', 'the', '2', 'cellphones', 'i', 'was', 'planning', 'to', 'use', 'it', 'with']
Raw text:
None of the three sizes they sent with the headset would stay in my ears.
Clean token list:
['none', 'of', 'the', 'three', 'sizes', 'they', 'sent', 'with', 'the', 'headset', 'would', 'stay', 'in', 'my', 'ears']
Raw text:
Worst customer

<a id="part2"></a>

# Part 2: Building a fixed-size vocabulary

We want to select candidate tokens for our vocabulary.

Let's use the following *simple* rules to build our vocabulary

* Keep any token that appears at least 4 times in our corpus (entire dataset)

Why? If a token is *rare*, it might be difficult to learn a reliable pattern for how it can be used to predict sentiment, which is our ultimate goal.


In [7]:
tok_count_dict = dict()

for line in all_reviews_as_line_separated_string.split("\n"):
    tok_list = tokenize_text(line)
    for tok in tok_list:
        if tok in tok_count_dict:
            tok_count_dict[tok] += 1
        else:
            tok_count_dict[tok] = 1

### Print the 10 most common tokens

In [8]:
sorted_tokens = list(sorted(tok_count_dict, key=tok_count_dict.get, reverse=True))

In [9]:
for w in sorted_tokens[:10]:
    print("%5d %s" % (tok_count_dict[w], w))

  255 the
  150 and
  122 i
   91 a
   83 was
   81 it
   79 is
   72 to
   63 this
   53 in


### Print the 10 least common tokens

In [10]:
for w in sorted_tokens[-10:]:
    print("%5d %s" % (tok_count_dict[w], w))

    1 life
    1 brunch
    1 words
    1 potato
    1 bruschetta
    1 devine
    1 duck
    1 rare
    1 pink
    1 char


### Build vocabulary as list of all tokens that have count at least 4

In [11]:
# Create a list of strings that identify all tokens in our vocabulary
# We'll use a *list comprehension*, a way in Python to cleaning select a subset of a larger list
# by providing an if statement.

vocab_list = [w for w in sorted_tokens if tok_count_dict[w] >= 4]

In [12]:
type(vocab_list)

list

In [13]:
for w in vocab_list:
    print("%5d %s" % (tok_count_dict[w], w))

  255 the
  150 and
  122 i
   91 a
   83 was
   81 it
   79 is
   72 to
   63 this
   53 in
   49 of
   41 not
   39 for
   39 good
   36 with
   33 on
   33 my
   31 very
   31 great
   30 that
   30 had
   29 phone
   27 have
   27 service
   26 you
   26 food
   25 place
   23 are
   22 but
   22 all
   21 so
   20 were
   19 we
   18 they
   18 be
   15 time
   14 at
   14 an
   13 also
   13 would
   13 here
   13 back
   12 work
   12 first
   12 really
   11 product
   11 ear
   11 amazing
   10 has
   10 your
   10 headset
   10 out
   10 what
   10 just
   10 from
   10 about
   10 get
   10 delicious
    9 does
    9 use
    9 then
    9 don't
    9 i'm
    9 like
    9 it's
    9 will
    8 one
    8 disappointed
    8 me
    8 off
    8 could
    8 because
    8 battery
    8 as
    8 when
    8 experience
    8 or
    8 their
    8 our
    8 best
    7 enough
    7 there
    7 no
    7 bad
    7 quality
    7 doesn't
    7 made
    7 item
    7 piece
    7 if
    7 i've
 

### Discussion 2a: Do you see tokens in the vocabulary that might be useful for the sentiment prediction? What are they? Are there some that would never be useful? Why?

TODO make a list

### Exercise 2b: How many tokens are in the chosen vocabulary? How many would there be if you used count at least 2?

TODO

<a id="part3"></a>

# Part 3: Creating bag-of-words representation for individual review

Now, given the vocabulary we defined above in part 2, let's turn each text into a count vector.

Our goal is to write a method that can take each individual review text and produce a feature vector.

First, define a dictionary that maps each vocab term in order to an integer defining its order in the vocab

Each vocab term maps to a unique id.

In [14]:
vocab_dict = dict()
for vocab_id, tok in enumerate(vocab_list):
    vocab_dict[tok] = vocab_id

Define method to produce feature vector from provided text and the vocabulary (as a dict)

In [15]:
def transform_text_into_feature_vector(text, vocab_dict):
    ''' Produce count feature vector for provided text
    
    Args
    ----
    text : string
        A string of raw text, representing a single 'review'
    vocab_dict : dict with string keys
        If token is in vocabulary, will exist as key in the dict
        If token is not in vocabulary, will not be in the dict

    Returns
    -------
    count_V : 1D numpy array, shape (V,) = (n_vocab,)
        Count vector, indicating how often each vocab word
        appears in the provided text string
    '''
    V = len(vocab_dict.keys())
    count_V = np.zeros(V)
    for tok in tokenize_text(text):
        if tok in vocab_dict:
            vv = vocab_dict[tok]
            count_V[vv] += 1
    return count_V
        

### Example tranformations of short phrases

Below, we'll try our count-vector transformation on several manually constructed short 'documents'

* Common words: "the was this"
* Positive words: "good great fantastic"
* Negative words: "bad horrible awful"
* Nonsense words: "dinosaur nonsense"

In [16]:
# Common words (should produce many positive entries in first few positions of the vector)
transform_text_into_feature_vector("the was this the of a an of a", vocab_dict)

array([2., 0., 0., 2., 1., 0., 0., 0., 1., 0., 2., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

In [17]:
# Positive words (should produce a few positive entries!)
transform_text_into_feature_vector("good great fantastic excellent good", vocab_dict)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 2., 0., 0., 0.,
       0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

In [18]:
# Negative words (should produce a few positive entries!)
transform_text_into_feature_vector("bad worse awful terrible horrible", vocab_dict)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

In [19]:
# Rare / nonsense words (should produce an all-zero vector!)
transform_text_into_feature_vector("dinosaur nonsense supercalifragilisticexpealidocious", vocab_dict)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

### Example tranformations of actual review (first row, index 0)

In [20]:
raw_text_0 = all_reviews_as_line_separated_string.split("\n")[0]
print(raw_text_0)
transform_text_into_feature_vector(raw_text_0, vocab_dict)

Oh and I forgot to also mention the weird color effect it has on your phone.


array([1., 1., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

In [21]:
### Example tranformations of actual review (index 101)

In [22]:
raw_text_101 = all_reviews_as_line_separated_string.split("\n")[101]
print(raw_text_101)
transform_text_into_feature_vector(raw_text_101, vocab_dict)

Couldn't figure it out


array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

# Part 4: Using a bag-of-words representation for a classifier

Let's show how we can classify text reviews using our BoW count feature representation.

We'll assume we have N total reviews in our training set.

We have defined a vocabulary of V terms (in part 2 above).

In [23]:
N = len(all_reviews_as_line_separated_string.split("\n"))
V = len(vocab_list)

### Load the labels $y$ for all reviews

We'll use knowledge that we built the raw dataset here by stacking many negative reviews then many positive reviews.

In [24]:
y_tr_N = np.hstack([np.zeros(N//2), np.ones(N//2)])

### Transform to bow features $x$ for all reviews

We need a feature matrix $X$ (with N rows and V features). We can do this just stacking all the transformed features from individual reviews.

In [25]:
x_tr_NV = np.zeros((N, V))

In [26]:
for nn, raw_text_line in enumerate(all_reviews_as_line_separated_string.split("\n")):
    x_tr_NV[nn] = transform_text_into_feature_vector(raw_text_line, vocab_dict)

In [27]:
print(x_tr_NV.shape)

(400, 194)


### Exercise 4a: How many times does each word in vocabulary appear in our training set?

Hint: use np.sum with a specific axis specified, for the x_tr_NV array

### Train a binary classifier

Let's train a LogisticRegression classifier.

We'll do 20 iters to keep it fast.

Don't worry if you get warnings about convergence.

In [28]:
clf = sklearn.linear_model.LogisticRegression(C=1000.0, max_iter=20) # Just pick reasonable choices for quick demo

In [29]:
clf.fit(x_tr_NV, y_tr_N)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1000.0, max_iter=20)

### What is accuracy of our classifier on training data?

In [30]:
yhat_tr_N = clf.predict(x_tr_NV)
acc = np.mean( y_tr_N == yhat_tr_N )

print("Training accuracy: %.3f" % acc)

Training accuracy: 0.993


### What are the learned logistic regression weights for each token in our vocabulary?

Each token in our vocabulary is a feature in our model.

Logistic regression will learn one weight parameter for each feature (each vocab token)

In the code below, we look at the learned weights, sort them from decreasing to increasing order, and then print them next to the corresponding feature. This might help us *interpret* what has been learned.

Interpretation: 
* If a weight is very negative, the more that word appears, the more likely that review is a NEGATIVE sentiment one
* If a weight is very positive, the more that word appears, the more likley that review is a POSITIVE sentiment one

In [31]:
weights_V = clf.coef_[0]
sorted_tok_ids_V = np.argsort(weights_V)

for vv in sorted_tok_ids_V:
    print("% 7.3f %s" % (weights_V[vv], vocab_list[vv]))

-20.011 phone
-14.944 2
-13.671 headset
-11.589 no
-11.479 sound
-10.975 battery
-10.796 bad
-10.516 new
-10.466 waste
-10.388 still
 -9.777 product
 -9.750 case
 -9.597 thing
 -9.514 it
 -9.203 horrible
 -8.713 use
 -8.597 then
 -8.420 plug
 -8.029 quality
 -7.619 work
 -7.434 poor
 -7.391 piece
 -7.307 them
 -7.278 
 -7.066 but
 -6.818 of
 -6.689 ear
 -6.664 worst
 -6.388 i'm
 -6.309 that
 -6.126 doesn't
 -5.594 junk
 -5.074 don't
 -5.071 design
 -4.760 if
 -4.759 has
 -4.623 charger
 -4.513 wrong
 -4.490 from
 -4.402 not
 -4.130 do
 -4.088 this
 -4.082 charge
 -4.076 into
 -4.034 too
 -3.917 me
 -3.913 off
 -3.808 its
 -3.784 what
 -3.701 service
 -3.548 after
 -3.445 over
 -3.402 anyone
 -3.194 easily
 -3.065 only
 -2.874 same
 -2.857 take
 -2.744 did
 -2.723 call
 -2.614 bought
 -2.554 huge
 -2.383 could
 -2.380 both
 -2.362 enough
 -2.180 when
 -2.172 last
 -2.152 put
 -2.129 out
 -2.095 does
 -1.983 in
 -1.862 more
 -1.819 disappointed
 -1.739 these
 -1.733 right
 -1.677 my
 -1.

### Discussion 4b: Can you interpret these weights? What vocab terms would you expect to have large negative or large positive weights? Do they? 

TODO discuss and make a list of what makes sense and what might not.

### Exercise 4c: Try out your classifier on new data

Below are the raw text of 20 possible reviews, which are NOT in the original dataset we used in Parts 1-3.

What does your classifier predict for each one? Does your classifier generalize well to this new data?

(You can use your human ability to read to decide what the labels should be).

In [32]:
test_reviews = """Do not make the same mistake as me.
I might have gotten a defect, but I would not risk buying it again because of the built quality alone.
Not worth it.
you could only take 2 videos at a time and the quality was very poor.
If you plan to use this in a car forget about it.
I have 2-3 bars on my cell phone when I am home, but you cant not hear anything.
Battery has no life.
The internet access was fine, it the rare instance that it worked.
Saggy, floppy piece of junk.
wont work right or atleast for me.
Good service, very clean, and inexpensive, to boot!
The owners are super friendly and the staff is courteous.
Very good food, great atmosphere.1
This was my first and only Vegas buffet and it did not disappoint.
Interesting decor.
Plus, it's only 8 bucks.
Great steak, great sides, great wine, amazing desserts.
Four stars for the food & the guy in the blue shirt for his great vibe & still letting us in to eat !
The staff is great, the food is delish, and they have an incredible beer selection.
Nargile - I think you are great"""

In [33]:
# Try a prediction for the first example test review
x_V = transform_text_into_feature_vector("Not worth it.", vocab_dict)
clf.predict(x_V.reshape((1,V)))

array([0.])

In [34]:
# TODO make predictions for the 20 test reviews above.

# Part 5: Using built-in sklearn tool: CountVectorizer

Instead of writing manual code, you can use CountVectorizer class from sklearn.

This class lets you do fast and repeatable BoW feature extraction given a dataset.

You can control things like:

* How to ignore rare vocab terms.
* How to ignore too common vocab terms (like the, an, a that don't provide meaningful semantic content)
* How to ignore punctuation
* Whether to use single-word features, or ordered pairs (bigrams aka 2-grams) or even 3-grams.
* Whether to produce count or binary features.

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

Look up the docs for CountVectorizer

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

### Parameters of the CountVectorizer

Here are all the settings you can control when you construct it.

```
Parameters
----------
input : string {'filename', 'file', 'content'}, default='content'
    If 'filename', the sequence passed as an argument to fit is
    expected to be a list of filenames that need reading to fetch
    the raw content to analyze.

    If 'file', the sequence items must have a 'read' method (file-like
    object) that is called to fetch the bytes in memory.

    Otherwise the input is expected to be a sequence of items that
    can be of type string or byte.

lowercase : bool, default=True
    Convert all characters to lowercase before tokenizing.

tokenizer : callable, default=None
    Override the string tokenization step while preserving the
    preprocessing and n-grams generation steps.
    Only applies if ``analyzer == 'word'``.

stop_words : string {'english'}, list, default=None
    If 'english', a built-in stop word list for English is used.
    There are several known issues with 'english' and you should
    consider an alternative (see :ref:`stop_words`).

    If a list, that list is assumed to contain stop words, all of which
    will be removed from the resulting tokens.
    Only applies if ``analyzer == 'word'``.

    If None, no stop words will be used. max_df can be set to a value
    in the range [0.7, 1.0) to automatically detect and filter stop
    words based on intra corpus document frequency of terms.

ngram_range : tuple (min_n, max_n), default=(1, 1)
    The lower and upper boundary of the range of n-values for different
    word n-grams or char n-grams to be extracted. All values of n such
    such that min_n <= n <= max_n will be used. For example an
    ``ngram_range`` of ``(1, 1)`` means only unigrams, ``(1, 2)`` means
    unigrams and bigrams, and ``(2, 2)`` means only bigrams.
    Only applies if ``analyzer is not callable``.

analyzer : string, {'word', 'char', 'char_wb'} or callable,             default='word'
    Whether the feature should be made of word n-gram or character
    n-grams.
    Option 'char_wb' creates character n-grams only from text inside
    word boundaries; n-grams at the edges of words are padded with space.

    If a callable is passed it is used to extract the sequence of features
    out of the raw, unprocessed input.

    .. versionchanged:: 0.21

    Since v0.21, if ``input`` is ``filename`` or ``file``, the data is
    first read from the file and then passed to the given callable
    analyzer.

max_df : float in range [0.0, 1.0] or int, default=1.0
    When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold (corpus-specific
    stop words).
    If float, the parameter represents a proportion of documents, integer
    absolute counts.
    This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int, default=1
    When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold. This value is also
    called cut-off in the literature.
    If float, the parameter represents a proportion of documents, integer
    absolute counts.
    This parameter is ignored if vocabulary is not None.

max_features : int, default=None
    If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

    This parameter is ignored if vocabulary is not None.

vocabulary : Mapping or iterable, default=None
    Either a Mapping (e.g., a dict) where keys are terms and values are
    indices in the feature matrix, or an iterable over terms. If not
    given, a vocabulary is determined from the input documents. Indices
    in the mapping should not be repeated and should not have any gap
    between 0 and the largest index.

binary : bool, default=False
    If True, all non zero counts are set to 1. This is useful for discrete
    probabilistic models that model binary events rather than integer
    counts.
```

### Attributes of the CountVectorizer

After you fit the count vectorizer, you can access these attributes.

```
Attributes
----------
vocabulary_ : dict
    A mapping of terms to feature indices.

fixed_vocabulary_: boolean
    True if a fixed vocabulary of term to indices mapping
    is provided by the user

stop_words_ : set
    Terms that were ignored because they either:

      - occurred in too many documents (`max_df`)
      - occurred in too few documents (`min_df`)
      - were cut off by feature selection (`max_features`).

    This is only available if no vocabulary was given.
```

### Prepare our training data (make it an iterable)

In [36]:
list_of_training_text_reviews = all_reviews_as_line_separated_string.split("\n")

### Create CountVectorizer that uses our *predefined* vocabulary

In [37]:
bow_preprocessor = CountVectorizer(binary=False, vocabulary=vocab_dict)

In [38]:
bow_preprocessor.fit(list_of_training_text_reviews)

CountVectorizer(vocabulary={'': 156, '-': 137, '2': 97, 'a': 3, 'about': 55,
                            'after': 131, 'again': 179, 'all': 29, 'also': 38,
                            'always': 98, 'amazing': 47, 'an': 37, 'and': 1,
                            'anyone': 151, 'are': 27, 'as': 73, 'at': 36,
                            'atmosphere': 190, 'away': 181, 'awesome': 188,
                            'back': 41, 'bad': 83, 'battery': 72, 'be': 34,
                            'because': 71, 'been': 107, 'best': 79, 'both': 148,
                            'bought': 147, 'buffet': 182, ...})

### How do we get the features?

Use sklearn API's `transform` method of your trained extractor.

This will deliver a SPARSE matrix representation (since so many entries are exactly zero, as most documents do not contain any instances of many vocab terms).

In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.

See scipy.sparse docs for details: <https://docs.scipy.org/doc/scipy/reference/sparse.html>

In [39]:
sparse_arr = bow_preprocessor.transform(list_of_training_text_reviews[:3])
print(type(sparse_arr))
print(sparse_arr.shape)

<class 'scipy.sparse.csr.csr_matrix'>
(3, 194)


You can convert to a dense representaiton via the `toarray()` method

In [40]:
dense_arr_3V = sparse_arr.toarray()
print(type(dense_arr_3V))
print(dense_arr_3V.shape)

<class 'numpy.ndarray'>
(3, 194)


Print out the extracted count features:

In [41]:
dense_arr_3V

array([[1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

### CountVectorizer that builds its own UNIGRAM vocabulary

* ngram_range=(1,1) : Means it will use unigrams (individual tokens)
* min_df=1 : Means include any term that occurs in at least one document in training set
* max_df=1.0 : Means include any terms that appears in less than 100% (fraction of 1.0) of training docs
* binary=False : means it will do counts, not just binary presence/absence of each vocab term


In [42]:
bow_preprocessor = CountVectorizer(ngram_range=(1,1), min_df=1, max_df=1.0, binary=False)

In [43]:
bow_preprocessor.fit(list_of_training_text_reviews)

CountVectorizer()

In [44]:
len(bow_preprocessor.vocabulary_)

1212

Print the first 10 words in the vocabulary

In [45]:
for term, count in list(bow_preprocessor.vocabulary_.items())[:10]:
    print("%4d %s" % (count, term))

 714 oh
  40 and
 421 forgot
1071 to
  30 also
 646 mention
1045 the
1166 weird
 209 color
 338 effect


### CountVectorizer that builds its own vocabulary with 2-grams 

2-grams means we look at an ordered pair of words, not individual words.

So "New York", "very bad", "not good", and "the cat" are all 2-grams (aka bigrams).


* ngram_range=(2,2) means look for 2-grams, not 1-grams
* min_df=3 : Means include any term that occurs i requires a 2-gram to appear in at least 3 documents in train set
* max_df=0.5 : Means include any terms that appears in less than 50% (fraction of 0.5) of training docs
* binary=False : means it will do counts, not just binary presence/absence of each vocab term


In [46]:
two_gram_preprocessor = CountVectorizer(binary=False, ngram_range=(2,2), min_df=3, max_df=0.75)

In [47]:
two_gram_preprocessor.fit(list_of_training_text_reviews)

CountVectorizer(max_df=0.75, min_df=3, ngram_range=(2, 2))

In [48]:
len(two_gram_preprocessor.vocabulary_)

116

In [49]:
for term, count in two_gram_preprocessor.vocabulary_.items():
    print("%4d %s" % (count, term))

 106 waste of
  17 does not
  93 to use
  56 of the
 111 with the
  32 in my
  94 very disappointed
  81 the service
  69 service was
 105 was very
  41 is the
  29 had to
  10 and the
  70 that it
  45 it is
  80 the same
  16 do not
 110 with my
  47 like the
  19 felt like
  12 and they
  89 this product
  65 recommend this
  86 this item
  91 to anyone
  53 not impressed
  54 not work
  33 in the
  78 the phone
  85 this is
  60 piece of
  55 of junk
  75 the ear
  51 my ear
 113 you are
  87 this phone
  76 the first
  72 the battery
  92 to be
  15 but it
  21 for me
  57 on the
 104 was the
  13 and was
 102 was not
  30 have to
   8 and service
  42 is very
  39 is not
  52 my phone
  83 there is
  44 it in
  59 phone for
  82 the worst
 115 you have
  22 for the
   7 and it
   0 about this
  46 it was
 109 which is
  48 ll be
  11 and then
   9 and so
 114 you get
  74 the company
   1 all in
  31 in all
  18 doesn work
  43 it and
  14 because it
   5 and had
  28 had the
  3

# Part 6: A Bag-of-Words CountVectorizer + Classifier pipeline

If we do a *pipeline*, we can make sure that:

1) We apply the same text transforms to training and text data

2) We use nice sklearn functionality that you don't need to reinvent from scratch

In this part, we'll show you how to combine CountVectorizer with a classifier, and do a *combined* grid search over relevant hyperparameters.

### Create a pipeline : CountVectors + classifier

In [50]:
my_bow_classifier_pipeline = sklearn.pipeline.Pipeline([
    ('my_bow_feature_extractor', CountVectorizer(min_df=1, max_df=1.0, ngram_range=(1,1))),
    ('my_classifier', sklearn.linear_model.LogisticRegression(C=1.0, max_iter=20, random_state=101)),
])


### Define hyperparam grid to search

Remember, for a *pipeline*, the name of each parameter is built by concatenating string names, like

```
<step_name>_<hyperparameter_name>
```

* where <step_name> is name of the step in the pipeline, like "my_classifier", which we defined when we created the pipeline
* where <hyperparameter_name> is the name of the hyperparameter for that step, like "C" which is a known parameter for that step's sklearn class



In [51]:
my_parameter_grid_by_name = dict()
my_parameter_grid_by_name['my_bow_feature_extractor__min_df'] = [1, 2, 4]
my_parameter_grid_by_name['my_classifier__C'] = np.logspace(-4, 4, 9)

In [52]:
my_scoring_metric_name = 'accuracy'

### Prepare a "predefined" split of our training data into train and validation (so we can do grid search)

In [53]:
# Number of training text reviews and labels
N

400

Recall the training labels have STRUCTURE, so we better scramble them when we define our split

In [54]:
print(y_tr_N)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.

We have 400 total examples. Let's randomly assign 100 to validation and 300 to training


In [55]:
prng = np.random.RandomState(0)

valid_ids = prng.choice(np.arange(N), size=100)

valid_indicators_N = np.zeros(N)
valid_indicators_N[valid_ids] = -1

In [56]:
my_splitter = sklearn.model_selection.PredefinedSplit(valid_indicators_N)

### Create a custom searcher object with all our settings in place.

In [57]:
grid_searcher = sklearn.model_selection.GridSearchCV(
    my_bow_classifier_pipeline,
    my_parameter_grid_by_name,
    scoring=my_scoring_metric_name,
    cv=my_splitter,
    refit=False)

### Fit the grid search

Remember, we expect some convergence warnings.

In [58]:
grid_searcher.fit(list_of_training_text_reviews, y_tr_N)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

GridSearchCV(cv=PredefinedSplit(test_fold=array([ 0,  0, ..., -1,  0])),
             estimator=Pipeline(steps=[('my_bow_feature_extractor',
                                        CountVectorizer()),
                                       ('my_classifier',
                                        LogisticRegression(max_iter=20,
                                                           random_state=101))]),
             param_grid={'my_bow_feature_extractor__min_df': [1, 2, 4],
                         'my_classifier__C': array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03,
       1.e+04])},
             refit=False, scoring='accuracy')

In [59]:
gsearch_results_df = pd.DataFrame(grid_searcher.cv_results_).copy()

param_keys = ['param_my_bow_feature_extractor__min_df', 'param_my_classifier__C']

# Rearrange row order so it is easy to skim
gsearch_results_df.sort_values(param_keys, inplace=True)


### Print the results of grid search

In [60]:
gsearch_results_df[param_keys + ['split0_test_score', 'rank_test_score']]

Unnamed: 0,param_my_bow_feature_extractor__min_df,param_my_classifier__C,split0_test_score,rank_test_score
0,1,0.0001,0.498413,24
1,1,0.001,0.501587,22
2,1,0.01,0.720635,14
3,1,0.1,0.777778,7
4,1,1.0,0.777778,7
5,1,10.0,0.793651,1
6,1,100.0,0.790476,3
7,1,1000.0,0.793651,1
8,1,10000.0,0.787302,5
9,2,0.0001,0.498413,24


### Discussion 6a: Which settings of min_df seem to perform best? Why do you think that is?

Hint: Which settings produce more features and which produce less? 

### Challenge Exercise 6b: Do 2-grams perform better than 1-grams?

Can you run a grid search to find out?