# Sentiment Analysis

Lots of libraries exist that will do sentiment analysis for you. Imagine that: just taking a sentence, throwing it into a library, and geting back a score! How convenient!

It's also **totally irresponsible** unless you know how the sentiment analyzer was built. In this homework we're going to see how sentiment analysis is done with a few packages.

## Installation

If you haven't already, you'll want to `pip install` two packages: NLTK and Textblob.

In [1]:
# !pip install nltk
# !pip install textblob

# NLTK: Natural Language Tooklit

[Natural Language Toolkit](https://www.nltk.org/) is the basis for a lot of text analysis done in Python. It's old and terrible and slow, but it's just been used for so long and does so many things that it's generally the default when people get into text analysis. The new kid on the block is [spaCy](https://spacy.io/) (but it doesn't do sentiment analysis so we're leaving it out of this).

When you first run NLTK, you need to download some datasets to make sure it will be able to do everything you want.

In [2]:
import nltk
nltk.download('vader_lexicon')
nltk.download('movie_reviews')
nltk.download('punkt')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/tbi/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/tbi/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to /Users/tbi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

To do sentiment analysis with NLTK, it only takes a couple lines of code:

In [3]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

sia = SIA()
sia.polarity_scores("This restaurant was great, but I'm not sure if I'll go there again.")

{'neg': 0.153, 'neu': 0.688, 'pos': 0.159, 'compound': 0.0276}

Asking `SentimentIntensityAnalyzer` for the `polarity_score` gave us four values in a dictionary:

- **negative:** the negative sentiment in a sentence
- **neutral:** the neutral sentiment in a sentence
- **positive:** the postivie sentiment in the sentence
- **compound:** the aggregated sentiment. 
    
Seems simple enough!

In [4]:
text = "I just got a call from my boss - does he realise it's Saturday?"
sia.polarity_scores(text)

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

Just like in real life, if you use an emoji you can be read as being more positive:

In [5]:
text = "I just got a call from my boss - does he realise it's Saturday? :)"
sia.polarity_scores(text)

{'neg': 0.0, 'neu': 0.786, 'pos': 0.214, 'compound': 0.4588}

In [6]:
text = "I just got a call from my boss - does he realise it's Saturday? 😊"
sia.polarity_scores(text)

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

Why didn't it understand the emoji the same way it understood the emoticon? Well, **it only knows the words that it's been trained on,** and if VADER's never seen 😊 before it won't know what to think of it.

# TextBlob

TextBlob is built on top of NLTK, but is infinitely easier to use. It's still slow, but _it's so so so easy to use_. 

You can just feed TextBlob your sentence, then ask for a `.sentiment`!

In [7]:
!pip install textblob

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [8]:
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer

In [9]:
blob = TextBlob("This restaurant was great, but I'm not sure if I'll go there again.")
blob.sentiment

Sentiment(polarity=0.275, subjectivity=0.8194444444444444)

**How could it possibly be easier than that?!?!?** This time we get a `polarity` and a `subjectivity` instead of all of those different scores, but it's basically the same idea.

If you like options: it turns out TextBlob actually has multiple sentiment analysis tools! How fun! We can plug in a different analyzer to get a different result.

In [10]:
blob = TextBlob("This restaurant was great, but I'm not sure if I'll go there again.", analyzer=NaiveBayesAnalyzer())
blob.sentiment

Sentiment(classification='pos', p_pos=0.5879425317005774, p_neg=0.41205746829942275)

Wow, that's a **very different result.** To understand why it's so different, we need to talk about where these sentiment numbers come from.

# But where do those numbers come from?

The most important thing to understand is **sentiment is always just an opinion.** In this case it's an opinion, yes, but specifically **the opinion of a machine.**

## VADER

NLTK's Sentiment Intensity Analyzer works is using something called **VADER**, which is a list of words that have a sentiment associated with each of them.

|Word|Sentiment rating|
|---|---|
|tragedy|-3.4|
|rejoiced|2.0|
|disaster|-3.1|
|great|3.1|

If you have more positives, the sentence is more positive. If you have more negatives, it's more negative. It can also take into account things like capitalization - you can read more about the classifier [here](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html), or the actual paper it came out of [here](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf).

**How do they know what's positive/negative?** They came up with a very big list of words, then asked people on the internet and paid them one cent for each word they scored.

## TextBlob's `.sentiment`

TextBlob's sentiment analysis is based on a separate library called [pattern](https://www.clips.uantwerpen.be/pattern).

> The sentiment analysis lexicon bundled in Pattern focuses on adjectives. It contains adjectives that occur frequently in customer reviews, hand-tagged with values for polarity and subjectivity.

Same kind of thing as NLTK's VADER, but it specifically looks at words from customer reviews.

**How do they know what's positive/negative?** They look at (mostly) adjectives that occur in customer reviews and hand-tag them.

## TextBlob's `.sentiment` + NaiveBayesAnalyzer

TextBlob's other option uses a `NaiveBayesAnalyzer`, which is a machine learning technique. When you use this option with TextBlob, the sentiment is coming from "an NLTK classifier trained on a movie reviews corpus."

**How do they know what's positive/negative?** Looked at movie reviews and scores using machine learning, see what words are associated with a positive/negative rating.

## What's this mean for me?

When you're doing automatic sentiment analysis, you have two major questions: 

* Where does the list of known words come from
* Where do the positive/negative scores come from

Let's compare the tools we've used so far.

|technique|word source|word selection|scores|
|---|---|---|---|
|NLTK (VADER)|everywhere|hand-picked|internet people, word-by-word|
|TextBlob|product reviews|hand-picked, mostly adjectives|internet people, word-by-word|
|TextBlob + NaiveBayesAnalyzer|movie reviews|all words|automatic based on score|

A major thing that should jump out at you is **how different the sources are.**

While VADER focuses on content found everywhere, TextBlob's two options are specific to certain domains. The [original paper for VADER](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf) passive-aggressively noted that VADER is effective at general use, but being trained on a specific domain can have benefits: 

> While some algorithms performed decently on test data from the specific domain for which it was expressly trained, they do not significantly outstrip the simple model we use.

They're basically saying, "if you train a model on words from a certain field, it will be good at that field."

## Questions

### Question 1: Is it okay to use a sentiment analyzer built on product reviews to check the sentiment of tweets? How about to check the sentiment of wine reviews?

In [11]:
# it might be problematic, as the context of tweets really depends. it would be more accurate in the case of product reviews for wine


### Question 2: Is it okay to use a sentiment analyzer trained on everything to check the sentiment of tweets? How about to check the sentiment of wine reviews?

In [12]:
# Tweets - it might be more inclusive and more subjective than the analyzer trained on product reviews
# Wine reviews - less accurate representation than the last one


### Question 3: If I'm trying to report on whether people generally like or dislike what is happening throughout the Democratic debates, could I use these sorts of tools on tweets? Let's hear arguments for both sides.

In [13]:
# better to train the analyzer on related field

# Training our own sentiment analyzer

We don't want to rely on other people, we want to do this ourselves! There are two major ways to do sentiment analysis:

* Have a list of words that you humans assign positive or negative scores to
* Look at something scored (movie reviews, product reviews) and figure out which words appear with which scores

Depending on how you look at it, it's either a classification or a regression problem. We'll see the difference down below.

## Training on tweets

Let's say we were going to analyze the sentiment of tweets. If we had a list of tweets that were scored positive vs. negative, we could see which words are usually associated with positive scores and which are usually associated with negative scores.

Luckily, we have **Sentiment140** - http://help.sentiment140.com/for-students - a list of 1.6 million tweets along with a score as to whether they're negative (0) or positive (4). We'll use it to build our own machine learning algorithm to see separate positivity from negativity.

### Read in our data

In [14]:
import pandas as pd

columns = ['polarity', 'id', 'datetime', 'query', 'username', 'content']
df = pd.read_csv("trainingandtestdata/training.1600000.processed.noemoticon.csv", 
                 names=columns,
                 encoding='latin-1')
df = df.dropna()
df.head()

Unnamed: 0,polarity,id,datetime,query,username,content
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


### Cleaning our data

The `polarity` field is whether something is positive or negative. How many do we have of each?.

In [15]:
df.polarity.value_counts()

4    800000
0    800000
Name: polarity, dtype: int64

According to the documentation, `0` is negative and `4` is positive. Weird, right? Let's make it zero and one instead.

In [16]:
df['polarity'] = df.polarity.replace(4, 1)

Confirm you have 800k of each.

In [17]:
df.polarity.value_counts()

1    800000
0    800000
Name: polarity, dtype: int64

That is a **lot of tweets.**

Let's be honest: it's going to take our algorithms a long long time to process that many. Instead of working with our entire dataframe, let's use a **sample of 20,000**. If things are still slow before we can decrease this number.

* **Tip:** `df.sample(5)` will give you a sample of 5 elements of your dataframe

In [18]:
df = df.sample(20000)

Confirm you have 20,000 rows and 6 columns.

In [19]:
df.shape

(20000, 6)

## Vectorize our tweets

Create a `TfidfVectorizer` and use it to vectorize our tweets. Since we don't have all the time in the world, we should probably use `max_features` to only take a selection of terms - how about 2000 for now?

* **Tip:** Your end result should be a `words_df`

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [21]:
tv = TfidfVectorizer(max_features = 2000)
ft = tv.fit_transform(df.content)
words_df = pd.DataFrame(ft.toarray(), columns = tv.get_feature_names())

Your dataframe should look something like

|00|000|10|...|your|...|yummy|yup|½t|
|---|---|---|---|---|---|---|---|---|
|0.0|0.0|0.0|...|0.0|...|0.0|0.0|0.0|
|0.0|0.0|0.0|...|0.0|...|0.0|0.0|0.0|
|0.0|0.0|0.0|...|0.235754|...|0.0|0.0|0.0|
|0.0|0.0|0.0|...|0.0|...|0.0|0.0|0.0|


In [22]:
words_df.head()

Unnamed: 0,00,000,09,10,100,11,12,14,15,16,...,your,youre,yours,yourself,youtube,yr,yummy,yup,zoo,ðµ
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Training our algorithm

### Setting up our variables

Create an `X` and a `y`, same as ever. In this case, what are our **features** and what are our **labels?**

In [59]:
col = df.columns.tolist()
col.remove('polarity')

In [60]:
X = words_df
y = df.drop(columns = col)

Confirm that `X` has 20,000 rows and 2,000 columns, and that `y` has 20,000 rows of 1 column.

In [61]:
X.shape

(20000, 2000)

In [81]:
y.shape

(20000, 1)

### Picking an algorithm

What kind of algorithm do we want? We've used quite a few, and I just pulled another one couple classifiers of thin air.

In [63]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

When picking an algorithm, think about what the output should be: is it a category? A probability, an amount? In this case **it might be any of those!**

For example:

* **A category:** `0` or `1` for negative vs positive
* **A probability:** The % chance that it's either negative or positive (between 0 and 1)
* **An amount:** A score between 0 and 1 about how positive it is

So hey, let's just make **one of each** of these. Name them `linreg`, `logreg`, `forest`, `svc`, and `bayes`.

The two new ones - `LinearSVC` and `MultinomialNB` - work exactly the same as your other classifiers, you'll be doing the standard creation and fitting:

```python
svc = LinearSVC()
svc.fit(X, y)
```

**Create and train classifiers in the cells below.** Add `%%time` to the top of each cell to see how long they take to train.

* **Tip:** Remember you need to add `C=1e9` to logistic regression, and specify the solver!
* **Tip:** If the logistic regression doesn't converge, it hasn't found an answer. You might need to increase `max_iter` (the default is 100)

In [89]:
%%time
# Create and train a linear regression
mod = LinearRegression()
linreg = mod.fit(X, y)

CPU times: user 11.4 s, sys: 474 ms, total: 11.9 s
Wall time: 6.84 s


In [83]:
%%time
# Create and train a logistic regression - if it doesn't converge be sure to increase max_iter
logreg = LogisticRegression(C = 1e9, solver = 'lbfgs', max_iter = 2000)
logreg.fit(X, df.polarity)

CPU times: user 45.3 s, sys: 589 ms, total: 45.9 s
Wall time: 24.2 s


In [85]:
%%time
# Create and train a random forest classifier
forest = RandomForestClassifier(n_estimators = 100)
forest.fit(X, df.polarity)

CPU times: user 1min 24s, sys: 698 ms, total: 1min 25s
Wall time: 1min 27s


In [86]:
%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X, y)

CPU times: user 193 ms, sys: 6.58 ms, total: 200 ms
Wall time: 214 ms


  y = column_or_1d(y, warn=True)


In [87]:
%%time
# Create and train a multinomial naive bayes classifier (MultinomialNB)
bayes = MultinomialNB()
bayes.fit(X, y)

CPU times: user 128 ms, sys: 25 ms, total: 153 ms
Wall time: 122 ms


  y = column_or_1d(y, warn=True)


**How long did each take to train?** How much faster were some compared to others?

In [None]:
# random forest is the slowest

# Use our models on some new data

Now that we've trained our models, **they can try to predict whether a model is positive or negative**.

**Add five more sentences to the list below.** They should be a mix of positive and negative. They can be boring, they can be exciting, they can be short, they can be long.

In [90]:
# Create some test data

pd.set_option("display.max_colwidth", 200)

unknown = pd.DataFrame([
       "I'm not sure how I feel about toast",
       "Did you see the baseball game yesterday?",
       "The package was delivered late and the contents were broken",
       "Trashy television shows are some of my favorites",
       "I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",
       "I find chirping birds irritating, but I know I'm not the only one"
], columns=['content'])
unknown

Unnamed: 0,content
0,I'm not sure how I feel about toast
1,Did you see the baseball game yesterday?
2,The package was delivered late and the contents were broken
3,Trashy television shows are some of my favorites
4,"I'm seeing a Kubrick film tomorrow, I hear not so great things about it."
5,"I find chirping birds irritating, but I know I'm not the only one"


First we need to **vectorizer** our sentences into numbers, so the algorithm can understand them.

Our algorithm only knows **certain words.** Run `vectorizer.get_feature_names()` to show you the list of the words it knows.

In [116]:
unknown_vt = tv.transform(unknown.content)
pd.DataFrame(unknown_vt.toarray(), columns = tv.get_feature_names())

Unnamed: 0,00,000,09,10,100,11,12,14,15,16,...,your,youre,yours,yourself,youtube,yr,yummy,yup,zoo,ðµ
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Usually when we use the vectorizer, we write code like this:
    
```python
vectors = vectorizer.fit_transform(....)
```

Which both learns all the words **and** counts them. In this case **we already have the list of words we know, we only want to count them.** So instead of `.fit_transform`, we just use `.transform`:

```python
unknown_vectors = vectorizer.transform(unknown.content)
unknown_words_df = ......
```

Finish making your `unknown_words_df` in the cell below.

In [118]:
unknown_words_df = pd.DataFrame(unknown_vt.toarray(), columns = tv.get_feature_names())

Confirm `unknown_words_df` is 11 rows and 2,000 columns.

In [120]:
unknown_words_df.shape

(6, 2000)

### Predicting with our models

To make a prediction for each of our sentences, you can use `.predict` with each of our models. For example, it would look like this for linear regression:

```python
unknown['pred_linreg'] = linreg.predict(unknown_words_df)
```

To add the prediction for logistic regression, you'd run similar `.predict` code, which will give you a `0` (negative) or a `1` (positive). A difference between the two is that for logistic regression, you can **also ask for the probability that the sentence is in the `1` category** instead of just simply the category. To do that, you use this code:

```python
unknown['pred_logreg_prob'] = linreg.predict_proba(unknown_words_df)[:,1]
```

**Add new columns for each of the models you trained.** If the model has a `.predict_proba`, add that as a column as well. 

* **Tip:** Tab is helpful for knowing whether `.predict_proba` is an option.
* **Tip:** Don't forget the `[:,1]` after `.predict_proba`, it means "give me the probability for category `1`

In [122]:
unknown['pred_linreg'] = linreg.predict(unknown_words_df)

In [124]:
unknown['pred_logreg'] = logreg.predict(unknown_words_df)

In [126]:
unknown['pred_logreg_prob'] = logreg.predict_proba(unknown_words_df)[:,1]

In [128]:
unknown['pred_forest'] = forest.predict(unknown_words_df)

In [133]:
unknown['pred_forest_prob'] = forest.predict_proba(unknown_words_df)[:,1]

In [135]:
unknown['pred_svc'] = svc.predict(unknown_words_df)

In [137]:
unknown['pred_bayes'] = bayes.predict(unknown_words_df)

In [141]:
unknown['pred_bayes_proba'] = bayes.predict_proba(unknown_words_df)[:,1]

In [142]:
unknown

Unnamed: 0,content,pred_linreg,pred_logreg,pred_logreg_prob,pred_forest,pred_forest_prob,pred_svc,pred_bayes,pred_bayes_proba
0,I'm not sure how I feel about toast,0.379635,0,0.292673,0,0.35,0,0,0.430343
1,Did you see the baseball game yesterday?,0.517577,1,0.527496,1,0.79,0,0,0.487945
2,The package was delivered late and the contents were broken,-0.108664,0,0.000929,0,0.33,0,0,0.143234
3,Trashy television shows are some of my favorites,0.526338,1,0.524955,0,0.43,1,0,0.475704
4,"I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",0.736005,1,0.862959,0,0.5,1,1,0.642408
5,"I find chirping birds irritating, but I know I'm not the only one",0.175828,0,0.072134,0,0.36,0,0,0.409012


Your output should look something like the below. Check your column names to confirm they match up.

|content|pred_linreg|pred_logreg|pred_logreg_proba|pred_forest|pred_forest_proba|pred_svc|pred_bayes|pred_bayes_proba|
|---|---|---|---|---|---|---|---|---|
|I'm not sure how I feel about toast|0.342560|0|0.271403|0|0.5|0|0|0.425271|
|...|...|...|...|...|...|...|...|...|

# Questions

### Question 4: What do the numbers mean? What's the difference between a 0 and a 1? A 0.5? Negative numbers?

### Question 5: Were there any sentences where the classifiers seemed to disagree about? How do you feel about the amount they disagree? 

### Question 6: What's the difference between using a 0/1 to talk about sentiment compared to 0-1? When might you use one compared to another?


### Question 7: What's the difference between the linear regression model and the other models we're using? Why might it fit or not fit?

### Question 8: Between 0-1, what range do you think counts as "negative," "positive" and "neutral"?

### Question 9: Does the variation in scores reflect the variation you would see among people? Or is it better or worse?

# Maybe we should have tested this?

We can actually see **which model performs the best**. Let's remind ourselves what we have by looking at:

* `X`
* `y`
* `df`

In [145]:
X.head()

Unnamed: 0,00,000,09,10,100,11,12,14,15,16,...,your,youre,yours,yourself,youtube,yr,yummy,yup,zoo,ðµ
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [146]:
y.head()

Unnamed: 0,polarity
1523282,1
1256083,1
78489,0
834199,1
99682,0


In [147]:
df.head()

Unnamed: 0,polarity,id,datetime,query,username,content
1523282,1,2176562368,Mon Jun 15 04:11:58 PDT 2009,NO_QUERY,ginavalakuzhy,Ha ha.. Will make a mental note of that! It was actually who figured it out! Smart one she is i say!!
1256083,1,1997401314,Mon Jun 01 16:47:08 PDT 2009,NO_QUERY,KenDahl4U,@Mizphit Just got finished working out. Chillin! Whatchu doin sweetness
78489,0,1751385203,Sat May 09 19:19:36 PDT 2009,NO_QUERY,valeriec24,Can't get Adobe air to load on my desktop so I can't use either Tweetdeck or Seesmic Desktop. &lt;sigh&gt;
834199,1,1557918175,Sun Apr 19 06:34:05 PDT 2009,NO_QUERY,bootifulGal1990,Hello everyone on twitter!!
99682,0,1793702302,Thu May 14 03:41:38 PDT 2009,NO_QUERY,ubuntugeeks,@rossphillips a small thingy or something big


Our original dataframe is a list of many, many tweets. We turned this into `X` - vectorized words - and `y` - whether the tweet is negative or positive.

Before we used `.fit(X, y)` to train on all of our data. Instead, **we can test our models** by doing a test/train split and see if the predictions match the actual labels.

## Create test and training data 

Split your `X` and `y` into train and test datasets. I always have to look up how to do it, so here's the code for you:

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
```

In [148]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

Use `X_train` and `y_train` to train all your models, except the linear regression one. You should be training:

* `logreg`
* `forest`
* `svc`
* `bayes`

Again, do them each in **separate cells** and use `%%time` to see how long each one takes to learn what's a positive vs negative tweet.

In [149]:
%%time
logreg.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


CPU times: user 54.7 s, sys: 700 ms, total: 55.4 s
Wall time: 29.8 s


LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=2000, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [150]:
%%time
forest.fit(X_train, y_train)

  """Entry point for launching an IPython kernel.


CPU times: user 54.8 s, sys: 609 ms, total: 55.5 s
Wall time: 57.2 s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [151]:
%%time
svc.fit(X_train, y_train)

CPU times: user 133 ms, sys: 7.17 ms, total: 140 ms
Wall time: 142 ms


  y = column_or_1d(y, warn=True)


LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [152]:
%%time
bayes.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


CPU times: user 110 ms, sys: 18.2 ms, total: 128 ms
Wall time: 111 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

# Confusion matrices

To see how well they did, we'll use a confusion matrix for each one. For example, here is what you'll use for logistic regression:

```python
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
```

In [153]:
from sklearn.metrics import confusion_matrix

### Logistic Regression

In [154]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,1757,684
Is positive,628,1931


### Random forest

In [155]:
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,1805,636
Is positive,658,1901


### SVC

In [156]:
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,1786,655
Is positive,598,1961


### Multinomial Naive Bayes

In [157]:
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,1817,624
Is positive,629,1930


## Percentage-based confusion matrices

Those are kind of irritating in that they're just numbers. It might work better if you do something like this instead to get percentages:

```python
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names) / matrix.sum(axis=1)
```

### Logisitic

In [158]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names) / matrix.sum(axis=1)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.719787,0.267292
Is positive,0.257272,0.754592


### Random forest

In [159]:
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names) / matrix.sum(axis=1)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.739451,0.248535
Is positive,0.269562,0.742868


### SVC

In [160]:
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names) / matrix.sum(axis=1)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.731667,0.255959
Is positive,0.244982,0.766315


### Multinomial Naive Bayes

In [161]:
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names) / matrix.sum(axis=1)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.744367,0.243845
Is positive,0.257681,0.754201


### Question 10: Which models performed the best? Were there big differences?

### Question 11: Do you think it's more important to be sensitive to negativity or positivity? Do we want more positive things incorrectly marked as negative, or more negative things marked as positive?

### Question 12: They all had very different training times. Which ones offer the best combination of performance and not making you wait around for an hour?

### Question 13: If you have a decent algorithm that trains more quickly, that could that mean about feature selection or the size of your training set? Why did we use `max_features=` and `df.sample`?

### Question 14: How do you feel about sentiment analysis?

### Question 15: How do you feel about [this piece from the UpShot](https://www.nytimes.com/interactive/2017/02/28/upshot/trump-sounds-different-tone-in-first-address-to-congress.html) that uses [the Emotional Lexicon](http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm)?

### Question 16: What would you feel comfortable using our sentiment classifier for?