# Week 3 - Lab - Logistic Regression

## Recap
![Imgur](https://i.imgur.com/UB4Kg0w.jpg)
[Link to sketchboard](https://sketchboard.me/tBjCwrhXsFwu)

## 1. Confusion Matrices

https://towardsdatascience.com/taking-the-confusion-out-of-confusion-matrices-c1ce054b3d3e

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

* We use 2 models to predict if a woman is pregnant or not and save the result in "pregnant.csv" files.
* pregnent dataset has 3 columns:
    * Actual is actual result (ground truth)
    * model_1_predict, model_2_predict are prediction results from model 1 and model 2

**Now, we have to answer the question which model is better?**

First of all, let's load dataset and print it out

In [49]:
df_pregnant = pd.read_csv('./pregnant.csv')
df_pregnant.head()

Unnamed: 0,Actual,model_1_predict,model_2_predict
0,0,0,0
1,1,0,1
2,0,0,0
3,0,0,0
4,0,0,0


### Calculate True Positive, False Positive, True Negative, False Negative
![Alt](https://cdn-images-1.medium.com/max/800/1*g5zpskPaxO8uSl0OWT4NTQ.png)
Note
* Positive : 1(Pregnant)
* Negative : 0(Not Pregnant)

True Positive : Predict **Positive** and it's **True**

In [55]:
# Predict Positive
positive = df_pregnant['model_1_predict'] == 1
# It's True (correct prediction)
correct_predict = df_pregnant['model_1_predict'] == df_pregnant['Actual']

TP = sum(positive & correct_predict)
TP

40

False Positive: Predict **Positive** and it's **False**

In [56]:
# Predict Positive
positive = df_pregnant['model_1_predict'] == 1
# It's False (incorrect prediction)
incorrect_predict = df_pregnant['model_1_predict'] != df_pregnant['Actual']

FP = sum(positive & incorrect_predict)
FP

65

True Negative : Predict **Negative** and it's **True**

In [52]:
# Predict Negative
negative = df_pregnant['model_1_predict'] == 0
# It's True (correct prediction)
correct_predict = df_pregnant['model_1_predict'] == df_pregnant['Actual']

TN = sum(negative & correct_predict)
TN

385

False Negative: Predict **Negative** and it's **False**

In [54]:
# Predict Negative
negative = df_pregnant['model_1_predict'] == 0
# It's False (incorrect prediction)
incorrect_predict = df_pregnant['model_1_predict'] != df_pregnant['Actual']

FN = sum(negative & incorrect_predict)
FN

10

### Accuracy vs Precision vs Recall

![Alt img](https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/440px-Precisionrecall.svg.png)

- Accuracy (all correct / all) = (TP + TN) / (TP + TN + FP + FN)
- Precision (true positives / predicted positives) = TP / (TP + FP)
- Sensitivity aka Recall (true positives / all actual positives) = TP / (TP + FN)

In [58]:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
print("Accuracy : ", Accuracy)
print("Precision: ", Precision)
print("RecalL   : ", Recall)

Accuracy :  0.85
Precision:  0.38095238095238093
RecalL   :  0.8


### Thank to Sklearn for making our life more easier with classification_report, confusion_matrix

In [59]:
from sklearn.metrics import confusion_matrix, classification_report

In [60]:
confusion = confusion_matrix(df_pregnant['Actual'], df_pregnant['model_2_predict'])
report = classification_report(df_pregnant['Actual'], df_pregnant['model_2_predict'])
print(confusion)
print(report)

[[395  55]
 [  5  45]]
             precision    recall  f1-score   support

          0       0.99      0.88      0.93       450
          1       0.45      0.90      0.60        50

avg / total       0.93      0.88      0.90       500



## 2. Sentiment analysis

This contest is taken from the real task of Text Processing.

The task is to build a model that will determine the tone (positive, negative) of the text. To do this, you will need to train the model on the existing data (train.csv). The resulting model will have to determine the class (neutral, positive, negative) of new texts. The dataset contains the following fields:

| Field name | Meaning |
|------------|-----------|
| ItemID  | id of twit|
| Sentiment | sentiment (1-positive, 0-negative)|
| SentimentText | text of the twit|

Let's first of all have a look at the data

In [61]:
# Import pandas, numpy and the dataset, save it in a object called 'sentiment'
# Your code here
import numpy as np
import pandas as pd
sentiment = pd.read_csv('data/train.csv', encoding='latin-1')

# Let's check sentiment.head(10) and sample(10)
# Your code here
sentiment.sample(10)

Unnamed: 0,ItemID,Sentiment,SentimentText
472,473,0,all I want is my ma in my arms that's all I ...
7443,7446,1,#whocangetit @THE_REAL_SHAQ I luv his sexy ma...
83322,83334,1,@asianpenmaster Mia! what's your ym?
23958,23970,0,@1op I've been good... how bout urself??? I se...
91797,91809,1,@bananaxxjana can you please become a burning ...
76022,76034,1,@carryfiasco my standing ticket is being sent ...
82062,82074,1,"@chelciscool you think so? well, good for you?..."
10858,10870,1,"&quot;Women form queue, when you got as much n..."
42104,42116,0,"@Andy_K it was supposed to get up to 400kph, b..."
3106,3107,0,I wish I could forget ...


As we can see, the structure of a twit varies a lot between twit and twit. They have different lengths, letters, numbers, extrange characters, etc. 

It is also important to note that **a lot** of words are not correctly spelled, for example the word _"Juuuuuuuuuuuuuuuuussssst"_ or the word _"sooo"_

This makes it hard to mesure how positive or negative are the words within the twits.

So we need a way of scoring the words such that words that appear in positive twits have greater score that those that appear in negative twits.

But first... how do we represent the twits as vectors we can input to our algorithm?

### Bag of words

One thing we could do to represent the twits as equal-sized vectors of numbers is the following:

* Create a list (vocabulary) with all the unique words in the whole corpus of twits. 
* We construct a feature vector from each twit that contains the counts of how often each word occurs in the particular twit

_Note that since the unique words in each twit represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will mostly consist of zeros_

Lets construct the bag of words. We will work with a smaller example for illustrative purposes, and at the end we will work with our real data.

In [62]:
twits = [
    'This is amazing!',
    'ML is the best, yes it is',
    'I am not sure about how this is going to end...'
]

Let's import [CountVectorizer.](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) It'll help us to convert a collection of text documents to a matrix of token counts.

In [63]:
# Your code here
from sklearn.feature_extraction.text import CountVectorizer

# Define an object of CountVectorizer() fit and transfom your twits into a 'bag'
# Your code here
count = CountVectorizer()
bag = count.fit_transform(twits)

In [64]:
# Find in document of CountVectorizer a function that show us list of feature names
# Your code here
count.get_feature_names()

['about',
 'am',
 'amazing',
 'best',
 'end',
 'going',
 'how',
 'is',
 'it',
 'ml',
 'not',
 'sure',
 'the',
 'this',
 'to',
 'yes']

As we can see from executing the preceding command, the vocabulary is stored in a Python array that maps the unique words to integer indices. Next, let's print the feature vectors that we just created:

In [65]:
# Call toarray() on your 'bag' to see the feature vectors
# Your code here
bag.toarray()

array([[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 2, 1, 1, 0, 0, 1, 0, 0, 1],
       [1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0]], dtype=int64)

In [None]:
# What is the index of the word 'is' and how many times it occurs in all three twits?
# Your answer here
Index 7, 4 times

Each index position in the feature vectors corresponds to the integer values that are stored as dictionary items in the CountVectorizer vocabulary. For example, the first feature at index position 0 resembles the count of the word 'about' , which only occurs in the last document. These values in the feature vectors are also called the **raw term frequencies**: `tf(t,d )` —the number of times a term `t` occurs in a document `d`.


### How relevant are words? Term frequency-inverse document frequency

We could use these raw term frequencies to score the words in our algorithm. There is a problem though: If a word is very frequent in _all_ documents, then it probably doesn't carry a lot of information. In order to tacke this problem we can use **term frequency-inverse document frequency**, which will reduce the score the more frequent the word is accross all twits. It is calculated like this:

\begin{equation*}
tf-idf(t,d) = tf(t,d) ~ idf(t,d)
\end{equation*}

_tf(t,d)_ is the raw term frequency descrived above. _idf(t,d)_ is the inverse document frequency, than can be calculated as follows:

\begin{equation*}
\log \frac{n_d}{1+df\left(d,t\right)}
\end{equation*}

where `n` is the total number of documents and _df(t,d)_ is the number of documents where the term `t` appears. 

The `1` addition in the denominator is just to avoid zero term for terms that appear in all documents, will not be entirely ignored. Ans the `log` ensures that low frequency term don't get too much weight.

Fortunately for us `scikit-learn` does all those calculations for us:

In [67]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()

# Formatting the number to 2 digits after the decimal point by showing on this notebook
np.set_printoptions(precision=2)

# Feed the tf-idf transformer with our previously created Bag of Words using fit_transform()
# Your code here


In [68]:
# Now what is the weight of the word 'is' and 'amazing'?
# Your answer here
'amazing': 0.72, 'is': 0.43

array([[0.  , 0.  , 0.72, 0.  , 0.  , 0.  , 0.  , 0.43, 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.55, 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.4 , 0.  , 0.  , 0.  , 0.47, 0.4 , 0.4 , 0.  ,
        0.  , 0.4 , 0.  , 0.  , 0.4 ],
       [0.33, 0.33, 0.  , 0.  , 0.33, 0.33, 0.33, 0.2 , 0.  , 0.  , 0.33,
        0.33, 0.  , 0.25, 0.33, 0.  ]])

## String manipulation in Python

One place where the Python language really shines is in the manipulation of strings. This section will cover some of Python's built-in string methods and formatting operations, before moving on to a quick guide to the extremely useful subject of regular expressions. Such string manipulation patterns come up often in the context of data science work

### Formatting strings: Adjusting case

Python makes it quite easy to adjust the case of a string. Here we'll look at the `upper()`, `lower()`, `capitalize()`, and `swapcase()` methods.

In [69]:
fox = 'tHe qUICk bROWn fOx.'
# Apply the functions above to `fox` and print out the results
# Your code here
print(fox.upper())
print(fox.lower())
print(fox.capitalize())
print(fox.swapcase())

THE QUICK BROWN FOX.
the quick brown fox.
The quick brown fox.
ThE QuicK BrowN FoX.


### Adding and removing spaces

Another common need is to remove spaces (or other characters) from the beginning or end of the string. The basic method of removing characters is the `strip()` method, which strips whitespace from the beginning and end of the line. To remove just space to the right or left, use `rstrip()` or `lstrip()` respectively.

To remove characters other than spaces, you can pass the desired character to the `strip()` method

In [71]:
line = '         this is the content         '

# Apply strip(), rstrip(), lstrip() to 'line' and print the results out
# Your code here
line.strip()

num = '00000000435'
# Remove all of the zeros from num
# Your code here
num.strip('0')

'435'

### Finding and replacing and splitting

If you want to find occurrences of a certain character in a string, the `find()`, `index()` and `replace()` methods are the best built-in methods.

`find()` and `index()` are very similar, in that they search for the first occurrence of a character or substring within a string, and return the index of the substring.

The `split()` method is perhaps more useful; it finds all instances of the split-point and returns the substrings in between. The default is to split on any whitespace, returning a list of the individual words in a string.

In [None]:
line = 'the quick brown fox jumped over a lazy dog'

# Find the index of 'fox' in 'line' using find() and index()
# Your code here

# Let's replace 'brown' with 'red'
# Your code here

# List all words in 'line' and put them in an array
# Your 1 line of code here

Note that if you would like to undo a `split()`, you can use the `join()` method, which returns a string built from a splitpoint and an iterable:

In [8]:
'--'.join(['1', '2', '3'])

'1--2--3'

A common pattern is to use the special character "\n" (newline) to join together lines:

In [9]:
print("\n".join(['Rules in family:', '1. Your wife is always right.', '2. If she is wrong, check the first rule again.']))

Rules in family:
1. Your wife is always right.
2. If she is wrong, check the first rule again.


### Regular Expression

In Python, regular expressions are supported by the `re` module.

A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

Let's walk through some examples:

In [77]:
import re

In [4]:
email_regex = '\w+@\w+\.[a-z]{2}'
text = "To email Hai Minh, try minhdh@coderschool.vn or the older address haiminh101@yahoo.vn"
re.findall(email_regex, text)

['minhdh@coderschool.vn', 'haiminh101@yahoo.vn']

In [5]:
# Replacing these email addresses with another string, perhaps to hide addresses in the output:
re.sub(email_regex, '--@--.--', text)

'To email Hai Minh, try --@--.-- or the older address --@--.--'

In [6]:
# The following will match any lower-case vowel:
re.split('[aeiou]', 'consequential')

['c', 'ns', 'q', '', 'nt', '', 'l']

You may need to extract from a document specific numerical codes that consist of a capital letter followed by a digit. You could do this as follows:

In [7]:
re.findall('[A-Z][0-9]', '1043879, G2, H6')

['G2', 'H6']

The following table lists a few of these characters that are commonly useful:

| Character | Description | Character | Description |
|------------|-----------|------------|-----------|
| "\d" | Match any digit   | "\D" | Match any non-digit|
| "\s" | Match any whitespace   | "\S" | Match any non-whitespace|
| "\w" | Match any alphanumeric char  | "\W" | Match any non-alphanumeric char|

| Character | Description | Example |
|------------|-----------|------------|
| ? | Match zero or one repetitions of preceding |  "ab?" matches "a" or "ab" |
| * | Match zero or more repetitions of preceding | "ab*" matches "a", "ab", "abb", "abbb"... |
| + | Match one or more repetitions of preceding |  "ab+" matches "ab", "abb", "abbb"... but not "a" |
| {n} | Match n repetitions of preceding | "ab{2}" matches "abb" |
| {m,n} | Match between m and n repetitions of preceding |  "ab{2,3}" matches "abb" or "abbb" |

### Further Resources on Regular Expressions

* [Python's re package Documentation](https://docs.python.org/3/library/re.html)
* [Python's official regular expression HOWTO](https://docs.python.org/3/howto/regex.html)
* [Mastering Regular Expressions (OReilly, 2006)](http://shop.oreilly.com/product/9780596528126.do)

## Data clean up

### Removing stop words

Now that we know how to format and score our input. Let's look at our **real** vocabulary. Specifically, the most common words:

In [72]:
from collections import Counter

# Example
count = Counter()
for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
    count[word] += 1
print(count)
print(count.most_common(2))

Counter({'blue': 3, 'red': 2, 'green': 1})
[('blue', 3), ('red', 2)]


In [73]:
vocab = Counter()

# Let's apply the example above to count words in our SentimentText
# Your code here
for twit in sentiment.SentimentText:
    for word in twit.split(' '):
        vocab[word] += 1

vocab.most_common(20)

[('', 123916),
 ('I', 32879),
 ('to', 28810),
 ('the', 28087),
 ('a', 21321),
 ('you', 21180),
 ('i', 15995),
 ('and', 14565),
 ('it', 12818),
 ('my', 12385),
 ('for', 12149),
 ('in', 11199),
 ('is', 11185),
 ('of', 10326),
 ('that', 9181),
 ('on', 9020),
 ('have', 8991),
 ('me', 8255),
 ('so', 7612),
 ('but', 7220)]

As you can see, the most common words are meaningless in terms of sentiment: _I, to, the, and_... they don't give any information on positiveness or negativeness. They're basically **noise** that can most probably be eliminated. These kind of words are called _stop words_, and it is a common practice to remove them when doing text analysis.

In [74]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/dks/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [75]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

vocab_reduced = Counter()
# Go through all of the items of vocab using vocab.items() and pick only words that are not in 'stop' 
# and save them in vocab_reduced
# Your code here
for w, c in vocab.items():
    if not w in stop:
        vocab_reduced[w]=c

vocab_reduced.most_common(20)

[('', 123916),
 ('I', 32879),
 ("I'm", 6416),
 ('like', 5086),
 ('-', 4922),
 ('get', 4864),
 ('u', 4194),
 ('good', 3953),
 ('love', 3494),
 ('know', 3472),
 ('go', 2990),
 ('see', 2868),
 ('one', 2787),
 ('got', 2774),
 ('think', 2613),
 ('&amp;', 2556),
 ('lol', 2419),
 ('going', 2396),
 ('really', 2287),
 ('im', 2200)]

This looks better, only in the 20 most common words we already see words that make sense: good, love, really... 

### Removing special characters and "trash"

If you look closer, you'll see that we're also taking into consideration punctuation signs ('-', ',', etc) and other html tags like `&amp`. We can definitely remove them for the sentiment analysis, but we will try to keep the emoticons, since those _do_ have a sentiment load:

In [78]:
def preprocessor(text):
    """ Return a cleaned version of text
    """
    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text)
    # Save emoticons for later appending
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # Remove any non-word character and append the emoticons,
    # removing the nose character for standarization. Convert to lower case
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    
    return text

# Create some random texts for testing the function preprocessor()
print(preprocessor(''))

 


We are almost ready! There is another trick we can use to reduce our vocabulary and consolidate words. If you think about it, words like: love, loving, etc. _Could_ express the same positivity. If that was the case, we would be  having two words in our vocabulary when we could have only one: lov. This process of reducing a word to its root is called **stemming**.

We also need a _tokenizer_ to break down our twits in individual words. We will implement two tokenizers, a regular one and one that does steaming:

In [79]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

# write a function called `tokenizer()` that split a text into list of words
# Your code here
def tokenizer(text):
    return text.split()

# write a function named `tokenizer_porter()` that split a text into list of words and apply stemming technic
# Hint: porter.stem(word)
# Your code here
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

# Testing
print(tokenizer('Hi there, I am loving this, like with a lot of love'))
print(tokenizer_porter('Hi there, I am loving this, like with a lot of love'))

['Hi', 'there,', 'I', 'am', 'loving', 'this,', 'like', 'with', 'a', 'lot', 'of', 'love']
['Hi', 'there,', 'I', 'am', 'love', 'this,', 'like', 'with', 'a', 'lot', 'of', 'love']


## Training Logistic Regression

We are finally ready to train our algorythm. 

In [80]:
# split the dataset in train and test
# Your code here
from sklearn.model_selection import train_test_split

X = sentiment['SentimentText']
y = sentiment['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [81]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words=stop,
                        tokenizer=tokenizer_porter,
                        preprocessor=preprocessor)

# A pipeline is what chains several steps together, once the initial exploration is done. 
# For example, some codes are meant to transform features — normalise numericals, or turn text into vectors, 
# or fill up missing data, they are transformers; other codes are meant to predict variables by fitting an algorithm,
# they are estimators. Pipeline chains all these together which can then be applied to training data
clf = Pipeline([('vect', tfidf),
                ('clf', LogisticRegression(random_state=0))])
clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2',
        preprocessor=<function preproc...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [82]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Now apply those above metrics to evaluate your model
# Your code here
predictions = clf.predict(X_test)
print('accuracy:',accuracy_score(y_test,predictions))
print('confusion matrix:\n',confusion_matrix(y_test,predictions))
print('classification report:\n',classification_report(y_test,predictions))

accuracy: 0.7521085441877521
confusion matrix:
 [[ 8781  4370]
 [ 3066 13780]]
classification report:
              precision    recall  f1-score   support

          0       0.74      0.67      0.70     13151
          1       0.76      0.82      0.79     16846

avg / total       0.75      0.75      0.75     29997



Finally, let's run some tests :-)

In [85]:
twits = [
    "This is really bad",
    "I love this!",
    ":)",
]

preds = clf.predict_proba(twits)

for i in range(len(twits)):
    print(f'{twits[i]} --> Negative, Positive = {preds[i]}')

This is really bad --> Negative, Positive = [0.96 0.04]
I love this! --> Negative, Positive = [0.08 0.92]
:) --> Negative, Positive = [0.38 0.62]


If we would like to use the classifier in another place, or just not train it again and again everytime, we can save the model in a pickle file:

In [84]:
import pickle
import os

pickle.dump(clf, open(os.path.join('data', 'logisticRegression.pkl'), 'wb'), protocol=4)

## And you're done! I hope you liked this!