# Hi sklearn!

Diving into machine learning.

## First challenge is [restaurants reviews](https://inclass.kaggle.com/c/restaurant-reviews) 
The goal of this competition is to learn predicting whether restaurant review is positive or negative.  

### Reading restaurant  collection

each line is a grade and text of review separated by tab.

In [1]:
!head -n 3 ./data/1-restaurant-train.csv

4	Thank you thank you thank you !! I  want to thank the people that made this place happen ....you have made all my dreams come true. Imagine a delicious yogurt shop with super fun flavors like peanut butter, chocolate mint, cake batter and so many more.  I used to have to travel to  Yogurtland or Jujuberry  but not anymore , now we have one right in the hood!! Guess what?   instead of eating a normal lunch I can pig out with a healthy peanut butter yogurt smothered in chocolate chips.   Could be the perfect lunch!!  See you there!
5	A Humane Society store at the Biltmore?  Interesting.  I had seen an adorable chihuahua mix at the Humane Society's webpage, and headed over to check out the little dog and the store.  They sell a variety of toys, treats, leashes, collars, and foods- in short, they sell a little bit of everything.  There's no tax, and if you adopt a pet from them they will give you 10% off the price as a thank you.  When I went there were 4 dogs up for adoption, a rabbit,

In [2]:
import codecs
# python kung fu.
with codecs.open('./data/1-restaurant-train.csv') as f:
    labels, reviews = zip(*[line.split('\t') for line in f.readlines()])

In [3]:
labels[:10]

('4', '5', '1', '3', '5', '4', '5', '3', '5', '5')

In [4]:
reviews[:2]

('Thank you thank you thank you !! I  want to thank the people that made this place happen ....you have made all my dreams come true. Imagine a delicious yogurt shop with super fun flavors like peanut butter, chocolate mint, cake batter and so many more.  I used to have to travel to  Yogurtland or Jujuberry  but not anymore , now we have one right in the hood!! Guess what?   instead of eating a normal lunch I can pig out with a healthy peanut butter yogurt smothered in chocolate chips.   Could be the perfect lunch!!  See you there!\n',
 "A Humane Society store at the Biltmore?  Interesting.  I had seen an adorable chihuahua mix at the Humane Society's webpage, and headed over to check out the little dog and the store.  They sell a variety of toys, treats, leashes, collars, and foods- in short, they sell a little bit of everything.  There's no tax, and if you adopt a pet from them they will give you 10% off the price as a thank you.  When I went there were 4 dogs up for adoption, a rabb

### read test dataset

In [5]:
with codecs.open('./data/1-restaurant-test.csv') as f:
    kaggle_test_reviews = f.readlines()

In [6]:
kaggle_test_reviews[:2]

["My son just loves this place.  Weird that he'd ask to come here everytime we go grocery shopping (bribe) and not even care to go to Toys R Us.  Not complaining.   I'm not into little knick knacks, but they have quite a selection on little travel toys, educational materials for kids and holiday stuff.  I bought a couple of red bows with brass jingles on it and wreaths to put on my porch lights for $4!   Why is it that it's so cheap, but you can end up spending $50?\n",
 '"We gave it a 9, so we will make that 5-, 4,5 stars. \\n   To start with it\'s just beautiful and we lucked out to be outside, under the heater next to the roaring fireplace. The service could not have been better and thanks to our YELPing friends, we hardly needed a menu. \\n  The Portugese clam soup was \\souper\\"" though on the salty side. The pork belly was \\""off the hook\\"" and the steak tacos were tops too! We gave the dishes, 9, 9.5 and 9.5 respectivley.\\n  Margaritas were awesome, Smokehouse and Pomograni

### prepare solution

define useful supplementary function

In [7]:
import numpy

In [8]:
import pandas
from IPython.display import FileLink

def create_solution(predictions, filename='1-restaurant-predictions.csv'):
    result = pandas.DataFrame({'Id': numpy.arange(len(predictions)), 'Solution': predictions})
    result.to_csv('data/{}'.format(filename), index=False)
    return FileLink('data/{}'.format(filename))

## Compute simple statistics in the reviews

In [9]:
def compute_data_expressions(reviews):
    features = []
    # length of each string
    features.append(map(len, reviews))
    
    # number of letters, digits, spaces = words
    for pattern in [str.isalpha, str.isdigit, str.isspace]:
        features.append(map(lambda review: sum(map(pattern, review)), reviews))
        
    features = numpy.array(features).T
    return features

features = compute_data_expressions(reviews)
kaggle_test_features = compute_data_expressions(kaggle_test_reviews)

# convert labels to int values
labels = map(int, labels)

# Making problem simpler: convert to positive/negative reviews. 
answers = numpy.array(labels) >= 4

In [10]:
answers

array([ True,  True, False, ...,  True, False,  True], dtype=bool)

In [11]:
features

array([[ 536,  411,    0,  107],
       [ 796,  603,    3,  159],
       [ 720,  534,    6,  143],
       ..., 
       [1092,  835,    9,  214],
       [ 574,  437,    2,  116],
       [ 637,  488,    4,  105]])

## Classification quality measure — ROC curve and area under the curve (AUC)
[Explanation of the ROC curve](http://arogozhnikov.github.io/RocCurve.html)

In [12]:
# area under the roc curve
from sklearn.metrics import roc_auc_score

## Simple solution

In [13]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=1)
knn_clf.fit(features, answers) # train an algorithm
roc_auc_score(answers, knn_clf.predict_proba(features)[:, 1])

0.99376260608836542

In [14]:
create_solution(knn_clf.predict_proba(kaggle_test_features)[:, 1])

Send it to kaggle and check that its score differs from the above value _significantly_.

## Cross validation / overfitting

In [15]:
from sklearn.cross_validation import train_test_split
trainX, testX, trainY, testY = train_test_split(features, answers, random_state=42)



## Kaggle challenges 

You'll participate in challenges, where your model performance will be assessed by kaggle using the predictions on the unlabeled data.

It is an important practice, first challenge is [here](https://inclass.kaggle.com/c/restaurant-reviews). 

First kaggle is due to 30 Jan.

## Knn

In [16]:
knn_clf = KNeighborsClassifier(n_neighbors=1)
knn_clf.fit(trainX, trainY)
print 'test', roc_auc_score(testY, knn_clf.predict_proba(testX)[:, 1])
print 'train', roc_auc_score(trainY, knn_clf.predict_proba(trainX)[:, 1])

test 0.512878045192
train 0.995269637289


situation above is called 'overfitting'

## Sidenote: sklearn interface

In [17]:
# work with scikit-learn models in simplest case consists of 
# 1. defining model with parameters
knn_clf = KNeighborsClassifier(n_neighbors=3)
# 2. training (method fit, X of shape [n_samples, n_features], target y of shape [n_samples])
knn_clf.fit(trainX, trainY)
# 3. predicting (predict probabilities for classification)
knn_clf.predict_proba(testX)

array([[ 0.33333333,  0.66666667],
       [ 0.66666667,  0.33333333],
       [ 0.33333333,  0.66666667],
       ..., 
       [ 1.        ,  0.        ],
       [ 0.66666667,  0.33333333],
       [ 0.33333333,  0.66666667]])

## Finding optimal number of neighbours:

In [18]:
for n_neighbors in [1, 2, 4, 8, 16, 32, 64]:
    knn_clf = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn_clf.fit(trainX, trainY)
    print n_neighbors, roc_auc_score(testY, knn_clf.predict_proba(testX)[:, 1])

1 0.512878045192
2 0.523998599836
4 0.53293147584
8 0.541474487911
16 0.552905570616
32 0.56343027432
64 0.572396762402


## [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model)

Simplest way to represent the text is to count number of times each word is met

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
# take the 100 the most frequent words
vectorizer = CountVectorizer(max_features=100)
vectorizer.fit(reviews)
counts = vectorizer.transform(reviews).toarray()
kaggle_test_counts = vectorizer.transform(kaggle_test_reviews).toarray()

In [20]:
counts.shape

(82065, 100)

In [21]:
counts

array([[0, 0, 1, ..., 0, 5, 0],
       [0, 0, 0, ..., 0, 4, 0],
       [1, 0, 1, ..., 0, 2, 0],
       ..., 
       [1, 0, 1, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 1, 0],
       [0, 0, 2, ..., 0, 0, 0]])

In [22]:
# vocabulary is a dictionary which keeps correspondence between columns and words
# vectorizer.vocabulary_

In [23]:
train_counts, test_counts, train_labels, test_labels = train_test_split(counts, answers, random_state=42)

## Naive Bayes

Naive Bayes will be explained later in the lectures. It is a generative model.

Naive explanation: we compute for each word probability to appear in positive and negative review. To assess a new review, this information is combined.

### Bernoulli
uses only information about presence of words in the text (count is zero or not) 

In [24]:
from sklearn.naive_bayes import BernoulliNB
nb_clf = BernoulliNB()
nb_clf.fit(train_counts, train_labels)
roc_auc_score(test_labels, nb_clf.predict_proba(test_counts)[:, 1])

0.69277446873488746

### multinomial

... also pays attention to counts

In [25]:
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB()
nb_clf.fit(train_counts, train_labels)
roc_auc_score(test_labels, nb_clf.predict_proba(test_counts)[:, 1])

0.73529071403854918

In [26]:
train_counts.shape

(61548, 100)

In [27]:
create_solution(nb_clf.predict_proba(kaggle_test_counts)[:, 1], filename='1-restaurant-predictions-nb.csv')

## Linear regression + Ridge regularization

Linear model is a very simple approximation:
$$\hat{y_i}= \theta_0 + \sum_j \theta_j x_i^j $$
where $x_i^j$ is a value for $j$-th feature for $i$-th sample, $\theta_j$ — is a parameter to find for $j$-th feature, $\hat{y_i}$ — prediction of linear model for $i$-th sample.

And we can introduce the loss function (how our approximation is far from the true values). For example:
$$\mathcal{L} = \sum_i (y_i - \hat{y}_i)^2 \to \min$$
(widely known as ordinary least squares)

In [28]:
from sklearn.linear_model import Ridge

In [29]:
ridge_clf = Ridge()
ridge_clf.fit(train_counts, train_labels)
# use `predict` method for regression methods to evaluate function for new data
print roc_auc_score(test_labels,  ridge_clf.predict(test_counts))
print roc_auc_score(train_labels, ridge_clf.predict(train_counts))

0.787431742938
0.792064902203


# Homework (due to 26 Jan, 9AM)

** Exercise #0. (1 point)**. Play with regularization parameter of RidgeRegression, see how it affects quality on train and test.
Check quality of best model by submitting to kaggle.


In [30]:
# take the 30000 the most frequent words, and use 5000 samples in training
vectorizer_reg = CountVectorizer(max_features=30000)
vectorizer_reg.fit(reviews)
counts_reg = vectorizer_reg.transform(reviews)
train_counts_reg, test_counts_reg, train_labels_reg, test_labels_reg = \
    train_test_split(counts_reg, answers, random_state=42, train_size=5000)

In [31]:
ridge_regularization = Ridge(alpha=0.01)
ridge_regularization.fit(train_counts_reg, train_labels_reg)
print roc_auc_score(test_labels_reg,  ridge_regularization.predict(test_counts_reg))

0.834342755226


In [32]:
# play with regularization here

**Exercise #1. (1 point)** Let's write the correspondence between columns and words (done below). Which words are most popular?

In [33]:
dictionary = numpy.empty(len(vectorizer.vocabulary_), dtype='O')
for word, index in vectorizer.vocabulary_.iteritems():
    dictionary[index] = word

In [34]:
dictionary

array([u'about', u'after', u'all', u'also', u'always', u'an', u'and',
       u'are', u'as', u'at', u'back', u'be', u'because', u'been', u'but',
       u'by', u'can', u'chicken', u'could', u'do', u'don', u'even',
       u'food', u'for', u'from', u'get', u'go', u'good', u'got', u'great',
       u'had', u'has', u'have', u'he', u'here', u'if', u'in', u'is', u'it',
       u'just', u'like', u'little', u'love', u'me', u'menu', u'more',
       u'much', u'my', u'ni', u'nice', u'no', u'not', u'nthe', u'of',
       u'on', u'one', u'only', u'or', u'order', u'ordered', u'other',
       u'our', u'out', u'people', u'place', u'pretty', u'really',
       u'restaurant', u'service', u'she', u'so', u'some', u'than', u'that',
       u'the', u'their', u'them', u'there', u'they', u'this', u'time',
       u'to', u'too', u'try', u'up', u'us', u've', u'very', u'was', u'we',
       u'well', u'were', u'what', u'when', u'which', u'will', u'with',
       u'would', u'you', u'your'], dtype=object)

** Exercise #2. (1 point) ** By analyzing coefficients in `ridge_clf.coef_`, determine which words have the highest impact on decision (= have the largest modulus of `coef_`)

** Exercise #3. (2 points) **  Does combining features and counts improve quality? Use `numpy.hstack` to concatenate arrays.
Explain the result.

** Exercise #4. (2 points)** Print examples on which your classifier makes mistakes (both false positive and false negative).

This is important step to understand what can be done to improve the classifier

** Exercise #5. (optional, just for fun)**  Write a restaurant review, which is misunderstood by your best model. 
Something like "Hate each time I'm not eating here".

Use your knowledge about the structure of the model.

** Major Goal in kaggle competition (not in this homework). ** Provide best classification model for the problem. 

You can start with computing new features:
1. Computing occurrences of symbols (like "!"; ":)", etc.)
2. Ignoring the stop-words, words with digits, etc.
3. Adding more words in the model bag of words 
4. words "likes", "liked", "like", "likely" all have "like" in it. You can use this information!
5. test your ideas

Or start with changing parameters of classifiers. 

## Completed? 


Rename the notebook to `1.2-Surname-sklearn.ipynb`, download (`File > Download as .ipynb`) and send to `icl.ml@yandex.ru`