#Outline

In this colab, we study how to handle **large-scale datasets** in sklearn.  

* In this course, so far we were able to load **entire data in memory** and were able to train and make inferences on all the data at once.  

* The large scale data sets may not fit in memory and we need to devise strategies to handle it in the context of training and prediction use cases. 

In this colab, we will discuss the following topics:
> - Overview of handling large-scale data
- Incremental preprocessing and learning.
  -  `fit()` vs. `partial_fit()`: `partial_fit` is our friend in this case.
- Combining preprocessing and incremental learning.

# **Large-scale Machine Learning**

Large-scale Machine Learning differs from traditional machine learning in the sense that it involves processing large amount of data in terms of its **size** or **number of samples**, **features** or **classes**.

There were many exciting developments in efficient large scale learning on many real world use cases in the last decade.

Although scikit-learn is optimized for **smaller data**, it does offer a decent set of **feature preprocessing** and **learning algorithms** such as classification, regression and clustering for large scale data . 

Scikit-learn handles large data through `partial_fit()` method instead of using the usual `fit()` method. 
> The idea is to process data in **batches** and **update** the model parameters for each batch.  This way of learning is referred to as '**Incremental (or out-of-core) learning**'.

##Incremental Learning

Increamental learning may be required in the following two scenarios:

* For **out-of-memory (large) datasets**, where it’s not possible to **load the entire data into the RAM** at once, one can load the data in chunks and fit the training model for each chunk of data.

* For machine learning tasks where a new batch of data comes with time, re-training the model with the previous and new batch of data is a computationally expensive process. 
> Instead of re-training the model with the entire set of data, one can employ an incremental learning approach, where the model parameters are updated with the new batch of data.


###Incremental Learning in `sklearn`

To perform incremental learning, Scikit-learn implements **`partial_fit`** method that helps in training an out-of-memory dataset. In other words, it has the ability to learn incrementally from a batch of instances.

In this colab, we will see an example of how to read, process, and train on such a large dataset that can't be loaded in memory entirely. 

This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core (online) learning. This function has some performance overhead, so it’s recommended to call it on a considerable large batch of data (that fits into the memory), to overcome the limitation of overhead.

### partial_fit() attributes:

`partial_fit(X, y, [classes], [sample_weight])`

where,

* `X` : array of shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features.

* `y` : array of shape (n_samples,) of target values.

* `classes` : array of shape (n_classes,) containing a list of all the classes that can possibly appear in the y vector.

Must be provided at the first call to partial_fit, can be omitted in subsequent calls.

* `sample_weight` : (optional) array of shape (n_samples,) containing weights applied to individual samples (1. for unweighted).

Returns: object (self)

For classification tasks, we have to pass the list of possible target class labels in `classes` parameter to cope-up with the unseen target classes in the 1st batch of the data.

The following estimators implement `partial_fit` method:
* **Classification:** 
  * `MultinomialNB`
  * `BernoulliNB`
  * `SGDClassifier`
  * `Perceptron`

* **Regression:** 
  * `SGDRegressor`

* **Clustering:** 
  * `MiniBatchKMeans`


`SGDRegressor` and `SGDClassifier` are commonly used for handling large data. 

The problem with standard regression/classification implementations such as batch gradient descent, support vector machines (SVMs), random forests etc is that because of the need to load all the data into memory at once, they can not be used in scenarios where we do not have sufficient memory. SGD, however, can deal with large data sets effectively by breaking up the data into chunks and processing them sequentially. The fact that we only need to load one chunk into memory at a time makes it useful for large-scale data as well as cases where we get streams of data at intervals. 

#**fit() versus partial_fit()**

Below, we show the use of `partial_fit()` along with `SGDClassifier` on a sample data.

For illustration, we first use traditional `fit()` method and then use `partial_fit()' on the same data. 

In [None]:
# Importing Libraries
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

##1. Traditional Approach (using `fit()`)

###Sample dataset

We will use a synthetic classification dataset for demonstration. 

Let us have 50000 samples with 10 features in the feature matrix. Further, lets have 3 classes in the target label, each class having a single cluster.

In [None]:
x, y = make_classification(n_samples=50000, n_features=10, 
                           n_classes=3, 
                           n_clusters_per_class=1)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.15)


We will make use of `SGDClassifier` to learn the classification model.

In [None]:
clf1 = SGDClassifier(max_iter=1000, tol=0.01)


We will use traditional `fit()` method to train our model.

In [None]:
clf1.fit(xtrain, ytrain)

SGDClassifier(tol=0.01)

Let's obtain the training and test scores on the trained model.

In [None]:
train_score = clf1.score(xtrain, ytrain)
print("Training score: ", train_score) 


Training score:  0.8740470588235294


In [None]:
test_score = clf1.score(xtest, ytest)
print("Test score: ", test_score)

Test score:  0.8718666666666667


We obtain the confusion matrix and classification report for evaluating the classifier.

In [None]:
ypred = clf1.predict(xtest)

cm = confusion_matrix(ytest, ypred)
print(cm) 

[[2363   74  109]
 [ 250 1842  408]
 [ 101   19 2334]]


We use `classification_report` API for obtaining important evaluation metrics for all three classes.

In [None]:
cr = classification_report(ytest, ypred)
print(cr) 

              precision    recall  f1-score   support

           0       0.87      0.93      0.90      2546
           1       0.95      0.74      0.83      2500
           2       0.82      0.95      0.88      2454

    accuracy                           0.87      7500
   macro avg       0.88      0.87      0.87      7500
weighted avg       0.88      0.87      0.87      7500



#2. Incremental approach (using partial_fit())

We will now assume that the data can not be kept completely in the main memory and hence, will load chunks of data and fit using `partial_fit()`.

In [None]:
xtrain[0:5]

array([[-0.68757687, -0.64550522,  0.62225691, -0.47085352,  0.40740261,
         1.26881267,  0.88761506,  0.65488746, -0.64590846,  0.11256402],
       [-0.89396789, -0.80955777,  0.37497888,  1.3207705 ,  0.27718403,
        -2.25014512,  1.52162745,  1.33110039,  1.73783113,  1.55400338],
       [ 0.0255818 ,  0.62806094,  0.73782478,  0.2250157 ,  0.56208794,
         0.60477614, -0.43315007,  1.37549026,  0.2692454 , -1.61447111],
       [-1.52689091,  0.37674956,  1.97143845,  0.56636717, -0.99702095,
        -1.40439757, -1.38714459, -1.91884614,  0.80334504,  0.87205605],
       [ 2.57706481, -1.71781642,  0.86873806,  0.97826578,  0.50814283,
        -1.79168631,  0.23275466,  1.66158004,  1.27139602,  0.1363464 ]])

In [None]:
ytrain[0:5]

array([1, 0, 0, 2, 0])



In order to load data chunk by chunk, we will first store the given (training) data in a csv file. (This is just for demonstration purpose. In a real scenario, the large dataset might already be in the form of say, a csv, which we will be reading in multiple iterations.)

In [None]:
import numpy as np

In [None]:
train_data = np.concatenate((xtrain, ytrain[:, np.newaxis]), axis=1)

In [None]:
train_data[0:5]

array([[-0.68757687, -0.64550522,  0.62225691, -0.47085352,  0.40740261,
         1.26881267,  0.88761506,  0.65488746, -0.64590846,  0.11256402,
         1.        ],
       [-0.89396789, -0.80955777,  0.37497888,  1.3207705 ,  0.27718403,
        -2.25014512,  1.52162745,  1.33110039,  1.73783113,  1.55400338,
         0.        ],
       [ 0.0255818 ,  0.62806094,  0.73782478,  0.2250157 ,  0.56208794,
         0.60477614, -0.43315007,  1.37549026,  0.2692454 , -1.61447111,
         0.        ],
       [-1.52689091,  0.37674956,  1.97143845,  0.56636717, -0.99702095,
        -1.40439757, -1.38714459, -1.91884614,  0.80334504,  0.87205605,
         2.        ],
       [ 2.57706481, -1.71781642,  0.86873806,  0.97826578,  0.50814283,
        -1.79168631,  0.23275466,  1.66158004,  1.27139602,  0.1363464 ,
         0.        ]])

In [None]:
a = np.asarray(train_data)
np.savetxt("train_data.csv", a, delimiter=",")

Now, our data for demonstration is ready in a csv file. 

Let's create `SGDClassifier` object that we intend to train with `partial_fit`.

In [None]:
# Let us create another classifier and we will fit it incrementally.
clf2 = SGDClassifier(max_iter=1000, tol=0.01)


###Processing data chunk by chunk

Pandas' `read_csv()` function has an attributre `chunksize` that can be used to read data chunk by chunk. The `chunksize` parameter specifies the number of rows per chunk. (The last chunk may contain fewer than chunksize rows, of course.)

We can then use this data for `partial_fit`. We can then repeat these two steps multiple times. That way, entire data may not be reqiuired to be kept in memory.

In [None]:
import pandas as pd

chunksize = 1000

iter = 1
for train_df in pd.read_csv("train_data.csv", chunksize=chunksize,
                            iterator=True):
  if iter == 1:
    # In the first iteration, we are specifying all possible class 
    # labels.
    xtrain_partial = train_df.iloc[:, 0:10]
    ytrain_partial = train_df.iloc[:, 10]
    clf2.partial_fit(xtrain_partial, ytrain_partial,
                     classes=np.array([0, 1, 2]))
  else:
    xtrain_partial = train_df.iloc[:, 0:10]
    ytrain_partial = train_df.iloc[:, 10]
    clf2.partial_fit(xtrain_partial, ytrain_partial)

  print("After iter #", iter)
  print(clf2.coef_)
  print(clf2.intercept_)  
  iter = iter + 1

After iter # 1
[[ -3.92037553  -8.80369826   2.94546802  29.19214586   9.01280493
    3.75789677  15.55871439  35.86026859  38.2597941   12.12102894]
 [-10.82495021  -8.68504032  -9.76804013  -1.99824024   9.0128883
   12.90017754   2.52402213  19.03041538  -3.12063155  -2.76948384]
 [ -3.16503835  10.74980392  -2.84624939 -14.94294568 -15.5253482
    8.01377013  -3.8599633  -42.70166356 -19.01602465  -8.49202777]]
[-19.66271808 -31.85445465 -18.43709513]
After iter # 2
[[ -4.23442234  15.33364978   9.51220132  27.73672075   9.75758109
    1.92337858 -12.5702007   36.73660429  36.29007357  10.56803131]
 [  5.55531209  -5.34347297   1.52884562 -10.70366723   9.04689135
   -1.35325804   3.88128797  14.40891297 -14.67193102   5.55807907]
 [  6.77524122 -14.52040094  -0.97362115 -16.89265126 -14.48505254
    2.20586224  -4.85366936 -41.43270364 -21.65690535   4.9768124 ]]
[-22.46235787 -25.33823759 -12.17728419]
After iter # 3
[[ -7.46932512 -10.45483468  -1.87311093  16.88919435   9.44214

**Notes:** 

* In the first call to `partial_fit()`, we passed the list of possible target class labels. For subsequent calls to `partial_fit()`, this is not required.

* Observe the changing values pf the classifier attributes: `coef_` and `intercept_` which we are printing in each iteration.

In [None]:
test_score = clf2.score(xtest, ytest)
print("Test score: ", test_score)

Test score:  0.8244


  "X does not have valid feature names, but"


Let's evaluate the classifier by examining the  `confusion_matrix`.

In [None]:
ypred = clf2.predict(xtest)
cm = confusion_matrix(ytest, ypred)
print(cm) 

[[2209  239   98]
 [ 204 1886  410]
 [  82  284 2088]]


  "X does not have valid feature names, but"


In [None]:
cr = classification_report(ytest, ypred)
print(cr) 

              precision    recall  f1-score   support

           0       0.89      0.87      0.88      2546
           1       0.78      0.75      0.77      2500
           2       0.80      0.85      0.83      2454

    accuracy                           0.82      7500
   macro avg       0.82      0.82      0.82      7500
weighted avg       0.82      0.82      0.82      7500



Apart from `SGDClassifier`, we can also train `Perceptron()`, `MultinomialNB()` and `BernoulliNB()` in a similar manner.



---



# **Incremental Preprocessing Example**

## `CountVectorizer` vs `HashingVectorizer`

Vectorizers are used to convert a collection of text documents to a vector representation, thus helping in preprocessing them before applying any model on these text documents. 

`CountVectorizer` and `HashingVectorizer` both perform the task of vectorizing the text documents. However, there are some differences among them. 


One difference is that `HashingVectorizer` does not store the resulting vocabulary (i.e. the unique tokens). Hence, it can be used to learn from data that does not fit into the computer’s main memory. Each mini-batch is vectorized using `HashingVectorizer` so as to guarantee that the input space of the estimator has always the same dimensionality.

With `HashingVectorizer`, each token directly maps to a pre-defined column position in a matrix. For example, if there are 100 columns in the resultant (vectorized) matrix, each token (word) maps to 1 of the 100 columns. The mapping between the word and the position in matrix is done using hashing. 

In other words, in `HashingVectorizer`, each token transforms to a column position instead of adding to the vocabulary. Not storing the vocabulary is useful while handling large data sets. This is because holding a huge token vocabulary comprising of millions of words may be a challenege when the memory is limited.

Since `HashingVectorizer` does not store vocabulary, its object not only takes lesser space, it also alleviates any dependence with function calls performed on the previous chunk of data in case of incremental learning.

###Example

Let us take some sample text documents and vectorize them, first using CountVectorizer and then HashingVectorizer.

In [None]:
text_documents = ['The well-known saying an apple a day keeps the doctor away has a very straightforward, literal meaning, that the eating of fruit maintains good health.',
                  'The proverb first appeared in print in 1866 and over 150 years later is advice that we still pass down through generations.', 
                  'British apples are one of the nations best loved fruit and according to Great British Apples, we consume around 122,000 tonnes of them each year.', 
                  'But what are the health benefits, and do they really keep the doctor away?']


### 1. CountVectorizer

We will first import the library and then create an object of CountVectorizer class. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
c_vectorizer = CountVectorizer()

We will now use this object to vectorize the input text documents using the function `fit_transform()`.

In [None]:
X_c = c_vectorizer.fit_transform(text_documents)

In [None]:
X_c.shape

(4, 66)

Here, 66 is the size of the vocabulary.

We can also see the volcabulary using `vocabulary_` attribute.


In [None]:
c_vectorizer.vocabulary_

{'000': 0,
 '122': 1,
 '150': 2,
 '1866': 3,
 'according': 4,
 'advice': 5,
 'an': 6,
 'and': 7,
 'appeared': 8,
 'apple': 9,
 'apples': 10,
 'are': 11,
 'around': 12,
 'away': 13,
 'benefits': 14,
 'best': 15,
 'british': 16,
 'but': 17,
 'consume': 18,
 'day': 19,
 'do': 20,
 'doctor': 21,
 'down': 22,
 'each': 23,
 'eating': 24,
 'first': 25,
 'fruit': 26,
 'generations': 27,
 'good': 28,
 'great': 29,
 'has': 30,
 'health': 31,
 'in': 32,
 'is': 33,
 'keep': 34,
 'keeps': 35,
 'known': 36,
 'later': 37,
 'literal': 38,
 'loved': 39,
 'maintains': 40,
 'meaning': 41,
 'nations': 42,
 'of': 43,
 'one': 44,
 'over': 45,
 'pass': 46,
 'print': 47,
 'proverb': 48,
 'really': 49,
 'saying': 50,
 'still': 51,
 'straightforward': 52,
 'that': 53,
 'the': 54,
 'them': 55,
 'they': 56,
 'through': 57,
 'to': 58,
 'tonnes': 59,
 'very': 60,
 'we': 61,
 'well': 62,
 'what': 63,
 'year': 64,
 'years': 65}

Following is the representation of four text documents.

In [None]:
print(X_c)

  (0, 54)	3
  (0, 62)	1
  (0, 36)	1
  (0, 50)	1
  (0, 6)	1
  (0, 9)	1
  (0, 19)	1
  (0, 35)	1
  (0, 21)	1
  (0, 13)	1
  (0, 30)	1
  (0, 60)	1
  (0, 52)	1
  (0, 38)	1
  (0, 41)	1
  (0, 53)	1
  (0, 24)	1
  (0, 43)	1
  (0, 26)	1
  (0, 40)	1
  (0, 28)	1
  (0, 31)	1
  (1, 54)	1
  (1, 53)	1
  (1, 48)	1
  :	:
  (2, 39)	1
  (2, 4)	1
  (2, 58)	1
  (2, 29)	1
  (2, 18)	1
  (2, 12)	1
  (2, 1)	1
  (2, 0)	1
  (2, 59)	1
  (2, 55)	1
  (2, 23)	1
  (2, 64)	1
  (3, 54)	2
  (3, 21)	1
  (3, 13)	1
  (3, 31)	1
  (3, 7)	1
  (3, 11)	1
  (3, 17)	1
  (3, 63)	1
  (3, 14)	1
  (3, 20)	1
  (3, 56)	1
  (3, 49)	1
  (3, 34)	1




---

### HashingVectorizer
Let us now see how `HashingVectorizer` is different from `CountVectorizer`.

We will create an object of HashingVectorizer. While creating the object, we need to specify the number of features we wish to have in the feature matrix. 

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

Let us create an object of `HashingVectorizer` class. An important parameter of this class is `n_features`. It declares the number of features (columns) in the output feature matrix. 

Note: Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.

In [None]:
h_vectorizer= HashingVectorizer(n_features=50) 

Let's perform hashing vectorization with `fit_transform`.

In [None]:
X_h = h_vectorizer.fit_transform(text_documents)


Let us examine the shape of the transformed feature matrix. The number of columns in this matrix is equal to the `n_features` attribute we specified.

In [None]:
X_h.shape

(4, 50)

Let's print the representation of the first example.

In [None]:
print(X_h[0])

  (0, 5)	0.0
  (0, 8)	-0.47140452079103173
  (0, 10)	-0.23570226039551587
  (0, 11)	-0.23570226039551587
  (0, 13)	0.0
  (0, 18)	-0.23570226039551587
  (0, 20)	0.23570226039551587
  (0, 26)	0.0
  (0, 29)	0.23570226039551587
  (0, 33)	0.23570226039551587
  (0, 36)	-0.23570226039551587
  (0, 38)	0.47140452079103173
  (0, 39)	-0.23570226039551587
  (0, 45)	-0.23570226039551587
  (0, 46)	0.23570226039551587


Overall, `HashingVectorizer` is a good choice if we are falling short of memory and resources, or we need to perform incremental learning. However, `CountVectorizer` is a good choice if we need to access the actual tokens.



---



#**Combining preprocessing and fitting in Incremental Learning**

###(`HashingVectorizer` along with `SGDClassifier`)



We will now use a dataset containing a textual feature that requires preprocessing using a vectorizer. Since we wish to perform incremental learning using `partial_fit()`, we will preprocess (i.e., vectorize) the dataset feature using `HashingVectorizer` and then we will incrementally fit it. 

### 1. Downloading the dataset

Below, we download a dataset from UCI ML datasets' library. (Instead of downloading, unzipping and then reading, we are directly reading the zipped csv file. For that purpose, we are making use of `urllib.request`, `BytesIO` and `TextIOWrapper` classes.)

This is a sentiment analysis dataset. There are only two columns in the dataset. One for the textual review and the other for the sentiment.

In [None]:
import pandas as pd
from io import StringIO, BytesIO, TextIOWrapper
from zipfile import ZipFile
import urllib.request

resp = urllib.request.urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip')
zipfile = ZipFile(BytesIO(resp.read()))

data = TextIOWrapper(zipfile.open('sentiment labelled sentences/amazon_cells_labelled.txt'), encoding='utf-8')

df = pd.read_csv(data, sep = '\t')
df.columns = ['review', 'sentiment']

##2. Exploring the data set.

Let's explore the dataset a bit.

In [None]:
df.head()

Unnamed: 0,review,sentiment
0,"Good case, Excellent value.",1
1,Great for the jawbone.,1
2,Tied to charger for conversations lasting more...,0
3,The mic is great.,1
4,I have to jiggle the plug to get it to line up...,0


In [None]:
df.tail()

Unnamed: 0,review,sentiment
994,The screen does get smudged easily because it ...,0
995,What a piece of junk.. I lose more calls on th...,0
996,Item Does Not Match Picture.,0
997,The only thing that disappoint me is the infra...,0
998,"You can not answer calls with the unit, never ...",0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     999 non-null    object
 1   sentiment  999 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 15.7+ KB


In [None]:
df.describe()

Unnamed: 0,sentiment
count,999.0
mean,0.500501
std,0.50025
min,0.0
25%,0.0
50%,1.0
75%,1.0
max,1.0


In [None]:
df.loc[:, 'sentiment'].unique()

array([1, 0])

As we can see, 
- There are 999 samples in the dataset. 
- The possible classes for sentiment are 1 and 0.

##4. Splitting data into train and test

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df.loc[:, 'review']

In [None]:
y= df.loc[:, 'sentiment']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
X_train.shape

(799,)

In [None]:
y_train.shape

(799,)

## 5. Preprocessing

Since the data is textual, we need to vectorize it. In order to perform incremental learning, we will use HashingVectorizer.

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer()

##6. Creating an instance of the SGDClassifier

In [None]:
from sklearn.linear_model import SGDClassifier
classifier = SGDClassifier(penalty='l2',loss='hinge')

## 7. Iteration 1 of partial_fit()

We will assume we do not have sufficient memory to handle all the 799 samples in one go for training purpose. So, we will take the first 400 samples from teh training data and `partial_fit` our classifier.

Another use case of partial_fit here could also be a scenario where we only have 400 samples available at a time. So, we fit our classifier with them. However, we `partial_fit` it, to have the possibility of training it wirth more data later whenever that becomes available.


In [None]:
X_train_part1_hashed = vectorizer.fit_transform(X_train[0:400])
y_train_part1 = y_train[0:400]


In [None]:
all_classes = np.unique(df.loc[:, 'sentiment']) #we need to mention all classes in the first iteration of partial_fit()

In [None]:
classifier.partial_fit(X_train_part1_hashed, y_train_part1, classes=all_classes)

SGDClassifier()

Let us now use this classifier on our test data that we had kept aside earlier.

In [None]:
X_test_hashed = vectorizer.transform(X_test) #first we will have to preprocess the X_test with the same vectorizer that was fit on train data.

In [None]:
test_score = classifier.score(X_test_hashed, y_test)
print("Test score: ", test_score)

Test score:  0.705


Note: We can also store this classifier using pickle object and can access it later.

# 8. Iteration 2 of partial_fit()

We will now assume that more data became available. So, we will fit the same classifier with more data and observe if our test score improves.

In [None]:
X_train_part2_hashed = vectorizer.fit_transform(X_train[400:])
y_train_part2 = y_train[400:]


In [None]:
classifier.partial_fit(X_train_part2_hashed, y_train_part2)

SGDClassifier()

In [None]:
test_score = classifier.score(X_test_hashed, y_test)
print("Test score: ", test_score)

Test score:  0.76


We see that our test score has improved after we fed more data to the classifier in the second iteration of `partial_fit()`.

For a more elaborate example, refer: https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#sphx-glr-auto-examples-applications-plot-out-of-core-classification-py