# Outline

In this colab, we study how to handle large-scale datasets in sklearn.

- In this course, so far we were able to load entire data in memory and were able to train and make inferences on all the data at once.

- The large scale data sets may not fit in memory and we need to devise
  strategies to handle it in the context of training and prediction use cases.

In this colab, we will discuss the following topics:


- Overview of handling large-scale data

- Incremental preprocessing and learning.

- fit() VS. partial_fit(): partial_fit is our friend in this case.

- Combining preprocessing and incremental learning.

# Large-scale Machine Learning
- Large-scale Machine Learning differs from traditional machine learning in the sense that it involves processing large amount of data in terms of
its size or number of samples, features or classes.
- There were many exciting developments in efficient large scale learning on many real world use cases in the last decade.

- Although scikit-learn is optimized for smaller data, it does offer a decent set of feature preprocessing and learning algorithms for large scale
data such as classification, regression and clustering.

- Scikit-learn handles large data through `partial_fit()` method instead of using the usual `fit()` method.

> The idea is to process data in batches and update the model parameters for
> each batch. This way of learning is referred to as 'Incremental (or
> out-of-core) learning'.

# Incremental Learning

Increamental learning may be required in the following two scenarios:

- For out-of-memory (large) datasets, where it's not possible to load the entire data into the RAM at once, one can load the data in chunks
and fit the training model for each chunk of data.

- For machine learning tasks where a new batch of data comes with time, re-training the model with the previous and new batch of data is a
computationally expensive process.


> Instead of re-training the model with the entire set of data, one can employ an incremental learning approach, where the
model parameters are updated with the new batch of data.

### Incremental Learning in `sklearn`

- To perform incremental learning, Scikit-learn implements partial_fit method that helps in training an out-of-memory dataset. In other words,
it has the ability to learn incrementally from a batch of instances.

- In this colab, we will see an example of how to read, process, and train on such a large dataset that can't be loaded in memory entirely.

- This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core (online)
learning. This function has some performance overhead, so it's recommended to call it on a considerable large batch of data (that fits into the
memory), to overcome the limitation of overhead.

### `partial_fit()` attributes:

`partial_fit(x, y, [classes], [sample_weight])`

where,


- x : array of shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features.

- y : array of shape (n_samples,) of target values.

- classes : array of shape (n_classes,) containing a list of all the classes
  that can possibly appear in the y vector. Must be provided at the first call
  to partial_fit, can be omitted in subsequent calls.

- sample_weight :(optional) array of shape (n_samples,) containing weights applied to individual samples (1. for unweighted).
Returns: object (self)

- For classification tasks, we have to pass the list of possible target class labels in classes parameter to cope-up with the unseen target
classes in the 1st batch of the data.

The following estimators implement `partial_fit` method:

- **Classification**:

    - MultinomialNB

    - BernoulliNB

    - SGDClassifier

    - Perceptron

- **Regression**:

    - SGDRegressor

- **Clustering**:

    - MiniBatchKMeans

<br>

`SGDRegressor` and `SGDClassifier` are commonly used for handling large data.


The problem with standard regression/classification implementations such as batch gradient descent, support vector machines (SVMs),
random forests etc is that because of the need to load all the data into memory at once, they can not be used in scenarios where we do not
have sufficient memory. SGD, however, can deal with large data sets effectively by breaking up the data into chunks and processing them
sequentially. The fact that we only need to load one chunk into memory at a time
makes it useful for large-scale data as well as cases where we get streams of
data at intervals.

# **`fit()`** versus **`partial_fit()`**

- Below, we show the use of `partial_fit()` along with SGDClassifier on a sample data.

- For illustration, we first use traditional fit() method and then use `partial_fit()` on the same data.

In [1]:
# Importing Libraries
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

## 1. Traditional Approach (`using fit()`)

In [2]:
x, y = make_classification(
    n_samples=50000, n_features=10, n_classes=3, n_clusters_per_class=1
)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.15)

In [3]:
clf1 = SGDClassifier(max_iter=1000, tol=0.01)

In [4]:
clf1.fit(xtrain, ytrain)

In [5]:
train_score = clf1.score(xtrain, ytrain)
print("Training score: ", train_score)

Training score:  0.9299058823529411


In [6]:
test_score = clf1.score(xtest, ytest)
print("Test score: ", test_score)

Test score:  0.9313333333333333


In [7]:
ypred = clf1.predict(xtest)

cm = confusion_matrix(ytest, ypred)
print(cm)

[[2227  211   29]
 [  32 2433  108]
 [ 104   31 2325]]


In [8]:
cr = classification_report(ytest, ypred)
print(cr)

              precision    recall  f1-score   support

           0       0.94      0.90      0.92      2467
           1       0.91      0.95      0.93      2573
           2       0.94      0.95      0.94      2460

    accuracy                           0.93      7500
   macro avg       0.93      0.93      0.93      7500
weighted avg       0.93      0.93      0.93      7500



# 2. Incremental approach (using `partial_fit()`)

We will now assume that the data can not be kept completely in the main memory and hence, will load chunks of data and fit using
`partial_fit()`.

In [9]:
xtrain[0:5]

array([[ 1.10500045,  0.84052026,  1.01608832,  0.11033884,  0.64264471,
         0.11505083, -0.0820009 ,  1.10066483,  0.14648171,  0.41402369],
       [ 0.98004185,  1.82124193,  0.7913414 ,  1.06080822, -0.5409595 ,
         0.67127883, -0.57478641, -1.33638612, -1.65815965, -1.09638091],
       [ 0.54139567, -0.95663694,  1.17351699, -0.10567863,  0.49043826,
         0.16380537,  0.17639576, -1.39930465,  0.24652796, -0.05422355],
       [-1.26806012, -1.14647083,  0.65693658, -1.97748136, -1.80454743,
        -0.04575879,  0.37600204,  0.38048225,  1.11432603,  0.76132667],
       [-0.68860502,  1.25605412,  1.34005005,  0.30035479, -1.47889937,
        -0.81469356, -0.53500156,  0.72654913, -1.83304557, -1.45186243]])

In [10]:
ytrain[0:5]

array([0, 1, 2, 2, 1])

In order to load data chunk by chunk, we will first store the given (training) data in a csv file. (This is just for demonstration purpose. In a real
scenario, the large dataset might already be in the form of say, a csv, which we will be reading in multiple iterations.)

In [11]:
import numpy as np

In [12]:
train_data = np.concatenate((xtrain, ytrain[:, np.newaxis]), axis=1)

In [13]:
train_data[0:5]

array([[ 1.10500045,  0.84052026,  1.01608832,  0.11033884,  0.64264471,
         0.11505083, -0.0820009 ,  1.10066483,  0.14648171,  0.41402369,
         0.        ],
       [ 0.98004185,  1.82124193,  0.7913414 ,  1.06080822, -0.5409595 ,
         0.67127883, -0.57478641, -1.33638612, -1.65815965, -1.09638091,
         1.        ],
       [ 0.54139567, -0.95663694,  1.17351699, -0.10567863,  0.49043826,
         0.16380537,  0.17639576, -1.39930465,  0.24652796, -0.05422355,
         2.        ],
       [-1.26806012, -1.14647083,  0.65693658, -1.97748136, -1.80454743,
        -0.04575879,  0.37600204,  0.38048225,  1.11432603,  0.76132667,
         2.        ],
       [-0.68860502,  1.25605412,  1.34005005,  0.30035479, -1.47889937,
        -0.81469356, -0.53500156,  0.72654913, -1.83304557, -1.45186243,
         1.        ]])

In [14]:
a = np. asarray (train_data)
np.savetxt("train_data.csv", a, delimiter=",")

- Now, our data for demonstration is ready in a csv file.

- Let's create `SGDClassifier` object that we intend to train with `partial_fit`.

In [15]:
# Let us create another classifier and we will fit it incrementally.
clf2 = SGDClassifier(max_iter=1000, tol=0.01)

### Processing data chunk by chunk

Pandas `read_csv()` function has an attributre chunksize that can be used to read data chunk by chunk. The chunksize parameter specifies
the number of rows per chunk. (The last chunk may contain fewer than chunksize rows, of course.)

We can then use this data for `partial_fit`. We can then repeat these two steps multiple times. That way, entire data may not be reqiuired to
be kept in memory.

In [16]:
import pandas as pd

chunksize = 1000

iter = 1
for train_df in pd.read_csv("train_data.csv", chunksize=chunksize, iterator=True):

    if iter == 1:
        # In the first iteration, we are specifying all possible class
        # labels.
        xtrain_partial = train_df.iloc[:, 0:10]
        ytrain_partial = train_df.iloc[:, 10]
        clf2.partial_fit(xtrain_partial, ytrain_partial, classes=np.array([0, 1, 2]))

    else:
        xtrain_partial = train_df.iloc[:, 0:10]
        ytrain_partial = train_df.iloc[:, 10]
        clf2.partial_fit(xtrain_partial, ytrain_partial)

    print("After iter #", iter)
    print(clf2.coef_)
    print(clf2.intercept_)
    iter = iter + 1

After iter # 1
[[ -2.69191961  77.7938791  -10.56702519   2.18444354   4.1942952
   -8.46619245  -6.01635734   7.63100049  21.38395999  46.21719783]
 [-11.07398377  16.81560249  12.19717134   5.65591138 -15.97081715
   -0.77767623  -7.75460933   8.2155308  -27.48628015 -22.4098787 ]
 [  5.18568359 -63.71780506  -7.9352457    5.40251129   9.32088735
   -5.79233732  14.36395169   8.83780983  29.42919602   9.51540867]]
[-65.14372933 -15.06139529 -42.64355487]
After iter # 2
[[  4.33781882  36.66812582  -3.95291396   5.16340824  -1.54569694
   11.92624449  -4.24992582   2.94542184   3.04426414  14.68554768]
 [  3.92242705   5.16954939  -1.1869542    0.2759156   -2.4570855
    3.74275952  -3.31906713  -3.62208697 -13.10200082 -11.58361039]
 [  5.13434135 -42.43692127   3.18723748  -9.05314252  -5.40989385
  -14.13570154   8.38029709  -4.71104252  13.69859921   0.38217805]]
[-51.1770285   -8.54199683 -22.15800974]
After iter # 3
[[  3.01585412  28.3124663   -5.22138979  -8.81423678  -7.22706

#### Notes:

- In the first call to `partial_fit()`, we passed the list of possible target class labels. For subsequent calls to `partial_fit()`, this is not
required.

- Observe the changing values pf the classifier attributes: `coef_` and `intercept_` which we are printing in each iteration.

In [17]:
test_score = clf2.score(xtest, ytest)
print("Test score: ", test_score)

Test score:  0.9225333333333333




In [18]:
ypred = clf2.predict (xtest)
cm = confusion_matrix(ytest, ypred)
print(cm)

[[2170  256   41]
 [  34 2438  101]
 [  97   52 2311]]




In [19]:
cr = classification_report(ytest, ypred)
print(cr)

              precision    recall  f1-score   support

           0       0.94      0.88      0.91      2467
           1       0.89      0.95      0.92      2573
           2       0.94      0.94      0.94      2460

    accuracy                           0.92      7500
   macro avg       0.92      0.92      0.92      7500
weighted avg       0.92      0.92      0.92      7500



# Incremental Preprocessing Example 

<br> 

### `CountVectorizer` VS `HashingVectorizer`
- Vectorizers are used to convert a collection of text documents to a vector representation, thus helping in preprocessing them before applying
any model on these text documents.

- CountVectorizer and HashingVectorizer both perform the task of vectorizing the text documents. However, there are some differences
among them.

- One difference is that HashingVectorizer does not store the resulting vocabulary (i.e. the unique tokens). Hence, it can be used to learn from
data that does not fit into the computer's main memory. Each mini-batch is vectorized using HashingVectorizer so as to guarantee that the
input space of the estimator has always the same dimensionality.

- With HashingVectorizer, each token directly maps to a pre-defined column position in a matrix. For example, if there are 100 columns in the
resultant (vectorized) matrix, each token (word) maps to 1 of the 100 columns. The mapping between the word and the position in matrix is
done using hashing.

- In other words, in HashingVectorizer, each token transforms to a column position instead of adding to the vocabulary. Not storing the
vocabulary is useful while handling large data sets. This is because holding a huge token vocabulary comprising of millions of words may be a
challenege when the memory is limited.

### Example

- Let us take some sample text documents and vectorize them, first using `CountVectorizer` and then `HashingVectorizer`.

In [1]:
text_documents = ['The well-known saying an apple a day keeps the doctor away has a very straightforward, literal meaning, that the',
'The proverb first appeared in print in 1866 and over 150 years later is advice that we still pass down through ge',
'British apples are one of the nations best loved fruit and according to Great British Apples, we consume around 1',
'But what are the health benefits, and do they really keep the doctor away?']

### 1. CountVectorizer

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
c_vectorizer = CountVectorizer()

In [3]:
X_c = c_vectorizer.fit_transform(text_documents)

In [4]:
X_c.shape

(4, 57)

In [6]:
X_c.toarray()

array([[0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 1, 1, 3, 0, 0, 0, 1, 0, 1, 0, 0],
       [1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
        0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
        0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1],
       [0, 0, 1, 0, 0, 1, 0, 0, 2, 1, 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0,
        1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 1, 0]], dtype=int64)

We can also see the vocabulary using `vocabulary_`

In [24]:
c_vectorizer.vocabulary_

{'the': 48,
 'well': 54,
 'known': 31,
 'saying': 44,
 'an': 4,
 'apple': 7,
 'day': 17,
 'keeps': 30,
 'doctor': 19,
 'away': 11,
 'has': 25,
 'very': 52,
 'straightforward': 46,
 'literal': 33,
 'meaning': 35,
 'that': 47,
 'proverb': 42,
 'first': 21,
 'appeared': 6,
 'in': 27,
 'print': 41,
 '1866': 1,
 'and': 5,
 'over': 39,
 '150': 0,
 'years': 56,
 'later': 32,
 'is': 28,
 'advice': 3,
 'we': 53,
 'still': 45,
 'pass': 40,
 'down': 20,
 'through': 50,
 'ge': 23,
 'british': 14,
 'apples': 8,
 'are': 9,
 'one': 38,
 'of': 37,
 'nations': 36,
 'best': 13,
 'loved': 34,
 'fruit': 22,
 'according': 2,
 'to': 51,
 'great': 24,
 'consume': 16,
 'around': 10,
 'but': 15,
 'what': 55,
 'health': 26,
 'benefits': 12,
 'do': 18,
 'they': 49,
 'really': 43,
 'keep': 29}

In [25]:
print(X_c)

  (0, 48)	3
  (0, 54)	1
  (0, 31)	1
  (0, 44)	1
  (0, 4)	1
  (0, 7)	1
  (0, 17)	1
  (0, 30)	1
  (0, 19)	1
  (0, 11)	1
  (0, 25)	1
  (0, 52)	1
  (0, 46)	1
  (0, 33)	1
  (0, 35)	1
  (0, 47)	1
  (1, 48)	1
  (1, 47)	1
  (1, 42)	1
  (1, 21)	1
  (1, 6)	1
  (1, 27)	2
  (1, 41)	1
  (1, 1)	1
  (1, 5)	1
  :	:
  (2, 9)	1
  (2, 38)	1
  (2, 37)	1
  (2, 36)	1
  (2, 13)	1
  (2, 34)	1
  (2, 22)	1
  (2, 2)	1
  (2, 51)	1
  (2, 24)	1
  (2, 16)	1
  (2, 10)	1
  (3, 48)	2
  (3, 19)	1
  (3, 11)	1
  (3, 5)	1
  (3, 9)	1
  (3, 15)	1
  (3, 55)	1
  (3, 26)	1
  (3, 12)	1
  (3, 18)	1
  (3, 49)	1
  (3, 43)	1
  (3, 29)	1


### HashingVectorizer
- Let us now see how `HashingVectorizer` is different from `CountVectorizer`.

- We will create an object of `HashingVectorizer`. While creating the object, we need to specify the number of features we wish to have in the
feature matrix.

In [26]:
from sklearn.feature_extraction.text import HashingVectorizer

Let us create an object of `HashingVectorizer` class. An important parameter of this class is `n_features`. It declares the number of features
(columns) in the output feature matrix.

Note: Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear
learners.

In [27]:
h_vectorizer= HashingVectorizer(n_features=50)

Let's perform hashing vectorization with `fit_transform`.

In [28]:
X_h =h_vectorizer.fit_transform(text_documents)

In [29]:
X_h.shape

(4, 50)

In [30]:
print(X_h[0])

  (0, 5)	-0.2886751345948129
  (0, 8)	-0.5773502691896258
  (0, 10)	-0.2886751345948129
  (0, 11)	-0.2886751345948129
  (0, 13)	0.0
  (0, 18)	-0.2886751345948129
  (0, 20)	0.2886751345948129
  (0, 26)	0.0
  (0, 38)	0.2886751345948129
  (0, 39)	-0.2886751345948129
  (0, 45)	-0.2886751345948129


Overall, `HashingVectorizer` is a good choice if we are falling short of memory and resources, or we need to perform incremental learning.

However, `CountVectorizer` is a good choice if we need to access the actual tokens.

# Combining preprocessing and fitting in Incremental Learning
(`HashingVectorizer` along with `SGDClassifier`)

We will now use a dataset containing a textual feature that requires preprocessing using a vectorizer. Since we wish to perform incremental
learning using `partial_fit()`, we will preprocess (i.e., vectorize) the dataset feature using `HashingVectorizer` and then we will
incrementally fit it.

### 1. Downloading the dataset
Below, we download a dataset from UCI ML datasets' library. (Instead of downloading, unzipping and then reading, we are directly reading the
zipped csv file. For that purpose, we are making use of urllib.request, BytesIO and TextIOWrapper classes.)

This is a sentiment analysis dataset. There are only two columns in the dataset. One for the textual review and the other for the sentiment.

In [32]:
import pandas as pd
from io import StringIO, BytesIO, TextIOWrapper
from zipfile import ZipFile
import urllib.request


df = pd.read_csv('./data/amazon_cells_labelled.txt', sep='\t')
df.columns = ['review', 'sentiment']

## 2. Exploring Data Set

In [33]:
df.head()

Unnamed: 0,review,sentiment
0,"Good case, Excellent value.",1
1,Great for the jawbone.,1
2,Tied to charger for conversations lasting more...,0
3,The mic is great.,1
4,I have to jiggle the plug to get it to line up...,0


In [34]:
df.tail()

Unnamed: 0,review,sentiment
994,The screen does get smudged easily because it ...,0
995,What a piece of junk.. I lose more calls on th...,0
996,Item Does Not Match Picture.,0
997,The only thing that disappoint me is the infra...,0
998,"You can not answer calls with the unit, never ...",0


In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     999 non-null    object
 1   sentiment  999 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 15.7+ KB


In [36]:
df.describe()

Unnamed: 0,sentiment
count,999.0
mean,0.500501
std,0.50025
min,0.0
25%,0.0
50%,1.0
75%,1.0
max,1.0


In [37]:
df.loc[:, 'sentiment'].unique()

array([1, 0], dtype=int64)

### 4. Splitting data into train and test

In [39]:
from sklearn.model_selection import train_test_split

X = df.loc[:, 'review']
y= df.loc[:, 'sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

X_train.shape

(799,)

In [40]:
y_train.shape

(799,)

### 5. Preprocessing
Since the data is textual, we need to vectorize it. In order to perform incremental learning, we will use HashingVectorizer.

In [41]:
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer()

### 6. Creating an instance of the SGDClassifier

In [46]:
from sklearn.linear_model import SGDClassifier
classifier = SGDClassifier(penalty='l2' , loss='hinge')

### 7. Iteration 1 of partial_fit()

- We will assume we do not have sufficient memory to handle all the 799 samples in one go for training purpose. So, we will take the first 400
samples from teh training data and partial_fit our classifier.

- Another use case of partial_fit here could also be a scenario where we only have 400 samples available at a time. So, we fit our classifier with
them. However, we partial_fit it, to have the possibility of training it wirth more data later whenever that becomes available.

In [47]:
X_train_part1_hashed = vectorizer.fit_transform(X_train [0:400])
y_train_part1 = y_train[0:400]

In [48]:
all_classes = np.unique(df.loc[:, 'sentiment']) #we need to mention all classes in the first iteration of partial_fit()

In [49]:
classifier.partial_fit(X_train_part1_hashed, y_train_part1, classes=all_classes)

In [50]:
X_test_hashed = vectorizer.transform(X_test) #first we will have to preprocess the X_test with the same vectorizer that was fit on X_train

In [51]:
classifier.score(X_test_hashed, y_test)

0.735

### 8. Iteration 2 of `partial_fit()`

We will now assume that more data became available. So, we will fit the same classifier with more data and observe if our test score improves.

In [52]:
X_train_part2_hashed = vectorizer.fit_transform(X_train[400: ])
y_train_part2 = y_train [400: ]

In [53]:
classifier.partial_fit(X_train_part2_hashed, y_train_part2)

In [54]:
test_score = classifier.score(X_test_hashed, y_test)
print("Test score: ", test_score)

Test score:  0.75
