### Large Scale Machine Learning
So far, we were able to load the entire data in memory and make models. But this might not be possible in many real life situations.

We will now look at how

* to handle such large-scale data
* do incremental preprocessing and learning
    * fit() vs partial_fit()
* Combining preprocessing and incremental learning.


#### Incremental Learning
the idea is to process data in batches and update the model parameters for each batch. This way of learning is referred to as **incremental learning**. it can be implemented in two cases 
* For **out-of-memory** datsets. where its not possible to load the entire data into the RM at once, one can load the data in chunks and fit the training model for each chunk of data. 
* for ML tasks where a new batch of data comes with time, re-training the model with the previous and new batch of data is a computationally expensive process. 
> instead retraining the model with the entire set of data, one can employ an incremental learning approach, where the model parameters are updated with the new batch data 

##### partial_fit() : 
```py
partial_fit(X, y, [classes], [sample_weight])

# classes ==> list of classes. its recommended to supply the list of classes the very first time when we call the partial_fit method. cause it may happen that in the 1st chunk some classes are not present. 
```

The following estimators implement partial_fit method;

* Classification:
    * MultinomialNB
    * BernoulliNB
    * SGDClassifier (can be used to implement different classifiers)
    * Perceptron

* Regression:
    * SGDRegressor
* Clustering:
    * MiniBatchKMeans

> **SGDRegressor** and **SGDClassifier** are commonly used for large scale dataset 

##### fit() vs partial_fit()

In [1]:
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

1. Traditional Approach

In [2]:
x, y = make_classification(
    n_samples=50000, n_features=10, n_classes=3, n_clusters_per_class=1
)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15)

In [3]:
clf1 = SGDClassifier(max_iter=1000, tol=0.01)

In [4]:
clf1.fit(x_train, y_train)

SGDClassifier(tol=0.01)

In [5]:
train_score = clf1.score(x_train, y_train)
print("Training score: ", train_score)

Training score:  0.8470823529411765


In [6]:
test_score = clf1.score(x_test, y_test)
print("Training score: ", test_score)

Training score:  0.8517333333333333


In [7]:
y_pred = clf1.predict(x_test)
cr = classification_report(y_test, y_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.83      0.87      0.85      2488
           1       0.84      0.83      0.84      2557
           2       0.89      0.85      0.87      2455

    accuracy                           0.85      7500
   macro avg       0.85      0.85      0.85      7500
weighted avg       0.85      0.85      0.85      7500



2. Incremental Approach

In [8]:
import numpy as np
import pandas as pd

train_data = np.concatenate((x_train, y_train[:, np.newaxis]), axis=1)

In [9]:
train_data[0:5]

array([[-0.11003047,  0.38908167,  1.47860953, -0.28270276, -0.31148655,
         0.16744432,  1.9966384 ,  0.55197083, -0.56721298, -1.0812275 ,
         1.        ],
       [ 0.68697399, -1.04758773,  0.04478718, -0.04564442,  0.22511452,
        -0.65409864,  2.6238179 , -1.10444198, -1.55020072, -1.17506703,
         1.        ],
       [ 0.22031978,  1.7393718 , -0.2605469 , -1.48837971, -2.07845681,
        -0.25352852, -0.84557823,  1.796273  ,  0.07208196,  0.94706743,
         2.        ],
       [-1.24892515, -0.39061413, -0.41031182,  0.03868295, -0.22756571,
         1.32660628,  0.52983493, -2.17001691,  1.02997175, -1.07805313,
         1.        ],
       [-0.04011214, -2.2423575 , -0.23955815,  0.37060199, -1.26875087,
         0.08466208,  1.6386385 ,  0.92970865, -0.51408611, -0.12544258,
         1.        ]])

In [10]:
a = np.asarray(train_data)
np.savetxt("train_data.csv", a, delimiter=",")

In [11]:
clf2 = SGDClassifier(max_iter=1000, tol=0.01)

In [12]:
chunksize = 1000
iter = 1

for train_df in pd.read_csv("train_data.csv", chunksize=chunksize, iterator=True):

    if iter == 1:
        # In the first iteration, we are specifying all possible class labels
        x_train_partial = train_df.iloc[:, 0:10]
        y_train_partial = train_df.iloc[:, 10]
        clf2.partial_fit(x_train_partial, y_train_partial, classes = np.array([0, 1, 2]))

    else:
        x_train_partial = train_df.iloc[:, 0:10]
        y_train_partial = train_df.iloc[:, 10]
        clf2.partial_fit(x_train_partial, y_train_partial)

    print("After iter #", iter)
    print(clf2.coef_)
    print(clf2.intercept_)
    iter +=1

After iter # 1
[[ 24.86977182  -0.76404621 -14.13044271  -2.45674511   8.40506152
  -25.38348618  29.28701781  20.56655709 -33.95172773  16.64349898]
 [-12.28891561 -16.07376281  10.70895973  13.35830589  -2.93840371
   13.29483941  14.52830124  18.49953659   6.99146449   9.12941022]
 [  7.01890264  -5.51692509  -1.14402555   9.10928025  11.58532188
   -8.62636488 -48.12690465  -8.91773147   9.4458253   -0.83175161]]
[-106.38130377  -22.123142     -0.42707729]
After iter # 2
[[ 17.32319306 -19.96940772  17.26280622   4.31857432 -11.53370198
  -17.36093809  32.74241067  -4.20670068 -27.81383225 -15.34485822]
 [-12.25689272   1.7952154  -11.16827132 -13.92941841  13.79337311
   13.13806953   9.7813337   -6.26302128   8.56218909  -3.87573571]
 [  4.99905856  -3.72970971  -1.24280574  -0.70273311  -1.89152705
   -6.02667202 -29.7556034   -0.21621234   5.20186526   1.81464476]]
[-71.70520108 -15.33757535 -23.43152305]
After iter # 3
[[ 20.2993236   -8.99253296   8.32998564  -6.10793663  -1.

In [13]:
test_score = clf2.score(x_test, y_test)
print("Training score: ", test_score)

Training score:  0.8361333333333333




In [14]:
y_pred = clf2.predict(x_test)
cr = classification_report(y_test, y_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.83      0.87      0.85      2488
           1       0.77      0.88      0.82      2557
           2       0.94      0.76      0.84      2455

    accuracy                           0.84      7500
   macro avg       0.85      0.84      0.84      7500
weighted avg       0.85      0.84      0.84      7500





#### Incremental preprocessing example

`CountVectorizer` vs `HashingVectorizer`
* CountVectorizer and HashingVectorizer both perform the task of vectorizing text data
* HashingVectorizer does't store the resulting vocabulary, therefore it can be used to learn from data that doesn't fit into main memory. Each mini batch is vectorized using HashingVectorizer so as to guarantee that that the input space of the vvectorizer has the same dimensionality.

In [15]:
text = ['Russell was raised by his paternal grandparents after his unconventional parents both died young.', 
        'He was discontented living with his grandparents, but enjoyed four happy years at Winchester College.',
        'His academic education came to a sudden end when he was sent down from Balliol College, Oxford, probably because authorities there had suspicions concerning the nature of his relationship with the future poet Lionel Johnson.',
        'He always bitterly resented his treatment by Oxford.']

`CountVectorizer`

In [46]:
test_text = ['I am a boy', 'I am indian']
c_vectorizer = CountVectorizer()
test_text_x = c_vectorizer.fit_transform(test_text)
print(test_text_x.toarray())
print(c_vectorizer.get_feature_names())

[[1 1 0]
 [1 0 1]]
['am', 'boy', 'indian']




In [53]:
from sklearn.feature_extraction.text import CountVectorizer
c_vectorizer = CountVectorizer()

In [54]:
X_c = c_vectorizer.fit_transform(text)

In [55]:
c_vectorizer.get_feature_names()

['academic',
 'after',
 'always',
 'at',
 'authorities',
 'balliol',
 'because',
 'bitterly',
 'both',
 'but',
 'by',
 'came',
 'college',
 'concerning',
 'died',
 'discontented',
 'down',
 'education',
 'end',
 'enjoyed',
 'four',
 'from',
 'future',
 'grandparents',
 'had',
 'happy',
 'he',
 'his',
 'johnson',
 'lionel',
 'living',
 'nature',
 'of',
 'oxford',
 'parents',
 'paternal',
 'poet',
 'probably',
 'raised',
 'relationship',
 'resented',
 'russell',
 'sent',
 'sudden',
 'suspicions',
 'the',
 'there',
 'to',
 'treatment',
 'unconventional',
 'was',
 'when',
 'winchester',
 'with',
 'years',
 'young']

In [56]:
X_c.shape

(4, 56)

In [57]:
X_c.toarray()

array([[0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0,
        0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0],
       [1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1,
        1, 0, 1, 0, 1, 2, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1,
        1, 2, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [58]:
c_vectorizer.vocabulary_

{'russell': 41,
 'was': 50,
 'raised': 38,
 'by': 10,
 'his': 27,
 'paternal': 35,
 'grandparents': 23,
 'after': 1,
 'unconventional': 49,
 'parents': 34,
 'both': 8,
 'died': 14,
 'young': 55,
 'he': 26,
 'discontented': 15,
 'living': 30,
 'with': 53,
 'but': 9,
 'enjoyed': 19,
 'four': 20,
 'happy': 25,
 'years': 54,
 'at': 3,
 'winchester': 52,
 'college': 12,
 'academic': 0,
 'education': 17,
 'came': 11,
 'to': 47,
 'sudden': 43,
 'end': 18,
 'when': 51,
 'sent': 42,
 'down': 16,
 'from': 21,
 'balliol': 5,
 'oxford': 33,
 'probably': 37,
 'because': 6,
 'authorities': 4,
 'there': 46,
 'had': 24,
 'suspicions': 44,
 'concerning': 13,
 'the': 45,
 'nature': 31,
 'of': 32,
 'relationship': 39,
 'future': 22,
 'poet': 36,
 'lionel': 29,
 'johnson': 28,
 'always': 2,
 'bitterly': 7,
 'resented': 40,
 'treatment': 48}

In [59]:
print(X_c)

  (0, 41)	1
  (0, 50)	1
  (0, 38)	1
  (0, 10)	1
  (0, 27)	2
  (0, 35)	1
  (0, 23)	1
  (0, 1)	1
  (0, 49)	1
  (0, 34)	1
  (0, 8)	1
  (0, 14)	1
  (0, 55)	1
  (1, 50)	1
  (1, 27)	1
  (1, 23)	1
  (1, 26)	1
  (1, 15)	1
  (1, 30)	1
  (1, 53)	1
  (1, 9)	1
  (1, 19)	1
  (1, 20)	1
  (1, 25)	1
  (1, 54)	1
  :	:
  (2, 5)	1
  (2, 33)	1
  (2, 37)	1
  (2, 6)	1
  (2, 4)	1
  (2, 46)	1
  (2, 24)	1
  (2, 44)	1
  (2, 13)	1
  (2, 45)	2
  (2, 31)	1
  (2, 32)	1
  (2, 39)	1
  (2, 22)	1
  (2, 36)	1
  (2, 29)	1
  (2, 28)	1
  (3, 10)	1
  (3, 27)	1
  (3, 26)	1
  (3, 33)	1
  (3, 2)	1
  (3, 7)	1
  (3, 40)	1
  (3, 48)	1


`HashingVectorizer`

In [49]:
test_text = ['I am a boy', 'I am indian']
h_vectorizer = HashingVectorizer(n_features=30)
test_text_x = h_vectorizer.fit_transform(test_text)
print(test_text_x.toarray())

[[ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.70710678  0.          0.
   0.          0.          0.          0.          0.          0.
   0.         -0.70710678  0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.70710678  0.          0.
   0.          0.          0.         -0.70710678  0.          0.
   0.          0.          0.          0.          0.          0.        ]]


In [62]:
from sklearn.feature_extraction.text import HashingVectorizer
h_vectorizer = HashingVectorizer(n_features=50)
X_h = h_vectorizer.fit_transform(text)
X_h.shape

(4, 50)

In [63]:
print(X_h[0])

  (0, 1)	-0.2672612419124244
  (0, 6)	0.2672612419124244
  (0, 11)	-0.2672612419124244
  (0, 19)	-0.2672612419124244
  (0, 21)	0.2672612419124244
  (0, 23)	-0.2672612419124244
  (0, 24)	0.5345224838248488
  (0, 28)	0.2672612419124244
  (0, 30)	0.2672612419124244
  (0, 40)	-0.2672612419124244
  (0, 42)	0.0
  (0, 49)	0.2672612419124244


Combining preprocessing and fiting in incremental learning

In [64]:
import pandas as pd
from io import StringIO, BytesIO, TextIOWrapper
from zipfile import ZipFile
import urllib.request

response = urllib.request.urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip')
zipfile = ZipFile(BytesIO(response.read()))

data = TextIOWrapper(zipfile.open('sentiment labelled sentences/amazon_cells_labelled.txt'), encoding='utf-8')
df = pd.read_csv(data, sep = '\t')
df.columns = ['review', 'sentiment']

In [24]:
df.head()

Unnamed: 0,review,sentiment
0,"Good case, Excellent value.",1
1,Great for the jawbone.,1
2,Tied to charger for conversations lasting more...,0
3,The mic is great.,1
4,I have to jiggle the plug to get it to line up...,0


In [25]:
df.tail() 

Unnamed: 0,review,sentiment
994,The screen does get smudged easily because it ...,0
995,What a piece of junk.. I lose more calls on th...,0
996,Item Does Not Match Picture.,0
997,The only thing that disappoint me is the infra...,0
998,"You can not answer calls with the unit, never ...",0


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     999 non-null    object
 1   sentiment  999 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 15.7+ KB


In [27]:
df.loc[:, 'sentiment'].unique()

array([1, 0], dtype=int64)

In [28]:
from sklearn.model_selection import train_test_split
X = df.loc[:, 'review']
y = df.loc[:,'sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [29]:
X_train.shape

(799,)

In [30]:
vectorizer = HashingVectorizer()
classifier = SGDClassifier(penalty='l2', loss='hinge') # SVM

> Iteration 1 of `partial_fit()`

In [32]:
X_train_part1_hashed = vectorizer.fit_transform(X_train[0:400])
y_train_part1 = y_train[0:400]
all_classes = np.unique(df.loc[:, 'sentiment'])

classifier.partial_fit(
    X_train_part1_hashed, y_train_part1, classes = all_classes
)

SGDClassifier()

In [33]:
X_test_hashed = vectorizer.transform(X_test)
test_score = classifier.score(X_test_hashed, y_test)
print("Test score: ", test_score)

Test score:  0.7


> Iteration 2 of partial_fit()

In [35]:
X_train_part2_hashed = vectorizer.fit_transform(X_train[400:])
y_train_part2 = y_train[400:]

classifier.partial_fit(X_train_part2_hashed, y_train_part2)

SGDClassifier()

In [36]:
test_score = classifier.score(X_test_hashed, y_test)
print("Test score: ", test_score)

Test score:  0.765


test accuracy has gone up