### Objectives
* Limitations of basic scikit-learn models
* Introduction to online-learning models
* Applications & Examples
* Limitations of online-learnig models

#### Limitations of basic scikit-learn models
* One you have trained a model, it cannot be retrained.
* When new data arrives, retraning using old data & new data has to be done. Very expensive process.

In [2]:
from sklearn.datasets import california_housing
from sklearn.linear_model import LinearRegression



In [3]:
house_data = california_housing.fetch_california_housing()

In [4]:
feature_data = house_data.data

In [11]:
feature_data.shape

(20640, 8)

In [5]:
target_data = house_data.target

In [6]:
feature_data

array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
          37.88      , -122.23      ],
       [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
          37.86      , -122.22      ],
       [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
          37.85      , -122.24      ],
       ...,
       [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
          39.43      , -121.22      ],
       [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
          39.43      , -121.32      ],
       [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
          39.37      , -121.24      ]])

In [7]:
target_data

array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

In [8]:
current_feature_data = feature_data[:-5]

In [9]:
current_target_data = target_data[:-5]

In [10]:
current_feature_data.shape

(20635, 8)

In [14]:
new_feature_data = feature_data[-5:]

In [16]:
new_feature_data.shape

(5, 8)

In [15]:
new_target_data = target_data[-5:]

In [17]:
#linear model
lr = LinearRegression()

In [18]:
lr.fit(current_feature_data, current_target_data)

LinearRegression()

In [19]:
lr.coef_

array([ 4.36698295e-01,  9.44056975e-03, -1.07268396e-01,  6.44960409e-01,
       -3.92823742e-06, -3.78591131e-03, -4.21718526e-01, -4.34859339e-01])

* Now, suppose we receive additional 5 rows of data.
* LinearRegression don't support partial training with new data
* The model have to be trained with all data

In [21]:
lr.fit(new_feature_data, new_target_data)

LinearRegression()

In [22]:
lr.coef_

array([-1.25982639e-01, -2.09113671e-02, -1.62817519e-02, -1.52094168e-02,
        6.80123634e-05,  4.74835543e-02,  1.15604694e-02,  3.51442814e-02])

* The Previously learned parameters are completely forgotten

### Solution to above problem is using models which support's online learning
https://scikit-learn.org/0.15/modules/scaling_strategies.html

In [23]:
from sklearn.linear_model import SGDRegressor

In [24]:
sgd_lr = SGDRegressor()

In [25]:
sgd_lr.fit(current_feature_data, current_target_data)

SGDRegressor()

In [26]:
sgd_lr.coef_

array([-9.04408262e+10,  2.32770465e+11,  1.37361132e+11, -5.61233187e+10,
        2.32035351e+11,  1.69019103e+10,  2.60470384e+11, -3.76218902e+11])

In [27]:
sgd_lr.partial_fit(new_feature_data, new_target_data) #with previous data

SGDRegressor()

In [28]:
sgd_lr.coef_

array([-9.13590269e+10,  2.30824392e+11,  1.35781732e+11, -5.64659279e+10,
        6.28251683e+10,  1.61904309e+10,  2.50950345e+11, -3.46821397e+11])

PS: For training the model with large data, use partial fit or online learning models 

### Limitations of Online Learning Models
* Doesn't support pipeline

In [29]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer

In [30]:
model = make_pipeline(StandardScaler(), SGDRegressor())

In [31]:
ss = StandardScaler()

In [32]:
cv = CountVectorizer()

### Large Text Classification using Online Learning Models

In [33]:
import pandas as pd

In [34]:
data = pd.read_csv('amazon_food_review/Reviews.csv', nrows=10000,usecols=['Score','Text'])

In [36]:
data.head()

Unnamed: 0,Score,Text
0,5,I have bought several of the Vitality canned d...
1,1,Product arrived labeled as Jumbo Salted Peanut...
2,4,This is a confection that has been around a fe...
3,2,If you are looking for the secret ingredient i...
4,5,Great taffy at a great price. There was a wid...


In [37]:
data.Score.unique()

array([5, 1, 4, 2, 3], dtype=int64)

In [38]:
from sklearn.feature_extraction.text import HashingVectorizer

In [39]:
hv = HashingVectorizer(n_features=1000)

In [40]:
hv.partial_fit(data.Text)

HashingVectorizer(n_features=1000)

In [42]:
hv.transform(data.Text[:5]).toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [43]:
import numpy as np
from sklearn.naive_bayes import MultinomialNB

In [44]:
mnb = MultinomialNB()

In [46]:
data_itr = pd.read_csv('amazon_food_review/Reviews.csv',chunksize=10000,usecols=['Score','Text'])

In [47]:
hv = HashingVectorizer(n_features=1000)
mnb = MultinomialNB()

for data in data_itr:
    
    hv.partial_fit(data.Text)
    feature = hv.transform(data.Text)
    feature = np.abs(feature)
    mnb.partial_fit(feature, data.Score,[1,2,3,4,5])

In [48]:
feature = hv.transform(data.Text[:5])
feature = np.abs(feature)
mnb.predict(feature)

array([5, 5, 5, 5, 5])

In [49]:
data = pd.read_csv('amazon_food_review/Reviews.csv',nrows=10000,usecols=['Score','Text'])

In [50]:
data.Score = data.Score.map(lambda v: 0 if v < 3 else 1 )

In [51]:
data.Score.value_counts()

1    8478
0    1522
Name: Score, dtype: int64

* Data seems to be imbalanced & needs to be balanced before training