## ***GloVe (Global Vectotrisation for word embeddings)***
- Learns from a global dataset how often words co-occur in a large corpus
- Build a co-occurrence matrix X, where
    - **X[i][j] = number of times word j appears in the context of word i**.

> The large / medium model of eng language defined in spacy supports **GloVe**
***

In [57]:
import pandas as pd
import numpy as np
import spacy

`Small word model has no words with vectors due to small space...`

In [58]:
nlp = spacy.load('en_core_web_sm')

In [59]:
vocabs = [w.text for w in nlp.vocab]
len(vocabs)

764

In [60]:
vocabs_with_vec = [w.text for w in nlp.vocab if w.has_vector]
len(vocabs_with_vec)

0

In [61]:
nlp = spacy.load('en_core_web_lg')

In [62]:
vocabs = [w.text for w in nlp.vocab]
len(vocabs)

764

In [63]:
vocabs_with_vec = [w.text for w in nlp.vocab if w.has_vector]
len(vocabs_with_vec)

483

In [64]:
text = "King's Brother Naliva, did not see tiger in his life. That's gonna be the very first time."

In [65]:
doc = nlp(text)
doc.vector

array([ 2.49257423e-02,  1.40086442e-01, -6.26299158e-02, -9.58364755e-02,
        8.62075537e-02,  6.23040497e-02,  6.32589161e-02, -2.12972954e-01,
       -4.82869148e-02,  2.18838739e+00, -2.52835602e-01,  1.56183504e-02,
        1.36078864e-01, -5.54644410e-03, -1.26936153e-01,  3.23903449e-02,
        1.67958271e-02,  8.27373505e-01, -1.61599889e-01,  2.16769502e-02,
        3.49579528e-02, -6.79826587e-02, -3.36050279e-02, -6.44635186e-02,
        1.58028845e-02,  2.86054332e-02, -7.25458190e-02, -2.21678298e-02,
        7.24087358e-02, -1.48268405e-03, -9.86358374e-02,  3.72204408e-02,
       -6.29735067e-02,  3.32303084e-02,  6.76210448e-02, -9.08748358e-02,
        1.09941661e-01,  3.23800147e-02, -6.57827258e-02, -1.44940242e-01,
       -6.85887486e-02,  1.83807798e-02,  6.81123659e-02, -1.62720487e-01,
        6.30724728e-02,  1.51250884e-02, -1.36790395e-01, -3.21125686e-02,
        5.34656048e-02, -1.76390335e-02, -3.68455402e-03,  5.39165884e-02,
        3.53638716e-02,  

In [66]:
base_token= nlp("Lion")
for token in doc:
    print(f"{token.text} -> {base_token.text} => {token.similarity(base_token)}")

King -> Lion => 0.4747401773929596
's -> Lion => 0.1708216369152069
Brother -> Lion => 0.29921838641166687
Naliva -> Lion => 0.0
, -> Lion => 0.09471897035837173
did -> Lion => 0.24573396146297455
not -> Lion => 0.24010586738586426
see -> Lion => 0.26982855796813965
tiger -> Lion => 0.7359829545021057
in -> Lion => 0.1637919545173645
his -> Lion => 0.27939265966415405
life -> Lion => 0.24868765473365784
. -> Lion => 0.17546972632408142
That -> Lion => 0.23986424505710602
's -> Lion => 0.1708216369152069
gon -> Lion => 0.13697659969329834
na -> Lion => 0.09639988094568253
be -> Lion => 0.20266763865947723
the -> Lion => 0.2469848245382309
very -> Lion => 0.1867997795343399
first -> Lion => 0.18334393203258514
time -> Lion => 0.18702884018421173
. -> Lion => 0.17546972632408142


  print(f"{token.text} -> {base_token.text} => {token.similarity(base_token)}")


In [67]:
modified_text = " ".join([token.lemma_ for token in doc if not token.is_stop and not token.is_punct and not token.is_oov])
modified_doc = nlp(modified_text)
for token in modified_doc:
    print(f"{token} -> {base_token.text} => {token.similarity(base_token)}")


King -> Lion => 0.4747401773929596
Brother -> Lion => 0.29921838641166687
tiger -> Lion => 0.7359829545021057
life -> Lion => 0.24868765473365784
go -> Lion => 0.23744122684001923
to -> Lion => 0.1585528552532196
time -> Lion => 0.18702884018421173


## **Exercise with classification**

***Data modifications***

In [68]:
data = pd.read_json('news.json')

In [69]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7500 entries, 0 to 7499
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   text      7500 non-null   object
 1   category  7500 non-null   object
dtypes: object(2)
memory usage: 175.8+ KB


In [70]:
data['category'].value_counts()

category
CRIME       2500
SPORTS      2500
BUSINESS    2500
Name: count, dtype: int64

In [71]:
def vectorize(text):
    doc = nlp(text)
    vectors = doc.vector
    return vectors

def map(text):
    if text == "CRIME":
        return 0
    elif text == "SPORTS":
        return 1
    else:
        return 2

In [72]:
data["text_vect"] = data["text"].apply(vectorize)

In [73]:
data['text_vect'][0]

array([-1.26001850e-01,  2.67008185e-01, -1.66271344e-01, -6.34676814e-02,
        1.11337133e-01,  7.27505758e-02,  6.63272515e-02, -2.29737118e-01,
       -1.11820459e-01,  2.11998129e+00, -1.75150022e-01,  1.19103529e-02,
        1.03851162e-01, -7.35084713e-02, -2.37800658e-01, -9.46056750e-03,
       -7.55394101e-02,  6.05688155e-01, -1.07233554e-01,  2.65875049e-02,
       -4.04296704e-02,  2.03981996e-02,  1.85514074e-02, -3.09930388e-02,
        2.69037182e-03, -1.15703279e-02, -6.55305460e-02, -2.48420443e-02,
        1.01293482e-01, -1.04971789e-01, -4.93194163e-02, -3.62486579e-02,
       -1.12083972e-01,  6.22642078e-02,  6.66726306e-02, -3.55293117e-02,
       -5.75957820e-02,  4.43126373e-02, -1.18519016e-01,  1.38854906e-02,
       -4.49024001e-03, -1.18816875e-01,  6.14356883e-02, -1.16614714e-01,
       -1.31945647e-02,  5.73158786e-02, -1.58320993e-01, -3.14859636e-02,
        1.05353110e-01,  8.28966592e-03, -8.78614113e-02,  5.73272184e-02,
       -1.15606472e-01,  

In [74]:
data['category'] = data['category'].apply(map)

In [75]:
data.head()

Unnamed: 0,text,category,text_vect
0,"Larry Nassar Blames His Victims, Says He 'Was ...",0,"[-0.12600185, 0.2670082, -0.16627134, -0.06346..."
1,"Woman Beats Cancer, Dies Falling From Horse",0,"[-0.15009212, 0.39993936, -0.030083999, 0.0827..."
2,Vegas Taxpayers Could Spend A Record $750 Mill...,1,"[0.0686294, 0.079817496, -0.10145511, -0.09158..."
3,This Richard Sherman Interception Literally Sh...,1,"[-0.05209184, 0.2640776, 0.024623524, -0.05012..."
4,7 Things That Could Totally Kill Weed Legaliza...,2,"[-0.1287915, 0.17678502, -0.064180195, -0.1208..."


***train test split***

In [76]:
from sklearn.model_selection import train_test_split

In [77]:
train_text, test_text, train_cat, test_cat = train_test_split(data['text_vect'], data['category'], test_size=0.2, random_state=42)

In [78]:
train_text, train_cat

(4664    [-0.12581278, 0.1898045, -0.036296602, -0.0293...
 4411    [-0.07932585, 0.17766216, -0.11637992, -0.0280...
 7448    [-0.12062322, 0.28803593, -0.09285785, 0.06842...
 1919    [-0.11157667, 0.22581784, -0.050059576, 0.1048...
 1298    [0.040780935, 0.079157814, -0.078565314, 0.062...
                               ...                        
 5191    [-0.15780966, 0.1636805, -0.018735997, 0.01396...
 5226    [-0.008577629, 0.16997755, -0.18298985, -0.023...
 5390    [-0.117595, 0.31183338, 0.029610638, -0.013123...
 860     [-0.020720065, 0.17224072, -0.18227829, -0.049...
 7270    [-0.14791192, 0.086226225, -0.2055983, -0.0767...
 Name: text_vect, Length: 6000, dtype: object,
 4664    1
 4411    0
 7448    0
 1919    1
 1298    1
        ..
 5191    0
 5226    2
 5390    0
 860     2
 7270    2
 Name: category, Length: 6000, dtype: int64)

***train test X fix***

In [88]:
train_stack = np.stack(train_text) # model requires 2-D array as X
test_stack = np.stack(test_text)
test_stack

array([[ 0.06671583,  0.1412695 , -0.11733154, ..., -0.11196616,
        -0.03278363, -0.00393685],
       [-0.01368161,  0.05240081,  0.08033022, ...,  0.01986429,
        -0.08102372,  0.1284342 ],
       [-0.02073403,  0.12885685, -0.12046428, ..., -0.0283075 ,
         0.05703985,  0.05961888],
       ...,
       [-0.06178999,  0.20459999, -0.11497214, ..., -0.0450891 ,
        -0.00802652,  0.12298948],
       [-0.06142196,  0.19728164, -0.12341338, ..., -0.1150021 ,
         0.0386393 ,  0.07016914],
       [-0.10421622,  0.19685285, -0.01277222, ..., -0.01948734,
         0.13160922,  0.0942944 ]], shape=(1500, 300), dtype=float32)

In [85]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

In [89]:
scaled_train_stack = scaler.fit_transform(train_stack)
scaled_test_stack = scaler.transform(test_stack)

scaled_train_stack

array([[0.416151  , 0.58160686, 0.3055647 , ..., 0.45544875, 0.45612225,
        0.48931414],
       [0.47877282, 0.568594  , 0.23375559, ..., 0.4453909 , 0.47794566,
        0.5037934 ],
       [0.42314178, 0.6868806 , 0.25484735, ..., 0.43832693, 0.40509003,
        0.43344137],
       ...,
       [0.42722106, 0.7123841 , 0.3646624 , ..., 0.5352946 , 0.3619061 ,
        0.4402717 ],
       [0.55771977, 0.5627839 , 0.17466584, ..., 0.4690867 , 0.46605614,
        0.49818766],
       [0.38638163, 0.4706029 , 0.15375525, ..., 0.42784762, 0.36653465,
        0.40723234]], shape=(6000, 300), dtype=float32)

***Classification***

In [90]:
from sklearn.metrics import classification_report

`KNN` - 89 %
***
- Unscaled data

In [80]:
from sklearn.neighbors import KNeighborsClassifier

In [113]:
clf_1 = KNeighborsClassifier(n_neighbors=5, metric='cosine')

In [114]:
clf_1.fit(train_stack, train_cat)

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'cosine'
,metric_params,
,n_jobs,


In [115]:
preds_1 = clf_1.predict(test_stack)
preds_1[:10], test_cat[:10]

(array([1, 0, 2, 2, 1, 2, 1, 2, 0, 0]),
 970     1
 6279    0
 1859    2
 6803    2
 6305    1
 3039    2
 7194    1
 1446    2
 5199    0
 6234    0
 Name: category, dtype: int64)

In [116]:
print(classification_report(test_cat, preds_1))

              precision    recall  f1-score   support

           0       0.88      0.91      0.90       510
           1       0.90      0.85      0.87       482
           2       0.89      0.90      0.90       508

    accuracy                           0.89      1500
   macro avg       0.89      0.89      0.89      1500
weighted avg       0.89      0.89      0.89      1500



- Scaled data

In [99]:
clf_1.fit(scaled_train_stack, train_cat)

0,1,2
,n_neighbors,10
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


In [100]:
preds_1_ = clf_1.predict(scaled_test_stack)
preds_1_[:10], test_cat[:10]

(array([1, 0, 2, 2, 2, 2, 1, 2, 0, 0]),
 970     1
 6279    0
 1859    2
 6803    2
 6305    1
 3039    2
 7194    1
 1446    2
 5199    0
 6234    0
 Name: category, dtype: int64)

In [101]:
print(classification_report(test_cat, preds_1_))

              precision    recall  f1-score   support

           0       0.88      0.90      0.89       510
           1       0.91      0.84      0.87       482
           2       0.88      0.92      0.90       508

    accuracy                           0.89      1500
   macro avg       0.89      0.89      0.89      1500
weighted avg       0.89      0.89      0.89      1500



`Multinomial Naive Bayes` - ***83 %***
***
- Scaled data 
    - non scaled data is not possible as the model can't feed with neg values

In [102]:
from sklearn.naive_bayes import MultinomialNB

In [103]:
clf_2 = MultinomialNB()

In [105]:
clf_2.fit(scaled_train_stack, train_cat)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [106]:
preds_2 = clf_2.predict(scaled_test_stack)

In [107]:
print(classification_report(test_cat, preds_2))


              precision    recall  f1-score   support

           0       0.90      0.82      0.86       510
           1       0.86      0.80      0.83       482
           2       0.76      0.88      0.82       508

    accuracy                           0.83      1500
   macro avg       0.84      0.83      0.83      1500
weighted avg       0.84      0.83      0.83      1500



`Random Forest Regressor` - ***88 %***

In [117]:
from sklearn.ensemble import RandomForestClassifier

In [118]:
clf_3 = RandomForestClassifier()

In [119]:
clf_3.fit(train_stack, train_cat)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [120]:
preds_3 = clf_3.predict(test_stack)

In [121]:
print(classification_report(test_cat, preds_3))


              precision    recall  f1-score   support

           0       0.90      0.87      0.88       510
           1       0.87      0.90      0.88       482
           2       0.88      0.89      0.89       508

    accuracy                           0.88      1500
   macro avg       0.88      0.88      0.88      1500
weighted avg       0.89      0.88      0.88      1500



- scaled data

In [122]:
clf_3.fit(scaled_train_stack, train_cat)


0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [123]:
preds_3_ = clf_3.predict(scaled_test_stack)

In [124]:
print(classification_report(test_cat, preds_3_))


              precision    recall  f1-score   support

           0       0.90      0.86      0.88       510
           1       0.86      0.89      0.87       482
           2       0.88      0.89      0.89       508

    accuracy                           0.88      1500
   macro avg       0.88      0.88      0.88      1500
weighted avg       0.88      0.88      0.88      1500



`Decision Tree Classifier` - **73 %**

In [126]:
from sklearn.tree import DecisionTreeClassifier

In [127]:
clf_4 = DecisionTreeClassifier()

In [128]:
clf_4.fit(train_stack, train_cat)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [129]:
preds_4 = clf_4.predict(test_stack)

In [130]:
print(classification_report(test_cat, preds_4))

              precision    recall  f1-score   support

           0       0.76      0.70      0.73       510
           1       0.71      0.74      0.72       482
           2       0.73      0.75      0.74       508

    accuracy                           0.73      1500
   macro avg       0.73      0.73      0.73      1500
weighted avg       0.73      0.73      0.73      1500

