# P2W2D2AM - Natural Language Processing - Part 1

---
## A. Understanding CountVectorizer

In [None]:
# Let's Define Simple Corpus

corpus = [
    'Saya sedang belajar data science',
    'data yang saya proses adalah data teks',
    'NLP adalah cabang besar didalam data science'
]

`CountVectorizer()` is a module in Scikit Learn that use Bag of Words. Two methods that usually used are : 
* `.fit()` : used to collect vocabularies.
* `.transform()` : used to convert token into numeric value.

*See this [link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) for more details.*

In [None]:
# Collect the Vocabularies using CountVectorizer()

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
count_vect.fit(corpus)

CountVectorizer()

In [None]:
# See Vocabularies

count_vect.get_feature_names_out()

array(['adalah', 'belajar', 'besar', 'cabang', 'data', 'didalam', 'nlp',
       'proses', 'saya', 'science', 'sedang', 'teks', 'yang'],
      dtype=object)

In [None]:
# Transform from Corpus into Numerical Vector

corpus_count_vect = count_vect.transform(corpus)
corpus_count_vect

<3x13 sparse matrix of type '<class 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

Notes : 

* As you can see, variable `corpus_vect` is stored as **sparse matrix**. 
* The sparse matrix is a matrix that contains non-zero values. 
* From the above output, we see that there are **`3 x 13` equal to 39 values but only 18 of them are not zeroes (only 46.15 %) with 13 vocabularies**. 
* This is just from simple corpus that contains 3 documents. 
* Imagine if you have a corpus with large documents. Consequently, you will  have massive non-zero values.

In [None]:
# Let's See Contents of Sparse Matrix

print(corpus_count_vect)

  (0, 1)	1
  (0, 4)	1
  (0, 8)	1
  (0, 9)	1
  (0, 10)	1
  (1, 0)	1
  (1, 4)	2
  (1, 7)	1
  (1, 8)	1
  (1, 11)	1
  (1, 12)	1
  (2, 0)	1
  (2, 2)	1
  (2, 3)	1
  (2, 4)	1
  (2, 5)	1
  (2, 6)	1
  (2, 9)	1


As previously mentioned, a sparse matrix is a matrix that only save non-zero values. On the other hand, a dense matrix is a matrix that contains all values (both zero values and non-zero values). See the comparison below as illustration.

<img src='https://static.javatpoint.com/ds/images/types-of-sparse-matrices.png'>

In [None]:
# See `corpus_vect` as Dense Matrix

corpus_count_vect.toarray()

array([[0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0],
       [1, 0, 0, 0, 2, 0, 0, 1, 1, 0, 0, 1, 1],
       [1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0]])

In [None]:
# See `corpus_vect` with Their Token's Name

import pandas as pd
pd.DataFrame(corpus_count_vect.toarray(), columns = count_vect.get_feature_names_out())

Unnamed: 0,adalah,belajar,besar,cabang,data,didalam,nlp,proses,saya,science,sedang,teks,yang
0,0,1,0,0,1,0,0,0,1,1,1,0,0
1,1,0,0,0,2,0,0,1,1,0,0,1,1
2,1,0,1,1,1,1,1,0,0,1,0,0,0


In [None]:
# corpus = [
#     'Saya sedang belajar data science',
#     'data yang saya proses adalah data teks',
#     'NLP adalah cabang besar didalam data science'
# ]

---
## B. Understanding TF-IDF Vectorizer

`CountVectorizer()` and `TFidfVectorizer()` have some things in common such as :
* Methods : `.fit()`, `.transform()`, `.get_feature_names()`, etc.
* Save in a sparse matrix.

You can say that `TFidfVectorizer()` is a normalize form of `CountVectorizer()`.

*For more details, please visit this [link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).*

In [None]:
# Collect the Vocabularies using TfidfVectorizer()

from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vect = TfidfVectorizer()
tf_idf_vect.fit(corpus)

TfidfVectorizer()

In [None]:
# See Vocabularies

tf_idf_vect.get_feature_names_out()

array(['adalah', 'belajar', 'besar', 'cabang', 'data', 'didalam', 'nlp',
       'proses', 'saya', 'science', 'sedang', 'teks', 'yang'],
      dtype=object)

In [None]:
# Transform from Corpus into Numerical Vector

corpus_tf_idf_vect = tf_idf_vect.transform(corpus)
corpus_tf_idf_vect

<3x13 sparse matrix of type '<class 'numpy.float64'>'
	with 18 stored elements in Compressed Sparse Row format>

In [None]:
# Let's See Contents of Sparse Matrix

print(corpus_tf_idf_vect)

  (0, 10)	0.5340933749435833
  (0, 9)	0.4061917781433946
  (0, 8)	0.4061917781433946
  (0, 4)	0.3154441510317797
  (0, 1)	0.5340933749435833
  (1, 12)	0.42439575294071896
  (1, 11)	0.42439575294071896
  (1, 8)	0.32276390910429226
  (1, 7)	0.42439575294071896
  (1, 4)	0.5013099366829596
  (1, 0)	0.32276390910429226
  (2, 9)	0.3241235393856436
  (2, 6)	0.42618350336974425
  (2, 5)	0.42618350336974425
  (2, 4)	0.2517108425440014
  (2, 3)	0.42618350336974425
  (2, 2)	0.42618350336974425
  (2, 0)	0.3241235393856436


In [None]:
# See `corpus_vect` as Dense Matrix

corpus_tf_idf_vect.toarray()

array([[0.        , 0.53409337, 0.        , 0.        , 0.31544415,
        0.        , 0.        , 0.        , 0.40619178, 0.40619178,
        0.53409337, 0.        , 0.        ],
       [0.32276391, 0.        , 0.        , 0.        , 0.50130994,
        0.        , 0.        , 0.42439575, 0.32276391, 0.        ,
        0.        , 0.42439575, 0.42439575],
       [0.32412354, 0.        , 0.4261835 , 0.4261835 , 0.25171084,
        0.4261835 , 0.4261835 , 0.        , 0.        , 0.32412354,
        0.        , 0.        , 0.        ]])

In [None]:
# See `corpus_vect` with Their Token's Name

print('Dense Matrix - CountVectorizer()')
pd.DataFrame(corpus_count_vect.toarray(), columns = count_vect.get_feature_names_out())

Dense Matrix - CountVectorizer()


Unnamed: 0,adalah,belajar,besar,cabang,data,didalam,nlp,proses,saya,science,sedang,teks,yang
0,0,1,0,0,1,0,0,0,1,1,1,0,0
1,1,0,0,0,2,0,0,1,1,0,0,1,1
2,1,0,1,1,1,1,1,0,0,1,0,0,0


In [None]:
# See `corpus_tf_idf_vect` with Their Token's Name

print('Dense Matrix - TfidfVectorizer()')
pd.DataFrame(corpus_tf_idf_vect.toarray(), columns = tf_idf_vect.get_feature_names_out())

Dense Matrix - TfidfVectorizer()


Unnamed: 0,adalah,belajar,besar,cabang,data,didalam,nlp,proses,saya,science,sedang,teks,yang
0,0.0,0.534093,0.0,0.0,0.315444,0.0,0.0,0.0,0.406192,0.406192,0.534093,0.0,0.0
1,0.322764,0.0,0.0,0.0,0.50131,0.0,0.0,0.424396,0.322764,0.0,0.0,0.424396,0.424396
2,0.324124,0.0,0.426184,0.426184,0.251711,0.426184,0.426184,0.0,0.0,0.324124,0.0,0.0,0.0


## C. Case Study : YouTube Spam Comment

For this case study, we will try to **detect a YouTube comment whether a spam comment or not**. 

*You can find the dataset in this [link](https://raw.githubusercontent.com/danupurnomo/hacktiv8-trial-class/main/dataset/dataset.csv).*

---
### C.1. CountVectorizer

In [None]:
# Load Dataset

import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/danupurnomo/hacktiv8-trial-class/main/dataset/dataset.csv')
data.head(5)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,z12rwfnyyrbsefonb232i5ehdxzkjzjs2,Lisa Wellas,,+447935454150 lovely girl talk to me xxx﻿,1
1,z130wpnwwnyuetxcn23xf5k5ynmkdpjrj04,jason graham,2015-05-29T02:26:10.652000,I always end up coming back to this song<br />﻿,0
2,z13vsfqirtavjvu0t22ezrgzyorwxhpf3,Ajkal Khan,,"my sister just received over 6,500 new <a rel=...",1
3,z12wjzc4eprnvja4304cgbbizuved35wxcs,Dakota Taylor,2015-05-29T02:13:07.810000,Cool﻿,0
4,z13xjfr42z3uxdz2223gx5rrzs3dt5hna,Jihad Naser,,Hello I&#39;am from Palastine﻿,1


In [None]:
# Check Distribution of Dataset

data.CLASS.value_counts()

1    245
0    203
Name: CLASS, dtype: int64

For this task, we only need column `CONTENT` as corpus and column `CLASS` as target.

In [None]:
# Splitting Dataset

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.CONTENT, 
                                                    data.CLASS, 
                                                    test_size=0.3,
                                                    random_state=10)

print('Train Size : ', len(X_train))
print('Test Size  : ', len(X_test))

Train Size :  313
Test Size  :  135


In [None]:
# Convert String into Numerical

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

count_vect.fit(X_train)
X_train_vect = count_vect.transform(X_train)
X_test_vect = count_vect.transform(X_test)

In [None]:
# See X_train_vect

X_train_vect

<313x1297 sparse matrix of type '<class 'numpy.int64'>'
	with 5083 stored elements in Compressed Sparse Row format>

From this corpus, there are **405,961 values (from 313 * 1297)** but only **5,083 tokens that are non-zero values (1.25 %) with 1,297 vocabularies**.

In [None]:
# Train the Model

from sklearn.neighbors import KNeighborsClassifier

model_knn_count_vect = KNeighborsClassifier(n_neighbors=5)
model_knn_count_vect.fit(X_train_vect, y_train)

KNeighborsClassifier()

In [None]:
# Model Evaluation - Train Set

from sklearn.metrics import classification_report

y_pred_train = model_knn_count_vect.predict(X_train_vect)
print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       0.79      1.00      0.88       143
           1       1.00      0.78      0.87       170

    accuracy                           0.88       313
   macro avg       0.90      0.89      0.88       313
weighted avg       0.90      0.88      0.88       313



In [None]:
# Model Evaluation - Test Set

from sklearn.metrics import classification_report

y_pred_test = model_knn_count_vect.predict(X_test_vect)
print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

           0       0.70      1.00      0.82        60
           1       1.00      0.65      0.79        75

    accuracy                           0.81       135
   macro avg       0.85      0.83      0.81       135
weighted avg       0.87      0.81      0.80       135



In [None]:
# Predict New Texts

new_texts = ['i love this artist',
             'please subscribe my channel at bla bla bla']

new_texts_count_vect = count_vect.transform(new_texts)
model_knn_count_vect.predict(new_texts_count_vect)

array([0, 1])

In [None]:
# Save Classification Report into a Dictionary

from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

all_reports = {}
score_reports = {
    'train - precision' : precision_score(y_train, y_pred_train),
    'train - recall' : recall_score(y_train, y_pred_train),
    'train - accuracy' : accuracy_score(y_train, y_pred_train),
    'train - f1_score' : f1_score(y_train, y_pred_train),
    'test - precision' : precision_score(y_test, y_pred_test),
    'test - recall' : recall_score(y_test, y_pred_test),
    'test - accuracy_score' : accuracy_score(y_test, y_pred_test),
    'test - f1_score' : f1_score(y_test, y_pred_test),
}
all_reports['CountVectorizer'] = score_reports
pd.DataFrame(all_reports)

Unnamed: 0,CountVectorizer
test - accuracy_score,0.807407
test - f1_score,0.790323
test - precision,1.0
test - recall,0.653333
train - accuracy,0.878594
train - f1_score,0.874172
train - precision,1.0
train - recall,0.776471


---
### C.2. TFidfVectorizer

In [None]:
# Convert String into Numerical

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer()

tf_idf_vect.fit(X_train)
X_train_vect = tf_idf_vect.transform(X_train)
X_test_vect = tf_idf_vect.transform(X_test)

In [None]:
# See X_train_vect

X_train_vect

<313x1297 sparse matrix of type '<class 'numpy.float64'>'
	with 5083 stored elements in Compressed Sparse Row format>

In [None]:
# Train the Model

from sklearn.neighbors import KNeighborsClassifier

model_knn_tf_idf_vect = KNeighborsClassifier(n_neighbors=5)
model_knn_tf_idf_vect.fit(X_train_vect, y_train)

KNeighborsClassifier()

In [None]:
# Model Evaluation - Train Set

from sklearn.metrics import classification_report

y_pred_train = model_knn_tf_idf_vect.predict(X_train_vect)
print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       0.54      1.00      0.70       143
           1       1.00      0.29      0.45       170

    accuracy                           0.61       313
   macro avg       0.77      0.64      0.58       313
weighted avg       0.79      0.61      0.56       313



In [None]:
# Model Evaluation - Test Set

from sklearn.metrics import classification_report

y_pred_test = model_knn_tf_idf_vect.predict(X_test_vect)
print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

           0       0.53      1.00      0.69        60
           1       1.00      0.28      0.44        75

    accuracy                           0.60       135
   macro avg       0.76      0.64      0.56       135
weighted avg       0.79      0.60      0.55       135



In [None]:
# Predict New Texts

new_texts = ['i love this artist',
             'please subscribe my channel at bla bla bla']

new_texts_count_vect = tf_idf_vect.transform(new_texts)
model_knn_tf_idf_vect.predict(new_texts_count_vect)

array([0, 1])

In [None]:
# Save Classification Report into a Dictionary

score_reports = {
    'train - precision' : precision_score(y_train, y_pred_train),
    'train - recall' : recall_score(y_train, y_pred_train),
    'train - accuracy' : accuracy_score(y_train, y_pred_train),
    'train - f1_score' : f1_score(y_train, y_pred_train),
    'test - precision' : precision_score(y_test, y_pred_test),
    'test - recall' : recall_score(y_test, y_pred_test),
    'test - accuracy_score' : accuracy_score(y_test, y_pred_test),
    'test - f1_score' : f1_score(y_test, y_pred_test),
}
all_reports['TfidfVectorizer'] = score_reports
pd.DataFrame(all_reports)

Unnamed: 0,CountVectorizer,TfidfVectorizer
train - precision,1.0,1.0
train - recall,0.776471,0.288235
train - accuracy,0.878594,0.613419
train - f1_score,0.874172,0.447489
test - precision,1.0,1.0
test - recall,0.653333,0.28
test - accuracy_score,0.807407,0.6
test - f1_score,0.790323,0.4375


**Conclusion : the performance of TF-IDF model is worse than CountVectorizer model. Let's try to improve the perofmance of CountVectorizer model.**

---
### C.3. CountVectorizer with Stopwords

<img src='https://kite4sky.files.wordpress.com/2019/09/c3.jpg'>

As shown above, stopwords are words that appear frequently. This words 
In other word we can say that there are some words which have very high frequency which are generally known as language builder words (`is`, `the`, `then`, `that`, etc.). This words cannot tell you about what the document is, or in case someone have written a review about product, they can’t tell the context for that.

For stopwords, you can use any public stopwords list such as : 
* [NLTK](https://www.nltk.org/nltk_data/) (search `Stopwords Corpus`)
* [PySastrawi](https://github.com/har07/PySastrawi)

In [None]:
# Download Stopwords

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# Define Stopwords

## Load Stopwords from NLTK
from nltk.corpus import stopwords
stop_words_en = stopwords.words("english")

print('Stopwords from NLTK')
print(len(stop_words_en), stop_words_en)
print('')

## Create A New Stopwords
new_stop_words = ['aye', 'mine', 'have']

## Merge Stopwords
stop_words_en = stop_words_en + new_stop_words
stop_words_en = list(set(stop_words_en))
print('Out Final Stopwords')
print(len(stop_words_en), stop_words_en)

Stopwords from NLTK
179 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own',

In [None]:
# Convert String into Numerical

from sklearn.feature_extraction.text import CountVectorizer
count_vect_sw = CountVectorizer(stop_words = stop_words_en)

count_vect_sw.fit(X_train)
X_train_count_vect_sw = count_vect_sw.transform(X_train)
X_test_count_vect_sw = count_vect_sw.transform(X_test)

In [None]:
# See train_kalimat

X_train_count_vect_sw

<313x1190 sparse matrix of type '<class 'numpy.int64'>'
	with 3306 stored elements in Compressed Sparse Row format>

Differences between without stopwords and with stopwords in terms of vocabularies.
* Without Stopwords (only CountVectorizer)
  - Total values : 313*1297 = 405,961
  - Total non-zero values : 5,083
  - Percentage : 1.25 %
  - Total vocabularies : 1,297
* With Stopwords
  - Total values : 313*1190 = 372,470
  - Total non-zero values : 3,306
  - Percentage : 0.89 %
  - Total vocabularies : 1,190

As shown above, with stopwords : 
* We reduce our non-zeros values from 5,083 to 3,306 (**34.96 %**) and
* We also reduce the vocabularies from 1,297 to 1,190 (**8.25 %**)

In [None]:
# Train the Model

from sklearn.neighbors import KNeighborsClassifier
model_knn_count_vect_sw = KNeighborsClassifier(n_neighbors=5)

model_knn_count_vect_sw.fit(X_train_count_vect_sw, y_train)

KNeighborsClassifier()

In [None]:
# Model Evaluation - Train Set

from sklearn.metrics import classification_report

y_pred_train = model_knn_count_vect_sw.predict(X_train_count_vect_sw)
print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       0.78      1.00      0.88       143
           1       1.00      0.76      0.87       170

    accuracy                           0.87       313
   macro avg       0.89      0.88      0.87       313
weighted avg       0.90      0.87      0.87       313



In [None]:
# Model Evaluation - Test Set

from sklearn.metrics import classification_report

y_pred_test = model_knn_count_vect_sw.predict(X_test_count_vect_sw)
print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

           0       0.70      1.00      0.82        60
           1       1.00      0.65      0.79        75

    accuracy                           0.81       135
   macro avg       0.85      0.83      0.81       135
weighted avg       0.87      0.81      0.80       135



In [None]:
# Predict New Texts

new_texts = ['i love this artist',
             'please subscribe my channel at bla bla bla']

new_texts_count_vect_sw = count_vect_sw.transform(new_texts)
model_knn_count_vect_sw.predict(new_texts_count_vect_sw)

array([0, 1])

In [None]:
# Save Classification Report into a Dictionary

score_reports = {
    'train - precision' : precision_score(y_train, y_pred_train),
    'train - recall' : recall_score(y_train, y_pred_train),
    'train - accuracy' : accuracy_score(y_train, y_pred_train),
    'train - f1_score' : f1_score(y_train, y_pred_train),
    'test - precision' : precision_score(y_test, y_pred_test),
    'test - recall' : recall_score(y_test, y_pred_test),
    'test - accuracy_score' : accuracy_score(y_test, y_pred_test),
    'test - f1_score' : f1_score(y_test, y_pred_test),
}
all_reports['CountVectorizer + Stopwords'] = score_reports
pd.DataFrame(all_reports)

Unnamed: 0,CountVectorizer,TfidfVectorizer,CountVectorizer + Stopwords
train - precision,1.0,1.0,1.0
train - recall,0.776471,0.288235,0.764706
train - accuracy,0.878594,0.613419,0.872204
train - f1_score,0.874172,0.447489,0.866667
test - precision,1.0,1.0,1.0
test - recall,0.653333,0.28,0.653333
test - accuracy_score,0.807407,0.6,0.807407
test - f1_score,0.790323,0.4375,0.790323


---
### C.4. CountVectorizer with Stopwords and Stemming

There are many stemming algorithms in NLTK such as Porter, Lancaster, etc. You can trial-and-error the stemming algorithms that fit your problem.

*For more details, please visit this [link](https://www.nltk.org/api/nltk.stem.html). To see how to use, please visit this [link](https://www.nltk.org/howto/stem.html).*

In [None]:
# Load Dataset

import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/danupurnomo/hacktiv8-trial-class/main/dataset/dataset.csv')
data.head(5)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,z12rwfnyyrbsefonb232i5ehdxzkjzjs2,Lisa Wellas,,+447935454150 lovely girl talk to me xxx﻿,1
1,z130wpnwwnyuetxcn23xf5k5ynmkdpjrj04,jason graham,2015-05-29T02:26:10.652000,I always end up coming back to this song<br />﻿,0
2,z13vsfqirtavjvu0t22ezrgzyorwxhpf3,Ajkal Khan,,"my sister just received over 6,500 new <a rel=...",1
3,z12wjzc4eprnvja4304cgbbizuved35wxcs,Dakota Taylor,2015-05-29T02:13:07.810000,Cool﻿,0
4,z13xjfr42z3uxdz2223gx5rrzs3dt5hna,Jihad Naser,,Hello I&#39;am from Palastine﻿,1


In [None]:
# Stem The Corpus

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_content = []

for doc in data.CONTENT:
  result = [stemmer.stem(word) for word in doc.split()]
  result = ' '.join(result)
  stemmed_content.append(result)

data['STEMMED_CONTENT'] = stemmed_content
data[['CONTENT', 'STEMMED_CONTENT']]

Unnamed: 0,CONTENT,STEMMED_CONTENT
0,+447935454150 lovely girl talk to me xxx﻿,+447935454150 love girl talk to me xxx﻿
1,I always end up coming back to this song<br />﻿,i alway end up come back to thi song<br />﻿
2,"my sister just received over 6,500 new <a rel=...","my sister just receiv over 6,500 new <a rel=""n..."
3,Cool﻿,cool﻿
4,Hello I&#39;am from Palastine﻿,hello i&#39;am from palastine﻿
...,...,...
443,SUBSCRIBE TO MY CHANNEL X PLEASE!. SPARE,subscrib to my channel x please!. spare
444,Check out my videos guy! :) Hope you guys had ...,check out my video guy! :) hope you guy had a ...
445,3 yrs ago I had a health scare but thankfully ...,3 yr ago i had a health scare but thank i’m ok...
446,Rihanna looks so beautiful with red hair ;)﻿,rihanna look so beauti with red hair ;)﻿


In [None]:
# Splitting Dataset

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.STEMMED_CONTENT, 
                                                    data.CLASS, 
                                                    test_size=0.3,
                                                    random_state=10)

print('Train Size : ', len(X_train))
print('Test Size  : ', len(X_test))

Train Size :  313
Test Size  :  135


In [None]:
# Download Stopwords

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# Define Stopwords

## Load Stopwords from NLTK
from nltk.corpus import stopwords
stop_words_en = stopwords.words("english")

print('Stopwords from NLTK')
print(len(stop_words_en), stop_words_en)
print('')

## Create A New Stopwords
new_stop_words = ['aye', 'mine', 'have']

## Merge Stopwords
stop_words_en = stop_words_en + new_stop_words
stop_words_en = list(set(stop_words_en))
print('Out Final Stopwords')
print(len(stop_words_en), stop_words_en)

Stopwords from NLTK
179 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own',

In [None]:
# Convert String into Numerical

from sklearn.feature_extraction.text import CountVectorizer
count_vect_sw_stem = CountVectorizer(stop_words = stop_words_en)

count_vect_sw_stem.fit(X_train)
X_train_count_vect_sw_stem = count_vect_sw_stem.transform(X_train)
X_test_count_vect_sw_stem = count_vect_sw_stem.transform(X_test)

In [None]:
# See train_kalimat

X_train_count_vect_sw_stem

<313x1142 sparse matrix of type '<class 'numpy.int64'>'
	with 3430 stored elements in Compressed Sparse Row format>

Differences between without stopwords and with stopwords in terms of vocabularies.
* Without Stopwords (only CountVectorizer)
  - Total values : 313*1297 = 405,961
  - Total non-zero values : 5,083
  - Percentage : 1.25 %
  - Total vocabularies : 1,297
* With Stopwords
  - Total values : 313*1190 = 372,470
  - Total non-zero values : 3,306
  - Percentage : 0.89 %
  - Total vocabularies : 1,190
* With Stopwords and Stemming
  - Total values : 313*1142 = 357,446
  - Total non-zero values : 3,430
  - Percentage : 0.96 %
  - Total vocabularies : 1,142

As shown above, with stopwords : 
* We reduce the vocabularies from 1,297 to 1,142 (**11.95 %**)

In [None]:
# Train the Model

from sklearn.neighbors import KNeighborsClassifier
model_knn_count_vect_sw_stem = KNeighborsClassifier(n_neighbors=5)

model_knn_count_vect_sw_stem.fit(X_train_count_vect_sw_stem, y_train)

KNeighborsClassifier()

In [None]:
# Model Evaluation - Train Set

from sklearn.metrics import classification_report

y_pred_train = model_knn_count_vect_sw_stem.predict(X_train_count_vect_sw_stem)
print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       0.79      0.99      0.88       143
           1       0.99      0.78      0.87       170

    accuracy                           0.87       313
   macro avg       0.89      0.88      0.87       313
weighted avg       0.89      0.87      0.87       313



In [None]:
# Model Evaluation - Test Set

from sklearn.metrics import classification_report

y_pred_test = model_knn_count_vect_sw_stem.predict(X_test_count_vect_sw_stem)
print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

           0       0.66      0.98      0.79        60
           1       0.98      0.59      0.73        75

    accuracy                           0.76       135
   macro avg       0.82      0.78      0.76       135
weighted avg       0.83      0.76      0.76       135



In [None]:
# Predict New Texts

new_texts = ['i love this artist',
             'please subscribe my channel at bla bla bla']

new_texts_count_vect_sw_stem = count_vect_sw_stem.transform(new_texts)
model_knn_count_vect_sw_stem.predict(new_texts_count_vect_sw_stem)

array([0, 0])

In [None]:
# Save Classification Report into a Dictionary

score_reports = {
    'train - precision' : precision_score(y_train, y_pred_train),
    'train - recall' : recall_score(y_train, y_pred_train),
    'train - accuracy' : accuracy_score(y_train, y_pred_train),
    'train - f1_score' : f1_score(y_train, y_pred_train),
    'test - precision' : precision_score(y_test, y_pred_test),
    'test - recall' : recall_score(y_test, y_pred_test),
    'test - accuracy_score' : accuracy_score(y_test, y_pred_test),
    'test - f1_score' : f1_score(y_test, y_pred_test),
}
all_reports['CountVectorizer + Stopwords + Stemming'] = score_reports
pd.DataFrame(all_reports)

Unnamed: 0,CountVectorizer,TfidfVectorizer,CountVectorizer + Stopwords,CountVectorizer + Stopwords + Stemming
train - precision,1.0,1.0,1.0,0.985075
train - recall,0.776471,0.288235,0.764706,0.776471
train - accuracy,0.878594,0.613419,0.872204,0.872204
train - f1_score,0.874172,0.447489,0.866667,0.868421
test - precision,1.0,1.0,1.0,0.977778
test - recall,0.653333,0.28,0.653333,0.586667
test - accuracy_score,0.807407,0.6,0.807407,0.762963
test - f1_score,0.790323,0.4375,0.790323,0.733333


---
Conclusion : 
* Based on above Classification Report, we can see that our model with only `CountVectorizer()` has best performance compare to other model.

* Hence, model `CountVectorizer + Stopwords` also has similar performance to the best model but with less vocabularies. We can pick this model as the choosen model.

* It's still too early to judge whether in this case using Stopwords and or Stemming improve the performance or not. Try another algorithms such as Multinomial Naive Bayes, Random FOrest, with or without stopwords and stemming, and check their perfomances.