# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [24]:
# Write your code here
import pandas as pd
from nltk.corpus import stopwords
import string
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import matplotlib.pyplot as plt
from collections import Counter
from sklearn import model_selection


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [25]:
from google.colab import files
u = files.upload()

Saving stsa-train.txt to stsa-train.txt


In [26]:
u = files.upload()

Saving stsa-test.txt to stsa-test.txt


In [27]:
def  text_split(x):
  x = x.split(" ")
  return x[0], x[1:]

In [28]:
train = pd.DataFrame(columns= ['text', 'target'])
test = pd.DataFrame(columns= ['text', 'target'])
with open('stsa-train.txt') as f:
  for i in f.readlines():
    target, text = text_split(i)
    train.loc[len(train.index)] = [text, target]

with open('stsa-test.txt') as f:
  for i in f.readlines():
    target, text = text_split(i)
    test.loc[len(test.index)] = [text, target]

In [29]:
test

Unnamed: 0,text,target
0,"[no, movement, ,, no, yuks, ,, not, much, of, ...",0
1,"[a, gob, of, drivel, so, sickly, sweet, ,, eve...",0
2,"[gangs, of, new, york, is, an, unapologetic, m...",0
3,"[we, never, really, feel, involved, with, the,...",0
4,"[this, is, one, of, polanski, 's, best, films,...",1
...,...,...
1816,"[an, often-deadly, boring, ,, strange, reading...",0
1817,"[the, problem, with, concept, films, is, that,...",0
1818,"[safe, conduct, ,, however, ambitious, and, we...",0
1819,"[a, film, made, with, as, little, wit, ,, inte...",0


In [30]:
train

Unnamed: 0,text,target
0,"[a, stirring, ,, funny, and, finally, transpor...",1
1,"[apparently, reassembled, from, the, cutting-r...",0
2,"[they, presume, their, audience, wo, n't, sit,...",0
3,"[this, is, a, visually, stunning, rumination, ...",1
4,"[jonathan, parker, 's, bartleby, should, have,...",1
...,...,...
6915,"[painful, ,, horrifying, and, oppressively, tr...",1
6916,"[take, care, is, nicely, performed, by, a, qui...",0
6917,"[the, script, covers, huge, ,, heavy, topics, ...",0
6918,"[a, seriously, bad, film, with, seriously, war...",0


In [31]:
stop_word = stopwords.words('english')
p = ['0','1','2','3','4','5','6','7','8','9']
def text_manipulation(texts):
  without_punt = []
  without_stop_w = []
  for text in texts:
    text = text.strip()
    text = text.lower()
    if text not in string.punctuation:
      without_punt.append(text)
    if text not in stop_word:
      without_stop_w.append(text)
  without_num = []
  for word in without_stop_w:
      if word.isdigit():
          continue
      if word.isalnum():
          word = list(word)
          words = [i for i in word if i not in p]
          word = ''.join(words)
          without_num.append(word)
      else:
          without_num.append(word)
  without_len1_2 = [word for word in without_num if len(word) >=3]
  return without_len1_2

In [32]:
train['text'] = train['text'].apply(text_manipulation)
test['text'] = test['text'].apply(text_manipulation)

In [33]:
c = Counter()
for text in train['text']:
  for word in text:
    c[word] += 1
print(c)



In [34]:
#selecting top 200 features

top_200 = {k: v for k, v in sorted(c.items(), key=lambda item: item[1], reverse=True)}
top_200 = {k:c[k] for k in list(top_200.keys())[:200]}
features = list(top_200.keys())


In [35]:
import pandas as pd
df_train = pd.DataFrame(columns = features)
df_test = pd.DataFrame(columns = features)


In [36]:
def term_document_matrix(word_list):
    term_matrix = dict()
    for word in features:
        if(word in word_list):
            occ = word_list.count(word)
            term_matrix[word] = occ
        else:
            term_matrix[word] = 0
    return term_matrix


In [37]:
for col, row in train.iterrows():
  w_list = term_document_matrix(row['text'])
  df_train.loc[len(df_train.index)] = w_list

In [38]:
df_train['Target'] = train['target']

  df_train['Target'] = train['target']


In [39]:
df_train

Unnamed: 0,film,movie,n't,...,one,like,story,-rrb-,-lrb-,even,....1,genre,need,simply,idea,smart,plays,series,goes,whole,Target
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6915,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
6916,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6917,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6918,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
for col, row in test.iterrows():
  w_list = term_document_matrix(row['text'])
  df_test.loc[len(df_test.index)] = w_list

In [41]:
df_test['Target'] = test['target']
df_test

  df_test['Target'] = test['target']


Unnamed: 0,film,movie,n't,...,one,like,story,-rrb-,-lrb-,even,....1,genre,need,simply,idea,smart,plays,series,goes,whole,Target
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1816,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1817,0,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1818,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1819,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Multinomial NB**

In [42]:
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
tr, val  = train_test_split(df_train, test_size=0.2)
x = tr[features]
y = tr['Target']
mul_nav_model = MultinomialNB()

In [43]:
#cross validation
cs  = model_selection.cross_val_score(mul_nav_model, x, y, cv=10)
print(cs)

[0.6101083  0.64440433 0.66967509 0.65703971 0.62093863 0.65523466
 0.64014467 0.65641953 0.65461121 0.66184448]


In [44]:
x = df_train[features]
y = df_train['Target']
mul_nav_model.fit(x,y)

In [45]:
pred = mul_nav_model.predict(df_test[features])

In [46]:
print(classification_report(df_test['Target'], pred))

              precision    recall  f1-score   support

           0       0.69      0.56      0.62       912
           1       0.63      0.75      0.68       909

    accuracy                           0.66      1821
   macro avg       0.66      0.66      0.65      1821
weighted avg       0.66      0.66      0.65      1821



**SVM**

In [48]:
from sklearn import svm
svm_model = svm.SVC()
tr, val  = train_test_split(df_train, test_size=0.2)
x = tr[features]
y = tr['Target']

In [49]:
cs  = model_selection.cross_val_score(svm_model, x, y, cv=10)
print(cs)
x = df_train[features]
y = df_train['Target']
svm_model.fit(x,y)

[0.66606498 0.61913357 0.66787004 0.64620939 0.58303249 0.68050542
 0.62567812 0.62929476 0.61663653 0.65280289]


In [50]:
pred = svm_model.predict(df_test[features])
print(classification_report(df_test['Target'], pred))

              precision    recall  f1-score   support

           0       0.69      0.55      0.62       912
           1       0.63      0.75      0.68       909

    accuracy                           0.65      1821
   macro avg       0.66      0.65      0.65      1821
weighted avg       0.66      0.65      0.65      1821



**KNN**

In [51]:
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier()

In [52]:
tr, val  = train_test_split(df_train, test_size=0.2)
x = tr[features]
y = tr['Target']
cs  = model_selection.cross_val_score(knn_model, x, y, cv=10)
print(cs)

[0.55956679 0.56137184 0.54873646 0.62454874 0.60469314 0.62635379
 0.60759494 0.59312839 0.58589512 0.59312839]


In [53]:
x = df_train[features]
y = df_train['Target']
knn_model.fit(x,y)
pred = knn_model.predict(df_test[features])
print(classification_report(df_test['Target'], pred))

              precision    recall  f1-score   support

           0       0.57      0.65      0.61       912
           1       0.59      0.51      0.55       909

    accuracy                           0.58      1821
   macro avg       0.58      0.58      0.58      1821
weighted avg       0.58      0.58      0.58      1821



**DECISION TREE**

In [54]:
from sklearn import tree
tree_model = tree.DecisionTreeClassifier()

In [55]:
tr, val  = train_test_split(df_train, test_size=0.2)
x = tr[features]
y = tr['Target']

In [56]:
cs  = model_selection.cross_val_score(tree_model, x, y, cv=10)
print(cs)

[0.58483755 0.60288809 0.6299639  0.61732852 0.57220217 0.55776173
 0.60216998 0.60759494 0.57866184 0.57142857]


In [57]:
x = df_train[features]
y = df_train['Target']
tree_model.fit(x,y)

In [58]:
pred = tree_model.predict(df_test[features])
print(classification_report(df_test['Target'], pred))

              precision    recall  f1-score   support

           0       0.58      0.65      0.61       912
           1       0.60      0.53      0.57       909

    accuracy                           0.59      1821
   macro avg       0.59      0.59      0.59      1821
weighted avg       0.59      0.59      0.59      1821



**randomforest**

In [59]:
from sklearn.ensemble import RandomForestClassifier
forest_model = RandomForestClassifier()

In [60]:
tr, val  = train_test_split(df_train, test_size=0.2)
x = tr[features]
y = tr['Target']

In [61]:
cs  = model_selection.cross_val_score(forest_model, x, y, cv=10)
print(cs)

[0.61913357 0.65523466 0.63537906 0.59747292 0.62093863 0.58303249
 0.60216998 0.63110307 0.62748644 0.58951175]


In [62]:
x = df_train[features]
y = df_train['Target']
forest_model.fit(x,y)

In [63]:
pred = forest_model.predict(df_test[features])
print(classification_report(df_test['Target'], pred))

              precision    recall  f1-score   support

           0       0.61      0.68      0.64       912
           1       0.64      0.57      0.60       909

    accuracy                           0.62      1821
   macro avg       0.62      0.62      0.62      1821
weighted avg       0.62      0.62      0.62      1821



**XGBoost**

In [64]:
from xgboost import XGBClassifier
bst_model = XGBClassifier()

In [66]:
def string_ch(x):
  return int(x)
  tr, val  = train_test_split(df_train, test_size=0.2)
x = tr[features]
y = tr['Target'].apply(string_ch)
cs  = model_selection.cross_val_score(bst_model, x, y, cv=10)
print(cs)

[0.63176895 0.65884477 0.68231047 0.62454874 0.66245487 0.62635379
 0.61663653 0.68716094 0.62206148 0.62025316]


In [67]:
x = df_train[features]
y = df_train['Target'].apply(string_ch)
bst_model.fit(x,y)

In [68]:
pred = bst_model.predict(df_test[features])
print(classification_report(df_test['Target'].apply(string_ch), pred))

              precision    recall  f1-score   support

           0       0.70      0.56      0.62       912
           1       0.63      0.75      0.69       909

    accuracy                           0.66      1821
   macro avg       0.66      0.66      0.65      1821
weighted avg       0.66      0.66      0.65      1821



**Word2Vec**

In [69]:
from gensim.models import Word2Vec

In [70]:
t_df = pd.concat([train, test])

In [71]:
word2vec = Word2Vec( vector_size = 100, negative=5, hs=1, min_count=2, sample = 0)

In [72]:
word2vec.build_vocab([i for i in t_df['text']])

In [73]:
words = set(word2vec.wv.index_to_key )

In [75]:
import numpy as np
k = list()
for ls in t_df['text']:
  p = list()
  for i in ls:
    if i in words:
      p.append(word2vec.wv[i])
  k.append(np.array(p))
final = list()
for i in k:
  final.append(np.array(i))
final_1 = list()
for i in final:
  p = list()
  z = list()
  for j in i:
    p.append(j)
  for i in range(100):
    sum = 0
    for j in p:
      sum += j[i]
    try:
      avg = sum/len(p)
    except:
      continue
    z.append(avg)
  final_1.append(z)


In [76]:
p = pd.DataFrame(final_1)
p

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.000757,0.002255,0.002030,-0.005421,0.000784,-0.005349,0.004690,0.001427,0.001126,0.001172,...,0.001263,0.000432,0.001635,-0.002300,0.003737,-0.000247,0.002996,0.003703,0.000262,-0.001648
1,-0.004768,0.001027,0.003397,0.000224,-0.004614,0.000423,0.000570,0.001186,-0.000396,0.001570,...,0.005670,0.003467,-0.000930,0.006921,0.001199,0.002495,-0.000634,-0.001536,0.003766,0.001292
2,0.000609,-0.000492,0.002113,0.000257,0.001667,0.000425,-0.000737,0.002463,0.000342,0.000585,...,0.001447,0.003373,-0.000286,0.000649,0.002566,0.003283,0.001632,0.000894,0.000054,0.002296
3,-0.001654,-0.001845,0.003310,0.000913,0.000485,-0.000426,-0.001738,-0.000462,-0.000705,0.000087,...,0.002566,0.001909,-0.001976,0.004352,0.002855,0.003930,0.001311,-0.000465,-0.001840,0.002259
4,0.001410,-0.001984,0.000076,-0.001960,-0.002929,-0.000384,0.004550,0.003672,-0.000427,0.002060,...,0.000783,0.000587,0.000825,0.001727,-0.000057,0.002387,-0.002041,0.001970,-0.000916,-0.001172
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8736,0.001164,-0.000600,-0.000596,-0.002470,0.001388,0.000753,-0.000593,-0.001227,-0.000979,-0.002881,...,0.000382,0.001057,0.002783,-0.000343,-0.003848,-0.002693,0.000603,-0.001110,0.001082,-0.000573
8737,-0.002242,-0.000542,0.003749,-0.000007,-0.001611,-0.002029,0.002017,0.000051,-0.002733,-0.000931,...,0.001434,0.000679,0.002789,-0.004442,0.005002,-0.003213,0.001189,-0.001125,-0.000112,0.003282
8738,0.003420,-0.000010,-0.001959,-0.000289,0.000367,0.000621,-0.000902,0.002861,-0.000855,-0.000811,...,-0.000611,0.004264,-0.000251,-0.001921,-0.000317,-0.000563,0.000117,-0.001657,0.003757,0.002098
8739,-0.001526,-0.001371,0.001291,-0.001522,-0.000168,0.000012,0.000760,0.001995,-0.001011,0.001270,...,-0.000566,0.001764,0.001848,-0.000926,0.001023,0.003854,-0.002254,-0.002976,0.000878,-0.000610


In [77]:
t_df = t_df.reset_index()
p['target'] = t_df['target']
mul_nav_model = MultinomialNB()
f = list(p.columns)
f.remove('target')
f
tr, tes  = train_test_split(p, test_size=0.2)
x = p[f]
y = p['target']
x = df_train[features]
y = df_train['Target']
mul_nav_model.fit(x,y)
pred = mul_nav_model.predict(x)
print(classification_report(df_train['Target'], pred))

              precision    recall  f1-score   support

           0       0.68      0.59      0.63      3310
           1       0.66      0.75      0.70      3610

    accuracy                           0.67      6920
   macro avg       0.67      0.67      0.67      6920
weighted avg       0.67      0.67      0.67      6920



**BERT**

In [78]:
pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.70.0-py3-none-any.whl (315 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.5/315.5 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from simpletransformers)
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tensorboardx (from simpletransformers)
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.7/101.7 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
Collecting wandb>=0.10.32 (from simpletransformers)


In [79]:
import pandas as pd
from nltk.corpus import stopwords
import string
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import matplotlib.pyplot as plt
from collections import Counter
import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from collections import Counter
from simpletransformers.classification import ClassificationModel
from sklearn.metrics import f1_score, accuracy_score

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [83]:
cuda_available = torch.cuda.is_available()
train_args ={"reprocess_input_data": True,
             "fp16":False,
             "use_early_stopping" : 0,
             "num_train_epochs": 3}

model = ClassificationModel(
    "bert", 'bert-base-uncased',
    num_labels=2,
    args=train_args,
     use_cuda=cuda_available
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [84]:
train['target'] = train['target'].apply(string_ch)

In [None]:
model.train_model(train)

  self.pid = os.fork()


  0%|          | 0/13 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/865 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/865 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/865 [00:00<?, ?it/s]

In [91]:
test


Unnamed: 0,text,target
0,"[movement, yuks, much, anything]",0
1,"[gob, drivel, sickly, sweet, even, eager, cons...",0
2,"[gangs, new, york, unapologetic, mess, whose, ...",0
3,"[never, really, feel, involved, story, ideas, ...",0
4,"[one, polanski, best, films]",1
...,...,...
1816,"[often-deadly, boring, strange, reading, class...",0
1817,"[problem, concept, films, concept, poor, one, ...",0
1818,"[safe, conduct, however, ambitious, well-inten...",0
1819,"[film, made, little, wit, interest, profession...",0


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [3]:
test


NameError: name 'test' is not defined

In [104]:
t_df = pd.concat([train, test]).reset_index()

In [105]:
t_df  = t_df.drop(['index', 'target'], axis = 1)
t_df

Unnamed: 0,text
0,"[stirring, funny, finally, transporting, re-im..."
1,"[apparently, reassembled, cutting-room, floor,..."
2,"[presume, audience, n't, sit, still, sociology..."
3,"[visually, stunning, rumination, love, memory,..."
4,"[jonathan, parker, bartleby, be-all-end-all, m..."
...,...
8736,"[often-deadly, boring, strange, reading, class..."
8737,"[problem, concept, films, concept, poor, one, ..."
8738,"[safe, conduct, however, ambitious, well-inten..."
8739,"[film, made, little, wit, interest, profession..."


In [106]:
stop_word = stopwords.words('english')
p = ['0','1','2','3','4','5','6','7','8','9']
def text_manipulation(texts):
  without_punt = []
  without_stop_w = []
  for text in texts:
    text = text.strip()
    text = text.lower()
    if text not in string.punctuation:
      without_punt.append(text)
    if text not in stop_word:
      without_stop_w.append(text)
  without_num = []
  for word in without_stop_w:
      if word.isdigit():
          continue
      if word.isalnum():
          word = list(word)
          words = [i for i in word if i not in p]
          word = ''.join(words)
          without_num.append(word)
      else:
          without_num.append(word)
  without_len1_2 = [word for word in without_num if len(word) >=3]
  return without_len1_2
t_df['text'] = t_df['text'].apply(text_manipulation)
c = Counter()
for text in t_df['text']:
  for word in text:
    c[word] += 1
print(c)



In [107]:
top_400_voc = {k: v for k, v in sorted(c.items(), key=lambda item: item[1], reverse=True)}
top_400_voc = {k:c[k] for k in list(top_400_voc.keys())[:400]}


In [108]:
features = list(top_400_voc.keys())
features


['film',
 'movie',
 "n't",
 '...',
 'one',
 'like',
 'story',
 '-rrb-',
 '-lrb-',
 'even',
 'much',
 'comedy',
 'good',
 'characters',
 'time',
 'way',
 'little',
 'funny',
 'make',
 'director',
 'never',
 'enough',
 'bad',
 'makes',
 'would',
 'work',
 'life',
 'may',
 'best',
 'love',
 'could',
 'movies',
 'well',
 'new',
 'really',
 'performances',
 'something',
 'films',
 'drama',
 'action',
 'made',
 'many',
 'plot',
 'still',
 'see',
 'people',
 'nothing',
 'two',
 'better',
 'every',
 "'re",
 'great',
 'without',
 'look',
 'ever',
 'long',
 'cast',
 'get',
 'fun',
 'sense',
 'humor',
 'audience',
 'might',
 'script',
 'also',
 'though',
 'world',
 'first',
 'performance',
 'often',
 'character',
 'another',
 'real',
 'feel',
 'big',
 'kind',
 'thing',
 'feels',
 'documentary',
 'tale',
 'thriller',
 'seems',
 'less',
 'entertaining',
 'picture',
 'screen',
 'minutes',
 'hard',
 "'ll",
 'hollywood',
 'watching',
 'take',
 'romantic',
 'far',
 'almost',
 "'ve",
 'acting',
 'heart'

In [99]:
bog_df = pd.DataFrame(columns = features)

In [100]:
def term_document_matrix(word_list):
    term_matrix = dict()
    for word in features:
        if(word in word_list):
            occ = word_list.count(word)
            term_matrix[word] = occ
        else:
            term_matrix[word] = 0
    return term_matrix

In [113]:
for col, row in t_df.iterrows():
  w_list = term_document_matrix(row['text'])

In [111]:
bog_df

Unnamed: 0,film,movie,n't,...,one,like,story,-rrb-,-lrb-,even,....1,line,sequel,written,sex,writing,live,talent,psychological,animation,leaves
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17343,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
17344,0,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
17345,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
17346,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## K-means


In [2]:
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,30), timings= True)
visualizer.fit(bog_df)
visualizer.show()

NameError: name 'bog_df' is not defined

From the elbow curve it suggest the k-values as 11

In [1]:
k=KMeans(init='k-means++',n_clusters=11,n_init=100,random_state=0).fit(bog_df)
labels =k.predict(bog_df)
c_d = t_df.assign(Cluster=labels)
for i in range(11):
  print(c_d[c_d['Cluster'] == i[100]])
  print('*' * 10)

NameError: name 'KMeans' is not defined

From the first task we observe that we have only 2 target values but here the clusters values shows 11 values which may act as like ratings

## DBScan

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import numpy as np
nn = NearestNeighbors(n_neighbors=20).fit(bog_df)
distances, indices = nn.kneighbors(bog_df)

In [None]:
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.figure(figsize=(10,8))
plt.plot(distances)

It shows that the eps value is 1

In [None]:
db_model = DBSCAN(eps =  1)
db_model.fit(bog_df)
t_df['label'] = db_model.labels_

In [None]:
for i in range(-1,1):
  print(t_df[t_df['label'] == i])
  print('*' * 10)

It shows that they are 2 labels which are correct as the target in the task-1

## Hierarchical clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering()
visualizer = KElbowVisualizer(model, k=(2,30), timings= True)
visualizer.fit(bog_df)
visualizer.show()

In [None]:
hierarchical_cluster = AgglomerativeClustering(n_clusters=9, affinity='euclidean', linkage='ward')
labels = hierarchical_cluster.fit_predict(bog_df)

In [None]:
c = t_df.assign(Cluster=labels)
for i in range(9):
  print(c_d[c_d['Cluster'] == i])
  print('*' * 10)

## word2vec

In [None]:
from gensim.models import Word2Vec
l_df = pd.concat([train, test]).reset_index()

In [None]:
word2vec = Word2Vec( vector_size = 100, negative=5, hs=1, min_count=2, sample = 0)
word2vec.build_vocab([i for i in t_df['text']])
words = set(word2vec.wv.index_to_key )

In [None]:
import numpy as np
k = list()
for ls in t_df['text']:
  p = list()
  for i in ls:
    if i in words:
      p.append(word2vec.wv[i])
  k.append(np.array(p))
final = list()
for i in k:
  final.append(np.array(i))
final_1 = list()
for i in final:
  p = list()
  z = list()
  for j in i:
    p.append(j)
  for i in range(100):
    sum = 0
    for j in p:
      sum += j[i]
    try:
      avg = sum/len(p)
    except:
      continue
    z.append(avg)
  final_1.append(z)

In [None]:
p = pd.DataFrame(final_1)
p = p.fillna(0)
p

In [None]:
l_df = l_df.drop('index', axis = 1)

In [None]:
model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,30), timings= True)
visualizer.fit(p)
visualizer.show()

In [None]:
k=KMeans(init='k-means++',n_clusters=11,n_init=100,random_state=0).fit(p)
labels =k.predict(p)
c_d = t_df.assign(Cluster=labels)
for i in range(9):
  print(c_d[c_d['Cluster'] == i])
  print('*' * 10)

**bert**

In [None]:
pip install -U sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

In [None]:
embedder = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

In [None]:
corpus_embeddings = embedder.encode(corpus)

In [None]:
l = list()
for i in t_df['text']:
  l.append(embedder.encode(i))
final_1 = list()
for i in l:
  p = list()
  z = list()
  for j in i:
    p.append(j)
  for i in range(100):
    sum = 0
    for j in p:
      sum += j[i]
    try:
      avg = sum/len(p)
    except:
      continue
    z.append(avg)
  final_1.append(z)
k  = pd.DataFrame(final_1)
k = k.fillna(0)
k


In [None]:
model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,30), timings= True)
visualizer.fit(k)
visualizer.show()

It suggests that the k = 10

In [None]:
d=KMeans(init='k-means++',n_clusters=10,n_init=100,random_state=0).fit(k)
labels =d.predict(k)
c_d = t_df.assign(Cluster=labels)

In [None]:
for i in range(10):
  print(c_d[c_d['Cluster'] == i])
  print('*' * 10)

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:





'''