For understanding the concept of bag-of-word, let us setup the environment and import the necessary libraries such as:

1. pandas: for reading and understanding the data,
2. numpy: for doing numerical computations on the data,
3. BeautifulSoup: for pulling data out of HTML and XML files and remove the unnessary tags and helps in navigating, searching, and modifying the parse tree data,
4. re: is the library for regular expression and we are going to use it to clean out data and based on pattern matching using regular expressions,
5. nltk: is a natural language toolkit library used to do text processing for classification, tokenization, stemming, tagging, parsing, semantic reasoning, etc,
6. sklearn: is used for all mahine learning tasks such as here we are using it to training a Random Forest model and predicting it's performance.

In [20]:
import pandas as pd     
import numpy as np
from bs4 import BeautifulSoup
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords # Import the stop word list

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [21]:
import os
for dirname, _, filenames in os.walk('/content/drive/MyDrive/machine learning projects/training set/Bag of Words Meets Bags of Popcorn'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/content/drive/MyDrive/machine learning projects/training set/Bag of Words Meets Bags of Popcorn/sampleSubmission.csv
/content/drive/MyDrive/machine learning projects/training set/Bag of Words Meets Bags of Popcorn/testData.tsv
/content/drive/MyDrive/machine learning projects/training set/Bag of Words Meets Bags of Popcorn/labeledTrainData.tsv
/content/drive/MyDrive/machine learning projects/training set/Bag of Words Meets Bags of Popcorn/unlabeledTrainData.tsv


In [22]:
df_train=pd.read_csv('/content/drive/MyDrive/machine learning projects/training set/Bag of Words Meets Bags of Popcorn/labeledTrainData.tsv',header=0,
                     delimiter="\t",quoting=3)
df_test=pd.read_csv('/content/drive/MyDrive/machine learning projects/training set/Bag of Words Meets Bags of Popcorn/testData.tsv',header=0, 
                     delimiter="\t",quoting=3)

In [23]:
print(df_train.shape)
print(df_test.shape)

(25000, 3)
(25000, 2)


In [24]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         25000 non-null  object
 1   sentiment  25000 non-null  int64 
 2   review     25000 non-null  object
dtypes: int64(1), object(2)
memory usage: 586.1+ KB


In [25]:
df_train.describe()

Unnamed: 0,sentiment
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [26]:
print(df_train.columns.values)
print(df_test.columns.values)

['id' 'sentiment' 'review']
['id' 'review']


In [27]:
df_train['review'][0] # as you can see there are HTML symboles so we use beautifulsoup to erase them

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [28]:
bs_data=BeautifulSoup(df_train['review'][0])
print(bs_data.get_text())

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

In [29]:
letters_only=re.sub("[^a-zA-Z]", " ", bs_data.get_text() )
print(letters_only)

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    mi

In [30]:
lower_case=letters_only.lower()
words=lower_case.split()
print(words)

['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again', 'maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent', 'moonwalker', 'is', 'part', 'biography', 'part', 'feature', 'film', 'which', 'i', 'remember', 'going', 'to', 'see', 'at', 'the', 'cinema', 'when', 'it', 'was', 'originally', 'released', 'some', 'of', 'it', 'has', 'subtle', 'messages', 'about', 'mj', 's', 'feeling', 'towards', 'the', 'press', 'and', 'also', 'the', 'obvious', 'message', 'of', 'drugs', 'are', 'bad', 'm', 'kay', 'visually', 'impressive', 'but', 'of', 'course', 'this', 'is', 'all', 'about', 

In [31]:
print(stopwords.words("english") )

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [32]:
words = [w for w in words if not w in stopwords.words("english")]
print(words)

['stuff', 'going', 'moment', 'mj', 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker', 'maybe', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'really', 'cool', 'eighties', 'maybe', 'make', 'mind', 'whether', 'guilty', 'innocent', 'moonwalker', 'part', 'biography', 'part', 'feature', 'film', 'remember', 'going', 'see', 'cinema', 'originally', 'released', 'subtle', 'messages', 'mj', 'feeling', 'towards', 'press', 'also', 'obvious', 'message', 'drugs', 'bad', 'kay', 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless', 'remotely', 'like', 'mj', 'anyway', 'going', 'hate', 'find', 'boring', 'may', 'call', 'mj', 'egotist', 'consenting', 'making', 'movie', 'mj', 'fans', 'would', 'say', 'made', 'fans', 'true', 'really', 'nice', 'actual', 'feature', 'film', 'bit', 'finally', 'starts', 'minutes', 'excluding', 'smooth', 'criminal', 'sequence', 'joe', 'pesci', 'convincing', 'psychopathic', 'powerful', 'drug', 'lord', 

In [41]:
def clean_text_data(data_point,data_size):
  review_soup=BeautifulSoup(data_point)
  review_text=review_soup.get_text()
  review_letter_only=re.sub("^a-zA-Z"," ",review_text)
  review_lower=review_letter_only.lower()
  review_words=review_lower.split()
  stop_words=stopwords.words('english')
  meaningful_words = [x for x in review_words if x not in stop_words]
  if((i)%2000==0):
    print("Cleaned %d of %d data (%d %%)." % ( i, data_size, ((i)/data_size)*100))
  return( " ".join( meaningful_words)) 

In [42]:
df_train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [43]:
training_data_size = df_train["review"].size
testing_data_size = df_test["review"].size

print(training_data_size)
print(testing_data_size)

25000
25000


In [44]:
for i in range(training_data_size):
    df_train["review"][i] = clean_text_data(df_train["review"][i], training_data_size)
print("Cleaning training completed!")

Cleaned 0 of 25000 data (0 %).


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Cleaned 2000 of 25000 data (8 %).
Cleaned 4000 of 25000 data (16 %).
Cleaned 6000 of 25000 data (24 %).
Cleaned 8000 of 25000 data (32 %).
Cleaned 10000 of 25000 data (40 %).
Cleaned 12000 of 25000 data (48 %).
Cleaned 14000 of 25000 data (56 %).
Cleaned 16000 of 25000 data (64 %).
Cleaned 18000 of 25000 data (72 %).
Cleaned 20000 of 25000 data (80 %).
Cleaned 22000 of 25000 data (88 %).
Cleaned 24000 of 25000 data (96 %).
Cleaning training completed!


In [45]:
for i in range(testing_data_size):
    df_test["review"][i] = clean_text_data(df_test["review"][i], testing_data_size)
print("Cleaning validation completed!")

Cleaned 0 of 25000 data (0 %).
Cleaned 2000 of 25000 data (8 %).
Cleaned 4000 of 25000 data (16 %).
Cleaned 6000 of 25000 data (24 %).
Cleaned 8000 of 25000 data (32 %).
Cleaned 10000 of 25000 data (40 %).
Cleaned 12000 of 25000 data (48 %).
Cleaned 14000 of 25000 data (56 %).
Cleaned 16000 of 25000 data (64 %).
Cleaned 18000 of 25000 data (72 %).
Cleaned 20000 of 25000 data (80 %).
Cleaned 22000 of 25000 data (88 %).
Cleaned 24000 of 25000 data (96 %).
Cleaning validation completed!


In [46]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

In [47]:
X_train, X_cv, Y_train, Y_cv = train_test_split(df_train["review"], df_train["sentiment"], test_size = 0.3, random_state=42)

In [48]:
X_train = vectorizer.fit_transform(X_train)
X_train = X_train.toarray()
print(X_train.shape)

(17500, 5000)


In [50]:
X_cv = vectorizer.transform(X_cv)
X_cv = X_cv.toarray()
print(X_cv.shape)

(7500, 5000)


In [51]:
X_test = vectorizer.transform(df_test["review"])
X_test = X_test.toarray()
print(X_test.shape)

(25000, 5000)


In [52]:
X_test = vectorizer.transform(df_test["review"])
X_test = X_test.toarray()
print(X_test.shape)

(25000, 5000)


In [53]:
vocab = vectorizer.get_feature_names()
print(f"Printing first 100 vocabulary samples:\n{vocab[:100]}")

Printing first 100 vocabulary samples:
['00', '000', '10', '100', '11', '12', '13', '13th', '14', '15', '16', '17', '18', '1930', '1930s', '1933', '1939', '1940', '1950', '1950s', '1960', '1960s', '1968', '1970', '1970s', '1971', '1972', '1973', '1980', '1980s', '1983', '1984', '1988', '1990', '1993', '1996', '1999', '19th', '1st', '20', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '20th', '24', '25', '2nd', '30', '3000', '30s', '35', '3d', '3rd', '40', '45', '50', '50s', '60', '60s', '70', '70s', '80', '80s', '90', '90s', '99', 'abandoned', 'abc', 'abilities', 'ability', 'able', 'about', 'above', 'abraham', 'absence', 'absolute', 'absolutely', 'absurd', 'abuse', 'abusive', 'abysmal', 'academy', 'accent', 'accents', 'accept', 'acceptable', 'accepted', 'access', 'accident', 'accidentally', 'accompanied', 'accomplished', 'according', 'account']




In [54]:
distribution = np.sum(X_train, axis=0)

print("Printing first 100 vocab-dist pairs:")

for tag, count in zip(vocab[:100], distribution[:100]):
    print(count, tag)

Printing first 100 vocab-dist pairs:
68 00
208 000
3069 10
324 100
257 11
230 12
186 13
65 13th
146 14
352 15
93 16
122 17
117 18
72 1930
73 1930s
64 1933
58 1939
74 1940
116 1950
103 1950s
70 1960
62 1960s
58 1968
117 1970
104 1970s
60 1971
60 1972
60 1973
113 1980
96 1980s
70 1983
60 1984
59 1988
83 1990
62 1993
71 1996
78 1999
59 19th
88 1st
498 20
110 2000
122 2001
92 2002
81 2003
91 2004
94 2005
112 2006
78 2007
58 2008
86 20th
98 24
125 25
72 2nd
418 30
60 3000
60 30s
66 35
60 3d
74 3rd
272 40
84 45
324 50
77 50s
201 60
138 60s
291 70
213 70s
416 80
206 80s
380 90
72 90s
81 99
138 abandoned
93 abc
64 abilities
306 ability
897 able
593 about
103 above
63 abraham
72 absence
240 absolute
1007 absolutely
206 absurd
137 abuse
66 abusive
74 abysmal
192 academy
336 accent
144 accents
215 accept
88 acceptable
99 accepted
65 access
218 accident
128 accidentally
64 accompanied
87 accomplished
202 according
135 account


In [55]:
forest = RandomForestClassifier() 
forest = forest.fit( X_train, Y_train)

In [56]:
predictions = forest.predict(X_cv) 
print("Accuracy: ", accuracy_score(Y_cv, predictions))

Accuracy:  0.8428


In [57]:
result = forest.predict(X_test) 
output = pd.DataFrame( data={"id":df_test["id"], "sentiment":result} )
output.to_csv( "submission.csv", index=False, quoting=3 )