In this case study we will be implementing an elementary model that utilizes word embeddings for text classification. Word embeddings are known for encoding contextual information. In this notebook we will use a pretrained model to generate word embeddings of each word in a sentence. Further, average of all embeddings for a sentence will be the sentence representation. Each sentence representation will be classified into one of the categories. The entire process is described step by step below:

1. Load the dataset from the disk
2. Tokenize text in the dataset and create vocabulary
3. Load the word2vec model from the disk into a python dictionary
4. Load embeddings for each word and take average
5. One hot encode the target labels
6. Train the classifier

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
from nltk.tokenize import RegexpTokenizer
import numpy as np
import re

### Load the dataset from the disk

In [3]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Glove/bbc-text-1.csv')
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [4]:
df['text'][0]

'tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high

### Tokenizer
Regular expression based tokenizers to consider only alphabetical sequences and ignore numerical sequences.

In [5]:
def complaint_to_words(comp):
    
    words = RegexpTokenizer('\w+').tokenize(comp)
    words = [re.sub(r'([xx]+)|([XX]+)|(\d+)', '', w).lower() for w in words]
    words = list(filter(lambda a: a != '', words))
    
    return words

In [6]:
### Exlpanation of above cell


from nltk.tokenize import RegexpTokenizer
'''
d, \D: ANY ONE digit/non-digit character. Digits are [0-9]
\w, \W: ANY ONE word/non-word character. For ASCII, word characters are [a-zA-Z0-9_]
\s, \S: ANY ONE space/non-space character. For ASCII, whitespace characters are [ \n\r\t\f]
'''
## Show pythex and usage
import re
tokenizer = RegexpTokenizer('\w+')

words=[re.sub('\d+','',word) for word in tokenizer.tokenize('This is an NLP case study. Alright ww123.. 2.. 3.. 4')]
words = list(filter(lambda x : (x!='' ),words))
words

['This', 'is', 'an', 'NLP', 'case', 'study', 'Alright', 'ww']

In [7]:
text = "I have outdated information on my credit report that I have previously disputed that has yet to be removed this information xx is XX more then seven years old and 12.1 does not meet credit reporting requirements"

In [8]:
words = complaint_to_words(text)

In [9]:
words

['i',
 'have',
 'outdated',
 'information',
 'on',
 'my',
 'credit',
 'report',
 'that',
 'i',
 'have',
 'previously',
 'disputed',
 'that',
 'has',
 'yet',
 'to',
 'be',
 'removed',
 'this',
 'information',
 'is',
 'more',
 'then',
 'seven',
 'years',
 'old',
 'and',
 'does',
 'not',
 'meet',
 'credit',
 'reporting',
 'requirements']

### Vocabulary
Extracing all the unique words from the dataset

In [10]:
df.shape

(2225, 2)

In [11]:
all_words = list()
for comp in df['text']:
    for w in complaint_to_words(comp):
        all_words.append(w)

In [12]:
print('Size of vocabulary: {}'.format(len(set(all_words))))

Size of vocabulary: 27850


In [13]:
all_words[-10:-1]

['player', 'more', 'of', 'the', 'same', 'in', 'future', 'please', 'he']

In [14]:
print('Complaint\n', df['text'][10], '\n')
print('Tokens\n', complaint_to_words(df['text'][10]))

Complaint
 berlin cheers for anti-nazi film a german movie about an anti-nazi resistance heroine has drawn loud applause at berlin film festival.  sophie scholl - the final days portrays the final days of the member of the white rose movement. scholl  21  was arrested and beheaded with her brother  hans  in 1943 for distributing leaflets condemning the  abhorrent tyranny  of adolf hitler. director marc rothemund said:  i have a feeling of responsibility to keep the legacy of the scholls going.   we must somehow keep their ideas alive   he added.  the film drew on transcripts of gestapo interrogations and scholl s trial preserved in the archive of communist east germany s secret police. their discovery was the inspiration behind the film for rothemund  who worked closely with surviving relatives  including one of scholl s sisters  to ensure historical accuracy on the film. scholl and other members of the white rose resistance group first started distributing anti-nazi leaflets in the su

### Indexing
Indexing each unique word in the dataset by assigning it a unique number.

In [15]:
index_dict = dict()
count = 1
index_dict['<unk>'] = 0
for word in set(all_words):
    index_dict[word] = count
    count += 1

In [None]:
#index_dict

### Dataset
Utilizing indexed words to replace words by index. This makes the dataset numerical and keras readable.

In [24]:
embeddings_index = {}
f = open('/content/drive/MyDrive/Colab Notebooks/Glove/glove.6B.300d.txt') # GLOBAL VECTOR
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

In [16]:
embeddings_index = {}
f = open('/content/drive/MyDrive/Colab Notebooks/Glove/glove.6B.300d.txt') # GLOBAL VECTOR
for line in f:
    print(line)
    break
f.close()

the 0.04656 0.21318 -0.0074364 -0.45854 -0.035639 0.23643 -0.28836 0.21521 -0.13486 -1.6413 -0.26091 0.032434 0.056621 -0.043296 -0.021672 0.22476 -0.075129 -0.067018 -0.14247 0.038825 -0.18951 0.29977 0.39305 0.17887 -0.17343 -0.21178 0.23617 -0.063681 -0.42318 -0.11661 0.093754 0.17296 -0.33073 0.49112 -0.68995 -0.092462 0.24742 -0.17991 0.097908 0.083118 0.15299 -0.27276 -0.038934 0.54453 0.53737 0.29105 -0.0073514 0.04788 -0.4076 -0.026759 0.17919 0.010977 -0.10963 -0.26395 0.07399 0.26236 -0.1508 0.34623 0.25758 0.11971 -0.037135 -0.071593 0.43898 -0.040764 0.016425 -0.4464 0.17197 0.046246 0.058639 0.041499 0.53948 0.52495 0.11361 -0.048315 -0.36385 0.18704 0.092761 -0.11129 -0.42085 0.13992 -0.39338 -0.067945 0.12188 0.16707 0.075169 -0.015529 -0.19499 0.19638 0.053194 0.2517 -0.34845 -0.10638 -0.34692 -0.19024 -0.2004 0.12154 -0.29208 0.023353 -0.11618 -0.35768 0.062304 0.35884 0.02906 0.0073005 0.0049482 -0.15048 -0.12313 0.19337 0.12173 0.44503 0.25147 0.10781 -0.17716 0.0386

In [17]:
embeddings_index

{}

In [18]:
# emmbed_dict = {}
# with open('/content/glove.6B.200d.txt','r') as f:
#   for line in f:
#     values = line.split()
#     word = values[0]
#     vector = np.asarray(values[1:],'float32')
#     emmbed_dict[word]=vector

In [19]:
from scipy import spatial

In [20]:
def find_similar_word(emmbedes):
  nearest = sorted(embeddings_index.keys(), key=lambda word: spatial.distance.euclidean(embeddings_index[word], emmbedes))
  return nearest


# Explanation

In [21]:
embeddings_index.keys()

dict_keys([])

In [25]:
spatial.distance.euclidean(embeddings_index['rat'],embeddings_index['rat'])

0.0

In [26]:
'''
Python sorted() key

sorted() function has an optional parameter called ‘key’ which takes a function as its value. This key function transforms each element before sorting, it takes the value and returns 1 value which is then used within sort instead of the original value. For example, if we pass a list of strings in sorted(), it gets sorted alphabetically. But if we specify key = len, i.e. give len function as key, then the strings would be passed to len, and the value it returns, i.e. the length of strings will be sorted. This means that the strings would be sorted based on their lengths instead

'''

L = ["cccc", "b", "dd", "aaa"]
 
print("Normal sort :", sorted(L))
 
print("Sort with len :", sorted(L, key=len))

Normal sort : ['aaa', 'b', 'cccc', 'dd']
Sort with len : ['b', 'dd', 'aaa', 'cccc']


In [27]:
find_similar_word(embeddings_index['rat'])[0:10]

['rat',
 'rats',
 'rodent',
 'cockroach',
 'shmahn',
 'rabbit',
 'k978-1',
 'bb96',
 'bulletinyyy',
 'bdb94']

In [28]:
#len(embeddings_index['unk'])

In [29]:
#embeddings_list = embeddings_index.items()

In [30]:
#list(embeddings_list)[99:101]

#### Taking average of all word embeddings in a sentence to generate the sentence representation.

In [31]:
data_list = list()
for comp in df['text']:
    sentence = np.zeros(300)
    count = 0
    for w in complaint_to_words(comp):
        try:
            sentence += embeddings_index[w]
            count += 1
        except KeyError:
            continue
    data_list.append(sentence / count)

#### Converting categrical labels to numerical format and further one hot encoding on the numerical labels.

In [32]:
df

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...
...,...,...
2220,business,cars pull down us retail figures us retail sal...
2221,politics,kilroy unveils immigration policy ex-chatshow ...
2222,entertainment,rem announce new glasgow concert us band rem h...
2223,politics,how political squabbles snowball it s become c...


In [33]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df['category'])
df['Target'] = le.transform(df['category'])
df.head()

Unnamed: 0,category,text,Target
0,tech,tv future in the hands of viewers with home th...,4
1,business,worldcom boss left books alone former worldc...,0
2,sport,tigers wary of farrell gamble leicester say ...,3
3,sport,yeading face newcastle in fa cup premiership s...,3
4,entertainment,ocean s twelve raids box office ocean s twelve...,1


### One hot Encoding

In [34]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(np.array(data_list), df.Target.values, 
    test_size=0.15, random_state=0)

In [35]:
print(X_train.shape)

(1891, 300)


In [36]:
print(y_train.shape)

(1891,)


#### Training and testing the classifier

In [37]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
clf = BernoulliNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(accuracy_score(y_test, pred))

0.9610778443113772


In [47]:
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(accuracy_score(y_test, pred))

0.9640718562874252
