<html>
<font size = 6 color = Red align = center>
Natural Language Processing - Banking Customer Reviews using Python<br/>
</font>
</html>

<html>
<font size = 4 color = blue>
Author: Sulekha Aloorravi
</font>
</html>

### Load the reviews in the form of HTML files

In [1]:
import glob

In [2]:
html_files = glob.glob(r".\BankBazaarData\*.html")

In [3]:
import codecs

In [4]:
S = " "
html_array = []
for file in html_files:
    f = codecs.open(file, 'r')
    html_array.append(f.read())

In [5]:
html = S.join(html_array)

In [6]:
html.count("ellipsis_text")

996

### Parse them using BeautifulSoup

In [7]:
from bs4 import BeautifulSoup

In [8]:
parsed_html = BeautifulSoup(html,"html.parser")

In [9]:
reviews = parsed_html.find_all('span', attrs = {'class': "ellipsis_text", 'itemprop': "description"})

In [10]:
ratings = parsed_html.find_all('span', attrs = {'itemprop': "ratingvalue"})

In [11]:
import re

In [12]:
def cleanhtml(raw_html):
    clean = re.compile('<.*?>')
    cleantext = re.sub(clean, '', raw_html)
    return cleantext

In [13]:
rates = []
for i in range(0,len(ratings)):
    rates.append(cleanhtml(ratings[i].text))

In [14]:
revws = []
for i in range(0,len(reviews)):
    revws.append(cleanhtml(reviews[i].text))

### Convert the data into dataframe

In [15]:
import pandas as pd

In [16]:
Reviews = pd.DataFrame({'Reviews': revws, 'Ratings': rates})

In [17]:
Reviews = Reviews.replace({r'\s+$': '', r'^\s+': ''}, regex = True).replace(r'\n', ' ', regex = True)

In [18]:
Reviews.drop_duplicates(inplace = True)

In [19]:
Reviews.count()

Ratings    992
Reviews    992
dtype: int64

In [20]:
Reviews.head(5)

Unnamed: 0,Ratings,Reviews
0,5.0,I have been using BANK BAZAAR services for a w...
1,3.0,"I have taken personal loan from CASHE , my exp..."
2,4.0,For the first time i have taken financial prod...
3,4.0,"My personal loan was taken with HDFC BANK , th..."
4,4.0,I have taken a personal loan from the HDFC ban...


In [21]:
Reviews.dtypes

Ratings    object
Reviews    object
dtype: object

In [22]:
Reviews['Ratings'] = Reviews['Ratings'].astype(float)

In [23]:
Reviews.dtypes

Ratings    float64
Reviews     object
dtype: object

### Calculate sentiment

In [24]:
def calculate_sentiment(reviews):
    if reviews['Ratings'] < 4.0:
        return 0   #Negative
    else:
        return 1   #Positive

In [25]:
Reviews["Sentiment"] = Reviews.apply(calculate_sentiment, axis=1)

In [26]:
Reviews.to_csv("BankReviews.csv")

### Save data in the form of csv

### Let us store Reviews and Sentiments as arrays since it would be more suitable to do further processing and feature extraction

In [27]:
text, y = Reviews.Reviews, Reviews.Sentiment

### Split the data into train and test at a ratio of 70:30

In [28]:
from sklearn.model_selection import train_test_split

In [29]:
text_train, text_test, y_train, y_test = train_test_split(text, y, test_size=0.33, random_state=42)

### Explore train and test data

In [30]:
import numpy as np

In [31]:
np.unique(y_train)

array([0, 1], dtype=int64)

In [32]:
np.unique(y_test)

array([0, 1], dtype=int64)

In [33]:
print("Samples per Sentiment (training): {}".format(np.bincount(y_train)))

Samples per Sentiment (training): [136 528]


In [34]:
print("Samples per Sentiment (testing): {}".format(np.bincount(y_test)))

Samples per Sentiment (testing): [ 68 260]


### Feature Extraction

<html>
<font size = 4>
<b> Rescaling the Data with tf-idf </b> <br /></font>
<font>
One of the approaches to extract features from test is to
rescale features by how informative we expect them to be. One of the most common
ways to do this is using the term frequency–inverse document frequency (tf-idf)
method. Let us create a function for the same.
</font>
</html> 
\begin{equation*}
\text{tfidf}(w, d) = \text{tf} \log\big(\frac{N + 1}{N_w + 1}\big) + 1
\end{equation*}

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [36]:
def tfidf_extractor(corpus, ngram_range=(1,1)):    
    vectorizer = TfidfVectorizer(min_df=1, 
                                 norm='l2',
                                 smooth_idf=True,
                                 use_idf=True,
                                 ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

In [37]:
tfidf_vectorizer, tfidf_train_features = tfidf_extractor(text_train)  
tfidf_test_features = tfidf_vectorizer.transform(text_test) 

In [38]:
from sklearn import metrics

In [39]:
def get_metrics(true_labels, predicted_labels):
    print ('Accuracy:', np.round(metrics.accuracy_score(true_labels,predicted_labels),2))
    print ('Precision:', np.round(metrics.precision_score(true_labels,predicted_labels,average='weighted'),2))
    print ('Recall:', np.round(metrics.recall_score(true_labels, predicted_labels,average='weighted'),2))
    print ('F1 Score:', np.round(metrics.f1_score(true_labels, predicted_labels,average='weighted'),2)) 

In [40]:
def train_predict_evaluate_model(classifier,train_features, train_labels, test_features, test_labels):
    # build model    
    classifier.fit(train_features, train_labels)
    # predict using model
    predictions = classifier.predict(test_features) 
    # evaluate model prediction performance   
    get_metrics(true_labels=test_labels, 
                predicted_labels=predictions)
    return predictions 

### Multinomial Naive Bayes with tfidf features   

In [41]:
from sklearn.naive_bayes import MultinomialNB

In [42]:
mnb_best = MultinomialNB(alpha = 0.001,fit_prior = True)

In [43]:
mnb_tfidf_predictions = train_predict_evaluate_model(classifier=mnb_best,
                                           train_features=tfidf_train_features,
                                           train_labels=y_train,
                                           test_features=tfidf_test_features,
                                           test_labels=y_test)

Accuracy: 0.78
Precision: 0.73
Recall: 0.78
F1 Score: 0.74


### Confusion Matrix

In [44]:
cm = metrics.confusion_matrix(y_test, mnb_tfidf_predictions)
pd.DataFrame(cm, index=range(0,2), columns=range(0,2))

Unnamed: 0,0,1
0,12,56
1,16,244


### Incorrect Predictions

In [45]:
print ("*******[0 - Negative, 1 - Positive]*******")
for document, label, predicted_label in zip(text_test, y_test, mnb_tfidf_predictions):
    for i in range(0,2):
        if label == i and predicted_label != i:
            print ("Actual Label:", +label)
            print ("Predicted Label:", +predicted_label)
            print ("Review:", re.sub('\n', ' ', document))

*******[0 - Negative, 1 - Positive]*******
Actual Label: 0
Predicted Label: 1
Review: My personal loan was taken with FULLERTON .The process was very good but their processing fee was very high.The loan amount was 1,54,000 and the interest rate was satisfactory .The tenure period was 5 years. I had a smooth and hassle  free
Actual Label: 0
Predicted Label: 1
Review: My personal loan was taken with HDFC Bank 4 years back. The loan amount was 1.9 lakhs  and the interest rate was very high . The tenure   period was 4 years . I had a good experience  and i had a very smooth process also. Their processing fee was also too high.My overall experience was good.
Actual Label: 0
Predicted Label: 1
Review: My personal loan was taken with FULLERTON 1 month back.The loan amount was 10 lakhs  and the interest rate was very high. Te tenure period was 5 years. The process was quite good and excellent . I haven't faced any issues and the documentation  was smooth but  The processing  fee was high.
Actu