<h1>Introduction</h1>

This notebook will be an introduction to implementing end-to-end solutions for a few NLP tasks. Tthe areas covered in this notebkook are:

* Consumer complaint classification
* Basic sentiment predictons using Vader

The others will cover more advanced topics like record linkage, text summarization etc.

In [None]:
!pip install -U scikit-learn

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import string
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import os
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import sklearn.feature_extraction.text as text
from sklearn import model_selection, preprocessing,linear_model, naive_bayes, metrics, svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from io import StringIO
import seaborn as sns
from tqdm import tqdm

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
tqdm.pandas()

<h2>Multi-class classification - Consumer complaints</h2>

We have a database of thousands of consumer complaints about different financial products and services to the respective companies. We need to classify them into a product category, using the data available tous.

In [None]:
df = pd.read_csv('../input/us-consumer-finance-complaints/consumer_complaints.csv', encoding='latin-1')

In [None]:
df.head()

In [None]:
df[df['consumer_complaint_narrative'].notnull()]

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.isnull().sum()

In [None]:
df.shape

In [None]:
# We will extract the required columns for our predictions

df = df[['product', 'consumer_complaint_narrative']]
df = df[df['consumer_complaint_narrative'].notnull()]

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df['product'].nunique()

In [None]:
plt.figure(figsize=(15, 10))
sns.histplot(x='product', data=df)
plt.xticks(rotation=90)
plt.title('Distribution of complaints')
plt.show()

In [None]:
def processRow(row):
    import re
    import nltk
    from textblob import TextBlob
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from textblob import Word
    from nltk.util import ngrams
    import re
    from wordcloud import WordCloud, STOPWORDS
    from nltk.tokenize import word_tokenize
    tweet = row
    #Lower case
    tweet.lower()
    #Removes unicode strings like "\u002c" and "x96"
    tweet = re.sub(r'(\\u[0-9A-Fa-f]+)',r"", tweet)
    tweet = re.sub(r'[^\x00-\x7f]',r"",tweet)
    #convert any url to URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert any @Username to "AT_USER"
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    tweet = re.sub('[\n]+', ' ', tweet)
    print("VALUE OF S is", tweet)
    #Remove not alphanumeric symbols white spaces
    tweet = re.sub(r'[^\w]', ' ', tweet)
    #Removes hastag in front of a word """
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #Remove :( or :)
    tweet = tweet.replace(':)',"")
    tweet = tweet.replace(':(',"")
    #remove numbers
    tweet = "".join([i for i in tweet if not i.isdigit()])
    #remove multiple exclamation
    tweet = re.sub(r"(\!)\1+", ' ', tweet)
    #remove multiple question marks
    tweet = re.sub(r"(\?)\1+", ' ', tweet)
    #remove multistop
    tweet = re.sub(r"(\.)\1+", ' ', tweet)
    #lemma
    from textblob import Word
    tweet =" ".join([Word(word).lemmatize() for word in tweet.split()])
    #stemmer
    #st = PorterStemmer()
    #tweet=" ".join([st.stem(word) for word in tweet.split()])
    #Removes emoticons from text
    tweet = re.sub(':\)|;\)|:-\)|\(-:|:-D|=D|:P|xD|X-p|\^\^|:-*|\^\.\^|\^\-\^|\^\_\^|\,-\)|\)-:|:\'\(|:\(|:-\(|:\S|T\.T|\.\_\.|:<|:-\S|:-<|\*\-\*|:O|=O|=\-O|O\.o|XO|O\_O|:-\@|=/|:/|X\-\(|>\.<|>=\(|D:', "", tweet)
    #trim
    tweet = tweet.strip('\'"')
    row = tweet
    return row

In [None]:
df.rename(columns={'consumer_complaint_narrative':'complaint'}, inplace=True)
df.head()

In [None]:
df['complaint'] = df['complaint'].progress_apply(lambda x: processRow(x))

In [None]:
df.head()

In [None]:
# Working on splitting the data
x_train, x_test, y_train, y_test = model_selection.train_test_split(df['complaint'], df['product'])

In [None]:
le = preprocessing.LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.fit_transform(y_test)

In [None]:
tfidf = TfidfVectorizer()
tfidf.fit(df['complaint'])
x_train_t = tfidf.transform(x_train)
x_test_t = tfidf.transform(x_test)

In [None]:
x_train_t.shape

In [None]:
x_test_t.shape

In [None]:
y_train

In [None]:
model = linear_model.LogisticRegression(solver='liblinear').fit(x_train_t, y_train)
accuracy = metrics.accuracy_score(model.predict(x_test_t), y_test)
print("Accuracy: ", accuracy)

In [None]:
print(metrics.classification_report(y_test, model.predict(x_test_t), target_names=df['product'].unique()))

In [None]:
matr = confusion_matrix(y_test, model.predict(x_test_t))

In [None]:
df['category_id'] = df['product'].factorize()[0]

In [None]:
df['category_id']

In [None]:
cleaned_df = df[['product', 'category_id']].drop_duplicates().sort_values('category_id')
cleaned_df.head()

In [None]:
conv = dict(cleaned_df.values)
dict(cleaned_df.values)

In [None]:
inverse_conv = dict(cleaned_df[['category_id','product']].values)
dict(cleaned_df[['category_id','product']].values)

In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(matr, annot=True, xticklabels=cleaned_df[['product']].values,
           yticklabels=cleaned_df[['product']].values, cmap="BuPu", fmt='d')
plt.ylabel('True value')
plt.xlabel('Predicted value')
plt.show()

In [None]:
# Predicitng in real time
sentence = ["This company refuses to provide me verification andvalidation of debt"+ "per my right under the FDCPA.I do not believe this debt is mine."]
feature_sent = tfidf.transform(sentence)
pred = model.predict(feature_sent)
print(pred)

In [None]:
print(sentence)
print("This has been predicted as:",inverse_conv[pred[0]])

Some ways to improve the accuracy:

* Use GridSearchCV for hyperparameter tuning. We can try out multiple combinations to see which one fits best for us

* Using deep learning techniques like RNN and LSTMs

* Reiterating the whole process using other vanilla ML models such SVM, Naive Bayes, MLP, GBM etc.

<h2>Sentiment analysis for Amazon food</h2>

We've seen sentiment analysis using vectorizers and TextBlob (in previous notebooks). Here we will work with the Vader library to automate our process:

In [None]:
df = pd.read_csv('../input/amazon-fine-food-reviews/Reviews.csv')
df.head()

In [None]:
df.info()

In [None]:
df.dtypes

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
# Column we are interested in:
# Review summary
df['Summary'].head()

In [None]:
# Review description
df['Text'].head()

<h3>Preprocessing

In [None]:
from textblob import TextBlob
from textblob import Word
from nltk.corpus import stopwords

# Lowercase
df['Text'] = df['Text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
# Remove punctuation
df['Text'] = df['Text'].str.replace('[^\w\s]', "")
df['Text'].head()

In [None]:
# Spelling corrections - very expensive process
#df['Text'] = df['Text'].progress_apply(lambda x: str(TextBlob(x).correct()))

In [None]:
df['Text'] = df['Text'].progress_apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

In [None]:
df.dropna(inplace=True)
df.Score.hist(bins=5, grid=False)
plt.show()

In [None]:
print(df.groupby('Score').count().Id)

As we have a highly skewed dataset, we will pick up N-sample of values, for each class.

In [None]:
score1 = df[df['Score']==1].sample(n=28000)
score2 = df[df['Score']==2].sample(n=28000)
score3 = df[df['Score']==3].sample(n=28000)
score4 = df[df['Score']==4].sample(n=28000)
score5 = df[df['Score']==5].sample(n=28000)

In [None]:
df_final = pd.concat([score1, score2, score3, score4, score5], axis=0)
df_final

In [None]:
df_final.reset_index(drop=True, inplace=True)

In [None]:
df_final.head()

<h3>Sentiment definition</h3>

Let us define our sentiments as follows (you can choose any criteria):

* Score <= 2 : Negative
* Score = 3 : Neutral
* Score >=4 : Positive

In [None]:
df_final.shape

In [None]:
print(df_final.groupby('Score').count().Id)

<h3>Working with wordclouds</h3>

We will analyze the summary of each type of review.

In [None]:
from wordcloud import WordCloud
from wordcloud import STOPWORDS

In [None]:
# Wordcloud input is a single string
total_str = df_final.Summary.str.cat()
wordcloud = WordCloud(background_color='white')
wordcloud.generate(total_str)
plt.figure(figsize=(10,10))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
# Splitting our reviews

df_negative = df_final[df_final['Score'].isin([1, 2])]
df_positive = df_final[df_final['Score'].isin([4, 5])]
# Transform into a single string
df_negative_s = df_negative.Summary.str.cat()
df_positive_s = df_positive.Summary.str.cat()

In [None]:
neg_wc = WordCloud(background_color='white').generate(df_negative_s)
pos_wc = WordCloud(background_color='white').generate(df_positive_s) 

In [None]:
fig = plt.figure(figsize=(10,10))
plt.imshow(neg_wc,interpolation='bilinear')
plt.axis("off")
plt.title('Reviews with Negative Scores',fontsize=20)
plt.show()

In [None]:
fig = plt.figure(figsize=(10,10))
plt.imshow(pos_wc,interpolation='bilinear')
plt.axis("off")
plt.title('Reviews with Positive Scores',fontsize=20)
plt.show()

<h3>Feature engineering</h3>

We'll be using the Vader library here, so we do not have to engineer any features. We can do so, if we're building our model from scratch.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re
import os
import sys
import ast

In [None]:
!pip install vaderSentiment

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

The VADER library returns 4 values such as:

* pos: The probability of the sentiment to be positive
* neu: The probability of the sentiment to be neutral
* neg: The probability of the sentiment to be negative
* compound: The normalized compound score which calculates the sum of all lexicon ratings and takes values from -1 to 1

In [None]:
listing = []
for row in tqdm(df['Text']):
    vs = analyzer.polarity_scores(row)
    listing.append(vs)
    
df_results = pd.DataFrame(listing)
df_results.head()

In [None]:
df_with_sent = pd.concat([df.reset_index(drop=True), df_results], axis=1)
df_with_sent.head()

In [None]:
df_with_sent['Sentiment'] = np.where(df_results['compound']>=0, 'Positive', 'Negative')

In [None]:
df_with_sent.head()

In [None]:
df_with_sent.head()

In [None]:
sns.countplot(df_with_sent['Sentiment'])
plt.show()

In [None]:
df_with_sent.groupby('ProductId')['Sentiment'].value_counts()

Here we have it, we were able to get the total number of positive and negative sentiments for each product, using the Vader library. We can do the same process from scratch if we have labeled data, we'll see that in the next few notebooks.