# Topic Modeling and Classification of Customer Complaints 

## **Table of Contents** 

1. **Problem Statement** 


2. **Essential Comments**


3. **Import Libraries and Modules** 


4. **Load JSON Data as Pandas DataFrame** 


5. **Exploratory Data Analysis and Data Preprocessing**

    - 5.1 Data Exploration 
    - 5.2 Data Exploration Summary
    - 5.3 Data Preprocessing
    - 5.4 Visualization of Preprocessed Data


6. **Topic Modeling with TF-IDF Vectorization and NMF**
    
    - 6.1 Vectorize Raw Texts to TF-IDF Feature Matrix
    - 6.2 Find Optimal Number of Topic with NMF
    - 6.3 Manual Topic Modeling with NMF


7. **Build Supervised Model to Classify New Complaints**
    
    - 7.1 Predictive Classifier 1: Multinomial Naive Bayes
    - 7.2 Predictive Classifier 2: Logistic Regression
    - 7.3 Predictive Classifier 3: Decision Tree
    - 7.4 Predictive Classifier 4: Random Forest


8. **Conclusions** 

## **1. Problem Statement**

Build a model that can classify customer complaints based on the product/service. This will allow the segreggation of these complaints (or tickets) to their relevant categories (or topics), thereby helping in the quick resolution of an issue.


We will be doing topic modeling on a consumer complaints data set. Since the data is not labeled, we will be applying the non-negative matrix factorization (**NMF**) approach for topic modeling of consumer complaints and clustering them into one of the following five categories:

- **Credit/Prepaid Card**
- **Bank Account Services**
- **Theft/Dispute Reporting**
- **Mortgages/Loans**
- **Others**

With the aid of topic modeling, we will be able to map each ticket onto the respective department/category. We will then use this data to train any classifier such as *logistic regression*, *decision tree*, or *random forest*. Finally, using the trained classifier we will classify any new customer complaint support ticket to the relevant department. 

## **2. Essential Comments**

1. If you are running this notebook on Google Colab, uncomment the first cell in the section below, called **"Import Required Libraries and Modules"**, and run the notebook end-to-end.

2. If you are running this notebook on a local or virtual machine, make sure to create a new virtual or `conda` environment, install all required libraies, and then run the notebook end-to-end. 

3. Make sure that you use `gensim==4.0` package if you want to use the `nmf` model available in `gensim.models`. The `NMF.py` file is not available in `gensim.models` with old versions of gensim such as `gensim==3.6` or `gensim==3.8`.

4. The most time consuming parts of this notebook are the lemmatization step present in "**Exploratory Data Analysis and Data Preprocessing**" and the NMF step in **"Topic Modeling with TF-IDF Vectorization and NMF"**. One can in principle, decrease this time with either/both the following approaches:

    - The **stemming** approach (from `nltk` package) of wrapping the **lemmatization** step with other data preprocessing steps into a `spaCy` **pipeline**. 

    - Using a subset of the "non-empty" ("non-blank") consumer complaints instead of the whole data set of around 22,000 records.

## **3. Import Required Libraries and Modules**

In [None]:
!python3 -m spacy download en_core_web_sm --quiet
!python -m textblob.download_corpora --quiet
!python3 -m pip install gensim==4.0 --quiet

In [None]:
# Builtin libraries
import os
import warnings
import json
import re
import string
import IPython as ipy
import pickle
import pprint

# Third-party libraries for data science
# and machine learning 
import numpy as np
import pandas as pd
import matplotlib as mpl 
import seaborn as sns
import plotly
import sklearn as skl 

# Third-party NLP libraries
import nltk
import spacy
import en_core_web_sm
import textblob
import wordcloud
import gensim

In [None]:
print(f'{"re --version":{20}} : {re.__version__:s}')
print(f'{"json --version":{20}} : {json.__version__:s}')
print(f'{"nltk --version":{20}} : {nltk.__version__:s}')
print(f'{"spacy --version":{20}} : {spacy.__version__:s}')
print(f'{"ipython --version":{20}} : {ipy.__version__:s}')
print(f'{"numpy --version":{20}} : {np.__version__:s}')
print(f'{"pandas --version":{20}} : {pd.__version__:s}')
print(f'{"matplotlib --version":{20}} : {mpl.__version__:s}')
print(f'{"seaborn --version":{20}} : {sns.__version__:s}')
print(f'{"plotly --version":{20}} : {plotly.__version__:s}')
print(f'{"sklearn --version":{20}} : {skl.__version__:s}')
print(f'{"textblob --version":{20}} : {textblob.__version__:s}')
print(f'{"wordcloud --version":{20}} : {wordcloud.__version__:s}')
print(f'{"gensim --version":{20}} : {gensim.__version__:s}')

In [None]:
warnings.filterwarnings(action='ignore')

In [None]:
from IPython.display import display
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

from matplotlib import pyplot as plt 

from sklearn.model_selection import train_test_split as tts 
from sklearn.linear_model import LogisticRegression 
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer 
from sklearn.metrics import confusion_matrix, f1_score, classification_report
from sklearn import decomposition as decomp

from plotly import offline as plot
from plotly import graph_objects as go
from plotly import express as px

from pprint import pprint

from textblob import TextBlob
from wordcloud import WordCloud, STOPWORDS

from gensim.corpora.dictionary import Dictionary
from gensim.models import nmf, CoherenceModel 
from gensim.models.coherencemodel import CoherenceModel
from operator import itemgetter

In [None]:
nlp_model = en_core_web_sm.load()

## **4. Load JSON Data as Pandas DataFrame** 

In [None]:
!pwd 
!wget -nv https://raw.githubusercontent.com/rs2pydev/nlp_1_CLFwTM/main/data/Client_data.json
!ls 

In [None]:
with open("./Client_data.json") as f_handle:
    json_data = json.load(f_handle)

In [None]:
df = pd.json_normalize(json_data) 

## **5. Exploratory Data Analysis and Data Preprocessing**

### 5.1 Data Exploration 

In [None]:
# Display dataset
display(df.sample(n=10))

In [None]:
# Check number of rows and columns in dataset
print(df.shape)

In [None]:
# Variables' information
df.info()

In [None]:
# Identify and collect null columns
null_cols = [var for var in df.columns if df[var].isnull().sum() > 0]
print(*null_cols, sep='\n', end='\n\n')
display(df[null_cols].isnull().sum())

In [None]:
# Identify and collect non-null columns
not_null_cols = [var for var in df.columns if df[var].isnull().sum() == 0]
print(*not_null_cols, sep='\n', end='\n\n')
display(df[not_null_cols].isnull().sum())

In [None]:
# Create list of column names
col_names = df.columns.to_list()
print('Column Names: ')
print(*col_names, sep="\n", end='\n')

In [None]:
def value_count_df(df:pd.DataFrame=None, var:str=None) -> pd.DataFrame:
    """
    Given a Pandas DataFrame and a column name, this function displays 
    the items in the column and their counts (frequencies).
    Args:
        df: pd.DataFrame | Default value None
        var: str | Default value None
    Return:
        pd.DataFrame
    """
    new_df = pd.DataFrame()
    new_df = df[var].value_counts().reset_index()
    new_df.columns = [str(var), 'Count']
    return new_df

In [None]:
# Check distributions of columns of interest
vars = [
    '_source.product', 
    '_source.issue', 
    '_source.complaint_what_happened'
]

for var in vars:
    tmp = pd.DataFrame()
    tmp = value_count_df(df=df, var=var)
    print(f'For variable `{var:s}`: ')
    display(tmp)
    print()

In [None]:
# We will examine the consumer complaints column to check for
# null values hidden as empty strings
print('Non-empty items: ')
display(df.loc[(df['_source.complaint_what_happened'] != ''), :].shape)
print('Empty items: ')
display(df.loc[(df['_source.complaint_what_happened'] == ''), :].shape)

### 5.2 Data Exploration Summary

* The dataset has 78313 customer complaints and 22 features with the customer complaint is in `_source.complaint_what_happened` column.

* Using the 21072 non-empty (non-blank) rows of the `_source.complaint_what_happened` column, we will create a DataFrame called `df_text`. **NOTE:** 57241 rows of this column are empty (blank). 

* Next, we rename the `df_text` column.

* Finally, we apply text preprocessing (see below) on `df_text.complaints_unclean` and create a new column, `complaints_clean`. 
    * Convert text to lowercase.
    * Remove text in square brackets.
    * Remove punctuations.
    * Remove words containing numbers.
    * Remove all *hidden* words, containing `XXX`
    * Use POS tags to get relevant words from the texts - We will use nouns only.
    * Lemmatize the texts.

### 5.3 Data Preprocessing

In [None]:
df_text = pd.DataFrame()
df_text = pd.DataFrame(df.loc[(df['_source.complaint_what_happened'] != ''), 
                 '_source.complaint_what_happened']).reset_index(drop=True)
df_text.rename(columns={'_source.complaint_what_happened': 'complaints_unclean'}, 
               inplace=True)
display(df_text.sample(n=10))

In [None]:
def text_cleaner(text:str=None) -> str:
    '''
    Make text lowercase, remove text in square brackets, remove punctuation 
    and remove words containing numbers.
    Args:
        text: str | Default value None
    Returns:
        str 
    '''
    text = text.lower() # Make word lowercase
    text = re.sub(r'\[.*?\]', '', text)  # Remove word in square brackets
    text = re.sub(r'\w*\d\w*', '', text) # Remove words with digits 
    text = re.sub(r'x{4}|xx/', '', text) # Remove words with 'XXXX' | 'XX/' 
    text = re.sub(r'\n', '', text) # Remove new lines
    text = re.sub(r'\b\w{1,3}\b', '', text) # Remove all 1-, 2-, and 3-letter words
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text) # Remove punctuations
    return text

In [None]:
df_text['complaints_clean'] = df_text['complaints_unclean'].apply(lambda x: text_cleaner(x))

In [None]:
pd.set_option('display.max_colwidth', -1)
display(df_text.sample(n=5))

In [None]:
def text_lemmatizer(text:str=None) -> str:        
    '''
    Function to Lemmatize an input text.
    Args:
        text: str | Default value None
    Returns:
        str 
    '''
    lemmas = []
    doc = nlp_model(text)
    for word in doc:
        lemmas.append(word.lemma_)
    return " ".join(lemmas)

In [None]:
# Creating a dataframe with 
# --- original (uncleaned) complaints 
# --- cleaned complaints 
# --- lemmatized complaints.
df_text["complaints_lemmatize"] =  df_text.apply(lambda x: text_lemmatizer(
    x['complaints_clean']), axis=1)
display(df_text.sample(n=6))

* Unlike verbs and common nouns, there's no clear base form of a personal pronoun.  spaCy's solution is to introduce a novel symbol, -PRON-, which is used as the lemma for all personal pronouns.

* **Chunking** in NLP is a process to take small pieces of information and group them into large units. The primary use of **Chunking** is making groups of "noun phrases. Here we are using only singular nouns as we have already lemmatized the texts.

In [None]:
def pos_tag(text):
    try:
        return TextBlob(text).tags
    except:
        return None

def get_adjectives(text):
    blob = TextBlob(text)
    return ' '.join([word for (word,tag) in blob.tags if tag == "NN"])

df_text["complaints_POS_removed"] =  df_text.apply(lambda x: 
                                                    get_adjectives(x['complaints_lemmatize']), 
                                                    axis=1)

In [None]:
# Now, `df_text` DataFrame contains: 
# --- Raw (unclean) complaints
# --- Cleaned complaints 
# --- Lemmatized complaints 
# --- Complaints after removing POS tags.

display(df_text.sample(n=5))

### 5.4 Visualization of Preprocessed Data

In [None]:
plt.figure(figsize=(10,6))
doc_lens = [len(d) for d in df_text.complaints_POS_removed]
plt.hist(doc_lens, bins = 50)
plt.title('Distribution of Complaint character length')
plt.ylabel('Number of Complaint')
plt.xlabel('Complaint character length')
sns.despine()
plt.show();

The above plot shows that in terms of the distribution of the word counts, it is positively skewed.

Below, we show the top 40 words by frequency among all the articles after processing the text.

In [None]:
stopwords = set(STOPWORDS)
wc = WordCloud(background_color='white', stopwords=stopwords, max_words=40, 
               max_font_size=40, random_state=42).generate(str(df_text['complaints_POS_removed']))
print(wc)

In [None]:
mpl.rcParams['figure.figsize'] = (12.0,12.0)  
mpl.rcParams['font.size'] = 12            
mpl.rcParams['savefig.dpi'] = 100             
mpl.rcParams['figure.subplot.bottom'] =.1 
fig = plt.figure()
plt.imshow(wc);
plt.axis('off')
plt.show();

In [None]:
#Removing `-PRON-` from the text corpus
df_text['complaints_fin_ver'] = df_text['complaints_POS_removed'].str.replace('-PRON-', '')
display(df_text.sample(n=5))

Given below are the top unigrams, bigrams and trigrams by frequency among all the complaints after processing the text: 

- **credit**
- **debt** 
- **bank** 
- **loan** 
- **mortgage** 

The above are some of the top words which makes sense given the focus of the complaints.

In [None]:
def get_top_n_ngram(corpus:str=None, ng_range:tuple=(), 
                     n:int=0) -> list:
    """
    Get top `n` ngrams from a given corpus.
    Args:
        corpus:str      | Default value None
        ng_range:tuple  | Default value ()
        n:int           | Default value 0
    Returns:
        words_freq:list 
    """
    if ng_range:
        vec = CountVectorizer(ngram_range=ng_range, stop_words='english').fit(corpus)
    else:
        vec = CountVectorizer(stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
# Extract and plot top 30 unigrams

common_words = get_top_n_ngram(corpus=df_text['complaints_fin_ver'].values.astype('U'), n=30)
df2 = pd.DataFrame(common_words, columns = ['unigram' , 'count'])
display(df2.head(10))

fig = go.Figure([go.Bar(x=df2['unigram'], y=df2['count'])])
fig.update_layout(title=go.layout.Title(
    text="Top 30 unigrams in complaint text after removing stop words and lemmatization"))
fig.show();

In [None]:
# Extract and plot top 30 bigrams

common_words = get_top_n_ngram(corpus=df_text['complaints_fin_ver'].values.astype('U'), 
                               ng_range=(2, 2), n=30)
df3 = pd.DataFrame(common_words, columns = ['bigram' , 'count'])
display(df3.head(10))

fig = go.Figure([go.Bar(x=df3['bigram'], y=df3['count'])])
fig.update_layout(title=go.layout.Title(
    text="Top 30 bigrams in complaint text after removing stop words and lemmatization"))
fig.show();

In [None]:
# Extract and plot top 30 trigrams

common_words = get_top_n_ngram(corpus=df_text['complaints_fin_ver'].values.astype('U'), 
                               ng_range=(3, 3), n=30)
df4 = pd.DataFrame(common_words, columns = ['trigram' , 'count'])
df4.head(10)

fig = go.Figure([go.Bar(x=df4['trigram'], y=df4['count'])])
fig.update_layout(title=go.layout.Title(
    text="Top 30 trigrams in complaint text after removing stop words and lemmatization"))
fig.show();

## **6. Topic Modeling with TF-IDF Vectorization and NMF**

### 6.1 Vectorize Raw Texts to TF-IDF Feature Matrix

Here:

- `max_df` is used for removing terms that appear too frequently, also known as "corpus-specific stop words". `max_df = 0.95` means "ignore terms that appear in more than 95% of the complaints"

- `min_df` is used for removing terms that appear too infrequently. `min_df = 2` means "ignore terms that appear in less than 2 complaints"

In [None]:
tfidf_vec = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

Create a document term matrix using `fit_transform()`. The contents of a document term matrix are tuples of `(complaint_id,token_id)` TF-IDF score such that those tuples that are absent have a score of 0.

In [None]:
doc_mat = tfidf_vec.fit_transform(df_text['complaints_fin_ver'])

### 6.2 Find Optimal Number of Topic with NMF 

The Non-Negative Matrix Factorization (NMF) is an unsupervised technique wherein high dimensional (word) vectors are decomposed (or factorized) into lower-dimensional (lower-rank) representations. These lower-dimensional vectors are non-negative which also means their coefficients are non-negative.

We will use a **coherence model** to automatically select the best number of topics.

In [None]:
# Use Gensim's NMF to get the best num of topics via coherence score
texts = df_text['complaints_fin_ver']
dataset = [d.split() for d in texts]

In [None]:
# Create a Gensim dictionary, i.e., a mapping between 
# words and their integer id
dictionary = Dictionary(dataset)

In [None]:
# Filter out extremes to limit the number of features
dictionary.filter_extremes(no_below=3, no_above=0.85, keep_n=5000)

In [None]:
# Create the bag-of-words format => list of tuples with 
# each tuple being (token_id, token_count)
corpus = [dictionary.doc2bow(text) for text in dataset]

In [None]:
# Create a list of the topic numbers we want to try
topic_nums = list(np.arange(5, 10, 1))

In [None]:
# Run the nmf model and calculate the coherence score
# for each number of topics
coherence_scores = []
for num in topic_nums:

    NMF = nmf.Nmf(corpus=corpus, num_topics=num, id2word=dictionary, chunksize=2000, 
              passes=5, kappa=.1, minimum_probability=0.01, w_max_iter=300, 
              w_stop_condition=0.0001, h_max_iter=100, h_stop_condition=0.001, 
              eval_every=10, normalize=True, random_state=42)
    
    # Run the coherence model to get the score
    cm = CoherenceModel(model=NMF, texts=texts, dictionary=dictionary, coherence='c_v')
    coherence_scores.append(round(cm.get_coherence(), 5))

In [None]:
# Get the number of topics with the highest coherence score
scores = list(zip(topic_nums, coherence_scores))
best_num_topics = sorted(scores, key=itemgetter(1), reverse=True)[0][0]
print(best_num_topics)

### 6.3 Manual Topic Modeling with NMF

With the above `CoherenceModel` we got the best number of topics as 5.Now, all we need to do is run the model. The only parameter that is required is the number of components i.e. the number of topics we want. *This is the most crucial part in any topic modeling process and will greatly affect how good your final topics are.*

In [None]:
nmf_model = decomp.NMF(n_components=5,random_state=40)
nmf_model.fit(doc_mat)
print()
print(len(tfidf_vec.get_feature_names()))

In [None]:
# Print the top word of a sample component
single_topic = nmf_model.components_[0]
single_topic.argsort()
top_word_indices = single_topic.argsort()[-10:]
for index in top_word_indices:
    print(tfidf_vec.get_feature_names()[index])

In [None]:
# Print Top 15 words for each of the topics
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf_vec.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

In [None]:
# Creating the best topic for each complaint
topic_results = nmf_model.transform(doc_mat)
topic_results[0].round(2)
topic_results[0].argmax()
topic_results.argmax(axis=1)

In [None]:
# Assign the best topic to each of the complaints in 
# `Topic` column
df_text['Topic'] = topic_results.argmax(axis=1)

In [None]:
df_text.head()

In [None]:
# Print the first 5 complaints for each of the topics
df_dc=df_text.groupby('Topic').head(5)
df_dc.sort_values('Topic')

In [None]:
# Create a dictionary of topic names and 
# topics, i.e., topic number
topic_names = {
    0: "Bank Account Services", 
    1: "Credit/Prepaid Card", 
    2: "Others", 
    3: "Theft/Dispute Reporting", 
    4: "Mortgage/Loan"
}

In [None]:
# Replace Topics with Topic Names
df_text['Topic'] = df_text['Topic'].map(topic_names)

In [None]:
display(df_text.head())

## **7. Build Supervised Model to Classify New Complaints** 

We have analyzed and preprocessed raw text data (consumer complaints) and clustered them into 5 topics using NMF technique. In this section we will use supervised machine learning to classify new consumer complaints to the appropriate topic. 

Since we will be using supervised learning technique, we have to convert the topic names to numbers as ML algorithms are applicable to numbers *only*.

In [None]:
Topic_names = {
    "Bank Account Services": 0, 
    "Credit/Prepaid Card": 1, 
    "Others": 2, 
    "Theft/Dispute Reporting": 3, 
    "Mortgage/Loan":4
}

df_text['Topic'] = df_text['Topic'].map(Topic_names)

In [None]:
display(df_text.head())

In [None]:
train_data = df_text.loc[:, ["complaints_unclean", "Topic"]]
display(train_data.sample(n=6))

In [None]:
# Get vector count 
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_data.complaints_unclean)

# Save word vector to disk 
pickle.dump(count_vect.vocabulary_, open("count_vector.pkl","wb"))

In [None]:
# Transform word vector to TF-IDF vector 
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# Save TF-IDF vector to disk
pickle.dump(tfidf_transformer, open("tfidf.pkl","wb"))

In [None]:
def train_test_evaluate(estimator=None, name=None, 
                        trainData=None, testData=None, testSize:float=0.3, 
                        randomState:int=42):
    """
    Perform train-test split, train model, test model, and evaluate model performance.
    Args:
        estimator:sklearn.estimator class object    | Default value None
        name:str                                    | Default value None
        trainData:pd.Series or np.array             | Default value None
        testData:pd.Series or np.array              | Default value None
        testSize:float                              | Default value 0.3
        randomState:int                             | Default value 42
    """

    # Perform train-test split with `test_size=0.25`
    X_train, X_test, y_train, y_test = tts(
        trainData, testData, test_size=testSize, random_state=randomState)

    # Create Multinomial Naive Bayes classifier 
    clf = estimator.fit(X_train, y_train)

    # Save model to disk
    model_name = name + ".pkl"
    pickle.dump(clf, open(model_name, "wb"))
    print('Model created and saved to disc!')

    # Manual creation of the topic names as a list
    target_names = [
    "Bank Account Services", 
    "Credit/Prepaid Card", 
    "Others", 
    "Theft/Dispute Reporting", 
    "Mortgage/Loan"
    ]

    docs_new = """
    I can't get any info from chase about who services my mortgage, who owns it and who has 
    original loan docs
    """
    docs_new = docs_new.split(" ")

    # Load model
    load_cvec = CountVectorizer(vocabulary=pickle.load(open("count_vector.pkl", "rb")))
    load_tfidf = pickle.load(open("tfidf.pkl","rb"))
    load_model = pickle.load(open(model_name,"rb"))
    print('Model loaded from disc!')

    # Test model: Make predictions and evaluate model
    X_new_counts = load_cvec.transform(docs_new)
    X_new_tfidf = load_tfidf.transform(X_new_counts)
    predicted = load_model.predict(X_new_tfidf)
    # print('Target Names: ', target_names[predicted[:]])
    predicted = load_model.predict(X_test)
    result = pd.DataFrame({
        'true_topic': y_test.apply(lambda x: target_names[x]), 
        'predicted_topic': y_test.apply(lambda x: target_names[x]), 
        'true_topic_num': y_test, 
        'predicted_topic_num': predicted
        })
    display(result.head(10))
    print()
   
    conf_mat = confusion_matrix(y_test,predicted)
    print(conf_mat)
    print()
    clf_report = classification_report(
        y_test, predicted, target_names=target_names)
    print(clf_report)

### 7.1 Predictive Classifier 1: Multinomial Naive Bayes 

In [None]:
clf_mB = MultinomialNB()
train_test_evaluate(estimator=clf_mB, name='nb_model', 
                    trainData=X_train_tfidf, testData=train_data.Topic, 
                    testSize=0.25)

### 7.2 Predictive Classifier 2: Logistic Regression  

In [None]:
clf_lr = LogisticRegression(random_state=0)
train_test_evaluate(estimator=clf_lr, name='lr_model',
                    trainData=X_train_tfidf, testData=train_data.Topic, 
                    testSize=0.25)

### 7.3 Predictive Classifier 3: Decision Tree  

In [None]:
clf_dt = DecisionTreeClassifier(random_state=0)
train_test_evaluate(estimator=clf_dt, name='dt_model',
                    trainData=X_train_tfidf, testData=train_data.Topic, 
                    testSize=0.25)

### 7.4 Predictive Classifier 4: Random Forest


In [None]:
clf_rf = RandomForestClassifier(random_state=0)
train_test_evaluate(estimator=clf_rf, name='rf_model',
                    trainData=X_train_tfidf, testData=train_data.Topic, 
                    testSize=0.25)

## **8. Conclusions** 

1. Out of all the four models compared, *without any kind of model optimization*, the performance of the Logistic Regression algorithm, in terms of F1 score, precision and recall metrics, is the best for all the 5 topics.


2. The *unoptimized* performances of the Decision Tree and Random Forest classifiers come in second and third positions, respectively. The Decision Tree classifier is found to have a very balanced performance w.r.t all the 5 classes.


3. The Multinomial Naive Bayes algorithms in comparison to the other 3 algorithms performs very poorly w.r.t the **Mortgage/Loan** topic.