## Problem Statement 

You need to build a model that is able to classify customer complaints based on the products/services. By doing so, you can segregate these tickets into their relevant categories and, therefore, help in the quick resolution of the issue.

You will be doing topic modelling on the <b>.json</b> data provided by the company. Since this data is not labelled, you need to apply NMF to analyse patterns and classify tickets into the following five clusters based on their products/services:

* Credit card / Prepaid card

* Bank account services

* Theft/Dispute reporting

* Mortgages/loans

* Others 


With the help of topic modelling, you will be able to map each ticket onto its respective department/category. You can then use this data to train any supervised model such as logistic regression, decision tree or random forest. Using this trained model, you can classify any new customer complaint support ticket into its relevant department.

## Pipelines that needs to be performed:

You need to perform the following eight major tasks to complete the assignment:

1.  Data loading

2. Text preprocessing

3. Exploratory data analysis (EDA)

4. Feature extraction

5. Topic modelling 

6. Model building using supervised learning

7. Model training and evaluation

8. Model inference

## Importing the necessary libraries

In [70]:
import json 
import numpy as np
import pandas as pd
import re, nltk, spacy, string
import en_core_web_sm
nlp = en_core_web_sm.load()
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from plotly.offline import plot
import plotly.graph_objects as go
import plotly.express as px

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from pprint import pprint

from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.probability import FreqDist

In [71]:
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [72]:
!unzip /usr/share/nltk_data/corpora/wordnet.zip -d /usr/share/nltk_data/corpora/

Archive:  /usr/share/nltk_data/corpora/wordnet.zip
replace /usr/share/nltk_data/corpora/wordnet/lexnames? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


## Loading the data

The data is in JSON format and we need to convert it to a dataframe.

In [73]:
# Opening JSON file 
f = open('/kaggle/input/automatic-ticket-classification/complaints-2021-05-14_08_16.json')# Write the path to your data file and load it 
  
# returns JSON object as  
# a dictionary 
data = json.load(f)
df=pd.json_normalize(data)

## Data preparation

In [74]:
# Inspect the dataframe to understand the given data.
df.sample(10)


Unnamed: 0,_index,_type,_id,_score,_source.tags,_source.zip_code,_source.complaint_id,_source.issue,_source.date_received,_source.state,...,_source.company_response,_source.company,_source.submitted_via,_source.date_sent_to_company,_source.company_public_response,_source.sub_product,_source.timely,_source.complaint_what_happened,_source.sub_issue,_source.consumer_consent_provided
20682,complaint-public-v2,complaint,4242096,0.0,,770XX,4242096,Unauthorized transactions or other transaction...,2021-03-24T12:00:00-05:00,TX,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2021-03-24T12:00:00-05:00,,Mobile or digital wallet,Yes,,,
66138,complaint-public-v2,complaint,2079539,0.0,,,2079539,Disclosure verification of debt,2016-08-25T12:00:00-05:00,,...,Closed with explanation,JPMORGAN CHASE & CO.,Postal mail,2016-08-29T12:00:00-05:00,,Credit card,Yes,,Not given enough info to verify debt,
25562,complaint-public-v2,complaint,3797044,0.0,,114XX,3797044,Attempts to collect debt not owed,2020-08-14T12:00:00-05:00,NY,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2020-08-24T12:00:00-05:00,,Credit card debt,Yes,i did a debt resolution program back in XXXX o...,Debt was paid,Consent provided
43831,complaint-public-v2,complaint,2705796,0.0,,,2705796,Managing an account,2017-10-18T12:00:00-05:00,WA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2017-10-18T12:00:00-05:00,,Checking account,Yes,,Problem using a debit or ATM card,Other
21217,complaint-public-v2,complaint,3350123,0.0,,100XX,3350123,Incorrect information on your report,2019-08-22T12:00:00-05:00,NY,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-08-22T12:00:00-05:00,,Credit reporting,Yes,I have been going back and forth with chase si...,Personal information incorrect,Consent provided
25196,complaint-public-v2,complaint,3997521,0.0,"Older American, Servicemember",34609,3997521,Managing an account,2020-12-08T12:00:00-05:00,FL,...,Closed with explanation,JPMORGAN CHASE & CO.,Phone,2020-12-08T12:00:00-05:00,,Checking account,Yes,,Deposits and withdrawals,
7987,complaint-public-v2,complaint,3448442,0.0,Servicemember,,3448442,Managing an account,2019-11-23T12:00:00-05:00,CO,...,Closed with monetary relief,JPMORGAN CHASE & CO.,Web,2019-11-23T12:00:00-05:00,,Checking account,Yes,"I was approached yesterday on XX/XX/19, by a w...",Deposits and withdrawals,Consent provided
49910,complaint-public-v2,complaint,1037094,0.0,,92115,1037094,APR or interest rate,2014-09-19T12:00:00-05:00,CA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2014-09-19T12:00:00-05:00,,,Yes,,,
43662,complaint-public-v2,complaint,2700861,0.0,,432XX,2700861,Getting a credit card,2017-10-13T12:00:00-05:00,OH,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2017-10-13T12:00:00-05:00,,Store credit card,Yes,Chase bank ohio applay Credit Cards. Chase fre...,Application denied,Consent provided
75165,complaint-public-v2,complaint,2992154,0.0,,331XX,2992154,Problem with a purchase shown on your statement,2018-08-15T12:00:00-05:00,FL,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2018-08-15T12:00:00-05:00,,General-purpose credit card or charge card,Yes,Purchase Date Of XXXX XXXX Services : XX/XX/2...,Credit card company isn't resolving a dispute ...,Consent provided


In [75]:
df.shape

(78313, 22)

In [76]:
#print the column names
df.columns

Index(['_index', '_type', '_id', '_score', '_source.tags', '_source.zip_code',
       '_source.complaint_id', '_source.issue', '_source.date_received',
       '_source.state', '_source.consumer_disputed', '_source.product',
       '_source.company_response', '_source.company', '_source.submitted_via',
       '_source.date_sent_to_company', '_source.company_public_response',
       '_source.sub_product', '_source.timely',
       '_source.complaint_what_happened', '_source.sub_issue',
       '_source.consumer_consent_provided'],
      dtype='object')

In [85]:
#Assign new column names
for column_name in df.columns:
    cleaned_column_name = re.sub('^_', '', column_name)
    cleaned_column_name = re.sub('^source\.', '', cleaned_column_name)
    df.rename(columns={column_name: cleaned_column_name}, inplace=True)

In [86]:
df.columns

Index(['index', 'type', 'id', 'score', 'tags', 'zip_code', 'complaint_id',
       'issue', 'date_received', 'state', 'consumer_disputed', 'product',
       'company_response', 'company', 'submitted_via', 'date_sent_to_company',
       'company_public_response', 'sub_product', 'timely',
       'complaint_what_happened', 'sub_issue', 'consumer_consent_provided'],
      dtype='object')

In [79]:
df.sample(10)

Unnamed: 0,_index,_type,_id,_score,_source.tags,_source.zip_code,_source.complaint_id,_source.issue,_source.date_received,_source.state,...,_source.company_response,_source.company,_source.submitted_via,_source.date_sent_to_company,_source.company_public_response,_source.sub_product,_source.timely,_source.complaint_what_happened,_source.sub_issue,_source.consumer_consent_provided
58025,complaint-public-v2,complaint,2805412,0.0,,070XX,2805412,Closing your account,2018-02-06T12:00:00-05:00,NJ,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2018-02-06T12:00:00-05:00,,General-purpose credit card or charge card,Yes,,Company closed your account,Consent not provided
17041,complaint-public-v2,complaint,2822777,0.0,,30008,2822777,Managing an account,2018-02-22T12:00:00-05:00,GA,...,Closed with explanation,JPMORGAN CHASE & CO.,Postal mail,2018-02-22T12:00:00-05:00,,Other banking product or service,Yes,,Deposits and withdrawals,
30176,complaint-public-v2,complaint,3274115,0.0,,750XX,3274115,Problem with a purchase or transfer,2019-06-13T12:00:00-05:00,TX,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-06-13T12:00:00-05:00,,General-purpose prepaid card,Yes,I AM INSTANTLY PUSHED OVER TO CORPORATE WHO I ...,Card company isn't resolving a dispute about a...,Consent provided
22024,complaint-public-v2,complaint,3196099,0.0,Servicemember,30016,3196099,Struggling to pay mortgage,2019-03-30T12:00:00-05:00,GA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-03-30T12:00:00-05:00,,FHA mortgage,Yes,,,Consent not provided
53296,complaint-public-v2,complaint,1651159,0.0,,89141,1651159,"Loan modification,collection,foreclosure",2015-11-12T12:00:00-05:00,NV,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2015-11-16T12:00:00-05:00,,FHA mortgage,Yes,,,Consent not provided
42473,complaint-public-v2,complaint,40720,0.0,,598XX,40720,"Loan servicing, payments, escrow account",2012-03-25T12:00:00-05:00,MT,...,Closed without relief,JPMORGAN CHASE & CO.,Referral,2012-04-05T12:00:00-05:00,,Other mortgage,Yes,,,
48352,complaint-public-v2,complaint,192971,0.0,Servicemember,490XX,192971,Billing disputes,2012-11-19T12:00:00-05:00,MI,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2012-11-19T12:00:00-05:00,,,Yes,,,
23812,complaint-public-v2,complaint,644391,0.0,,949XX,644391,Payoff process,2013-12-18T12:00:00-05:00,CA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2013-12-17T12:00:00-05:00,,,Yes,,,
76054,complaint-public-v2,complaint,2885647,0.0,,77004,2885647,Improper use of your report,2018-04-23T12:00:00-05:00,TX,...,Closed with explanation,JPMORGAN CHASE & CO.,Referral,2018-04-24T12:00:00-05:00,,Credit reporting,Yes,,Credit inquiries on your report that you don't...,
25811,complaint-public-v2,complaint,3453244,0.0,,27950,3453244,Trouble during payment process,2019-11-29T12:00:00-05:00,NC,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-12-04T12:00:00-05:00,,Other type of mortgage,Yes,,,Consent withdrawn


In [87]:
(df.complaint_what_happened == "").sum()

57241

In [None]:
#Assign nan in place of blanks in the complaints column
df[df.complaint_what_happened == ""] = np.nan 

In [None]:
(df.complaint_what_happened == "").sum()

In [None]:
(df.complaint_what_happened == np.nan).sum()

In [None]:
#Remove all rows where complaints column is nan
df.dropna(subset=['complaint_what_happened'], inplace=True)

In [None]:
df.shape

## Prepare the text for topic modeling

Once you have removed all the blank complaints, you need to:

* Make the text lowercase
* Remove text in square brackets
* Remove punctuation
* Remove words containing numbers


Once you have done these cleaning operations you need to perform the following:
* Lemmatize the texts
* Extract the POS tags of the lemmatized text and remove all the words which have tags other than NN[tag == "NN"].


In [None]:
# Write your function here to clean the text and remove all the unnecessary elements.
def clean_text(text):
    # Make the text lowercase
    text = text.lower()
    
    # Remove text in square brackets using regular expression
    text = re.sub(r'\[.*?\]', '', text)
    
    # Remove punctuation using string library
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Remove words containing numbers using regular expression
    text = re.sub(r'\w*\d\w*', '', text)
    
    return text

In [None]:
#Write your function to Lemmatize the texts
def lemmatize_text(text):
    # Tokenize the text into words
    words = word_tokenize(text.lower())
    
    # Initialize the WordNet Lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Lemmatize each word using its part of speech (POS)
    lemmatized_words = []
    for word, pos in nltk.pos_tag(words):
        pos_letter = pos[0].lower() if pos[0].lower() in ['a', 'r', 'n', 'v'] else 'n'
        lemma = lemmatizer.lemmatize(word, pos=pos_letter)
        lemmatized_words.append(lemma)
    
    # Join the lemmatized words back into a sentence
    lemmatized_text = ' '.join(lemmatized_words)
    
    return lemmatized_text

In [None]:
df['cleaned_complaints'] = df['complaint_what_happened'].apply(clean_text)

In [None]:
from tqdm import tqdm
tqdm.pandas()
df['lemmatized_complaints'] = df['cleaned_complaints'].progress_apply(lemmatize_text)

In [None]:
df['lemmatized_complaints'].sample(10)

In [None]:
#Create a dataframe('df_clean') that will have only the complaints and the lemmatized complaints 
df_clean = pd.DataFrame(columns=['complaint_what_happened','lemmatized_complaints'], data=df[['complaint_what_happened','lemmatized_complaints']])

In [None]:
df_clean

In [None]:
# Write your function to extract the POS tags
def extract_pos_tag(sentence):
    # Tokenize the sentence into words
    words = word_tokenize(sentence)
    
    # Perform POS tagging using nltk.pos_tag
    pos_tags = nltk.pos_tag(words)
    
    # Extract words with tags 'NN', join them, and return
    return ' '.join([word for (word, tag) in pos_tags if tag == "NN" and word != 'i'])

df_clean["complaint_POS_removed"] = df_clean["lemmatized_complaints"].progress_apply(extract_pos_tag)
df_clean["length"] = df_clean["complaint_POS_removed"].progress_apply(len)


In [None]:
#The clean dataframe should now contain the raw complaint, lemmatized complaint and the complaint after removing POS tags.
df_clean

## Exploratory data analysis to get familiar with the data.

Write the code in this task to perform the following:

*   Visualise the data according to the 'Complaint' character length
*   Using a word cloud find the top 40 words by frequency among all the articles after processing the text
*   Find the top unigrams,bigrams and trigrams by frequency among all the complaints after processing the text. ‘




In [None]:
# Write your code here to visualise the data according to the 'Complaint' character length
plt.figure(figsize=(18,5))
plt.hist([l for l in df_clean.length if l < 4000], bins=50)
plt.xlabel("Complaint Character Length")
plt.show()

#### Find the top 40 words by frequency among all the articles after processing the text.

In [None]:
#Using a word cloud find the top 40 words by frequency among all the articles after processing the text
from wordcloud import WordCloud

wordcloud = WordCloud(
    max_words=40,
    max_font_size=40
).generate(str(df_clean['complaint_POS_removed']))

fig = plt.figure(figsize=(20,15))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

In [None]:
#Removing -PRON- from the text corpus
df_clean['Complaint_clean'] = df_clean['complaint_POS_removed'].str.replace('-PRON-', '')

#### Find the top unigrams,bigrams and trigrams by frequency among all the complaints after processing the text.

In [None]:
#Write your code here to find the top 30 unigram frequency among the complaints in the cleaned datafram(df_clean). 
complaints_list = df_clean['Complaint_clean'].tolist()

# Concatenate all complaints into a single string
all_complaints_text = " ".join(complaints_list)

# Tokenize the text into words
words = word_tokenize(all_complaints_text)

# Calculate the frequency distribution of unigrams
fdist = FreqDist(words)

# Get the top 30 most common unigrams
top_30_unigrams = fdist.most_common(30)

# Create a DataFrame to display the results
top_30_unigrams_df = pd.DataFrame(top_30_unigrams, columns=['Unigram', 'Frequency'])

# Display the top 30 unigrams
print(top_30_unigrams_df)

In [None]:
#Print the top 10 words in the unigram frequency
top_30_unigrams_df.head(10)

In [None]:
#Write your code here to find the top 30 bigram frequency among the complaints in the cleaned datafram(df_clean). 
# Create bigrams from the list of words
bigrams = list(nltk.bigrams(words))

# Calculate the frequency distribution of bigrams
fdist = FreqDist(bigrams)

# Get the top 30 most common bigrams
top_30_bigrams = fdist.most_common(30)

# Create a DataFrame to display the results
top_30_bigrams_df = pd.DataFrame(top_30_bigrams, columns=['Bigram', 'Frequency'])

# Display the top 30 bigrams
print(top_30_bigrams_df)

In [None]:
#Print the top 10 words in the bigram frequency
top_30_bigrams_df.head(10)

In [None]:
#Write your code here to find the top 30 trigram frequency among the complaints in the cleaned datafram(df_clean). 
# Create trigrams from the list of words
trigrams = list(nltk.ngrams(words, 3))

# Calculate the frequency distribution of trigrams
fdist = FreqDist(trigrams)

# Get the top 30 most common trigrams
top_30_trigrams = fdist.most_common(30)

# Create a DataFrame to display the results
top_30_trigrams_df = pd.DataFrame(top_30_trigrams, columns=['Trigram', 'Frequency'])

# Display the top 30 trigrams
print(top_30_trigrams_df)

In [None]:
#Print the top 10 words in the trigram frequency
top_30_trigrams_df.head(10)

## The personal details of customer has been masked in the dataset with xxxx. Let's remove the masked text as this will be of no use for our analysis

In [None]:
df_clean['Complaint_clean'] = df_clean['Complaint_clean'].str.replace('xxxx','')

In [None]:
#All masked texts has been removed
df_clean

## Feature Extraction
Convert the raw texts to a matrix of TF-IDF features

**max_df** is used for removing terms that appear too frequently, also known as "corpus-specific stop words"
max_df = 0.95 means "ignore terms that appear in more than 95% of the complaints"

**min_df** is used for removing terms that appear too infrequently
min_df = 2 means "ignore terms that appear in less than 2 complaints"

In [None]:
#Write your code here to initialise the TfidfVectorizer 
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words="english")

#### Create a document term matrix using fit_transform

The contents of a document term matrix are tuples of (complaint_id,token_id) tf-idf score:
The tuples that are not there have a tf-idf score of 0

In [None]:
#Write your code here to create the Document Term Matrix by transforming the complaints column present in df_clean.
dtm = tfidf.fit_transform(df_clean.Complaint_clean)

## Topic Modelling using NMF

Non-Negative Matrix Factorization (NMF) is an unsupervised technique so there are no labeling of topics that the model will be trained on. The way it works is that, NMF decomposes (or factorizes) high-dimensional vectors into a lower-dimensional representation. These lower-dimensional vectors are non-negative which also means their coefficients are non-negative.

In this task you have to perform the following:

* Find the best number of clusters 
* Apply the best number to create word clusters
* Inspect & validate the correction of each cluster wrt the complaints 
* Correct the labels if needed 
* Map the clusters to topics/cluster names

In [None]:
from sklearn.decomposition import NMF

## Manual Topic Modeling
You need to do take the trial & error approach to find the best num of topics for your NMF model.

The only parameter that is required is the number of components i.e. the number of topics we want. This is the most crucial step in the whole topic modeling process and will greatly affect how good your final topics are.

In [None]:
#Load your nmf_model with the n_components i.e 5
num_topics = 5 #write the value you want to test out

#keep the random_state =40
nmf_model = NMF(n_components=num_topics, random_state=40) #write your code here

In [None]:
nmf_model.fit(dtm)
len(tfidf.get_feature_names_out())

In [None]:
#Print the Top15 words for each of the topics
for index,topic in enumerate(nmf_model.components_):
    words_in_topic = []
    for word_index in topic.argsort()[-15:]:
        words_in_topic.append(tfidf.get_feature_names_out()[word_index])
    print(f'topic {index + 1}: {words_in_topic}')
    print('\n')

In [None]:
#Create the best topic for each complaint in terms of integer value 0,1,2,3 & 4
topic_results = nmf_model.transform(dtm)
topic_results[0].round(2)
topic_results[0].argmax()
topic_results.argmax(axis=1)

In [None]:
#Assign the best topic to each of the cmplaints in Topic Column

df_clean['Topic'] = df_clean['Topic'] = topic_results.argmax(axis = 1) #write your code to assign topics to each rows.

In [None]:
df_clean.head()

In [None]:
#Print the first 5 Complaint for each of the Topics
df_clean=df_clean.groupby('Topic').head(5)
df_clean.sort_values('Topic')

#### After evaluating the mapping, if the topics assigned are correct then assign these names to the relevant topic:
* Bank Account services
* Credit card or prepaid card
* Theft/Dispute Reporting
* Mortgage/Loan
* Others

In [None]:
#Create the dictionary of Topic names and Topics

Topic_names = {0:"Bank Account services",
               1:"Credit card or prepaid card", 
               2:"Others",
               3:"Theft/Dispute Reporting",
               4:"Mortgage/Loan"}
#Replace Topics with Topic Names
df_clean['Topic'] = df_clean['Topic'].map(Topic_names)

In [None]:
df_clean

## Supervised model to predict any new complaints to the relevant Topics.

You have now build the model to create the topics for each complaints.Now in the below section you will use them to classify any new complaints.

Since you will be using supervised learning technique we have to convert the topic names to numbers(numpy arrays only understand numbers)

In [None]:
#Create the dictionary again of Topic names and Topics

Topic_names = {"Bank Account services":0,
               "Credit card or prepaid card":1,
               "Others":2,
               "Theft/Dispute Reporting":3,
               "Mortgage/Loan":4}
#Replace Topics with Topic Names
df_clean['Topic'] = df_clean['Topic'].map(Topic_names)

In [None]:
df_clean

In [None]:
#Keep the columns"complaint_what_happened" & "Topic" only in the new dataframe --> training_data
training_data = df_clean[["complaint_what_happened","Topic"]]

In [None]:
training_data

####Apply the supervised models on the training data created. In this process, you have to do the following:
* Create the vector counts using Count Vectoriser
* Transform the word vecotr to tf-idf
* Create the train & test data using the train_test_split on the tf-idf & topics


In [None]:

#Write your code to get the Vector count
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(training_data.complaintwhathappened)

#Write your code here to transform the word vector to tf-idf
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [None]:
from sklearn.model_selection import train_test_split

# Performing Train-Test split
X_train, X_test, y_train, y_test = train_test_split(X_train_tfidf, training_data.Topic, test_size=0.25, random_state=42)

You have to try atleast 3 models on the train & test data from these options:
* Logistic regression
* Decision Tree
* Random Forest
* Naive Bayes (optional)

**Using the required evaluation metrics judge the tried models and select the ones performing the best**

In [None]:
# Write your code here to build any 3 models and evaluate them using the required metrics





In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report, confusion_matrix

# Run the Logistic Regression model
model_name = 'LOGISTIC REGRESSION'
clf_lr = LogisticRegression(solver='liblinear')
clf_lr.fit(X_train, y_train)
y_pred_lr = clf_lr.predict(X_test)

In [None]:
# Calculate F1 Score using weighted average method
f1_lr = f1_score(y_test, y_pred_lr, average="weighted")
f1_lr

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Run Decision Tree on default hyperparameters
model_name = 'DECISION TREE'
clf_dt = DecisionTreeClassifier()
%time 
clf_dt.fit(X_train, y_train)
y_pred_dt = clf_dt.predict(X_test)

In [None]:
# Calculate F1 Score using weighted average method
f1_dt = f1_score(y_test, y_pred_dt, average="weighted")
f1_dt

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Run the Random Forest model on default hyperparameters
model_name = 'RANDOM FOREST'
clf_rf = RandomForestClassifier()
%time 
clf_rf.fit(X_train, y_train)
y_pred_rf = clf_rf.predict(X_test)

In [None]:
# Calculate F1 Score using weighted average method
f1_rf = f1_score(y_test, y_pred_rf, average="weighted")
f1_rf