<a href="https://colab.research.google.com/github/tmj1432/Death-Row-Last-Statement-Topic-Modeling/blob/main/Topic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

In this notebook, we will be applying topic modeling on last statements from death row inmates. 

---

## Installation

In [94]:
# pip install bertopic

In [95]:
# nltk.download('stopwords')

In [96]:
# nltk.download('wordnet')

In [97]:
# nltk.download('omw-1.4')

---

## Import Libraries

In [98]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [99]:
import pandas as pd
from bertopic import BERTopic
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering

---

## Data Cleaning

In [100]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Last Messages Topic Modelling/last_messages.csv')

### Null Values

In [101]:
df.isna().sum()

Unnamed: 0      0
0               0
1               2
2               2
3               2
4               2
5               3
6             381
7             434
8             453
9             464
10            469
11            472
dtype: int64

In [102]:
df.loc[df['6'].notna()].tail()

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
425,425,Date of Execution:,"January 4, 1995",Inmate:,Jesse Jacobs #872,Last Statement:,"I have committed lots of sin in my life, but I...","I would like to tell my son, daughter and wife...","Eden, if they want proof of them, give it to t...",,,,
426,426,Date of Execution:,"December 11, 1994",Inmate:,Raymond Kinnamon #808,Last Statement:,…guys like them got tied up in something like ...,"I’m not ready to go, but I have no choice; I s...",If my words can persuade you to discontinue th...,(I gave Warden Hodges the phone at this time a...,,,
427,427,Date of Execution:,12/06/1994,Inmate:,Herman Clark #715,Last Statement:,I told the daughter not to come. Discontinue; ...,Jesus Christ is the Lord of Lords and the King...,,,,,
438,438,Date of Execution:,"August 31, 1993",Inmate:,Richard J. Wilkerson #756,Last Statement:,This execution is not justice. This execution ...,I will say once again…..This execution isn’t j...,"""Seeing Through the Eyes of a Death Row Inmate""","Sometime I wonder why, why he? Why did he go o...",Richard J. Wilkerson,Written through his sister Michelle Winn,
474,474,Date of Execution:,"December 7, 1982",Inmate:,"Charlie Brooks, Jr. #592",Last Statement:,"Statement to the Media: I, at this very moment...",Spoken:,"Yes, I do. \r\n I love you. \r\n Asdadu an l...",,,,


Since the messages spread across columns 5 to 11, I will be combining all of them into 1 column.

In [103]:
# filling null values with empty space so that we can concatenate the messages
df.fillna(' ', inplace=True)

In [104]:
# concatenate messages in column 5 to 11 inclusive
df['last_message'] = ''
for i in range(5,12):
    df['last_message'] = df['last_message'] + df[str(i)] + ' '

In [105]:
# checking index 438 as its message spans through columns 5 to 11
df['last_message'][438]

'This execution is not justice. This execution is an act of revenge! If this is justice, then justice is blind. Take a borderline retarded young male who for the 1st time ever in his life committed a felony then contaminate his TRUE tell all confession add a judge who discriminates plus an ALL-WHITE JURY pile on an ineffective assistance of counsel and execute the option of rehabilitation persecute the witnesses and you have created a death sentence for a family lasting over 10 years. I will say once again…..This execution isn’t justice – but an act of revenge. Killing R.J. will not bring Anil back, it only justifies "an eye for an eye and a tooth for a tooth." It’s too late to help R.J., but maybe this poem will help someone else out there. "Seeing Through the Eyes of a Death Row Inmate" Sometime I wonder why, why he? Why did he go out into the world to see? To be out there and see what really did exist, now his name is written down on the Death Row list. I can only imagine how loneso

### Check for duplicates

In [106]:
df.duplicated().sum()

0

No duplicate values

---

## Text Preprocessing

In [107]:
# lemmatizer
wn = WordNetLemmatizer()
# Remove stopwords
stopwords = nltk.corpus.stopwords.words('english')
print(f'There are {len(stopwords)} default stopwords.')

# 'no' is removed as a stopwords to retain the original meaning of 'no last statement'
not_stopwords = {'no'}
final_stopwords = set([word for word in stopwords if word not in not_stopwords])
print(f'There are {len(final_stopwords)} stopwords after removal.')

There are 179 default stopwords.
There are 178 stopwords after removal.


In [108]:
# Remove stopwords
df['last_message_without_stopwords'] = df['last_message'].apply(lambda x: ' '.join([w for w in x.split() if w.lower() not in final_stopwords]))
# Lemmatization
df['last_message_lemmatized'] = df['last_message_without_stopwords'].apply(lambda x: ' '.join([wn.lemmatize(w) for w in x.split() if w not in stopwords]))
# Take a look at the data
df = df[['last_message','last_message_without_stopwords','last_message_lemmatized']]
df.head()

Unnamed: 0,last_message,last_message_without_stopwords,last_message_lemmatized
0,No last statement given.,No last statement given.,No last statement given.
1,I want to take this moment to be shared with e...,"want take moment shared everyone, give God glo...","want take moment shared everyone, give God glo..."
2,"Yes, I just want to thank (pause) I don’t wan...","Yes, want thank (pause) don’t want leave baby,...","Yes, want thank (pause) don’t want leave baby,..."
3,I just want to say to the family of Pablo Cas...,"want say family Pablo Castro, appreciate every...","want say family Pablo Castro, appreciate every..."
4,I would like to thank my Jesus Christ my Lord...,would like thank Jesus Christ Lord Savior. wou...,would like thank Jesus Christ Lord Savior. wou...


---

## Topic Modeling with BERTopic

The aim of topic modeling is to discover the themes that run through a corpus by analyzing the words of the original texts.

### HDBSCAN

With HDBSCAN, it does not require us to set any number of clusters.

In [109]:
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
topic_model = BERTopic(hdbscan_model=hdbscan_model,n_gram_range=(1, 3))
topics, probs = topic_model.fit_transform(df['last_message_lemmatized'])

In [110]:
topic_model.get_topics()

{-1: [('love', 0.038629578773375396),
  ('you', 0.023075620937316806),
  ('know', 0.021277773034272217),
  ('family', 0.017618682929228743),
  ('thank', 0.015954393826922902),
  ('yall', 0.015366491906310217),
  ('me', 0.014660085155271067),
  ('sorry', 0.014606957459921951),
  ('love you', 0.013671850462598634),
  ('im', 0.013593974504982183)],
 0: [('sorry', 0.03211017328404651),
  ('family', 0.030586753341662925),
  ('love', 0.027785229022126275),
  ('me', 0.023052380379648234),
  ('forgive', 0.021794635149903877),
  ('you', 0.021233029396584897),
  ('know', 0.020579520558085807),
  ('hope', 0.018172382968058225),
  ('like', 0.017706886436030914),
  ('would', 0.017330543187604568)],
 1: [('love', 0.07787060476117681),
  ('yall', 0.04131851505863636),
  ('tell', 0.04038798795529721),
  ('you', 0.03908874806337),
  ('love you', 0.03858007555819863),
  ('family', 0.03645468891175988),
  ('strong', 0.028163975136233206),
  ('want', 0.027026722003120174),
  ('stay', 0.0269184840847446),


After looking at the topics clustered by DBSCAN, I felt that it was not representative enough. Therefore, we will be exploring other clustering techniques.

The following two clustering techniques requires us to set a number of clusters.

### Agglomerative Clustering

In [111]:
cluster_model = AgglomerativeClustering(n_clusters=5)
topic_model = BERTopic(hdbscan_model=cluster_model,n_gram_range=(1, 3))
topics, probs = topic_model.fit_transform(df['last_message_lemmatized'])

In [112]:
topic_model.get_topics()

{0: [('love', 0.03230039350230063),
  ('sorry', 0.022197973863978284),
  ('family', 0.022139812881507546),
  ('you', 0.022090267263628092),
  ('know', 0.021629298503248885),
  ('me', 0.016985357712153103),
  ('yall', 0.01560113163635324),
  ('hope', 0.01456129722271946),
  ('god', 0.014455460345766469),
  ('forgive', 0.014211511626871004)],
 1: [('love', 0.06074006385392828),
  ('you', 0.030224755758443026),
  ('family', 0.026194543133986305),
  ('yall', 0.026112959334738856),
  ('tell', 0.024889926799426833),
  ('thank', 0.022454537584349815),
  ('know', 0.022149500100979436),
  ('want', 0.020561085408473286),
  ('love you', 0.02054357459597393),
  ('me', 0.017727407236679215)],
 2: [('thank', 0.038388132515277514),
  ('love', 0.03119486226247102),
  ('would', 0.024021285644947397),
  ('like', 0.023193061687592772),
  ('you', 0.022916023643312626),
  ('god', 0.021849451269425994),
  ('would like', 0.020683485433044158),
  ('lord', 0.018763502808593093),
  ('jesus', 0.01859244791163383

### K-Means Clustering

In [113]:
cluster_model = KMeans(n_clusters=5)
topic_model = BERTopic(hdbscan_model=cluster_model,n_gram_range=(1, 3))
topics, probs = topic_model.fit_transform(df['last_message_lemmatized'])

In [114]:
topic_model.get_topics()

{0: [('sorry', 0.029686038476237294),
  ('love', 0.027171775428034776),
  ('family', 0.026750074153545272),
  ('forgive', 0.022326583404156417),
  ('you', 0.021726625364630357),
  ('know', 0.02147501565484286),
  ('me', 0.021377381550354735),
  ('would', 0.01849474567690138),
  ('god', 0.01750228214857628),
  ('like', 0.01741740721569078)],
 1: [('love', 0.05806985013811193),
  ('you', 0.030975981804198662),
  ('yall', 0.026536955592677092),
  ('know', 0.025701247557127146),
  ('love you', 0.021905241428252224),
  ('tell', 0.020555688675218852),
  ('want', 0.020433744680752656),
  ('family', 0.019781640445601226),
  ('thank', 0.017260806488453472),
  ('strong', 0.015088642207646738)],
 2: [('love', 0.020012156505414873),
  ('people', 0.015601614490863714),
  ('you', 0.014077425012968166),
  ('family', 0.013744665218593575),
  ('know', 0.013684169263906958),
  ('me', 0.012864237756897707),
  ('im', 0.012754559501328528),
  ('death', 0.011893806906705802),
  ('say', 0.011694037043482527)

By using clustering methods that requires us to set a certain number of clusters, we managed to get an additional cluster not seen with DBSCAN. This cluster is the cluster where inmates provided no last statement. 

Although the difference is minute, I felt that with K-means clustering, the clusters were a little more distinct from each other. 

Let us visualising the topics from K-means clustering.

In [115]:
# Visualize top topic keywords
topic_model.visualize_barchart(top_n_topics=10)

To better understand the topics, we will be looking at what kind of statement makes up the clusters.

In [123]:
topic_model.get_representative_docs(0)[0]

"Yes do, know way make pain suffering gave you. sorry. punishment nothing compared pain sorrow caused. hope someday find peace. strong enough ask forgiveness worth. realize I've done pain I've given. Please Lord forgive me. done horrible things. ask Lord please forgive me. gained nothing, brought sorrow pain wonderful people. sorry. sorry. Sanchez family showed love. Hawkings' family, sorry. know affected long. Please forgive me. Irene, want thank thank husband Jack. I'll waiting you. sorry. family ask forgiveness. Father God ask forgiveness. ask forgiveness Lord. ready go Lord. Thank you. ready go. Jesus Savior none like you. day want praise, let every breath. Shout Lord let u sing."

We can see that Topic 0 is influenced by asking for forgiveness.

In [124]:
topic_model.get_representative_docs(1)[0]

"Yes, do, Victor, Gary Hey bros, know hear me, can't hear you. right thinking grew up... know grew house. need love like use to. Deena, Bob raised house, need take care love like use to. Adela love you, Mijta, need take care mom. need love like use to. Juana, kindness showed me. Taking time show friendship did. never repay that. Take OK; see good. OK. Thank showing loved again. showed love sometimes deserve. love that. need take care yourself. going OK; know I'll be. love you, love you, love you, love , love you, love bro. Take care yall. May God mercy soul. thought going harder this. ready go. going sleep now. feel it, affecting now."

We can see that Topic 1 is influenced by love, family and acceptance.

In [133]:
topic_model.get_representative_docs(2)[1]

"want start acknowledging love I've family. No man world better family me. best parent world. best brother sister world. I've wonderful life man could ever had. I've never proud anybody daughter son. I've got complaint regret that. love everyone always loved life. I've never doubt that. Couple matter want talk since one time people listen say. Unites States gotten zero respect human life. death symptom bigger illness. point government got wake stop thing destroy country killing innocent children. ongoing embargo sanction place like Iran Iraq, Cuba places. anything change world, harming innocent children. That's got stop point. Perhaps important lot way environment even devastating long keep going direction we're going end result matter treat people everybody planet way out. got wake stop that. Ah, one way world truth ever going get out, people ever going know what's happening long support free press there. see press struggling stay existent free institution One truly free institution p

We can see that Topic 2 is influenced by love and family as well as some frustrations.

In [126]:
topic_model.get_representative_docs(3)[0]

"First all, would like ask Sister Teresa send Connie yellow rose. want thank Lord, Jesus Christ, year spent death row. blessing life. opportunity serve Jesus Christ thankful opportunity. would like thank Father Walsh become Franciscan, people world become friends. wonderful experience life. would like thank Chaplain Lopez, witness giving support love. would like thank Nuns England support. want tell son love them; always loved - greatest gift God. want tell witnesses, Tannie, Rebecca, Al, Leo, Dr. Blackwell love thankful support. want ask Paulette forgiveness heart. One day, hope will. tragedy family family. sorry. special angel, love you. love you, Connie. May God pas Kindom's shore softly gently. ready."

We can see that Topic 3 is influenced by religion.

In [127]:
topic_model.get_representative_docs(4)[0]

'inmate declined make last statement.'

We can see that Topic 4 are inmates who did not give any last statement.

In [116]:
# Visualize intertopic distance
topic_model.visualize_topics()

Based on the plot above, we can see that topics 0, 2 and 3 are similar to each other and topics 1 and 4 are similar to each other. We can also see that between topics 0, 2, 3 and topics 1 and 4, there is a great distance between these two groups which shows low similarity between the two groups.

---

## Summary

In summary, we have managed to cluster last statements of inmates into five different topics.

Topic 0: Forgiveness

Topic 1: Love, Family and Acceptance

Topic 2: Love, Family and Frustrations

Topic 3: Religion

Topic 4: No last statement

---