# Affinity Propagation

Affinity Propagation is a clustering algorithm used in natural language processing (NLP) for sentiment analysis which is another unsupervised machine learning algorithm. In contrast to K-means clustering, Affinity Propagation Algorithm does not require the number of clusters to be specified beforehand. Instead, it uses a message-passing algorithm to identify exemplars, or data points that are representative of their clusters. Each data point is assigned a responsibility score and an availability score, which are used to update the exemplars until convergence.

Affinity Propagation is a clustering algorithm used in natural language processing (NLP) for sentiment analysis. Clustering algorithms group similar objects together based on a certain similarity metric. In the case of sentiment analysis, the objects are textual documents or sentences, and the similarity metric is based on the sentiment expressed in the text.

Affinity Propagation is particularly useful for sentiment analysis because it does not require the number of clusters to be predefined, which can be difficult to determine in advance. Instead, it automatically determines the number of clusters by finding the most representative data points, known as exemplars, for each cluster.

The algorithm works by iteratively passing messages between data points until a set of exemplars is identified that best represent the entire data set. These exemplars can then be used to assign new data points to the appropriate cluster.

Overall, Affinity Propagation is a powerful tool for sentiment analysis in NLP because it can automatically identify the most representative data points for each cluster without requiring a priori knowledge of the number of clusters.

In [8]:
# Import necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AffinityPropagation
import pandas as pd
from sklearn import metrics


In [2]:
#Load csv files
DATASET_ENCODING = "ISO-8859-1"
df = pd.read_csv('sampled_twitter_data.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]
y_train = pd.read_csv('y_train.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]
y_test = pd.read_csv('y_test.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]
X_test = pd.read_csv('X_test.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]
X_train = pd.read_csv('X_train.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]

X_train.fillna('', inplace=True)
X_test.fillna('', inplace=True)



  y_train = pd.read_csv('y_train.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]


  y_test = pd.read_csv('y_test.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]


  X_test = pd.read_csv('X_test.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]


  X_train = pd.read_csv('X_train.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]


In [6]:
# Select the first and last 2000 rows from X_train
X_train_reduced = pd.concat([X_train.iloc[:2000], X_train.iloc[-2000:]])

# Vectorize the reduced text data
vectorizer = TfidfVectorizer(stop_words='english')
X_vec = vectorizer.fit_transform(X_train_reduced)


# Perform clustering using Affinity Propagation
af = AffinityPropagation().fit(X_vec)

# Get the cluster labels
cluster_labels = af.labels_

# Print the number of clusters
n_clusters = len(set(cluster_labels))
print('Number of clusters: %d' % n_clusters)

# Print the clusters and their associated text data
clusters = {}
for i, label in enumerate(cluster_labels):
    if label not in clusters:
        clusters[label] = []
    clusters[label].append(X_train.iloc[i])
for cluster in clusters:
    print('Cluster %d:' % cluster)
    print(clusters[cluster])

Number of clusters: 13
Cluster 7:
Cluster 10:
Cluster 8:
['currently playing sims im boycotting sims find sims drop dead ugly', 'ughh reaction anaesthetic filling dentist feeling poorly', 'donniedoll jordandoll morning afternoon best dolls hugs', 'good day everyone working big projects youtube views keep freezing really sux guess well never see million', 'jonasbrothers st nighttt well morning', 'yr old son better putting together lego models', 'jlanier went game watching nyc briliant header usa usa usa', 'morning people lot work today']
Cluster 0:
['happy mothers day', 'happy mothers day', 'sreejith least got karteekj create account', 'miss much know']
Cluster 9:
['well thank u baby', 'vintagegoodness thank much', 'thebadcat thank', 'jayfk thank u', 'rip mouth', 'mennard youre welcomemost enjoyable read x']
Cluster 11:
['class miss already', 'class right', 'rumlover empty rum barrel sad rum barrel shakes head horror horror', 'quotcome fly meee lets fly lets flyyyyyyyyy pack lets fly aw

In [9]:
# Compute the silhouette score
silhouette_score = metrics.silhouette_score(X_vec, af.labels_, metric='cosine')
print('Silhouette score: %.3f' % silhouette_score)

# Compute the Calinski-Harabasz score
ch_score = metrics.calinski_harabasz_score(X_vec.toarray(), af.labels_)
print('Calinski-Harabasz score: %.3f' % ch_score)

# Compute the Davies-Bouldin score
db_score = metrics.davies_bouldin_score(X_vec.toarray(), af.labels_)
print('Davies-Bouldin score: %.3f' % db_score)

Silhouette score: -0.029
Calinski-Harabasz score: 3.274
Davies-Bouldin score: 10.860


In [None]:
# Silhouette score: This score measures how similar an object is to its own cluster compared to other clusters. The score ranges from -1 to 1, where a score of 1 indicates that the object is very similar to its own cluster and very dissimilar to other clusters, while a score of -1 indicates the opposite. In your case, the silhouette score is -0.029, which is close to 0 and indicates that the clustering is not very good.

# Calinski-Harabasz score: This score measures the ratio of between-cluster dispersion and within-cluster dispersion. A higher score indicates better clustering. In your case, the score is 3.274, which is low and indicates poor clustering.

# Davies-Bouldin score: This score measures the average similarity between each cluster and its most similar cluster, taking into account the size of the clusters. A lower score indicates better clustering. In your case, the score is 10.860, which is high and indicates poor clustering.

# Overall, based on these scores, it appears that the clustering algorithm is not performing very well on your data.

In [4]:
# Import the Necessary Dependencies

# Data Manipulation
import re ## regular expression operator
import numpy as np
import pandas as pd
import string 


# plotting
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Methods and stopwords text processing
import nltk ## natural language toolkit
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer


# Machine Learning Libraries
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AffinityPropagation
from sklearn.metrics.pairwise import cosine_similarity

In [5]:
# Importing the dataset
DATASET_COLUMNS=['target','ids','date','flag','user','text']
DATASET_ENCODING = "ISO-8859-1"
df_raw = pd.read_csv('twitter_data.csv', encoding=DATASET_ENCODING, names=DATASET_COLUMNS)

#Exploratory Data Analysis
df_raw.head()

# Removing the unnecessary columns.
df_raw = df_raw[['target','text']]

# Storing data in lists.
text, target = list(df_raw['text']), list(df_raw['target'])

In [6]:
print('length of data is', len(df_raw))

length of data is 1600000


In [7]:

# Create subgroups based on the target column
subgroups = df_raw['target'].astype(str)  # Convert the 'target' column to a string to ensure correct stratification

# Specify the number of tweets you want to include in your final dataset for each subgroup
#n_samples = 20000  # Replace with the desired number of tweets in each subgroup
n_samples = 1000  # Replace with the desired number of tweets in each subgroup


# Use train_test_split to do stratified sampling
target0_df, _ = train_test_split(df_raw[subgroups == '0'], stratify=subgroups[subgroups == '0'], train_size=n_samples/len(df_raw[subgroups == '0']), random_state=42)
target4_df, _ = train_test_split(df_raw[subgroups == '4'], stratify=subgroups[subgroups == '4'], train_size=n_samples/len(df_raw[subgroups == '4']), random_state=42)

# Combine the two subgroups into a single dataframe
df = pd.concat([target0_df, target4_df], ignore_index=True)

# Save the sampled dataset
df.to_csv('sampled_twitter_data.csv', index=False)

In [8]:
print('The new length of data is', len(df))

The new length of data is 2000


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   target  2000 non-null   int64 
 1   text    2000 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.4+ KB


In [10]:
df.dtypes

target     int64
text      object
dtype: object

In [11]:
#Checking for null values
np.sum(df.isnull().any(axis=1))

0

In [12]:
#Rows and columns in the dataset
print('Count of columns in the data is:  ', len(df.columns))
print('Count of rows in the data is:  ', len(df))

Count of columns in the data is:   2
Count of rows in the data is:   2000


In [13]:
#Check unique target values
df['target'].unique()

array([0, 4], dtype=int64)

In [14]:
#Check the number of target values
df['target'].nunique()

2

In [15]:
df['text'].tail()

1995    @ThomasFritts I'm so jealous. Mountains are my...
1996                       home. just a few more minutes 
1997    im getting the voyager for cheap  $50 baby! im...
1998    @Nonicam that classic help desk response-turn ...
1999    is chatting on facey with @JuliaBier and Matt ...
Name: text, dtype: object

In [16]:
# Convert all text to lowercase
df['text'] = df['text'].str.lower()
df['text'].tail()

1995    @thomasfritts i'm so jealous. mountains are my...
1996                       home. just a few more minutes 
1997    im getting the voyager for cheap  $50 baby! im...
1998    @nonicam that classic help desk response-turn ...
1999    is chatting on facey with @juliabier and matt ...
Name: text, dtype: object

In [17]:
#Cleaning and removing punctuations

english_punctuations = string.punctuation
punctuations_list = english_punctuations
def cleaning_punctuations(text):
    translator = str.maketrans('', '', punctuations_list)
    return text.translate(translator)
df['text']= df['text'].apply(lambda x: cleaning_punctuations(x))
df['text'].tail()

1995    thomasfritts im so jealous mountains are my fa...
1996                        home just a few more minutes 
1997    im getting the voyager for cheap  50 baby im s...
1998    nonicam that classic help desk responseturn it...
1999    is chatting on facey with juliabier and matt a...
Name: text, dtype: object

In [18]:
#Cleaning and removing repeating characters

def cleaning_repeating_char(text):
    return re.sub(r'(.)1+', r'1', text)
df['text'] = df['text'].apply(lambda x: cleaning_repeating_char(x))
df['text'].tail()

1995    thomasfritts im so jealous mountains are my fa...
1996                        home just a few more minutes 
1997    im getting the voyager for cheap  50 baby im s...
1998    nonicam that classic help desk responseturn it...
1999    is chatting on facey with juliabier and matt a...
Name: text, dtype: object

In [19]:
#Cleaning and removing URLs

def cleaning_URLs(data):
    return re.sub('((www.[^s]+)|(https?://[^s]+))',' ',data)
df['text'] = df['text'].apply(lambda x: cleaning_URLs(x))
df['text'].tail()

1995    thomasfritts im so jealous mountains are my fa...
1996                        home just a few more minutes 
1997    im getting the voyager for cheap  50 baby im s...
1998    nonicam that classic help desk responseturn it...
1999    is chatting on facey with juliabier and matt a...
Name: text, dtype: object

In [20]:
#Cleaning and removing numeric numbers

def cleaning_numbers(data):
    return re.sub('[0-9]+', '', data)
df['text'] = df['text'].apply(lambda x: cleaning_numbers(x))
df['text'].tail()

1995    thomasfritts im so jealous mountains are my fa...
1996                        home just a few more minutes 
1997    im getting the voyager for cheap   baby im so ...
1998    nonicam that classic help desk responseturn it...
1999    is chatting on facey with juliabier and matt a...
Name: text, dtype: object

In [21]:
# Removing # and @ characters from tweets and other symbols

def cleaning_characters(data):
    return re.sub(r'\@\w+|\#|\'|\"|\´|\`|\,','', data)
df['text'] = df['text'].apply(lambda x: cleaning_characters(x))
df['text'].tail()


1995    thomasfritts im so jealous mountains are my fa...
1996                        home just a few more minutes 
1997    im getting the voyager for cheap   baby im so ...
1998    nonicam that classic help desk responseturn it...
1999    is chatting on facey with juliabier and matt a...
Name: text, dtype: object

In [22]:
#Defining set containing all stopwords in English.

nltk.download ('stopwords')
stop_words = set(stopwords.words( 'english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Bita\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [23]:
#Cleaning and removing the above stop words list from the tweet text

def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in stop_words])
df['text'] = df['text'].apply(lambda text: cleaning_stopwords(text))
df['text'].tail()


1995    thomasfritts im jealous mountains favorite tol...
1996                                         home minutes
1997             im getting voyager cheap baby im excited
1998    nonicam classic help desk responseturn turn gl...
1999                  chatting facey juliabier matt maine
Name: text, dtype: object

In [24]:
# Count the words used per user

def word_count(sentence):
    return len(sentence.split())
    
df['word count'] = df['text'].apply(word_count)
df.tail()

Unnamed: 0,target,text,word count
1995,4,thomasfritts im jealous mountains favorite tol...,8
1996,4,home minutes,2
1997,4,im getting voyager cheap baby im excited,7
1998,4,nonicam classic help desk responseturn turn gl...,8
1999,4,chatting facey juliabier matt maine,5


In [25]:
# Stemming process

st = nltk.PorterStemmer()
def stemming_process(data):
    text = [st.stem(word) for word in data]
    return data
df['text']= df['text'].apply(lambda x: stemming_process(x))
df['text'].tail()



1995    thomasfritts im jealous mountains favorite tol...
1996                                         home minutes
1997             im getting voyager cheap baby im excited
1998    nonicam classic help desk responseturn turn gl...
1999                  chatting facey juliabier matt maine
Name: text, dtype: object

In [26]:
# Lemmatizer Proccess
from nltk.stem import WordNetLemmatizer
lm = nltk.WordNetLemmatizer()
def lemmatizer_on_text(data):
    text = [lm.lemmatize(word) for word in data]
    return data
df['text'] = df['text'].apply(lambda x: lemmatizer_on_text(x))
df['text'].tail()

1995    thomasfritts im jealous mountains favorite tol...
1996                                         home minutes
1997             im getting voyager cheap baby im excited
1998    nonicam classic help desk responseturn turn gl...
1999                  chatting facey juliabier matt maine
Name: text, dtype: object

In [27]:
# Separating positive and negative tweets

data_pos = df[df['target'] == 4]
data_neg = df[df['target'] == 0]
df = pd.concat([data_neg, data_pos])
data_neg.tail()

# Save the sampled dataset
df.to_csv('sampled_twitter_data.csv', index=False)


In [28]:
data_pos.tail()

Unnamed: 0,target,text,word count
1995,4,thomasfritts im jealous mountains favorite tol...,8
1996,4,home minutes,2
1997,4,im getting voyager cheap baby im excited,7
1998,4,nonicam classic help desk responseturn turn gl...,8
1999,4,chatting facey juliabier matt maine,5


In [29]:
# Doing tokenization of Tweet Text

tokenizer = RegexpTokenizer(r'\w+')
df_tk = df['text'].apply(tokenizer.tokenize)
df_tk.tail()

1995    [thomasfritts, im, jealous, mountains, favorit...
1996                                      [home, minutes]
1997     [im, getting, voyager, cheap, baby, im, excited]
1998    [nonicam, classic, help, desk, responseturn, t...
1999            [chatting, facey, juliabier, matt, maine]
Name: text, dtype: object

In [30]:
# Separate the date into train and test subset

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'],
                                                    test_size = 0.05, random_state = 0)
print(f'Data Split done.')

# X = df.drop('target', axis=1)  # Predictor feature columns
# y = df['target']   # Predicted class (1=True, 0=False)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=0) 
# 0.05 is the proportion of the data to allocate to the test set 
#the random seed used to split the data. This ensures that the same random split is used every time the code is run.

X_train.tail()
X_train_all = df['text']

X_train.to_csv('X_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)



Data Split done.


In [31]:
print(('X_train shape =', X_train.shape), ('y_train shape =', y_train.shape), ('X_test shape =', X_test.shape), ('y_test shape =', y_test.shape))

('X_train shape =', (1900,)) ('y_train shape =', (1900,)) ('X_test shape =', (100,)) ('y_test shape =', (100,))


In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score

# # vectorize the text data
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vz = vectorizer.fit_transform(X_train_all)
#vectorizer.fit(X_train)
print('Vectorizer fitted.')
print('No. of feature_words: ', len(vectorizer.get_feature_names_out()))

# # transform the text data into a sparse matrix
X_train_vec = vectorizer.transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# save it into sparse matrix format
# save_npz('X_train.npz', X_train_vec) #model do save sparse matrix
# save_npz('X_test.npz', X_test_vec)

Vectorizer fitted.
No. of feature_words:  18480


Affinity Propagation Algorithm

In [33]:
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
cosine_sim_features = cosine_similarity(vz)
ap = AffinityPropagation(max_iter=1000)
ap.fit(cosine_sim_features)
res = Counter(ap.labels_)
res.most_common(13)

[(4, 1963),
 (13, 4),
 (0, 3),
 (1, 3),
 (2, 3),
 (12, 3),
 (3, 3),
 (6, 3),
 (5, 3),
 (8, 3),
 (10, 3),
 (7, 2),
 (9, 2)]

In [34]:
df['cluster_label'] = ap.labels_
for num_clusters in range(1,13):
  filtered_clusters = [item[0] for item in res.most_common(num_clusters)]
  filtered_df = df[df['cluster_label'].isin(filtered_clusters)]
  twitter_clusters = (filtered_df[['text', 'cluster_label']]
                    .sort_values(by=['cluster_label'], 
                                ascending=False)
                    .groupby('cluster_label').head(20))
  twitter_clusters = twitter_clusters.copy(deep=True)

  # get key features for each cluster
  # get twetts belonging to each cluster
  for cluster_num in range(len(filtered_clusters)):
      twitts = twitter_clusters[twitter_clusters['cluster_label'] == filtered_clusters[cluster_num]]['text'].values.tolist()
      print('CLUSTER '+str(cluster_num)+':')
      print('Some similar twitts:', twitts)
      print('-'*50)
  print('='*100)

CLUSTER 0:
Some similar twitts: ['ksammm omg yessss im watching th one definitely cried brought cedricks body back', 'im listening greenday sorry im actually goin lol', 'nemoniknemonik time', 'beckyfearns hahaha okay hopeee loooose oh pre orderd new jb album burnin book rolling stone posterr x', 'finethere many movies anticipateespecially quotharry potter quotcause thats shown birthmonthwooot', 'erethfamily hehe theres always someone celebrating somthing family', 'ddlovato hey demi say hi pleasejust hi day colorfull yesterday', 'donniewahlberg thanks much reminding live life kinda stagnant till came back thank xx', 'way home movie anyone coming', 'andrewmoriarty tweet made laugh much', 'buildingwalls ive change thrown times sq riding bike tend avoid area days', 'angesbiz mothers day chillaxed', 'favorite party last night getting attention guys sat watched', 'geeksyndicate posted review site hope tis shite', 'trackies day woop im chaving', 'bronzin body poolside', 'bethenny omg always g