# Affinity Propagation

Affinity Propagation is a clustering algorithm used in natural language processing (NLP) for sentiment analysis. Clustering algorithms group similar objects together based on a certain similarity metric. In the case of sentiment analysis, the objects are textual documents or sentences, and the similarity metric is based on the sentiment expressed in the text.

Affinity Propagation is particularly useful for sentiment analysis because it does not require the number of clusters to be predefined, which can be difficult to determine in advance. Instead, it automatically determines the number of clusters by finding the most representative data points, known as exemplars, for each cluster.

The algorithm works by iteratively passing messages between data points until a set of exemplars is identified that best represent the entire data set. These exemplars can then be used to assign new data points to the appropriate cluster.

Overall, Affinity Propagation is a powerful tool for sentiment analysis in NLP because it can automatically identify the most representative data points for each cluster without requiring a priori knowledge of the number of clusters.

In [1]:
# Import necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AffinityPropagation
import pandas as pd

In [2]:
#Load csv files
DATASET_ENCODING = "ISO-8859-1"
y_train = pd.read_csv('y_train.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]
y_test = pd.read_csv('y_test.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]
X_test = pd.read_csv('X_test.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]
X_train = pd.read_csv('X_train.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]

X_train.fillna('', inplace=True)
X_test.fillna('', inplace=True)

y_test = y_test.replace(4,1)
y_train = y_train.replace(4,1)

# Load data
data = pd.read_csv('x_train.csv')



  y_train = pd.read_csv('y_train.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]


  y_test = pd.read_csv('y_test.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]


  X_test = pd.read_csv('X_test.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]


  X_train = pd.read_csv('X_train.csv', squeeze=True, encoding=DATASET_ENCODING)[:-1]


In [7]:
# Vectorize the text data
vectorizer = TfidfVectorizer(stop_words='english')
X_vect = vectorizer.fit_transform(X_train)

# Perform clustering using Affinity Propagation
af = AffinityPropagation().fit(X_vect)

# Get the cluster labels
cluster_labels = af.labels_

# Print the number of clusters
n_clusters = len(set(cluster_labels))
print('Number of clusters: %d' % n_clusters)

# Print the clusters and their associated text data
clusters = {}
for i, label in enumerate(cluster_labels):
    if label not in clusters:
        clusters[label] = []
    clusters[label].append(X_train.iloc[i])
for cluster in clusters:
    print('Cluster %d:' % cluster)
    print(clusters[cluster])

Number of clusters: 6
Cluster 0:
[ids                                                  2321931262
date                               Wed Jun 24 21:47:09 PDT 2009
flag                                                   NO_QUERY
user                                                     ahrris
text          ['always', 'happen', 'maybe', 'rly', 'thinks',...
word count                                                   10
Name: 0, dtype: object]
Cluster 1:
[ids                              2065074756
date           Sun Jun 07 07:50:15 PDT 2009
flag                               NO_QUERY
user                              shanajaca
text          ['try', 'rach', 'xxxxxxxxxx']
word count                                3
Name: 1, dtype: object]
Cluster 2:
[ids                                                  1834918736
date                               Mon May 18 04:36:46 PDT 2009
flag                                                   NO_QUERY
user                                            demoni