Code uses the TfidfVectorizer to transform the text data into feature vectors, and then applies TruncatedSVD to reduce the number of features and make the clustering faster


It applies k-means clustering to the reduced feature vectors

Adjust the number of clusters by changing the n_clusters parameter in the KMeans constructor 

Code need to be customized to fit specific use case

In [7]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

# define the dataframe with hardcoded data

data = {'text': [
    'The study of humanities involves exploring human culture, including art, literature, philosophy, and history. Humanities courses often encourage critical thinking and analysis of complex ideas.',
    'In computer science, algorithms are used to solve problems and automate tasks. They are step-by-step procedures for calculations or other operations, and are essential to modern computing.',
    'Many people argue that technology is having a negative impact on society. They point to issues such as privacy concerns, addiction to social media, and the loss of jobs due to automation.',
    'One of the most important movements in the history of art is the Renaissance. This period saw a revival of interest in classical art and literature, as well as the development of new techniques and styles.',
    'Data science is a field that involves extracting insights and knowledge from data. It incorporates elements of statistics, computer science, and domain expertise to make sense of large and complex data sets.'
]}


df = pd.DataFrame(data)

# create a TfidfVectorizer object to transform the text data into feature vectors
vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000, min_df=2, stop_words='english', use_idf=True)
X = vectorizer.fit_transform(df['text'])

# apply TruncatedSVD to reduce the number of features and make the clustering faster
svd = TruncatedSVD(n_components=2)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
X = lsa.fit_transform(X)

# perform k-means clustering on the feature vectors
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=100, n_init=1)
kmeans.fit(X)

# add the cluster labels to the dataframe
df['cluster'] = kmeans.labels_

# print the results
print(df)


                                                text  cluster
0  The study of humanities involves exploring hum...        1
1  In computer science, algorithms are used to so...        0
2  Many people argue that technology is having a ...        0
3  One of the most important movements in the his...        1
4  Data science is a field that involves extracti...        0
