1. Use the BOW model on the text of the BBC dataset (apply data preprocessing) and do K-means clustering (don't use the category feature). See if the clusters correspond to the categories. (Choose K to be the number of categories)

In [None]:
import pandas as pd

ds = pd.read_csv("../input/bbc-fulltext-and-category/bbc-text.csv")
print(ds.head())
print(" ")
print("Category         Counts")
print(ds["category"].value_counts())

In [None]:
# Data Preprocessing

# Droping the "category" column

ds2 = ds.drop("category", axis=1)

# Removing punctuation

import string

punct = "\n\r"+string.punctuation
ds2["text"] = ds2["text"].str.translate(str.maketrans('', '', punct))

# Removing stop words

from nltk.corpus import stopwords

stop = stopwords.words("english")
ds2 = ds2["text"].str.lower().str.split()
ds2 = ds2.apply(lambda k: ([i for i in k if i not in stop]))

# Stemming

import re
import nltk
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
a = []
def SW(txt):
    for i2 in txt:
        a.append(stemmer.stem(i2))
    b = a[:]
    a.clear()
    return b
ds2 = ds2.apply(SW)

# Joining back

def JB(il):
    return " ".join(il)  
ds2 = ds2.apply(JB)

# Lemmatization

from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()
def tokenize(str_input):
    words = re.sub(r"(?u)[^A-Za-z]", " ", str_input).lower().split(" ")
    words = [wordnet_lemmatizer.lemmatize(word) for word in words if len(word)>2]
    return words

# Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(tokenizer = tokenize)
vectors = vectorizer.fit_transform(ds2)
feature_names = vectorizer.get_feature_names()
ds2 = pd.DataFrame(vectors.toarray(), columns = feature_names)
print(ds2)

In [None]:
# Applying the K-Means clustering algorithm

from sklearn import cluster

k_means = cluster.KMeans(n_clusters = 5, max_iter = 50, random_state = 1)
k_means.fit(ds2) 
labels = k_means.labels_
ds2 = pd.DataFrame(labels, index = ds.category , columns = ["Cluster ID"])
print(ds2)

In [None]:
# The clusters' correspondence to the categories

print("Article category: Sport")
print(ds2.loc["sport"].value_counts())
print(" ")
print("Article category: Business")
print(ds2.loc["business"].value_counts())
print(" ")
print("Article category: Politics")
print(ds2.loc["politics"].value_counts())
print(" ")
print("Article category: Tech")
print(ds2.loc["tech"].value_counts())
print(" ")
print("Article category: Entertainment")
print(ds2.loc["entertainment"].value_counts())

Results:

1. The cluster with the lowest number of article in it is cluster 2 (190 articles). 
1. The cluster with the highest number of article in it is cluster 0 (953 articles) and it includes the highest concentration of business articles (501/510 business articles are found in this cluster).
1. The highest concentration of sport articles is found in cluster 1 (489/511 articles).
1. The highest concentrations of politics articles are found in clusters 0 (199/417 articles) and 3 (213/417 articles).
1. The highest concentration of tech articles is found in cluster 4 (350/401 articles).
1. The highest concentrations of entertainment articles are found in clusters 0 (194/386 articles) and 2 (184/386 articles).

2. Apply K-means clustering to the customer segmentation dataset. See what does the clusters correspond to.

In [None]:
dss = pd.read_csv("../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv")
print(dss.head())
print(" ")

# Obtaining information about the dataset to determine which features are important for the clustering analysis

print("Dataset Information: ")
print(dss.info())

In [None]:
# Droping the unneeded columns

dss = dss.drop(["CustomerID", "Gender"], axis=1)

# Performing the SSE test to determine the best number of clusters

import matplotlib.pyplot as plt
%matplotlib inline

numClusters = [1,2,3,4,5,6]
SSE = []
for k in numClusters:
    k_means = cluster.KMeans(n_clusters=k)
    k_means.fit(dss)
    SSE.append(k_means.inertia_)
plt.plot(numClusters, SSE)
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")

In [None]:
# K-Means clustering

k_means = cluster.KMeans(n_clusters = 5, max_iter = 50, random_state = 1)
k_means.fit(dss) 
labels = k_means.labels_
pd.DataFrame(labels, columns = ["Cluster ID"])

In [None]:
# Plotting the results

import seaborn as sns

dss["Labels"] = k_means.labels_
f = plt.figure(figsize = (24,12))
ax1 = f.add_subplot(221)
sns.swarmplot(x = "Labels", y = "Age", hue = dss["Labels"], data = dss, ax = ax1)
ax1.set_title("Age Labels")
ax2 = f.add_subplot(222)
sns.swarmplot(x = "Labels", y = "Annual Income (k$)", hue = dss["Labels"], data = dss, ax = ax2)
ax2.set_title("Annual Income Labels")
ax3 = f.add_subplot(223)
sns.swarmplot(x = "Labels", y = "Spending Score (1-100)", hue = dss["Labels"], data = dss, ax = ax3)
ax3.set_title("Scoring History Labels")
plt.show()

Results:

From the plots above, it can be seen that the role of age is not significant in the clustering analysis (or as significant as the other features), while the annual income and scoring history do have a significant role as the clusters correspond to them, which can be seen in the plots above. 