# Basic Clustering Analysis on Questionnaire

 - Clustering on three questions using K-means
 - Make clustering prediction
 - Using the clustering resutls to make classfication prediction
 
 - Clustering on three questions using SVD (LSA)
 - Return SVD values (n = 1, n = 2)
 - Use singular values as features in classificaiton prediction
 
 v.2

In [1]:
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

pd.options.mode.chained_assignment = None

In [2]:
data = pd.read_csv('Clustering_DataSet.csv', header = 0, usecols=['UID-G-NR', 'Gender', 'Major', 'Class', 
                                                                  'Q1 Raw', 'Q2 Raw', 'Q3 Raw'])
data.head()

Unnamed: 0,UID-G-NR,Gender,Major,Class,Q1 Raw,Q2 Raw,Q3 Raw
0,UID-A-1,Male,Business,Senior,"the body of technologies, processes and practi...",Long and precise,Making sure you're making secure payments
1,UID-A-2,Female,Business,Junior,Certain programs that protect websites from ha...,Different characters and numbers,Companies pay other companies to protect their...
2,UID-A-3,Female,Business,Senior,to protect me from bad things such as fraud an...,capital and lower letters and many numbers.\n,everything is private
3,UID-A-4,Male,Business,Sophomore,Cyber security is just secuirty and protection...,"Both characters, numbers, special numbers and ...",Security is achieved through some type of encr...
4,UID-A-5,Female,Business,Sophomore,It is a form of security that protects individ...,A strong password consists of uppercase and lo...,A software that protects online consumers


In [3]:
data.shape

(999, 7)

In [4]:
# remove nan data
df = data.dropna(subset=['Q1 Raw', 'Q2 Raw', 'Q3 Raw'])
df.shape

(134, 7)

In [5]:
major_list = df.Major.unique().tolist()
major_list

['Business', 'Computing', 'LiberalArts', 'Other', 'Nursing']

## Clustering using K-means

Run K-means Clustering on Q1, Q2 and Q3 seperatly.


In [6]:
q1_content = df['Q1 Raw'].values
q2_content = df['Q2 Raw'].values
q3_content = df['Q3 Raw'].values

q1_content[0]

'the body of technologies, processes and practices designed to protect networks, computers, programs and data from attack,'

### Vectorization

In [7]:
# convert questions into numerical vectors
vectorizer = TfidfVectorizer(stop_words={'english'})
X_1 = vectorizer.fit_transform(q1_content)

vectorizer = TfidfVectorizer(stop_words={'english'})
X_2 = vectorizer.fit_transform(q2_content)

vectorizer = TfidfVectorizer(stop_words={'english'})
X_3 = vectorizer.fit_transform(q3_content)

### K-means

In [8]:
# k-means
true_k = 2

model_1 = KMeans(n_clusters=true_k, init='k-means++', max_iter=200, n_init=10)
model_1.fit(X_1)
labels_1 = model_1.labels_

model_2 = KMeans(n_clusters=true_k, init='k-means++', max_iter=200, n_init=10)
model_2.fit(X_2)
labels_2 = model_2.labels_

model_3 = KMeans(n_clusters=true_k, init='k-means++', max_iter=200, n_init=10)
model_3.fit(X_3)
labels_3 = model_3.labels_

In [9]:
# assigning cluster number to data frame
df['cluter_q1'] = labels_1
df['cluter_q2'] = labels_2
df['cluter_q3'] = labels_3

In [10]:
df.head()

Unnamed: 0,UID-G-NR,Gender,Major,Class,Q1 Raw,Q2 Raw,Q3 Raw,cluter_q1,cluter_q2,cluter_q3
0,UID-A-1,Male,Business,Senior,"the body of technologies, processes and practi...",Long and precise,Making sure you're making secure payments,1,0,0
1,UID-A-2,Female,Business,Junior,Certain programs that protect websites from ha...,Different characters and numbers,Companies pay other companies to protect their...,1,0,0
2,UID-A-3,Female,Business,Senior,to protect me from bad things such as fraud an...,capital and lower letters and many numbers.\n,everything is private,1,0,0
3,UID-A-4,Male,Business,Sophomore,Cyber security is just secuirty and protection...,"Both characters, numbers, special numbers and ...",Security is achieved through some type of encr...,1,1,1
4,UID-A-5,Female,Business,Sophomore,It is a form of security that protects individ...,A strong password consists of uppercase and lo...,A software that protects online consumers,1,0,0


`cluster_q1` indicates which cluster number K-means algorithms predicted for question 1.
`cluster_q2` indicates which cluster number K-means algorithms predicted for question 2.
`cluster_q3` indicates which cluster number K-means algorithms predicted for question 3.

Now we are ready to do some further experinment, such as regression analysis or classification prediction.

For example, let's run a simple classificaiton using the features we have.

### Clustering using SVD

Dimensionality reduction using truncated SVD (aka LSA).

This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently such as vectorized matrices.

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
https://machinelearningmastery.com/singular-value-decomposition-for-machine-learning/

### Try `n_components = 1`

In [11]:
n = 1

svd = TruncatedSVD(n_components=n, random_state=42)
svd_1 = svd.fit_transform(X_1) 

svd = TruncatedSVD(n_components=n, random_state=42)
svd_2 = svd.fit_transform(X_2)

svd = TruncatedSVD(n_components=n, random_state=42)
svd_3 = svd.fit_transform(X_3) 

In [12]:
# add SVD results into new columns
df['svd_1'] = svd_1
df['svd_2'] = svd_2
df['svd_3'] = svd_3

### Try `n_components = 2`

In [13]:
n = 2

svd = TruncatedSVD(n_components=n, random_state=42)
svd_1 = svd.fit_transform(X_1) 

svd = TruncatedSVD(n_components=n, random_state=42)
svd_2 = svd.fit_transform(X_2)

svd = TruncatedSVD(n_components=n, random_state=42)
svd_3 = svd.fit_transform(X_3) 

In [14]:
# add SVD results into new columns
# since n_components = 2, we have two svd values returns

df['svd_1_1'] = svd_1[:,0]
df['svd_1_2'] = svd_1[:,1]

df['svd_2_1'] = svd_2[:,0]
df['svd_2_2'] = svd_2[:,1]

df['svd_3_1'] = svd_3[:,0]
df['svd_3_2'] = svd_3[:,1]

In [15]:
df.head()

Unnamed: 0,UID-G-NR,Gender,Major,Class,Q1 Raw,Q2 Raw,Q3 Raw,cluter_q1,cluter_q2,cluter_q3,svd_1,svd_2,svd_3,svd_1_1,svd_1_2,svd_2_1,svd_2_2,svd_3_1,svd_3_2
0,UID-A-1,Male,Business,Senior,"the body of technologies, processes and practi...",Long and precise,Making sure you're making secure payments,1,0,0,0.259529,0.130182,0.188329,0.259521,0.032111,0.130176,0.04282,0.188346,-0.113996
1,UID-A-2,Female,Business,Junior,Certain programs that protect websites from ha...,Different characters and numbers,Companies pay other companies to protect their...,1,0,0,0.173411,0.389092,0.107757,0.173407,-0.060382,0.389096,0.176357,0.107792,-0.040546
2,UID-A-3,Female,Business,Senior,to protect me from bad things such as fraud an...,capital and lower letters and many numbers.\n,everything is private,1,0,0,0.168492,0.39658,0.091454,0.168485,-0.033793,0.396577,0.116475,0.091468,0.006161
3,UID-A-4,Male,Business,Sophomore,Cyber security is just secuirty and protection...,"Both characters, numbers, special numbers and ...",Security is achieved through some type of encr...,1,1,1,0.279454,0.478717,0.198504,0.279456,-0.002025,0.478719,0.473946,0.198527,0.12706
4,UID-A-5,Female,Business,Sophomore,It is a form of security that protects individ...,A strong password consists of uppercase and lo...,A software that protects online consumers,1,0,0,0.287306,0.364646,0.115523,0.287311,-0.086706,0.364646,-0.162954,0.115499,0.000475


## Classification

In [16]:
df.head()

Unnamed: 0,UID-G-NR,Gender,Major,Class,Q1 Raw,Q2 Raw,Q3 Raw,cluter_q1,cluter_q2,cluter_q3,svd_1,svd_2,svd_3,svd_1_1,svd_1_2,svd_2_1,svd_2_2,svd_3_1,svd_3_2
0,UID-A-1,Male,Business,Senior,"the body of technologies, processes and practi...",Long and precise,Making sure you're making secure payments,1,0,0,0.259529,0.130182,0.188329,0.259521,0.032111,0.130176,0.04282,0.188346,-0.113996
1,UID-A-2,Female,Business,Junior,Certain programs that protect websites from ha...,Different characters and numbers,Companies pay other companies to protect their...,1,0,0,0.173411,0.389092,0.107757,0.173407,-0.060382,0.389096,0.176357,0.107792,-0.040546
2,UID-A-3,Female,Business,Senior,to protect me from bad things such as fraud an...,capital and lower letters and many numbers.\n,everything is private,1,0,0,0.168492,0.39658,0.091454,0.168485,-0.033793,0.396577,0.116475,0.091468,0.006161
3,UID-A-4,Male,Business,Sophomore,Cyber security is just secuirty and protection...,"Both characters, numbers, special numbers and ...",Security is achieved through some type of encr...,1,1,1,0.279454,0.478717,0.198504,0.279456,-0.002025,0.478719,0.473946,0.198527,0.12706
4,UID-A-5,Female,Business,Sophomore,It is a form of security that protects individ...,A strong password consists of uppercase and lo...,A software that protects online consumers,1,0,0,0.287306,0.364646,0.115523,0.287311,-0.086706,0.364646,-0.162954,0.115499,0.000475


In [17]:
# convert catigorical features into numercial features
df["gender"] = df["Gender"].astype('category').cat.codes
df["major"] = df["Major"].astype('category').cat.codes
df["class"] = df["Class"].astype('category').cat.codes

In [18]:
# convert label category into numerical

mask = df['UID-G-NR'].str.contains('A')
# assign UID-G-NR contians 'A' as label '1', otherwise, label '0'
df.loc[mask, 'label'] = 1
df.loc[~mask, 'label'] = 0

In [19]:
df.head()

Unnamed: 0,UID-G-NR,Gender,Major,Class,Q1 Raw,Q2 Raw,Q3 Raw,cluter_q1,cluter_q2,cluter_q3,...,svd_1_1,svd_1_2,svd_2_1,svd_2_2,svd_3_1,svd_3_2,gender,major,class,label
0,UID-A-1,Male,Business,Senior,"the body of technologies, processes and practi...",Long and precise,Making sure you're making secure payments,1,0,0,...,0.259521,0.032111,0.130176,0.04282,0.188346,-0.113996,1,0,2,1.0
1,UID-A-2,Female,Business,Junior,Certain programs that protect websites from ha...,Different characters and numbers,Companies pay other companies to protect their...,1,0,0,...,0.173407,-0.060382,0.389096,0.176357,0.107792,-0.040546,0,0,1,1.0
2,UID-A-3,Female,Business,Senior,to protect me from bad things such as fraud an...,capital and lower letters and many numbers.\n,everything is private,1,0,0,...,0.168485,-0.033793,0.396577,0.116475,0.091468,0.006161,0,0,2,1.0
3,UID-A-4,Male,Business,Sophomore,Cyber security is just secuirty and protection...,"Both characters, numbers, special numbers and ...",Security is achieved through some type of encr...,1,1,1,...,0.279456,-0.002025,0.478719,0.473946,0.198527,0.12706,1,0,3,1.0
4,UID-A-5,Female,Business,Sophomore,It is a form of security that protects individ...,A strong password consists of uppercase and lo...,A software that protects online consumers,1,0,0,...,0.287311,-0.086706,0.364646,-0.162954,0.115499,0.000475,0,0,3,1.0


## Classify  only using K-means cluster number
`cluter_q1, cluter_q2, cluter_q3`

With help of clustering algorithms on 3 different questions, we can achieve about 0.68 accuracy. 
It's better than 0.5 guessing but not good enough for a classificaiton experiment.

In [20]:
# select features
X = df[['cluter_q1', 'cluter_q2', 'cluter_q3']].values
y = df[['label']].values.ravel()

In [21]:
# call random forest classifier
clf = RandomForestClassifier(n_estimators=10,
                             random_state=42)
clf.fit(X, y)
y_pred = clf.predict(X)

In [22]:
# print classification results
print(classification_report(y, y_pred))

              precision    recall  f1-score   support

         0.0       0.64      0.67      0.65        70
         1.0       0.62      0.58      0.60        64

    accuracy                           0.63       134
   macro avg       0.63      0.62      0.62       134
weighted avg       0.63      0.63      0.63       134



### Test with other features

In [23]:
# change other combinations of features
X = df[['gender',  'cluter_q1', 'cluter_q2', 'cluter_q3']].values

y = df[['label']].values.ravel()

In [24]:
clf = RandomForestClassifier(n_estimators=10,
                             random_state=42)
clf.fit(X, y)
y_pred = clf.predict(X)


print(classification_report(y, y_pred))

              precision    recall  f1-score   support

         0.0       0.78      0.89      0.83        70
         1.0       0.85      0.72      0.78        64

    accuracy                           0.81       134
   macro avg       0.81      0.80      0.80       134
weighted avg       0.81      0.81      0.80       134



### All Features

In [25]:
X = df[['gender', 'major', 'class', 'cluter_q1', 'cluter_q2', 'cluter_q3']].values

y = df[['label']].values.ravel()

In [26]:
clf = RandomForestClassifier(n_estimators=10,
                             random_state=42)
clf.fit(X, y)
y_pred = clf.predict(X)


print(classification_report(y, y_pred))

              precision    recall  f1-score   support

         0.0       0.97      0.97      0.97        70
         1.0       0.97      0.97      0.97        64

    accuracy                           0.97       134
   macro avg       0.97      0.97      0.97       134
weighted avg       0.97      0.97      0.97       134



### Notice Issue

One of the issue for this classification is that `major` attribute **contributes too much** on the classfication performance.
Only one feature can perform classification with `0.89` accuracy.

In [27]:
X = df[['major',]].values

y = df[['label']].values.ravel()

In [28]:
clf = RandomForestClassifier(n_estimators=10,
                             random_state=42)
clf.fit(X, y)
y_pred = clf.predict(X)


print(classification_report(y, y_pred))

              precision    recall  f1-score   support

         0.0       0.97      0.81      0.88        70
         1.0       0.83      0.97      0.89        64

    accuracy                           0.89       134
   macro avg       0.90      0.89      0.89       134
weighted avg       0.90      0.89      0.89       134



## Use SVD values as Features

### Use SVD results when `n_component = 1`

In [29]:
# select features
X = df[['svd_1', 'svd_2', 'svd_3']].values
y = df[['label']].values.ravel()

# call random forest classifier
clf = RandomForestClassifier(n_estimators=10,
                             random_state=42)
clf.fit(X, y)
y_pred = clf.predict(X)

# print classification results
print(classification_report(y, y_pred))

              precision    recall  f1-score   support

         0.0       0.93      0.99      0.96        70
         1.0       0.98      0.92      0.95        64

    accuracy                           0.96       134
   macro avg       0.96      0.95      0.95       134
weighted avg       0.96      0.96      0.96       134



**Actually very great performance with 0.96 accuracy!**

LSA (SVD) has much better perofmance than K-means.

Let's try SVD with `n_component = 2`

### Use SVD results when `n_component = 2`

In [30]:
# select features
X = df[['svd_1_1', 'svd_1_2', 'svd_2_1', 'svd_2_2', 'svd_3_1', 'svd_3_2']].values
y = df[['label']].values.ravel()

# call random forest classifier
clf = RandomForestClassifier(n_estimators=10,
                             random_state=42)
clf.fit(X, y)
y_pred = clf.predict(X)

# print classification results
print(classification_report(y, y_pred))

              precision    recall  f1-score   support

         0.0       0.96      0.99      0.97        70
         1.0       0.98      0.95      0.97        64

    accuracy                           0.97       134
   macro avg       0.97      0.97      0.97       134
weighted avg       0.97      0.97      0.97       134



Only increases 1 percent, seems that n_component = 1 is fairly enough to cluster our data.

Nice!

## Others

We can also run some regression analysis, but remember to convert categorical features into **numerical dumpy varaibles**. Numerical encoder is not enought.

Also, more analysis are needed.

For example, change `K` values in **K-means**, or adding more NLP models to anlaysis the content & topics, etc.

Or change `n_component` values or other parameters in SVD model