# Approach 1

<ol>
    <li>Find a vocabulary of 1000 most important words
    <li>Build a vector of 1000 dimensions for each document - each dimension is the tfidf for that word in the vocabulary
    <li>Use vector to and perform K-means clustering
</ol>

In [45]:
import pandas as pd
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

In [11]:
df = pd.read_csv('incident_wide_descriptions.csv', header=0, encoding='ISO-8859-1')
df = df[['number', 'short_description']]

print(df.info())

df.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193396 entries, 0 to 193395
Data columns (total 2 columns):
number               193396 non-null object
short_description    193395 non-null object
dtypes: object(2)
memory usage: 3.0+ MB
None


Unnamed: 0,number,short_description
0,INC0229960,Testing HRIT Request
1,INC0229990,Testing HRIT Request 2
2,INC0230041,Testing HRIT Request 3
3,INC0230756,Testing HRIT Request 4
4,INC0160236,RE: ServiceNow IT Trouble Ticket Link Question
5,INC0156765,I get an error message in closing an epic fill...
6,INC0156736,Unable to print from EPIC
7,INC0146421,Add location option for a provider
8,INC0148808,Printer will not feed properly
9,INC0149930,Add Darrell Misner (u6010696) to be able to bo...


In [32]:
description = ':;!?@#$%&*(){}[]\/.,""' + "''"

val = re.sub('[:;!?@#$%&*\(\)\[\]\{\}\./\\\,"]', "", description)
print(re.sub("[']", "", val))




In [36]:
df['short_description'] = df['short_description'].apply(lambda x: str(x).lower())
df['short_description'] = df['short_description'].apply(lambda x: re.sub("[']", "", re.sub('[;:!?@#$%&*\(\)\[\]\{\}\./\\\,"]', "", x)))

df.head(10)

Unnamed: 0,number,short_description
0,INC0229960,testing hrit request
1,INC0229990,testing hrit request 2
2,INC0230041,testing hrit request 3
3,INC0230756,testing hrit request 4
4,INC0160236,re servicenow it trouble ticket link question
5,INC0156765,i get an error message in closing an epic fill...
6,INC0156736,unable to print from epic
7,INC0146421,add location option for a provider
8,INC0148808,printer will not feed properly
9,INC0149930,add darrell misner u6010696 to be able to book...


### Vectorizing the input

In [59]:
documents = list(df['short_description'])
print(documents[:10])

['testing hrit request', 'testing hrit request 2', 'testing hrit request 3', 'testing hrit request 4', 're servicenow it trouble ticket link question', 'i get an error message in closing an epic fill that says the file is icomplete phone 801-580-2765  dr meikle', 'unable to print from epic', 'add location option for a provider', 'printer will not feed properly', 'add darrell misner u6010696 to be able to book into provider slots']


In [60]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X = vectorizer.fit_transform(documents)

# print(type(X))

# idf = vectorizer.idf_
# print(type(idf))

# print(idf[:10])

# print(type(vectorizer.get_feature_names()))
# print(vectorizer.get_feature_names()[:10])

display(dict(zip(vectorizer.get_feature_names(), idf)))


{'10': 6.5564351648557517,
 '102': 8.0077143760650564,
 '11': 8.0665548760879879,
 '12': 8.0426016350654947,
 '15': 8.2452466648313649,
 '17': 7.9413917331339832,
 '1st': 8.0787501491818077,
 '20': 8.2672255715501404,
 '2013': 8.2026870504125675,
 '2016': 7.2509219303447541,
 '2017': 7.1063922598848217,
 '24': 8.1552205131736457,
 '2fa': 6.7873059509908442,
 '2nd': 7.8253928192711006,
 '30': 8.0665548760879879,
 '30757': 8.1035961477683376,
 '360': 7.9202269219419392,
 '365': 7.8112081842791445,
 '3m': 7.696036798057059,
 '3rd': 8.1618650558923136,
 '581': 7.7743376484708167,
 '585': 7.8018623218609067,
 '587': 8.0192087554907907,
 '801': 6.684816331503959,
 'able': 5.9833326115682466,
 'access': 3.8667588932491337,
 'accessing': 6.663731213016888,
 'account': 4.7101858200823212,
 'accounts': 7.2063536108648778,
 'acknowledge': 7.9963506174147403,
 'acrobat': 7.9963506174147403,
 'action': 7.6312368048301433,
 'activate': 7.2114950103652964,
 'activated': 8.0077143760650564,
 'activati

## Clustering using k-means 

In [64]:
true_k = 20
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=20, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

## Human readable clusters

In [65]:
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print( "Cluster %d:" % i, end=" ")
    for ind in order_centroids[i, :20]:
        print(' %s' % terms[ind], end=" ")
    print('\n')

Top terms per cluster:
Cluster 0:  duo  question  general  bypass  code  generated  setup  reactivation  authenticate  caller  task  unable  set  token  request  reactivtion  reactivate  enrollment  information  device 

Cluster 1:  umail  quota  increase  request  alias  lockout  account  locked  unlock  cleared  mailbox  display  change  issue  forwarding  increased  administer  behalf  created  users 

Cluster 2:  issues  outlook  computer  printer  having  citrix  monitor  login  epic  connectivity  connection  network  printing  phone  kronos  laptop  lms  skype  display  receiver 

Cluster 3:  data  warehouse  bi  report  access  need  request  table  port  updated  center  new  team  boe  epic  reports  research  missing  edw  toad 

Cluster 4:  access  epic  need  request  unable  peoplesoft  kronos  wants  drive  pulse  folder  remote  grant  umb  shared  network  citrix  employee  vpn  granted 

Cluster 5:  epic  hfs  team  scheduling  issue  unable  icon  visit  add  provide