# NMF

Non-negative Matrix Factorization (NMF) is a method of dimensionality reduction and clustering. It has many applications in text mining, image analysis, bioinformatics, and even audio signal processing. 

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.decomposition import NMF

### Bring in data

In [2]:
healthSet = pd.read_csv('trainY2_Y3.csv')

Set X and Y

In [3]:
Y  = healthSet['DaysInHospital'].copy()
X = healthSet.drop(['DaysInHospital'], axis=1) 
features = X.columns

In [4]:
X.shape

(104495, 103)

NMF will create two matrices from the given matrix. If the given matrix has shape (104495,103) and the defined n_components for the nmf is 20 then the result will be 1 matrix of shape (104495,20) and another matrix of shape (20, 103). 

Here we are using the Frobenius norm objective function.

In [5]:
n_components=20
nmf = NMF(n_components=n_components,random_state=1,alpha=.1, l1_ratio=.5)
nmf_fitted = nmf.fit_transform(X)

Now we can view the top 5 features (column headers) for each cluster:

In [6]:
n_top = 5
for idx, cluster in enumerate(nmf.components_):
    message = "Cluster #%d: " % idx
    message += " ".join([features[i] for i in cluster.argsort()[:-n_top - 1:-1]])
    print(message)

Cluster #0: MemberID PayDelaySum PayDelayMax PayDelayMean LabCountMax
Cluster #1: PayDelaySum PayDelayMax PayDelayMean PayDelayCount LengthOfStay.daysCount
Cluster #2: PayDelaySum PayDelayMax PayDelayMean LabCountMax LabCountMean
Cluster #3: PayDelayMin PayDelayMean PayDelayMax PayDelaySum LabCountCount
Cluster #4: PayDelaySum LabCountSum LabCountMax LengthOfStay.daysCount PayDelayCount
Cluster #5: LengthOfStay.daysSum LengthOfStay.daysMax LengthOfStay.daysCount PayDelayCount Specialty_Internal
Cluster #6: PayDelaySum PayDelayMax PayDelayMean PrimaryConditionGroup_NEUMENT PlaceSvc_Urgent Care
Cluster #7: PayDelayMean LabCountMax LabCountMean PlaceSvc_Office LabCountMin
Cluster #8: PayDelaySum PrimaryConditionGroup_MSC2a3 Specialty_Laboratory PlaceSvc_Independent Lab PayDelayCount
Cluster #9: PayDelaySum Specialty_Internal PlaceSvc_Office ProcedureGroup_EM PayDelayCount
Cluster #10: LabCountMin LabCountMean LabCountSum LabCountMax PayDelayMin
Cluster #11: PayDelaySum Specialty_General P

Here we are using the generalized Kullback-Leibler divergence objective function.

In [7]:
n_components = 20
nmfKL = NMF(n_components=n_components, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5)
nmfKL_fitted = nmfKL.fit_transform(X)

In [8]:
n_top = 5
for idx, cluster in enumerate(nmfKL.components_):
    message = "Cluster #%d: " % idx
    message += " ".join([features[i] for i in cluster.argsort()[:-n_top - 1:-1]])
    print(message)

Cluster #0: MemberID PayDelaySum PayDelayMax PayDelayMean PayDelayMin
Cluster #1: PayDelaySum PayDelayMax PayDelayMean PlaceSvc_Office LengthOfStay.daysCount
Cluster #2: PayDelayMax PayDelayMean PayDelayMin LabCountMax LabCountMean
Cluster #3: PayDelaySum PayDelayMean PayDelayMin LabCountMin LabCountMean
Cluster #4: LabCountSum LengthOfStay.daysCount PayDelayCount PlaceSvc_Independent Lab Specialty_Laboratory
Cluster #5: LengthOfStay.daysSum LengthOfStay.daysMax PlaceSvc_Inpatient Hospital PayDelayMax PayDelayCount
Cluster #6: PayDelaySum PayDelayMean LabCountSum LabCountMax PlaceSvc_Independent Lab
Cluster #7: PlaceSvc_Office LengthOfStay.daysCount PayDelayCount Specialty_General Practice ProcedureGroup_EM
Cluster #8: LengthOfStay.daysCount PayDelayCount PrimaryConditionGroup_MSC2a3 PlaceSvc_Independent Lab Specialty_Laboratory
Cluster #9: Specialty_Internal LengthOfStay.daysCount PayDelayCount ProcedureGroup_EM PrimaryConditionGroup_MISCHRT
Cluster #10: LabCountMin LabCountMean LabCo