<a href="https://colab.research.google.com/github/tarun-jethwani/quickrepo/blob/master/FeatureSelectionForClustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### selecting s features/columns from p feature dataset for clustering

The idea behind the following approach is to rank features on their variance scores, after reading a research paper for selecting subset of features for clustering.

More distributed datapoints are going to be on the feature space, better is the probabilty to get well formed clusters, well distributed datapoints means high variance which leads to low uncertainity which is ideal condition for getting well formed clusters

Therefore we are going to rank features on the basis of their variance.

In [0]:
#Import necessary libraries
import pandas as pd
import numpy as np
import heapq as hq # to filter out the S feature names with high rank values
from sklearn.cluster import KMeans

In [0]:
# takes dataframe as well as S as arguments
def rank_features(df, S): 
  col_rank = {}
  for d in df:
    """
    creating a dictionary of column : their variance score, 
    higher variance score means more suitable is the column
    for clustering
    """
    col_rank[d] = df[d].var() 
    """
    returning S columns/features with high variance
    scores eventually these are top S ranks
    """
  return hq.nlargest(5, col_rank, key = lambda x : col_rank[x])


In [0]:
def build_dataset(N,P,S,K):
  # if P is greater or equal to K than only its valid paramter
  if P >= S : 
    data = np.random.randn(N,P) # dataset created of shape N x P
    df = pd.DataFrame(data) # creating pandas dataframe of data
    """
    Now I am going to pass this Dataframe to rank feature functions to get
    first S columns which are most suitable for Clustering
    """
    S_features = rank_features(df, S)
    # perform clustering on first S columns to get K clusters on sample
    clusters = KMeans(n_clusters=K, random_state=0).fit(df[S_features])
    return clusters 
  else :
    print("invalid entry") 
    return 0

### Tested and Working Perfectly !!!

In [46]:
build_dataset(1000,10,5,3)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)