# Guide to RobMel's Segmentation Code
The segmentation allows the number of segments we want to impute by, meaning the number of times our datasets will be fractionally split, imputed and recombined. 

The method is it takes the least NaN value column, splits according to that one, imputes once, and then continues to split on the unaltered column. If the segment has NaNs, they are filled by *most frequent* if categorical and *median* if numerical (the choices for these two are outlined in the code). 

If there is an overall column in a segmented column that is all Nans, the parent imputed dataframe will be used to fill in the values. Each imputation step returns four versions: one following using the **mean**, **median**, **most frequent** and **KNN**.


---


## Difference with Sklearn IterativeImputer
I found Sklearn's IterativeImputer has some flexibility in terms of by what attributes you want to divide by, **however** there appear some limitations, namely all NaN columns are dropped, and the imputation type is one by one (these are challenges which may be overcome). Link to documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html)

I have been focusing on building the custom imputation code mentioned above, and leave this to experimentation after we get a working imputer using the above.



---

## Perceived Benefits

The perceived benefits of such a code is that it focuses on splitting and imputing the data on features heuristically selected to maximize data amount used as an input for imputation.



---

# Non-Disclosure Note

Due to the non-disclosure required by the original project this was from, some parts of the module have been redacted and are still in the process of being generalized and published.

In [None]:
#@title Imports
import os
import numpy as np
import pandas as pd
import statsmodels.api as sm
import scipy
from scipy import interpolate
from itertools import product
import  matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import OneHotEncoder,LabelEncoder

In [None]:
filepath = "PATH TO DF"
df = pd.read_excel(filepath)

# Module

In [None]:
# Sample Imputer: Feel free to change based on your needs and data distribution
imp = KNNImputer(missing_values=np.nan, n_neighbors=1)

In [None]:
class SegmentImputer:
  """
  Custom Segmentation Imputer Class
  """
  def __init__(self,imputer,segments = [], skip = [], drop = [], threshold = 1,backup=None):
    """
    imputer             - any imputer with the same API as sklearn imputers
    segments (optional) - priority list of segments to take place
    skip (optional)     - features to not be included when choosing a segmenting feature
    drop (optional)     - features to be dropped if spotted in dataset
    threshold (default) - maximum percentage of NaN values in a DF before looking outside for 
    backup (optional)   - precalculated baseline backup dataset if imputer to be used with same dataset
    """
    self.imputer = imputer
    self.segments = segments
    self.skip = skip
    self.drop = drop
    self.threshold = threshold
    self._available= [] # segments partitioned by (moves from more to less, aka specific to general)
    self.__labelencoderlist = []
    self._dummydict = {} #used for temporary storage of dummy columns
    self._baselinebackup = backup # to save computation time
    self.__checksegmentlist()

  def preprocess(self,df,method="default",deep=True,fillna=False,cat_cols=[],dummies=[]):
    """
    Preprocess DataFrame to be compatible for imputation
    
    df      - dataframe to be preprocessed
    method  - method to be used for preprocessing
      * default: uses LabelEncoder to transform categorical variables to numeric (optimal KNN if n_neighbor=1)
      * missforest: non-parametric preprocessing [coming soon]
      * none: no preprocessing
      * onehot: apply one hot encoding (increasing dimension sizes in process)
    """

    # UPDATE NOTE: DOES NOT HANDLE MISSING VALUES YET FOR CATEGORICAL ENCODING
    if method=="none":
      return df
      
    if method=='onehot':
      encdf = pd.DataFrame(index=df.index)
      cols = []
      for col in df.columns:
        if col in cat_cols:
          cols.append(pd.get_dummies(df[col],prefix=col))
        else:
          cols.append(df[col])
      
      encdf = pd.concat(cols,axis = 1)
      return encdf      

    if method=='default':
      encdf = df.copy(deep=deep)
      encdf.drop(columns=self.drop,inplace=True)
      if not len(cat_cols):
        return encdf

      try:
        i=0 # index counter variable
        for col in cat_cols:
          if col in self.skip or col in self.drop:
            continue
          
          self.__labelencoderlist.append(LabelEncoder())
          encdf[col] = self.__labelencoderlist[i].fit_transform(encdf[col])
          i+=1 # update index

      except (TypeError,IndexError) as e:
        print(f"For col {col}, the following error appeared\n{e}")
        print(f"Col type {type(col)}")
      except:
        raise Exception()
    else:
      raise Exception("No Method Selected")
    
    if len(dummies)>0:
      for dummy in dummies:
        dummyset = pd.get_dummies(encdf[dummy])
        self._dummydict[dummy] = dummyset.columns
        for col in dummyset.columns:
          encdf[col] = dummyset[col]
        encdf = encdf.drop(columns=dummy)
    return encdf

  def segment(self,df,n,segmentcriteria="leastnan", splitcriteria="unique",preprocess=True,preprocess_method="default",reset_available=True,cat_cols=[],dummies=[]):
    """
    Allows for segmentation with backup to take place

    df      - raw dataframe to be segmented
    n       - number of segmentations to take place
    """
    if preprocess:
      self._baselinebackup = df
      df = self.preprocess(df, method=preprocess_method,cat_cols=cat_cols,dummies=dummies)
    splits = self.get_splits(df,n,criteria=segmentcriteria,reset_available=reset_available)

    counter = 0
    imputed_splits = []
    for split in splits:
      subset = df
      for i in range(n):
        attr = self.available[i]
        subset = subset[subset[attr]==split[i]]
      counter+=len(subset)
      imputed_subset = self.impute(subset)
      if imputed_subset is not None:
        imputed_splits.append(imputed_subset)
      else:
        print(f"Current Split {split}")
        print(f"Split Size: {len(subset)} ({len(subset)*100/len(df):.2f}%)")
        print(f"{counter} of {len(df)} points processed ({counter*100/len(df):.2f}%)")
        continue
    
    imputed_df = pd.concat(imputed_splits)
    if imputed_df.isnull().sum().sum()>0:
      print(f"moving one layer up to {n-1} segments from {n}")
      self.available.pop()
      return self.segment(imputed_df,n-1,segmentcriteria=segmentcriteria,splitcriteria=splitcriteria,preprocess=False,reset_available=False,cat_cols=cat_cols,dummies=dummies)
    
    print("Segmentation Complete")

    print("Post Processing...")
    post_process_df = self._postprocess(imputed_df,cat_cols=cat_cols,dummies=dummies)
    return post_process_df

  def impute(self,subset):
    """
    Subset to be imputed
    """
    # check if subset exists
    if len(subset)==0:
      return None
    elif len(subset)==1:
      return subset # cannot be imputed alone
    
    if subset.isnull().sum().sum()==0:
      return subset
    
    # save nan columns:
    nan_bools = subset.isnull().all()
    nan_cols = list(nan_bools[nan_bools==True].index)
    order = subset.columns # in case columns get dropped
    temp_order = []
    for col in order:
      if col in nan_cols:
        continue
      else:
        temp_order.append(col)

    # impute
    print("imputing...")
    imputed_subset = self.imputer.fit_transform(subset)
    imputed_subset = pd.DataFrame(imputed_subset,columns=temp_order)
    
    # put back nan columns
    for col in nan_cols:
      imputed_subset[col]=np.nan
    
    imputed_subset = imputed_subset.reindex(columns = order)
    
    return imputed_subset


  def get_splits(self,df,n,criteria='leastnan',reset_available=False):
    """
    Gets subsets for given number of splits

    df      - preprocessed dataframe to be segmented
    n       - number of segments
    criteria (optional) - method to be used when choosing a new segment
    reset_available (optional) - whether to recalculate partitions to be used
    """
    if reset_available:
      self._get_set_segments(df,n,criteria=criteria)
    
    vars = [list(df[col].unique()) for col in self.available]
    splits = product(*vars) # iterator
    
    return splits


  def _get_set_segments(self,df,n,criteria="leastnan"):
    """
    Gets and Sets Segments to be used

    df      - preprocessed dataframe to be segmented
    n       - number of segments
    criteria (optional) - method to be used when choosing new segment
    """
    self.available = []
    for i in range(n):
      if i<len(self.segments):
        self.available.append(self.segments[i])
      else:
        self.available.append(self.next_segment(df,used_segments = self.available,criteria=criteria))
    return self.available
    
  def _updatesplitcount():
    pass

  def _check_threshold(self,df,split):
    nans = self._count_nans(df)[split]
    pass

  def _count_nans(self,df):
    return ((df.isnull().sum()).sort_values(ascending=False)/len(df))

  ### BEG OUTDATED ###
      #split,splitbackup = splitter(df,seg)
      #backup.extend(splitbackup)
      #dfs.extend(split)
    # while 'n'
      # choose segment
      # split data based on unique values of segment
      # impute, if not available use backup data to impute
      # return, merge, and sort data
  ### END OUTDATED ###

  def next_segment(self,df,used_segments=[],criteria="leastnan"):
    """
    df            - data frame to find next segmenting attribute
    used_segments - list of used segments to be skipped over
    criteria      - criteria which to choose next segment attribute
      * leastnan: least number of NaNs (default)
      * mostnan: most number of Nans
      * alphabetical: alphabetically by attribute name
      * leastunique: least number of unique values
      * mostunique: most number of unique values
    """
    # preliminary check for priority segments
    for seg in self.segments:
      if seg not in used_segments:
        return seg

    # else resume normal flow
    if criteria=="leastnan":
      attrs = list((df.isnull().sum()).sort_values(ascending=False).index)
    elif criteria=="mostnan":
      attrs = list((df.isnull().sum()).sort_values(ascending=True).index)
    elif criteria=="alphabetical":
      attrs = list(df.columns.sort_values(ascending=False))
    elif criteria=="leastunique":
      attrs = list(df.nunique().sort_values(ascending=False).index)
    elif criteria=="mostunique":
      attrs = list(df.nunique().sort_values(ascending=True).index)
    else:
      raise Exception(f"Segmentation Criteria {criteria} not found.\nChoose from 'leastnan','mostnan','alphabetical','leastunique','mostunique'.")
    
    attr = attrs[-1]
    while attr in used_segments or attr in self.skip:
      attrs.pop()
      attr = attrs[-1]
    
    return attrs[-1]

  def split(self,df,segment,criteria="unique",interval=None):
    """
    df        - dataframe to be split
    segment   - segment as to split on
    criteria  - criteria of which to apply splits
      * unique: by each unique value (default)
      * range: by intervals of values (only works on numeric, else defaults to unique)
    """
    if criteria=="range":
      if not interval:
        print("Interval not supplied, defaulting to unique")
      else:
        raise Exception("Range Not Built Yet")
    elif criteria not in ("range","unique"):
      raise Exception(f"Supplied criteria {criteria} not valid.\nCriteria must be: 'unique','range'.\nIf 'range' supply 'interval' value.")
    
    # default
    return df[segment].unique()

  def _postprocess(self,df,method="none",cat_cols = [],dummies = []):
    """
    Post-processing of data
    
    df      - dataframe to be processed
    method  - method to be used
      * default: for REDACTED Project
      * none: no processing
    """

    if method=="none":
      return df

    ### BEG OUTDATED ###
    #df = df.apply(lambda series: pd.Series(
    #self.__labelencoder.inverse_transform(series[series.notnull()]),
    #index=series[series.notnull()].index))
    ### END OUTDATED ###

    if len(dummies)>0:
      # get max
      for dummy in dummies:
        cols = self._dummydict[dummy]
        df[dummy] = df[cols].idxmax(axis=1)
        df = df.drop(columns=cols)  

    if len(cat_cols)>0:
      for i,col in enumerate(cat_cols):
        df[col] = pd.Series(self.__labelencoderlist[i].inverse_transform(df[col].astype(int)))

    # sort columns
    cols = self._baselinebackup.columns
    df = df.reindex(cols,axis=1)

    return df
  
  def __checksegmentlist(self):
    for seg in self.segments:
      if seg in self.skip:
        self.segments.remove(seg)

# Model Run and Visualization

In [None]:
# May be skewed, but using KMeans on the nonans dataset
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
def fitting(df):
    Sum_of_squared_distances = []
    K = range(1,20)
    for k in K:
        km = KMeans(n_clusters=k)
        s = StandardScaler()
        s.fit(df)
        s_df = s.transform(df)
        km = km.fit(s_df)
        Sum_of_squared_distances.append(km.inertia_)
    plt.plot(K, Sum_of_squared_distances, 'bx-')
    plt.xlabel('k')
    plt.ylabel('Sum_of_squared_distances')
    plt.title('Elbow Method For Optimal k')
    plt.show()
    return 

seg = SegmentImputer(KNNImputer())
fitting(seg.preprocess(df))

# End Notes:

With further updates in Numpy/Pandas, there remains an opportunity to further parallelize many of the operations in the original module.