# Group Project - Risk Based Segmentation 

This is document contains a description of the task and also a starter code. 
Implement your exercise by changing only this Jupyter Notebook and the class inside RiskDataFrame.py, deliver both files. 


## Introduction

Customer segmentation involves categorizing the portfolio by industry, location, revenue, account size, and number of employees and many other variables to reveal where risk and opportunity live within the portfolio. Those patterns can then provide key measurable data points for more predictive credit risk management. Taking a portfolio approach to risk management gives credit professionals a better fix on the accounts, in order to develop strategies for better serving segments that present the best opportunities. Not only that, you can work to maximize performance
in all customer segments, even seemingly risky segments.

Customer segmentation analysis can lead to several tangible improvements in credit risk management: stronger credit policies, and improved internal communication and cooperation across teams.

## Task scope
Your group is working in the retail risk modeling team and you are asked to build a class to perform risk-based segmentation and test it for car loans’ customers based on given historical data of customer behavior. The class must perform the segmentation from a risk management perspective.

## Class
We will use Nico's great initial code, which extends Pandas DataFrame in a magical way turning our own class into like-Pandas:

    #Initializing the inherited pd.DataFrame
    def __init__(self, *args, **kwargs):
        super().__init__(*args,**kwargs)
    
    @property
    def _constructor(self):
        def func_(*args,**kwargs):
            df = RiskDataframe(*args,**kwargs)
            return df
        return func_

### PERFORM YOUR OWN DATA CLEANNING & DATA PREPARATION HERE

Your objective in this part is simply to prepare the data to apply to the missing_not_at_random and find_segment_split
methods. Do not overcomplicate the data cleanning and data preparation. Keep it simple!


Import the necessary packages including periculum!

In [1]:
from Periculum_Group_B_IE import RiskDataframe as rdf
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

df = pd.read_excel(r'C:\Users\setej\Documents\IE Masters\Python 2\Group Project\spp2_GroupProject_RBASegmentation\AUTO_LOANS_DATA.xlsx')

Removing account number and costumer id as no value in them

In [2]:
df = df.drop(['ACCOUNT_NUMBER', 'CUSTOMER_ID'], axis=1)

Instanciatate the RiskDataframe class

---
The RiskDataframes init method:

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.cat_columns = []
        self.num_columns = []
        self.date_columns = []
        self.mnra_columns = 'Must run the missing_not_at_random method to update this attribute'
        self.full_file_segment_variable = 'Must run the missing_not_at_random method to update this attribute'
        self.thin_file_segment_variable = 'Must run the missing_not_at_random method to update this attribute'
        self.GINI_fitted_full_model = 'Must run the find_segmentation_split method to update this attribute'
        self.accuracy_fitted_full_model = 'Must run the find_segmentation_split method to update this attribute'
        self.variable_split = 'Must run the find_segment_split method to update this attribute'
        self.date_types()
---

In [3]:
myrdf = rdf.RiskDataframe(df)

Check that the class inherited the Pandas dataframe attributes, and methods

In [4]:
myrdf.columns

Index(['REPORTING_DATE', 'PROGRAM_NAME', 'LOAN_OPEN_DATE',
       'EXPECTED_CLOSE_DATE', 'ORIGINAL_BOOKED_AMOUNT', 'OUTSTANDING',
       'BUCKET', 'SEX', 'CUSTOMER_OPEN_DATE', 'BIRTH_DATE', 'PROFESSION',
       'CAR_TYPE'],
      dtype='object')

In [5]:
myrdf.head()

Unnamed: 0,REPORTING_DATE,PROGRAM_NAME,LOAN_OPEN_DATE,EXPECTED_CLOSE_DATE,ORIGINAL_BOOKED_AMOUNT,OUTSTANDING,BUCKET,SEX,CUSTOMER_OPEN_DATE,BIRTH_DATE,PROFESSION,CAR_TYPE
0,2016-01-31,Auto Loans 50% Down Payment - Employed,2015-11-25,2020-11-03,91000.0,88223.4,0,M,2015-10-27,1986-03-24,EMPLOYEE,KIA
1,2016-01-31,Pick Up and Small Trucks,2015-12-08,2017-12-03,35000.0,33714.82,0,M,2015-11-29,1985-08-18,EMPLOYEE,CARRY
2,2016-01-31,Auto Loans 40% Down Payment - Employed,2016-01-12,2021-01-03,52500.0,52500.0,0,F,2015-12-28,1985-07-02,HOUSEWIFE,CHEVROLET
3,2016-01-31,Auto Loans 30% Down Payment - Self Employed,2015-11-23,2019-10-03,103000.0,99054.45,0,M,2015-10-21,1979-01-01,Shop Owner,MITSUBISHI
4,2016-01-31,Auto Loans 30% Down Payment - Self Employed,2015-11-23,2018-11-03,94250.0,89450.17,0,M,2015-11-02,1977-01-20,Shop Owner,SEAT


date_types method is called in the instantiation of the class as a way to give the object the necessary attributes for the other methods.

---
    def date_types(self):
        """
        This method is used as a way to create a list for each column types,
        in order to use these lists in further methods that will be called.
        Since this method is necessary for the other methods in the class,
        it is called when the class is instantiated.
        """
        for column in self.columns:
            if self[column].dtype == 'O':
                self.cat_columns.append(column)
            elif self[column].dtype == 'float64':
                self.num_columns.append(column)
            elif self[column].dtype == 'int64':
                self.num_columns.append(column)
            elif self[column].dtype == '<M8[ns]':
                self.date_columns.append(column)
            else:
                None
---

In [6]:
myrdf.num_columns, myrdf.cat_columns, myrdf.date_columns

(['ORIGINAL_BOOKED_AMOUNT', 'OUTSTANDING', 'BUCKET'],
 ['PROGRAM_NAME', 'SEX', 'PROFESSION', 'CAR_TYPE'],
 ['REPORTING_DATE',
  'LOAN_OPEN_DATE',
  'EXPECTED_CLOSE_DATE',
  'CUSTOMER_OPEN_DATE',
  'BIRTH_DATE'])

date_to_int method is an option method that will take any date columns and convert them to timedelta numbers as fraction of a year, using the parameter report_date as a required argument to calculate the difference

---
    def date_to_int(self, reporting_date):
        """
        This is an optional method in case the data in the dataframe has date columns,
        this method will convert all the dates to year fraction from the reporting date
        to calculate the time difference. Which is necessary for the application of the
        segmentation method, because Sklearn logistic regression does not accept date type
        values.
        ----------
        reporting_date : This variable is a datetime object that will be the point in time
        where all of the timedelta's will be calculated from.
        -------
        """
        for column in self.columns:
            if self[column].dtype == '<M8[ns]':
                self[column] = abs(self[column] - reporting_date).astype('timedelta64[D]')
                self[column] = round(self[column] / 365, 2)
            else:
                pass
        self.num_columns = self.num_columns + self.date_columns
        self.date_columns = "None due to date_to_int method being called before."
---

In [7]:
reporting_date = myrdf['REPORTING_DATE'].max()

In [8]:
myrdf.date_to_int(reporting_date)

Dropping reporting date column, and remoove it from the num_columns attribute

In [9]:
myrdf.drop('REPORTING_DATE', inplace=True, axis=1)
myrdf.num_columns.remove('REPORTING_DATE')

In [10]:
myrdf.num_columns, myrdf.cat_columns, myrdf.date_columns

(['ORIGINAL_BOOKED_AMOUNT',
  'OUTSTANDING',
  'BUCKET',
  'LOAN_OPEN_DATE',
  'EXPECTED_CLOSE_DATE',
  'CUSTOMER_OPEN_DATE',
  'BIRTH_DATE'],
 ['PROGRAM_NAME', 'SEX', 'PROFESSION', 'CAR_TYPE'],
 'None due to date_to_int method being called before.')

In [11]:
myrdf.head()

Unnamed: 0,PROGRAM_NAME,LOAN_OPEN_DATE,EXPECTED_CLOSE_DATE,ORIGINAL_BOOKED_AMOUNT,OUTSTANDING,BUCKET,SEX,CUSTOMER_OPEN_DATE,BIRTH_DATE,PROFESSION,CAR_TYPE
0,Auto Loans 50% Down Payment - Employed,3.76,1.18,91000.0,88223.4,0,M,3.84,33.45,EMPLOYEE,KIA
1,Pick Up and Small Trucks,3.73,1.74,35000.0,33714.82,0,M,3.75,34.05,EMPLOYEE,CARRY
2,Auto Loans 40% Down Payment - Employed,3.63,1.35,52500.0,52500.0,0,F,3.67,34.18,HOUSEWIFE,CHEVROLET
3,Auto Loans 30% Down Payment - Self Employed,3.77,0.1,103000.0,99054.45,0,M,3.86,40.68,Shop Owner,MITSUBISHI
4,Auto Loans 30% Down Payment - Self Employed,3.77,0.82,94250.0,89450.17,0,M,3.82,42.63,Shop Owner,SEAT


#  Implement the following 2 methods to automate the Risk-based segmentation process:
* You can implement more methods if you think it is necessary.
* In computer science, when we are dividing the code it is important to think which code does what. For example, __data cleanning__ and __data preparation__, is it done by the Risk Based Segmentation Class or the class assumes that the data is clean and ready for modelling (all variables are numeric, and dummies are already provided)?
* Use: the input dataset should already be clean and ready for the trainning of a Logistic Regression with a binary target 0 and 1 class. 
* Scope: data cleanning and data preparation is out of the scope of the Class, but notice that .missing_not_at_random() requires the data to have missing values.

## 1) Implement a method .missing_not_at_random() 
To identify different potential segments sharing data (based on sharing missing values) - Expected result is a print:
Missing Not At Random Repport (MNAR) -  PROFESSION, SEX and BIRTH_DATE variables seem Missing Not at Random, there for we recommend:

&emsp;  Thin File Segment Variables (all others variables free of MNAR issue): REPORTING_DATE, ACCOUNT_NUMBER, CUSTOMER_ID, PROGRAM_NAME, LOAN_OPEN_DATE, EXPECTED_CLOSE_DATE, ORIGINAL_BOOKED_AMOUNT, 
OUTSTANDING, CUSTOMER_OPEN_DATE, CAR_TYPE

&emsp; Full File Segment Variables: REPORTING_DATE, ACCOUNT_NUMBER, CUSTOMER_ID, PROGRAM_NAME, LOAN_OPEN_DATE, EXPECTED_CLOSE_DATE, ORIGINAL_BOOKED_AMOUNT, OUTSTANDING, SEX, CUSTOMER_OPEN_DATE, BIRTH_DATE, PROFESSION, CAR_TYPE


Check the mnra_columns attribute before running the method called missing_not_at_random

In [12]:
myrdf.mnra_columns

'Must run the missing_not_at_random method to update this attribute'

---
    def missing_not_at_random(self, corr_threshold=0.9):
        """
        This method is checks for the correlation between the missing values in all the columns, pair by
        pair in order, in order to see if the correlation is higher than threshold to be considered missing
        not at random.
        -------
        corr_threshold: This variable is the threshold that will be used as a cut off to decide if the
        correlation between the missing values between a pair of columns is high enough to be considered missing
        not at random.
        """

        def redundant_pairs(self):
            """
            This function inside the method is used to find pairs of columns that are
            are repeated in the correlation matrix used in the method missing_not_at_random.
            """
            pairs_to_drop = set()
            cols = self.columns
            for i in range(0, self.shape[1]):
                for j in range(0, i + 1):
                    pairs_to_drop.add((cols[i], cols[j]))
            return pairs_to_drop

        NaS = self.iloc[:, [i for i, n in enumerate(np.var(self.isna(), axis='rows')) if n > 0]]
        labels_to_drop = redundant_pairs(NaS)
        NaS_df = NaS.isnull().corr().unstack()
        NaS_corr = NaS_df.drop(labels=labels_to_drop).sort_values(ascending=False)
        mnra_list = []

        for i in range(len(NaS_corr)):
            if (NaS_corr[i] >= corr_threshold):
                mnra_list.append(NaS_corr.index[i])
            else:
                pass
        mnra_columns = list(set([item for sublist in mnra_list for item in sublist]))
        full_file_segment_variable = self.num_columns + self.cat_columns
        thin_file_segment_variable = [x for x in full_file_segment_variable if x not in mnra_columns]

        self.mnra_columns = mnra_columns
        self.full_file_segment_variable = full_file_segment_variable
        self.thin_file_segment_variable = thin_file_segment_variable
---

In [13]:
myrdf.missing_not_at_random()

Missing Not At Random Repport - ['PROFESSION', 'BIRTH_DATE', 'SEX'] variables seem Missing Not at Random,there for we recommend:
Thin File Segment Variables: ['ORIGINAL_BOOKED_AMOUNT', 'OUTSTANDING', 'BUCKET', 'LOAN_OPEN_DATE', 'EXPECTED_CLOSE_DATE', 'CUSTOMER_OPEN_DATE', 'PROGRAM_NAME', 'CAR_TYPE']
Full File Segment Variables: ['ORIGINAL_BOOKED_AMOUNT', 'OUTSTANDING', 'BUCKET', 'LOAN_OPEN_DATE', 'EXPECTED_CLOSE_DATE', 'CUSTOMER_OPEN_DATE', 'BIRTH_DATE', 'PROGRAM_NAME', 'SEX', 'PROFESSION', 'CAR_TYPE']


Check the results in of the attribute following the method being called

In [14]:
myrdf.mnra_columns

['PROFESSION', 'BIRTH_DATE', 'SEX']

---
OUTPUT SAMPLE:

__Missing Not At Random Repport__ -  REPORTING_DATE, ACCOUNT_NUMBER, CUSTOMER_ID variables seem Missing Not at Random, there for we recommend:

&emsp;  Thin File Segment Variables: PROGRAM_NAME, LOAN_OPEN_DATE, EXPECTED_CLOSE_DATE, 
ORIGINAL_BOOKED_AMOUNT, OUTSTANDING, SEX, 
CUSTOMER_OPEN_DATE, BIRTH_DATE, PROFESSION, CAR_TYPE

&emsp; Full File Segment Variables: REPORTING_DATE, ACCOUNT_NUMBER, CUSTOMER_ID, PROGRAM_NAME, LOAN_OPEN_DATE, EXPECTED_CLOSE_DATE, ORIGINAL_BOOKED_AMOUNT, 
OUTSTANDING, SEX, CUSTOMER_OPEN_DATE, BIRTH_DATE, PROFESSION, CAR_TYPE

---

##	2) implement a method .find_segment_split(variable)
given one variable, implement a method to identify if the variable is a good segmentation splitter and if the variable is a good splitter,   
different segments of customers with different level of risk (the one explained in the second video)
* Scope: data cleanning and data preparation is out of the scope of the Class, note that .find_segment_split(VARIABLE) assumes the data is already clean free of missing values.
* Categorical: for the segmentation process of categorical variable, dummy transformation is not practical, it is recommended that categorical variables come pre-transformed into numerical by replacing the categories by the Probability of belonging to class 1.
* The following code only works for a single variable, implement a loop going over each variable of the dataset as a candidate for segmentation.
* The following method must implement two segmentation approaches, one for Categorical Nominal (order not relevant - variable must be automatically transformed) and others where order is important.



Check for the amount of missing rows in the dataset

In [15]:
myrdf.isna().sum()/myrdf.count()*100 , myrdf.shape

(PROGRAM_NAME              0.000000
 LOAN_OPEN_DATE            0.000000
 EXPECTED_CLOSE_DATE       0.000000
 ORIGINAL_BOOKED_AMOUNT    0.000000
 OUTSTANDING               0.000000
 BUCKET                    0.000000
 SEX                       0.505170
 CUSTOMER_OPEN_DATE        0.000000
 BIRTH_DATE                0.505731
 PROFESSION                0.620796
 CAR_TYPE                  1.295115
 dtype: float64,
 (900860, 11))

Use simple imputing to fill these rows, as this is out of the score of the RiskDataframe class

In [16]:
cat_imputer = SimpleImputer(strategy='most_frequent')
num_imputer = SimpleImputer(strategy='median')

df_cat = pd.DataFrame(cat_imputer.fit_transform(myrdf[myrdf.cat_columns]))
df_cat.columns = myrdf.cat_columns
df_num = pd.DataFrame(num_imputer.fit_transform(myrdf[myrdf.num_columns]))
df_num.columns = myrdf.num_columns


df = df_cat
df = df.join(df_num)
myrdf = rdf.RiskDataframe(df)
myrdf.head()

Unnamed: 0,PROGRAM_NAME,SEX,PROFESSION,CAR_TYPE,ORIGINAL_BOOKED_AMOUNT,OUTSTANDING,BUCKET,LOAN_OPEN_DATE,EXPECTED_CLOSE_DATE,CUSTOMER_OPEN_DATE,BIRTH_DATE
0,Auto Loans 50% Down Payment - Employed,M,EMPLOYEE,KIA,91000.0,88223.4,0.0,3.76,1.18,3.84,33.45
1,Pick Up and Small Trucks,M,EMPLOYEE,CARRY,35000.0,33714.82,0.0,3.73,1.74,3.75,34.05
2,Auto Loans 40% Down Payment - Employed,F,HOUSEWIFE,CHEVROLET,52500.0,52500.0,0.0,3.63,1.35,3.67,34.18
3,Auto Loans 30% Down Payment - Self Employed,M,Shop Owner,MITSUBISHI,103000.0,99054.45,0.0,3.77,0.1,3.86,40.68
4,Auto Loans 30% Down Payment - Self Employed,M,Shop Owner,SEAT,94250.0,89450.17,0.0,3.77,0.82,3.82,42.63


Check the results

In [17]:
myrdf.isna().sum()/myrdf.count()*100 , myrdf.shape

(PROGRAM_NAME              0.0
 SEX                       0.0
 PROFESSION                0.0
 CAR_TYPE                  0.0
 ORIGINAL_BOOKED_AMOUNT    0.0
 OUTSTANDING               0.0
 BUCKET                    0.0
 LOAN_OPEN_DATE            0.0
 EXPECTED_CLOSE_DATE       0.0
 CUSTOMER_OPEN_DATE        0.0
 BIRTH_DATE                0.0
 dtype: float64,
 (900860, 11))

We are going to binarize the target variable in order improve model performance and simplification

In [18]:
myrdf['BUCKET'].value_counts()

0.0    743348
1.0    100979
2.0     35120
3.0     11306
4.0      4981
5.0      2627
7.0      1457
6.0      1042
Name: BUCKET, dtype: int64

In [19]:
target = 'BUCKET'
target_series = myrdf[target]
myrdf = myrdf.drop(target, axis=1)
target_array = np.vectorize(lambda x: 0 if x == 0 else 1)(target_series)
target_df = pd.DataFrame(target_array)
target_df = target_df.rename(columns={0:'BUCKET'}) 


myrdf = myrdf.join(target_df)
myrdf.num_columns.append(target)
myrdf.head()

Unnamed: 0,PROGRAM_NAME,SEX,PROFESSION,CAR_TYPE,ORIGINAL_BOOKED_AMOUNT,OUTSTANDING,LOAN_OPEN_DATE,EXPECTED_CLOSE_DATE,CUSTOMER_OPEN_DATE,BIRTH_DATE,BUCKET
0,Auto Loans 50% Down Payment - Employed,M,EMPLOYEE,KIA,91000.0,88223.4,3.76,1.18,3.84,33.45,0
1,Pick Up and Small Trucks,M,EMPLOYEE,CARRY,35000.0,33714.82,3.73,1.74,3.75,34.05,0
2,Auto Loans 40% Down Payment - Employed,F,HOUSEWIFE,CHEVROLET,52500.0,52500.0,3.63,1.35,3.67,34.18,0
3,Auto Loans 30% Down Payment - Self Employed,M,Shop Owner,MITSUBISHI,103000.0,99054.45,3.77,0.1,3.86,40.68,0
4,Auto Loans 30% Down Payment - Self Employed,M,Shop Owner,SEAT,94250.0,89450.17,3.77,0.82,3.82,42.63,0


In [20]:
myrdf['BUCKET'].value_counts()

0    743348
1    157512
Name: BUCKET, dtype: int64

In [21]:
myrdf.head()

Unnamed: 0,PROGRAM_NAME,SEX,PROFESSION,CAR_TYPE,ORIGINAL_BOOKED_AMOUNT,OUTSTANDING,LOAN_OPEN_DATE,EXPECTED_CLOSE_DATE,CUSTOMER_OPEN_DATE,BIRTH_DATE,BUCKET
0,Auto Loans 50% Down Payment - Employed,M,EMPLOYEE,KIA,91000.0,88223.4,3.76,1.18,3.84,33.45,0
1,Pick Up and Small Trucks,M,EMPLOYEE,CARRY,35000.0,33714.82,3.73,1.74,3.75,34.05,0
2,Auto Loans 40% Down Payment - Employed,F,HOUSEWIFE,CHEVROLET,52500.0,52500.0,3.63,1.35,3.67,34.18,0
3,Auto Loans 30% Down Payment - Self Employed,M,Shop Owner,MITSUBISHI,103000.0,99054.45,3.77,0.1,3.86,40.68,0
4,Auto Loans 30% Down Payment - Self Employed,M,Shop Owner,SEAT,94250.0,89450.17,3.77,0.82,3.82,42.63,0


In [22]:
myrdf.accuracy_fitted_full_model

'Must run the find_segmentation_split method to update this attribute'

---
    def find_segment_split(self, target='', robust_scaler=''):
        """
        This method finds if the data in each column performs better if it is split in order to segment the data
        and have a better model fit. The model used is logistic regression for a binary classification, which does
        not accept alphanumeric values, therefore labelencoder is automatically called if the method detects these
        data type columns. The required argument for this method is target, since the logistic regression model needs
        this. Robust_scaler is an optional argument in order to enhance model performance. Once the baseline model with
        the full file without segmentation is calculated, this method continues to find where is the optimal place
        for spltting each column by applying a decision tree classifier, and extracting the root node splitting point.
        Finally it fits a model on the segmented dataset and compares the results of both models.

        Returns
        -------
        Example 1:
        ORIGINAL_BOOKED_AMOUNT: Not good for segmentation. Afer analysis, we did not find a good split using this variable.
        Model Developed on ORIGINAL_BOOKED_AMOUNT Seg 1 (train sample) applied on ORIGINAL_BOOKED_AMOUNT Seg 1 (test sample): 0.269 %
        Model Developed on Full Population (train sample) applied on ORIGINAL_BOOKED_AMOUNT Seg 1 (test sample): 0.269 %
        Model Developed on ORIGINAL_BOOKED_AMOUNT Seg 2 (train sample) applied on ORIGINAL_BOOKED_AMOUNT Seg 2 (test sample): 0.263 %
        Model Developed on Full Population (train sample) applied on ORIGINAL_BOOKED_AMOUNT Seg 2 (test sample): 0.263 %
                
        """

        if len(self.cat_columns) > 0:
            df_cat = self[self.cat_columns]
            for column in range(len(self.cat_columns)):
                df_cat[self.cat_columns[column]] = LabelEncoder().fit_transform(df_cat[self.cat_columns[column]])
            self.drop(self.cat_columns, inplace=True, axis=1)
            for col in df_cat.columns:
                self[col] = df_cat[col]
        else:
            pass

        if robust_scaler.upper() == 'YES':
            non_target_df = self.drop(target, axis=1)
            scaled_features = RobustScaler().fit_transform(non_target_df.values)
            scaled_df = pd.DataFrame(scaled_features, index=non_target_df.index, columns=non_target_df.columns)
            self.drop(scaled_df.columns, inplace=True, axis=1)
            for col in scaled_df.columns:
                self[col] = scaled_df[col]
        else:
            pass

        # Baseline model
        df_train, df_test = train_test_split(self, test_size=0.2, random_state=42)
        try:
            self.num_columns.remove(target)
        except:
            self.cat_columns.remove(target)
        X_train = df_train.drop(target, axis=1)
        y_train = df_train[target]
        X_test = df_test.drop(target, axis=1)
        y_test = df_test[target]
        method = LogisticRegression(random_state=0, solver='lbfgs', max_iter=100)
        fitted_full_model = method.fit(X_train, y_train)
        y_pred_proba = fitted_full_model.predict_proba(X_test)[:, 0]
        y_pred = fitted_full_model.predict(X_test)
        fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
        roc_auc = auc(fpr, tpr)
        self.GINI_fitted_full_model = abs((2 * roc_auc) - 1)
        self.accuracy_fitted_full_model = accuracy_score(y_test, y_pred)

        # Function to decide where to split
        all_columns = self.num_columns + self.cat_columns
        split_list = []
        def splits(column):
            x = self.drop(target, axis=1)
            y = self[target]
            single_x = np.array(x[column]).reshape(-1, 1)
            X_train, X_test, y_train, y_test = train_test_split(single_x, y, test_size=0.2, random_state=42)
            method = DecisionTreeClassifier(random_state=0, max_depth=3)
            individual_feature_model = method.fit(X_train, y_train)
            y_pred = individual_feature_model.predict(X_test)
            split = str(tree.export_text(individual_feature_model))
            split = float(split[17:23])
            split_list.append(split)
            return split_list
        np.vectorize(splits, otypes=[list])(all_columns)
        self.variable_split = dict(zip(all_columns, split_list))

        # Function to decide if good segmentation loop
        def segmentation(column, split):
            df_train_seg1 = df_train[self[column] > split]
            df_train_seg2 = df_train[self[column] <= split]
            df_test_seg1 = df_test[self[column] > split]
            df_test_seg2 = df_test[self[column] <= split]

            X_train_seg1 = df_train_seg1[all_columns]
            y_train_seg1 = df_train_seg1[target]
            X_test_seg1 = df_test_seg1[all_columns]
            y_test_seg1 = df_test_seg1[target]

            fitted_model_seg1 = method.fit(X_train_seg1, y_train_seg1)
            y_pred_seg1 = fitted_model_seg1.predict_proba(X_test_seg1)[:, 1]
            y_pred_seg1_fullmodel = fitted_full_model.predict_proba(X_test_seg1)[:, 1]

            fpr, tpr, thresholds = roc_curve(y_test_seg1, y_pred_seg1)
            roc_auc = auc(fpr, tpr)
            GINI_seg1 = round(abs((2 * roc_auc) - 1),3)

            fpr, tpr, thresholds = roc_curve(y_test_seg1, y_pred_seg1_fullmodel)
            roc_auc = auc(fpr, tpr)
            GINI_seg1_full = round(abs((2 * roc_auc) - 1),3)

            X_train_seg2 = df_train_seg2[all_columns]
            y_train_seg2 = df_train_seg2[target]
            X_test_seg2 = df_test_seg2[all_columns]
            y_test_seg2 = df_test_seg2[target]

            fitted_model_seg2 = method.fit(X_train_seg2, y_train_seg2)
            y_pred_seg2 = fitted_model_seg2.predict_proba(X_test_seg2)[:, 1]
            y_pred_seg2_fullmodel = fitted_full_model.predict_proba(X_test_seg2)[:, 1]

            fpr, tpr, thresholds = roc_curve(y_test_seg2, y_pred_seg2)
            roc_auc = auc(fpr, tpr)
            GINI_seg2 = round(abs((2 * roc_auc) - 1),3)

            fpr, tpr, thresholds = roc_curve(y_test_seg2, y_pred_seg2_fullmodel)
            roc_auc = auc(fpr, tpr)
            GINI_seg2_full = round(abs((2 * roc_auc) - 1),3)

            if GINI_seg1 > GINI_seg1_full and GINI_seg2 > GINI_seg2_full:
                print(f'{column}: Good for segmentation.')
                print(f'Segment1: {column} > {split} [GINI Full Model: {GINI_seg1_full}% / GINI Segmented Model: {GINI_seg1}')
                print(f'Segment2: {column} > {split} [GINI Full Model: {GINI_seg2_full}% / GINI Segmented Model: {GINI_seg2}')
            else:
                print(f'{column}: Not good for segmentation. Afer analysis, we did not find a good split using this variable.')

            print(f"Model Developed on {column} Seg 1 (train sample) applied on {column} Seg 1 (test sample):",
                  GINI_seg1,'%')
            print(f"Model Developed on Full Population (train sample) applied on {column} Seg 1 (test sample):",
                  GINI_seg1_full,'%')
            print(f"Model Developed on {column} Seg 2 (train sample) applied on {column} Seg 2 (test sample):",
                  GINI_seg2,'%')
            print(f"Model Developed on Full Population (train sample) applied on {column} Seg 2 (test sample):",
                  GINI_seg2_full,'%')
        np.vectorize(segmentation, otypes=[list])(all_columns, split_list)
---

In [23]:
myrdf.find_segment_split(target='BUCKET', robust_scaler='Yes')

ORIGINAL_BOOKED_AMOUNT: Not good for segmentation. Afer analysis, we did not find a good split using this variable.
Model Developed on ORIGINAL_BOOKED_AMOUNT Seg 1 (train sample) applied on ORIGINAL_BOOKED_AMOUNT Seg 1 (test sample): 0.238 %
Model Developed on Full Population (train sample) applied on ORIGINAL_BOOKED_AMOUNT Seg 1 (test sample): 0.238 %
Model Developed on ORIGINAL_BOOKED_AMOUNT Seg 2 (train sample) applied on ORIGINAL_BOOKED_AMOUNT Seg 2 (test sample): 0.265 %
Model Developed on Full Population (train sample) applied on ORIGINAL_BOOKED_AMOUNT Seg 2 (test sample): 0.265 %
OUTSTANDING: Not good for segmentation. Afer analysis, we did not find a good split using this variable.
Model Developed on OUTSTANDING Seg 1 (train sample) applied on OUTSTANDING Seg 1 (test sample): 0.256 %
Model Developed on Full Population (train sample) applied on OUTSTANDING Seg 1 (test sample): 0.256 %
Model Developed on OUTSTANDING Seg 2 (train sample) applied on OUTSTANDING Seg 2 (test sample):

In [24]:
myrdf.head()

Unnamed: 0,BUCKET,ORIGINAL_BOOKED_AMOUNT,OUTSTANDING,LOAN_OPEN_DATE,EXPECTED_CLOSE_DATE,CUSTOMER_OPEN_DATE,BIRTH_DATE,PROGRAM_NAME,SEX,PROFESSION,CAR_TYPE
0,0,0.316327,0.712837,-0.121547,-0.039735,-0.158654,-0.428502,0.0,0.0,0.0,0.227273
1,0,-0.826531,-0.260049,-0.138122,0.331126,-0.201923,-0.389547,2.5,0.0,0.0,-0.909091
2,0,-0.469388,0.075235,-0.19337,0.072848,-0.240385,-0.381107,-0.333333,-1.0,0.181818,-0.772727
3,0,0.561224,0.906153,-0.116022,-0.754967,-0.149038,0.040902,-0.833333,0.0,2.363636,0.636364
4,0,0.382653,0.734733,-0.116022,-0.278146,-0.168269,0.167505,-0.833333,0.0,2.363636,1.0


In [25]:
myrdf.accuracy_fitted_full_model

0.8240681126923163

# TO DO:

## Mandatory (part 1 - p1) -  Implement the following: (5 out of 10 points)

- __Project Name__: pick a name for your project (if it is taken at https://pypi.org/ please create small variations), I recommend you get inspiration by the following Pockemon names: some_pockemon_examples.zip

- __Project Managment (Github)__: Work in group using Github, invite professor manoelgad@gmail.com as a collaborator to your project from the very beggining.

- __Implementation__: Discuss in group and decide the implementation you need to do for each method (missing_not_at_random and find_segment_split), then do the implementation of missing_not_at_random and find_segment_split. Your implementation should work in any dataset (make the necessary assumption and inform the user if the assumption are not followed, for example: inform the dataset must be clean and types must be informed in case they are not). 

- __Video__: Create a video from 5 to 15 minutes explainning the whole library (including p1 and p2) and showing examples of how to use it. The video will not be assessed own its own and won't be assessed by colleagues. The video can be very simple, just the notebook/python class and someone explainning things. Upload the video to Youtube and include a link to the video in the website if your group decide to pick Publishing below.    


## Improvements (part 2 - p2) -  Implement 2 of the following list of tasks:  (5 out of 10 points)

- __Improving__: Make improvements to the code -  Reliable/Robust: Create a train-test split, train all models in train and test all models always in test; Robust: Research and apply a statistical test to decide when the accuracy diffrence in statiscally relevant. Small functions/methods: Break your implementation into small functions/methods; Fast: Optimize your code, use vectorization when possible. Use stratified random sampling to reduce dataset sizes and therefore speed up the segmentation process. Implement segmentation split using Tree algorithm.

- __Publishing__: Publish your code in GitHub -  Work in group using Github, invite professor manoelgad@gmail.com as a collaborator to your project from the very beggining. Create a python package and distribute your package using https://pypi.org/, by the end of the project one must be able to pip install your project and use it.
References: https://www.youtube.com/watch?v=GIF3LaRqgXo; and  https://github.com/judy2k/publishing_python_packages_talk

- __Testing__: Implement a Test class using unittest with an "comprehensive" set of tests using a series of datasets of your choice. Have a look at this: https://ains.co/blog/things-which-arent-magic-flask-part-1.html and https://www.youtube.com/watch?v=1Lfv5tUGsn8

- __Documentation__: create a documentation for your project and publish it at GitHub project (readme) and also a pythonanywhere.com website (simple HTML). The documentation must contain an about, a how to and also examples of how to use with one or more datasets. All used datasets should be provided within the project (make sure you don't share huge datasets, make it small before sharing your code).

- __Logging & Repporting__: Log all intermediate results and final results into a Sqlite database using SQLAlchemy, then produce the final result repport in HTML format using Bokeh.


# Evaluation criteria

All team members have the choice of focusing, by choising 1 or 2 of the tasks of part 2.
*   If 1 task of part 2 is choosen, grade will be: p1\*0.5 + p2x\*0.4 + (p2y\*0.1) (p2x is the grade in the choosen task and p2y the other) 
*   If 2 tasks of part 2 are choosen, grade will be: p1\*0.5 + (p2x\*0.25 + p2y\*0.25)
* If your group only implement 1 extra part, all members will be assessed using: p1\*0.5 + p2x\*0.4 + (p2y\*0.1)
* If your group implements more then 2 parts, please indicate the ones you want to be assessed upon.


What professor will look at when assessing the project:
*	Problem structuring - How did you structure the problem and the project?
*	What assumptions did you make? (Please mention them in the video)
*	How did you narrow the scope? (Please mention them in the video)
*	Technical Skills: How reliable (does it use your own class? does is it apply data quality controls?)), readable and flexible (can you apply your code to a new dataset?) was the code that you developed?
* Analytical Skills: How logically sound, complete and meaningful was the approach (machine learning, statistics, analytics, visualization…) that you applied?
*	Usefulness:	How useful would the results of your work for new datasets?



# Tools
You are allowed to use __Python only__ and any Python Library inside your Jupyter Notebook or inside your Class, always give preference to Pandas and Scikit Learn whenever you can.



# Deliverables
*	A zip file with all code and datasets used for the projet.


# Data description
We will provide you with historical data of car loans. The data contains monthly status for each loan for 3 years. In addition to some demographic information

# Notes:
*	This data is Loan level NOT Customer level, meaning that one customer can take more than one loan
*	The data is monthly starting from 2016-01-01 to 2019-09-01 so if the loan already started before Jan2016 you will find partial history for it.
*	We have multiple programs under the car loans product
*	Make sure you understand the difference between Buckets

# Research:
In order to implement the methods missing_not_at_random and find_segment_split, you are allowed to search for whichever information you need in the internet including but not limited to:
*	Code syntax
*	Business term (However you can ask me)

Start by looking into these 3 videos:
*	What is Risk-based segmentation? https://www.youtube.com/watch?v=2ZpLgUcucfQ 
*   This is a generic video on Segmentation, it is a good reference, but careful not all needs to be implemented and not all mentioned here is relevant for this project: https://www.youtube.com/watch?v=PLsUfDDytaE 


---

# APPENDIX: 

## Simple example of Risk Based Segmentation

*   Video explainning the code below: https://www.youtube.com/watch?v=kWtnlpGwh_o


In [None]:
df = dataframe#pd.read_csv("AUTO_LOANS_DATA.csv", sep=";")

In [None]:
argument_dict = {'REPORTING_DATE':'datetime64[ns]','LOAN_OPEN_DATE':'datetime64[ns]',
                 'EXPECTED_CLOSE_DATE':'datetime64[ns]','CUSTOMER_OPEN_DATE':'datetime64[ns]',
                 'BIRTH_DATE':'datetime64[ns]','PROGRAM_NAME':'category','BUCKET':'category','SEX':'category',
                'PROFESSION':'category','CAR_TYPE':'category'}
myrdf.SetAttributes(argument_dict)
myrdf.dtypes

### Random Sample trick to speed up the process...

In [None]:
from sklearn.model_selection import train_test_split
df = dataframe
df_random_sample, _ = train_test_split(dataframe, test_size=0.95)

In [None]:
df = df_random_sample

In [None]:
df

In [None]:
df.head()

### Dirty variable selection, feature transformation and data cleanning

In [None]:
#df = df.fillna(0)

In [None]:
def get_specific_columns(df, data_types, to_ignore = list(), ignore_target = False):
    columns = df.select_dtypes(include=data_types).columns
    if ignore_target:
        columns = filter(lambda x: x not in to_ignore, list(columns))
    return list(columns)

In [None]:
target = 'BINARIZED_TARGET'

In [None]:
all_numeric_variables = get_specific_columns(df, ["float64", "int64"], [target], ignore_target = True)

# LogisticRegression - Full Model - all variables
You sould use LogisticRegression in the modeling part to avoid any overfitting issues, and also split your data into train and test split.


In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
splitter = train_test_split
"-----------------------"

df_train, df_test = splitter(df, test_size = 0.2, random_state = 42)

In [None]:
X_train = df_train[all_numeric_variables]
y_train = df_train[target]

In [None]:
X_test = df_test[all_numeric_variables]
y_test = df_test[target]

In [None]:
from sklearn.linear_model import LogisticRegression
method = LogisticRegression(random_state=0)
fitted_full_model = method.fit(X_train, y_train)
y_pred = fitted_full_model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

# GINI vs Accuracy - use GINI for this analysis!

GINI as well as accuracy is a 0 to 1 measure, 0 being very bad prediction and 1 being perfect separation.
For this project __you should use GINI__ as it looks the model in all predictions (all range of probabilities), accuracy gets the probability and using a cut-off and transform the probability into predicted class 0 for probabilities below 50% and predicted class 1 for above or equal to 50%. So using accuracy makes our analysis for segmentation very short sighted as the result of the analysis could change if one changes the cut-off to let say 40%, for this reason we will use the GINI coeficient which is independent of the cut-off having a better overview of the whole model predictions.

GINI is a simple calculation resulting from AUC. You will not find directly the Gini Coefficient as an attribute for the LogisticRegressor Class, but you can use the 2*AUC-1 formula to calculate it. 

If you want more details about GINI have a look into this video:
https://www.youtube.com/watch?v=MiBUBVUC8kE


Make sure you use .predict_proba (to predict probability) and then get the first column using [:,1] to get only the probability of being 1, instead of .predict which gives the 0 or 1 class. This proba is  what you need to pass as predictions_list below, to finally obtain the GINI:

In [None]:
y_pred_probadbility = fitted_full_model.predict_proba(X_test)[:,1]
#y_test is your actual 0 and 1 class and y_pred_probadbility is the predicted probability of belonging to class 1.
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probadbility)
roc_auc = auc(fpr, tpr)
GINI = (2 * roc_auc) - 1
print(GINI)

# Analysis Model 1 - Gender M and F

In [None]:
df['SEX'].value_counts()

In [None]:
df_train_seg1 = df_train[df['SEX'] == "M"]
df_train_seg2 = df_train[df['SEX'] != "M"]
df_test_seg1 = df_test[df['SEX'] == "M"]
df_test_seg2 = df_test[df['SEX'] != "M"]

# Full Model vs Seg 1 on Seg 1

In [None]:
X_train_seg1 = df_train_seg1[all_numeric_variables]
y_train_seg1 = df_train_seg1[target]
X_test_seg1 = df_test_seg1[all_numeric_variables]
y_test_seg1 = df_test_seg1[target]
fitted_model_seg1 = method.fit(X_train_seg1, y_train_seg1)

def GINI(y_test, y_pred_probadbility):
    from sklearn.metrics import roc_curve, auc
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_probadbility)
    roc_auc = auc(fpr, tpr)
    GINI = (2 * roc_auc) - 1
    return(GINI)

y_pred_seg1_proba = fitted_model_seg1.predict_proba(X_test_seg1)[:,1]
y_pred_seg1_fullmodel_proba = fitted_full_model.predict_proba(X_test_seg1)[:,1]

print("Segment1: SEX in ('M') [GINI Full Model: {:.4f}% / GINI Segmented Model: {:.4f}%]".format(
    GINI(y_test_seg1, y_pred_seg1_proba)*100,
    GINI(y_test_seg1, y_pred_seg1_fullmodel_proba)*100
)) 

# Full Model vs Seg 2 on Seg 2

In [None]:
X_train_seg2 = df_train_seg2[all_numeric_variables]
y_train_seg2 = df_train_seg2[target]
X_test_seg2 = df_test_seg2[all_numeric_variables]
y_test_seg2 = df_test_seg2[target]
fitted_model_seg2 = method.fit(X_train_seg2, y_train_seg2)

y_pred_seg2_proba = fitted_model_seg2.predict_proba(X_test_seg2)[:,1]
y_pred_seg2_fullmodel_proba = fitted_full_model.predict_proba(X_test_seg2)[:,1]

print("Segment1: SEX in ('F') [GINI Full Model: {:.4f}% / GINI Segmented Model: {:.4f}%]".format(
    GINI(y_test_seg2, y_pred_seg2_proba)*100,
    GINI(y_test_seg2, y_pred_seg2_fullmodel_proba)*100
))   

# Execution Summary Repport

&emsp;
BUCKET is the target variable and was not analyzed separetly.

__Missing Not At Random Repport__ -  REPORTING_DATE, ACCOUNT_NUMBER, CUSTOMER_ID variables seem Missing Not at Random, there for we recommend:

&emsp;  Thin File Segment Variables: PROGRAM_NAME, LOAN_OPEN_DATE, EXPECTED_CLOSE_DATE, 
ORIGINAL_BOOKED_AMOUNT, OUTSTANDING, SEX, 
CUSTOMER_OPEN_DATE, BIRTH_DATE, PROFESSION, CAR_TYPE

&emsp; Full File Segment Variables: REPORTING_DATE, ACCOUNT_NUMBER, CUSTOMER_ID, PROGRAM_NAME, LOAN_OPEN_DATE, EXPECTED_CLOSE_DATE, ORIGINAL_BOOKED_AMOUNT, 
OUTSTANDING, SEX, CUSTOMER_OPEN_DATE, BIRTH_DATE, PROFESSION, CAR_TYPE

__Variable by Variable Risk Based Segmentation Analysis__:

&emsp; REPORTING_DATE Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; ACCOUNT_NUMBER Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; CUSTOMER_ID Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; PROGRAM_NAME Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; LOAN_OPEN_DATE Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; EXPECTED_CLOSE_DATE Good for segmentation.  

&emsp; &emsp; Segment1: EXPECTED_CLOSE_DATE < '22/07/2021'  [GINI Full Model: 32.1234% / GINI Segmented Model: 33.4342%]

&emsp; &emsp;  Segment2: EXPECTED_CLOSE_DATE >= '22/07/2021' [GINI Full Model: 63.7523% / GINI Segmented Model: 68.8342%]

&emsp; ORIGINAL_BOOKED_AMOUNT Good for segmentation.  

&emsp; &emsp; Segment1: ORIGINAL_BOOKED_AMOUNT < 90000 [GINI Full Model: 32.3243% / GINI Segmented Model: 33.9833%]

&emsp; &emsp; Segment2: ORIGINAL_BOOKED_AMOUNT >= 90000 [GINI Full Model: 63.3449% / GINI Segmented Model: 68.9438%]

&emsp; OUTSTANDING Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; SEX Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; CUSTOMER_OPEN_DATE Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; CUSTOMER_OPEN_DATE Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; BIRTH_DATE Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; PROFESSION Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; CAR_TYPE Good for segmentation.  

&emsp; &emsp; Segment1: CAR_TYPE in (BMW', 'BYD', 'CARRY', 'Changan', 'CHEVROLET', 'Gelory', 'GELY', 'HYUNDAI') [GINI Full Model: 35.3492% / GINI Segmented Model: 37.3943%]

&emsp; &emsp; Segment2: CAR_TYPE in ('Jack', 'KIA', 'MERCEDES', 'MITSUBISHI', 'NISSAN', 'RENAULT', 'SEAT', 'SKODA', 'SUZUKI') [GINI Full Model: 42.4324% / GINI Segmented Model: 49.4393%]