In [1]:
'''Data Set Information:

This research aimed at the case of customersâ€™ default payments in Taiwan and compares the predictive accuracy of probability 
of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the 
estimated probability of default will be more valuable than the binary result of classification - credible or not credible 
clients. Because the real probability of default is unknown, this study presented the novel â€œSorting Smoothing Methodâ€ 
to estimate the real probability of default. With the real probability of default as the response variable (Y), and the 
predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that 
the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression 
intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, 
artificial neural network is the only one that can accurately estimate the real probability of default.


Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. 
This study reviewed the literature and used the following 23 variables as explanatory variables:
X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) 
    credit.
X2: Gender (1 = male; 2 = female).
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
X4: Marital status (1 = married; 2 = single; 3 = others).
X5: Age (year).
X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: 
X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;
X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 
1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 
9 = payment delay for nine months and above.
X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005;
    X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 
2005; . . .;X23 = amount paid in April, 2005.'''

'Data Set Information:\n\nThis research aimed at the case of customersâ€™ default payments in Taiwan and compares the predictive accuracy of probability \nof default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the \nestimated probability of default will be more valuable than the binary result of classification - credible or not credible \nclients. Because the real probability of default is unknown, this study presented the novel â€œSorting Smoothing Methodâ€\x9d \nto estimate the real probability of default. With the real probability of default as the response variable (Y), and the \npredictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that \nthe forecasting model produced by artificial neural network has the highest coefficient of determination; its regression \nintercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the 

In [1]:
import pandas as pd
import numpy as np

In [2]:
import os
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,make_scorer
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler



In [3]:
pwd

'c:\\Users\\tusha\\Documents\\Data Science Course\\ML Projects\\credit card MLProject\\notebook\\data'

In [4]:
df=pd.read_excel('C:\\Users\\tusha\\Documents\\Data Science Course\\ML\\ML Day 5\\default of credit card clients.xls')

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
2,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0


In [6]:
df.isnull().sum()

Unnamed: 0    0
X1            0
X2            0
X3            0
X4            0
X5            0
X6            0
X7            0
X8            0
X9            0
X10           0
X11           0
X12           0
X13           0
X14           0
X15           0
X16           0
X17           0
X18           0
X19           0
X20           0
X21           0
X22           0
X23           0
Y             0
dtype: int64

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30001 entries, 0 to 30000
Data columns (total 25 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  30001 non-null  object
 1   X1          30001 non-null  object
 2   X2          30001 non-null  object
 3   X3          30001 non-null  object
 4   X4          30001 non-null  object
 5   X5          30001 non-null  object
 6   X6          30001 non-null  object
 7   X7          30001 non-null  object
 8   X8          30001 non-null  object
 9   X9          30001 non-null  object
 10  X10         30001 non-null  object
 11  X11         30001 non-null  object
 12  X12         30001 non-null  object
 13  X13         30001 non-null  object
 14  X14         30001 non-null  object
 15  X15         30001 non-null  object
 16  X16         30001 non-null  object
 17  X17         30001 non-null  object
 18  X18         30001 non-null  object
 19  X19         30001 non-null  object
 20  X20   

In [8]:
df.columns

Index(['Unnamed: 0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9',
       'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19',
       'X20', 'X21', 'X22', 'X23', 'Y'],
      dtype='object')

In [9]:
df['X3'].unique()

array(['EDUCATION', 2, 1, 3, 5, 4, 6, 0], dtype=object)

In [10]:
df['X2'].unique()

array(['SEX', 2, 1], dtype=object)

In [11]:
df.rename(columns={'Unnamed: 0':'Customer_ID'},inplace=True)

In [12]:
df.head()

Unnamed: 0,Customer_ID,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
2,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0


In [13]:
data=df[1::]

In [14]:
data.head()

Unnamed: 0,Customer_ID,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
1,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
2,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
5,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [15]:
data=data.astype(int)

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 1 to 30000
Data columns (total 25 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   Customer_ID  30000 non-null  int32
 1   X1           30000 non-null  int32
 2   X2           30000 non-null  int32
 3   X3           30000 non-null  int32
 4   X4           30000 non-null  int32
 5   X5           30000 non-null  int32
 6   X6           30000 non-null  int32
 7   X7           30000 non-null  int32
 8   X8           30000 non-null  int32
 9   X9           30000 non-null  int32
 10  X10          30000 non-null  int32
 11  X11          30000 non-null  int32
 12  X12          30000 non-null  int32
 13  X13          30000 non-null  int32
 14  X14          30000 non-null  int32
 15  X15          30000 non-null  int32
 16  X16          30000 non-null  int32
 17  X17          30000 non-null  int32
 18  X18          30000 non-null  int32
 19  X19          30000 non-null  int32
 20  X20   

In [17]:
data['X3']=data['X3'].map({1:1,2:2,3:3,4:4,5:4,6:4,0:4})

In [18]:
data['X3'].unique()
#X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)

array([2, 1, 3, 4], dtype=int64)

In [19]:
data['X2'].unique()
#X2: Gender (1 = male; 2 = female).

array([2, 1])

In [20]:
data['X4'].value_counts()

X4
2    15964
1    13659
3      323
0       54
Name: count, dtype: int64

In [21]:
data['X4']=data['X4'].map({1:1,2:2,3:3,0:1})

In [22]:
data['X4'].unique()
#X4: Marital status (1 = married; 2 = single; 3 = others)

array([1, 2, 3], dtype=int64)

In [23]:
data.head(1)

Unnamed: 0,Customer_ID,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
1,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1


In [24]:
data.isnull().sum()

Customer_ID    0
X1             0
X2             0
X3             0
X4             0
X5             0
X6             0
X7             0
X8             0
X9             0
X10            0
X11            0
X12            0
X13            0
X14            0
X15            0
X16            0
X17            0
X18            0
X19            0
X20            0
X21            0
X22            0
X23            0
Y              0
dtype: int64

In [25]:
data.to_csv('credit_card.csv',index=True)

In [25]:
X = data.drop(labels=['Y'],axis=1)
Y= data[['Y']]

In [26]:
X

Unnamed: 0,Customer_ID,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
1,1,20000,2,2,1,24,2,2,-1,-1,...,689,0,0,0,0,689,0,0,0,0
2,2,120000,2,2,2,26,-1,2,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
3,3,90000,2,2,2,34,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
4,4,50000,2,2,1,37,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
5,5,50000,1,2,1,57,-1,0,-1,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29996,29996,220000,1,3,1,39,0,0,0,0,...,208365,88004,31237,15980,8500,20000,5003,3047,5000,1000
29997,29997,150000,1,3,2,43,-1,-1,-1,-1,...,3502,8979,5190,0,1837,3526,8998,129,0,0
29998,29998,30000,1,2,2,37,4,3,2,-1,...,2758,20878,20582,19357,0,0,22000,4200,2000,3100
29999,29999,80000,1,3,1,41,1,-1,0,0,...,76304,52774,11855,48944,85900,3409,1178,1926,52964,1804


In [27]:
Y

Unnamed: 0,Y
1,1
2,1
3,0
4,0
5,0
...,...
29996,0
29997,0
29998,1
29999,1


In [91]:
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.9,random_state=42)

In [92]:
y_train.value_counts()

Y
0    2324
1     676
Name: count, dtype: int64

In [96]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [97]:
pwd

'c:\\Users\\tusha\\Documents\\Data Science Course\\ML Projects\\credit card MLProject\\notebook\\data'

In [185]:
            #input_feature_train_df = train_df.drop(columns=['Useless','Y'],axis=1)
            #target_feature_train_df = train_df[['Y']]

            #input_feature_test_df = test_df.drop(columns=['Useless','Y'],axis=1)
            #target_feature_test_df = test_df[['Y']]

'''
input_feature_train_arr=preprocessing_obj.fit_transform(input_feature_train_df)
            input_feature_test_arr=preprocessing_obj.transform(input_feature_test_df)

            logging.info("Applying preprocessing object on training and testing datasets.")
            

            train_arr = np.c_[input_feature_train_arr, np.array(target_feature_train_df)]
            test_arr = np.c_[input_feature_test_arr, np.array(target_feature_test_df)]


'''

train_df= pd.read_csv('c:\\Users\\tusha\\Documents\\Data Science Course\\ML Projects\\credit card MLProject\\artifacts\\train.csv')
test_df= pd.read_csv('c:\\Users\\tusha\\Documents\\Data Science Course\\ML Projects\\credit card MLProject\\artifacts\\test.csv')


In [186]:
input_feature_train_df = train_df.drop(columns=['Y'],axis=1)
target_feature_train_df = train_df[['Y']]

input_feature_test_df = test_df.drop(columns=['Y'],axis=1)
target_feature_test_df = test_df[['Y']]

In [187]:
input_feature_train_arr=scaler.fit_transform(input_feature_train_df)
input_feature_test_arr=scaler.transform(input_feature_test_df)

In [188]:
train_arr = np.c_[input_feature_train_arr, np.array(target_feature_train_df)]
test_arr = np.c_[input_feature_test_arr, np.array(target_feature_test_df)]

In [189]:
train_arr.shape, test_arr.shape

((9000, 26), (21000, 26))

In [190]:
train_arr

array([[ 0.5938067 ,  0.5938067 , -1.13831215, ..., -0.24370598,
         0.15359641,  1.        ],
       [-1.66598644, -1.66598644, -0.75725278, ..., -0.13373262,
         0.0105827 ,  0.        ],
       [-0.11064175, -0.11064175,  0.38592535, ..., -0.16628473,
         6.67233326,  0.        ],
       ...,
       [-0.24959418, -0.24959418, -0.6810409 , ...,  4.56439216,
        -0.07522553,  0.        ],
       [-0.87620621, -0.87620621,  1.91016286, ...,  1.98617671,
         3.6648692 ,  0.        ],
       [-1.69354629, -1.69354629, -0.90967653, ..., -0.28668985,
        -0.28597055,  0.        ]])

In [191]:
test_arr

array([[ 1.27899784,  1.27899784, -0.29998152, ..., -0.3065479 ,
        -0.18654741,  0.        ],
       [-1.48079341, -1.48079341, -0.90967653, ..., -0.25627436,
        -0.24684199,  0.        ],
       [ 0.08642854,  0.08642854, -1.21452403, ..., -0.07447269,
        -0.30404748,  0.        ],
       ...,
       [ 1.70392705,  1.70392705,  3.43440036, ...,  0.14032099,
        -0.19129547,  0.        ],
       [-0.95127511, -0.95127511,  0.00486598, ..., -0.25187543,
        -0.30404748,  0.        ],
       [ 0.58700322,  0.58700322, -0.6810409 , ..., -0.24370598,
        -0.24684199,  0.        ]])

In [192]:
X_train, y_train, X_test, y_test = (
                train_arr[:,:-1],
                train_arr[:,-1],
                test_arr[:,:-1],
                test_arr[:,-1]
            )

In [193]:
X_train.ndim

2

In [194]:
X_train

array([[ 0.5938067 ,  0.5938067 , -1.13831215, ..., -0.25387395,
        -0.24370598,  0.15359641],
       [-1.66598644, -1.66598644, -0.75725278, ..., -0.13939918,
        -0.13373262,  0.0105827 ],
       [-0.11064175, -0.11064175,  0.38592535, ...,  0.03270547,
        -0.16628473,  6.67233326],
       ...,
       [-0.24959418, -0.24959418, -0.6810409 , ..., -0.11997117,
         4.56439216, -0.07522553],
       [-0.87620621, -0.87620621,  1.91016286, ..., -0.31928811,
         1.98617671,  3.6648692 ],
       [-1.69354629, -1.69354629, -0.90967653, ..., -0.29861724,
        -0.28668985, -0.28597055]])

In [195]:
pd.DataFrame(y_train).value_counts(normalize=True)

0.0    0.778889
1.0    0.221111
Name: proportion, dtype: float64

In [196]:
feature_names = ['Customer_ID','X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9',
                'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19',
                'X20', 'X21', 'X22', 'X23', 'Y']

In [219]:
test_model=LogisticRegression(max_iter=100000,C= 0.001,fit_intercept= True, penalty = 'l1' , solver= 'saga',class_weight='balanced')

In [220]:

#model = LogisticRegression(max_iter=100000, C=0.001, penalty=feature_names, solver='liblinear')


In [221]:
test_model.fit(X_train, y_train)

In [222]:
y_pred=test_model.predict(X_test)

In [223]:
test_model.score(X_train,y_train)

0.779

In [224]:
accuracy_score(y_pred,y_test)

0.7806190476190477

In [217]:
X_test.shape

(21000, 25)

In [218]:
y_train.shape

(9000,)

In [148]:
y_train.ndim

1

In [149]:
#y_train=np.ravel(y_train)

In [150]:
y_train

array([1., 0., 0., ..., 0., 0., 0.])

In [151]:
#model=LogisticRegression(max_iter=100000,C= 0.001, fit_intercept= True, penalty= 'l1', solver= 'saga')


In [155]:
model.fit(X_train,y_train)

In [153]:
model.score(X_train,y_train)

0.7788888888888889

In [154]:
y_pred=model.predict(X_test)

In [140]:
accuracy_score(y_test,y_pred)

0.7787619047619048

In [141]:
confusion_matrix(y_test,y_pred)

array([[16354,     0],
       [ 4646,     0]], dtype=int64)

In [195]:
def evaluate_model(X_train,y_train,X_test,y_test,model):   
        
    
    
    model={'LogisticRegression' : LogisticRegression(max_iter=100000,C= 0.001, fit_intercept= True, penalty= 'l1', solver= 'saga'),}
    

    for i in range(len(model)):
        models1 = list(model.values())[i]
        models1.fit(X_train,y_train)

                # Predict Testing data
        y_test_pred =models1.predict(X_test)

                # Get R2 scores for train and test data
                #train_model_score = r2_score(ytrain,y_train_pred)
        test_model_score2 = accuracy_score(y_test,y_test_pred)
        
        

        

            
       
        return model, test_model_score2

In [198]:
report:tuple=evaluate_model(X_train,y_train,X_test,y_test,model)

In [201]:
report[0].values()

dict_values([LogisticRegression(C=0.001, max_iter=100000, penalty='l1', solver='saga')])

In [182]:
type(model)

sklearn.linear_model._logistic.LogisticRegression

In [183]:
report

({'LogisticRegression': LogisticRegression(C=0.001, max_iter=100000, penalty='l1', solver='saga')},
 0.7787619047619048)

In [184]:
report.values()

AttributeError: 'tuple' object has no attribute 'values'