<a href="https://colab.research.google.com/github/824024445/KaggleCases/blob/master/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0kaggle%E6%A1%88%E4%BE%8B%EF%BC%9A%E9%A3%8E%E6%8E%A7%E8%AF%84%E5%88%86%E5%8D%A1%E6%A8%A1%E5%9E%8B%EF%BC%88Give_Me_Some_Credit%EF%BC%89.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### 1.Define the problem

The credit scoring algorithm, used to guess the probability of default, is a method used by banks to determine whether a loan should be granted. Improve the current level of credit scores by predicting the likelihood that someone will face financial distress in the next two years

In [None]:
#Data collation and analysis
import pandas as pd
import numpy as np

#Visualization
import matplotlib.pyplot as plt
import seaborn as sns

#Machine learning
import time
import os
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

#### 2.1 Read Data

In [None]:
train_df = pd.read_csv('/kaggle/input/GiveMeSomeCredit/cs-training.csv')
test_df = pd.read_csv('/kaggle/input/GiveMeSomeCredit/cs-test.csv')
combine=[train_df, test_df]
train_df.head()

### 2.2 Observe the data

#### 2.2.1 info()

In [None]:
train_df.info()

observed:
-"MonthlyIncome" and "NumberOfDependents" have null values. Need to deal with null values in data cleaning.

In [None]:
test_df.info()

#### 2.2.2 decribe()

In [None]:
#decribe() View the information of numeric data. There is no non-numeric data, 
#so do not use describe(include=['O']) to view non-numeric data.
train_df.describe()

observed:
-"NumberOfDependents" More than 50% of people have no family members, and the discrete value is large, 
so select the mode to fill in null

#### 2.2.3 corr() Find association

In [None]:
#Find associations (also often used when cleaning data later, to compare the effect)
corr_matrix = train_df.corr()
print(corr_matrix["SeriousDlqin2yrs"].sort_values(ascending=False))

# The following code graphically shows the correlation between each feature
fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(corr_matrix,xticklabels=corr_matrix.columns,yticklabels=corr_matrix.columns,annot=True)

Find the correlation between SeriousDlqin2yrs (target value, the smaller the better) and other features
-For the remaining features, if the correlation between each feature exceeds 60%, you can use only one feature to model. In this case, NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse can be modeled with only one

### 2.3 Data cleaning

#### 2.3.1 Null value processing

In [None]:
for data in combine:
  data["MonthlyIncome"].fillna(data["MonthlyIncome"].mean(), inplace=True)
  data["NumberOfDependents"].fillna(data["MonthlyIncome"].mode()[0], inplace=True)

#Check the replaced data and confirm that there is no empty value
train_df.info()

#### 2.3.2 Outlier handling

NumberOfDependents

In [None]:
#As you can see, the number of family members is actually 6,670, and the number is still a lot, accounting for 2.6%, 
#first fill it with the average value
train_df.NumberOfDependents.value_counts()

In [None]:
##Look at the correlation between the number of family members and the target value before filling, 
##in order to see the effect, the correlation before processing is -0.013881 
##(I list all the correlations in order to check the correlations of several others at any time)
corr_matrix = train_df.corr()
corr_matrix["SeriousDlqin2yrs"].sort_values(ascending=False)

In [None]:
for data in combine:
  data["NumberOfDependents"][data["NumberOfDependents"]>30] = 0
  
corr_matrix = train_df.corr()
corr_matrix["SeriousDlqin2yrs"].sort_values(ascending=False)

#After modifying the abnormal value, the correlation of "NumberOfDependents" reached 0.046869

age

In [None]:
train_df = train_df[train_df["age"]>18]
test_df = test_df[test_df["age"]>18]

combine = [train_df, test_df]
train_df[train_df["age"]<18]

#### 2.3.3 Create new features

In [None]:
for data in combine:
  data["CombinedDefaulted"] = data["NumberOfTimes90DaysLate"] + data["NumberOfTime60-89DaysPastDueNotWorse"] + data["NumberOfTime30-59DaysPastDueNotWorse"]
  data.loc[(data["CombinedDefaulted"] >= 1), "CombinedDefaulted"] = 1
  data["CombinedCreditLoans"] = data["NumberOfOpenCreditLinesAndLoans"] + data["NumberRealEstateLoansOrLines"]

  data["CombinedCreditLoans"] = data["NumberOfOpenCreditLinesAndLoans"] + data["NumberRealEstateLoansOrLines"]
  data.loc[(data["CombinedCreditLoans"] <= 5), "CombinedCreditLoans"] = 0
  data.loc[(data["CombinedCreditLoans"] > 5), "CombinedCreditLoans"] = 1


In [None]:
train_df.corr()["SeriousDlqin2yrs"][["CombinedDefaulted", "CombinedCreditLoans"]]

### 3 Models and predictions

The data set needs to be divided again. The previous test is used for the final test. There is no target value. After submitting to kaggle, it will return you an AUC score, which is equivalent to evaluating generalization ability. And now we must first evaluate the current model by ourselves

In [None]:
attributes=["SeriousDlqin2yrs", 'age','NumberOfTime30-59DaysPastDueNotWorse','NumberOfDependents','MonthlyIncome',"CombinedDefaulted","CombinedCreditLoans"]
sol=['SeriousDlqin2yrs']

attributes2 = ["Unnamed: 0", 'age','NumberOfTime30-59DaysPastDueNotWorse','NumberOfDependents','MonthlyIncome',"CombinedDefaulted","CombinedCreditLoans"]
sol=['SeriousDlqin2yrs']

train_df = train_df[attributes]
test_df = test_df[attributes2]


#### 3.1 Logistic regression

For quick testing, I wrote a class

In [None]:
class Tester():
    def __init__(self, target):
        self.target = target
        self.datasets = {}
        self.models = {}
        self.cache = {} # We added a simple cache to speed up

    def addDataset(self, name, df):
        self.datasets[name] = df.copy()

    def addModel(self, name, model):
        self.models[name] = model
        
    def clearModels(self):
        self.models = {}

    def clearCache(self):
        self.cache = {}
    
    def testModelWithDataset(self, m_name, df_name, sample_len, cv):
        if (m_name, df_name, sample_len, cv) in self.cache:
            return self.cache[(m_name, df_name, sample_len, cv)]

        clf = self.models[m_name]
        
        if not sample_len: 
            sample = self.datasets[df_name]
        else: sample = self.datasets[df_name].sample(sample_len)
            
        X = sample.drop([self.target], axis=1)
        Y = sample[self.target]

        s = cross_validate(clf, X, Y, scoring=['roc_auc'], cv=cv, n_jobs=-1)
        self.cache[(m_name, df_name, sample_len, cv)] = s

        return s

    def runTests(self, sample_len=80000, cv=4):
        # Test the added model on all added data sets
        scores = {}
        for m_name in self.models:
            for df_name in self.datasets:
                # print('Testing %s' % str((m_name, df_name)), end='')
                start = time.time()

                score = self.testModelWithDataset(m_name, df_name, sample_len, cv)
                scores[(m_name, df_name)] = score
                
                end = time.time()
                
                # print(' -- %0.2fs ' % (end - start))

        print('--- Top 10 Results ---')
        for score in sorted(scores.items(), key=lambda x: -1 * x[1]['test_roc_auc'].mean())[:10]:
            auc = score[1]['test_roc_auc']
            print("%s --> AUC: %0.4f (+/- %0.4f)" % (str(score[0]), auc.mean(), auc.std()))
    

In [None]:
# We will use test objects in all models
tester = Tester('SeriousDlqin2yrs')

# Add data set
tester.addDataset('Drop Missing', train_df.dropna())

# Add model
rfc = RandomForestClassifier(n_estimators=15, max_depth = 6, random_state=0)
log = LogisticRegression()
tester.addModel('Simple Random Forest', rfc)
tester.addModel('Simple Logistic Regression', log)

# test
tester.runTests()

In [None]:
X_train = train_df.drop(['SeriousDlqin2yrs'], axis=1)
Y_train = train_df['SeriousDlqin2yrs']

X_test = test_df.drop(["Unnamed: 0"], axis=1)
rfc.fit(X_train, Y_train)
Y_pred = rfc.predict_proba(X_test)


In [None]:
submission = pd.DataFrame({
        "Id": test_df["Unnamed: 0"],
        "Probability": pd.DataFrame(Y_pred)[1]
    })

submission.to_csv('submission.csv', index=False)