## <center>Курсовой проект<a class="anchor" id="course_project"></a><center>

### Постановка задачи<a class="anchor" id="course_project_task"></a>

**Задача**

На основании имеющихся данных о клиентах банка требуется построить модель для прогнозирования невыполнения долговых обязательств по текущему кредиту. Выполнить прогноз для примеров из тестового датасета.

**Наименование файлов с данными**

course_project_train.csv - обучающий датасет<br>
course_project_test.csv - тестовый датасет

**Целевая переменная**

Credit Default - факт невыполнения кредитных обязательств

**Метрика качества**

F1-score (sklearn.metrics.f1_score)

**Требования к решению**

*Целевая метрика*
* F1($\beta$ = 1) > 0.5 при Precision > 0.5 и Recall > 0.5
* Метрика оценивается по качеству прогноза для главного класса (1 - просрочка по кредиту)

*Решение должно содержать*
1. Тетрадка Jupyter Notebook с кодом Вашего решения, названная по образцу {ФИО}\_solution.ipynb, пример SShirkin\_solution.ipynb
2. Файл CSV с прогнозами целевой переменной (0 или 1, а НЕ вероятность) для тестового датасета, названный по образцу {ФИО}\_predictions.csv, пример SShirkin\_predictions.csv

*Рекомендации для файла с кодом (ipynb)*
1. Файл должен содержать заголовки и комментарии (markdown)
2. Повторяющиеся операции лучше оформлять в виде функций
3. Не делать вывод большого количества строк таблиц (5-10 достаточно)
4. По возможности добавлять графики, описывающие данные (около 3-5)
5. Добавлять только лучшую модель, то есть не включать в код все варианты решения проекта
6. Скрипт проекта должен отрабатывать от начала и до конца (от загрузки данных до выгрузки предсказаний)
7. Весь проект должен быть в одном скрипте (файл ipynb).
8. Допускается применение библиотек Python и моделей машинного обучения,
которые были в данном курсе.

**Сроки сдачи**

Cдать проект нужно в течение 6 дней после окончания последнего вебинара (до 20:00 вс).
Оценки работ, сданных до дедлайна, будут представлены в виде рейтинга, ранжированного по заданной метрике качества.
Проекты, сданные после дедлайна или сданные повторно, не попадают в рейтинг, но можно будет узнать результат.

### Примерное описание этапов выполнения курсового проекта<a class="anchor" id="course_project_steps"></a>

**Построение модели классификации**
1. Обзор обучающего датасета
2. Обработка выбросов
3. Обработка пропусков
4. Анализ данных
5. Отбор признаков
6. Балансировка классов
7. Подбор моделей, получение бейзлайна
8. Выбор наилучшей модели, настройка гиперпараметров
9. Проверка качества, борьба с переобучением
10. Интерпретация результатов

**Прогнозирование на тестовом датасете**
1. Выполнить для тестового датасета те же этапы обработки и построения признаков
2. Спрогнозировать целевую переменную, используя модель, построенную на обучающем датасете
3. Прогнозы должны быть для всех примеров из тестового датасета (для всех строк)
4. Соблюдать исходный порядок примеров из тестового датасета

### Обзор данных<a class="anchor" id="course_project_review"></a>

**Описание датасета**

* **Home Ownership** - домовладение
* **Annual Income** - годовой доход
* **Years in current job** - количество лет на текущем месте работы
* **Tax Liens** - налоговые обременения
* **Number of Open Accounts** - количество открытых счетов
* **Years of Credit History** - количество лет кредитной истории
* **Maximum Open Credit** - наибольший открытый кредит
* **Number of Credit Problems** - количество проблем с кредитом
* **Months since last delinquent** - количество месяцев с последней просрочки платежа
* **Bankruptcies** - банкротства
* **Purpose** - цель кредита
* **Term** - срок кредита
* **Current Loan Amount** - текущая сумма кредита
* **Current Credit Balance** - текущий кредитный баланс
* **Monthly Debt** - ежемесячный долг
* **Credit Score** - кредитный рейтинг
* **Credit Default** - факт невыполнения кредитных обязательств (0 - погашен вовремя, 1 - просрочка)

In [1]:
# 1. Основные библиотеки
import numpy as np
import pandas as pd
import pickle   # сохранение модели

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# 2. Разделение датасета
from sklearn.model_selection import train_test_split, KFold, GridSearchCV

# 3. Модели
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import KFold, GridSearchCV, RandomizedSearchCV
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# 4. Метрики качества
from sklearn.metrics import mean_squared_error as mse, r2_score as r2

# 5. Для визуализации внешних картинок в ноутбуке
from IPython.display import Image

**Пути к директориям и файлам**

In [2]:
TRAIN_DATASET_PATH = 'D:/repo/Python-Data-Science-2/course_project_train.csv'
TEST_DATASET_PATH = 'D:/repo/Python-Data-Science-2/course_project_test.csv'

**Загрузка тренировочных данных**

In [3]:
df_train = pd.read_csv(TRAIN_DATASET_PATH)
df_train.head()

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
0,Own Home,482087.0,,0.0,11.0,26.3,685960.0,1.0,,1.0,debt consolidation,Short Term,99999999.0,47386.0,7914.0,749.0,0
1,Own Home,1025487.0,10+ years,0.0,15.0,15.3,1181730.0,0.0,,0.0,debt consolidation,Long Term,264968.0,394972.0,18373.0,737.0,1
2,Home Mortgage,751412.0,8 years,0.0,11.0,35.0,1182434.0,0.0,,0.0,debt consolidation,Short Term,99999999.0,308389.0,13651.0,742.0,0
3,Own Home,805068.0,6 years,0.0,8.0,22.5,147400.0,1.0,,1.0,debt consolidation,Short Term,121396.0,95855.0,11338.0,694.0,0
4,Rent,776264.0,8 years,0.0,13.0,13.6,385836.0,1.0,,0.0,debt consolidation,Short Term,125840.0,93309.0,7180.0,719.0,0


**Заполняем пропуски признака 'Bankruptcies' модой**

In [4]:
df_train.loc[pd.isnull(df_train['Bankruptcies']), 'Bankruptcies'] = df_train['Bankruptcies'].mode()[0]

**Переводим категории в числа**

In [5]:
df_train.replace({'Home Ownership':{'Home Mortgage':3, 'Rent':2, 'Own Home':1, 'Have Mortgage':3}, 
                  'Years in current job':{'10+ years':10, '2 years':2, '3 years':3, '< 1 year':0, '5 years':5, 
                                          '1 year':1, '4 years':4, '6 years':6, '7 years':7, '8 years':8, '9 years':9}, 
                  'Purpose':{'debt consolidation':15, 'other':14, 'home improvements':13, 'business loan':12, 'buy a car':11, 
                             'medical bills':10, 'major purchase':9, 'take a trip':8, 'buy house':7, 'small business':6, 
                             'wedding':5, 'moving':4, 'educational expenses':3, 'vacation':2, 'renewable energy':1}, 
                  'Term':{'Short Term':0, 'Long Term':1}}, inplace = True)

### Вариант 1

Разделять на два датасета надо вручную.

X содержит признаки для обучения.

y - все известные 'Years in current job'.

X_nan - строки с пропущенным признаком Years in current job. проверьте их размер.

In [6]:
X = df_train.drop(['Annual Income', 'Months since last delinquent', 'Years in current job', 'Credit Score'], axis = 'columns')[~df_train['Years in current job'].isna()]

y = df_train.loc[~df_train['Years in current job'].isna(), 'Years in current job'].astype(int)

X_nan = df_train.drop(['Annual Income', 'Months since last delinquent', 'Years in current job', 'Credit Score'], axis = 'columns')[df_train['Years in current job'].isna()]

In [8]:
df_y = pd.DataFrame(y)
df_y.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7129 entries, 1 to 7499
Data columns (total 1 columns):
Years in current job    7129 non-null int32
dtypes: int32(1)
memory usage: 83.5 KB


In [9]:
df_y.isna().sum()

Years in current job    0
dtype: int64

In [10]:
df_y.shape

(7129, 1)

Масштабируем

In [7]:
scaler = StandardScaler()

X = scaler.fit_transform(X)

X_nan = scaler.fit_transform(X_nan)

после масштабирования обучаем модель и предсказываем пропущенные значения

потом записываем их на место пропусков

In [8]:
knn = KNeighborsClassifier(n_neighbors=7)

knn.fit(X, y)

y_pred = knn.predict(X_nan)

df_train.loc[df_train['Years in current job'].isna(), 'Years in current job'] = y_pred

In [10]:
pd.DataFrame(y_pred).to_csv('y_pred_2021-05-06.csv', index = False)

In [9]:
df_train.isna().sum()

Home Ownership                     0
Annual Income                   1557
Years in current job               0
Tax Liens                          0
Number of Open Accounts            0
Years of Credit History            0
Maximum Open Credit                0
Number of Credit Problems          0
Months since last delinquent    4081
Bankruptcies                       0
Purpose                            0
Term                               0
Current Loan Amount                0
Current Credit Balance             0
Monthly Debt                       0
Credit Score                    1557
Credit Default                     0
dtype: int64

### Вариант 2

Используем **IterativeImputer()** по умолчанию.

Каждая переменная с пропусками представляется как функция от всех остальных переменных. Ее пропуски заменяются рассчитанными по этой функции значениями.

In [17]:
feature_names = df_train.drop(['Annual Income', 'Months since last delinquent', 'Years in current job', 'Credit Score'], axis=1).columns
X = df_train[feature_names]
y = df_train['Years in current job']
X.isna().sum()[X.isna().sum() != 0]

Series([], dtype: int64)

In [18]:
imputer = IterativeImputer()

X_imp = imputer.fit_transform(X)
scores = cross_val_score(gb_model, X_imp, y, scoring='r2', cv=cv)
print('R2: %.4f' % scores.mean())



ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

**Обзор количественных признаков**

In [16]:
df_train.shape

(7500, 17)

In [10]:
df_train.describe(include='all')

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
count,7129,5647.0,7129,7129.0,7129.0,7129.0,7129.0,7129.0,3243.0,7115.0,7129,7129,7129.0,7129.0,7129.0,5647.0,7129.0
unique,4,,11,,,,,,,,15,2,,,,,
top,Home Mortgage,,10+ years,,,,,,,,debt consolidation,Short Term,,,,,
freq,3472,,2332,,,,,,,,5670,5245,,,,,
mean,,1389593.0,,0.027914,11.177164,18.063712,959766.9,0.15963,34.49152,0.11033,,,11905130.0,291505.5,18549.739094,1153.308482,0.277178
std,,852821.9,,0.264173,4.933924,6.765567,16436650.0,0.480883,21.712328,0.336679,,,31959080.0,317511.4,11976.756991,1607.975737,0.447636
min,,164597.0,,0.0,2.0,4.0,0.0,0.0,0.0,0.0,,,11242.0,0.0,0.0,585.0,0.0
25%,,855627.0,,0.0,8.0,13.5,281710.0,0.0,16.0,0.0,,,184888.0,115577.0,10308.0,711.0,0.0
50%,,1199261.0,,0.0,10.0,16.9,480546.0,0.0,32.0,0.0,,,313302.0,212135.0,16306.0,731.0,0.0
75%,,1670756.0,,0.0,14.0,21.5,795168.0,0.0,50.0,0.0,,,525844.0,362767.0,24030.0,743.0,1.0


**Обзор целевой переменной**

In [4]:
df_train['Credit Default'].value_counts(normalize=True)

0    0.718267
1    0.281733
Name: Credit Default, dtype: float64

### Приведение типов<a class="anchor" id="cast"></a>

In [5]:
df_train.dtypes

Home Ownership                   object
Annual Income                   float64
Years in current job             object
Tax Liens                       float64
Number of Open Accounts         float64
Years of Credit History         float64
Maximum Open Credit             float64
Number of Credit Problems       float64
Months since last delinquent    float64
Bankruptcies                    float64
Purpose                          object
Term                             object
Current Loan Amount             float64
Current Credit Balance          float64
Monthly Debt                    float64
Credit Score                    float64
Credit Default                    int64
dtype: object

**Обзор номинативных/категориальных признаков**

In [14]:
#for cat_colname in df_train.columns:
for cat_colname in df_train.select_dtypes(include='object').columns:
    print(str(cat_colname) + '\n\n' + str(df_train[cat_colname].value_counts()) + '\n' + '*' * 100 + '\n')

Home Ownership

Home Mortgage    3637
Rent             3204
Own Home          647
Have Mortgage      12
Name: Home Ownership, dtype: int64
****************************************************************************************************

Years in current job

10+ years    2332
2 years       705
3 years       620
< 1 year      563
5 years       516
1 year        504
4 years       469
6 years       426
7 years       396
8 years       339
9 years       259
Name: Years in current job, dtype: int64
****************************************************************************************************

Purpose

debt consolidation      5944
other                    665
home improvements        412
business loan            129
buy a car                 96
medical bills             71
major purchase            40
take a trip               37
buy house                 34
small business            26
wedding                   15
moving                    11
educational expenses      10
vacation  

### Обработка пропусков<a class="anchor" id="gaps"></a>

In [15]:
df_train.isna().sum()

Home Ownership                     0
Annual Income                   1557
Years in current job             371
Tax Liens                          0
Number of Open Accounts            0
Years of Credit History            0
Maximum Open Credit                0
Number of Credit Problems          0
Months since last delinquent    4081
Bankruptcies                      14
Purpose                            0
Term                               0
Current Loan Amount                0
Current Credit Balance             0
Monthly Debt                       0
Credit Score                    1557
Credit Default                     0
dtype: int64

### Обработка выбросов<a class="anchor" id="outliers"></a>

**Корректируем выбросы параметра 'Credit Score'**

Исправляем опечатки (убираем лишний ноль в младшем разряде)

In [6]:
#df_train.loc[df_train['Credit Score'] > 999, 'Credit Score'] = df_train['Credit Score']/10

**Разбиваем датафреймы X и y на тренировку и валидацию**

In [6]:
X_train, X_valid, y_train, y_valid = train_test_split(df_train.drop(['Annual Income', 'Months since last delinquent', 'Credit Score'], axis = 'columns'),
    df_train['Years in current job'], test_size = 0.2, random_state = 42)

In [10]:
y_train = pd.DataFrame(y_train)
y_valid = pd.DataFrame(y_valid)

In [10]:
type(y_train)#, y_valid

pandas.core.series.Series

### Нормализуем датасет<a class="anchor" id="outliers"></a>

In [7]:
cols_for_scale = ['Number of Open Accounts', 'Years of Credit History', 'Maximum Open Credit', 
                  'Bankruptcies', 'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt']

In [8]:
scaler = RobustScaler()

In [22]:
#scaler = MinMaxScaler()

In [9]:
X_train[cols_for_scale] = scaler.fit_transform(X_train[cols_for_scale])

In [10]:
X_valid[cols_for_scale] = scaler.transform(X_valid[cols_for_scale])

### Классифицируем по KNN<a class="anchor" id="outliers"></a>

In [16]:
knn = KNeighborsClassifier(n_neighbors=7)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_valid)

ValueError: Unknown label type: 'continuous'

In [15]:
X_train#.isna().sum()

Unnamed: 0,Home Ownership,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Default
6399,0.33,0.0,1.0,0.166667,0.358025,-0.098071,1.0,0.0,0.00,0,294.957851,-0.037041,0.630223,0
442,0.33,0.4,0.0,0.333333,0.074074,0.598671,0.0,0.0,0.00,0,-0.143861,0.479627,0.134948,0
6573,0.33,1.0,0.0,0.333333,0.012346,0.006687,0.0,0.0,0.00,0,0.726338,0.265704,-0.384911,0
583,0.00,0.3,0.0,0.333333,0.296296,0.285727,0.0,0.0,0.00,0,-0.341948,0.179478,0.712978,0
1467,0.00,0.2,0.0,-0.166667,1.469136,-0.432662,1.0,1.0,0.14,0,0.228681,-0.740062,-0.654141,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3978,0.00,0.4,0.0,0.333333,2.555556,-0.435577,0.0,0.0,0.00,0,-0.308619,-0.205522,0.259191,0
5475,0.00,0.9,0.0,0.500000,0.567901,0.470424,0.0,0.0,0.00,1,0.328473,-0.397984,-0.116713,1
5511,0.00,0.0,0.0,0.333333,0.938272,-0.310759,0.0,0.0,0.00,0,-0.189884,0.130599,0.354053,0
5679,0.33,0.0,0.0,0.166667,0.506173,-0.492542,0.0,0.0,0.00,0,-0.617302,-0.185054,0.457257,0


In [11]:
k_vals = np.arange(2,10)

accuracy_valid = []
accuracy_train = []

for val in k_vals:
    knn = KNeighborsClassifier(n_neighbors=val)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_valid)
    y_pred_train = knn.predict(X_valid)
    acc_valid = accuracy_score(y_valid, y_pred)
    acc_train = accuracy_score(y_train, y_pred_train)
    accuracy_valid.append(acc_valid)
    accuracy_train.append(acc_train)
    print('n_neighbors = {} \n\t acc_valid = {} \n\t acc_train = {}\n'.format(val, acc_valid, acc_train))

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [None]:
accuracy y_pred train acc_valid

### Загрузка тестовых данных<a class="anchor" id="gaps"></a>

In [38]:
df_test = pd.read_csv(TEST_DATASET_PATH)
df_test.head()

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score
0,Rent,,4 years,0.0,9.0,12.5,220968.0,0.0,70.0,0.0,debt consolidation,Short Term,162470.0,105906.0,6813.0,
1,Rent,231838.0,1 year,0.0,6.0,32.7,55946.0,0.0,8.0,0.0,educational expenses,Short Term,78298.0,46037.0,2318.0,699.0
2,Home Mortgage,1152540.0,3 years,0.0,10.0,13.7,204600.0,0.0,,0.0,debt consolidation,Short Term,200178.0,146490.0,18729.0,7260.0
3,Home Mortgage,1220313.0,10+ years,0.0,16.0,17.0,456302.0,0.0,70.0,0.0,debt consolidation,Short Term,217382.0,213199.0,27559.0,739.0
4,Home Mortgage,2340952.0,6 years,0.0,11.0,23.6,1207272.0,0.0,,0.0,debt consolidation,Long Term,777634.0,425391.0,42605.0,706.0


In [36]:
df_test.shape

(2500, 16)

### Домашнее задание

1. Приведите по два примера, когда лучше максимизировать Precision, а когда Recall.

2. Почему мы используем F-меру? Почему, например, нельзя просто взять среднее от Precision и Recall?