Представлен датасет центра приюта животных, и вашей задачей будет обучить модель таким образом, чтобы  по определенным признакам была возможность максимально уверенно предсказать метки 'Adoption' и 'Transfer' (столбец “outcome_type”).

Здесь вы вольны делать что угодно. Я хочу видеть от вас:
1. Проверка наличия/обработка пропусков
2. Проверьте взаимосвязи между признаками
3. Попробуйте создать свои признаки
4. Удалите лишние
5. Обратите внимание на текстовые столбцы. Подумайте, что можно извлечь полезного оттуда
6. Использование профайлера вам поможет.
7. Не забывайте, что у вас есть PCA (Метод главных компонент). Он может пригодиться.

Вспомните о всем, что я говорил на предыдущих занятиях. Не все будет пригодится, но в жизни вам никто не будет говорить, что использовать :)

Хорошим классификатором для этой задачи будет "Случайный лес" (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Понимать суть работы "леса" не обязательно на данном этапе, но качество предсказаний будет выше, чем с линейным классификатором. (если желаете, вот гайд https://adataanalyst.com/scikit-learn/linear-classification-method/)



In [239]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.model_selection import GridSearchCV

data = pd.read_csv('aac_shelter_outcomes.csv').iloc[:10000]



In [240]:
data = data[data['outcome_type'].isin(['Adoption', 'Transfer'])]
data['outcome_type']=(data['outcome_type'] == 'Adoption')*1
data['age_upon_outcome'] = (pd.to_datetime(data['datetime'])-pd.to_datetime(data['date_of_birth'])).dt.days
data['animal_type'].value_counts()
data['breed'].value_counts()
data['sex_upon_outcome'].value_counts(dropna=False)
data ['sex_upon_outcome']=data['sex_upon_outcome'].fillna('Unknown')
def split_sex(s):
    if s=='Unknown':
        return pd.Series([np.nan, np.nan], index=['sex', 'is_sterilized'])
    is_stirilized, sex = s.split()
    if is_stirilized =='Spayed' or 'Neutered':
        is_stirilized=1
    else:
        is_stirilized=0
    if sex=='Male':
        sex=1
    else:
        sex=0
    return pd.Series([sex, is_stirilized], index=['sex', 'is_sterilized'])
data[['sex', 'is_stirilized']] = data['sex_upon_outcome'].apply(split_sex).astype('Int64')

data=data.loc[data['age_upon_outcome']<200]
data.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome,sex,is_stirilized
0,15,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,0,Intact Male,1,1
5,126,A664462,Dog,Leonberger Mix,Brown/White,2013-06-03T00:00:00,2013-10-07T13:06:00,2013-10-07T13:06:00,*Edgar,Partner,0,Intact Male,1,1
8,59,A685067,Cat,Domestic Shorthair Mix,Blue Tabby/White,2014-06-16T00:00:00,2014-08-14T18:45:00,2014-08-14T18:45:00,Lucy,,1,Intact Female,0,1
9,95,A678580,Cat,Domestic Shorthair Mix,White/Black,2014-03-26T00:00:00,2014-06-29T17:45:00,2014-06-29T17:45:00,*Frida,Offsite,1,Spayed Female,0,1
12,80,A677679,Dog,Chihuahua Shorthair/Pomeranian,Black,2014-03-07T00:00:00,2014-05-26T19:10:00,2014-05-26T19:10:00,Kash,Foster,1,Neutered Male,1,1


In [241]:
breed = data['breed'].value_counts()
breed_for = breed[breed<10].index
data.loc[data['breed'].isin(breed_for), 'breed'] = 'Mix'
data.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome,sex,is_stirilized
0,15,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,0,Intact Male,1,1
5,126,A664462,Dog,Mix,Brown/White,2013-06-03T00:00:00,2013-10-07T13:06:00,2013-10-07T13:06:00,*Edgar,Partner,0,Intact Male,1,1
8,59,A685067,Cat,Domestic Shorthair Mix,Blue Tabby/White,2014-06-16T00:00:00,2014-08-14T18:45:00,2014-08-14T18:45:00,Lucy,,1,Intact Female,0,1
9,95,A678580,Cat,Domestic Shorthair Mix,White/Black,2014-03-26T00:00:00,2014-06-29T17:45:00,2014-06-29T17:45:00,*Frida,Offsite,1,Spayed Female,0,1
12,80,A677679,Dog,Mix,Black,2014-03-07T00:00:00,2014-05-26T19:10:00,2014-05-26T19:10:00,Kash,Foster,1,Neutered Male,1,1


In [242]:
x = data[['age_upon_outcome', 'animal_type', 'breed', 'color', 'sex', 'is_stirilized']]
y = data['outcome_type']
categoricals = ['animal_type', 'breed', 'color', 'sex', 'is_stirilized']
# нашли категориальные переменные, по которым будем предсказывать
x[categoricals] = x[categoricals].astype('object').fillna('NULL')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [243]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, stratify=y, random_state=1995)
#чтобы причины случайной величины были всегда одинаковы(напрмер, дуновение ветра, тряска руки и тд)
cat_cols = np.where(x.columns.isin(categoricals))[0]
#x.columns - список наших колонок, isin - проверяем, находится ли элемент в списке (колонка, а не эл-т)
#np.where- чтобы узнать номера, а не булевые значеня
#[0] - чтобы возвращался не кортеж, а только номер из 1 числа

# 1 способ Catboost

In [244]:
clf = CatBoostClassifier(n_estimators=300, learning_rate=0.2, max_depth=5, cat_features=cat_cols, eval_metric='AUC')
clf.fit(x_train, y_train, eval_set=[(x_train, y_train), (x_test, y_test)])
y_pred_train, y_pred_test = clf.predict_proba(x_train)[:, 1], clf.predict_proba(x_test)[:, 1]
print(roc_auc_score(y_train, y_pred_train), roc_auc_score(y_test, y_pred_test))

0:	test: 0.7967624	test1: 0.8360316	best: 0.8360316 (0)	total: 8.38ms	remaining: 2.51s
1:	test: 0.8263406	test1: 0.8650602	best: 0.8650602 (1)	total: 15.1ms	remaining: 2.25s
2:	test: 0.8268283	test1: 0.8667545	best: 0.8667545 (2)	total: 20.8ms	remaining: 2.06s
3:	test: 0.8369380	test1: 0.8638742	best: 0.8667545 (2)	total: 26.1ms	remaining: 1.93s
4:	test: 0.8382865	test1: 0.8632342	best: 0.8667545 (2)	total: 31.4ms	remaining: 1.85s
5:	test: 0.8412944	test1: 0.8638554	best: 0.8667545 (2)	total: 36.5ms	remaining: 1.79s
6:	test: 0.8420924	test1: 0.8599398	best: 0.8667545 (2)	total: 41.3ms	remaining: 1.73s
7:	test: 0.8425113	test1: 0.8611446	best: 0.8667545 (2)	total: 46.9ms	remaining: 1.71s
8:	test: 0.8425085	test1: 0.8611446	best: 0.8667545 (2)	total: 50ms	remaining: 1.62s
9:	test: 0.8425085	test1: 0.8611446	best: 0.8667545 (2)	total: 52.2ms	remaining: 1.51s
10:	test: 0.8446775	test1: 0.8622929	best: 0.8667545 (2)	total: 56.3ms	remaining: 1.48s
11:	test: 0.8448752	test1: 0.8618411	best: 0

111:	test: 0.8811847	test1: 0.8702937	best: 0.8709337 (110)	total: 579ms	remaining: 971ms
112:	test: 0.8811739	test1: 0.8702937	best: 0.8709337 (110)	total: 588ms	remaining: 973ms
113:	test: 0.8813938	test1: 0.8712349	best: 0.8712349 (113)	total: 597ms	remaining: 974ms
114:	test: 0.8814068	test1: 0.8711973	best: 0.8712349 (113)	total: 605ms	remaining: 973ms
115:	test: 0.8814120	test1: 0.8714608	best: 0.8714608 (115)	total: 610ms	remaining: 967ms
116:	test: 0.8819492	test1: 0.8719503	best: 0.8719503 (116)	total: 617ms	remaining: 964ms
117:	test: 0.8825345	test1: 0.8717809	best: 0.8719503 (116)	total: 622ms	remaining: 959ms
118:	test: 0.8829863	test1: 0.8716303	best: 0.8719503 (116)	total: 625ms	remaining: 951ms
119:	test: 0.8826302	test1: 0.8712161	best: 0.8719503 (116)	total: 630ms	remaining: 944ms
120:	test: 0.8827175	test1: 0.8710655	best: 0.8719503 (116)	total: 633ms	remaining: 937ms
121:	test: 0.8829443	test1: 0.8717056	best: 0.8719503 (116)	total: 640ms	remaining: 934ms
122:	test:

211:	test: 0.8960436	test1: 0.8736446	best: 0.8750188 (138)	total: 1.17s	remaining: 487ms
212:	test: 0.8962438	test1: 0.8735693	best: 0.8750188 (138)	total: 1.18s	remaining: 483ms
213:	test: 0.8964212	test1: 0.8734187	best: 0.8750188 (138)	total: 1.19s	remaining: 478ms
214:	test: 0.8964366	test1: 0.8738328	best: 0.8750188 (138)	total: 1.19s	remaining: 472ms
215:	test: 0.8966475	test1: 0.8741340	best: 0.8750188 (138)	total: 1.2s	remaining: 467ms
216:	test: 0.8967101	test1: 0.8743223	best: 0.8750188 (138)	total: 1.21s	remaining: 461ms
217:	test: 0.8966111	test1: 0.8745105	best: 0.8750188 (138)	total: 1.21s	remaining: 456ms
218:	test: 0.8973197	test1: 0.8748494	best: 0.8750188 (138)	total: 1.22s	remaining: 451ms
219:	test: 0.8976189	test1: 0.8750753	best: 0.8750753 (219)	total: 1.22s	remaining: 445ms
220:	test: 0.8976543	test1: 0.8748494	best: 0.8750753 (219)	total: 1.23s	remaining: 439ms
221:	test: 0.8976142	test1: 0.8750753	best: 0.8750753 (219)	total: 1.23s	remaining: 434ms
222:	test: 

 2 способ RandomForest

In [222]:
#breed = data['breed'].value_counts()
#breed

In [223]:
#for val, index in zip(data['breed'].value_counts(), data['breed'].value_counts().index):
#    print(index, ' ', val)

In [224]:
#breed_for = breed[breed<10].index
#breed_for

In [225]:
#data.loc[data['breed'].isin(breed_for), 'breed'] = 'Mix'
#data.head()
#data['breed'].value_counts()

In [226]:
'''age = data['age_upon_outcome'].value_counts()
age = data['age_upon_outcome'].astype(int).astype(str).value_counts()
age
age_1 = age[age<2].index
age_1 = age_1.get_values()

np.sort(age_1)

data=data.loc[data['age_upon_outcome']<200]
data.head()

SyntaxError: EOF while scanning triple-quoted string literal (<ipython-input-226-b85d4375cadb>, line 10)

In [227]:
'''color = data['color'].value_counts()
#color_1 = color_1.get_values()
#color_1=data['color'].unique
data = data[data.duplicated(subset=["color"], keep='first')]
#color_1 = data['color'].astype(object).astype(str).value_counts()
#data = data.drop(color_1, axis = 1, inplace = True)
data['color'].unique


SyntaxError: EOF while scanning triple-quoted string literal (<ipython-input-227-0b79ca85314c>, line 7)

In [217]:
'''x = data[['age_upon_outcome', 'animal_type', 'breed', 'color', 'sex', 'is_stirilized']]
y = data['outcome_type']
categoricals = ['animal_type', 'breed', 'color', 'sex', 'is_stirilized']
# нашли категориальные переменные, по которым будем предсказывать
x[categoricals] = x[categoricals].astype('object').fillna('NULL')

In [221]:
'''# Разделяем (тестовый и тренировочный наборы)
#
train, test = train_test_split(data, test_size = 0.4)

classes = train.pop('animal_type').values
features = train

testClasses = test.pop('animal_type').values
testFeatures = test

#
# Тренировка модели
#
model = RandomForestClassifier(n_estimators=500, max_depth=30).fit(features, classes)

#
# Отчёты о точности работы
#
print(classification_report(testClasses, model.predict(testFeatures)))
print(accuracy_score(testClasses, model.predict(testFeatures)))'''

ValueError: could not convert string to float: 'A667625'