# Домашняя работа-3. 

Представлен датасет центра приюта животных, и вашей задачей будет обучить модель таким образом, чтобы  по определенным признакам была возможность максимально уверенно предсказать метки 'Adoption' и 'Transfer' (столбец “outcome_type”).

Здесь вы вольны делать что угодно. Я хочу видеть от вас:

1. Проверка наличия/обработка пропусков
2. Проверьте взаимосвязи между признаками
3. Попробуйте создать свои признаки
4. Удалите лишние
5. Обратите внимание на текстовые столбцы. Подумайте, что можно извлечь полезного оттуда
6. Использование профайлера вам поможет.
7. Не забывайте, что у вас есть PCA (Метод главных компонент). Он может пригодиться.

Вспомните о всем, что я говорил на предыдущих занятиях. Не все будет пригодится, но в жизни вам никто не будет говорить, что использовать :)

Хорошим классификатором для этой задачи будет "Случайный лес" (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Понимать суть работы "леса" не обязательно на данном этапе, но качество предсказаний будет выше, чем с линейным классификатором. (если желаете, вот гайд https://adataanalyst.com/scikit-learn/linear-classification-method/)

Желаю успеха :)

In [1]:
import pandas as pd
import pandas_profiling
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

import re
import datetime as dt
from functools import reduce

In [52]:
df = pd.read_csv('aac_shelter_outcomes.csv')
df.shape

(78256, 12)

In [53]:
df.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,Transfer,Intact Male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,Spayed Female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,,Adoption,Neutered Male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,Neutered Male
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07T00:00:00,2014-07-07T14:04:00,2014-07-07T14:04:00,,Rabies Risk,Euthanasia,Unknown


In [54]:
df.outcome_type.unique()

array(['Transfer', 'Adoption', 'Euthanasia', 'Return to Owner', 'Died',
       'Disposal', 'Relocate', 'Missing', nan, 'Rto-Adopt'], dtype=object)

In [55]:
df = df[df['outcome_type'].isin(['Transfer', 'Adoption'])]
df.shape

(56611, 12)

In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56611 entries, 0 to 78255
Data columns (total 12 columns):
age_upon_outcome    56609 non-null object
animal_id           56611 non-null object
animal_type         56611 non-null object
breed               56611 non-null object
color               56611 non-null object
date_of_birth       56611 non-null object
datetime            56611 non-null object
monthyear           56611 non-null object
name                38660 non-null object
outcome_subtype     29425 non-null object
outcome_type        56611 non-null object
sex_upon_outcome    56611 non-null object
dtypes: object(12)
memory usage: 5.6+ MB


Удалим, на мой взгляд, ненужные столбцы... уменьшим наши данные.

In [57]:
df.drop(['age_upon_outcome',
         'animal_id', 'date_of_birth',
         'monthyear', 'name', 'outcome_subtype'], axis=1, inplace=True)

In [58]:
df.head(10)

Unnamed: 0,animal_type,breed,color,datetime,outcome_type,sex_upon_outcome
0,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-22T16:04:00,Transfer,Intact Male
1,Dog,Beagle Mix,White/Brown,2013-11-07T11:47:00,Transfer,Spayed Female
2,Dog,Pit Bull,Blue/White,2014-06-03T14:20:00,Adoption,Neutered Male
3,Dog,Miniature Schnauzer Mix,White,2014-06-15T15:50:00,Transfer,Neutered Male
5,Dog,Leonberger Mix,Brown/White,2013-10-07T13:06:00,Transfer,Intact Male
7,Dog,Chihuahua Shorthair Mix,Brown,2014-12-08T15:55:00,Transfer,Spayed Female
8,Cat,Domestic Shorthair Mix,Blue Tabby/White,2014-08-14T18:45:00,Adoption,Intact Female
9,Cat,Domestic Shorthair Mix,White/Black,2014-06-29T17:45:00,Adoption,Spayed Female
11,Dog,Papillon/Border Collie,Black/White,2014-03-28T14:39:00,Transfer,Neutered Male
12,Dog,Chihuahua Shorthair/Pomeranian,Black,2014-05-26T19:10:00,Adoption,Neutered Male


Оставили: тип, породу, цвет, дату, размещение, пол.
Преобразуем пол животного в бинарный вид для дальнейшего упрощения нашего датасата.

In [59]:
df['adopted'] = df.outcome_type.str.lower().str.contains('adopt').astype(int)
df.head()

Unnamed: 0,animal_type,breed,color,datetime,outcome_type,sex_upon_outcome,adopted
0,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-22T16:04:00,Transfer,Intact Male,0
1,Dog,Beagle Mix,White/Brown,2013-11-07T11:47:00,Transfer,Spayed Female,0
2,Dog,Pit Bull,Blue/White,2014-06-03T14:20:00,Adoption,Neutered Male,1
3,Dog,Miniature Schnauzer Mix,White,2014-06-15T15:50:00,Transfer,Neutered Male,0
5,Dog,Leonberger Mix,Brown/White,2013-10-07T13:06:00,Transfer,Intact Male,0


In [60]:
df['sex_intact'] = df.sex_upon_outcome.str.contains('intact').astype(int)

df.head()

Unnamed: 0,animal_type,breed,color,datetime,outcome_type,sex_upon_outcome,adopted,sex_intact
0,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-22T16:04:00,Transfer,Intact Male,0,0
1,Dog,Beagle Mix,White/Brown,2013-11-07T11:47:00,Transfer,Spayed Female,0,0
2,Dog,Pit Bull,Blue/White,2014-06-03T14:20:00,Adoption,Neutered Male,1,0
3,Dog,Miniature Schnauzer Mix,White,2014-06-15T15:50:00,Transfer,Neutered Male,0,0
5,Dog,Leonberger Mix,Brown/White,2013-10-07T13:06:00,Transfer,Intact Male,0,0


In [61]:
MIX_ATTR = 'mix'
SHORT_HAIR_ATTR = 'shorthair'
MEDIUM_HAIR_ATTR = 'medium hair'
LONG_HAIR_ATTR = 'longhair'

df['breed'] = df.breed.str.lower()

In [62]:
df['is_mix'] = df.breed.str.contains(MIX_ATTR).astype(int)
df.head()

Unnamed: 0,animal_type,breed,color,datetime,outcome_type,sex_upon_outcome,adopted,sex_intact,is_mix
0,Cat,domestic shorthair mix,Orange Tabby,2014-07-22T16:04:00,Transfer,Intact Male,0,0,1
1,Dog,beagle mix,White/Brown,2013-11-07T11:47:00,Transfer,Spayed Female,0,0,1
2,Dog,pit bull,Blue/White,2014-06-03T14:20:00,Adoption,Neutered Male,1,0,0
3,Dog,miniature schnauzer mix,White,2014-06-15T15:50:00,Transfer,Neutered Male,0,0,1
5,Dog,leonberger mix,Brown/White,2013-10-07T13:06:00,Transfer,Intact Male,0,0,1


In [63]:
df['l_hair'] = df.breed.str.contains(LONG_HAIR_ATTR).astype(int)
df['m_hair'] = df.breed.str.contains(MEDIUM_HAIR_ATTR).astype(int)
df['s_hair'] = df.breed.str.contains(SHORT_HAIR_ATTR).astype(int)

df.head()

Unnamed: 0,animal_type,breed,color,datetime,outcome_type,sex_upon_outcome,adopted,sex_intact,is_mix,l_hair,m_hair,s_hair
0,Cat,domestic shorthair mix,Orange Tabby,2014-07-22T16:04:00,Transfer,Intact Male,0,0,1,0,0,1
1,Dog,beagle mix,White/Brown,2013-11-07T11:47:00,Transfer,Spayed Female,0,0,1,0,0,0
2,Dog,pit bull,Blue/White,2014-06-03T14:20:00,Adoption,Neutered Male,1,0,0,0,0,0
3,Dog,miniature schnauzer mix,White,2014-06-15T15:50:00,Transfer,Neutered Male,0,0,1,0,0,0
5,Dog,leonberger mix,Brown/White,2013-10-07T13:06:00,Transfer,Intact Male,0,0,1,0,0,0


In [64]:
df.breed = df.breed.str.replace(f'{MIX_ATTR}|{LONG_HAIR_ATTR}|{SHORT_HAIR_ATTR}|{MEDIUM_HAIR_ATTR}', '').str.strip()
df.head()

Unnamed: 0,animal_type,breed,color,datetime,outcome_type,sex_upon_outcome,adopted,sex_intact,is_mix,l_hair,m_hair,s_hair
0,Cat,domestic,Orange Tabby,2014-07-22T16:04:00,Transfer,Intact Male,0,0,1,0,0,1
1,Dog,beagle,White/Brown,2013-11-07T11:47:00,Transfer,Spayed Female,0,0,1,0,0,0
2,Dog,pit bull,Blue/White,2014-06-03T14:20:00,Adoption,Neutered Male,1,0,0,0,0,0
3,Dog,miniature schnauzer,White,2014-06-15T15:50:00,Transfer,Neutered Male,0,0,1,0,0,0
5,Dog,leonberger,Brown/White,2013-10-07T13:06:00,Transfer,Intact Male,0,0,1,0,0,0


In [65]:
def one_hot_encode_new_columns(df: pd.DataFrame, col_name: str):
    enc = OneHotEncoder(categories='auto')
    
    encoded_data = enc.fit_transform(
        np.array( df[col_name] ).reshape(-1, 1)
    ).todense()
    
    encoded_feature_names = list(map(lambda val: re.sub(r'^.+_', f'{col_name}_', val), enc.get_feature_names()))
    
    return pd.DataFrame(data=encoded_data, columns=encoded_feature_names)

In [66]:
df = df.reset_index(drop=True)

ohe_col_names = ['animal_type', 'outcome_type', 'color', 'sex_upon_outcome']

df_ohe = pd.concat([
    df,
    *map(lambda col_name: one_hot_encode_new_columns(df, col_name), ohe_col_names)
], sort=False, axis=1)

df_ohe.drop(ohe_col_names, axis=1, inplace=True)

df_ohe.head()

Unnamed: 0,breed,datetime,adopted,sex_intact,is_mix,l_hair,m_hair,s_hair,animal_type_Bird,animal_type_Cat,...,color_Yellow/Cream,color_Yellow/Orange,color_Yellow/Orange Tabby,color_Yellow/Tan,color_Yellow/White,sex_upon_outcome_Intact Female,sex_upon_outcome_Intact Male,sex_upon_outcome_Neutered Male,sex_upon_outcome_Spayed Female,sex_upon_outcome_Unknown
0,domestic,2014-07-22T16:04:00,0,0,1,0,0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,beagle,2013-11-07T11:47:00,0,0,1,0,0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,pit bull,2014-06-03T14:20:00,1,0,0,0,0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,miniature schnauzer,2014-06-15T15:50:00,0,0,1,0,0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,leonberger,2013-10-07T13:06:00,0,0,1,0,0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [67]:
df_ohe.drop(['breed', 'datetime' ], axis=1, inplace=True)

df_ohe.head()

Unnamed: 0,adopted,sex_intact,is_mix,l_hair,m_hair,s_hair,animal_type_Bird,animal_type_Cat,animal_type_Dog,animal_type_Livestock,...,color_Yellow/Cream,color_Yellow/Orange,color_Yellow/Orange Tabby,color_Yellow/Tan,color_Yellow/White,sex_upon_outcome_Intact Female,sex_upon_outcome_Intact Male,sex_upon_outcome_Neutered Male,sex_upon_outcome_Spayed Female,sex_upon_outcome_Unknown
0,0,0,1,0,0,1,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0,0,1,0,0,0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1,0,0,0,0,0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0,0,1,0,0,0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0,0,1,0,0,0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [68]:
df_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56611 entries, 0 to 56610
Columns: 493 entries, adopted to sex_upon_outcome_Unknown
dtypes: float64(487), int32(6)
memory usage: 211.6 MB


In [69]:
X = df_ohe.drop(['adopted'], axis=1)
Y = df_ohe['adopted']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)

In [70]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((39627, 492), (16984, 492), (39627,), (16984,))

In [71]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, Y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [72]:
clf.score(X_test, Y_test)

1.0

Странно для меня, но точность нашей модели составила 100%