### Многоклассовая логистическая регрессия [Домашнее задание]

Следуйте следующим инструкциям:

1. Загрузите набор данных ```"titanic"``` из библиотеки ```seaborn```.

2. Выполните анализ, очистку и всю необходимую предварительную обработку набора данных. Примечание: вам не разрешено отбрасывать точки данных.

3. Разделите данные на ```train``` и ```validation``` со следующим соотношением 8-2, используя функцию ```train_test_split()``` из библиотеки ```sklearn```. Обратите внимание, вы должны установить ```random_state = 1```.

4. Постройте модель, которая предсказывает класс пассажира ```pclass```.

5. Точность модели как для ```train```, так и для ```validation```  данных должна быть выше ```80%```.


**Примечание:** На всех этапах, пожалуйста, включайте пояснения.

## 1. Загрузим набор данных "titanic" из библиотеки seaborn

In [1]:
# Начало кода
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

titanic = sns.load_dataset("titanic")
# Конец кода

In [2]:
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


## 2. Выполним чистку и предварительную обработку набора данных

Сначала хотелось бы заменить категориальные значения на числовые.

In [3]:
titanic["class"].unique

<bound method Series.unique of 0       Third
1       First
2       Third
3       First
4       Third
        ...  
886    Second
887     First
888     Third
889     First
890     Third
Name: class, Length: 891, dtype: category
Categories (3, object): [First, Second, Third]>

In [4]:
titanic["who"].unique()

array(['man', 'woman', 'child'], dtype=object)

Для интерпретируемых значений используем `label_encoder`, для остальных - `one_hot_encoder`.

In [5]:
titanic["sex"].replace("male", 1, inplace=True)
titanic["sex"].replace("female", 0, inplace=True)
titanic["class"].replace("First", 1, inplace=True)
titanic["class"].replace("Second", 2, inplace=True)
titanic["class"].replace("Third", 3, inplace=True)
titanic["who"].replace("man", 1, inplace=True)
titanic["who"].replace("woman", 0, inplace=True)
titanic["who"].replace("child", 2, inplace=True)
titanic["adult_male"].replace(True, 1, inplace=True)
titanic["adult_male"].replace(False, 0, inplace=True)
titanic["alive"].replace("yes", 1, inplace=True)
titanic["alive"].replace("no", 0, inplace=True)
titanic["alone"].replace(True, 1, inplace=True)
titanic["alone"].replace(False, 0, inplace=True)

titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.2500,S,3,1,1.0,,Southampton,0,0.0
1,1,1,0,38.0,1,0,71.2833,C,1,0,0.0,C,Cherbourg,1,0.0
2,1,3,0,26.0,0,0,7.9250,S,3,0,0.0,,Southampton,1,1.0
3,1,1,0,35.0,1,0,53.1000,S,1,0,0.0,C,Southampton,1,0.0
4,0,3,1,35.0,0,0,8.0500,S,3,1,1.0,,Southampton,0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,1,27.0,0,0,13.0000,S,2,1,1.0,,Southampton,0,1.0
887,1,1,0,19.0,0,0,30.0000,S,1,0,0.0,B,Southampton,1,1.0
888,0,3,0,,1,2,23.4500,S,3,0,0.0,,Southampton,0,0.0
889,1,1,1,26.0,0,0,30.0000,C,1,1,1.0,C,Cherbourg,1,1.0


Мы можем заметить, что столбцы `pclass` и `class` имеют очень похожие значения. Проверим насколько.

In [6]:
(titanic["pclass"] == titanic["class"]).value_counts()

True    891
dtype: int64

`class` полностью повторяет признак `pclass`, который и будет нашим целевым. Поэтому, признак `class` можем опустить.

In [7]:
titanic = titanic.drop(columns="class")

Но прежде чем использовать `one_hot_encoding` все же лучше заполнить недостающие данные.

In [8]:
titanic.describe()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,who,adult_male,alive,alone
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.647587,29.699118,0.523008,0.381594,32.204208,0.789001,0.602694,0.383838,0.602694
std,0.486592,0.836071,0.47799,14.526497,1.102743,0.806057,49.693429,0.594291,0.489615,0.486592,0.489615
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,20.125,0.0,0.0,7.9104,0.0,0.0,0.0,0.0
50%,0.0,3.0,1.0,28.0,0.0,0.0,14.4542,1.0,1.0,0.0,1.0
75%,1.0,3.0,1.0,38.0,1.0,0.0,31.0,1.0,1.0,1.0,1.0
max,1.0,3.0,1.0,80.0,8.0,6.0,512.3292,2.0,1.0,1.0,1.0


In [9]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
survived       891 non-null int64
pclass         891 non-null int64
sex            891 non-null int64
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
fare           891 non-null float64
embarked       889 non-null object
who            891 non-null int64
adult_male     891 non-null float64
deck           203 non-null category
embark_town    889 non-null object
alive          891 non-null int64
alone          891 non-null float64
dtypes: category(1), float64(4), int64(7), object(2)
memory usage: 91.9+ KB


Не все столбцы содержат по 891 значению. Рассмотрим признаки с недостающими значениями.

In [10]:
titanic["age"].unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

In [11]:
titanic["embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [12]:
titanic["embark_town"].unique()

array(['Southampton', 'Cherbourg', 'Queenstown', nan], dtype=object)

In [13]:
titanic["deck"].unique()

[NaN, C, E, G, D, A, B, F]
Categories (7, object): [C, E, G, D, A, B, F]

Так как мы в любом случае будем использовать `one_hot_encoding` для признаков `embarked`, `embark_town` и `deck`, то можно заменить `NaN` на любое другое значение - к примеру, "no"

In [14]:
titanic['deck'] = titanic['deck'].cat.add_categories('Unknown')
titanic['deck'].fillna('Unknown', inplace =True)

In [15]:
titanic['embarked'].fillna('Unknown', inplace =True)
titanic['embark_town'].fillna('Unknown', inplace =True)

In [16]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
survived       891 non-null int64
pclass         891 non-null int64
sex            891 non-null int64
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
fare           891 non-null float64
embarked       891 non-null object
who            891 non-null int64
adult_male     891 non-null float64
deck           891 non-null category
embark_town    891 non-null object
alive          891 non-null int64
alone          891 non-null float64
dtypes: category(1), float64(4), int64(7), object(2)
memory usage: 91.9+ KB


Недостающие значения признака "возраст" можно заменить на среднее значение по столбцу. Но, мне кажется, это было бы слишком грубо. Попробуем заменить недостающие значения на значения похожих строк. Для определения схожести двух векторов используем `scipy.spatial.distance.cosine`. Однако, мы все еще имеем категориальные значения, поэтому сначала используем `one_hot_encoding`.

In [17]:
import pandas as pd
numeric_titanic = pd.get_dummies(titanic, columns=['embarked', 'embark_town', 'deck'])
numeric_titanic.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'who',
       'adult_male', 'alive', 'alone', 'embarked_C', 'embarked_Q',
       'embarked_S', 'embarked_Unknown', 'embark_town_Cherbourg',
       'embark_town_Queenstown', 'embark_town_Southampton',
       'embark_town_Unknown', 'deck_A', 'deck_B', 'deck_C', 'deck_D', 'deck_E',
       'deck_F', 'deck_G', 'deck_Unknown'],
      dtype='object')

In [18]:
numeric_titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,who,adult_male,alive,...,embark_town_Southampton,embark_town_Unknown,deck_A,deck_B,deck_C,deck_D,deck_E,deck_F,deck_G,deck_Unknown
0,0,3,1,22.0,1,0,7.2500,1,1.0,0,...,1,0,0,0,0,0,0,0,0,1
1,1,1,0,38.0,1,0,71.2833,0,0.0,1,...,0,0,0,0,1,0,0,0,0,0
2,1,3,0,26.0,0,0,7.9250,0,0.0,1,...,1,0,0,0,0,0,0,0,0,1
3,1,1,0,35.0,1,0,53.1000,0,0.0,1,...,1,0,0,0,1,0,0,0,0,0
4,0,3,1,35.0,0,0,8.0500,1,1.0,0,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,1,27.0,0,0,13.0000,1,1.0,0,...,1,0,0,0,0,0,0,0,0,1
887,1,1,0,19.0,0,0,30.0000,0,0.0,1,...,1,0,0,1,0,0,0,0,0,0
888,0,3,0,,1,2,23.4500,0,0.0,0,...,1,0,0,0,0,0,0,0,0,1
889,1,1,1,26.0,0,0,30.0000,1,1.0,1,...,0,0,0,0,1,0,0,0,0,0


Теперь попробуем найти схожие вектора с теми, у которых нет значения признака `age`. При подсчете `similarity` мы будем использовать вектор со значениями всех прзнаков, кроме целевого `age`.

In [19]:
nas = numeric_titanic[numeric_titanic["age"].isna() == True].index.tolist()

In [20]:
import scipy
for idx in nas:
    similarities = {}
    navector = numeric_titanic.drop(columns='age').loc[idx].values
    for idx2 in numeric_titanic[numeric_titanic["age"].isna() == False].index.tolist():
        vector2 = numeric_titanic.drop(columns='age').loc[idx2].values
        similarity = 1 - scipy.spatial.distance.cosine(navector,vector2)
        similarities[idx2]=similarity
        
    max_idx2 = [key for key in similarities.keys() if similarities[key] == max(similarities.values())][0]
    replacing_value = numeric_titanic["age"].loc[max_idx2]
    numeric_titanic.at[idx, 'age']= replacing_value

In [21]:
numeric_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 27 columns):
survived                   891 non-null int64
pclass                     891 non-null int64
sex                        891 non-null int64
age                        891 non-null float64
sibsp                      891 non-null int64
parch                      891 non-null int64
fare                       891 non-null float64
who                        891 non-null int64
adult_male                 891 non-null float64
alive                      891 non-null int64
alone                      891 non-null float64
embarked_C                 891 non-null uint8
embarked_Q                 891 non-null uint8
embarked_S                 891 non-null uint8
embarked_Unknown           891 non-null uint8
embark_town_Cherbourg      891 non-null uint8
embark_town_Queenstown     891 non-null uint8
embark_town_Southampton    891 non-null uint8
embark_town_Unknown        891 non-null uint8
deck_A       

Теперь мы располагаем качественной и полной числовой информацией.

In [22]:
numeric_titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,who,adult_male,alive,...,embark_town_Southampton,embark_town_Unknown,deck_A,deck_B,deck_C,deck_D,deck_E,deck_F,deck_G,deck_Unknown
0,0,3,1,22.0,1,0,7.2500,1,1.0,0,...,1,0,0,0,0,0,0,0,0,1
1,1,1,0,38.0,1,0,71.2833,0,0.0,1,...,0,0,0,0,1,0,0,0,0,0
2,1,3,0,26.0,0,0,7.9250,0,0.0,1,...,1,0,0,0,0,0,0,0,0,1
3,1,1,0,35.0,1,0,53.1000,0,0.0,1,...,1,0,0,0,1,0,0,0,0,0
4,0,3,1,35.0,0,0,8.0500,1,1.0,0,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,1,27.0,0,0,13.0000,1,1.0,0,...,1,0,0,0,0,0,0,0,0,1
887,1,1,0,19.0,0,0,30.0000,0,0.0,1,...,1,0,0,1,0,0,0,0,0,0
888,0,3,0,30.0,1,2,23.4500,0,0.0,0,...,1,0,0,0,0,0,0,0,0,1
889,1,1,1,26.0,0,0,30.0000,1,1.0,1,...,0,0,0,0,1,0,0,0,0,0


## 3. Разделим данные на train и validation

In [23]:
titanic_label = numeric_titanic["pclass"]
numeric_titanic = numeric_titanic.drop(columns="pclass")
numeric_titanic

Unnamed: 0,survived,sex,age,sibsp,parch,fare,who,adult_male,alive,alone,...,embark_town_Southampton,embark_town_Unknown,deck_A,deck_B,deck_C,deck_D,deck_E,deck_F,deck_G,deck_Unknown
0,0,1,22.0,1,0,7.2500,1,1.0,0,0.0,...,1,0,0,0,0,0,0,0,0,1
1,1,0,38.0,1,0,71.2833,0,0.0,1,0.0,...,0,0,0,0,1,0,0,0,0,0
2,1,0,26.0,0,0,7.9250,0,0.0,1,1.0,...,1,0,0,0,0,0,0,0,0,1
3,1,0,35.0,1,0,53.1000,0,0.0,1,0.0,...,1,0,0,0,1,0,0,0,0,0
4,0,1,35.0,0,0,8.0500,1,1.0,0,1.0,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,1,27.0,0,0,13.0000,1,1.0,0,1.0,...,1,0,0,0,0,0,0,0,0,1
887,1,0,19.0,0,0,30.0000,0,0.0,1,1.0,...,1,0,0,1,0,0,0,0,0,0
888,0,0,30.0,1,2,23.4500,0,0.0,0,0.0,...,1,0,0,0,0,0,0,0,0,1
889,1,1,26.0,0,0,30.0000,1,1.0,1,1.0,...,0,0,0,0,1,0,0,0,0,0


In [24]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
         numeric_titanic, titanic_label, test_size=0.2, random_state=1)

In [25]:
X_train

Unnamed: 0,survived,sex,age,sibsp,parch,fare,who,adult_male,alive,alone,...,embark_town_Southampton,embark_town_Unknown,deck_A,deck_B,deck_C,deck_D,deck_E,deck_F,deck_G,deck_Unknown
301,1,1,28.0,2,0,23.2500,1,1.0,1,0.0,...,0,0,0,0,0,0,0,0,0,1
309,1,0,30.0,0,0,56.9292,0,0.0,1,1.0,...,0,0,0,0,0,0,1,0,0,0
516,1,0,34.0,0,0,10.5000,0,0.0,1,1.0,...,1,0,0,0,0,0,0,1,0,0
120,0,1,21.0,2,0,73.5000,1,1.0,0,0.0,...,1,0,0,0,0,0,0,0,0,1
570,1,1,62.0,0,0,10.5000,1,1.0,1,1.0,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
715,0,1,19.0,0,0,7.6500,1,1.0,0,1.0,...,1,0,0,0,0,0,0,1,0,0
767,0,0,30.5,0,0,7.7500,0,0.0,0,1.0,...,0,0,0,0,0,0,0,0,0,1
72,0,1,21.0,0,0,73.5000,1,1.0,0,1.0,...,1,0,0,0,0,0,0,0,0,1
235,0,0,45.0,0,0,7.5500,0,0.0,0,1.0,...,1,0,0,0,0,0,0,0,0,1


In [26]:
X_test

Unnamed: 0,survived,sex,age,sibsp,parch,fare,who,adult_male,alive,alone,...,embark_town_Southampton,embark_town_Unknown,deck_A,deck_B,deck_C,deck_D,deck_E,deck_F,deck_G,deck_Unknown
862,1,0,48.0,0,0,25.9292,0,0.0,1,1.0,...,1,0,0,0,0,1,0,0,0,0
223,0,1,28.0,0,0,7.8958,1,1.0,0,1.0,...,1,0,0,0,0,0,0,0,0,1
84,1,0,17.0,0,0,10.5000,0,0.0,1,1.0,...,1,0,0,0,0,0,0,0,0,1
680,0,0,21.0,0,0,8.1375,0,0.0,0,1.0,...,0,0,0,0,0,0,0,0,0,1
535,1,0,7.0,0,2,26.2500,2,0.0,1,0.0,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
796,1,0,49.0,0,0,25.9292,0,0.0,1,1.0,...,1,0,0,0,0,1,0,0,0,0
815,0,1,40.0,0,0,0.0000,1,1.0,0,1.0,...,1,0,0,1,0,0,0,0,0,0
629,0,1,21.0,0,0,7.7333,1,1.0,0,1.0,...,0,0,0,0,0,0,0,0,0,1
421,0,1,21.0,0,0,7.7333,1,1.0,0,1.0,...,0,0,0,0,0,0,0,0,0,1


## 4. Построим модель на основе логистической регрессии

Создадим и обучим модель Многоклассовой Логистической Регрессии с помощью аргумента one-vs-rest `multi_class="ovr"`, используя данные из обучающей выборки.

In [27]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(multi_class='ovr', max_iter = 500)
classifier.fit(X_train,y_train)

LogisticRegression(max_iter=500, multi_class='ovr')

Параметр `max_iter` установили на значении 1000, потому что при стандартном количестве в 100, модель не может достичь convergence.

In [28]:
predictions = classifier.predict(X_test)
accuracy = (predictions == y_test).sum() / len(y_test)

print(f"The accuracy of the Logistic Regression model on test data is: {accuracy*100:.2f} %")

The accuracy of the Logistic Regression model on test data is: 81.01 %


In [29]:
score = classifier.score(X_train, y_train)
print(f"The accuracy of the Logistic Regression model on train data is: {score*100:.2f} %")

The accuracy of the Logistic Regression model on train data is: 84.69 %


#### Точность модели на обучающей выборке - 84,69 %, тогда как на тестовой выборке значение - 81,01 %.

### Отличная работа