<a href="https://colab.research.google.com/github/tutsilianna/Introduction_to_ML_and_Advanced_ML_Python/blob/main/Logistic%20Regression/Classifiers_Logistic_Regression_%7C_Exercise_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task

In tasks, use the implementation of logistic regression from the sklearn library:

`from sklearn.linear_model import LogisticRegression`

When training, use the following parameters: `random_state = 2019`, `solver = 'lbfgs'`:

`LogisticRegression(random_state = 2019, solver = 'lbfgs').fit(X, y)`

[Description of the implementation of logistic regression from the sklearn library.](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

# Dataset description

[The provided dataset](https://drive.google.com/file/d/1qTELQc2Nvl8gx_PRWhuuo22eskHL2za2/view?usp=sharing) contains information about the passengers of the Titanic, which sank on the night of April 15, 1912. A number of passengers were rescued due to many different factors, including their gender, age, which deck their cabin was on, social status, etc.

The dataset consists of various features that describe information about the passengers. Each row of the table is an individual passenger, with all the information about that passenger contained in its row.

Dataset description:
- **Survived** (target): whether passanger survived or not (0 = No, 1 = Yes);
- **Pclass**: ticket class (1 = 1st, 2 = 2nd, 3 = 3rd);
- **Sex**: gender (female или male)
- **Age**: age in years
- **SibSp**: no. of siblings/spouses aboard the Titanic
- **Parch**: no. of parents/children aboard the Titanic
- **Ticket**: ticket number
- **Fare**: passenger fare
- **Cabin**: cabin number
- **Embarked**: port of embarkation: (C = Cherbourg, Q = Queenstown, S = Southampton).

We need to solve the classification problem and learn to predict the target feature **Survived** (whether the passenger will survive) from the corresponding non-target features.


# Data preparation and exploratory analysis

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
data = pd.read_csv('/content/drive/MyDrive/itmo|AI_cources/titanic_train.csv', encoding = 'utf-8', delimiter=',')

In [5]:
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,home.dest
0,3,1,"Smyth, Miss. Julia",female,,0,0,335432,7.7333,,Q,
1,3,1,"Glynn, Miss. Mary Agatha",female,,0,0,335677,7.75,,Q,"Co Clare, Ireland Washington, DC"
2,3,1,"Whabee, Mrs. George Joseph (Shawneene Abi-Saab)",female,38.0,0,0,2688,7.2292,,C,
3,3,0,"Vovk, Mr. Janko",male,22.0,0,0,349252,7.8958,,S,
4,3,0,"de Pelsmaeker, Mr. Alfons",male,16.0,0,0,345778,9.5,,S,


Find the number of missing values in the <code>age</code> column:

In [None]:
# < ENTER YOUR CODE HERE >

Calculate the proportion of survivors.

In [None]:
# < ENTER YOUR CODE HERE >

Determine the proportion of missing values within each feature and get rid of those features where the proportion of missing values is greater than a third. Also delete the column <code>ticket</code> as this information is unlikely to be useful.

In [None]:
# < ENTER YOUR CODE HERE >

From the dataset description, you can see that the columns <code>sibsp</code> and <code>parch</code> are essentially responsible for family size. Replace these two columns with the <code>fam_size</code> column, whose values will be calculated as the sum of the corresponding values in the columns <code>sibsp</code> and <code>parch</code>.

In [None]:
# < ENTER YOUR CODE HERE >

The resulting dataset will be called **INITIAL** (the features discarded at this stage do not need to be returned at any of the subsequent stages of the task).

In [None]:
# < ENTER YOUR CODE HERE >

Based on available statistics, estimate the probability of survival if the passenger is a member of the particular category (this category is pecified in your individual assignment).

In [None]:
# < ENTER YOUR CODE HERE >

Construct histograms of survivors and non-survivors by age.

In [None]:
# < ENTER YOUR CODE HERE >

# Model based on numerical features

## Removing rows with missing values

As a base model, it makes sense to build a model with minimal effort.

From the initial dataset, remove all categorical features as well as rows containing missing values.

Use <code>train_test_split()</code> to split the dataset into training and test samples <b>with the parameters specified in your individual assigment</b>. Use stratification by column <code>survived</code>.

Train <code>LogisticRegression()</code> model (<b>with the parameters specified in your individual assigment</b>) on training dataset, and evaluate the model on test dataset.

Compute <code>f1_score()</code> of the model on the test dataset (we recommend using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html">the corresponding function</a> with default parameters).

In [None]:
# < ENTER YOUR CODE HERE >

## Filling missing values with the mean

The quality of the resulting model leaves much to be desired. It makes sense to try to fill the missing values. Remove categorical features from the initial dataset, and fill missing values with the mean value of the column. The following steps are similar: splitting, training, evaluation.

Compute <code>f1_score()</code> of the model on the test dataset.

In [None]:
# < ENTER YOUR CODE HERE >

## Filling missing values based on honorifics

Obviously, filling missing values in the <code>age</code> column with mean is not the smartest idea. Perhaps you should perform this procedure somehow more intelligently, for example, with an eye on appealing to the person.

You can notice that the initial dataset in the column contains information about the social status of the passenger, namely, there are appeals <code>Mr., Mrs., Dr.</code>, etc. Based on this information, we can try to make an assumption about the passenger's age.

Put the column <code>name</code> back into consideration. Create a separate <code>honorific</code> column and put the appeal values there.

Calculate the number of unique appeals.

In [None]:
# < ENTER YOUR CODE HERE >

Most likely, it makes sense to reduce the number of appeals, adding small groups to the more numerous ones, as there seems to be no fundamental difference between, for example, <code>Don</code> and <code>Mr</code>. Note that <code>Master</code> is a former appeal to a child, we will work with this appeal separately.




Make the following substitutions:

<code>Mr</code> $\leftarrow$ <code>['Rev', 'Col', 'Dr', 'Major', 'Don', 'Capt']</code>

<code>Mrs</code> $\leftarrow$ <code> ['Dona', 'Countess']</code>

<code>Miss</code> $\leftarrow$ <code> ['Mlle', 'Ms']</code>

In [None]:
# < ENTER YOUR CODE HERE >

Calculate the proportion of rows with the <code>Master</code> value relative to the number of all males.

In [None]:
# < ENTER YOUR CODE HERE >

Calculate the average age of the category specified in your individual assigment.

In [None]:
# < ENTER YOUR CODE HERE >

Fill the missing values in the column <code>age</code> with the mean corresponding to the mean of the <code>honorific</code> category.

Get rid of non-numeric features. Next steps are similar: splitting, training, evaluation.

Compute <code>f1_score()</code> of the model on the test dataset.

In [None]:
# < ENTER YOUR CODE HERE >

# Model that uses categorical features

In the original dataset, fill the missing values in the column <code>age</code> with values based on the appeals (as in the previous step).

After that, drop the features <code>name</code> and <code>honorific</code>. They have fulfilled their function.

Perform <code>one-hot</code> encoding of the non-numeric features, for example, with <code>pd.get_dummies(drop_first=True)</code>. Then follow the familiar pattern: split, train, evaluate.

Compute <code>f1_score()</code> of the model on the test dataset.

In [None]:
# < ENTER YOUR CODE HERE >