В качестве домашнего задания вам предлагается поработать над предсказанием погоды. Файл с данными вы найдете в соответствующей директории. Вам будет доступен датасет weather.csv, ПЕРВЫЕ 75% (shuffle = False) которого нужно взять для обучения, последние 25% - для тестирования.

Требуется построить 4 модели которые будут предсказывать целевую переменную <b>RainTomorrow</b> с помощью:

   1. логистической регрессии [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)
   
   2. метода ближайших соседей [sklearn.neighbors](https://scikit-learn.org/stable/modules/neighbors.html)
 
   3. Байесовского классификатора [sklearn.naive_bayes](https://scikit-learn.org/stable/modules/naive_bayes.html)
   
   4. логистической регрессии реализованной самостоятельно

Затем следует сравнить результаты моделей (по качеству и времени выполнения) и сделать вывод о том, какая модель и с какими параметрами даёт лучшие результаты.

Не забывайте о том, что работа с признаками играет очень большую роль в построении хорошей модели.

Краткое описание данных:

    Date - Дата наблюдений
    Location - Название локации, в которой расположена метеорологическая станция
    MinTemp - Минимальная температура в градусах цельсия
    MaxTemp - Максимальная температура в градусах цельсия
    Rainfall - Количество осадков, зафиксированных за день в мм
    Evaporation - Так называемое "pan evaporation" класса А (мм) за 24 часа до 9 утра
    Sunshine - Число солнечных часов за день
    WindGustDir - направление самого сильного порыва ветра за последние 24 часа
    WindGustSpeed - скорость (км / ч) самого сильного порыва ветра за последние 24 часа
    WindDir9am - направление ветра в 9 утра

In [None]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import seaborn as sns
import time
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
%matplotlib notebook

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

In [3]:
X = pd.read_csv('weather.csv')
y = X['RainTomorrow'].replace({'No' : 0, 'Yes' : 1})
X.head()

Unnamed: 0.1,Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


In [4]:
X['RainToday'].replace({'No' : 0, 'Yes' : 1}, inplace=True)
X.drop(['Unnamed: 0', 'RainTomorrow'], axis=1, inplace=True)
date = pd.DatetimeIndex(X['Date'])
X.drop('Date', axis=1, inplace=True)
X['Year'] = date.year
X['Month'] = date.month
X['Day'] = date.day
X.head()

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,Year,Month,Day
0,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,...,1007.7,1007.1,8.0,,16.9,21.8,0.0,2008,12,1
1,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,...,1010.6,1007.8,,,17.2,24.3,0.0,2008,12,2
2,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,...,1007.6,1008.7,,2.0,21.0,23.2,0.0,2008,12,3
3,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,...,1017.6,1012.8,,,18.1,26.5,0.0,2008,12,4
4,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,...,1010.8,1006.0,7.0,8.0,17.8,29.7,0.0,2008,12,5


## Заполнение пропусков

In [5]:
X.isna().sum().sort_values()

Location             0
Year                 0
Month                0
Day                  0
MaxTemp            322
MinTemp            637
Temp9am            904
WindSpeed9am      1348
RainToday         1406
Rainfall          1406
Humidity9am       1774
WindSpeed3pm      2630
Temp3pm           2726
Humidity3pm       3610
WindDir3pm        3778
WindGustSpeed     9270
WindGustDir       9330
WindDir9am       10013
Pressure3pm      13981
Pressure9am      14014
Cloud9am         53657
Cloud3pm         57094
Evaporation      60843
Sunshine         67816
dtype: int64

In [6]:
X.dtypes

Location          object
MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustDir       object
WindGustSpeed    float64
WindDir9am        object
WindDir3pm        object
WindSpeed9am     float64
WindSpeed3pm     float64
Humidity9am      float64
Humidity3pm      float64
Pressure9am      float64
Pressure3pm      float64
Cloud9am         float64
Cloud3pm         float64
Temp9am          float64
Temp3pm          float64
RainToday        float64
Year               int64
Month              int64
Day                int64
dtype: object

In [7]:
type_object = X.dtypes == 'object'
type_object[0] = False
val = {
    'WindGustDir': X['WindGustDir'].mode()[0],
    'WindDir9am': X['WindDir9am'].mode()[0],
    'WindDir3pm': X['WindDir3pm'].mode()[0]
}
X.fillna(value=val, inplace=True)
type_float64 = X.dtypes == 'float64'
X.fillna(value=X[X.columns[type_float64]].mean(), inplace=True)

In [8]:
X.isna().sum().sort_values()

Location         0
Year             0
RainToday        0
Temp3pm          0
Temp9am          0
Cloud3pm         0
Cloud9am         0
Pressure3pm      0
Pressure9am      0
Humidity3pm      0
Humidity9am      0
WindSpeed3pm     0
WindSpeed9am     0
WindDir3pm       0
WindDir9am       0
WindGustSpeed    0
WindGustDir      0
Sunshine         0
Evaporation      0
Rainfall         0
MaxTemp          0
MinTemp          0
Month            0
Day              0
dtype: int64

## Трансформация признаков

In [29]:
float64_features = X.columns[X.dtypes != 'object']
hist = X.hist(float64_features, bins=50, figsize=(8,20));
hist

<IPython.core.display.Javascript object>

array([[<AxesSubplot:title={'center':'MinTemp'}>,
        <AxesSubplot:title={'center':'MaxTemp'}>,
        <AxesSubplot:title={'center':'Rainfall'}>,
        <AxesSubplot:title={'center':'Evaporation'}>,
        <AxesSubplot:title={'center':'Sunshine'}>,
        <AxesSubplot:title={'center':'WindGustSpeed'}>,
        <AxesSubplot:title={'center':'WindSpeed9am'}>,
        <AxesSubplot:title={'center':'WindSpeed3pm'}>,
        <AxesSubplot:title={'center':'Humidity9am'}>,
        <AxesSubplot:title={'center':'Humidity3pm'}>,
        <AxesSubplot:title={'center':'Pressure9am'}>,
        <AxesSubplot:title={'center':'Pressure3pm'}>,
        <AxesSubplot:title={'center':'Cloud9am'}>,
        <AxesSubplot:title={'center':'Cloud3pm'}>,
        <AxesSubplot:title={'center':'Temp9am'}>],
       [<AxesSubplot:title={'center':'Temp3pm'}>,
        <AxesSubplot:title={'center':'RainToday'}>,
        <AxesSubplot:title={'center':'Day'}>,
        <AxesSubplot:title={'center':'2007'}>,
        <AxesS

In [10]:
scaler = MinMaxScaler()
scaler.fit(X[float64_features])
X[float64_features] = scaler.transform(X[float64_features])

In [11]:
cols_to_log = ['Rainfall', 'Evaporation']
for col in cols_to_log:
    X[col] = np.log1p(X[col])
# date = pd.DatetimeIndex(X['Date'])
# X.drop('Date', axis=1, inplace=True)
X['Year'] = date.year
X['Month'] = date.month
X['Day'] = date.day

In [12]:
date.day

Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
            ...
            15, 16, 17, 18, 19, 20, 21, 22, 23, 24],
           dtype='int64', name='Date', length=142193)

In [13]:
cols_to_encode = ['Year', 'Month', 'Location', 'WindDir3pm', 'WindDir9am', 'WindGustDir']
for col in cols_to_encode:
    d_f = pd.get_dummies(X[col])
    X.drop(col, axis=1, inplace=True)
    X = pd.concat([X, d_f], axis=1)
n = int(X.shape[0] * 0.75)
X_train, y_train = X[:n], y[:n]
X_test, y_test = X[n:], y[n:]
X_train = X_train.to_numpy()
y_train = y_train.to_numpy()
X_test = X_test.to_numpy()
y_test = y_test.to_numpy()


In [14]:
models = [LogisticRegression(), KNeighborsClassifier(algorithm='brute'), GaussianNB()]
predicted = {}
for model in models:
    model.fit(X_train, y_train)
    print(type(model).__name__, 'FIT DONE')
    predicted[type(model).__name__] = model.predict(X_test)
    print(type(model).__name__, 'PREDICT DONE')

LogisticRegression FIT DONE
LogisticRegression PREDICT DONE
KNeighborsClassifier FIT DONE
KNeighborsClassifier PREDICT DONE
GaussianNB FIT DONE
GaussianNB PREDICT DONE


Функция предсказания метки класса, получает на вход вероятности принадлежности к классу 1 и выдает метки классов $y \in \{0, 1\}$

In [15]:
predicted

{'LogisticRegression': array([0, 0, 1, ..., 0, 0, 0]),
 'KNeighborsClassifier': array([0, 0, 0, ..., 0, 0, 0]),
 'GaussianNB': array([1, 1, 1, ..., 0, 0, 0])}

In [16]:
for model_name, pred_val in predicted.items():
    print(model_name)
    print(classification_report(y_test,pred_val))
    print('\n')

LogisticRegression
              precision    recall  f1-score   support

           0       0.87      0.95      0.91     27882
           1       0.74      0.48      0.59      7667

    accuracy                           0.85     35549
   macro avg       0.80      0.72      0.75     35549
weighted avg       0.84      0.85      0.84     35549



KNeighborsClassifier
              precision    recall  f1-score   support

           0       0.81      0.94      0.87     27882
           1       0.46      0.20      0.28      7667

    accuracy                           0.78     35549
   macro avg       0.64      0.57      0.57     35549
weighted avg       0.74      0.78      0.74     35549



GaussianNB
              precision    recall  f1-score   support

           0       0.91      0.77      0.83     27882
           1       0.46      0.73      0.57      7667

    accuracy                           0.76     35549
   macro avg       0.69      0.75      0.70     35549
weighted avg       