# **Rain in Australia**
## **Predict next-day rain in Australia**

## 1. Importing tools

Firstly, lets import all tools we need.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import xgboost as xgb
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from lightgbm import LGBMClassifier
from optuna.samplers import TPESampler
import optuna.integration.lightgbm as lightgbm
from sklearn.metrics import roc_auc_score
import optuna
from sklearn.feature_selection import RFECV
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

## 2. Exploratory data analysis 

Before select the models, create some new features or chose them from existing, we need to analysis the data we got. Lets make it consistently.

## 2.1 Data reading and primary processing

In the follow cells we read the data and subsets them on targets table, which self-explanatory, and on another data, with whom we would treaking.

Also, before subset we drop all unknown target values and factorize it on 0 if tomorrow none rainy day and 1 otherwise.

In [None]:
data = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')
data.dropna(subset=['RainTomorrow'], inplace=True)
data['RainTomorrow'] = pd.factorize(data['RainTomorrow'])[0]
target = data['RainTomorrow']
data = data.drop(['RainTomorrow'], axis=1)

Lets create two list with categorical and not features.

***object* - means non digital format**

***float64* - digital**

In [None]:
data.info()

In [None]:
cat = ['Location', 
       'WindGustDir',
       'WindDir9am',
       'WindDir3pm',
       'RainToday',
       'Date']       
dig = [i for i in list(data) if i not in cat]

Further, we will work with *Nan* values.

In [None]:
percent_nan = data.isna().sum()/data.shape[0]
percent_nan

We will change the not a number value in digit columns for their mean, cat - nearest value in the same column.

In [None]:
for ind, d in zip(percent_nan.index, percent_nan):
  if ind in dig and d != 0:
    data[ind].fillna(data[ind].mean(), inplace=True)

data.fillna(method='ffill', inplace=True)

Our next step is factorize the cat. features list to digit format.

In [None]:
for c in cat:
  data['{}_dig'.format(c)] = pd.factorize(data[c])[0]

In [None]:
cat_dig = ['{}_dig'.format(c) for c in cat]
features = dig + cat_dig

## 2.2 Data visualization

To find out insights in our data, lets visualize them.

In [None]:
data[features].hist(figsize=(20, 15), bins=20)

As we can see, categorical columns mostly had uniform distribution.

Digital columns had nearly bell-shaped distribution, but some of them was heavy-tail.

In [None]:
data[features].corrwith(target)

From this matrix we can take information about features with smallest or highest negative/positive correlation with target variable. We may drop som of them, like *'Location_dig', 'Temp9am'* and so on.

In [None]:
sns.pairplot(data[dig])

Using this plot, we can see, that some of time based features had a high positive correlation with themselfs. So, we may want drop one of them from each pair.

I will use cap equal to 0.05. It just heuristic and you may save them in data. Different decision can give different advantages, for example, the fewer features, the higher the models speed and vice versa.

In [None]:
drop_features = []

for feature, corr in zip(data[features].corrwith(target).index, data[features].corrwith(target).values):
    if np.abs(corr) < 0.05:
        drop_features.append(feature)

## 2.3 Data standartization

All machine learning algorithms prefer data with same scale, so lets subtract from each column it mean value and devide by standard deviation.

In [None]:
for d in dig:
  data[d] = (data[d] - data[d].mean()) / data[d].std()

So, if we look on new digital columns distribution, we can see, that each mean value is equal to zero and std is equal to unit.

In [None]:
data[dig].hist(figsize=(20, 15), bins=20)

## 3. Model selection

## 3.1 Parameters selection

The next step is chose and prepare model for predict target.

Firstly i will take the best algorithms for kaggle competitors - XGBCLassifier().

Separate data for train and test subsets. Also i create the new features list without *drop_features*

In [None]:
new_features = [feature for feature in features if feature not in drop_features]

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(
    data[new_features], target, random_state=100, test_size=0.2, shuffle=False) # shuffle False because data is time series.

In [None]:
clf = xgb.XGBClassifier(seed=123)
clf.fit(xtrain, ytrain)
clf.score(xtest, ytest)

Not bad for first result. You can compare it with old features list, may be more is better?

In [None]:
xtrain_old, xtest_old, ytrain_old, ytest_old = train_test_split(
    data[features], target, random_state=100, test_size=0.2, shuffle=False)

clf_old = xgb.XGBClassifier(seed=123)
clf_old.fit(xtrain_old, ytrain_old)
clf_old.score(xtest_old, ytest_old)

But i will choose *LGBMClassifier*, cause it more lighter, faster and accurate than *XGBClassifier*.

In [None]:
lgb_model = LGBMClassifier(random_state=123)
lgb_model.fit(xtrain, ytrain)
lgb_model.score(xtest, ytest)

Also we can go through feature selection procedure. 

* Create parameters dictionary
* Using optuna we choose the better one combination of them

In [None]:
lgb_train = lightgbm.Dataset(xtrain, ytrain)
lgb_eval = lightgbm.Dataset(xtest, ytest)

def create_model(trial):
    params = {
            'num_leaves': trial.suggest_int('num_leaves', 32, 512),
            'boosting_type': 'gbdt',
            'objective': 'binary',
            'metric': 'auc',
            'learning_rate': trial.suggest_uniform('learning_rate', 0.05, 0.5),
            'max_depth': trial.suggest_int('max_depth', 3, 18),
            'min_child_weight': trial.suggest_int('min_child_weight', 1, 20),
            'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
            'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
            'bagging_freq': trial.suggest_int('bagging_freq', 1, 8),
            'min_child_samples': trial.suggest_int('min_child_samples', 4, 80),
            'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-8, 1.0),
            'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-8, 1.0),
            'n_estimators': trial.suggest_int('n_estimators', 200, 600),
            'random_state': 123
        }
    model = LGBMClassifier(**params)
    return model

def objective(trial):
    model = create_model(trial)
    model.fit(xtrain, ytrain)
    preds = model.predict_proba(xtest)[:, 1]
    score = roc_auc_score(ytest, preds)
    return score

In [None]:
sampler = TPESampler(seed=123)
study = optuna.create_study(direction='maximize', sampler=sampler)
study.optimize(objective, n_trials=100) # you can increase n_trials number for better result 

In [None]:
params = study.best_params 
print(params)

In [None]:
new_lgb = LGBMClassifier(**params)
new_lgb.fit(xtrain, ytrain)
new_lgb.score(xtest, ytest)

We can't increase our result by tuning parameters. But difference is really small.

## 3.2 Features selection

Later we manually selected the features, which seemed promising for us.

Let's give this job for RFECV algorithm.

In [None]:
select_model = LGBMClassifier(**params)
selector = RFECV(select_model, step=1, cv=5, verbose=10, min_features_to_select=6)
selector.fit(xtrain, ytrain)

Look at feature importanse:

In [None]:
selector.ranking_

In [None]:
final_features = [new_features[i] for i in range(len(selector.support_)) if selector.support_[i] == True]

Let's train the model for the last time and see the final result.

In [None]:
xtrain_final, xtest_final, ytrain_final, ytest_final = train_test_split(
    data[final_features], target, random_state=100, test_size=0.3, shuffle=False)

final_lgb = LGBMClassifier(**params)
final_lgb.fit(xtrain_final, ytrain_final)
final_lgb.score(xtest_final, ytest_final)

As we can see, result dose not improve. So we come back to previous features set.

## 4. Ensemble of models

Let's create 3 any different models (of your choice).

And list of tupple to them.

In [None]:
k = KNeighborsClassifier(n_neighbors=7)
g = GaussianNB()
rf = RandomForestClassifier()

estimators = [
    ('k', k), ('g', g), ('l', final_lgb), ('rf', rf)
]

Lets check all of these models.

In [None]:
k.fit(xtrain, ytrain)
k.score(xtest, ytest)

In [None]:
g.fit(xtrain, ytrain)
g.score(xtest, ytest)

In [None]:
rf.fit(xtrain, ytrain)
rf.score(xtest, ytest)

Further, combine them.

In [None]:
vote = VotingClassifier(voting='hard', estimators=estimators)
vote.fit(xtrain, ytrain)
vote.score(xtest, ytest)

As we can see, training ensemble of different models also didnt give improve in this particular problem.

## 5. Recurrent Neural Network (RNN)

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras import models
from keras import layers
from keras.optimizers import Adam
import numpy as np

This in cell for activate check gpu's availability.

In [None]:
device_name = tf.test.gpu_device_name()
if "GPU" not in device_name:
    print("GPU device not found")
print('Found GPU at: {}'.format(device_name))

For reccurent nn we need to prepare sequences of data, where k - length of sequence in days.

In [None]:
k = 30 # for best result, you may increase k value

r_xtrain = [xtrain.iloc[i:i+k] for i in range(xtrain.shape[0]-k-1)]
r_ytrain = [ytrain.iloc[i+k] for i in range(ytrain.shape[0]-k-1)]

r_xtest = [xtest.iloc[i:i+k] for i in range(xtest.shape[0]-k-1)]
r_ytest = [ytest.iloc[i+k] for i in range(xtest.shape[0]-k-1)]

r_xtrain = np.array(r_xtrain)
r_ytrain = np.array(r_ytrain)

r_xtest = np.array(r_xtest)
r_ytest = np.array(r_ytest)

Lets create a simple 1-layer RNN.

In [None]:
with tf.device('/cpu:0'):
    model = models.Sequential()

    model.add(layers.GRU(32, input_shape=(None, r_xtrain.shape[-1]), recurrent_dropout=0.2))
    model.add(layers.Dense(1))
    model.compile(optimizer=Adam(amsgrad=True), loss='mse', metrics='accuracy')
    history = model.fit(x=r_xtrain, 
                        y=r_ytrain,
                        epochs=15,
                        validation_data=(r_xtest, r_ytest)
                        )

Plots the history process.

In [None]:
def history_plt(history):
    loss = history.history['loss']
    acc = history.history['accuracy']
    val_loss = history.history['val_loss']
    val_acc = history.history['val_accuracy']

    epochs = range(1, len(loss) + 1)

    fig, ax = plt.subplots(1, 2, figsize=(10, 5))
    ax[0].plot(epochs, acc, label='acc', color='b')
    ax[0].plot(epochs, val_acc, label='val_acc', color='r')
    ax[0].legend()
    ax[1].plot(epochs, loss, label='loss', color='b')
    ax[1].plot(epochs, val_loss, label='val_loss', color='r')
    ax[1].legend()
    plt.show()

In [None]:
history_plt(history)

In [None]:
np.max(history.history['val_accuracy'])

Looking at the plots, we can continue train further to achieve better results.

## 6. Convolutional Neural Network (CNN)

In [None]:
with tf.device('/cpu:0'):
    model = models.Sequential()

    model.add(layers.Conv1D(16, 3, input_shape=(None, r_xtrain.shape[-1]), activation='relu', kernel_regularizer='l1_l2'))
    model.add(layers.Conv1D(32, 3, activation='relu', kernel_regularizer='l1_l2'))

    model.add(layers.BatchNormalization())
    model.add(layers.MaxPool1D())
    
    model.add(layers.Conv1D(64, 3, activation='relu', kernel_regularizer='l1_l2'))
    model.add(layers.Conv1D(128, 3, activation='relu', kernel_regularizer='l1_l2'))
    model.add(layers.BatchNormalization())
    model.add(layers.MaxPool1D())
    
    model.add(layers.Dense(1))
    model.compile(optimizer=Adam(amsgrad=True), loss='mse', metrics='accuracy')
    history = model.fit(x=r_xtrain, 
                        y=r_ytrain,
                        epochs=15,
                        validation_data=(r_xtest, r_ytest)
                        )

In [None]:
history_plt(history)

In [None]:
np.max(history.history['val_accuracy'])

### In conclusion, I want to say there are many means and opportunities to improve the quality of the model, and each of them is effective in some specific task.

    * XGBClassifier with all parameters: 0.8624
    * XGBClassifier without some dropped parameters: 0.8483
    * LGBMClassifier: 0.8643
    * LGBMClassifier with parameters tuning: 0.8642
    * LGBMClassifier with parameters tuning and feature selection: 0.8587
    * RandomForestClassifier: 0.8607
    * KNeighborsClassifier: 0.8435
    * GaussianNB: 0.8399
    * Ensemble of [RFC, KNC, GNB, LGBM]: 0.8611
    * Reccurent Neural Network: 0.8127
    * Convolutional Neural Network: 0.7979

### That's all.

### Good luck to all!