# Hackathon #1 - Binary Classification - Template
Always check the [Reference Document](https://docs.google.com/document/d/1A-1-UK9ol4tegfU4YQySEiBXBeqbx_tu1qdmiMAzNCM/edit?usp=sharing) for all information ;) 

Remember to keep your workflow consistent!

## Regarding the Data
- The dataset can be loaded with `load_data()` (check the already provided code below). It will output two dataframes:
  - `train`: labeled dataset (with target) 
  - `test`: unlabeled dataset (target is not available)
- You will use the `train` data to do your magic! When you are finished and have a predictive model you will have to make predictions using the `test` data and submit it to our platform (where you will get a AUC value - kind of what happens in Kaggle competitions). Check the file `submission_example.csv` for an example and the [Reference Document](https://docs.google.com/document/d/1A-1-UK9ol4tegfU4YQySEiBXBeqbx_tu1qdmiMAzNCM/edit?usp=sharing) for further information.
- You can and should perform train-test splits on the `train` data that you have available. Cross-validation is highly recommended. 
- The target name is `TomorrowRainForecast`: it is either 1 (rains tomorrow) and 0 (does not rain tomorrow).
- The `ID` is the identification variable which is very important for keeping track of the predictions you will make on the `test` data.

*Good luck,  
LDSA team*

In [2]:
%load_ext autoreload

In [3]:
# some useful imports
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.decomposition import PCA
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score, cross_validate

from sklearn.naive_bayes import GaussianNB

from sklearn.linear_model import LogisticRegression

from utils import read_and_process
% autoreload 2
%matplotlib inline

In [4]:
def load_data(data_dir='../data/'):
    train = pd.read_csv(data_dir+'train.csv')
    test = pd.read_csv(data_dir+'test.csv')
    return train, test

# 1. Load Data
Load the `train` and `test` datasets

In [5]:
train, test = load_data()

In [6]:
train[train.DidRainToday=='Yes'].median()

ID                      3315.5
AmountRain                 5.4
StrongWindSpeed           35.0
AfternoonWindSpeed        15.0
MorningHumidity           87.0
AfternoonHumidity         63.0
MorningTemp               15.2
AfternoonTemp             18.8
DaysSinceNewYear         174.0
TomorrowRainForecast       0.0
dtype: float64

In [6]:
train.head()

Unnamed: 0,ID,AmountRain,StrongWindDir,StrongWindSpeed,MorningWindDir,AfternoonWindDir,AfternoonWindSpeed,MorningHumidity,AfternoonHumidity,MorningTemp,AfternoonTemp,DidRainToday,DaysSinceNewYear,TomorrowRainForecast
0,5683,0.0,WSW,43.0,N,SW,17.0,82.0,51.0,15.4,20.8,No,268,0
1,2971,0.0,E,15.0,,SE,7.0,91.0,63.0,9.7,16.6,No,136,0
2,3560,0.0,S,33.0,WSW,WSW,17.0,58.0,38.0,10.9,15.9,No,217,0
3,2304,10.0,NE,30.0,S,SSE,11.0,96.0,93.0,4.7,6.4,Yes,195,1
4,3573,0.0,WSW,48.0,,SW,26.0,59.0,40.0,11.5,15.9,No,232,0


# Do your magic!

In [20]:
train, test = load_data()
train = pd.concat([train,
                   train[train.TomorrowRainForecast==1],
                   train[train.TomorrowRainForecast==1],
                   train[train.TomorrowRainForecast==1]
                  ])
y = train['TomorrowRainForecast']
print(train.groupby(['TomorrowRainForecast']).size())
train = train.drop(columns='TomorrowRainForecast')
X = read_and_process(train)
X.isnull().sum()

TomorrowRainForecast
0    5607
1    5384
dtype: int64


AmountRain            0
StrongWindSpeed       0
AfternoonWindSpeed    0
MorningHumidity       0
AfternoonHumidity     0
MorningTemp           0
AfternoonTemp         0
DaysSinceNewYear      0
MorningWindDir_E      0
MorningWindDir_ENE    0
MorningWindDir_ESE    0
MorningWindDir_N      0
MorningWindDir_NE     0
MorningWindDir_NNE    0
MorningWindDir_NNW    0
MorningWindDir_NW     0
MorningWindDir_S      0
MorningWindDir_SE     0
MorningWindDir_SSE    0
MorningWindDir_SSW    0
MorningWindDir_SW     0
MorningWindDir_W      0
MorningWindDir_WNW    0
MorningWindDir_WSW    0
diff_temp             0
dtype: int64

In [19]:
clf = GradientBoostingClassifier(n_estimators=100,random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42)
clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_test)[:,1]
roc_auc_score(y_test,y_pred)

0.8981508024104902

In [117]:
clf = GradientBoostingClassifier(n_estimators=99)
randomized_search = RandomizedSearchCV(
                       clf,
                       params,
                       cv=5, n_iter=1,
                       random_state=99,
                       return_train_score=True,
                       scoring='roc_auc'
)
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42)
randomized_search.fit(X,y)
print(randomized_search.best_score_)
#results = cross_validate(randomized_search.best_estimator_, X_test, y_test, scoring="roc_auc",
#                             return_train_score=True, cv=5)
#print(np.nean(results))
best_p = randomized_search.best_params_
#clf = LogisticRegression(C=best_p["C"], penalty=best_p['penalty'])
#clf.fit(X,y)

0.8721211939002287


In [113]:
print(best_p)

{'n_estimators': 99}


In [None]:
features_importance = pd.Series(index=X_train.columns,data=clf.feature_importances_)
features_importance.nlargest(20).plot.barh()

In [126]:
result = test.copy()
X_submit = read_and_process(test)
result['TomorrowRainForecast'] = y_pred = (clf.predict_proba(X_submit)[:,1] + clf2.predict_proba(X_submit)[:,1])/2
result[['ID','TomorrowRainForecast']].to_csv('../data/submission5.csv',index=False)