# Will it rain tomorrow in Australia?

In this notebook we are going to analyze a dataset regarding wheather in Australia. The dataset contains, for a number of tuples ```Date -  Location```, a collection of columns with weather information about that day in that location. One of that columns is ```RainTomorrow```, indicating whether the following day it rained in that location. 

Our target will be to build a model to predict whether these ```RainTomorrow``` column. The data correspond to several values of ```Location``` all across Australia, en we want the predictive model to work fine for each of these values.

During this notebook we discuss several aspects of the data, providing some ideas when it comes to exploration and data preprocessing, and testing a couple of classification models.

* 1- [Exploratory Data Analysis](#eda): First steps data exploration. We explore separately data that come as string and as numeric.
* 2- [Feature Transformation](#ft): We perform some feature engineering tasks, including transforming some string variables to numeric.
* 3- [Exploratory Data Analysis (reprise)](#eda-r): We perform some additional exploration steps after transforming some string variables to numeric.
* 4- [Missing Data Handling](#mdh): We explain what do we do with missing data.
* 5- [Model Training](#mt): We try to predict ```RainTomorrow``` by using two types of algorithms: **Logistic Regression** and **XGBoost**.
* 6- [Conclusions](#con):  We give some final impressions and outline some possiblities for future work.


First of all we make the first imports.

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

An we load the data:

In [None]:
data_folder = '/kaggle/input/weather-dataset-rattle-package'
data_file = 'weatherAUS.csv'

data_path = os.path.join(data_folder, data_file)

df = pd.read_csv(data_path)

## Exploratory Data Analysis <a name="eda"></a>

We perform the first EDA to get an idea of how the data are. We show the data and their dimensions:

In [None]:
df

In [None]:
df.shape

The variable ```RISK_MM``` is a measure of how much it will rain tomorrow. So, the value of ```RainTomorrow``` can be inferred from ```RISK_MM```. As we can see: the value of ```RainTomorrow``` depends on whether ```RISK_MM``` is greater than 1.0 or not.

In [None]:
df.groupby('RainTomorrow')['RISK_MM'].describe()

That is why we will not take this column into account when we predict ```RainTomorrow```.

We have 142193 rows and 23 colums. Let us see the type of each column.

In [None]:
df.dtypes

We store the variables that are text and number separately.

In [None]:
num_vars = df.columns[df.dtypes == 'float'] ## We get the numeric vars
str_vars = df.columns[df.dtypes == 'object'] ## We get the string vars


### Exploratory Analysis of Numerical Variables

In this section we are going to study the variables whose type is ```flaot64```.

#### Basic Statistics

First of all we are going to show the basic statistics for each variable.

In [None]:
df.describe()

For now we do not see any suspicious value. It is true that some variables (for instance ```Rainfall```) have big value disparity between the mean and maximal value, if you take the standard deviation in account, but since these values represent wheather information for a total of 49 different locations in a big country like Australia, we have no reason to be suspicious.

Let us see the distribution of these numeric variables.

We can see how some variables ha an almost Gaussian shape, while others are more asymmetric. Some of them have a lot of great outliers, and some of have more ordinary outliers. Let us see two examples:

In [None]:
vars_to_show = ['Pressure9am', 'Humidity3pm']

plot, ax = plt.subplots(ncols=2, nrows=len(vars_to_show), figsize=(20, 15))

for n, v in enumerate(vars_to_show):
    
    df[v].plot(ax = ax[n, 0], kind='hist', bins=50, title=v)
    df.boxplot(v, ax = ax[n, 1])

(show to see these plots for all the numerical variables)

In [None]:
plot, ax = plt.subplots(ncols=2, nrows=len(num_vars), figsize=(20, 80))

for n, v in enumerate(num_vars):
    
    df[v].plot(ax = ax[n, 0], kind='hist', bins=50, title=v)
    df.boxplot(v, ax = ax[n, 1])

#### Do we remove outliers?

We are talking about weather variables for a big country like Australia, so it is normal that these variables present high variability. So, we do not dare to say that some of these outliers are not believe. Nonetheless, even if they are real, they might be worth to leave them out since they can have too much impact on the model. Let us see how many outliers, using the $|x - \mu| \geq 3\sigma$ criteria, has each numerical column.

In [None]:
df_num = df[num_vars]

df_is_outlier = ~df_num[np.abs(df_num - df_num.mean()) > 3 * df_num.std()].isna()
(df_is_outlier).sum()

And now see how many rows have at least one outlier:

In [None]:
row_has_outlier = df_is_outlier.sum(axis=1) > 0
df_is_outlier.sum(axis=1)[df_is_outlier.sum(axis=1) > 0].count()

It has only 9006 rows with outliers out of 142193 so it might not seem a big deal to remove them. Anyway, there might be some good reasons to keep them. For instance, it might happen that these outlier are highly concentrated in one of the locations, making it very hard to predict ```RainTomorrow``` for that location if we remove them. 

In the following computation we can see how, by removing the outliers, we can be removing more than 30% of the *yes* instances of ```RainTomorrow``` for some locations. This would be inconvenient (later we will see how these amount might not be such a big number), so we will keep the values.

In [None]:
df_check_outlier = pd.DataFrame({'Location': df['Location'], 'RainTomorrow': df['RainTomorrow'],'is_out': df_is_outlier.sum(axis=1) > 0, 'total': df_is_outlier.sum(axis=1) > -1})
df_check_prop = df_check_outlier.groupby(['Location', 'RainTomorrow']).sum().sort_values('is_out')
(df_check_prop['is_out'] / df_check_prop['total']).reset_index().sort_values(0)

#### Correlations between numerical variables

Let us plot a correlation heatmap between the numerical values (we do not plot the pairwaise scatter plot because it overloads the engine).

In [None]:
sns.heatmap(df[num_vars].corr())

We can see some expected correlations, like positive correlation between minimal and maximal temperature the same day, and negative correlation between sunshine and cloudiness. We will see later if this can help to reduce the dimensionality of the data.

### Exploratory Analysis of String Variables

We are going to see the different values of the string variables.

In [None]:
for v in str_vars:
    print('Different values of', v, '\n ')
    print(df[v].value_counts())
    print('\n \n \n')

We can get a get deal of information from where.

* Regarding ```Date``` and ```Location```, the combinations of this columns forms the key that identify each row.
* ```Date``` has to be parse from string to datetime.
* ```Location``` has 49 different values and most of them seem to be kind of balanced, allowing to generate models that (we expect) will work good independenly for each location.
* ```WindGustDir```, ```WindDir9am``` and ```WindDir3pm``` represent wind directions. Considering the direction **N** es 0, this directions can be translated into angles. 
* ```RainToday``` and ```RainTomorrow``` have *Yes* and *No* values.

Before we explore the relation of ```RainTomorrow``` with respect to the other variables, we must perform some variable transformation to conver these variables to numbers in a suitable way.

## Feature Transformation <a name="ft"></a>

Let us start by addressing the variable ```Date```. First of all we have to parse it from text to datetime.

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

If we want to determine if it will rain, it is relevant to use the date. But the date as it is given is not useful. First of all, we are interested only in the place that date takes along the year, so 2006 Jan 1st  is *the same* as 2007 Jan 1st. Then, we need to take into account that the date is cyclical, since Jan 1st comes right after Dec 1st. That is why we do the following transformations:

In [None]:
df['day_num'] = df['Date'].apply(lambda x: pd.Period(x, freq='D').dayofyear) # We get the number of the date in the year
df['day_num_angle'] = df['day_num'] / 365 * 2 * np.pi # We get the day inside year as an angle 'angle'
df['day_num_sin'] = np.sin(df['day_num_angle']) # We get the sine and consine of the 'angle'
df['day_num_cos'] = np.cos(df['day_num_angle']) 

This wave we have codified the *periodicity* of the Date. Finally, since there might be differences from one year to another, we also keep the year of the date.

In [None]:
df['Year'] = df['Date'].dt.year

Now, let us see the wind direction variables. These are ```WindGustDir```, ```WindDir9am```, ```WindDir3pm```.

We will transform this as angles according to the following table, taking the 0 angle at **E** and going counterclockwise. We map them to their angles and we create a two variables for each angle, their sines and cosines:

In [None]:
wind_dir_vars = ['WindGustDir', 'WindDir9am', 'WindDir3pm']

dir_list = ['E', 'ENE', 'NE', 'NNE', 'N', 'NNW', 'NW', 'WNW', 'W', 'WSW', 'SW', 'SSW', 'S', 'SSE', 'SE', 'ESE']
ang_rad_list = [i * np.pi / 8 for i in range(16)]

wind_dir_map = dict(zip(dir_list, ang_rad_list))

for v in wind_dir_vars:
    v_name_sin = v + '_sin'
    v_name_cos = v + '_cos'
    df[v_name_sin] = np.sin(df[v].map(wind_dir_map))
    df[v_name_cos] = np.cos(df[v].map(wind_dir_map))

Finally we are going to parse ```RainToday``` and ```RainTomorrow``` as 0 and 1.

In [None]:
df['RainToday_num'] = df['RainToday'].map({'Yes': 1, 'No': 0})
df['RainTomorrow_num'] = df['RainTomorrow'].map({'Yes': 1, 'No': 0})

Ok, now we keep this new set of variables.

In [None]:
feat_cols = ['Year', 'day_num_sin', 'day_num_cos', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',
       'Sunshine', 'WindGustDir_sin', 'WindGustDir_cos', 'WindGustSpeed', 'WindDir9am_sin', 'WindDir9am_cos', 'WindDir3pm_sin', 'WindDir3pm_cos',
       'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
       'Temp3pm', 'RainToday_num']

label_col = 'RainTomorrow_num'

Besides ```Location```, now the rest of input variables are numeric. Let us go back to data exploration.

## Exploratory Data Analysis (Reprise) <a name="eda-r"></a>

Let us see density plots of each numerical feature for the two different values of ```RainTomorrow```.

In [None]:
feat_no_location = [c for c in feat_cols if c != 'Location']

for v in feat_no_location:
    plt.figure()
    df.groupby('RainTomorrow')[v].plot(kind='density', legend=True, title=v)

Now let us see the density of ```RainTomorrow``` by ```Location```.

In [None]:
locs = df['Location'].unique()

for l in locs:
    plt.figure()
    df[df['Location'] == l]['RainTomorrow_num'].plot(kind='hist', legend=True, title=l)

A better way to visualize this (excluding the location name) is to si the percentage of ```RainTomorrow == 'yes'``` for the different cities.

In [None]:
plt.plot(df.groupby('Location')['RainTomorrow_num'].sum() / df.groupby('Location')['RainTomorrow_num'].count() * 100)

We can see how some variables have different distributions depending on the value of ```RainTomorrow```. Some of them, like regarding cloud or temperature actually make sense. ```RainToday``` can be seen to be strongly correlated to ```RainTomorrow```.
We can also see ```RainTomorrow``` has a different proportion of positive values dependnig on ```Location```.

Finally, we are going to plot de density of all the variables by value of ```RainTomorow``` and also by ```Location``` (but of course we are going to hide the content since it would be too much to show).

In [None]:
feat_no_location = [c for c in feat_cols if c != 'Location']
locs = df['Location'].unique()


for v in feat_no_location:
    print('Variable:', v)

    min_val = df[v].min()
    max_val = df[v].max()
    for l in locs:  
        plt.figure()
        try:
            df[df['Location'] == l].groupby('RainTomorrow')[v].plot(kind='density', legend=True, title=v + ' ' + l, xlim=(min_val, max_val))
            plt.show()
        except ValueError:
            print('There was no null-value for Location', l, 'and variable', v)
            plt.close()

## Missing Data Handling <a name="mdh"></a>

It seems that there are some locations for which some columns have no non-null values. 

In [None]:
df.groupby('Location').aggregate(lambda x: (~x.isna()).sum())

Let us see all the cases:

In [None]:
stack_table = df.groupby('Location').aggregate(lambda x: (~x.isna()).sum()).stack()
stack_table = stack_table[stack_table == 0]

stack_table.reset_index().groupby('Location')['level_1'].agg([list, 'count']).reset_index()

We have 22 locations for which at least one column is all nulls.

Let us see them in a heatmap (black means all the values are missing).

In [None]:
st_reset = stack_table.reset_index()
st_reset.columns = ['Location', 'var', 'val']
st_piv = st_reset.pivot(index='Location', columns='var', values='val')
st_piv.iloc[:, :] = np.where(np.isnan(st_piv), 1 , 0) 

plt.figure(figsize=(10, 10))
sns.heatmap(st_piv, linewidths=1)

We want to input now missing values.

In [None]:
print('Number of rows:', df.shape[0])
print('Number of rows with some null value:', df.isna().sum(axis=1)[df.isna().sum(axis=1) > 0].shape[0])

More than half of the rows have some null column. Besides that, there are locations for which some columns are completely null. That is why to remove all the rows with some null value is not a good strategy. On the other hand, except for ```Date```, ```Location``` and ```RainTomorrow``` all the columns have some null value, so we cannot remove all the columns for which we have some null value.

In [None]:
df.isna().sum(axis=0)[df.isna().sum(axis=0) > 0]

We are going to do the following:

* We are going to assign, to each date, its quarter of year.
* To each null value, we will assume the mean value of that column for that location during that quarter that year.
* If no non-null value is available for that location that quarter that year, we will assume the mean value of that column for that location during that quarter for all the years.
* If no-non null value is available for that location that quarter, we will assume the mean value for that quarter that year for all the locations.

**Careful!!** It might take a while!

In [None]:
num_vars_2 = df.columns[df.dtypes == 'float']
num_vars_2 = [c for c in num_vars_2 if not c in ['Date', 'Quarter']]


df['Quarter'] = df['Date'].dt.quarter

df_fillna_1 = df.groupby(['Year', 'Quarter', 'Location']).agg('mean').reset_index()
df_fillna_2 = df.groupby(['Quarter', 'Location']).agg('mean').reset_index()
df_fillna_3 = df.groupby(['Quarter', 'Year']).agg('mean').reset_index()



df_fillna_1[num_vars_2] = np.where(df_fillna_1[num_vars_2].isna(), \
                          df_fillna_1[['Quarter', 'Location']].apply(lambda x: df_fillna_2.set_index(['Quarter', 'Location']).loc[x[0], x[1]][num_vars_2], axis=1), \
                          df_fillna_1[num_vars_2])

df_fillna_1[num_vars_2] = np.where(df_fillna_1[num_vars_2].isna(), \
                          df_fillna_1[['Quarter', 'Year']].apply(lambda x: df_fillna_3.set_index(['Quarter', 'Year']).loc[x[0], x[1]][num_vars_2], axis=1), \
                          df_fillna_1[num_vars_2])

df[num_vars_2] = np.where(df[num_vars_2].isna(),
                          df[['Year', 'Quarter', 'Location']].apply(lambda x: df_fillna_1.set_index(['Year', 'Quarter', 'Location']).loc[x[0], x[1], x[2]][num_vars_2], axis=1), \
                          df[num_vars_2])

We are now ready to try to find a good model.

In [None]:
feat_cols = ['Year', 'day_num_sin', 'day_num_cos', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',
       'Sunshine', 'WindGustDir_sin', 'WindGustDir_cos', 'WindGustSpeed', 'WindDir9am_sin', 'WindDir9am_cos', 'WindDir3pm_sin', 'WindDir3pm_cos',
       'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
       'Temp3pm', 'RainToday_num']

label_col = 'RainTomorrow_num'


## Model Training <a name="mt"></a>

We want to train a model that is able to predict the value of ```RainTomorrow_num``` with respect to the rest of feature variables.

We would like to train a model using only purely metheorological variables. That is, chronological (```Year```, ```day_num_sin``` and ```day_num_cos```) and ```Location```. The decision of excluding ```Location``` is in order to avoid the model we are using to end up being, *de facto*, a collection of models, one for any location, making it hard to generalize to new possible future locations.

Then, our variables are the following:

In [None]:
feats_exclude = ['Year', 'day_num_sin', 'day_num_cos', 'Location']
feat_cols_model = [f for f in feat_cols if not f in feats_exclude]

### Feature Selection 

Our first step is to reduce the dimensionality of the input data. To do that, we use the $\chi^2$-test to determine what variables are more relevant.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

#We add this offset since chi-sq requires positive values
offset = 1e6

X = df[feat_cols_model] + offset
y = df[label_col]
scores = chi2(X, y)[0]

sorted_vars = [var for _, var in sorted(zip(-scores, feat_cols_model))]

m_scores = -scores
m_scores.sort()

df_scores = pd.DataFrame({'var': sorted_vars, 'score': -m_scores})
df_scores

We will take some few of the most important variables. We will see how keeping only a few variables makes no difference from keeping all of them.

### Model Selection

We will split the dataset in train and test, by keeping a proportion 70%-30%.

#### XGBoost

We try first with an XGBoost classifier.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[feat_cols_model], df[label_col], test_size=0.3, random_state=42)

Let us now train a model by using all the features:

In [None]:
from xgboost import XGBClassifier

model = XGBClassifier()

# fit the model with the training data
model.fit(X_train, y_train)
 
# predict the target on the train dataset
predict_test = model.predict(X_test)

Let us show the metrics:

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

accuracy_train = accuracy_score(y_test, predict_test)
conf_mat = confusion_matrix(y_test, predict_test, labels=[1, 0])

print('Accuracy of the model: ', accuracy_train)
print('Confusion matrix:\n', conf_mat)

tpr = conf_mat[0, 0] / conf_mat[0, :].sum()
tnr = conf_mat[1, 1] / conf_mat[1, :].sum()

print('True positive rate:', tpr)
print('True negative rate:', tnr)

We see how the model performs by only keeping a number of most relevant features.

In [None]:
df_metrics = pd.DataFrame({})

for i, _ in enumerate(sorted_vars):
    
    n_feats = i + 1

    important_feats = sorted_vars[:n_feats]
    
    model = XGBClassifier()

    # fit the model with the training data
    model.fit(X_train[important_feats], y_train)
 
    # predict the target on the train dataset
    predict_test = model.predict(X_test[important_feats])

    accuracy_test = accuracy_score(y_test, predict_test)
    conf_mat = confusion_matrix(y_test, predict_test, labels=[1, 0])

    tpr = conf_mat[0, 0] / conf_mat[0, :].sum()
    tnr = conf_mat[1, 1] / conf_mat[1, :].sum()
    
    df_aux = pd.DataFrame({'n_feats': [n_feats], 'accuracy': [accuracy_test],
                           'true_pos_rate': [tpr], 'true_neg_rate': [tnr]})

    df_metrics = pd.concat([df_metrics, df_aux], axis=0)
    
df_metrics

As we can see, the true positive rate is very low, and each feature is important in the model to increase this rate.

In [None]:
line0, = plt.plot(df_metrics['n_feats'], df_metrics['accuracy'], label='Accuracy')
line1, = plt.plot(df_metrics['n_feats'], df_metrics['true_pos_rate'], label='True Positive Rate')
line2, = plt.plot(df_metrics['n_feats'], df_metrics['true_neg_rate'], label='True Negative Rate')
legend = plt.legend(handles=[line0, line1, line2], loc='upper right')
ax = plt.gca().add_artist(legend)
plt.ylim(0, 1)



#### Logistic Regression

Since XGBoost was not a good choice, we will try a simpler model, as is **logistic regression**, since it allows to tune the probability threshold in order to trade specificity and sensitivity.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve


model = LogisticRegression()

# fit the model with the training data
model.fit(X_train, y_train)
 
# predict the target on the train dataset
predict_test_proba = model.predict_proba(X_test)

Let us see the ROC:

In [None]:
fpr, tpr, thresholds = roc_curve(y_test,predict_test_proba[:,1])

plt.plot(fpr, tpr)
plt.ylabel('TPR')
plt.xlabel('1-TNR')


Let us choose a probability threshold that maximizes the product between the True Positive Rate and the True Negative Rate.

In [None]:
from sklearn.preprocessing import binarize

which_th = ((1 - fpr) * tpr).argmax()

print('Threshold:', thresholds[which_th])
print('True Positive Rate:', tpr[which_th])
print('True Negative Rate:', 1 - fpr[which_th])


y_pred_class = binarize(predict_test_proba,0.25)[:,1]
accuracy = accuracy_score(y_test,y_pred_class)

print('Accuracy:', accuracy)

Now, we have sacrificed some specificity (true negative rate) and accuracy in order to improve our sensitivity (true positive rate).

Let us try to do this with a more limited number of features:

In [None]:
df_metrics = pd.DataFrame({})

for i, _ in enumerate(sorted_vars):
    
    n_feats = i + 1

    important_feats = sorted_vars[:n_feats]
    
    model = LogisticRegression()

    # fit the model with the training data
    model.fit(X_train[important_feats], y_train)
 
    # predict the target on the train dataset
    predict_test_proba = model.predict_proba(X_test[important_feats])
    predict_test = y_pred_class = binarize(predict_test_proba,0.25)[:,1]
    
    accuracy_test = accuracy_score(y_test,y_pred_class)
    
    fpr, tpr, thresholds = roc_curve(y_test, predict_test_proba[:,1])

    which_th = ((1 - fpr) * tpr).argmax()
    
    threshold = thresholds[which_th]
    tr_pos_rate = tpr[which_th]
    tr_neg_rate = 1 - fpr[which_th]
    
    df_aux = pd.DataFrame({'n_feats': [n_feats], 'accuracy': [accuracy_test],
                           'true_pos_rate': [tr_pos_rate], 'true_neg_rate': [tr_neg_rate],
                           'threshold': [threshold]})

    df_metrics = pd.concat([df_metrics, df_aux], axis=0)
    
df_metrics

Now, we can see that a single feature already gives decent predictions, but generally, each new feature improves the model.

For now we are not interested in keeping on improving the model. A model that bests logistic regression on every aspect is **random forests** as can be seen (in the kernel)[https://www.kaggle.com/michaelcao/will-it-rain].

### Interpretation: Why is there no perfect model?

When we trained our **XGBoost** model we saw that it was easy to obtain high specificity but not high sensitivity. By switching to **logistic regression** we could trade some specificity for some sensitivity. Why are going to study why this happens.

By only taking the accuracy measure, our XGBoost model was good enough. This is due to the fact that the amount of instances for each value of ```RainTomorrow``` is highly unbalanced.

In [None]:
df['RainTomorrow'].value_counts() / df.shape[0]

That is, a model that always defaults to predict that it will not rain tomorrow will obtain an accuracy of 77.58%. Considering this, the fact that our model has a true positive rate greater than 70% is remarkable.

We have tried several ways (not included) to increase this true positive rate while preserving the true negative rate but we could not get it up. Our interpretation of the facts is the following: the boolean variable ```RainTomorrow``` is defined as whether ```RISK_MM``` is greater than 1.0. Therefore instances where the values of ```RISK_MM``` are 1.0 and 1.1 will have different ```RainTomorrow``` values. Because of this, to predict ```RainTomorrow``` when ```RISK_MM``` is close to 1.0 is very difficult, since it would be equivalent to predict ```RISK_MM``` with very high precission. We will see how the ability of the model to preddict accurately that ```RainTomorrow``` is ```True``` increases as  ```RISK_MM``` grows.

In [None]:
from math import ceil

model = XGBClassifier()
model.fit(X_train, y_train)

test_df = df.iloc[X_test.index]

max_RISK = ceil(test_df['RISK_MM'].max())

df_tpr = pd.DataFrame()

for i in range(max_RISK):
        
    test_df_filt = test_df[(test_df['RISK_MM'] >= i) & (test_df['RainTomorrow_num'] == 1)]
    
    test_df_labels = test_df_filt[label_col]
    
    predictions = model.predict(test_df_filt[feat_cols_model])

    good_preds = (test_df_labels == predictions).sum()
    total_preds = predictions.shape[0]

    df_aux = pd.DataFrame({'min_RISK_MM': [i], 'true_positive_rate': [good_preds / total_preds]})

    df_tpr = pd.concat([df_tpr, df_aux], axis=0)
    
df_tpr.reset_index(inplace=True, drop=True)

In [None]:
plt.plot(df_tpr['min_RISK_MM'], df_tpr['true_positive_rate'])
plt.xlabel('minimal RISK_MM')
plt.ylabel('True Positive Rate')
plt.ylim(0.5, 1.1)

The plot shows the True Positive Rate of the model if we limit the testing set to instances where ```RISK_MM``` is greater than some certain amount. We can see how the original true positive rate was very low (~0.54), but if we condition our testing set to have ```RISK_MM``` greater than 10 it rises to ~0.72, and to ~0.78 if the lower limit of ```RISK_MM``` is 20.

We can see how there is a range of values of ```min_RISK_MM``` for which the true positive rate decreases. This is because that proportions are calculated over a very small set of instances so a single false positive can have great impact on the rate.

## Conclusions <a name="con"></a>

The problem of predicting ```RainTomorrow``` is ill-conceived, since ```RainTomorrow``` is a category based on an arbitrary threshold for a variable ```RISK_MM```. That is why it is that hard to predict ```RainTomorrow``` for border values.

Furthermore, the affirmative instances of ```RainTomorrow``` cover a very high variety of situations. We have some suggestions to improve the problem posing:

* To pose a **regresssion** problem where we try to predict ```RISK_MM```. The problem is that more than half of the instances the value of this variable is 0, and the rest of them tends to be concentrated around very low values. The higher values would be extreme outliers and would be very hard to predict, unless a very good resampling job is done.

* To pose the problem as a matter of **ordinal classification**, where, based on ```RISK_MM```, we split its domain in a series of categories. Of course, we still would have the problem of arbitrary thresholds, but we could arrange them in a way that make these categories more meaningful.

Additionaly, we would like to propose a different problem for future work:

* Providing we have a classification model that we are happy with its performance, it would be interesting if it works purely based on *physical* variables, disregarding the location. A way to evaluate this would be to train a model using the data from some values of ```Location``` and evaluating it with instances from the rests of ```Location```. This way we would now if, providing a high enough variety of situations when training the model, the model would work fine for variables coming from new locations.