> Hi! I'm Mauricio Ruanova. My friends call me Mau. I am a Data Science and Machine Learning Engineer. 

This notebook contains an heuristic prediction for the weather in Seattle. 

Using a dataset that contains the complete records of daily rainfall patterns from January 1st, 1948 to December 12, 2017. 

Maybe if it rained Yesterday and it is raining Today then it is likely to raing Tomorrow.

But how much can we predict using the numbers provided?

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.options.mode.chained_assignment = None  # default='warn'
from sklearn.model_selection import train_test_split 
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('../input/did-it-rain-in-seattle-19482017/seattleWeather_1948-2017.csv')
df.shape

In [None]:
df.head(3)

In [None]:
numrows = df.shape[0] - 2 # 25551-2 = 25549
numrows

In [None]:
# create an empty dataframe
heuristic_df = pd.DataFrame({'yesterday':[0.0]*numrows,
                             'today':[0.0]*numrows,
                             'tomorrow':[0.0]*numrows,
                             'guess':[False]*numrows, #logical guess
                             'rain_tomorrow':[False]*numrows, #historical observation
                             'correct':[False]*numrows, #TRUE if your guess matches the historical observation
                             'true_positive':[False]*numrows, #TRUE If you said it would rain and it did
                             'false_positive':[False]*numrows,#TRUE If you sait id would rain and it didn't
                             'true_negative':[False]*numrows, #TRUE if you said it wouldn't rain and it didn't
                             'false_negative':[False]*numrows}) #TRUE if you said it wouldn't raing and it did
heuristic_df.shape

In [None]:
heuristic_df.head(3)

### Build a loop to add your heuristic model guesses as a column to this dataframe
here is a loop that populates the dataframe created earlier with the total perciptation from yesterday and today.

Then the guess is set to true if rained both yesterday and today 

In [None]:
for z in range(numrows):
    # start at time 2 in the data frame
    i = z + 2
    # pull values from the dataframe
    yesterday = df.iloc[(i-2),1]
    today = df.iloc[(i-1),1]
    tomorrow = df.iloc[i,1]
    rain_tomorrow = df.iloc[(i),1]
    heuristic_df.iat[z,0] = yesterday
    heuristic_df.iat[z,1] = today
    heuristic_df.iat[z,2] = tomorrow
    heuristic_df.iat[z,3] = False # set guess default to False
    heuristic_df.iat[z,4] = rain_tomorrow
    # example hueristic : if today > 0.0 and yesterday > 0.0:
    if yesterday >= 0.9 or today >= 0.05: # 0.707073 # my own heuristic based on personal experience
        heuristic_df.iat[z,3] = True
    if heuristic_df.iat[z,3] == heuristic_df.iat[z,4]:
        heuristic_df.iat[z,5] = True
        if heuristic_df.iat[z,3] == True:
            heuristic_df.iat[z,6] = True #true positive
        else:
            heuristic_df.iat[z,8] = True #true negative
    else:
        heuristic_df.iat[z,5] = False
        if heuristic_df.iat[z,3] == True:
            heuristic_df.iat[z,7] = True #false positive
        else:
            heuristic_df.iat[z,9] = True #false negative
heuristic_df.head()

In [None]:
data1 = heuristic_df[['yesterday']]
data1.head()

In [None]:
data2 = heuristic_df[['today']]
data2.head()

In [None]:
data3 = heuristic_df[['tomorrow']]
data3.head()

In [None]:
X = heuristic_df.dropna()
X.shape

In [None]:
y = pd.Series(np.where(X['tomorrow'].dropna() > 0, 1, 0)) # integer 0 or 1?
y.shape

In [None]:
y.head()

In [None]:
y.tail()

## Prevent overfitting with split train test

Break the dataset into two parts, training and testing. 

Use the first 80% of the dataset for training and the last 20% for testing. 

Evaluate both sets of data using your function. 

What difference do you see in the calculated values (Precision and Recall)?

- Separate a dataset into training and testing subsets
- Calculate Precision and Recall for training and test sets
- Calculate SSE for both training and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) #train = 80% / test = 20%
X_train.shape

In [None]:
X_train.head() # the first 80% of the dataset for training

In [None]:
X_test.shape

In [None]:
X_test.head()

## RandomForestClassifier

Fit the model.

Prediction.

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
# Fitting train data into model
clf.fit(X_train, y_train)
# Prediction
y_pred = clf.predict(X_test)
y_pred

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))

In [None]:
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))

In [None]:
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

## Feature Importances
As expected, the plot suggests that 3 features are informative, while the remaining are not.

In [None]:
clf.feature_importances_

In [None]:
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)
indices = np.argsort(importances)
import matplotlib.pyplot as plt
plt.figure()
plt.title("Feature Importances")
plt.barh(range(X.shape[1]), importances[indices], color="r", xerr=std[indices], align="center")
plt.yticks(range(X.shape[1]), indices)
plt.ylim([-1, X.shape[1]])
plt.show()

## Confusion matrix

The four concepts of true/false negative/positive are measures of your guesses:

These values then serve as a core measure of performance.

This information can also be organized using a confusion matrix:

| Confusion Matrix | Predicted Positives | Predicted Negatives |
| ---------------- | ------------------- | ------------------- |
| Positives        | True Positives      | False Positives     |
| Negatives        | False Negatives     | True Negatives      |

### Precision
The percent of the time you predict positive that you are correct.
### Recall
The percentage of positive guesses you got correct that you should have gotten correct.

In [None]:
print(confusion_matrix(y_test, y_pred))

## Classification report
Build a text report showing the main classification metrics.

In [None]:
print(metrics.classification_report(y_test, y_pred))

## Accuracy classification score
In multilabel classification, this function computes subset accuracy: 

the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

In [None]:
print('accuracy score: ', accuracy_score(y_test,y_pred))

In [None]:
X_test.head()

### True Positives

In [None]:
true_positives = X_test['true_positive'].value_counts()
true_positives

In [None]:
# type(true_positives) # pandas.core.series.Series

In [None]:
# true_positives.array # <PandasArray> [3914, 1195] Length: 2, dtype: int64

### True Negatives

In [None]:
true_negatives = X_test['true_negative'].value_counts()
true_negatives

### The accuracy of your predicitions
We used this simple approach in the first part to see what percent of the time we were correct 

calculated as (true positive + true negative) / number of guesses

In [None]:
accuracy = (true_positives.array[1] + true_negatives.array[1]) / X_train.shape[0] # number of guesses
print('accuracy: ', accuracy)

### The precision of your predicitions
Precision is the percent of your postive prediction which are correct

more specifically it is calculated (num true positive)/(num true positive + num false positive)

In [None]:
precision = true_positives.array[1] / (true_positives.array[1]+true_positives.array[0])
print('precision: ', precision)

### The recall of your predicitions
Recall the percent of the time you are correct when you predict positive

more specifically it is calculated (num true positive)/(num true positive + num false negative)

In [None]:
recall = true_negatives.array[1] / (true_positives.array[1]+true_positives.array[0])
print('recall: ', recall)

## Sum of Squared Error (SSE) Cost of your prediction
Adding up the difference in your prediction and the actual value after you have squared each individual difference.

https://www.wikihow.com/Calculate-the-Sum-of-Squares-for-Error-(SSE)

In [None]:
### The sum of squared error (SSE) of your predictions
mean = X_test.mean().array[2] # tomorrow
print('mean: ', mean)

In [None]:
X_test['deviation'] = X_test['tomorrow'] - mean
X_test['deviation'] = X_test['tomorrow'] - mean

In [None]:
X_test['squared'] = X_test['deviation']**2

In [None]:
X_test.head()