# Rain in Australia

This project makes a prediction model to predict the weather for the next day in Australia, if it is going to rain or not, through a binary outcome in a model trained.


### About data

This dataset have datas about registrations and observation about wheater by meteorological stationsin some days in Australia.

The target is the features **RainTomorrow** that respond the question: "Will it rain tomorrow in Australia? Yes or no".

### Sources

The sources about this data is Kaggle in: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package/home.
For more details about the data and source access link.

### Imports

In [None]:
# For explore data
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# For escalation of values
from scipy import stats

# For machine learning modeles
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.ensemble import  AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest, chi2
from sklearn import preprocessing

# For the validation of models
from sklearn.metrics import accuracy_score, precision_score, recall_score
from numpy import mean


### Data analysis

First we will access the data and view the dataframe.

In [None]:
df = pd.read_csv('../input/weatherAUS.csv')
df.head()

In [None]:
# The Shape of dataset
df.shape

By orientation of the source will not use the feature &Quot;RISK_MM". For more information: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package/discussion/78316.

In [None]:
# Removing "RISK_MM"
df = df.drop(columns='RISK_MM')
df.shape

Accessing the information from each feature and their quantities.

In [None]:
df.info()

Looking this information about data above and some vizualition about some features with **value.counts**, vizualization that spend so much space that is best hides them, we remove some features:

* Location: The question is "Will it rain tomorrow in Australia", so this dosen't have much importance about teh question.
* Date: This dates works more like a index about information and they are not continuity, just random.
* Evaporation, Sunshine, Cloud9am e Cloud3pm : This features have much null values and a large grade. Therefore it is better to remove them.

### Preparing the data

Let's start removing the unnecessary features for the models

In [None]:
df = df.drop(columns=['Location','Date', 'Evaporation', 'Sunshine', 'Cloud9am', 'Cloud3pm'], axis=1)
df.info()

Now the features have values of quantity similars.

In [None]:
# Shape vizualization
df.shape

This step is to remove null values, a importante step for predictions, and cheking if they have been removed.

In [None]:
df = df.dropna(how='any')
df.shape

Looking if there are outlier in some features

In [None]:
# Ploting blockspots
sns.boxplot(x=df['MinTemp'])
plt.show()

In [None]:
sns.boxplot(x=df['MaxTemp'])
plt.show()

In [None]:
sns.boxplot(x=df['WindGustSpeed'])
plt.show()

In [None]:
sns.boxplot(x=df['Rainfall'])
plt.show()

In [None]:
sns.boxplot(x=df['WindSpeed9am'])
plt.show()

In [None]:
sns.boxplot(x=df['Humidity9am'])
plt.show()

An exemple of outliers can be observed in the **Humidity9am** graph. There as some values of 0 and 100 about Humidity, this values are unreals to be found in a open space.

To remove outliers I will apply the Z-score technic. This technic do a escalation of all the values of the dataframe and relates to the average and the standard challenge, generating a score for each value. The value of the average is 0 and those who are within a normal range of standard deviations are between 1 and -1. Thus the values that have scores of near pf 3 and -3 are considered outliers, can be removed from the dataframe.

In [None]:
# Appling a escalation in numerics datas using "get_numeric_data".
z = np.abs(stats.zscore(df._get_numeric_data()))
# Print a table with z-scores
print(z)
# Removing outliers
df= df[(z < 3).all(axis=1)]
# Looking the new shape of dataframe
print(df.shape)

Cheking the quantity of value now in all features

In [None]:
df.info()

Now I will change some cagetorical features about wind direction to numerical features (**WindGustDir, WindDir3pm, WindDir9am**).

Looking the quantity of wind direction

In [None]:
len(df.WindGustDir.value_counts())

Changing the categorical featuries using dummies funcition.

In [None]:
# List of features that will be changed
winds = ['WindGustDir', 'WindDir3pm', 'WindDir9am']
    
# Doing the transformation with "get_dummies"
df = pd.get_dummies(df, columns=winds)

# Cheking the new shape
df.shape

Now the dataframe has more 48 columns, that repalce the 3 old columns about wind direction. However for mathematical purposes we need drop one columns for each group created in function of a wind direction feature, because the values of this columns need be independent among them.

In [None]:
# Removing one collumns of each group
df = df.drop(['WindGustDir_WSW', 'WindDir3pm_SSW', 'WindDir9am_NNE'], axis =1)
df.shape

Converting the values of **RainToday e RainTomorrow** to 0 e 1, that represents labels of a binary system.

In [None]:
df['RainToday'].replace({'No': 0, 'Yes': 1},inplace = True)
df['RainTomorrow'].replace({'No': 0, 'Yes': 1},inplace = True)

Cheking the conversion

In [None]:
df.RainToday.value_counts()

In [None]:
df.RainTomorrow.value_counts()

The last step before start work with modes of machine learning was to make the escalation of all values to be between 0 and 1. This will help the values as pressure (units of thousands) and temperature (units of tens) do not distort the models, since they have different proportions. This escalation does not alter the differences between values in the same category.

In [None]:
# Doing the escalation using "MinMaxScale" model
scaler = preprocessing.MinMaxScaler()
# Training the model
scaler.fit(df)
# Changing data 
df = pd.DataFrame(scaler.transform(df), index=df.index, columns=df.columns)
# Returning the data frama after the escalation
df.head()

### Predictive models

The first step of the process will be to determine what are the features  has greater relevance for the predictive models. For this I will fit the function **SelectKBest** to generate scores of features to see which are most relevant.

In [None]:
# Splinting the data in features (X) and labels (y)
X = df.loc[:,df.columns!='RainTomorrow']
y = df[['RainTomorrow']]
# Using função SelectKBest and determining the parameters numbers of features, K = 58
selector = SelectKBest(chi2, k=58)
# Traning
selector.fit(X, y)
# Returning scores
scores = selector.scores_
# Creating a list for features names
lista = df.columns
lista = [x for x in lista if x != 'RainTomorrow']
# Creationg a dictionaty with the features name list and scores  
unsorted_pairs = zip(lista, scores)
sorted_pairs = list(reversed(sorted(unsorted_pairs, key=lambda x: x[1])))
k_best_features = dict(sorted_pairs[:58])

Ploting a bar graphic about features scores.

In [None]:
# Ploting the graphic area
plt.figure(figsize=(20,7),facecolor = 'w',edgecolor = 'w')
# Ploting the bar graphic
p = plt.bar(range(len(k_best_features)), list(k_best_features.values()), align='center')
plt.xticks(range(len(k_best_features)), list(k_best_features.keys()))
# Editing the names
plt.xticks(rotation='90')
plt.title('K best features scores')
plt.xlabel('Features')
plt.ylabel('Score')
plt.show()

Now we can see that **RainToday** is the most important feature for the models. The features about wind direction don't have high scores. I will use just features that have scores above 1% (71 points) **RainToday** score (7136 points).

In [None]:
# Creating a list of features names with score above 71 points
K_values = []
for key in k_best_features:
    if float(k_best_features[key]) >= float(0.01 * k_best_features['RainToday']):
        K_values.append(key)

Spliting the new group of features (X) and labels (y) for the models

In [None]:
df_predi = df[K_values + ['RainTomorrow']]
X = df[K_values]
y = df['RainTomorrow']

Now I will analyze how many characteristics the models will have a good performace.
The models that I will work:

* Logistic Regression
* Decision Tree
* Kmeans


In [None]:
# Cirando uma lista para contagens de features de K_values, no caso 31
n_features_list = list(range(2,len(K_values)+1))

Although we have chosen the best features for models, we also assess the quantity of features the models will get the best results, so let's create a loop for the models returning the accuracy of results. For each interaction will be removed from the feature that has the lowest score.

In [None]:
# Creating list for each model to append reseults
# For Logistic Regression
accuracy_LR=[]
# For Decision Tree
accuracy_dt=[]
# For Kmeans
accuracy_Kmeans=[]

# Creating a loop for the number of features unsing the list of names features
for n in n_features_list:  
    
    # Splinting the values for the training and test sets with "train_test_split"
    # We will leave 20% of the data for test and the rest for training.
    features_train, features_test, labels_train, labels_test = train_test_split(df[K_values[:n]], y, test_size=0.2, random_state=42)

    # Applying Logistic Regression model
    l_clf = LogisticRegression()
    # Training
    l_clf.fit(features_train, labels_train)
    # Doing the prediction
    prediction_lr = l_clf.predict(features_test)
    # Append the values of accuracy in a list
    accuracy_LR.append(accuracy_score(labels_test, prediction_lr))
    
    # The steps are the same for others models
    
    # For Decision Tree
    dt_clf = DecisionTreeClassifier(random_state=0)
    dt_clf.fit(features_train, labels_train)
    prediction_dt = dt_clf.predict(features_test)
    accuracy_dt.append(accuracy_score(labels_test, prediction_dt))
    
    # For Kmeans
    k_clf = KMeans(n_clusters=2)
    k_clf.fit(features_train, labels_train)
    prediction_k = k_clf.predict(features_test)
    accuracy_Kmeans.append(accuracy_score(labels_test, prediction_k))


To understande better the models performace changing the number of features, I created a graphic about accuracy.

In [None]:
# Ploting the graphic area
plt.figure(figsize=(9,6),facecolor = 'w',edgecolor = 'w')

# Ploting the graphic about accuracy x number of features for each model
# Losgistic Regression
line1 = plt.plot(n_features_list, accuracy_LR, 'b', label='LR')
# Decision Tree
line2 = plt.plot(n_features_list, accuracy_dt, 'r', label= 'dt')
# Kmeans
line3 = plt.plot(n_features_list, accuracy_Kmeans, 'g', label= 'Kmean')

# Editing the names 
plt.legend(('Logistic Regression', 'Decision tree', 'Kmean'), loc = 'best')
# Editing labels
plt.title('accuracy x features')
plt.ylabel('accuracy score')
plt.xlabel('n features')
# Editing grids
plt.xticks(n_features_list)
plt.grid(b='true',which='both', axis='both')
plt.show()

Now we can see that the models had a good performace with 3 features **(RainToday, rainfall and humidity3pm)**, only Kmeans who had the same income with more features. We'll use this quantity of 3 going forward to compare the 3 models. The next step is determine the best parameters for each of the methods using **GridSearchCV**, that will be applied for in to least 3 parameters.


In [None]:
# Splinting data to test and traing with 3 features
X = df[K_values[:3]]
y = df['RainTomorrow']
features_train, features_test, labels_train, labels_test = train_test_split(X, y, test_size=0.2, random_state=42)

For Logistic Regression I will change:

* Solver - Algorithm to use in the optimization problem (newton-cg, lbfgs, liblinear, sag, saga)
* C - Inverse of regularization strength $(0.01, 0.1, 10, 105,^10^{10}, 10^{15}, 10^{20})$
* tol - Tolerance for stopping criteria $(10^{-20}, 10^{-15}, 10^{-10}, 10^{-5}, 0.01, 0.1, 10)$

In [None]:
# Creating a list with parameters 
parameters = {'solver':('newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'), 'C':[0.01, 0.1, 10, 10**5,10**10,10**15,10**20],'tol':[10**-20,10**-15,10**-10,10**-5,0.01, 0.1, 10]}
# Applying the model
l_clf = LogisticRegression()
clf = GridSearchCV(l_clf, parameters)
clf.fit(features_train, labels_train)
# Outout of parameters 
best_l_clf = clf.best_estimator_
clf.best_estimator_

For Decision Tree:

* Criterion - the function to measure the quality of a split (gini ou entropy)
* min_sample_leaf - The minimum number of samples required to be at a leaf node (1-5)
* max_depth - The maximum depth of the tree. Ela pode chegar ao máximo das min_sample_leaf (1-5)
* class_weight - Weights associated with classes in the form (balanced)

In [None]:
# Creating a list with parameters 
parameters = { 'criterion': ('gini', 'entropy'), 'min_samples_leaf' : range(1, 5), 'max_depth' : range(1, 5), 'class_weight': ['balanced'] }
# Applying the model
dt_clf = DecisionTreeClassifier(random_state=0)
clf = GridSearchCV(dt_clf, parameters)
# Outout of parameters 
clf.fit(features_train, labels_train)
clf.best_estimator_

For Kmeans:

* algorithm - K-means algorithm to use (auto, full, elkan)
* tol - Relative tolerance with regards to inertia to declare convergence $(10^{-20},10^{-15},10^{-10},10^{-5},0.01, 0.1, 10)$
* n_init - Number of time the k-means algorithm will be run with different centroid seeds (10,25,50,75,100,200)

In [None]:
# Creating a list with parameters 
parameters = {'algorithm':('auto', 'full', 'elkan'), 'tol':[10**-20,10**-15,10**-10,10**-5,0.01, 0.1, 10], 'n_init': [10,25,50,75,100,200], 'algorithm': ('auto', 'full', 'elkan')}
# Applying the model
k_clf = KMeans(n_clusters=2)
clf = GridSearchCV(k_clf, parameters)
# Outout of parameters 
clf.fit(features_train, labels_train)
clf.best_estimator_

Now with the optimization of parameters for each prediction models, I will make the prediction for rain using the method of validation.

Making the storage of configurations for each model. Here we will see other parameters not mentioned above, they are other options for adjustments. How were not assessed, are returned in the standard form of each model.

In [None]:
# Logistic Regression
l_clf = LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='saga', tol=0.1,
          verbose=0, warm_start=False)

In decision tree will add the AdaBoostClassifier to try to improve the result.

In [None]:
# Decision Tree com AdaBoost
dt_clf = AdaBoostClassifier(DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best'), n_estimators=50, learning_rate=.8)

In [None]:
# Kmean
k_clf =  KMeans(algorithm='elkan', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=1e-20, verbose=0)

The model validation consists in making thousand interactions of predictive models and return the accuracy, precision and recall for a list. From the results make an average of this values to have a result more near of the real value to determine the best model.
Which allows you to make these thousand interactions to validate the models is that the function **train_test_split**. It will select in each interaction a set of different database for testing and training, thus covering a larger quantity of formations to which the model train and generate more reliable results, once covering unique combinations.

Precision and reacll can best help in binary prediction that only accuracy, avoiding overfitings.

Creating the function for validation:

In [None]:
def avaliacao_clf(clf, features, labels, n_iters=1000):
    print (clf)
    
    # Creating list for outputs
    accuracy = []
    precision = []
    recall = []
    first = True
    
    # Creating a loop to thousand interactions
    for tentativa in range(n_iters):
        
        # Splinting data to test and traing
        features_train, features_test, labels_train, labels_test = train_test_split(X, y, test_size=0.3)

        # Applying the model
        clf.fit(features_train, labels_train)
        predictions = clf.predict(features_test)
        # Appending accuracy
        accuracy.append(accuracy_score(labels_test, predictions))
        # Appending precision
        precision.append(precision_score(labels_test, predictions))
        # Appending recall
        recall.append(recall_score(labels_test, predictions))

    # Taking the average of metrics for evaluating and implementing the results

    print ("precision: {}".format(mean(precision)))
    print ("recall:    {}".format(mean(recall)))
    print ("accuracy:    {}".format(mean(accuracy)))
    
    return mean(precision), mean(recall), mean(accuracy)

Doing the validation for logistic regression:

In [None]:
avaliacao_clf(l_clf, X, y)

For Decision Tree:

In [None]:
avaliacao_clf(dt_clf, X, y)

For Kmeans:

In [None]:
avaliacao_clf(k_clf, X, y)

### Conclusion

The resuts for accuracy, precision e recall are in the table below:


| Model                | Precision   | Recall      | Accuracy    |
|---------------------------|-------------|-------------|-------------|
| Decision Tree - Adboost   | 0.454       | 0.694       | 0.763       |
| KMeans                    | 0.366       | 0.467       | 0.637       |
| Logistic Regression       | 0.721       | 0.378       | 0.837       |

From these results we can conclude that the model of Logistic Regression is the best option between models because it has the best results of precision, having the best performance in generating true positives, recall, evaluating with precision has the return of best result of correct data and accuracy best among the three methods.