# SnowyCocoon - Predicting the rain in Australia

## It's my first task comission/notebook. I'm still learning and very probably, I'll update this notebook later on :).

We will be comparing 5 different methods to predict the rain:
1. Logistic Regresion
3. Decision Trees
2. Random Forest
4. KNeighborsClassifier
5. Neural Network (Built in pyTorch)

Dataset from:
- https://www.kaggle.com/jsphyg/weather-dataset-rattle-package


List of notebooks that helped me with the process:
- https://www.kaggle.com/aninditapani/will-it-rain-tomorrow
- https://www.kaggle.com/prashant111/logistic-regression-classifier-tutorial
- https://www.kaggle.com/rafetcan/red-wine-quality-classification-95-76-acc



# 0. Importing the libriaries and data

In [None]:
import torch
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from torch import nn, optim
import torch.nn.functional as F
from scipy.stats import norm
#from sklearn.utils import resample

In [None]:
df = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv') #Reading the csv

In [None]:
df.head() #Displaying the first 5 rows of the dataframe 

# 1. Preprocessing the data

### 1.1. Handling the NA/Null/Empty values

In [None]:
df.count().sort_values() #Checking how many values each column have.

In [None]:
df.isna().sum().sort_values() #Checking how many missing values each column have.

In [None]:
df = df.drop(columns=['Sunshine','Evaporation','Cloud3pm','Cloud9am','Location','Date'],axis=1) #We are droping all columns with less than 60% of data

In [None]:
df = df.dropna(how='any') #We are dropping all the rows with any missing value.

### 1.2. Dealing with outliers
https://en.wikipedia.org/wiki/Outlier

We are calculating the Z-score of for every value in the dataframe. If the Z-score is going to be bigger than 3, then we are going to delete the whole row with one or more outliers. The highes the Z-score is, the more unusual the data is!

![title](https://raw.githubusercontent.com/SnowyCocoon/Data-Science-Projects/main/11.%20Rain%20Classification%20using%205%20different%20classification%20models/Img1.png)

source(https://en.wikipedia.org/wiki/Z-test)

![title](https://github.com/SnowyCocoon/Data-Science-Projects/blob/main/11.%20Rain%20Classification%20using%205%20different%20classification%20models/Img2.png?raw=true)

source (https://www.dummies.com/education/math/statistics/how-to-calculate-a-confidence-interval-for-a-population-mean-when-you-know-its-standard-deviation/)

In [None]:
from scipy import stats

z = np.abs(stats.zscore(df._get_numeric_data()))
print(z)
df= df[(z < 3).all(axis=1)]
print(df.shape)

### 1.3. Dealing with categorical data (in string format)

In [None]:
df['RainToday'].replace({'No': 0, 'Yes': 1},inplace = True)
df['RainTomorrow'].replace({'No': 0, 'Yes': 1},inplace = True)

In [None]:
categorical_columns = ['WindGustDir', 'WindDir3pm', 'WindDir9am']
df = pd.get_dummies(df, columns=categorical_columns)

### 1.4. Standarizing/Normalizing our data

In [None]:
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()
scaler.fit(df)
df = pd.DataFrame(scaler.transform(df), index=df.index, columns=df.columns)

### 1.5. Selecting the features to include in our model

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

X = df.loc[:,df.columns!='RainTomorrow']
y = df[['RainTomorrow']]

selector = SelectKBest(chi2, k=3)
selector.fit(X, y)

X_new = selector.transform(X)
print(X.columns[selector.get_support(indices=True)]) #top 3 columns

### 1.6. Analysing the top 3 columns

Probability Plots are looking already fine, because we've already deleted the outliers!

#### 1.6.1. Humidity at 3 PM

In [None]:
df[["Humidity3pm","RainTomorrow"]].groupby(["RainTomorrow"], as_index = False).mean().sort_values(by = "RainTomorrow").style.background_gradient("Reds")

In [None]:
plt.figure(figsize=(13,10))
plt.subplot(2,2,1)
plt.hist(df["Humidity3pm"], color="orange")
plt.xlabel("Humidity3pm")
plt.ylabel("Frequency")
plt.title("Humidity3pm histogram", color = "black", fontweight='bold', fontsize = 11)
plt.subplot(2,2,2)
sns.distplot(df["Humidity3pm"], fit=norm, color="orange")
plt.title("Humidity3pm Distplot", color = "black", fontweight='bold', fontsize = 11)
plt.subplot(2,2,3)
stats.probplot(df["Humidity3pm"], plot = plt)

plt.show()

#### 1.6.2. Rainfall

In [None]:
df[["Rainfall","RainTomorrow"]].groupby(["RainTomorrow"], as_index = False).mean().sort_values(by = "RainTomorrow").style.background_gradient("Reds")

In [None]:
plt.figure(figsize=(13,10))
plt.subplot(2,2,1)
plt.hist(df["Rainfall"], color="purple")
plt.xlabel("Rainfall")
plt.ylabel("Frequency")
plt.title("Rainfall histogram", color = "black", fontweight='bold', fontsize = 11)
plt.subplot(2,2,2)
sns.distplot(df["Rainfall"], fit=norm, color="purple")
plt.title("Rainfall Distplot", color = "black", fontweight='bold', fontsize = 11)
plt.subplot(2,2,3)
stats.probplot(df["Rainfall"], plot = plt)

plt.show()

#### 1.6.3. Rain Today

In [None]:
df[["RainToday","RainTomorrow"]].groupby(["RainTomorrow"], as_index = False).mean().sort_values(by = "RainTomorrow").style.background_gradient("Reds")

In [None]:
plt.figure(figsize=(13,10))
plt.subplot(2,2,1)
plt.hist(df["RainToday"], color="blue")
plt.xlabel("RainToday")
plt.ylabel("Frequency")
plt.title("RainToday histogram", color = "black", fontweight='bold', fontsize = 11)
plt.subplot(2,2,2)
sns.distplot(df["RainToday"], fit=norm, color="blue")
plt.title("RainToday Distplot", color = "black", fontweight='bold', fontsize = 11)

plt.show()

### 1.7. Spliting the data

In [None]:
X = df[['Humidity3pm','Rainfall','RainToday']]
y = df[['RainTomorrow']]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### 1.8. Upsampling (not used)

Upsampling can really help us to get more equal f1 score (between 2 classes), but we dont want do use it here. We can recive better accuracy without up/downsampling because there are more days without rain.

In [None]:
from imblearn.over_sampling import SMOTE
import collections

In [None]:
sm = SMOTE(random_state=14)
X_train_sm, y_train_sm = sm.fit_resample(X_train, y_train)

# 2. Creating and Fitting the Models + Evaluation

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, classification_report

### 2.1. Linear Regresion

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
clf_logreg = LogisticRegression(random_state=0)
clf_logreg.fit(X_train,y_train)

y_pred_1 = clf_logreg.predict(X_test)
score_1 = accuracy_score(y_test,y_pred_1)

print('Accuracy :',score_1)

In [None]:
cm = confusion_matrix(y_test, y_pred_1)
classes = ['No rain', 'Raining']
df_cm = pd.DataFrame(cm, index=classes, columns=classes)
hmap = sns.heatmap(df_cm, annot=True, fmt="d")
hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0, ha='right')
hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30, ha='right')
plt.ylabel('True label')
plt.xlabel('Predicted label');

### 2.2. Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
clf_dt = DecisionTreeClassifier(random_state=0)
clf_dt.fit(X_train,y_train)

y_pred_2 = clf_dt.predict(X_test)
score_2 = accuracy_score(y_test,y_pred_2)

print('Accuracy :',score_2)

In [None]:
cm = confusion_matrix(y_test, y_pred_2)
df_cm = pd.DataFrame(cm, index=classes, columns=classes)
hmap = sns.heatmap(df_cm, annot=True, fmt="d")
hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0, ha='right')
hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30, ha='right')
plt.ylabel('True label')
plt.xlabel('Predicted label');

### 2.3. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf_rf = RandomForestClassifier(n_estimators=100, max_depth=4,random_state=0)
clf_rf.fit(X_train,y_train)

y_pred_3 = clf_rf.predict(X_test)
score_3 = accuracy_score(y_test,y_pred_3)

print('Accuracy :',score_3)

In [None]:
cm = confusion_matrix(y_test, y_pred_3)
df_cm = pd.DataFrame(cm, index=classes, columns=classes)
hmap = sns.heatmap(df_cm, annot=True, fmt="d")
hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0, ha='right')
hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30, ha='right')
plt.ylabel('True label')
plt.xlabel('Predicted label');

### 2.4. KNeighborsClassifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors = 2)
knn.fit(X_train, y_train)

y_pred_4 = knn.predict(X_test)
score_4 = accuracy_score(y_test,y_pred_4)

print('Accuracy :',score_4)

In [None]:
cm = confusion_matrix(y_test, y_pred_4)
df_cm = pd.DataFrame(cm, index=classes, columns=classes)
hmap = sns.heatmap(df_cm, annot=True, fmt="d")
hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0, ha='right')
hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30, ha='right')
plt.ylabel('True label')
plt.xlabel('Predicted label');

### 2.5. Neural Network (built in pyTorch)


In [None]:
X_train = torch.from_numpy(X_train.to_numpy()).float()
y_train = torch.squeeze(torch.from_numpy(y_train.to_numpy()).float())

X_test = torch.from_numpy(X_test.to_numpy()).float()
y_test = torch.squeeze(torch.from_numpy(y_test.to_numpy()).float())

In [None]:
class Net(nn.Module):
  def __init__(self, n_features): # NN Constructor
    super(Net, self).__init__() #super constructor
    self.fc1 = nn.Linear(n_features, 32) #Input to Hidden Layer
    self.fc2 = nn.Linear(32, 16) #Input to Hidden Layer
    self.fc3 = nn.Linear(16, 8) #Hidden to Hidden Layer
    self.fc4 = nn.Linear(8, 1) #Hidden to Output Layer
    
  def forward(self, x): #passing outputs to another layers
    x = F.relu(self.fc1(x)) # In to Hid Layer (relu)
    x = F.relu(self.fc2(x)) # In to Hid Layer (relu)
    x = F.relu(self.fc3(x)) # Hid to Hid Layer (relu)
    return torch.sigmoid(self.fc4(x)) # Hid to Out Layer (sigmoid)

net = Net(X_train.shape[1]) 

In [None]:
#https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a
criterion = nn.BCELoss() #Binary Cross Entropy Loss

In [None]:
optimizer = optim.Adam(net.parameters(), lr=0.001) #Optimizer

In [None]:
def calculate_accuracy(y_true, y_pred):
  predicted = y_pred.ge(.5).view(-1)
  return (y_true == predicted).sum().float() / len(y_true)

In [None]:
def round_tensor(t, decimal_places=3):
  return round(t.item(), decimal_places)

In [None]:
for epoch in range(1000):
    y_pred = net(X_train)
    y_pred = torch.squeeze(y_pred)
    train_loss = criterion(y_pred, y_train)
    if epoch % 50 == 0:
      train_acc = calculate_accuracy(y_train, y_pred)
      y_test_pred = net(X_test)
      y_test_pred = torch.squeeze(y_test_pred)
      test_loss = criterion(y_test_pred, y_test)
      test_acc = calculate_accuracy(y_test, y_test_pred)
      print(
f'''epoch {epoch}
Train set - loss: {round_tensor(train_loss)}, accuracy: {round_tensor(train_acc)}
Test  set - loss: {round_tensor(test_loss)}, accuracy: {round_tensor(test_acc)}
''')
    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

In [None]:
y_pred_5 = net(X_test)
y_pred_5 = y_pred_5.ge(.5).view(-1).cpu()
score_5 = accuracy_score(y_test,y_pred_5)
y_test = y_test.cpu()
print(classification_report(y_test, y_pred_5, target_names=classes))

In [None]:
cm = confusion_matrix(y_test, y_pred_5)
df_cm = pd.DataFrame(cm, index=classes, columns=classes)
hmap = sns.heatmap(df_cm, annot=True, fmt="d")
hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0, ha='right')
hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30, ha='right')
plt.ylabel('True label')
plt.xlabel('Predicted label');

### 2.6. Results

In [None]:
print(f'Scores for different models: \nLinear Regresion {score_1} \nDecision Tree {score_2} \nRandom Forest {score_3} \nKNeighborsClassifier {score_4} \nNeural Network {score_5}')