<h1 style="background-color:#8fabcd;font-family:newtimeroman;font-size:550%;text-align:center;border-radius: 15px 50px;padding: 5px">Rainfall Predicion</h1>


<center><img src="https://acegif.com/wp-content/uploads/rainy-10.gif",height='500',width='600'></center>

This dataset contains about 10 years of daily weather observations from many locations across Australia.

RainTomorrow is the target variable to predict. It means -- did it rain the next day, Yes or No? This column is Yes if the rain for that day was 1mm or more.

<h3 style="background-color:#00ffff;font-family:newtimeroman;font-size:400%;text-align:center;border-radius: 15px 30px;padding: 3px">Importing modules and Loading datasets</h3>

In [None]:
import pandas as pd
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from scipy.stats import zscore
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,roc_auc_score,confusion_matrix
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
warnings.filterwarnings('ignore')
data=pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv',parse_dates=['Date'])
data.head()

In [None]:
data.info()

<h3 style="background-color:#00ffff;font-family:newtimeroman;font-size:400%;text-align:center;border-radius: 15px 30px;padding: 3px">Plotting Missing values</h3>

In [None]:
msno.bar(data,color='#ff80ff')

### From above it is quite evident that 4 features (i.e,'Evaporation','Sunshine','Cloud9am','Cloud3pm') have more than 40% of missing data, so droping them. And also to make generalize model we need to drop location.

In [None]:
data.drop(['Evaporation','Sunshine','Cloud9am','Cloud3pm','Location'],axis=1,inplace=True)

#### Filling numerical empty column with median value of that row and for categorical filling empty values with its corresponding previous value.

In [None]:
# Filling null values

def fill_na(info):
    cols=info.columns
    for col in cols:
        if info[col].dtype=='object':
            info[col].fillna(method='ffill',inplace=True)
        else:
            info[col].fillna(info[col].median(),inplace=True)
    return info


In [None]:
cleaned_data=fill_na(data)

In [None]:
msno.bar(cleaned_data,color='#0099ff')

### As dataset it very large i.e, about 145k so 10% of dataset would be enough for testing our models.

In [None]:
train_data,test_data=train_test_split(cleaned_data,test_size=0.1,random_state=40)


<h3 style="background-color:#00ffff;font-family:newtimeroman;font-size:400%;text-align:center;border-radius: 15px 30px;padding: 3px">Categorical Features</h3>

In [None]:
categorical=[]
numerical_cols=[]
for col in cleaned_data.columns:
    if cleaned_data[col].dtype=='object':
        categorical.append(col)
    else:
        numerical_cols.append(col)
categorical

In [None]:
plt.figure(figsize=(20,15))
for i in range(3):
    fig=px.pie(train_data,names=train_data[categorical[i]].value_counts().index.values,
               values=train_data[categorical[i]].value_counts(),
               hole=0.3,title='{0}'.format(categorical[i]))
    fig.show()
    

In [None]:
direction_encoder=LabelEncoder()
train_data.WindGustDir=direction_encoder.fit_transform(train_data.WindGustDir)
test_data.WindGustDir=direction_encoder.transform(test_data.WindGustDir)
for col in categorical[1:3]:
    train_data[col]=direction_encoder.fit_transform(train_data[col])
    test_data[col]=direction_encoder.transform(test_data[col])

In [None]:
px.bar(data_frame=train_data,x=train_data.RainToday.value_counts().index.values,y=train_data.RainToday.value_counts(),
       color=['NO','YES'],title='Will today rain?')

## Here it is quite evident that the dataset is imbalanced so we would be using SMOTE in later section of this notebook to resolve this problem.

In [None]:
px.bar(data_frame=train_data,x=train_data.RainTomorrow.value_counts().index.values,y=train_data.RainTomorrow.value_counts(),
       color=['NO','YES'],title='Will tomorrow rain?')


<h3 style="background-color:#00ffff;font-family:newtimeroman;font-size:400%;text-align:center;border-radius: 15px 30px;padding: 3px">Numerical Features</h3>

In [None]:
# Numerical columns

numerical_cols

In [None]:
plt.figure(figsize=(32,32))
for i in range(12):
    plt.subplot(4,3,(i%12)+1)
    sns.distplot(train_data[numerical_cols[i+1]])
plt.show()

In [None]:
train_data.describe()

## In the provided graph it is quite evident that the Rainfall feature are not uniformly distributed.<br> There are outliers in it which is required to resolved.

In [None]:
# Removing rows with zscore of Rainfall greater than 3 or less than -3.

cleaned_train_data=train_data[abs(zscore(train_data.Rainfall))<3]

In [None]:
prediction_encoder=LabelEncoder()
cleaned_train_data.RainToday=prediction_encoder.fit_transform(cleaned_train_data.RainToday)
cleaned_train_data.RainTomorrow=prediction_encoder.transform(cleaned_train_data.RainTomorrow)
test_data.RainToday=prediction_encoder.transform(test_data.RainToday)
test_data.RainTomorrow=prediction_encoder.transform(test_data.RainTomorrow)

In [None]:
cleaned_train_data.drop(['Date'],axis=1,inplace=True)

In [None]:
# from sklearn.preprocessing import StandardScaler

plt.figure(figsize=(20,20))
sns.heatmap(cleaned_train_data.corr(),annot=True,vmin=-1)
plt.show()

## From above heatmap it is evident that some of the features are highly correlated to others so removing features with correlation greater than 0.7.

In [None]:
cleaned_train_data.drop(['MinTemp','MaxTemp','Temp9am','Pressure3pm'],inplace=True,axis=1)

In [None]:
test_data.drop(['MinTemp','MaxTemp','Temp9am','Pressure3pm','Date'],inplace=True,axis=1)


<h3 style="background-color:#00ffff;font-family:newtimeroman;font-size:400%;text-align:center;border-radius: 15px 30px;padding: 3px">SMOTE:- To solve data-imbalance</h3>

In [None]:
oversample=SMOTE()
train_inputs,train_output=oversample.fit_resample(cleaned_train_data.drop(['RainToday'],axis=1),cleaned_train_data.RainToday)

In [None]:

fig=px.pie(train_output,names=prediction_encoder.inverse_transform([0,1]),values=train_output.value_counts(),
   hole=0.3,title='Today Rain?')
fig.show()

In [None]:
train_x,train_y1,train_y2=train_inputs.iloc[:,:11],train_output,train_inputs.iloc[:,11]
test_x,test_y1,test_y2=test_data.iloc[:,:11],test_data.iloc[:,11],test_data.iloc[:,12]


<h3 style="background-color:#00ffff;font-family:newtimeroman;font-size:400%;text-align:center;border-radius: 15px 30px;padding: 3px">Model Selection</h3>

<h5 style="background-color:#ff4dff;font-family:newtimeroman;font-size:400%;text-align:center;border-radius: 15px 30px;padding: 3px">Logistic Regression</h5>

In [None]:
modelLR=LogisticRegression()
modelLR.fit(train_x,train_y1)

In [None]:

print('Classification Report on training data\n',classification_report(train_y1,modelLR.predict(train_x)))
print('Classification Report on validation data\n',classification_report(test_y1,modelLR.predict(test_x)))

In [None]:
print('Train Accuracy of Logistic Regression model is {0} %'.format((accuracy_score(train_y1,modelLR.predict(train_x))*100).round(2)))
print('Validation Accuracy of Logistic Regression is {0} %'.format((accuracy_score(test_y1,modelLR.predict(test_x))*100).round(2)))

In [None]:
plt.figure(figsize=(10,8))
plt.title('CONFUSION MATRICS OF TEST DATA-SET IN LOGISTIC REGRESSION')
sns.heatmap(confusion_matrix(test_y1,modelLR.predict(test_x)),annot=True)
plt.show()


<h5 style="background-color:#ff4dff;font-family:newtimeroman;font-size:400%;text-align:center;border-radius: 15px 30px;padding: 3px">XGBoost</h5>

In [None]:
model_xgb=XGBClassifier()
model_xgb.fit(train_x,train_y1)

In [None]:
print('Classification Report on training data\n',classification_report(train_y1,model_xgb.predict(train_x)))
print('Classification Report on validation data\n',classification_report(test_y1,model_xgb.predict(test_x)))

In [None]:
print('Train Accuracy of xgboost model is {0} %'.format((accuracy_score(train_y1,model_xgb.predict(train_x))*100).round(2)))
print('Validation Accuracy of xgboost model is {0} %'.format((accuracy_score(test_y1,model_xgb.predict(test_x))*100).round(2)))

In [None]:
plt.figure(figsize=(10,8))
plt.title('CONFUSION MATRICS OF TEST DATA-SET IN XGBOOST')
sns.heatmap(confusion_matrix(test_y1,model_xgb.predict(test_x)),annot=True)
plt.show()


<h5 style="background-color:#ff4dff;font-family:newtimeroman;font-size:400%;text-align:center;border-radius: 15px 30px;padding: 3px">Random Forest</h5>

In [None]:
model_random_forest=RandomForestClassifier(max_depth=10)
model_random_forest.fit(train_x,train_y1)

In [None]:

print('Classification Report on training data\n',classification_report(train_y1,model_random_forest.predict(train_x)))
print('Classification Report on validation data\n',classification_report(test_y1,model_random_forest.predict(test_x)))

In [None]:
print('Train Accuracy of Random forest model is {0} %'.format((accuracy_score(train_y1,model_random_forest.predict(train_x))*100).round(2)))
print('Validation Accuracy of Random forest is {0} %'.format((accuracy_score(test_y1,model_random_forest.predict(test_x))*100).round(2)))

In [None]:
plt.figure(figsize=(10,8))
plt.title('CONFUSION MATRICS OF TEST DATA-SET IN RANDOM FOREST')
sns.heatmap(confusion_matrix(test_y1,model_random_forest.predict(test_x)),annot=True)
plt.show()

### Both models XGBoost and RandomForest have about same accuracy, so considering XGBoost to predict will today there will be rain?


<h3 style="background-color:#00ffff;font-family:newtimeroman;font-size:400%;text-align:center;border-radius: 15px 30px;padding: 3px">Preparing data to predict will rain tomorrow?</h3>

In [None]:
train_x2=pd.concat([train_x,train_y1],axis=1)

In [None]:
# Using previously predicted value to predict will rain tomorrow.

test_y1_pred=pd.Series(model_xgb.predict(test_x),name='RainToday')
test_x2=pd.concat([test_x.reset_index(drop=True),test_y1_pred.reset_index(drop=True)],axis=1)

In [None]:
xgbmodel_model2=XGBClassifier()
xgbmodel_model2.fit(train_x2,train_y2)

In [None]:

print('Classification Report on training data\n',classification_report(train_y2,xgbmodel_model2.predict(train_x2)))
print('Classification Report on validation data\n',classification_report(test_y2,xgbmodel_model2.predict(test_x2)))

In [None]:
print('Train Accuracy for predicting will rain tomorrow is {0} %'.format((accuracy_score(train_y2,xgbmodel_model2.predict(train_x2))*100).round(2)))
print('Validation Accuracy for predicting will rain tomorrow is {0} %'.format((accuracy_score(test_y2,xgbmodel_model2.predict(test_x2))*100).round(2)))


<h3 style="background-color:#12abcd;font-family:newtimeroman;font-size:400%;text-align:center;border-radius: 15px 30px;padding: 3px">Conclusion</h3>

## We have seen that both our model worked quite well to predict if today there would be rainfall with accuracy of about 99.3% and other model which predicts tomorrow's rainfall with accuracy of about 84.67 % .

<font style='color:red' size=5><center>Hope you liked this notebook. If yes, then please upvote it!! If having any suggestion or query feel free to ask in comment section</center></font>