# Tomorrow Rain Prediction in Australia 



<img src = "https://www.skymetweather.com/content/wp-content/uploads/2020/01/RaininJanuary.jpg" height=500 width=500 style="margin : auto;">

> ## About 

> 1. **Supervised classification Problem**
> 2. **Use KNeighborsClassifier Algorithm**
> 3. **Use ExtraTreesClassifier for feature Selection**
> 4. **Accuracy of model = 84%**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

> ## Import Libararies

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

> ## Read Data

In [None]:
df = pd.read_csv("../input/weather-dataset-rattle-package/weatherAUS.csv")
df.head()

> **Translate column names in lower latter for easy to use**

In [None]:
l_columns = [x.lower() for x in df.columns]
df.columns = l_columns

> ## Handle Null Values


In [None]:
name = []
null = []
for i in df.columns:
    name.append(i)
    null.append(df[i].isnull().sum() / len(df))
    
null_desc = pd.DataFrame({"col_name" : name, "null_per": null})
null_desc.sort_values(by="null_per", ascending=False)

> **Drop Columns which have null values more than 20%**

In [None]:
drop_col = null_desc[null_desc.null_per > 0.20].col_name.values
df.drop(drop_col, axis = 1, inplace=True)

> **Split data into two types categorical and numeric**

In [None]:
catogrical = [x for x in df.columns if df[x].dtype == "object"]
numeric = [x for x in df.columns if df[x].dtype == "float64"]

In [None]:
df[catogrical].isnull().sum()

> **Fill Null values in categorical features with more frequent values**

In [None]:
for i in catogrical:
    df[i].fillna(df[i].mode()[0], inplace=True)

> **Fill Null values in numeric features with mean values**

In [None]:
for i in numeric:
    df[i].fillna(df[i].mean(), inplace = True)

In [None]:
df.isnull().sum()

> ## Feature Engineering

> **Convert Date column in datetime format**

In [None]:
df["date"] = pd.to_datetime(df.date)

In [None]:
for i in catogrical:
    print("{} unique = {}".format(i, df[i].nunique()))

> **More unique values are not good for accurate prediction that's why, convert date column into three new columns**

In [None]:
df["year"] = df["date"].dt.year
df["day"] = df["date"].dt.day
df["month"] = df["date"].dt.month

In [None]:
df.drop(["date", "location"],axis = 1, inplace=True)

> **One Hot encoding for categorical values**

In [None]:
dummies = pd.get_dummies(df[['windgustdir','winddir9am','winddir3pm','raintoday','raintomorrow']], drop_first=True)
df1 = pd.concat([df, dummies], axis=1)
df1.drop(['windgustdir','winddir9am','winddir3pm','raintoday','raintomorrow'], axis = 1, inplace = True)

In [None]:
df1.rename(columns={"raintoday_Yes" : "raintoday", "raintomorrow_Yes" : "raintomorrow"}, inplace = True)

> **Split data**

In [None]:
X = df1.iloc[:, : -1]
y = df1.iloc[:, -1]

> **Check Feature Importance** 

In [None]:
model = ExtraTreesClassifier()
model.fit(X,y)

In [None]:
plt.figure(figsize=(10, 35))
feature_rank = pd.Series(model.feature_importances_, index = X.columns)
feature_rank.sort_values().plot(kind = "barh")

> **Select most 16 columns for prediction**

In [None]:
imp_columns = feature_rank.nlargest(16).index
X = df1[imp_columns]

> ## Model Building

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
knn = KNeighborsClassifier(n_neighbors = 10)
knn.fit(X_train, y_train)

> ## Check Accuracy

In [None]:
predict = knn.predict(X_test)
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, predict)
acc