Requirements
1. Dataset must contain more than 5 features and have more than 500 observations. ✔
2. Analysis should only be based on ONE question ✔
3. Program must use two different machine learning algorithms ✔
4. 
i  Print out the original question that you are asking. ✔
ii  Visualise the data and prediction/classification if relevant. ✔
iii Print out the answer to the question obtained from both models. ✔
i.v Print out the accuracy or error of the ML models. ✔
5.  Mark-up cells a detailed explanation to why you picked the specific 
algorithms. ✔
6. comparison of the results obtained with both models and a reflection 
about why you think one model is better over the other. ✔
7. You code must be explained in comments as part of the code cells. ✔

Import all python functions, code comments here and throughout code to satisfy requirement 7

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

In [None]:
weather  = pd.read_csv("/kaggle/input/australia-weather/australia weather.csv")  #Load the dataset using pandas read_csv

Take a look at the dataframe information, we have 145460 entries, 0 to 145459, with total 23 columns, satisfies requirement 1

In [None]:
print(weather.info()) 

Lets look at the data, the first 5 rows

In [None]:
weather.head()

This is a lot of data, I will take a look at any correlations using pairplot and I will decide what columns to keep. I will do this before looking at the Nulls. This plot takes a lot of time to run.

In [None]:
sns.pairplot(weather)

My question will be: Can we predict whether it will rain tomorrow based on this dataset? Satisfies requirement 2

Use two different machine learning algorithms to create two regression or classification models over the same dataset. With this aim in mind I will reduce my dataset to a manageable size therefore I will drop the following columns:

In [None]:
weather.pop('Date')
weather.pop('Location')
weather.pop('Rainfall')
weather.pop('Evaporation')
weather.pop('Sunshine')
weather.pop('WindGustDir')
weather.pop('WindDir9am')
weather.pop('WindDir3pm')
weather.pop('WindSpeed3pm')
weather.pop('RainToday')

I will check how many nulls I have

In [None]:
weather.isnull().sum()

I could use impute to replace the nulls but for the moment I will delete nulls as my dataset is large and I can afford to lose some data

In [None]:
weather  = weather.dropna(axis = 0, how ='any')

In [None]:
weather.isnull().values.any()  #I will check if all nulls are gone

In [None]:
weather.head()  #I will take another look at the data now

I will use 'RainTomorrow' as my target

In [None]:
y = weather.pop('RainTomorrow')
y

In [None]:
sns.heatmap(weather.corr(), annot=True)

In [None]:
#using train_test_split from sklearn.model_selection 
X_train, X_test, y_train, y_test = train_test_split(weather, y, test_size=0.2)
X_train.shape, X_test.shape

I will use pipeline to create my models and compare results.
All my input data is numeric so I dont have to encode any columns.
I have no Nan so I do not need to impute any input columns. My output however is categorical so I will use classifier in my pipeline.

1st Model - I will use standardscaler and KNeighbourClassifier (manhattan and euclidean with a selection of neighbours). Then I will use minmax scaler and compare. This will satisfy part of requirement 3

In [None]:
#here I set up for Standard scaler with kNN
kNNpipeSS  = Pipeline(steps=[('scaler', StandardScaler()),
                             ('classifier', KNeighborsClassifier())])

In [None]:
#here i set up for minmax scaler with knn
kNNpipeMM  = Pipeline(steps=[('scaler', MinMaxScaler()),
                             ('classifier', KNeighborsClassifier())])

In [None]:
#here I set my classifier metrics
param_grid = {'classifier__n_neighbors':[5,10,15,20], 
              'classifier__metric':['manhattan','euclidean']}

In [None]:
#use grid for knn and standard scaler
pipe_knn_ss = GridSearchCV(kNNpipeSS, param_grid, verbose = 1)

In [None]:
#use grid for knn and minmax scaler
pipe_knn_mm = GridSearchCV(kNNpipeMM, param_grid, verbose = 1)

In [None]:
#fit the training data
pipe_knn_ss = pipe_knn_ss.fit(X_train, y_train)

In [None]:
#the best combinations are:
pipe_knn_ss.best_params_

In [None]:
#the accuracy for this combination is:
y_pred_gs = pipe_knn_ss.predict(X_test)
print("Accuracy for KNN and standard scaler are: ", pipe_knn_ss.score(X_test,y_test))
print('Can we predict the weather tomorrow? Yes, to {} accuracy'.format(pipe_knn_ss.score(X_test, y_test)) ) #satisfies req. 4.iii

In [None]:
#fit the training data for minmax scaler
pipe_knn_mm = pipe_knn_mm.fit(X_train, y_train)

In [None]:
#the best combinations are:
pipe_knn_mm.best_params_

In [None]:
#the accuracy for this combination is:
y_pred_gs = pipe_knn_mm.predict(X_test)
print("Accuracy for knn and min max scaler: ",pipe_knn_mm.score(X_test,y_test))
print('Can we predict the weather tomorrow? Yes, to {} accuracy'.format(pipe_knn_mm.score(X_test, y_test)) ) #satisfies req. 4.iii

2nd Model - I will use standardscaler and DecisionTreeClassifier (entropy and gini). Then I will use minmax scaler and compare. This will satisfy part of requirement 3

In [None]:
# here I set up for Standard scaler with decision tree
treepipeSS  = Pipeline(steps=[('scaler', StandardScaler()),
                             ('classifier', DecisionTreeClassifier())])

In [None]:
#here i set up for minmax scaler with decision tree
treepipeMM  = Pipeline(steps=[('scaler', MinMaxScaler()),
                             ('classifier', DecisionTreeClassifier())])

In [None]:
#set classifier metrics
param_grid = {'classifier__criterion':['entropy', 'gini'],
             'classifier__min_samples_split': range(2, 403, 10)
             }

In [None]:
#use grid for decision tree and standard scaler
pipe_tree_SS = GridSearchCV(treepipeSS, param_grid, verbose = 1)

In [None]:
#use grid for decision tree and minmax scaler
pipe_tree_MM = GridSearchCV(treepipeMM, param_grid, verbose = 1)

In [None]:
#fit train data for standard scaler
pipe_tree_SS = pipe_tree_SS.fit(X_train, y_train)

In [None]:
#best combinations are
pipe_tree_SS.best_params_

In [None]:
#the accuracy for this combination is:
y_pred_gs = pipe_tree_SS.predict(X_test)
print("Accuracy for decision tress standard scaler: ",pipe_tree_SS.score(X_test,y_test))
print('Can we predict the weather tomorrow? Yes, to {} accuracy'.format(pipe_tree_SS.score(X_test, y_test)) ) #satisfies req. 4.iii

In [None]:
##fit train data for min max
pipe_tree_MM = pipe_tree_MM.fit(X_train, y_train)

In [None]:
#best combinations are
pipe_tree_MM.best_params_

In [None]:
#the accuracy for this combination is:
y_pred_gs = pipe_tree_MM.predict(X_test)
print("Accuracy for decision tress standard scaler: ",pipe_tree_MM.score(X_test,y_test))
print('Can we predict the weather tomorrow? Yes, to {} accuracy'.format(pipe_tree_SS.score(X_test, y_test)) ) #satisfies req. 4.iii

#### To satisfy requirement 4: All results are similar with KNN/standard scaler just ahead with a score of 0.844
I will plot all the scores for a visual, this will satisfies requirement 6.

In [None]:
models_default_scores = {
    'KNearest Neighbors' : pipe_knn_ss.score(X_test, y_test),
    'Decision Tree' : pipe_tree_SS.score(X_test, y_test),
    }

In [None]:
default_models_compare = pd.DataFrame(models_default_scores, index=['accuracy'])
default_models_compare.T.plot.bar()

The next stage would be to create a function instead of repeating steps for each model.
As I had a large dataset I could afford to lose some data as the purpose was to create a model. With more experience and time I would have done more preprocessing and dealt with the missing data differently.
