# Australian rain data.

Hello Kaggle. 

This will be my first attempt at reading in and working with data. 
I'm a graduate of mathematics and chemistry at university but I never considered computer science,  I wasn't even slightly aware of data science as an academic entity. After a bit of digging I found the world of data science. To date I've started a few Udemy tutorials and read a few kernels but I haven't shared any of my attempts. This will be my first shared kernel so please feel free to give critique, insight, or point out errors of understanding and better ways to do things.

## Importing libraries

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## Read in the data

In [None]:
rawData = pd.read_csv("../input/weatherAUS.csv")

## Investigate, understand, and clean the data

### Investigating
We will begin investigating the data by viewing it, gathering information about its shape and the values in it.
We will first have a quick look at the data, then see it's shape.

In [None]:
rawData.head()

In [None]:
rawData.shape

### Investigate: Brief analysis
The dataset contains 24 columns, the last of which is whether or not the next day rained. 
For any given entry, which is a given day at a given location, there are:

1. Date: The date of the observation.
2. Location: The location of the weather station.
#### Temperature
3. MinTemp: The minimum temperature of that day in degrees celsius.
4. MaxTemp: The maximum temperature of that day in degrees celcius. 
#### Volume difference
5. Rainfall: The volume of rainfall recorded for that day in mm.
6. Evaporation: The Class A pan evaporation in mm.
#### Sunshine
7. Sunshine: The number of hours of sunshine in the given day.
#### Wind
8. WindGustDir: The direction of the STRONGEST wind gust in that given day.
9. WindGustSpeed: The greatest speed of the STRONGEST wind gust in that given day.
10. WindDir9am: Categorical Wind diration
11. WindDir3pm: Categorical Wind diration
12. WindSpeed9am: Wind speed 
13. WindSpeed3pm: Wind Speed
#### Humidity and pressure
14. Humidity9am: Humidity 
15. Humidity3pm: Humidity
16. Pressure9am: Pressure
17. Pressure3pm: Pressure
#### Overhead cloud recording, sky obstruction/cloud volume
18. Cloud9am: The fraction of the sky obstructed by clouds at 9 am on a given day, measured in eighths. 
19. Cloud9pm: The fraction of the sky obstructed by clouds at 3 pm on a given day. 
#### Temperature difference
20. Temp9am: Temperature
21. Temp3pm: Temperature
####  Misc.
22. RainToday:  Whether or not it rained on the day of the observation
23. Risk_MM: The risk of rain on a given day as determined by the station.
#### Target
24. RainTomorrow: Did the next day have rain recorded?

### Investigation data count
We have 142, 193 entries and  >15 features.
We will start by looking at the null entries and determine how to handle them on a case basis.

In [None]:
rawData.isnull().sum()

In [None]:
workingData = rawData.drop(['Evaporation', 'Location', 'Date', 'Sunshine', 'Cloud9am', 'Cloud3pm','RISK_MM', 'RainTomorrow'], axis=1).copy()
Y = rawData['RainTomorrow'].copy()
#We will take this opportunity to also map our catagorical values.
Y = Y.map({'Yes':1, 'No':0})
Y = pd.DataFrame(Y, columns = ['RainTomorrow'])
workingData['RainToday'] = workingData['RainToday'].map({'Yes':1, 'No':0})

We will briefly re-examine the dataset.

In [None]:
workingData.isnull().sum()

For values with less than 10, 000  missing numerical elements, we will replace the numerical values with the averages. We will drop elements that have more missing elements than 10, 000.

In [None]:
workingData = workingData.drop(['Pressure9am', 'Pressure3pm','WindDir9am'], axis = 1)

#Fill the missing data with numerical averages.
meanMatrix = workingData.drop(['WindDir3pm', 'RainToday', 'WindGustDir'], axis=1).copy()
meanWorkingData = meanMatrix.fillna(meanMatrix.mean())

For the missing categorical values, determining if it's a randomly missing value or if it's a non-randomly missing is required to correctly replace the missing values. Otherwise, we can discard them entirely in the form of either: Removing the feature entirely or removing the obersvations entirely. In this instance we will remove the wind direction variable, since we are also dropping the location for this analysis. 

In [None]:
#We will replace the observations that don't have a reading for the 'RainToday' feature with the most popular observation for that feature.
rawData['RainToday'].value_counts()

In [None]:
#So we will replace all missing RainToday valuse with 'No':0
meanWorkingData['RainToday'] = workingData['RainToday'].fillna(0)
#Brief investigation.
meanWorkingData.head(10)

In [None]:
#Ensure all Nulls are gone
meanWorkingData.isnull().sum()

The data is now ready to be scaled. We will use the sklearn preprocessing library to scale the data.

In [None]:
from sklearn import preprocessing
xScaled = preprocessing.scale(meanWorkingData.drop(['RainToday'], axis = 1))



#xScaled['RainToday'] = meanWorkingData['RainToday'].values
xScaled = pd.DataFrame(xScaled, columns=meanWorkingData.drop(['RainToday'], axis =1).columns)
xScaled['RainToday'] = meanWorkingData['RainToday']
#We will briefly visualise the correlation matrix to investigate if there are (m)any relationships. We can expect there to be temperature relationships, since those values will remain more or less close to one another, similar with wind speeds, since they should not change drastically in a daily observation.
sns.heatmap(data = xScaled.corr(), cmap = 'Blues')

## We will now attempt several types of Classification
### Feature Selection
We will attempt to try the x best features of our remaining feature set. We will attempt models with one feature, three features, five features, and all the features and compare their predictive power.
Since we standardised our data by scaling it between -1, 1, we cannot use the chi2 test since it assumes that variance and frequency can't be negative, but we can compare ANOVA f scores.

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

nFeatures = [1, 3, 5, 7, 9, 11]
xSelectors = []
xFeatures = []

for feature in nFeatures:
    xSelectionModel = SelectKBest(f_classif, k=feature).fit(xScaled, Y['RainTomorrow'])
    xSelectors.append(xSelectionModel)
    
for selector in xSelectors:
    featureList = xScaled.columns[selector.get_support()]
    xFeatures.append(featureList)

## Model Preparation
We will attempt several types of clasification model, but we want to try iterating over all of the predictor variables types that we are showing above. We will use our list of xSelectors to produce a list of fit classifiers of each type.

In [None]:
from sklearn.base import clone as skClone

def produceModels ( xElements, yElements, featureList, modelObject ):
    output = []
    for featureset in featureList:
        print('Generating model.')
        #Filter the xElements to the appropriate values in the xSelector
        xFilteredElements = xElements[featureset]
        print(xFilteredElements.shape)
        classifierModel = skClone(modelObject)
        classifierModel = classifierModel.fit(xFilteredElements, yElements['RainTomorrow'])
        output.append(classifierModel)
        print('Done.')
        #print('Coefficients: {0}'.format(classifierModel.coef_))
    return output

def compareModels ( xElements, yElements, featureList, modelList, showCoeffs = False):
    print('Comparing models.')
    iterator = 0
    for model in modelList:
        #Filter the xElements to the appropriate values in the xSelector
        xFilteredElements = xElements[featureList[iterator]]
        if showCoeffs:
            print(model.coef_)
        print('Comparing model with {0} features:'.format(nFeatures[iterator]))
        print('Accuracy : {0}'.format(model.score(xFilteredElements, yElements)))
        iterator += 1
    return

######## Is there a good way of comparing the predicting features similar to an adjusted r-squared or p-value for each of the predictors? Please let me know if you can think of a good guide on how to do this for future reference. Thanks.

## Prepare the data by splitting into training and test sets

In [None]:
from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(xScaled, Y, test_size = 0.2, random_state = 28122018)

### Logistic Regression
We will begin by attempting to fit a logisitc regression. 

In [None]:
from sklearn.linear_model import LogisticRegression
logisticRegressor = LogisticRegression(random_state = 28122018, solver='lbfgs')
logisticModels = produceModels(xTrain, yTrain, xFeatures, logisticRegressor)

In [None]:
compareModels(xTest, yTest, xFeatures, logisticModels)

### Decision tree

In [None]:
from sklearn import tree
decisionTreeClassifier = tree.DecisionTreeClassifier(random_state = 28122018)
treeModels = produceModels(xTrain, yTrain, xFeatures, decisionTreeClassifier)

In [None]:
compareModels(xTest, yTest, xFeatures, treeModels)

### KNN Classification

In [None]:
from sklearn.neighbors import KNeighborsClassifier as KNC
#Using 5 neighbors
print('KNN (5)')
KNNClassifier5 = KNC(n_neighbors=5)
KNNModels5 = produceModels(xTrain, yTrain, xFeatures, KNNClassifier5)
compareModels(xTest, yTest, xFeatures, KNNModels5)

#Using 7 neighbors
print('KNN (7)')
KNNClassifier7 = KNC(n_neighbors=7)
KNNModels7 = produceModels(xTrain, yTrain, xFeatures, KNNClassifier7)
compareModels(xTest, yTest, xFeatures, KNNModels7)

#Using 10 neighbors
print('KNN (10)')
KNNClassifier10 = KNC(n_neighbors=10)
KNNModels10 = produceModels(xTrain, yTrain, xFeatures, KNNClassifier10)
compareModels(xTest, yTest, xFeatures, KNNModels10)