# Test ML Rain detector

The purpose of this program is to read austin weather data and use binary classification machine learning to predict if rain will occur.

## ETL

**1. Extract**

We converted the .csv file of weather data into a pandas dataframe. This will make it easier to transform the data.

**2. Transform**

Next, another column called 'Outcome' was added to the dataframe. This column was added to aid the binary classification ML algorithms. Every row was iterated through and checked to see if the 'Events' column detailing the weather events for that day told of rain. If so, 'Outcome' was set to 1 (True), and if not, it was set to 0 (False).

Additionally, all rows with '-' representing incomplete data were dropped. We then split the dataset into a training set by which the ML algorithm would be trained on, and a testing set which would be the benchmark for how accurate our data was. The training set included the first 1000 rows of data, and the testing set had the remaining (218) rows. They were then exported to .csv files

**3. Load**

In the second codeblock, the .csv files were loaded as testing and training dataframe. We selected all the attributes of the original dataset (barring precipitation due to incomplete data) to train the algorithm on.

In [123]:
import pandas as pd

weather = pd.read_csv('austin_weather.csv')

#Add a new column called outcome.
#Outcome is set to 1 if there was rain for the selected date, and 0 if not.
#Used for binary classification.
weather['Outcome'] = ""
for column in range(len(weather['Events'])):
    if 'Rain' in weather['Events'][column]:
        weather.loc[column, "Outcome"] = 1
    else:
        weather.loc[column, "Outcome"] = 0

#Drops all rows with insufficient data (rows that have a "-").
#These rows break the ML learning algorithm and have to be removed.
for column in weather:
    weather = weather[~weather[column].isin(['-'])]

#weather = weather.loc[:,["TempHighF", "TempAvgF", "TempLowF", "WindHighMPH", "WindAvgMPH", "WindGustMPH", "Events", "Outcome"]]

train = weather.loc[0:1000, :]
test = weather.loc[1001:, :]

#train.to_csv('weather_train.csv')
#test.to_csv('weather_test.csv')

## Training our ML algorithm

With the training and testing datasets loaded, we imported sklearn and began to train the ML algorithm using Logistic and MLP classification. We found that our logistic classifier was 87% accurate, while our MLP classifier was 85.8% accurate.

In [124]:
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score


train_df = pd.read_csv("weather_train.csv")
train_features = train_df[["TempHighF", "TempAvgF", "TempLowF", "DewPointHighF", "DewPointAvgF", "DewPointLowF", "HumidityHighPercent", "HumidityAvgPercent", "HumidityLowPercent", "SeaLevelPressureHighInches", "SeaLevelPressureAvgInches", "SeaLevelPressureLowInches", "VisibilityHighMiles", "VisibilityAvgMiles", "VisibilityLowMiles", "WindHighMPH", "WindAvgMPH", "WindGustMPH"]]

train_labels = train_df["Outcome"]

# Now let's define our models
lr_classifier = LogisticRegression(solver='lbfgs',max_iter=10000)
mlp_classifier = MLPClassifier(solver='lbfgs', alpha=1e-5,
                               hidden_layer_sizes=(8, 2), random_state=11,max_iter=10000)

# train our models
lr_classifier.fit(train_features.to_numpy(),train_labels.to_numpy())
mlp_classifier.fit(train_features.to_numpy(),train_labels.to_numpy())

print ("Models trained successfully...")

#load test data
test_df = pd.read_csv("weather_test.csv")

# Extract the input features
test_inputs = test_df[["TempHighF", "TempAvgF", "TempLowF", "DewPointHighF", "DewPointAvgF", "DewPointLowF", "HumidityHighPercent", "HumidityAvgPercent", "HumidityLowPercent", "SeaLevelPressureHighInches", "SeaLevelPressureAvgInches", "SeaLevelPressureLowInches", "VisibilityHighMiles", "VisibilityAvgMiles", "VisibilityLowMiles", "WindHighMPH", "WindAvgMPH", "WindGustMPH"]]
y_actual = test_df["Outcome"]

# predict using logistic regression model
y_predicted_lr = lr_classifier.predict(test_inputs.to_numpy())
lr_accuracy_score = accuracy_score(y_predicted_lr,y_actual)

# predict using logistic regression model
y_predicted_mlp = mlp_classifier.predict(test_inputs.to_numpy())
mlp_accuracy_score = accuracy_score(y_predicted_mlp,y_actual)

print (f"Accuracy of the Logistic Classifier on test data= {lr_accuracy_score}")
print (f"Accuracy of the MLP Classifier on test data = {mlp_accuracy_score}")

Models trained successfully...
Accuracy of the Logistic Classifier on test data= 0.870253164556962
Accuracy of the MLP Classifier on test data = 0.8575949367088608
