**Import Libraries and Packages**

**Exploratory Data Analysis**
1. Divide the features into Categorical and Numerical wise
2. Handle the missing values of features accordingly
3. Check the Duplication into the features
4. Check the Cardinality of the Cat Features and perfrom the Feature Engg if necessary
5. Check the data distribution of Numerical Features
6. Check the Outliers and tackle them accordingly

**Data Preprocessing**
1. Encoding the Category Features first
2. Remove the unnecessary features
3. Scale the Feature

**Machine Learning Model Creation**
1. Model Training
2. Predict the Model
3. Check the accuracy score
4. Confusion Matrix

**Results and Conclusion**

# **Import Libraries and Dataset**

In [5]:
#import library and package

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import mean_squared_error,r2_score, accuracy_score, confusion_matrix

from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings("ignore")

# **Exploratory Data Analysis**

In [34]:
#Have a sneak peak of the dataset
df = pd.read_csv('weatherAUS.csv')
df.head()

PermissionError: [Errno 13] Permission denied: 'weatherAUS.csv'

In [None]:
#check row and column
df.shape

In [None]:
#check the column name
df.columns

In [None]:
#check data type, any null
df.info()

there is a combination of data type in dataset. The object type should be encoded into number (int) for machine learning purpose
Date column should be casted into day, month, and year in int format

In [None]:
#sum of null values in dataset
df.isnull().any()

There are a lot of missing values. This could be an issue. We will handle the categorical column by filling it with most frequent data, and numerical column with median data.

In [None]:
#check statistical report of the dataset
df.describe()

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

parsing date column with str format into datetime format, so we can extract the year, month, and day

In [None]:
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

Extracting the year, month, and day from date column

In [None]:
#checking the new column, year, month, and day
df.head()

In [None]:
#listing column with object type
cat_col = [col for col in df.columns if df[col].dtype == 'O']
cat_col

In [None]:
#count number of null
df[cat_col].isnull().sum()

In [None]:
#assing most frequent data innto nan value
df['WindGustDir'] = df['WindGustDir'].fillna(df['WindGustDir'].mode()[0])
df['WindDir9am'] = df['WindDir9am'].fillna(df['WindDir9am'].mode()[0])
df['WindDir3pm'] = df['WindDir3pm'].fillna(df['WindDir3pm'].mode()[0])
df['RainToday'] = df['RainToday'].fillna(df['RainToday'].mode()[0])
df['RainTomorrow'] = df['RainTomorrow'].fillna(df['RainTomorrow'].mode()[0])

df[cat_col].isnull().sum()

In [None]:
for val in cat_col :
    print('*********************************************')
    print(df[val].value_counts())

based on the report we can see there is ambiguity where sydneyairport, perthairport, melbourneairport is actually same with, sydney, pert, and melbourne

In [None]:
#replecing the ambiguity element
df['Location'] = df['Location'].replace('SydneyAirport', 'Sydney')
df['Location'] = df['Location'].replace('PerthAirport', 'Perth')
df['Location'] = df['Location'].replace('MelbourneAirport', 'Melbourne')

In [None]:
df.drop('Date', axis = 1, inplace = True)

because column date is no longer needed, we can remove the column from the dataset

In [None]:
num_col = [feature for feature in df.columns if df[feature].dtype != 'O']


Make list of column with numerical values

In [None]:
df[num_col].isnull().sum()

Thera a lot of null value. We will handle the null value by replacing the numerical value with their median

In [None]:
for val in num_col :
    df[val].fillna(df[val].median(), inplace = True)

In [None]:
df.isnull().sum()

All the null values is gone, and we can proceed to the next step

# ****Data Preprocessing****

In [None]:
cat_col

In [None]:
encoder = LabelEncoder()

In [None]:
for col in cat_col :
    df[col] = encoder.fit_transform(df[col])

Transform obj into int, for training the data

In [None]:
#check the data type of dataset
df.info()

In [None]:
x = df.drop('RainTomorrow', axis = 1)
y = df['RainTomorrow']

scaler = StandardScaler()
df[num_col] = scaler.fit_transform(df[num_col])

we separate the dependant and independant variable. 
the independant variable is X
and dependant variable is Y

This code will standardize the numerical columns in num_col by removing the mean and scaling to unit variance.



# **Machine Learning Model Creation**

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

In [None]:
model = LogisticRegression()

In [None]:
#training the data
model.fit(x_train, y_train)

In [None]:
#predict the data
predict = model.predict(x_test)

In [None]:
#mse calculation
mean_squared_error(y_test, predict)

In [None]:
#accuracy score of the prediction
accuracy_score(y_test, predict)

Model does a very good job in training and predicting. The result is 83,8 % accuracy

In [None]:
cm = confusion_matrix(y_test, predict)

In [None]:
print('TP =', cm[0,0])
print('FP =', cm[1,0])
print('TN =', cm[0,1])
print('FN =', cm[1,1])

Confusion Matrix
A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.

# **Conclusion**

1. The model does a very good job in predicting with 83.8% accuracy score
2. The report of confusion matrix shows TP = 21479, FP = 3494, TN = 1193, FN = 2926 meaning the model majority of model will predict there will be rain tomorrow