## About the Dataset
PJM Hourly Energy Consumption Data
PJM Interconnection LLC (PJM) is a regional transmission organization (RTO) in the United States. It is part of the Eastern Interconnection grid operating an electric transmission system serving all or parts of Delaware, Illinois, Indiana, Kentucky, Maryland, Michigan, New Jersey, North Carolina, Ohio, Pennsylvania, Tennessee, Virginia, West Virginia, and the District of Columbia.

The hourly power consumption data comes from PJM's website and are in megawatts (MW).

The regions have changed over the years so data may only appear for certain dates per region.

In [1]:
#Show PJM Regions
from IPython.display import Image
Image(url= "http://slideplayer.com/4238181/14/images/4/PJM+Evolution.jpg")

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression

In [3]:
df = pd.read_csv('E:/Fall 2019/Data Science with Python/Project/CS677Project_Padala/hourly-energy-consumptionPJME_hourly.csv')

FileNotFoundError: [Errno 2] File b'E:/Fall 2019/Data Science with Python/Project/CS677Project_Padala/hourly-energy-consumptionPJME_hourly.csv' does not exist: b'E:/Fall 2019/Data Science with Python/Project/CS677Project_Padala/hourly-energy-consumptionPJME_hourly.csv'

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.info()

## Checking for null values

In [None]:
df.isna().any()

## Drop any duplicate values

In [None]:
df.drop_duplicates(subset='Datetime', keep='last', inplace=True)

## Convert into datetime and set as index

In [None]:
df['Datetime']= pd.to_datetime(df['Datetime'])

In [None]:
df = df.set_index('Datetime') # set the Datetime as our index

## Histogram plot

In [None]:
df['PJME_MW'].plot.hist(figsize=(18, 6), bins=300, title='Distribution of PJME Load')
plt.show()

## check if our dataset is continuous

In [None]:
print(df.index.freq) # our dataset is not continuous

In [None]:
# creating a continuous date range with hourly frequency
date_range = pd.date_range(start=min(df.index), 
                           end=max(df.index), 
                           freq='H')

In [None]:
df = df.reindex(date_range)

In [None]:
df.isnull().any()

## now we have null values lets fill them

In [None]:
df = df.fillna(method = 'ffill') # using the ffill technique

In [None]:
df.isnull().any()

In [None]:
print(df.index.freq) # now our timeseries is continuous with no missing data

## Extracting  more features from the time series

In [None]:
df['dow'] = df.index.dayofweek
df['doy'] = df.index.dayofyear
df['year'] = df.index.year
df['month'] = df.index.month
df['quarter'] = df.index.quarter
df['hour'] = df.index.hour
df['weekday'] = df.index.weekday_name
df['woy'] = df.index.weekofyear
df['dom'] = df.index.day # Day of Month
df['date'] = df.index.date 

# let's add the season number
df['season'] = df['month'].apply(lambda month_number: (month_number%12 + 3)//3)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe().T

## Exploratory Data Analysis

In [None]:
plt.figure(figsize=(15,5))
plt.title(' Total Energy consumption throughout 2002-2018')
sns.countplot(x='year', data=df, color='lightblue');

In [None]:
plt.figure(figsize=(15,5))
plt.title(' Total Energy Consumption on each month throughout 2002-2018')
plt.ylabel('PJME_MW')
plt=df.groupby('month').PJME_MW.sum().plot(kind='bar',color='green')

## Correlation plot

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(15,5))
sns.heatmap(df.corr(),annot=True)
plt.show()

- The correlation matrix indicates the variables "dow" (day of week) and "hour" will be interesting to look at in the context of predicting our target variable.

## Time series plot

In [None]:
import plotly.graph_objects as go
fig = go.Figure([go.Scatter(x=df.index, y=df.PJME_MW,line_color='orange')])
fig.update_layout(title_text='Yearly Power Consumption',template="plotly_dark")
fig.update_xaxes(title_text='Date')
fig.update_yaxes(title_text='Energy Demand(MW)')
fig.show()

## Hourly,Daily,Weekly Timeseries Analysis

In [None]:
import plotly.express as px
plot_df=df.groupby(['hour', 'weekday'], as_index=False).agg({'PJME_MW':'mean'})
# plotting
fig = px.line(plot_df, x='hour', y='PJME_MW', color='weekday', title='Average Hourly Power Demand per Weekday')
fig.update_layout(xaxis_title='Hour',yaxis_title='Energy Demand [MW]',template="plotly_dark")
fig.show()

- demand for electricity is lower during the weekends, and dips a little sooner on friday afternoons.

In [None]:
plt_df=df.groupby(['hour', 'season'], as_index=False).agg({'PJME_MW':'mean'})

# plotting
fig = px.line(plt_df,x='hour', y='PJME_MW', color='season', title='Average Hourly Power Demand per Season')
fig.update_layout(xaxis_title='Hour',
                  yaxis_title='Energy Demand [MW]',template="plotly_dark")
fig.show()

- The Energy Consumption during season 3 i.e Summer is the highest

## Summer vs Winter Demand

In [None]:
plt_df = df.loc[(df.index >= '2015-11-01') & (df.index < '2016-01-01')]
fig = go.Figure([go.Scatter(x=plt_df.index, y=plt_df.PJME_MW,line_color='lightblue')])
fig.update_layout(title_text='Winter Power Consumption',template="plotly_dark")
fig.update_xaxes(title_text='Date')
fig.update_yaxes(title_text='Energy Demand(MW)')
fig.show()

- we notice dips in the energy consumption mid day.
- In winter months people tend to use less energy mid-day.

In [None]:
plt_df = df.loc[(df.index >= '2016-06-01') & (df.index < '2016-08-01')]
fig = go.Figure([go.Scatter(x=plt_df.index, y=plt_df.PJME_MW,line_color='yellow')])
fig.update_layout(title_text='Summer Power Consumption',template="plotly_dark")
fig.update_xaxes(title_text='Date')
fig.update_yaxes(title_text='Energy Demand(MW)')
fig.show()

- we notice bell shaped curves all over
- more energy is consumed mid-day.... this maybe due to the use of air conditioners in the summer

## Hourly Trend

In [None]:
fig = px.scatter(df, x="hour", y="PJME_MW")
fig.update_layout(title_text='Hourly Power Consumption',template="plotly_dark")
fig.update_xaxes(title_text='Hour')
fig.update_yaxes(title_text='Energy Demand(MW)')
fig.show()

## Pandas Profiling
- This library helps us to quickly get an overview of the data
- lets try to use it and see what we can infer
- To install run pip install pandas_profiling

In [None]:
#pip install pandas_profiling

In [None]:
import pandas_profiling
pandas_profiling.ProfileReport(df)

## Quarterly Trends

In [None]:
fig, ax = plt.subplots(figsize=(15,5))
sns.boxplot(df.loc[df['quarter']==1].hour, df.loc[df['quarter']==1].PJME_MW)
ax.set_title('Hourly Boxplot PJME Q1')
ax.set_ylim(0,65000)
fig, ax = plt.subplots(figsize=(15,5))
sns.boxplot(df.loc[df['quarter']==2].hour, df.loc[df['quarter']==2].PJME_MW)
ax.set_title('Hourly Boxplot PJME Q2')
ax.set_ylim(0,65000)
fig, ax = plt.subplots(figsize=(15,5))
sns.boxplot(df.loc[df['quarter']==3].hour, df.loc[df['quarter']==3].PJME_MW)
ax.set_title('Hourly Boxplot PJME Q3')
ax.set_ylim(0,65000)
fig, ax = plt.subplots(figsize=(15,5))
sns.boxplot(df.loc[df['quarter']==4].hour, df.loc[df['quarter']==4].PJME_MW)
ax.set_title('Hourly Boxplot PJME Q4')
_ = ax.set_ylim(0,65000)

## Time Series Decomposition

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

# seasonal_decompose needs a dataframe with a datetime index
series = df.PJME_MW
frequency = 24*365

# decomposing the time-series, with the frequency being 24 hours per 365 days
decomposed = seasonal_decompose(series, model='additive', freq=frequency)

In [None]:
# plotting the different elements constituting our time-series
def plot_decompositions(decompositions, titles, line_widths):
    for d, t, lw in zip(decompositions, titles, line_widths):
        
        # draw a line plot of the data
        fig = px.line(d,
              y='PJME_MW',
              title=t,
              height=300)
        
        # adjust line width
        fig.update_traces(line=dict(width=lw))
        
        # change layout of axes and the figure's margins 
        # to emulate tight_layout
        fig.update_layout(
            xaxis=dict(
                showticklabels=False,
                linewidth=1
            ),
            yaxis=dict(title=''),
            margin=go.layout.Margin(
                l=40, r=40, b=0, t=40, pad=0
            ),
        )
        
        # display
        fig.show()

# calling the function 
plot_decompositions(decompositions=[decomposed.trend, 
                                    decomposed.seasonal, 
                                    decomposed.resid],
                    titles=['Trend', 
                            'Seasonality',
                            'Residuals'],
                    line_widths=[2, 0.025, 0.05])

## Feature selection

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression 
X = df[['dow', 'doy', 'year', 'month', 'quarter', 'hour',
       'woy', 'dom', 'season']]
y = df[['PJME_MW']]

In [None]:
selector = SelectKBest(f_regression, k=2)
selector.fit(X, y)

In [None]:
print(X.columns[selector.get_support()])

## Prediction using Linear Regression

In [None]:
column_names = ['hour','dow'] # selecting hour and dow as our input features
X = df[column_names]
y = df['PJME_MW']

### using entire dataset to predict

In [None]:
model = LinearRegression()
model.fit(X, y)
df['predicted'] = model.predict(X)

In [None]:
# create figure
fig = go.Figure()
fig.add_trace(go.Scatter(x=df.index, y=df.PJME_MW,
                         mode='lines',
                         name='Actual'))
fig.add_trace(go.Scatter(x=df.index, y=df.predicted,
                         mode='lines', 
                         name='Predicted'))

# adjust layout
fig.update_traces(line=dict(width=0.5))
fig.update_layout(title='Linear Regression Forecast of Hourly Energy Demand',
                  xaxis_title='Date & Time (yyyy/mm/dd hh:MM)',
                  yaxis_title='Energy Demand [MW]')

In [None]:
plt.scatter(df.PJME_MW, df.predicted);

In [None]:
from sklearn import metrics
np.sqrt(metrics.mean_squared_error(y,df.predicted))

In [None]:
df = df.drop(columns = 'predicted')

- looks like linear regression wont work in our case

## lets try Holt-winter model for forecasting

In [None]:
df.head()

In [None]:
# set manually
CUTOFF_DATE = pd.to_datetime('2017-08-01')
TIME_DELTA = pd.DateOffset(years=8)

# splitting
train = df.loc[(df.index < CUTOFF_DATE) & (df.index >= CUTOFF_DATE-TIME_DELTA) ].copy()
test = df.loc[df.index >= CUTOFF_DATE].copy()

In [None]:
train.head()

In [None]:
test.head()

In [None]:
import statsmodels.api as sm
exp_smooth_train, exp_smooth_test = train['PJME_MW'], test['PJME_MW']
# fit & predict
holt_winter = sm.tsa.ExponentialSmoothing(exp_smooth_train,
                                          seasonal_periods=24*365,
                                          seasonal='add').fit()
y_hat_holt_winter = holt_winter.forecast(len(exp_smooth_test))

In [None]:
# create figure
fig = go.Figure()
fig.add_trace(go.Scatter(x=exp_smooth_test.index, y=exp_smooth_test,
                         mode='lines',
                         name='Actual'))
fig.add_trace(go.Scatter(x=y_hat_holt_winter.index, y=y_hat_holt_winter,
                         mode='lines', 
                         name='Predicted'))

# adjust layout
fig.update_traces(line=dict(width=0.5))
fig.update_layout(title='Holt-Winter Forecast of Hourly Energy Demand',
                  xaxis_title='Date & Time',
                  yaxis_title='Energy Demand [MW]')

# Classification

# About Dataset
This dataset contains daily weather observations from numerous Australian weather stations.

The target variable RainTomorrow means: Did it rain the next day? Yes or No.

Note: You should exclude the variable Risk-MM when training a binary classification model. Not excluding it will leak the answers to your model and reduce its predictability.

In [None]:
df = pd.read_csv('E:/Data Science with Python/Project/weather-dataset-rattle-package/weatherAUS.csv')

In [None]:
df = df.drop(columns = 'RISK_MM') #the dataset description asked me to do so

In [None]:
df.head()

In [None]:
df.describe().T

In [None]:
df.info()

## dealing with null values

In [None]:
(df.isnull().sum()/df.count())*100 #calculating the percentage of null values

- We have 4 columns where more than 60% of the data is missing....hence i am going to drop those columns

In [None]:
df = df.drop(columns=['Sunshine','Evaporation','Cloud3pm','Cloud9am'],axis=1)

In [None]:
df = df.dropna(how='any') #dropping all null values since their percentage is very low

In [None]:
df.shape

In [None]:
df.isnull().sum() #checking if there are anymore null values

In [None]:
f, ax = plt.subplots(figsize=(6, 8))
ax = sns.countplot(x="RainTomorrow", data=df)
plt.show()

## using pandas profiling

In [None]:
pandas_profiling.ProfileReport(df)

## creating features from the date variable

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

In [None]:
df.drop('Date', axis=1, inplace = True) #Dropping the original date

In [None]:
df.head()

## Correlation Plot

In [None]:
correlation = df.corr()

In [None]:
plt.figure(figsize=(16,12))
plt.title('Correlation Heatmap of Rain in Australia Dataset')
ax = sns.heatmap(correlation, square=True, annot=True, fmt='.2f', linecolor='white')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_yticklabels(ax.get_yticklabels(), rotation=30)           
plt.show()

- MinTemp and MaxTemp are highly correlated
- MinTemp and Temp9am are highly correlated
- MinTemp and Temp3pm are highly correlated
- MaxTemp and Temppam are highly correlated
- MaxTemp and Temp3pm are highly correlated
- WindGustSpeed and WindSpeed3pm are highly correlated
- Pressure9am and Pressure3pm are highly correlated
- Temp9am and Temp3pm are highly correlated

## Lets see a pairplot to know more about the correlated variables

In [None]:
var = ['MinTemp', 'MaxTemp', 'Temp9am', 'Temp3pm', 'WindGustSpeed', 'WindSpeed3pm', 'Pressure9am', 'Pressure3pm']
sns.pairplot(df[var], kind='scatter', diag_kind='hist', palette='Rainbow')
plt.show()

## Label Encoding the categorical data

In [None]:
categorical = [var for var in df.columns if df[var].dtype=='O']

In [None]:
categorical

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df[categorical]=df[categorical].apply(le.fit_transform)

In [None]:
df.head()

In [None]:
df[categorical].head()

## Perform PCA to determine no of features

In [None]:
from sklearn.decomposition import PCA
X = df.drop(columns='RainTomorrow')
y = df.RainTomorrow

In [None]:
pca = PCA(n_components=2)

X_r = pca.fit(X).transform(X)

X_r.shape

In [None]:
X_r[0:5]

In [None]:
pca.inverse_transform(X_r)[0:5]

In [None]:
print(pca.explained_variance_ratio_)

## Feature Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df)
scaled_df = pd.DataFrame(scaler.transform(df), index=df.index, columns=df.columns)
scaled_df.head()

## Feature extraction

In [None]:
from sklearn.feature_selection import SelectKBest, chi2
X = scaled_df.loc[:,scaled_df.columns!='RainTomorrow']
y = scaled_df['RainTomorrow']
selector = SelectKBest(chi2, k=5)
selector.fit(X, y)
X_new = selector.transform(X)
print(X.columns[selector.get_support(indices=True)])

## Apply PCA on these new features with scaled vs unscaled data

In [None]:
#scaled data
X = scaled_df[['Rainfall', 'WindGustSpeed', 'Humidity9am', 'Humidity3pm', 'RainToday']]
y = scaled_df.RainTomorrow
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)
print(pca.explained_variance_ratio_)

In [None]:
# unscaled data
X = df[['Rainfall', 'WindGustSpeed', 'Humidity9am', 'Humidity3pm', 'RainToday']]
y = df.RainTomorrow
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)
print(pca.explained_variance_ratio_)

- we see that scaled data has higher variance than unscaled data and first 2 components contribute 91% of total variance

## Logistic Regression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

X = df[['Rainfall', 'WindGustSpeed', 'Humidity9am', 'Humidity3pm', 'RainToday']]
y = df.RainTomorrow

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)

In [None]:
clf = make_pipeline(MinMaxScaler(),
                        PCA(n_components=2),LogisticRegression())
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

In [None]:
result = []
result.append(('Logistic Regression',accuracy_score(y_test, y_pred)))

In [None]:
print('Training set score: {:.4f}'.format(clf.score(X_train, y_train)))
print('Test set score: {:.4f}'.format(clf.score(X_test, y_test)))

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print('\nTrue Positives(TP) = ', cm[0,0])
print('\nTrue Negatives(TN) = ', cm[1,1])
print('\nFalse Positives(FP) = ', cm[0,1])
print('\nFalse Negatives(FN) = ', cm[1,0])

In [None]:
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='Blues')

## Classification Analysis
### Naive Bayes

In [None]:
# NaiveBayes Classifier
from sklearn.naive_bayes import GaussianNB
X = df[['Rainfall', 'WindGustSpeed', 'Humidity9am', 'Humidity3pm', 'RainToday']]
y = df.RainTomorrow
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
clf = make_pipeline(MinMaxScaler(),PCA(n_components=2), GaussianNB())
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

In [None]:
result.append(('Naive Bayes',accuracy_score(y_test, y_pred)))

In [None]:
print('Training set score: {:.4f}'.format(clf.score(X_train, y_train)))
print('Test set score: {:.4f}'.format(clf.score(X_test, y_test)))

In [None]:
cm = confusion_matrix(y_test, y_pred)
print('\nTrue Positives(TP) = ', cm[0,0])
print('\nTrue Negatives(TN) = ', cm[1,1])
print('\nFalse Positives(FP) = ', cm[0,1])
print('\nFalse Negatives(FN) = ', cm[1,0])

In [None]:
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='Reds')

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
X = df[['Rainfall', 'WindGustSpeed', 'Humidity9am', 'Humidity3pm', 'RainToday']]
y = df.RainTomorrow
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
clf = make_pipeline(MinMaxScaler(),PCA(n_components=2), DecisionTreeClassifier())
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

In [None]:
result.append(('Decision Tree',accuracy_score(y_test, y_pred)))

In [None]:
print('Training set score: {:.4f}'.format(clf.score(X_train, y_train)))
print('Test set score: {:.4f}'.format(clf.score(X_test, y_test)))

In [None]:
cm = confusion_matrix(y_test, y_pred)
print('\nTrue Positives(TP) = ', cm[0,0])
print('\nTrue Negatives(TN) = ', cm[1,1])
print('\nFalse Positives(FP) = ', cm[0,1])
print('\nFalse Negatives(FN) = ', cm[1,0])

In [None]:
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='Greens')

### Visualizing Decision Tree

In [None]:
from sklearn.tree import export_graphviz, plot_tree
import graphviz
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin'
X = df[['Rainfall', 'WindGustSpeed', 'Humidity9am', 'Humidity3pm', 'RainToday']]
y = df[['RainTomorrow']]
clf1 = DecisionTreeClassifier(max_depth=3)
clf1 = clf1.fit(X, y)

In [None]:
df1 = pd.read_csv('E:/Data Science with Python/Project/weather-dataset-rattle-package/weatherAUS.csv')

In [None]:
data = export_graphviz(clf1,out_file=None, 
                     feature_names=X.columns,
                      class_names=df1.RainTomorrow.unique(),
                      filled=True, rounded=True,  
                      special_characters=True)  
graph = graphviz.Source(data)

graph

### SVM

In [None]:
from sklearn.svm import SVC

In [None]:
X = df[['Rainfall', 'WindGustSpeed', 'Humidity9am', 'Humidity3pm', 'RainToday']]
y = df.RainTomorrow
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
clf = make_pipeline(MinMaxScaler(),PCA(n_components=2), SVC())
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

In [None]:
result.append(('SVM',accuracy_score(y_test, y_pred)))

In [None]:
print('Training set score: {:.4f}'.format(clf.score(X_train, y_train)))
print('Test set score: {:.4f}'.format(clf.score(X_test, y_test)))

In [None]:
cm = confusion_matrix(y_test, y_pred)
print('\nTrue Positives(TP) = ', cm[0,0])
print('\nTrue Negatives(TN) = ', cm[1,1])
print('\nFalse Positives(FP) = ', cm[0,1])
print('\nFalse Negatives(FN) = ', cm[1,0])

In [None]:
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='coolwarm')

## Clustering

In [None]:
scaled_df.head()

In [None]:
X = scaled_df[['WindGustSpeed','Humidity9am']]

In [None]:
plt.scatter(X["WindGustSpeed"],X["Humidity9am"],c='blue')
plt.xlabel('WindGustSpeed')
plt.ylabel('Humidity9am')

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(X)

In [None]:
print(kmeans.labels_)

In [None]:
centers = kmeans.cluster_centers_
print(centers)

In [None]:
plt.scatter(X['WindGustSpeed'], X['Humidity9am'], c=kmeans.labels_, s=32, cmap="viridis")
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=128, alpha=0.4);

- using kmeans we are able to seperate the data into two clusters

In [None]:
result

In [None]:
Names, accuracy = zip(*result)

In [None]:
def extractDigits(lst): 
    return [[el] for el in lst] 
accuracy1 = extractDigits(accuracy)

In [None]:
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
fig.set_figheight(7)
fig.set_figwidth(14)
plt.boxplot(accuracy1)
ax.set_xticklabels(Names)
#ax.set_ylim([ymin,ymax])
plt.show()


## Logistic regression has the best accuracy