<a href="https://colab.research.google.com/github/scbd-laboratory/datadriven-dm/blob/master/machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Hands-on of Data Driven Decision Making Training - 17 February 2020 - PUSDIKLAT Keuangan Umum, Pancoran*


---



# 1. Supervised Learning - Classification

In [0]:
# Import Library
import pandas as pd

#Import the files to Google Colab
url = 'https://raw.githubusercontent.com/dianrdn/rc-dataanalytic/master/dataset/churn_trasnsformed_new.csv'
df_csv = pd.read_csv(url, sep=',')

# Show 10 first Row
df_csv.head()

In [0]:
# Remove "Unnamed:O" Coloumn
df = df_csv.drop("Unnamed: 0", axis=1)
df.head()

In [0]:
# Check the Data Infomation
df.info()

In [0]:
#Import MinMax Scaler
from sklearn.preprocessing import MinMaxScaler

# initialize min-max scaler
mm_scaler = MinMaxScaler()
column_names = df.columns.tolist()
column_names.remove('Churn')

# Transform all attributes
df[column_names] = mm_scaler.fit_transform(df[column_names])
df.sort_index(inplace=True)
df.head()

In [0]:
# Selecting the Feature, by remove the unused feature 
feature = ['Churn', 'TotalCharges']
train_feature = df.drop(feature, axis=1)

# Set The Target
train_target = df["Churn"]

In [0]:
# Show the Feature
train_feature.head(5)

In [0]:
# Split Data
from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(train_feature ,train_target, shuffle = True, test_size=0.3, random_state=1)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [0]:
# Show the training data
X_train.head()

#### Decision Tree

We use [Scikit Learn DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) function. Below is the default parameter:


`DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)`,

An explanation of the decision tree can be seen here.[Medium: Decision Tree](https://medium.com/deep-math-machine-learning-ai/chapter-4-decision-trees-algorithms-b93975f7a1f1)

In [0]:
# Import library
from sklearn import tree

# Train Decision Tree
dtc = tree.DecisionTreeClassifier(min_impurity_decrease=0.01)
dtc.fit(X_train, y_train)

# Predict to Test Data 
y_preddtc = dtc.predict(X_test)

In [0]:
# Plot the tree
tree.plot_tree(dtc.fit(X_train, y_train), class_names=['0','1']) 

In [0]:
# Visualize with graphviz

from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(dtc, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

In [0]:
# Import the metrics class
from sklearn import metrics

# Show the Confussion Matrix
cnf_matrixdtc = metrics.confusion_matrix(y_test, y_preddtc)
cnf_matrixdtc

In [0]:
# Show the Accuracy, Precision, Recall
acc_dtc = metrics.accuracy_score(y_test, y_preddtc)
prec_dtc = metrics.precision_score(y_test, y_preddtc)
rec_dtc = metrics.recall_score(y_test, y_preddtc)
f1_dtc = metrics.f1_score(y_test, y_preddtc)
kappa_dtc = metrics.cohen_kappa_score(y_test, y_preddtc)

print("Accuracy:", acc_dtc )
print("Precision:", prec_dtc)
print("Recall:", rec_dtc)
print("F1 Score:", f1_dtc)
print("Cohens Kappa Score:", kappa_dtc)

In [0]:
# Cross-validation score
cv_iterations = 5
cv_score = cross_val_score(dtc, train_feature, train_target, cv=cv_iterations)
print('Accuracy with cross-validation (split size = {}): {} (+/- {})'
      .format(cv_iterations, round(cv_score.mean(),2), round(cv_score.std() * 2,2)))

In [0]:
# Import Visualization Package
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (7, 7)
plt.style.use('ggplot')

# Visualize ROC Curve
y_pred_probadtc = dtc.predict_proba(X_test)[::,1]
fprdtc, tprdtc, _ = metrics.roc_curve(y_test,  y_pred_probadtc)
aucdtc = metrics.roc_auc_score(y_test, y_pred_probadtc)
plt.plot(fprdtc,tprdtc,label="Decision Tree, auc="+str(aucdtc))
plt.legend(loc=4)
plt.show()

#### K-Nearest Neighbor

We use Scikit Learn KNeighborsClassifier function. Here is the default parameter:

`KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs)`

A small value of k means that noise will have a higher influence on the result and a large value make it computationally expensive. Data scientists usually choose as an odd number if the number of classes is 2 and another simple approach to select k is set k=sqrt(n).

In [0]:
# Import library
from sklearn.neighbors import KNeighborsClassifier

# Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors= 75)
knn.fit(X_train, y_train)

# Predict to test data
y_predknn = knn.predict(X_test)

In [0]:
# Show the Confussion Matrix
cnf_matrixknn = metrics.confusion_matrix(y_test, y_predknn)
cnf_matrixknn

In [0]:
# Show the Accuracy, Precision, Recall
acc_knn = metrics.accuracy_score(y_test, y_predknn)
prec_knn = metrics.precision_score(y_test, y_predknn)
rec_knn = metrics.recall_score(y_test, y_predknn)
f1_knn = metrics.f1_score(y_test, y_predknn)
kappa_knn = metrics.cohen_kappa_score(y_test, y_predknn)

print("Accuracy:", acc_knn)
print("Precision:", prec_knn)
print("Recall:", rec_knn)
print("F1 Score:", f1_knn)
print("Cohens Kappa Score:", kappa_knn)

In [0]:
# Cross-validation score
cv_iterations = 10
cv_score = cross_val_score(knn, train_feature, train_target, cv=cv_iterations)
print('Accuracy with cross-validation (split size = {}): {} (+/- {})'
      .format(cv_iterations, round(cv_score.mean(),2), round(cv_score.std() * 2,2)))

In [0]:
# ROC Curve
y_pred_probaknn = knn.predict_proba(X_test)[::,1]
fprknn, tprknn, _ = metrics.roc_curve(y_test,  y_pred_probaknn)
aucknn = metrics.roc_auc_score(y_test, y_pred_probaknn)
plt.plot(fprknn,tprknn,label="K-NN, auc="+str(aucknn))
plt.legend(loc=4)
plt.show()

#### Naive Bayes

We use Scikit Learn [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) function. Here is the default parameter:

`class sklearn.naive_bayes.GaussianNB(priors=None, var_smoothing=1e-09`


In [0]:
from sklearn.naive_bayes import GaussianNB 

# Train Naive Bayes Model
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_predgnb= gnb.predict(X_test)

In [0]:
# Show the Confussion Matrix
cnf_matrixgnb = metrics.confusion_matrix(y_test, y_predgnb)
cnf_matrixgnb

In [0]:
# Show the Accuracy, Precision, Recall
acc_gnb = metrics.accuracy_score(y_test, y_predgnb)
prec_gnb = metrics.precision_score(y_test, y_predgnb)
rec_gnb = metrics.recall_score(y_test, y_predgnb)
f1_gnb = metrics.f1_score(y_test, y_predgnb)
kappa_gnb = metrics.cohen_kappa_score(y_test, y_predgnb)

print("Accuracy:", acc_gnb)
print("Precision:", prec_gnb)
print("Recall:", rec_gnb)
print("F1 Score:", f1_gnb)
print("Cohens Kappa Score:", kappa_gnb)

In [0]:
# Cross-validation score
cv_iterations = 10
cv_score = cross_val_score(gnb, train_feature, train_target, cv=cv_iterations)
print('Accuracy with cross-validation (split size = {}): {} (+/- {})'
      .format(cv_iterations, round(cv_score.mean(),2), round(cv_score.std() * 2,2)))

In [0]:
# ROC Curve
y_pred_probagnb = gnb.predict_proba(X_test)[::,1]
fprgnb, tprgnb, _ = metrics.roc_curve(y_test,  y_pred_probagnb)
aucgnb = metrics.roc_auc_score(y_test, y_pred_probagnb)
plt.plot(fprgnb,tprgnb,label="Naive Bayes, auc="+str(aucgnb))
plt.legend(loc=4)
plt.show()

#### Model Comparison

In [0]:
# Comparing Model Performance
print("Decision Tree Accuracy =",acc_dtc)
print("Decision Tree Precision =",prec_dtc)
print("Decision Tree Recall =",rec_dtc)
print("Decision Tree F1-Score =", f1_dtc)
print("_______________________")
print("k-NN Accuracy =", acc_knn)
print("k-NN Precision =", prec_knn)
print("k-NN Recall =", rec_knn)
print("k-NN F1-Score =", f1_knn)
print("_______________________")
print("Naive Bayes Accuracy =", acc_gnb)
print("Naive Bayes Precision =", prec_gnb)
print("Naive Bayes Recall =", rec_gnb)
print("Naive Bayes F1-Score =", f1_gnb)

In [0]:
# Comparing ROC Curve
plt.plot(fprdtc,tprdtc,label="Decision Tree, auc="+str(aucdtc))
plt.plot(fprknn,tprknn,label="K-NN, auc="+str(aucknn))
plt.plot(fprgnb,tprgnb,label="Naive Bayes, auc="+str(aucgnb))
plt.legend(loc=4)
plt.show()

# 2. Unsupervised Learning - Clustering

K-Means Clustering

In [0]:
# Import Library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Styling Plot
sns.set() 
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')

In [0]:
# Import Dataset
df = pd.read_csv('https://raw.githubusercontent.com/rc-dbe/bigdatacertification/master/dataset/clustering.csv')

In [0]:
# Show 10 Rows of Dataset
df.head(10)

In [0]:
# Show lenght of Dataset
len(df) 

In [0]:
# Statistics Descriptive
df.describe().transpose()

In [0]:
# Import Standard Scaler
from sklearn.preprocessing import StandardScaler
column_names = df.columns.tolist()
standard_scaler = StandardScaler()

df[column_names] = standard_scaler.fit_transform(df[column_names])
df.sort_index(inplace=True)
df.head()

In [0]:
#Visualising the data
plot_income = sns.distplot(df["INCOME"])
plot_spend = sns.distplot(df["SPEND"])

In [0]:
# Plotting the values to understand the spread
Income = df['INCOME'].values
Spend = df['SPEND'].values
X = np.array(list(zip(Income, Spend)))
plt.scatter(Income, Spend, s=50)

In [0]:
# Elbow
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
    km=KMeans(n_clusters=i,init='k-means++', max_iter=300, n_init=10, random_state=0)
    km.fit(X)
    wcss.append(km.inertia_)
plt.plot(range(1,11),wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('wcss')
plt.show()

In [0]:
# Silhoutte
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

for n_cluster in range(2, 11):
    kmeans = KMeans(n_clusters=n_cluster).fit(X)
    label = kmeans.labels_
    sil_coeff = silhouette_score(X, label, metric='euclidean')
    print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, sil_coeff))

In [0]:
# Fitting Model with K-Means
km2=KMeans(n_clusters=3,init='k-means++', max_iter=300, n_init=10, random_state=0)
y_means = km2.fit_predict(X)

In [0]:
# Visualising the clusters for k=3
plt.scatter(X[y_means == 0, 0], X[y_means == 0, 1], s = 50, label = 'Cluster 1')
plt.scatter(X[y_means == 1, 0], X[y_means == 1, 1], s = 50, label = 'Cluster 2')
plt.scatter(X[y_means == 2, 0], X[y_means == 2, 1], s = 50, label = 'Cluster 3')

plt.scatter(km2.cluster_centers_[:,0], km2.cluster_centers_[:,1],s=200,marker='s', alpha=0.7, label='Centroids')
plt.title('Customer segments')
plt.xlabel('Annual income')
plt.ylabel('Annual spend')
plt.legend()
plt.show()

# Finding Pattern - Association Rule

In [0]:
# Import Library
import pandas as pd
import numpy as np

import seaborn as sns

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [0]:
# Import dataset
retail_df = pd.read_excel("https://github.com/dianrdn/rc-dataanalytic/blob/master/dataset/Online_Retail.xlsx?raw=true")
retail_df.head()

In [0]:
# Create Encode Function
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

def create_basket(country_filter):
    basket = (retail_df[retail_df['Country'] == country_filter]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))
    return basket
 

In [0]:
country_filter = "France"
basket_french = create_basket("France")
basket_sets = basket_french.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1)

In [0]:
 basket_sets.head()

In [0]:
frequent_itemsets = apriori(basket_sets, min_support=0.05, use_colnames=True)
frequent_itemsets

In [0]:
# Generate Rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules.head()