__Assignment 9__

1. [Import](#Import)
1. [Assignment 9](#Assignment-9)
    1. [Load-data](#Load-data)
    1. [Build Naive Bayes model](#Build-Naive-Bayes-model)    
    1. [Evaluate results](#Evaluate-results)

# Import

<a id = 'Import'></a>

In [2]:
# standard libary and settings
import warnings

warnings.simplefilter("ignore")
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:95% !important; }</style>"))

# data extensions and settings
import numpy as np

np.set_printoptions(threshold=np.inf, suppress=True)
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.options.display.float_format = "{:,.6f}".format

# modeling extensions
from sklearn.metrics import precision_score, recall_score, f1_score, explained_variance_score, mean_squared_log_error, mean_absolute_error, median_absolute_error, mean_squared_error, r2_score, confusion_matrix, roc_curve, accuracy_score, roc_auc_score, homogeneity_score, completeness_score, classification_report, silhouette_samples
from sklearn.model_selection import KFold, train_test_split, GridSearchCV, StratifiedKFold, cross_val_score, RandomizedSearchCV
from sklearn.naive_bayes import MultinomialNB

# Assignment 9

<a id = 'Assignment-9'></a>

## Load data

<a id = 'Load-data'></a>

In [3]:
# load and inspect data
# df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/cmc/cmc.data'
df = pd.read_csv(
    "s3://tdp-ml-datasets/misc/cmc.data",
    sep=",",
    names=[
        "WifeAge",
        "WifeEdu",
        "HusbandEdu",
        "NumChildren",
        "WifeReligion",
        "WifeWorking",
        "HusbandOccupation",
        "StandardOfLiving",
        "MediaExposure",
        "ContraceptiveMethod",
    ],
)
df.info()
display(df[:5])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1473 entries, 0 to 1472
Data columns (total 10 columns):
WifeAge                1473 non-null int64
WifeEdu                1473 non-null int64
HusbandEdu             1473 non-null int64
NumChildren            1473 non-null int64
WifeReligion           1473 non-null int64
WifeWorking            1473 non-null int64
HusbandOccupation      1473 non-null int64
StandardOfLiving       1473 non-null int64
MediaExposure          1473 non-null int64
ContraceptiveMethod    1473 non-null int64
dtypes: int64(10)
memory usage: 115.2 KB


Unnamed: 0,WifeAge,WifeEdu,HusbandEdu,NumChildren,WifeReligion,WifeWorking,HusbandOccupation,StandardOfLiving,MediaExposure,ContraceptiveMethod
0,24,2,3,3,1,1,2,3,0,1
1,45,1,3,10,1,1,3,4,0,1
2,43,2,3,7,1,1,3,4,0,1
3,42,3,2,9,1,1,3,3,0,1
4,36,3,3,8,1,1,3,2,0,1


In [4]:
# print unique class labels
df["ContraceptiveMethod"].unique()

array([1, 2, 3])

> Remarks - This dataset has a three-level object response variable

In [5]:
# train/test split
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

## Build Naive Bayes model

<a id = 'Naive-Bayes-model'></a>

In [6]:
# perform cross validation
model = MultinomialNB()
scores = cross_val_score(
    model, X_train, y_train, scoring="accuracy", cv=10
)
print(
    "CV accuracy on training data: {:.3f} +/- {:.3f}".format(
        np.mean(scores), np.std(scores)
    )
)

CV accuracy on training data: 0.493 +/- 0.027


In [7]:
# print all scores
print("All scores: {}".format(scores))

All scores: [0.51260504 0.53781513 0.51260504 0.46610169 0.44915254 0.49152542
 0.46153846 0.4957265  0.52136752 0.48275862]


> Remarks - The average score of the cross validation was 0.49 with a narrow standard deviation of 0.027.

# Evaluate results

<a id = 'Evaluate-results'></a>

In [8]:
# determine accuracy of model when used on test set
model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy: {:.3f}".format(accuracy_score(y_true=y_test, y_pred=y_pred)))

Accuracy: 0.512


> Remarks - The model performed slightly better on the unseen test data, but the accuracy is still very close to only 50%. This could be due to inter-dependencies among features in the dataset chosen. This dataset presents a classifcation problem with three different categories describing what type, if any, of contraception is used by a husband and wife. The features pertain to characteristics about the husband and wife, such as the education level of each, their standard of living, number of children. It could be the case that higher educated men marry higher educated women, and higher educated people possess a higher standard of living, and have more/fewer children than less educated people. These are just a few types of inter-dependencies that could exist in this data set that prevent Naive Bayes from being more accurate.