# About the Data
The data set used is a subset from the 1987 National Indonesia Contraceptive Prevalence Survey.  The data comes from 1,473 married women who were not pregnant or did not know that they were pregnant at the time of the survey.  They were scored according to their answers related to ten separate categories.  Two of the categories are numerical: "w_age" (wife's age, between 16 and 49) and "num_kid" (number of children ever born).  Four of the categories are categorical: "w_ed" (wife's education), "h_ed" (husband's education), "h_job" (husband's occupation), and "sol" (standard-of-living), with each category rated one of four values: 1 (low), 2, 3, or 4 (high).  Three of the categories have binary values: "w_islam" (wife is Islamic, 0=no, 1=yes), "home" (wife stays at home, 0=no, 1=yes), and "med_ex" (media exposure, 0=bad, 1=good).  The class attribute has three categories related to contraception use: 1 = no-use, 2 = long-term, 3 = short-term.

   1. Wife's age                     (numerical)
   2. Wife's education               (categorical)      1=low, 2, 3, 4=high
   3. Husband's education            (categorical)      1=low, 2, 3, 4=high
   4. Number of children ever born   (numerical)
   5. Wife's religion                (binary)           0=Non-Islam, 1=Islam
   6. Wife's now working?            (binary)           0=Yes, 1=No
   7. Husband's occupation           (categorical)      1, 2, 3, 4
   8. Standard-of-living index       (categorical)      1=low, 2, 3, 4=high
   9. Media exposure                 (binary)           0=Good, 1=Not good
   10. Contraceptive method used     (class attribute)  1=No-use 
                                                        2=Long-term
                                                        3=Short-term

Upon seeing the list of attributes, I immediately noticed that some of the values seem to be incorrectly positioned.  If typically 0=false and 1=true, then the values for attributes 6 (Wife's now working) and 9 (Media exposure) may need to be reversed.  I am also curious as to why the class attribute (Contraceptive method used) has 1=No-use, 2=Long-term, and 3=Short-term.  In my mind, the long-term and short-term values may need to be reversed in order to accurately portray the correlations correctly (since I would imagine that long-term contraception would be more effective than short-term).  This also begs the question, how would a respondent be classified if they used both long-term and short-term contraception

The data set can be downloaded from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice.  

# Hypotheses
With respect to the categories of this survey, one could reason that some hypotheses are logical.  For example, the following hypotheses have positive correlations:

1. Women who use contraception have a higher level of education than women who do not use contraception.
2. Women who use contraception have husbands with higher levels of education than women who don not use contraception.
3. Women who use contraception have a higher standard-of-living than women who do not use contraception.

While these hypotheses have negative correlations:
4. Women who use contraception have fewer children than women who do not use contraception.
5. Women who use contraception are younger than women who do not use contraception.

In [1]:
# First I will import tools.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

  return f(*args, **kwds)
  return f(*args, **kwds)


In [4]:
# Next I will upload the data set from the .csv file.
df = pd.read_csv('cmc.csv')
df.columns = ["w_age", "w_ed", "h_ed", "num_kid", "w_islam", "home", "h_job", "sol", "med_ex", "cont"]
df.head()

Unnamed: 0,w_age,w_ed,h_ed,num_kid,w_islam,home,h_job,sol,med_ex,cont
0,45,1,3,10,1,1,3,4,0,1
1,43,2,3,7,1,1,3,4,0,1
2,42,3,2,9,1,1,3,3,0,1
3,36,3,3,8,1,1,3,2,0,1
4,19,4,4,0,1,1,3,3,0,1


Next I will run the data set for all values to find means, standard deviations, min/max values, and quartile values.

In [8]:
df.describe()

Unnamed: 0,w_age,w_ed,h_ed,num_kid,w_islam,home,h_job,sol,med_ex,cont
count,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0
mean,32.544158,2.959239,3.430027,3.261549,0.850543,0.749321,2.137908,3.133832,0.074049,1.920516
std,8.227027,1.015031,0.816549,2.359341,0.356659,0.433552,0.865144,0.976486,0.261939,0.876345
min,16.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
25%,26.0,2.0,3.0,1.0,1.0,0.0,1.0,3.0,0.0,1.0
50%,32.0,3.0,4.0,3.0,1.0,1.0,2.0,3.0,0.0,2.0
75%,39.0,4.0,4.0,4.25,1.0,1.0,3.0,4.0,0.0,3.0
max,49.0,4.0,4.0,16.0,1.0,1.0,4.0,4.0,1.0,3.0


In order to evaluate the data set more clearly and look for correlations, I will separate the data into three data sets, grouped by responses for contraceptive usage.

In [5]:
# df1, which will be women who do not use contraception.
df1 = df[df['cont']==1]
df1.describe()

Unnamed: 0,w_age,w_ed,h_ed,num_kid,w_islam,home,h_job,sol,med_ex,cont
count,628.0,628.0,628.0,628.0,628.0,628.0,628.0,628.0,628.0,628.0
mean,33.43949,2.671975,3.281847,2.934713,0.880573,0.729299,2.200637,2.953822,0.117834,1.0
std,9.123353,1.052397,0.902869,2.657577,0.324548,0.444676,0.840293,1.044207,0.322669,0.0
min,16.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
25%,25.0,2.0,3.0,1.0,1.0,0.0,1.0,2.0,0.0,1.0
50%,32.0,3.0,4.0,2.0,1.0,1.0,2.0,3.0,0.0,1.0
75%,42.0,4.0,4.0,4.0,1.0,1.0,3.0,4.0,0.0,1.0
max,49.0,4.0,4.0,12.0,1.0,1.0,4.0,4.0,1.0,1.0


In [6]:
# df2, which will be women who use contraception on a long-term basis.
df2 = df[df['cont']==2]
df2.describe()

Unnamed: 0,w_age,w_ed,h_ed,num_kid,w_islam,home,h_job,sol,med_ex,cont
count,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0
mean,34.384384,3.456456,3.663664,3.738739,0.771772,0.732733,1.840841,3.468468,0.03003,2.0
std,7.454844,0.796488,0.70781,2.104406,0.420322,0.443199,0.885908,0.770149,0.170927,0.0
min,17.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,2.0
25%,28.0,3.0,4.0,2.0,1.0,0.0,1.0,3.0,0.0,2.0
50%,35.0,4.0,4.0,3.0,1.0,1.0,2.0,4.0,0.0,2.0
75%,41.0,4.0,4.0,5.0,1.0,1.0,3.0,4.0,0.0,2.0
max,49.0,4.0,4.0,13.0,1.0,1.0,4.0,4.0,1.0,2.0


In [7]:
# df3, which will be women who use contraception on a short-term basis.
df3 = df[df['cont']==3]
df3.describe()

Unnamed: 0,w_age,w_ed,h_ed,num_kid,w_islam,home,h_job,sol,med_ex,cont
count,511.0,511.0,511.0,511.0,511.0,511.0,511.0,511.0,511.0,511.0
mean,30.244618,2.988258,3.459883,3.35225,0.864971,0.784736,2.254403,3.136986,0.048924,3.0
std,6.943811,0.96602,0.728856,2.049675,0.34209,0.411408,0.838916,0.954259,0.21592,0.0
min,16.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,3.0
25%,25.0,2.0,3.0,2.0,1.0,1.0,2.0,3.0,0.0,3.0
50%,29.0,3.0,4.0,3.0,1.0,1.0,2.0,3.0,0.0,3.0
75%,35.0,4.0,4.0,4.0,1.0,1.0,3.0,4.0,0.0,3.0
max,49.0,4.0,4.0,16.0,1.0,1.0,4.0,4.0,1.0,3.0
