In [None]:
! git clone https://github.com/timw5/AI_Interview.git
! pip install pandas numpy scikit-learn matplotlib seaborn yellowbrick

This dataset is comprised of 8 input variables that describe medical details of patients and one output variable to indicate whether the patient will have an onset of diabetes within 5 years. <br> You can learn more about this dataset [here](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database), on the UCI Machine Learning Repository website.

In [None]:
import pandas as pd
import numpy as np

# load data
dataset = pd.read_csv('AI_Interview/diabetes.csv')

The columns of this dataset are:

1. Pregnancies — Number of times pregnant
2. GlucosePlasma — glucose concentration 2 hours in an oral glucose tolerance test
3. Blood Pressure — Diastolic blood pressure (mm Hg)
4. SkinThickness — Triceps skin-fold thickness (mm)
5. Insulin — Two hours of serum insulin (mu U/ml)
6. BMI — Body mass index (weight in kg/(height in m)²)
7. Diabetes Pedigree Function — Diabetes pedigree function <br> (this provides information about diabetes history in relatives and genetic relationship of those relatives with patients. Higher Pedigree Function means patient is more likely to have diabetes.)
8. Age — Age in years

9. Outcome — Class variable (0 or 1)

The first eight are numeric predictors, while the ninth is the binary outcome indicating whether the patient will get diabetes.

In [None]:
X = dataset.iloc[:, 0:8]
y = dataset.iloc[:,8]

We would like you to use a decision tree, and stratified k fold cross validation from the sklearn package <br> to try to predict if someone will get diabetes in the next 5 years, based on the above features. <br> Below, we provided you with some descriptions of the classes from sklearn to use as reference.

# EDA

<br> What does EDA mean to you? <br><br> Will you explain your initial thoughts of the data, <br><br> What's the first thing that you would do with this dataset? <br><br> What decisions about data quality would you need to make? <br><br> Are there any other features that you think you could use, that could be derived from the dataset? <br><br> What plots/graphs/visualizations would you use to help in understanding the data?

In [2]:
from sklearn.model_selection import RepeatedStratifiedKFold
help(RepeatedStratifiedKFold)

Help on class RepeatedStratifiedKFold in module sklearn.model_selection._split:

class RepeatedStratifiedKFold(_RepeatedSplits)
 |  RepeatedStratifiedKFold(*, n_splits=5, n_repeats=10, random_state=None)
 |  
 |  Repeated Stratified K-Fold cross validator.
 |  
 |  Repeats Stratified K-Fold n times with different randomization in each
 |  repetition.
 |  
 |  Read more in the :ref:`User Guide <repeated_k_fold>`.
 |  
 |  Parameters
 |  ----------
 |  n_splits : int, default=5
 |      Number of folds. Must be at least 2.
 |  
 |  n_repeats : int, default=10
 |      Number of times cross-validator needs to be repeated.
 |  
 |  random_state : int, RandomState instance or None, default=None
 |      Controls the generation of the random states for each repetition.
 |      Pass an int for reproducible output across multiple function calls.
 |      See :term:`Glossary <random_state>`.
 |  
 |  Examples
 |  --------
 |  >>> import numpy as np
 |  >>> from sklearn.model_selection import Repeat

In [3]:
from sklearn.tree import DecisionTreeClassifier
help(DecisionTreeClassifier)

Help on class DecisionTreeClassifier in module sklearn.tree._classes:

class DecisionTreeClassifier(sklearn.base.ClassifierMixin, BaseDecisionTree)
 |  DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)
 |  
 |  A decision tree classifier.
 |  
 |  Read more in the :ref:`User Guide <tree>`.
 |  
 |  Parameters
 |  ----------
 |  criterion : {"gini", "entropy", "log_loss"}, default="gini"
 |      The function to measure the quality of a split. Supported criteria are
 |      "gini" for the Gini impurity and "log_loss" and "entropy" both for the
 |      Shannon information gain, see :ref:`tree_mathematical_formulation`.
 |  
 |  splitter : {"best", "random"}, default="best"
 |      The strategy used to choose the split at each node. Supported
 |      strategies are "best" to 

In [4]:
from sklearn.model_selection import cross_val_score
help(cross_val_score)

Help on function cross_val_score in module sklearn.model_selection._validation:

cross_val_score(estimator, X, y=None, *, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)
    Evaluate a score by cross-validation.
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    estimator : estimator object implementing 'fit'
        The object to use to fit the data.
    
    X : array-like of shape (n_samples, n_features)
        The data to fit. Can be for example a list, or an array.
    
    y : array-like of shape (n_samples,) or (n_samples, n_outputs),             default=None
        The target variable to try to predict in the case of
        supervised learning.
    
    groups : array-like of shape (n_samples,), default=None
        Group labels for the samples used while splitting the dataset into
        train/test set. Only used in conjunction with a "Group" :term:`cv

# Model Evaluation

Start with a simple decision tree with a maximum depth of 4 to see how well that can do predicting diabetes. Use stratified k-fold cross-validation and report the mean, and standard deviation of the results.

In [None]:
# evaluate gradient boosting algorithm for classification
from numpy import mean
from numpy import std


# # define the model


# # define the evaluation method


# # evaluate the model on the dataset


# # report performance (mean and standard deviation)


Can you do better? Try with a different 5 different decision trees, with variable depths, evaluate each model using Stratified k fold cross validation, report the performance of each.

In [None]:
# So you can make a plot (Not required, only if you want to)
from matplotlib import pyplot
import seaborn as sns


# get a list of models to evaluate



# evaluate the models and store results
results, names = list(), list()



# plot model performance for comparison



See if you can beat the decision tree with a random forest. 

See how accurately a random forest model can predict the data, again using stratified k-fold cross-validation as you did above, and report your results


In [5]:
from sklearn.ensemble import RandomForestClassifier
help(RandomForestClassifier)

Help on class RandomForestClassifier in module sklearn.ensemble._forest:

class RandomForestClassifier(ForestClassifier)
 |  RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)
 |  
 |  A random forest classifier.
 |  
 |  A random forest is a meta estimator that fits a number of decision tree
 |  classifiers on various sub-samples of the dataset and uses averaging to
 |  improve the predictive accuracy and control over-fitting.
 |  The sub-sample size is controlled with the `max_samples` parameter if
 |  `bootstrap=True` (default), otherwise the whole dataset is used to build
 |  each tree.
 |  
 |  For a comparison between tree-based ensemble models see the example
 |  :ref:`sp

In [None]:
#define the model


# evaluate the model on the dataset


# report performance


# Summarize Results


What would you say is the best model out of all these options, and why?

(Bonus, Optional) <br> What steps would be next for productionalizing this model?