# Diabetes Data: Pickle Lab

### Goal: The goal of this lab is to fit an ML model pipeline and pickle it

#### The Data: For this project, I will use the Diabetes dataset. The objective of this is to determine whether an individual has diabetes based on 8 attributes. The data is sourced from OpenML, with id# 37. 

##### For building the classification pipeline, the following sources were very helpful:
##### https://www.kaggle.com/gautham11/building-a-scikit-learn-classification-pipeline
##### https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

In [21]:
##Import Libraries

import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pickle
import json


In [22]:
##Import the dataset and preview the data

diabetes=fetch_openml(data_id=37, as_frame=True)

diabetes.data

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0
1,1.0,85.0,66.0,29.0,0.0,26.6,0.351,31.0
2,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0
...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,48.0,180.0,32.9,0.171,63.0
764,2.0,122.0,70.0,27.0,0.0,36.8,0.340,27.0
765,5.0,121.0,72.0,23.0,112.0,26.2,0.245,30.0
766,1.0,126.0,60.0,0.0,0.0,30.1,0.349,47.0


In [23]:
##Note that the dataset is an sklearn Bunch data type

print(type(diabetes))

<class 'sklearn.utils.Bunch'>


In [24]:
## Test Train Split

##Note that the target varible is classifying whether a person has diabetes

X_train, X_test, y_train, y_test= train_test_split(diabetes.data, diabetes.target, random_state=13)


In [25]:
## Build a Pipeline using logisitc regression
## See refrences in intro markdown block 

pipe=Pipeline([('normalize', StandardScaler()), ('classify', LogisticRegression())])

pipe.fit(X_train, y_train)

Pipeline(steps=[('normalize', StandardScaler()),
                ('classify', LogisticRegression())])

In [26]:
## Pickle the data, save the file as model.pkl

with open('model.pkl', 'wb') as f: 
    
    pickle.dump(pipe, f, pickle.HIGHEST_PROTOCOL)

In [27]:
## Get some testing data to file----
import json

##Convert the bunch to a data frame
diabetesDF = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

##Use json
#with open('testdata.json','w') as f:
#    json.dump(diabetesDF.iloc[0].values.tolist(),f)
    
with open('newdata.py','w') as f:
    json.dump(diabetesDF.iloc[0].values.tolist(),f)