# Data Exploration
This notebook performs exploratory data analysis on the dataset.
To expand on the analysis, attach this notebook to the **Sohail Hosseini's Cluster** cluster,
edit [the options of pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html), and rerun it.
- Explore completed trials in the [MLflow experiment](#mlflow/experiments/557053454446975/s?orderByKey=metrics.%60val_f1_score%60&orderByAsc=false)
- Navigate to the parent notebook [here](#notebook/557053454446976) (If you launched the AutoML experiment using the Experiments UI, this link isn't very useful.)

Runtime Version: _10.4.x-cpu-ml-scala2.12_

In [0]:
import os
import uuid
import shutil
import pandas as pd
import databricks.automl_runtime

from mlflow.tracking import MlflowClient

# Download input data from mlflow into a pandas DataFrame
# Create temporary directory to download data
temp_dir = os.path.join(os.environ["SPARK_LOCAL_DIRS"], "tmp", str(uuid.uuid4())[:8])
os.makedirs(temp_dir)

# Download the artifact and read it
client = MlflowClient()
training_data_path = client.download_artifacts("c8daf7e8bbe743b794c46f66869f8146", "data", temp_dir)
df = pd.read_parquet(os.path.join(training_data_path, "training_data"))

# Delete the temporary data
shutil.rmtree(temp_dir)

target_col = "Outcome"

## Profiling Results

In [0]:
from pandas_profiling import ProfileReport
df_profile = ProfileReport(df, title="Profiling Report", progress_bar=False, infer_dtypes=False)
profile_html = df_profile.to_html()

displayHTML(profile_html)

0,1
Number of variables,9
Number of observations,768
Missing cells,0
Missing cells (%),0.0%
Duplicate rows,0
Duplicate rows (%),0.0%
Total size in memory,54.1 KiB
Average record size in memory,72.2 B

0,1
Numeric,9

0,1
Pregnancies is highly correlated with Age,High correlation
SkinThickness is highly correlated with Insulin,High correlation
Insulin is highly correlated with SkinThickness,High correlation
Age is highly correlated with Pregnancies,High correlation
Pregnancies is highly correlated with Age,High correlation
Age is highly correlated with Pregnancies,High correlation
Pregnancies is highly correlated with Age,High correlation
BloodPressure is highly correlated with BMI,High correlation
BMI is highly correlated with BloodPressure,High correlation
Age is highly correlated with Pregnancies,High correlation

0,1
Analysis started,2023-08-04 00:28:00.875562
Analysis finished,2023-08-04 00:28:18.062950
Duration,17.19 seconds
Software version,pandas-profiling v3.1.0
Download configuration,config.json

0,1
Distinct,17
Distinct (%),2.2%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,3.845052083

0,1
Minimum,0
Maximum,17
Zeros,111
Zeros (%),14.5%
Negative,0
Negative (%),0.0%
Memory size,6.1 KiB

0,1
Minimum,0
5-th percentile,0
Q1,1
median,3
Q3,6
95-th percentile,10
Maximum,17
Range,17
Interquartile range (IQR),5

0,1
Standard deviation,3.369578063
Coefficient of variation (CV),0.8763413316
Kurtosis,0.1592197775
Mean,3.845052083
Median Absolute Deviation (MAD),2
Skewness,0.9016739792
Sum,2953
Variance,11.35405632
Monotonicity,Not monotonic

Value,Count,Frequency (%)
1,135,17.6%
0,111,14.5%
2,103,13.4%
3,75,9.8%
4,68,8.9%
5,57,7.4%
6,50,6.5%
7,45,5.9%
8,38,4.9%
9,28,3.6%

Value,Count,Frequency (%)
0,111,14.5%
1,135,17.6%
2,103,13.4%
3,75,9.8%
4,68,8.9%
5,57,7.4%
6,50,6.5%
7,45,5.9%
8,38,4.9%
9,28,3.6%

Value,Count,Frequency (%)
17,1,0.1%
15,1,0.1%
14,2,0.3%
13,10,1.3%
12,9,1.2%
11,11,1.4%
10,24,3.1%
9,28,3.6%
8,38,4.9%
7,45,5.9%

0,1
Distinct,136
Distinct (%),17.7%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,120.8945312

0,1
Minimum,0
Maximum,199
Zeros,5
Zeros (%),0.7%
Negative,0
Negative (%),0.0%
Memory size,6.1 KiB

0,1
Minimum,0.0
5-th percentile,79.0
Q1,99.0
median,117.0
Q3,140.25
95-th percentile,181.0
Maximum,199.0
Range,199.0
Interquartile range (IQR),41.25

0,1
Standard deviation,31.9726182
Coefficient of variation (CV),0.2644670347
Kurtosis,0.6407798204
Mean,120.8945312
Median Absolute Deviation (MAD),20
Skewness,0.1737535018
Sum,92847
Variance,1022.248314
Monotonicity,Not monotonic

Value,Count,Frequency (%)
99,17,2.2%
100,17,2.2%
129,14,1.8%
125,14,1.8%
106,14,1.8%
111,14,1.8%
102,13,1.7%
95,13,1.7%
105,13,1.7%
108,13,1.7%

Value,Count,Frequency (%)
0,5,0.7%
44,1,0.1%
56,1,0.1%
57,2,0.3%
61,1,0.1%
62,1,0.1%
65,1,0.1%
67,1,0.1%
68,3,0.4%
71,4,0.5%

Value,Count,Frequency (%)
199,1,0.1%
198,1,0.1%
197,4,0.5%
196,3,0.4%
195,2,0.3%
194,3,0.4%
193,2,0.3%
191,1,0.1%
190,1,0.1%
189,4,0.5%

0,1
Distinct,47
Distinct (%),6.1%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,69.10546875

0,1
Minimum,0
Maximum,122
Zeros,35
Zeros (%),4.6%
Negative,0
Negative (%),0.0%
Memory size,6.1 KiB

0,1
Minimum,0.0
5-th percentile,38.7
Q1,62.0
median,72.0
Q3,80.0
95-th percentile,90.0
Maximum,122.0
Range,122.0
Interquartile range (IQR),18.0

0,1
Standard deviation,19.35580717
Coefficient of variation (CV),0.2800908166
Kurtosis,5.18015656
Mean,69.10546875
Median Absolute Deviation (MAD),8
Skewness,-1.843607983
Sum,53073
Variance,374.6472712
Monotonicity,Not monotonic

Value,Count,Frequency (%)
70,57,7.4%
74,52,6.8%
78,45,5.9%
68,45,5.9%
72,44,5.7%
64,43,5.6%
80,40,5.2%
76,39,5.1%
60,37,4.8%
0,35,4.6%

Value,Count,Frequency (%)
0,35,4.6%
24,1,0.1%
30,2,0.3%
38,1,0.1%
40,1,0.1%
44,4,0.5%
46,2,0.3%
48,5,0.7%
50,13,1.7%
52,11,1.4%

Value,Count,Frequency (%)
122,1,0.1%
114,1,0.1%
110,3,0.4%
108,2,0.3%
106,3,0.4%
104,2,0.3%
102,1,0.1%
100,3,0.4%
98,3,0.4%
96,4,0.5%

0,1
Distinct,51
Distinct (%),6.6%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,20.53645833

0,1
Minimum,0
Maximum,99
Zeros,227
Zeros (%),29.6%
Negative,0
Negative (%),0.0%
Memory size,6.1 KiB

0,1
Minimum,0
5-th percentile,0
Q1,0
median,23
Q3,32
95-th percentile,44
Maximum,99
Range,99
Interquartile range (IQR),32

0,1
Standard deviation,15.95221757
Coefficient of variation (CV),0.776775494
Kurtosis,-0.5200718662
Mean,20.53645833
Median Absolute Deviation (MAD),12
Skewness,0.1093724965
Sum,15772
Variance,254.4732453
Monotonicity,Not monotonic

Value,Count,Frequency (%)
0,227,29.6%
32,31,4.0%
30,27,3.5%
27,23,3.0%
23,22,2.9%
18,20,2.6%
28,20,2.6%
33,20,2.6%
31,19,2.5%
19,18,2.3%

Value,Count,Frequency (%)
0,227,29.6%
7,2,0.3%
8,2,0.3%
10,5,0.7%
11,6,0.8%
12,7,0.9%
13,11,1.4%
14,6,0.8%
15,14,1.8%
16,6,0.8%

Value,Count,Frequency (%)
99,1,0.1%
63,1,0.1%
60,1,0.1%
56,1,0.1%
54,2,0.3%
52,2,0.3%
51,1,0.1%
50,3,0.4%
49,3,0.4%
48,4,0.5%

0,1
Distinct,186
Distinct (%),24.2%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,79.79947917

0,1
Minimum,0
Maximum,846
Zeros,374
Zeros (%),48.7%
Negative,0
Negative (%),0.0%
Memory size,6.1 KiB

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
median,30.5
Q3,127.25
95-th percentile,293.0
Maximum,846.0
Range,846.0
Interquartile range (IQR),127.25

0,1
Standard deviation,115.2440024
Coefficient of variation (CV),1.444169856
Kurtosis,7.214259554
Mean,79.79947917
Median Absolute Deviation (MAD),30.5
Skewness,2.272250858
Sum,61286
Variance,13281.18008
Monotonicity,Not monotonic

Value,Count,Frequency (%)
0,374,48.7%
105,11,1.4%
140,9,1.2%
130,9,1.2%
120,8,1.0%
100,7,0.9%
180,7,0.9%
94,7,0.9%
115,6,0.8%
135,6,0.8%

Value,Count,Frequency (%)
0,374,48.7%
14,1,0.1%
15,1,0.1%
16,1,0.1%
18,2,0.3%
22,1,0.1%
23,2,0.3%
25,1,0.1%
29,1,0.1%
32,1,0.1%

Value,Count,Frequency (%)
846,1,0.1%
744,1,0.1%
680,1,0.1%
600,1,0.1%
579,1,0.1%
545,1,0.1%
543,1,0.1%
540,1,0.1%
510,1,0.1%
495,2,0.3%

0,1
Distinct,248
Distinct (%),32.3%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,31.99257813

0,1
Minimum,0
Maximum,67.1
Zeros,11
Zeros (%),1.4%
Negative,0
Negative (%),0.0%
Memory size,6.1 KiB

0,1
Minimum,0.0
5-th percentile,21.8
Q1,27.3
median,32.0
Q3,36.6
95-th percentile,44.395
Maximum,67.1
Range,67.1
Interquartile range (IQR),9.3

0,1
Standard deviation,7.88416032
Coefficient of variation (CV),0.2464371671
Kurtosis,3.290442901
Mean,31.99257813
Median Absolute Deviation (MAD),4.6
Skewness,-0.4289815885
Sum,24570.3
Variance,62.15998396
Monotonicity,Not monotonic

Value,Count,Frequency (%)
32,13,1.7%
31.2,12,1.6%
31.6,12,1.6%
0,11,1.4%
33.3,10,1.3%
32.4,10,1.3%
32.9,9,1.2%
32.8,9,1.2%
30.8,9,1.2%
30.1,9,1.2%

Value,Count,Frequency (%)
0.0,11,1.4%
18.2,3,0.4%
18.4,1,0.1%
19.1,1,0.1%
19.3,1,0.1%
19.4,1,0.1%
19.5,2,0.3%
19.6,3,0.4%
19.9,1,0.1%
20.0,1,0.1%

Value,Count,Frequency (%)
67.1,1,0.1%
59.4,1,0.1%
57.3,1,0.1%
55.0,1,0.1%
53.2,1,0.1%
52.9,1,0.1%
52.3,2,0.3%
50.0,1,0.1%
49.7,1,0.1%
49.6,1,0.1%

0,1
Distinct,517
Distinct (%),67.3%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,0.4718763021

0,1
Minimum,0.078
Maximum,2.42
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,6.1 KiB

0,1
Minimum,0.078
5-th percentile,0.14035
Q1,0.24375
median,0.3725
Q3,0.62625
95-th percentile,1.13285
Maximum,2.42
Range,2.342
Interquartile range (IQR),0.3825

0,1
Standard deviation,0.331328595
Coefficient of variation (CV),0.7021513764
Kurtosis,5.594953528
Mean,0.4718763021
Median Absolute Deviation (MAD),0.1675
Skewness,1.919911066
Sum,362.401
Variance,0.1097786379
Monotonicity,Not monotonic

Value,Count,Frequency (%)
0.258,6,0.8%
0.254,6,0.8%
0.268,5,0.7%
0.261,5,0.7%
0.207,5,0.7%
0.238,5,0.7%
0.259,5,0.7%
0.551,4,0.5%
0.692,4,0.5%
0.284,4,0.5%

Value,Count,Frequency (%)
0.078,1,0.1%
0.084,1,0.1%
0.085,2,0.3%
0.088,2,0.3%
0.089,1,0.1%
0.092,1,0.1%
0.096,1,0.1%
0.1,1,0.1%
0.101,1,0.1%
0.102,1,0.1%

Value,Count,Frequency (%)
2.42,1,0.1%
2.329,1,0.1%
2.288,1,0.1%
2.137,1,0.1%
1.893,1,0.1%
1.781,1,0.1%
1.731,1,0.1%
1.699,1,0.1%
1.698,1,0.1%
1.6,1,0.1%

0,1
Distinct,52
Distinct (%),6.8%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,33.24088542

0,1
Minimum,21
Maximum,81
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,6.1 KiB

0,1
Minimum,21
5-th percentile,21
Q1,24
median,29
Q3,41
95-th percentile,58
Maximum,81
Range,60
Interquartile range (IQR),17

0,1
Standard deviation,11.76023154
Coefficient of variation (CV),0.3537881556
Kurtosis,0.6431588885
Mean,33.24088542
Median Absolute Deviation (MAD),7
Skewness,1.129596701
Sum,25529
Variance,138.3030459
Monotonicity,Not monotonic

Value,Count,Frequency (%)
22,72,9.4%
21,63,8.2%
25,48,6.2%
24,46,6.0%
23,38,4.9%
28,35,4.6%
26,33,4.3%
27,32,4.2%
29,29,3.8%
31,24,3.1%

Value,Count,Frequency (%)
21,63,8.2%
22,72,9.4%
23,38,4.9%
24,46,6.0%
25,48,6.2%
26,33,4.3%
27,32,4.2%
28,35,4.6%
29,29,3.8%
30,21,2.7%

Value,Count,Frequency (%)
81,1,0.1%
72,1,0.1%
70,1,0.1%
69,2,0.3%
68,1,0.1%
67,3,0.4%
66,4,0.5%
65,3,0.4%
64,1,0.1%
63,4,0.5%

0,1
Distinct,2
Distinct (%),0.3%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,0.3489583333

0,1
Minimum,0
Maximum,1
Zeros,500
Zeros (%),65.1%
Negative,0
Negative (%),0.0%
Memory size,6.1 KiB

0,1
Minimum,0
5-th percentile,0
Q1,0
median,0
Q3,1
95-th percentile,1
Maximum,1
Range,1
Interquartile range (IQR),1

0,1
Standard deviation,0.4769513772
Coefficient of variation (CV),1.366786036
Kurtosis,-1.600929755
Mean,0.3489583333
Median Absolute Deviation (MAD),0
Skewness,0.6350166434
Sum,268
Variance,0.2274826163
Monotonicity,Not monotonic

Value,Count,Frequency (%)
0,500,65.1%
1,268,34.9%

Value,Count,Frequency (%)
0,500,65.1%
1,268,34.9%

Value,Count,Frequency (%)
1,268,34.9%
0,500,65.1%

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
758,1,106,76,0,0,37.5,0.197,26,0
759,6,190,92,0,0,35.5,0.278,66,1
760,2,88,58,26,16,28.4,0.766,22,0
761,9,170,74,31,0,44.0,0.403,43,1
762,9,89,62,0,0,22.5,0.142,33,0
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0
