#### Setup

In [26]:
import pandas as pd, matplotlib.pyplot as plt, numpy as np
import seaborn as sns
from scipy.io.arff import loadarff
from sklearn.feature_selection import f_classif
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load the data
data = loadarff('diabetes.arff')
df = pd.DataFrame(data[0])

#### 1) ANOVA is a statistical test that can be used to assess the discriminative power of asingle input variable. Using f_classif from sklearn, identify the input variables with the worst and best discriminative power. Plot their class-conditional probability density functions.

In [None]:
X = df.drop('Outcome', axis = 1)
df['Outcome'] = df['Outcome'].astype(int)
y = df['Outcome']

f_values, p_values = f_classif(X, y)

# Identify the best and worst input variables based on f-values
best = np.argmax(f_values)
worst = np.argmin(f_values)

fig = plt.figure(figsize=(12, 5))

print(f"The input variables with the worst discriminative power is: {X.columns[worst]} and the best is: {X.columns[best]}")

# Best input variable
plt.subplot(1, 2, 1)
sns.kdeplot(df[df['Outcome'] == 0][X.columns[best]], label= 'Normal', fill= True)
sns.kdeplot(df[df['Outcome'] == 1][X.columns[best]], label= 'Diabetes', fill= True)
plt.xlabel(f'{X.columns[best]}')
plt.ylabel('Density')
plt.title(f'Class Conditional PDF for {X.columns[best]}')

# Worst input variable
plt.subplot(1, 2, 2)
sns.kdeplot(df[df['Outcome'] == 0][X.columns[worst]], label= 'Normal', fill= True)
sns.kdeplot(df[df['Outcome'] == 1][X.columns[worst]], label= 'Diabetes', fill= True)
plt.xlabel(f'{X.columns[worst]}')
plt.ylabel('Density')
plt.title(f'Class Conditional PDF for {X.columns[worst]}')

plt.tight_layout()
plt.legend()
plt.show()

#### 2) Using a stratified 80-20 training-testing split with a fixed seed (random_state=1), assess in a single plot both the training and testing accuracies of a decision tree with minimum sample split in {2,5,10,20,30,50,100} and the remaining parameters as default.

In [None]:
X = df.drop('Outcome', axis = 1)
df['Outcome'] = df['Outcome'].astype(int)
y = df['Outcome']

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

avg_train_accuracies, avg_test_accuracies = [], []
min_sample_splits = [2,5,10,20,30,50,100]

for min_sample_split in min_sample_splits:
    train_acc_scores = []
    test_acc_scores = []

    # Run 10 times to improve accuracy
    for i in range(10):
        #Create a decision tree classifier
        clf = DecisionTreeClassifier(min_samples_split= min_sample_split, random_state=1)
        clf.fit(x_train, y_train)

        
        train_accuracy = accuracy_score(y_train, clf.predict(x_train))
        test_accuracy = accuracy_score(y_test, clf.predict(x_test))        
        train_acc_scores.append(train_accuracy)
        test_acc_scores.append(test_accuracy)
        
    avg_train_accuracies.append(np.mean(train_acc_scores))
    avg_test_accuracies.append(np.mean(test_acc_scores))
    
fig = plt.figure(figsize=(12, 5))
plt.plot(min_sample_splits, avg_train_accuracies, label='Training Accuracy')
plt.plot(min_sample_splits, avg_test_accuracies, label='Test Accuracy')
plt.title('Decision Tree Accuracy vs. Min Sample Split')
plt.xlabel('Min Sample Split')
plt.ylabel('Accuracy')
plt.legend()
plt.grid()
plt.show()

    

#### 3) Critically analyze these results, including the generalization capacity across settings.

For better analysis, we can separate the results in two different ranges in the X axis (Min Sample Split). 

* Min Sample Split < ≈30
	
	The train accuracy starts at 100% for a Min Sample Split of 2. That makes perfect sense as it means that we always split the decision tree whenever we get two different outcomes for differencing attribute values. This will inevitably result in a perfect classifying decision tree for every instance in the training data set, explaining the perfect accuracy score. However, it scores lower than 70% accuracy in unseen data (test data). 
	As the Min Sample Split setting increases along this range, the train accuracy decreases and the test accuracy increases due to a better generalization of the results.
	We can come to the conclusion that the decision trees created from smaller Min Sample Split values suffer from overfitting. They become too complex, overly learning patterns and noise in the training data and then scoring poorly in the test data.

* Min Sample Split > ≈30
		
	As the setting for Min Sample Split values increase from 30, the test accuracy starts slowly decreasing along with the train accuracy. By this point the decision trees created are starting to become too simple, translating in worse accuracies for both data sets. They now indicate underfitting problems.

The best balance between complexity and generalization capacity is found setting the Min Sample Split value at around 30, where the maximum test accuracy of about 77% is reached. At this point, the model catches enough patterns in the data to perform well with unseen data, without overly fitting the training set.

#### 4) To deploy the predictor, a healthcare provider opted to learn a single decision tree (random_state=1) using all available data and ensuring that the maximum depth would be 3 in order to avoid overfitting risks.

> i. Plot the decision tree 

In [None]:
X = df.drop('Outcome', axis = 1)
df['Outcome'] = df['Outcome'].astype(int)
y = df['Outcome']

predictor = tree.DecisionTreeClassifier(random_state=1, max_depth=3)
predictor.fit(X, y)

feature_names = X.columns
class_names = ['Normal', 'Diabetes']

plt.figure(figsize=(12, 5))
tree.plot_tree(predictor, feature_names=feature_names, class_names=class_names, impurity=False, filled=True)
plt.show()


>ii.Explain what characterizes diabetes by identifying the conditional associations together with their posterior probabilities. 

yapp caçador 2.