## 1. Look up SMOTE oversampling
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html
### a. Describe what it is in your own words in markdown.

SMOTE - Synthetic Minority Over-sampling Technique - a technique for oversampling. It supports multi-class resampling. It is one of the most commonly used oversampling methods to solve the imbalance problem.

### b. Use this technique with the diabetes dataset. Comment on the model performance compared to other methods. Make sure you are clear about why you chose the performance metric you did.

Running LogisticRegression without using SMOTE or any oversampling approach does not perform as well if you are looking at Recall however, some of the other performance metrics (like precision and accuracy) lean the other way. SMOTE performed better than the oversampling approach done in class based on Recall. In addition, changing the k_neighbors for SMOTE can tune the performance. 

I chose Recall as the performance metric to look at since it means out of all the points that actually belong to a positive class, what percentage of the model detected to be positive class. I feel the False Negative should be observed keenly as it has more impact so Recall becomes important.

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Read in the diabetes data, create dataframe and print the head
diabetes_df = pd.read_csv("../../in_class/in_class_assignments/diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
diabetes_df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [3]:
X = diabetes_df.drop('Outcome', axis = 1)
y = diabetes_df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#Standardize
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)

In [4]:
from imblearn.over_sampling import SMOTE 

sm = SMOTE(random_state=42)

X_res, y_res = sm.fit_resample(X_train_scaler,y_train)


In [5]:
#train using resampled
model = LogisticRegression(random_state = 42)
model.fit(X_res,y_res)

LogisticRegression(random_state=42)

In [6]:
#calculate accuracy
from sklearn.metrics import balanced_accuracy_score #assumes model is balanced

y_pred = model.predict(X_test_scaler)
balanced_accuracy_score(y_test,y_pred)

0.6888658940397351

In [7]:
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test,y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.80      0.72      0.66      0.76      0.69      0.48       151
          1       0.55      0.66      0.72      0.60      0.69      0.47        80

avg / total       0.71      0.70      0.68      0.70      0.69      0.47       231



In [8]:
X = diabetes_df.drop('Outcome', axis = 1)
y = diabetes_df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#Standardize
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)

model = LogisticRegression(random_state=42)

model.fit(X_train_scaler, y_train)


LogisticRegression(random_state=42)

In [9]:
y_pred = model.predict(X_test_scaler)
balanced_accuracy_score(y_test,y_pred)

#print(model.score(X_train_scaler, y_train))
#print(model.score(X_test_scaler, y_test))

0.7105960264900661

In [10]:
print(classification_report_imbalanced(y_test,y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.79      0.82      0.60      0.81      0.70      0.50       151
          1       0.64      0.60      0.82      0.62      0.70      0.48        80

avg / total       0.74      0.74      0.68      0.74      0.70      0.50       231



## 2. Create a function called rec_digit_sum that takes in an integer. This function is the recursive sum of all the digits in a number.

Given n, take the sum of all the digits in n. If the resulting value has more than one digit, continue calling the function in this way until a single-digit number is produced. The input will be a non-negative integer, and this should work for extremely large values as well as for single-digit inputs.

Examples:

16 --> 1 + 6 = 7

942 --> 9 + 4 + 2 = 15 --> 1 + 5 = 6

132189 --> 1 + 3 + 2 + 1 + 8 + 9 = 24 --> 2 + 4 = 6

493193 --> 4 + 9 + 3 + 1 + 9 + 3 = 29 --> 2 + 9 = 11

In [11]:
def rec_digit_sum(n):
    """ Return the sum of digits of a number.
        number: non-negative integer
    """
    # go through the number inputted, adding each digit together
    x = sum(int(i) for i in str(n))
    
    # if the resulting sum is greater than 9, need to go through the process again
    if x > 9:
        return rec_digit_sum(x)
    return x

In [12]:
print(rec_digit_sum(16))
print(rec_digit_sum(942))
print(rec_digit_sum(132189))
print(rec_digit_sum(493193))

7
6
6
2
