1.	Perform combined over and undersampling on the diabetes dataset (use SMOTEENN). Explain how combined sampling works.

Combined sampling works by using oversampling and undersampling to compensate for an imbalance in the dataset. In oversampling, we are dupllicating samples in the minority class and for undersampling, deleting samples in the majority class. This is repeated until the desired class distribution is achieved. 

In [27]:
import pandas as pd
import numpy as np

diabetes_df = pd.read_csv('../week_13/diabetes.csv')
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [31]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome',axis = 1)
y = diabetes_df['Outcome']

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

#Standardize
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)

In [45]:
from collections import Counter
from imblearn.combine import SMOTEENN 
from imblearn.pipeline import Pipeline
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score

#model
model = tree.DecisionTreeClassifier(max_depth = 8,random_state=42)

#Oversampling
resample = SMOTEENN()

steps = [('r', resample),('m', model)]
pipeline = Pipeline(steps=steps)

# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

#Evaluate model
scoring=['accuracy','precision_macro','recall_macro']
scores = cross_validate(pipeline, X_train_scaler, y_train, scoring=scoring, cv=cv, n_jobs=-1)

# summarize performance
print('Mean Accuracy: %.4f' % np.mean(scores['test_accuracy']))
print('Mean Precision: %.4f' % np.mean(scores['test_precision_macro']))
print('Mean Recall: %.4f' % np.mean(scores['test_recall_macro']))

# scores2 = cross_val_score(pipeline, X_test_scaler, y_test, scoring='roc_auc', cv=cv, n_jobs=-1)
# # summarize performance
# print('Mean ROC AUC: %.3f' % np.mean(scores2))

Mean Accuracy: 0.8728
Mean Precision: 0.8733
Mean Recall: 0.8720


2.	Comment on the performance of combined sampling vs the other approaches we have used for the diabetes dataset.

The combined sampling worked really well compared to other approaches. The accuracy, precision, and recall are all above 80% close to 90% so combined sampling works really well. With the other approaches, the scores have been around 70%, so this does a much better job. 

3.	What is outlier detection? Why is it useful? What methods can you use for outlier detection?

Outlier detection is removing observations in a dataset that don't fit. It is useful because by removing these outliers, a more accurate representation of the data can be made. Methods that can be used are Isolation Forests, Minimum Covariance Determinant, and Linear Regression Models. 

4.	Perform a linear SVM to predict credit approval (last column) using this dataset: https://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29 . Make sure you look at the accompanying document that describes the data in the dat file. You will need to either convert this data to another file type or import the dat file to python. 

In [37]:
australian_df = pd.read_csv('australian.csv',header=None)
australian_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,1,22.08,11.46,2,4,4,1.585,0,0,0,1,2,100,1213,0
1,0,22.67,7.0,2,8,4,0.165,0,0,0,0,2,160,1,0
2,0,29.58,1.75,1,4,4,1.25,0,0,0,1,2,280,1,0
3,0,21.67,11.5,1,5,3,0.0,1,1,11,1,2,0,1,1
4,1,20.17,8.17,2,6,4,1.96,1,1,14,0,2,60,159,1


In [40]:
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, classification_report

X = australian_df.drop(australian_df.columns[14],axis=1)
y = australian_df[australian_df.columns[14]]

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42, stratify=y)

#Standardize
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)

classifier = SVC(kernel='linear')
classifier.fit(X_train_scaler,y_train)
y_pred = classifier.predict(X_test_scaler)

#Classification Report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.76      0.84       153
           1       0.76      0.93      0.84       123

    accuracy                           0.84       276
   macro avg       0.84      0.85      0.84       276
weighted avg       0.85      0.84      0.84       276



5.	How did the SVM model perform? Use a classification report.

The SVM model performed pretty well. The accuracy score is 84%, overall precision score is 84% and the recall score is 85%. The accuracy, precision, and recall scores are all close together so it does a good job of predicting the correct outcome. 

6.	What kinds of jobs in data are you most interested in? Do some research on what is out there. Write about your thoughts in under 400 words. 

    I want to build off my background in biomedical engineering and my interest in neuroscience and psychology, so I found some job titles such as Neuroscience Data Scientist. This included the design and analysis of data, preparing data and data pipelines for model development, and building, training, testing, and deploying machine learning modes. This job included needing experience with sci-kit-learn libraries, pandas, numpy, scipy, and Matlab. Some data science jobs in neuroscience also involve exploring brain recordings and drawing conclusions from that data using data science. This also includes modeling neural networks and neural data science. 

	Another industry I looked at was consulting and there are a lot of areas to work in based on interest. Working as a consulting provides an opportunity to work in specific area and to work on different projects and with different clients. One company that deals with data science consulting is Accenture—where you work on a team and use statistics, data mining, and Machine learning for clients. This also gives you a chance to work face to face with clients also and present to them directly. I also found that you can do data science consulting in media – so working with companies like Netflix, Twitter, Spotify. This is also interesting because you’re working with lots of data, it’s more real-world data and about real human behavior. One can also see it’s real-life impact because people will be using it in real-time. 